The Multiple-Choice Exam on Creating Multiple-Choice Exams (2)

Congratulations on completing part one! We will now dive a little deeper into the art of writing multiple-choice exams. As before, record your answers to the questions as you work through the material; you'll be able to check them later.

One common criticism of multiple-choice exams is that they can only test factual recall or simple calculation skills. This is not actually true, although it is how they are commonly used. There are two key problems with writing conceptual MC questions:

It is actually quite difficult and time-consuming to create valid conceptual MC questions (this page took ~10 hours...)
Students may successfully answer at least some conceptual questions through rote memorization strategies rather than actual understanding

In the context of assessment, validity is the degree to which the assessment (such as a MC test) is both adequate and appropriate for its intended purpose. This includes obvious tests of validity, such as each MC question having only one correct answer and being free from ambiguity. Equally important, however, is that the assessment should actually measure what it sets out to. Thus, a conceptual MC question that effectively tested students’ language skills rather than their understanding of the subject would be considered inadequate for its purpose.

We have already seen some examples of questions that test a student’s ability to take tests: the use of negation, double negatives, and illogical sequencing of items would all be considered inappropriate, unless the exam was explicitly a test of logical reasoning skills (and even then...) Similarly, questions on content not yet covered, that require information the student could not reasonably be expected to know, or that use iodiomatic language all undermine the validity of the test.

Another issue relating to validity is what the test (and its results) will be used for. For example, MC questions commonly form the basis of concept inventories used in diagnostic assessment, either as a research tool or for placement purposes; an inventory that failed to reflect learning gains or distinguish between students of different ability would again be considered inadequate for purpose.

To illustrate some of the potential problems in creating conceptual MC tests, and ways to address them, we will now look at two different approaches: the use of paired questions and case studies.

Example A: Paired Questions

Suppose, for example, that you wanted to set a conceptual exam for a literature course by having the students analyse one of the assigned texts. The first question might be:

In the book, “The Hobbit”, the character Bilbo Baggins functions as the:
1. antihero
2. antagonist
3. focal character
4. supporting character

Anticipating the types of questions you might ask, or having prepared detailed summaries of each text studied, some students may have simply memorized lists of characters and their roles. As such, the question does not solely test conceptual understanding. One way around this is to pair questions probing both concept and reasoning together. In this case, the next question might then be:

The answer to question 1 is evidenced by the fact that:
1. he sets the dwarves up to be captured repeatedly
2. he is the least experienced character in the story
3. he cheats, lies, and steals throughout the story
4. he appears throughout the story in critical ways

Another example might be designed to distinguish between students who perform calculations solely by memorized procedures, and those that understand the concepts involved in deriving the same calculation:

An ideal gas has a molar mass of 46.0 g/mol. If a 3.00 g sample of the gas is heated in a rigid 2.00 L container to a temperature of 400 K, the pressure of the gas in the container will be (R = 0.08206 L/(mol.K)):
1. 0.004 atm
2. 1.07 atm
3. 16.4 atm
4. 252 atm
Four identical rigid containers, each containing the same mass of a different gas, are held at the same temperature. The vessel with the gas at the highest pressure will be the one containing:
1. carbon dioxide (CO₂)
2. methane (CH₄)
3. nitrogen (N₂)
4. krypton (Kr)
5. Cannot be determined

In this example, question 4 cannot be answered by direct calculation, since insufficient information is provided. Instead, students would have to reason from first principles that the gas with the lowest molar mass will result in the greatest number of moles and, therefore, the highest pressure. The reason for adding “Cannot be determined” is that without it, students unable to make the conceptual conceptions would be forced to guess; adding this option enables such students to be identified and provided with extra help.

You could criticise this pairing, since the calculation for question 3 is effectively a cue for the deduction in question 4. This could be circumvented by providing the number of moles (instead of the mass and molar mass) in question 3. One way to decide between these options would be to determine the facility value, FV, for each question – this is simply the fraction of students choosing the correct answer. If the FV for question 4 was lower when paired with the simpler form of question 3 than with the original form shown above, then one might well conclude that students were being provided with a significant hint by the original calculation.

Example B: The Case Study

Another approach to writing conceptual MC questions is to use case studies. Typically, these might describe scenarios similar to ones used in class; alternatively, they might present entirely new situations that must be analysed using the skills taught in class rather than any specific prior knowledge. In fact, this whole page can serve as an example of the case-study approach to conceptual multiple-choice testing. Given this information, attempt to answer the following questions:

The items in question 2 were crafted to correspond to those in question 1, but were deliberately presented in a different sequence because:
1. the correct answers in MC tests should always be evenly distributed between the item postions
2. the correct answer in MC tests should never be in the same position as the preceding question
3. the position of the answer for the first question should not act as a cue for the paired question
4. the items in question 1 are in ascending order, so next question should be in descending order
Question 1 was intended to be an easy first question to help students settle into the test. If the question was appropriate for this purpose, you would expect its facility value to be:
1. < 25%
2. 25 – 50%
3. 50 – 75%
4. > 75%

Another metric that can be employed to assess a MC question is its discriminating power, DP. Students are first ranked by their total score, and answers for the upper and lower quartiles separated. For each question, DP is then calculated as the difference in the number of students in the upper and lower quartiles answering correctly divided by half the number of students in both quartiles:

DP(%) = 200 × (NC_UQ − NC_LQ) / (NT_UQ − NT_LQ)

A question that completely discriminated between the upper and lower quartiles would have a maximum value of:
1. DP = 100%
2. DP = 75%
3. DP = 50%
4. DP = 25%
A question that was answered correctly by all of the upper and half of the lower quartiles would have a discriminating power of:
1. 25%
2. 50%
3. 75%
4. 100%
Question 4 was found to have facility value of 15% and a discriminating power of 36%. Given the nature and purpose of this question, you would immediately:
1. check which distractors were chosen most frequently
2. check how many of the lower quartile answered correctly
3. conclude that the question was appropriate and/or adequate
4. conclude that the question was inappropriate and/or inadequate

The distractors in question 3 were deliberately chosen as the results of common calculation errors, such as inverting the terms in the conversion of mass to moles. In this way, the question functions as a diagnostic assessment of the students’ calculation skills, just as question 4 is diagnostic of their conceptual understanding.

The same question can also be used for formative assessment, since a student can then be provided with highly specific feedback about their mistakes and allowed to try again; this works particularly well in an electronic test environment.

Such questions can also be used for summative assessment, where cumulative knowledge and understanding is tested at the end of a unit or course. This can, however, lead to complaints that instructors are deliberately setting unfair “trick questions”.

The best practice for multiple-choice summative assessment questions is to:
1. use common errors as distractors since students will not make these mistakes if they have really understood the material
2. avoid common errors as distractors since cumulative assessments are stressful and even good students make mistakes
3. use common errors as distractors in order to ensure the results yield the desired grade distribution
4. avoid common errors as distractors in order to avoid students appealing their grades after the exam

Recorded your answers somewhere? Then proceed to the next page to check your score...