How can we ask students to demonstrate thinking skills and the ability to apply knowledge by ticking a box?
By Mirkka Jokelainen, Publisher, GL Assessment
It is often thought that multiple-choice questions (MCQs) assess a very narrow set of knowledge and skills, reduce complex concepts to a list for rote-learning, and answered by guessing. On the other hand, they are easy to create and mark. The challenge in creating a good multiple-choice test is in showing all of these preconceptions wrong – except the easy marking (which is only easy at the end, after a long development process). In this post, I won’t go in to the challenges of marking MCQs that are still in development, but will discuss the process of developing good, rigorous multiple-choice questions for reliable formative assessment.
A well-constructed multiple-choice question, that provides a valid measure of learning, is actually not at all easy to write. All items in our assessments, for instance, are written by subject experts with experience in teaching and assessment to ensure that the test items reliably measure what they are supposed to measure, and cover a sufficiently wide and varied range of content to provide a valid measure of skills and knowledge.
In addition to ensuring the content is appropriate, questions need to be well-thought-out and accessible, and all the answer options must be plausible enough to distract those students who are trying to guess or are unable to work out the answer. Plausible alternatives ensure that students must recall and apply skills and knowledge to be able to choose the correct answer. (However, no matter how perfectly plausible the distractors are, there should be only one correct answer.)
After the items have been written and the working group agrees that all items appear valid and will be able to do their job, the items are trialled to make sure of this. Even items that on the face of it look excellent, can turn out not to work when taken by children. That is why, for this first version of the test, we create a lot more items than are needed for the final assessment. This way we know we have the capacity to drop those items that are not working, and include in the final assessment only the best items that ‘passed the test’.
In the trialling, hundreds of students from appropriate age groups take the trial tests. After these tests have been marked, we can see which test items did well, which didn’t, and why. For each individual test item, we look at several quality indicators to determine if the item works. For example, if all students get an item right (or wrong), the results don’t provide any differentiating information about their skills and knowledge. This may be as expected in summative testing, and can even be likely to happen in some smaller samples, such as a single class or one school. But on a large national sample, a good formative test item will differentiate between students. From the trial data, we also check that the distractors do their job.
Another indicator of a good MCQ is whether students who overall performed well in the test, outperformed the lower performing students in a particular item. If low-performers outperform high-performers in a particular question, there may be a problem with item validity – it appears not to be measuring what the rest of the test is measuring. The reason for this can often turn out to be a faulty key, where either there is no correct answer or there are more than one.
The difficulty of each item is also determined, that is, what percentage of students in the trial answered the item correctly. For a balanced test, it is important to have items of different difficulty, so that the results will help differentiate between students with different skills, and the information in the results and reports is useful for teaching and learning.
The question-by-question analysis that is available in our reports will provide you with evidence that there are items of different difficulty in the test, and that the national sample, or your students, weren’t able to guess the answers. If guessing was a sound answering method, all the items would be roughly of the same difficulty and the performance of the national sample would be a straight line. Below is a graph from one of the Progress Tests in Science reports, which shows how the standardisation sample and the group tested performed in MCQs that make up the assessment.
[Pic. 1 p. 11 from the PTS sample report.]
So, how can we ask students to demonstrate thinking skills and the ability to apply knowledge by ticking a box? Again, well-written and researched items are the answer. To ensure that the assessment addresses thinking skills, as well as knowledge, the questions must require the application of knowledge and in-depth evaluation of the answer options. Just because the answer can be given by ticking a box, does not mean that that is all that the student needs to do. To make students think, and find out more about their thinking, the distractors can be based on errors that are commonly made when working out the answer.
A lot of work and expertise goes into creating good multiple-choice questions. It’s not quick, it’s not easy, but when done right, a wealth of information about a student’s learning can be accessed through this objective and reliable method of formative assessment. As with any assessment, the MCQ test scores aren’t the final word and the only truth, but they can provide an important piece of information to add to other evidence, which goes into building a well-rounded picture of each individual’s learning.