Designing and Managing MCQs:

Chapter 4. Scoring and statistics.


Contents of This Chapter


In this chapter we will consider the issues of scoring of individual MCQs and whole tests, as well as the statistics which pertain to students and questions.


4.1 Negative marking. To use or not to use?

The subject of "negative marking" is unnecessarily emotive, and the explanation that follows is aimed at enabling one to discuss the subject dispassionately, if at all possible. By negative marking, one means a scoring scheme whereby marks are subtracted every time an incorrect alternative has been selected. This "penalty" is frequently viewed as unfair by students, and is such a contentious issue in some quarters that its use is banned in certain faculties at UCT. What can be said to defend its use?

We must approach this by remembering that it is an inevitable consequence of the structure of MCQ testing that random answering of the alternatives to a large number of questions will result in some being answered correctly by chance alone. For example, if a spin of a coin were to dictate the answers to 1000 true/false questions, one can expect that close to one half of the questions would be answered correctly, and one half incorrectly. If one mark were to be given for a correct response, and zero for an incorrect one, the final score would be 500/1000, or 50%, which is a pass mark! Clearly, this is undesirable.


Return to Contents of this Chapter


4.2 Applying negative marking to each question

We can solve this problem by awarding 1 mark for a correct response, and deducting 1 mark for an incorrect response. In this way, a totally prepared candidate will achieve 100%, a totally unprepared candidate 0%, and the other candidates will achieve a percentage score that should be a reflection of their degree of preparation - as far as any test can. This, after all is surely the prime goal of any test.

The above example deals with a question with two alternatives. Let us look at a question with three alternatives.

If the question has three alternatives, it would be unfair to the candidates to award 1 mark for the correct alternative, and deduct 1 mark for each incorrect alternative. Rather, the total marks of the incorrect alternatives should off-set the marks awarded for a correct response. Therefore, if the correct answer is awarded 1 mark, then 0.5 marks should be deducted for each incorrect alternative. (Instead of working with half marks, it is probably easier to work with whole marks, where the answer is awarded 2 marks, and 1 mark is deducted from each incorrect alternative).

We apply the same logic to a question with four alternatives - one correct and three incorrect. In this case, the value of the correct alternative should be 3 marks, with 1 mark deducted for each incorrect alternative. Here, the value of using whole marks becomes apparent. This is the system we will apply from here.

Instead of using words, we can apply a few symbols to make the explanation easier to follow:

The total number of alternatives offered in each question, will be denoted by the letter "n". Because the number of correct alternatives in each question is always 1, the total number of incorrect alternatives will be denoted by "n-1".

The number of marks awarded for a correct response will be denoted by the letter "C".

The number of marks deducted for each incorrect alternative will be denoted by the letter "I".

As we increase our alternatives, a pattern develops, and can be laid out as in the following table:
1 2 3 4 5 6
Total number of alternatives (n) Number of correct alternatives (C) Mark for correct alternative Number of incorrect alternatives (n-1) Deduction for each incorrect alternative Tot. deductions for incorrect alternatives
2 1 1 1 1 1
3 1 2 2 1 2
4 1 3 3 1 3
5 1 4 4 1 4

By looking at column 3 and column 6, we can see that in all cases, the mark given to the correct alternative will be off-set by the total marks given to the incorrect alternatives; this is achieved by ensuring that the mark awarded to each incorrect alternative is adjusted as the number of incorrect alternatives is increased.

Finally, we calculate the value of I (column 5) in the following way:
Mark awarded for correct answer (column 3)
I = -----------------------------------------------
Number of incorrect alternatives (column 4)

Using the notation above, this translates into
C
I = -----
n - 1

written more easily as I = C/(n-1)


Return to Contents of this Chapter


4.3 Applying negative marking to the entire test

We may wish to allow each question to count four marks, and have no negative marking on each question, but apply the correction to the final result of the test. Even if the method outlined in 4.2 is adopted, it is useful to read through the rest of 4.3, 4.4 and 4.5. While the amount adjusted is exactly the same as it would have been for individual questions, it will have the added advantage of easily illustrating that when the candidate has taken an educated guess at a question's answer, by eliminating one or more possible alternatives, a higher score results than if he/she had skipped that question.

We have seen that the totally unprepared candidate has a chance of scoring a number of marks unfairly - that is, due to simply guessing. This mark could just as easily have been obtained by a monkey by randomly guessing, and is often referred to as the "monkey score" (first used by Koeslag, Melzer and Schach, 1983) (and why MCQ tests are often disparagingly called "monkey puzzles"), or the "undeserved score". As we have seen, in a test consisting of questions with only two alternatives, this monkey score is as high as 50%. In a test consisting of more alternatives, this monkey score is reduced. In fact, if we look at the table and formula above, we will realise that the total monkey score is simply the number of questions multiplied by I.

Therefore, we see that if negative marking is NOT applied, the final test score of a candidate consists of

Score = Deserved score + undeserved score

In order to correct this, we introduce a correction in the same way as we introduced a correction I above, so that the score now is

Score = Deserved score + undeserved score - I

where I is again calculated as I = C/(n-1)

Let us look at how this is applied to a final test mark.

More typically, students will answer some questions correctly because they actually know or can work out the correct answer, and some questions correctly on account of some "partial knowledge", which can enable them to guess more judiciously that would be the case of pure random guesswork. They will score higher than the "monkey score" on a number of questions, and this represents a partial score awarded for partial knowledge.

The following example will make this clear

A student takes a test where 100 MCQs with 5 responses are presented. A correct answer attracts 4 marks, and an incorrect answer -1 mark. The student answers 50 questions correctly by deliberately selecting the correct answer as a result of applying knowledge (the "deserved score"), 25 questions by random guesswork (leading to an "undeserved score" which will be corrected by the application of negative marking) and 25 questions by first eliminating 2 out of 5 responses and then guessing on the other three (leading to a "partial score"). Of these, assume that 8 are correctly and 17 incorrectly answered. The student's final score will be:
Deserved: j = 200 marks
Undeserved: 5 x 4 - 20 x 1 = 0 marks
Partial Score: 8 x 4 - 17 x 1 = 15 marks
___________
Total = 215 marks

At this point,however, we should note that it is essential that the test questions should have the identical number of alternatives.

The 15 marks can be regarded as a bonus for partial knowledge. Note that if the student had abstained from answering all the questions where he or she harboured any doubts as to the correct answer, this bonus would have been thrown away.

If negative marking had not been in effect, (by awarding 0 for an incorrect answer), the student would have obtained 200 + 20 + 32 = 252 marks.

Two points must be emphasized:

The fact that the student receives a statistical credit for partial knowledge by being able to eliminate what to him or her seems to be obviously incorrect answers is perfectly fair. This is the equivalent of "educated waffling" in an essay, which scores some marks, but not the full number of possible marks.


Return to Contents of this Chapter


4.4 The percentage score

Final scores are commonly expressed as a percentage. When negative marking is in effect, there are three ways of expressing the percentage (M is the "monkey score"):

Which method will actually be used should be carefully thought out before administering the test, and students should be informed as to the method of scoring and its implications. They should also be cautioned against skipping questions.


Return to Contents of this Chapter


4.5 Adjustment of final percentages

If negative marking is not used, the random guessing factor will result in inadequate students scoring relatively more than their knowledge should allow. Hence it is common practice to make some adjustment to the final percentage scores.

If F% is the final corrected percentage score, and R% the uncorrected percentage score, then

F% = (nR% - 100)/(n-1)

where n is the number of responses in a test consisting of MCQs with a uniform number of responses and scoring.

From the above, we see that a raw score of 75% in a True/False test, where 1 mark is awarded for a correct answer and 0 for an incorrect answer is equivalent to a corrected score of 50%.

This impacts on the "pass mark" which must be achieved in MCQ tests. If the final pass mark is held to be 50%, then the uncorrected scores which are equivalent to a 50% pass mark will be


Note that other criteria,such as the minimum score for "fist class", "upper second" and so on should also be adjusted.


Return to Contents of this Chapter


4.6 Question attributes

The quality of a testing instrument, be it an essay or collection of MCQs, should be a matter of great concern to all involved. In the first place, students should feel that the test has been "fair", both in terms of the type of questions that they were asked to answer, and that the marking of their scripts should be as objective and consistent as possible. Secondly, the lecturers should monitor feedback from the test in order to ensure that the test was of a high quality and that the results reflect the real ability of the students.

Statistical treatment of the results obtained by a class will provide a window on the quality of a test consisting of MCQs.


4.6.1 How difficult/easy is a question?

If everyone in a class were to get a particular question right, it is clear that that question might have been too easy. Conversely, if everyone in the class had answered it incorrectly, the question might have been too difficult. The facility question is an attribute of that question which can be measured in terms of the performance of the class on that question. It is simply calculated by
facility = number of correct answers/total answers


For example, if 100 students attempted the question, and 60 students entered a correct answer for it, the facility will be 60/100 = 0.6. (Sometimes, the difficulty of a particular question is used. It is simply 1 - facility, or 0.4 in this case). The facility is sometimes expressed as a percentage, i.e. the percentage of those students answering the question who gave the correct answer, in this case 60%.

Questions which are too easy or too difficult should not be included in a test unless there are very good reasons to do so. Exceptions would be in order to identify students who lack certain "entry-level" knowledge to a course, or for some other diagnostic purposes. In general, one should aim for questions with a facility in the range 0.3 to 0.7. Some of the key reasons for this are discussed below.

4.6.2 The discrimination of a question

One fatal inadequacy of questions with excessively low or high facilities is that they do not differentiate between students of widely differing abilities. The manner in which a question is answered by students of different abilities is called the discrimination or discriminating power of the question. It may be calculated in different ways, one of them based on the performance of two student groups of equal size, one group being of "low performance" and the other group being of "high performance"

discrimination = (Ch - Cl)/Nh

where Ch is the number of correct answers returned by the high performing group, and Cl the number of correct answers returned by the low performing group, and Nh is the number of students in the high (or low) performing group.

In common practice, the groups are defined as the upper and lower quartiles of the class. These quartiles are obtained by ordering the test scores from lowest to highest, and separating these scores into four parts. The part scoring the least marks is the lower quartile and the part scoring the highest marks is the upper quartile.

(Never mind if the class is not divisible by 4, the computer will do the necessary calculations and adjustments! Statisticians will be quick to recognize that this method is only approximate, and that there are more reliable methods. (Ward, p. 215-218). It is given here for the purpose of information only.

Suppose we have a class of 158 students, divided into quartiles with 40 students each in the lower and upper quartiles:

If 35 students in the upper quartile and 17 students in the lower quartile answered the question correctly, then the discrimination will be (35 - 17)/40 = 0.45. This question has a satisfactory discrimination.

Suppose that the situation is reversed. 17 students in the upper quartile and 35 students in the lower quartile answered correctly. The discrimination is now negative, indicating that students who did poorly on the test as a whole fared better on the question than those who did well in the test! This question should be eliminated from the test and looked at carefully. Perhaps it was poorly formulated, or the correct response flagged as incorrect.

The minimum acceptable discrimination level depends inter alia, on the number of students taking the test, but can be taken as 0.3 for practical purposes.

The above discussion is far from being exhaustive. If you would like to know more, feel free to contact the persons whose names appear in Appendix A, and /or consult the references which are mentioned there.


Return to Contents of this Chapter


Links to other Chapters:
|Title Page|Contents|Chapter 1|Chapter 2|Chapter 3|Appendix A|Appendix B|Appendix C|Appendix D|

HTML 3.2 Checked!