Interpreting Test Results

Understanding how to interpret three useful statistics concerning your students' multiple-choice test scores will help you construct well-designed tests and improve instruction.

Item difficulty or P: the percentage of students who correctly answered an item.

Also referred to as the p-value
Ranges from 0% to 100%, or more typically written as a proportion 0.00 to 1.00
The higher the value, the easier the item
P-values above 0.90 indicate very easy items that you should not use in subsequent tests. If almost all students responded correctly, an item addresses a concept probably not worth testing.
P-values below 0.20 indicate very difficult items. If almost all students responded incorrectly, either an item is flawed or students did not understand the concept. Consider revising confusing language, removing the item from subsequent tests, or targeting the concept for re-instruction.

For maximum discrimination potential, desirable difficulty levels are slightly higher than midway between chance (1.00 divided by the number of choices) and perfect scores (1.00) for an item:

Format Ideal Difficulty

Five-response multiple-choice .60
Four-response multiple-choice .62
Three-response multiple-choice .66
True-false (two-response multiple-choice) .75

Item discrimination or R(IT): the relationship between how well students performed on the item and their total test score.

Also referred to as the Point-Biserial correlation (PBS)

Ranges from 0.00 to 1.00
The higher the value, the more discriminating the item
A highly discriminating item indicates that students with high test scores responded correctly whereas students with low test scores responded incorrectly.

Remove items with discrimination values near or less than zero, because this indicates that students who performed poorly on the test performed better on an item than students who performed well on the test. The item is confusing for your better scoring students in some way.

Evaluate items using four guidelines for classroom test discrimination values:

0.40 or higher very good items
0.30 to 0.39 good items
0.20 to 0.29 fairly good items
0.19 or less poor items

Reliability coefficient or ALPHA: a measure of the amount of measurement error associated with a test score.

Ranges from 0.00 to 1.00
The higher the value, the more reliable the test score
Typically, a measure of internal consistency, indicating how well items are correlated with one another
High reliability indicates that items are measuring the same construct (e.g., knowledge of how to calculate integrals)
Two ways to improve test reliability: 1) increase the number of items or 2) use items with high discrimination values

Reliability Interpretation

.90 and above Excellent reliability; at the level of the best standardized tests
.80 - .90 Very good for a classroom test
.70 - .80 Good for a classroom test; in the range of most. There are probably a few items that could be improved.
.60 - .70 Somewhat low. This test should be supplemented by other measures to determine grades. There are probably some items that could be improved.
.50 - .60 Suggests need to revise the test, unless it is quite short (ten or fewer items). The test must be supplemented by other measures for grading.
.50 or below Questionable reliability. This test should not contribute heavily to the course grade, and it needs revision.

Distractor Evaluation

Another useful item review technique is distractor evaluation.

You should consider each distractor an important part of an item in view of nearly 50 years of research that shows that there is a relationship between the distractors students choose and total test score. The quality of the distractors influences student performance on a test item.

Although correct answers must be truly correct, it is just as important that distractors be clearly incorrect, appealing to low scorers who have not mastered the material rather than to high scorers. You should review all item options to anticipate potential errors of judgment and inadequate performance so you can revise, replace, or remove poor distractors.

One way to study responses to distractors is with a frequency table that tells you the proportion of students who selected a given distractor. Remove or replace distractors selected by a few or no students because students find them to be implausible.

Caution when Interpreting Item Analysis Results

Mehrens and Lehmann (1973) offer three cautions about using the results of item analysis:

Item analysis data are not synonymous with item validity. An external criterion is required to accurately judge the validity of test items. By using the internal criterion of total test score, item analyses reflect internal consistency of items rather than validity.
The discrimination index is not always a measure of item quality. There are a variety of reasons why an item may have low discrimination power:

o extremely difficult or easy items will have low ability to discriminate, but such items are often needed to adequately sample course content and objectives.

o an item may show low discrimination if the test measures many content areas and cognitive skills. For example, if the majority of the test measures "knowledge of facts," then an item assessing "ability to apply principles" may have a low correlation with total test score, yet both types of items are needed to measure attainment of course objectives.

Item analysis data are tentative. Such data are influenced by the type and number of students being tested, instructional procedures employed, and chance errors. If repeated use of items is possible, statistics should be recorded for each administration of each item.

References

DeVellis, R. F. (1991). Scale development: Theory and applications. Newbury Park: Sage Publications.

Haladyna. T. M. (1999). Developing and validating multiple-choice test items (2nd ed.). Mahwah, NJ: Lawrence Erlbaum Associates.

Lord, F.M. (1952). The relationship of the reliability of multiple-choice test to the distribution of item difficulties. Psychometrika, 18, 181-194.

Mehrens, W. A., & Lehmann, I. J. (1973). Measurement and Evaluation in Education and Psychology. New York: Holt, Rinehart and Winston, 333-334.

Suen, H. K. (1990). Principles of test theories. Hillsdale, NJ: Lawrence Erlbaum Associates.

Academic Affairs

Interpreting Test Results

Item difficulty or P: the percentage of students who correctly answered an item.

Format Ideal Difficulty

Item discrimination or R(IT): the relationship between how well students performed on the item and their total test score.

Evaluate items using four guidelines for classroom test discrimination values:

Reliability coefficient or ALPHA: a measure of the amount of measurement error associated with a test score.

Reliability Interpretation

Distractor Evaluation

Caution when Interpreting Item Analysis Results

References

Scanning Services Links

Interpreting Test Results

Item difficulty or P: the percentage of students who correctly answered an item.

Format Ideal Difficulty

Item discrimination or R(IT): the relationship between how well students performed on the item and their total test score.

Evaluate items using four guidelines for classroom test discrimination values:

Reliability coefficient or ALPHA: a measure of the amount of measurement error associated with a test score.

Reliability Interpretation

Distractor Evaluation

Caution when Interpreting Item Analysis Results

References

Scanning Services Links

Testing and Evaluation Services