Test item reliability indicates how consistent the results produced from items on a test are. Consistency can refer to the items’ stability over time or the consistency of the items with each other. If an item is unreliable, statistical relationships will appear weaker than they really are and inappropriate conclusions may be drawn regarding the relationships between variables.
A measurement of reliability consists of the extent to which an observed score (which is the true score plus or minus error) accurately reflects the true score. Returning to the example in this week’s Introduction, if your true weight were 150 pounds and you stepped on the scale hundreds of times, it would sometimes show 149, sometimes 151, and sometimes 150. If you averaged all of those weights, you would come close to your true score. If you looked at how much the weights varied, you would have a good measure of the scale’s error.
The situation is similar with a psychological test—a score on an IQ test represents an estimate of the theoretical “true” IQ; however, that observed score also includes error.
Researchers or test developers measure a test’s reliability with a reliability coefficient, generally a positive correlation coefficient that is less than 1.00. (A correlation of 1.00 would indicate perfect correlation, which is theoretically impossible due to inherent error in measurement.) Acceptable reliability coefficients for psychological tests or test items are generally at least .70. If you know a test’s reliability, you can calculate its margin of error, a “plus or minus” band that indicates an interval likely to contain the true score.
A test item is reliable if its variations over time primarily reflect variations in what you are measuring. An unreliable item would show changes over periods that are not possible or are theoretically unexpected depending on the construct you are measuring. For instance, personality is a construct that is believed to be constant over a period of years or decades. An item that stated, “I feel happier than usual today” would be unreliable for measuring personality, because the construct of mood easily changes from day to day, much more quickly than the construct of personality.
For this week’s Discussion, think of a specific testing scenario. Then consider a reliable test item for that testing scenario and an unreliable item for that same testing scenario. Consider how you might know if these items are reliable or unreliable.
With these thoughts in mind:
Post by Day 4
a brief description of a specific testing scenario. Then describe one reliable test item and one unreliable test item for that testing scenario. Finally, explain what determines whether an item is reliable or unreliable within the scenario you presented.