Test Reliability

Test reliability describes the degree to which a test consistently measures the knowledge or abilities it is supposed to measure. Many factors can cause a test results to be inconsistent and, therefore, cause a test to be unreliable. A test may contain unclear directions or flawed test items; the students taking a test can be distracted, ill, or fatigued; and test scorers may misunderstand a test rubric or hold biases. Test developers and instructors can check test reliability through a number of statistically based methods including test-retest, split-half, internal consistency, and alternate form. They can also improve test reliability by adjusting a tests' length and improving test item quality.

Keywords Alternate Form Reliability; High-Stakes Tests; Internal Consistency Reliability; Item Quality; No Child Left Behind Act of 2001 (NCLB); Norm-Referenced Test; Performance-Based Assessment; Reliability; Split-Half Reliability; Standardized Tests; Test Bias; Test Length; Test-Retest Reliability; Validity

Testing & Evaluation > Test Reliability

Overview

Reliability has been defined as "the degree to which test scores for a group of test takers are consistent over repeated applications of a measurement procedure and hence are inferred to be dependable and repeatable for an individual test taker" (Berkkowitz, Wolkowitz, Fitch & Kopriva, 2000, as cited in Rudner & Schafer, 2001). Reliability is the extent to which the measurements gained from a test are derived from the knowledge or ability being measured; a test with a high degree of reliability has a small degree random error, which should make test scores more consistent as the test is repeatedly administered. However, a test's reliability is dependent upon both the testing instrument and the students taking the test, which means that the reliability of any test can vary from group to group. This is why, before a school, district, or state adopts a test, the reliability of the sample and the reliability of the norming groups should be considered (Rudner & Schafer, 2001).

In order for test scores to be used to make accurate judgments about students' abilities, they must be both reliable and valid. Test validity refers to how well a test truly measures the knowledge and abilities it is designed to measure. A test must first be determined to be reliable and then be assessed for validity. It is possible for tests to be reliable but not valid. If a test is reliable, it can be counted on to report similar results when taken by similar groups under similar circumstances over a period of time (Moriarty, 2002). Reliability is vital to test development because, in the current educational culture of high-stakes testing, tests must be accurate assessments of student progress and achievement.

Test reliability also describes the consistency of students' scores as they take different forms of a test. No two test forms will produce the identical results on a consistent basis for a number of reasons. Since they are not identical, test items may skew results. Additionally, students may make errors, feel ill, or be fatigued on a test day. External factors such as poor lighting, excessive noise, or room temperature can also interfere with testing.

However, even though all scores will not be identical, it is expected that they will be similar, which is one reason that test reliability is described in terms of degree of error. Checking for test reliability can determine the extent to which students' scores reflect random measurement errors, which falls into three categories: test factors, student factors, and scoring factors. Test factors that affect reliability can include test items, test directions, and (for multiple choice tests) ambiguous test item responses. Student factors include lack of motivation, concentration lapses, fatigue, memory lapses, carelessness, and sheer luck. Scoring factors that affect a test's reliability include ambiguous scoring guidelines, carelessness on the part of the test scorer, and computational errors. These are all considered random errors because how they affect students' scores is unpredictable: sometimes they can help students and at other times they can hinder students (Wells & Wollack, 2003).

All tests contain some degree of error; there is no such thing as a perfect test. However while errors may be unavoidable, primary goal of test developers is to limit errors to a level mirrors the purposes of the assessment. For example, a high-stakes test, such as an examination to grant a high school diploma, a license, or college admission, needs to have a small margin of error. In a low-stakes environment, however, an instructor-developed assessment can tolerate a larger margin of error since the results can be offset by other forms of assessment (Rudner & Schafer, 2001). If students' grades will be based solely on one examination, then the examination must have a high degree of reliability; but, in general, classroom tests can have a lower degree of reliability because most instructors also consider assessments like homework, papers, projects, presentations, participation, and other tests when determining student grades (Wells & Wollack, 2003).

Applications

Checking for Reliability

The four most commonly used methods of checking test reliability are: test-retest, split-half, internal consistency, and alternate form. All are statistically based and used in order to evaluate the stability of a grouping of test scores (Rudner & Schafer, 2001).

Test-Retest Reliability

Test-retest reliability is a coefficient that is received after giving the exact same exam twice and then comparing the two results. In theory, this can be a good measure of score consistency as it provides a clear, constant measurement that carries from one administer to another. However, it is not endorsed as a test for reliability because some challenges and limitations go along with it. First, it requires that the same test be given twice to the same group of students. This can be very costly and isn’t a great use of anyone's time. Additionally, it is difficult to say whether the test is truly adequately reliable. If the second administration of the test is given within too short of a time period, then student responses may be too consistent because students can remember the test question. Also, the students may have looked up the answers to questions they couldn't answer on the test. Alternatively, if the second administration of the test is given at too late a date, then students' answers can be skewed by the knowledge that they have acquired during the time between the tests (Rudner & Schafer, 2001).

Split-Half Reliability

Split-half reliability is a coefficient attained by halving a test and its contents, and comparing the results of each half. Because more lengthy tests are usually more sound than shorter ones, correcting the coefficient for length must often be done. Tests can be split in half by using the odd-numbered questions for one test and the even-numbered questions for the other test; by randomly selecting which items go on each test; or by manually selecting which items go on each test in an attempt to balance the content and level of difficulty. This method can be advantageous because it only requires the test to be given once. Its disadvantage is that the coefficient will vary based on how the test was divided. Also, the method is not necessarily proper for use on exams on which students' scores may be affected by a time limit (Rudner & Schafer, 2001).

Internal Consistency

Internal consistency looks at how similar individual test items from one administrator are from that of another administrator. Content sampling, which internal consistency estimates, is usually the largest component of measurement error (Rudner, 1994). The purpose of the exam is more than to simply determine how many items students can correctly answer; it is also to measure students' knowledge of the content covered by the testing instrument. In order to accomplish this, the items on the test must be sampled so that they are representative of the entire domain. The expectation is that students who have mastered the content will perform well and those who have yet to master the content will not do as well regardless of the items used on the testing instrument (Wells & Wollack, 2003). The internal consistency method is advantageous in that it requires one test administration and that the coefficient is not dependent upon a particular split of the test items. The primary disadvantage of this method is that it is best used with tests that measure only one skill area (Rudner & Schafer, 2001).

Alternate-Form Reliability

Most standardized tests have more than one version that can be used interchangeably. The different versions are then coupled with each other in terms of content and skill level, which is why they can be used in much the same way. The comparison of scores on matches of differing forms given to the same students can provide a measure of consistency and reliability. However, each test contains slightly different content, which can skew the results. Additionally, students who take the test a second time using an alternate form may have learned more in the time between tests, which can also skew results (Rudner & Schafer, 2001).

Sources of Error

The three primary sources of random measurement error in testing are: factors in the test, factors in the students taking the test, and scoring factors (Rudner & Schafer, 2001).

Test Factors

Most tests are comprised of questions that test specific skills. Since it is impossible to ask questions that cover every possible scenario-for example, every combination of two-digit multiplication problems-generalization is necessary. Generalizations can be specific or broad. For example, if students can correctly answer several multiplication questions, it can be specifically generalized that they have mastered multiplication. On the other hand, if students can correctly add, subtract, multiply, and divide factions, it can be broadly generalized that the have mastered fractions

However, the questions selected to assess a skill may contain errors. The test content represented by specific test items can vary with each version of a test, causing sampling errors and decreasing the reliability of the test. Though the test content might be slightly different, generalizations are made about the tested ability across all the test forms, lowering test reliability. An assessment that covers simple, basic skills would have content that is reasonably similar, so it should have high reliability. As skills become more complex, however, more errors of sampling items can occur.

Other types of test error that can affect reliability include the plausibility of distracters (i.e. the wrong answers on a multiple choice test); the presence of more than one correct answer; and test items that are too difficult or too easy for the students taking the assessment (Rudner & Schafer, 2001). Improper sequencing, or presenting test items in a telling order, can also affect test reliability as the sequence of test items may give away answers. Each item should represent an independent problem; the correct answer to one test item should not be dependent upon knowledge of the correct answer to a different test item (Cantor, 1987).

Student Factors

Students can affect the reliability of a test, too, because they are not always consistent. Students can be distracted, ill, or tired during a test. The testing room may be too warm or too cold (Gulek, 2003). These factors can cause students to make mindless errors, fail to correctly interpret exam instructions, forget directions that were read to them, or misread test questions. All of these factors can affect reliability (Rudner & Schafer, 2001).

Scoring Factors

Errors in scoring are another possible source of error. With objective tests, most scoring is done mechanically, which should minimize testing error. However, if students do not completely fill in an answer bubble, completely erase an answer they wanted to change, or if they make stray marks on the answer form, scoring may be inaccurate. On more open-ended questions that are evaluated by individuals, there can be many sources of error. Scorers can misinterpret the scoring rubric, or not be sure of what was expected of students. And, like students, scorers are not always consistent. They may change their criteria while grading, or hold a bias. They may expect their stronger students to do well and, therefore, give them the benefit of the doubt while expecting their weaker students to not do as well and, therefore, not give them the benefit of the doubt (Rudner & Schafer, 2001). An instructor who has graded fifty papers isn't likely to grade the fiftieth paper in exactly the same way he or she graded the first paper. Differences may also be apparent between a batch of papers that were graded one night and a batch that was graded the following night.

Further Insights

To ensure that the scores students earn accurately reflect what they have mastered, it is important to use testing instruments with a high degree of reliability. While test validity is also an important concern, test reliability should be the first priority when determining whether or not to use a particular testing instrument. An invalid test can be reliable, but an unreliable test cannot be valid. Therefore, test developers and instructors can save a lot of time by determining a test's reliability before determining its validity (Wells & Wollack, 2003).

Improving Reliability

Instructors who want to improve the reliability of their tests can look at two factors: test length and item quality.

Test Length

Test length can make a difference in reliability because a greater number of test items increases the number of times a trait is tested, making inferences about students' knowledge and abilities more accurate. For example, a test consisting of only a single test item would force instructors to assess students' abilities from a single piece of data. On the other hand, a test containing thirty items gives students more opportunities to demonstrate their abilities, and, thereby, more data for the instructor to evaluate, improving the accuracy of their evaluations.

Measurement error can account for a large percentage of student scores, but the percentage of measurement error decreases as the length of the test increases. It is possible for students who have not mastered course content to correctly answer one test question - either by guessing or because they truly know the answer - but it is much less likely that they could correctly answer all thirty questions through a combination of luck and knowledge (Wells & Wollack, 2003). It is reasonable to have eight to ten test items to measure each trait (McMillan, 1999).

The increase in reliability is greater when a shorter test is lengthened because the proportion of the number of questions being added is greater. However, it is important that any test items added to the assessment are of similar concept, quality, and difficulty to the ones that already included in the testing instrument. Another factor to consider before lengthening a test is that in doing so the assessment may become too long to complete in a class period, or become so long that students will become disinterested or tired while taking the test, and therefore not perform as well as they would have on a shorter assessment. To determine the length of a testing instrument, instructors should identify the maximum number of items they can include that will still allow students to finish the assessment within a reasonable period of time (Wells & Wollack, 2003).

Item Quality

Item quality is the second factor that should be considered, since including poorly constructed or inappropriate test items can reduce a test's reliability. If a test item can discriminate among students of varying degrees of mastery, then it is a good item to include on the test. Similar to constructing norm-referenced tests, a test item does a better job of discrimination if the students who are considered to have mastered course content tend to answer the item correctly while those who have not mastered course content tend to answer incorrectly. It is also a good idea to avoid including too many test items that almost every student can answer correctly or items that almost no student can answer correctly since these items do not appropriately discriminate between students (Wells & Wollack, 2003).

Conclusion

Reliability describes the extent to which a test stably estimates students' abilities and the extent to which exam results cannot be measured in error. Different types of reliability estimates are used to evaluate how differing reasons for error influence the testing instrument. Before adopting a testing instrument for a specific use, it should be determined how the reliability estimates have been computed and if appropriate statistical methods were used. For example, the split-half reliability method should not be used on tests on which students' scores may have been affected by a time limit since this method can cause artificially high reliability estimates. How the testing instrument fared with different groups of test takers should also be considered as well as how those reliability estimates were computed. Perhaps the most important question that should be considered is whether or not the reliability is high enough to justify using the test as a basis on which to make decisions about individual students (Rudner, 1994). This is a particularly vital question to ask if the instrument is going to be used in a high-stakes testing environment or to meet No Child Left Behind Act requirements.

To sum up, instructors should not rely solely on one assessment for grading purposes and should make sure that there are enough questions on an assessment to adequately measure student competency. Assessment procedures and scoring should be kept as objective as possible to avoid potential test bias. It is a good idea to use several different assessment methods because each method will have its own unique sources of error and students may perform better with some methods and not do as well with others. Therefore, best practices indicate that instructors should not give their students only multiple-choice tests, only essay examinations, or only fill-in-the-blank, short answer assessments, but a combination of assessments (McMillan, 1999).

Since the adoption of the No Child Left Behind Act, it is even more important that tests assess what they are supposed to assess and that errors in testing are minimized as much as possible. As noted before, there is no such thing as a perfect testing instrument, but those who adopt and implement tests in the classroom can take steps to assure that a test is the most appropriate, valid, and reliable instrument. By doing so, they can ensure that that students, schools, districts, and states are not adversely affected or penalized by a test itself.

Terms & Concepts

High-Stakes Tests: High-stakes tests that are used to make consequential arrangements for students, schools, school districts, and/or states. These tests can determine grade advancement, high school graduation, resource allocation, and instructor retention.

No Child Left Behind Act of 2001 (NCLB): The No Child Left Behind Act of 2001 is the latest reauthorization and a major overhaul of the Elementary and Secondary Education Act of 1965, the major federal law regarding K-12 education.

Norm-Referenced Test: Norm-referenced tests are assessments administered to students to determine how well they perform in comparison to other students taking the same assessment. Half the students taking the assessment are placed above the midrange point and half are placed below the midrange point.

Performance-Based Assessment: Performance-based assessments require students to actively demonstrate their knowledge rather than simply select a correct answer.

Reliability: Test reliability is the extent to which a test consistently measures what it is supposed to measure.

Standardized Tests: Standardized tests are tests that are administered and scored in a uniform manner. The tests are designed in such a way that the questions and interpretations are consistent.

Test Bias: Test bias occurs when provable and systematic differences in the results of students taking a test are discernable based on a group membership, such as gender, socioeconomic standing, race, or ethnicity.

Validity: Test validity refers to the degree to which a test truly measures the knowledge and abilities it is designed to measure.

Bibliography

Attali, Y., Lewis, W., & Steier, M. (2013). Scoring with the computer: Alternative procedures for improving the reliability of holistic essay scoring. Language Testing, 30, 125-141. Retrieved December 15, 2013, from EBSCO Online Database Education Research Complete. http://search.ebscohost.com/login.aspx?direct=true&db=ehh&AN=84761574&site=ehost-live

Boyd, D., Lankford, H., Loeb, S., & Wyckoff, J. (2013). Measuring test measurement error: A general approach. Journal of Educational & Behavioral Statistics, 38, 629-663. Retrieved December 15, 2013, from EBSCO Online Database Education Research Complete. http://search.ebscohost.com/login.aspx?direct=true&db=ehh&AN=91934289&site=ehost-live

Cantor, J. (1987). Developing multiple-choice test items. Training & Development Journal, 41 , 85. Retrieved September 21, 2007 from EBSCO Online Database Academic Search Premier. http://search.ebscohost.com/login.aspx?direct=true&db=aph&AN=9127185&site=ehost-live

Gulek, C. (2003). Preparing for high-stakes testing. Theory Into Practice, 42 , 42. Retrieved August 18, 2007 from EBSCO Online Database Academic Search Premier. http://search.ebscohost.com/login.aspx?direct=true&db=aph&AN=9611432&site=ehost-live

McMilllan, J. (1999). Establishing high quality classroom assessments. Richmond, VA: Metropolitan Educational Research Consortium. (ERIC Document Reproduction Service No. ED429146).

Moriarty, F. (2002). History of standardized testing. Retrieved September 19, 2007, from http://or.essortment.com/standardizedtes_riyw.htm

Morrison, K., & van der Werf, G. (2013, November). Editorial. Educational Research & Evaluation. pp. 649-650. Retrieved December 15, 2013, from EBSCO Online Database Education Research Complete. http://search.ebscohost.com/login.aspx?direct=true&db=ehh&AN=91809544&site=ehost-live

Rudner, L. (1994). Questions to ask when evaluating tests. Washington, DC: Office of Educational Research and Improvement. (ERIC Document Reproduction Service No. ED385607). Retrieved October 5, 2007 from EBSCO Online Education Research Database. http://www.eric.ed.gov/ERICDocs/data/ericdocs2sql/content_storage_01/0000019b/80/14/1b/61.pdf

Rudner, L. & Schafer, W. (2001). Reliability. Washington, D.C.: Office of Educational Research and Improvement. (ERIC Document Reproduction Service No. ED458213). Retrieved October 5, 2007 from EBSCO Online Education Research Database. http://www.eric.ed.gov/ERICDocs/data/ericdocs2sql/content_storage_01/0000019b/80/19/5b/85.pdf

Wells, C. & Wollack, J. (2003). An instructor's guide to understanding test reliability. Retrieved September 21, 2007, from http://testing.wisc.edu/Reliability.pdf

Suggested Reading

Downing, S. & Haladyna, T. (2006). Handbook of Test Development. New York, NY: Routledge-Falmer.

Hamilton, L., Stecher, B. & Klein, S. (2002). Making Sense of Test-Based Accountability in Education 2002. Santa Monica, CA: RAND Corporation.

Thompson, B. (2002). Score Reliability: Contemporary Thinking on Reliability Issues. Thousand Oaks, CA: Sage Publications.

Essay by Sandra Myers, M.Ed.

Sandra Myers holds a master's degree in adult education from Marshall University and is the former Director of Academic and Institutional Support at Miles Community College in Miles City, Montana, where she oversaw the college's community service, developmental education, and academic support programs. She has taught business, mathematics, and computer courses; her other areas of interest include adult education and community education.