Test Validity
Test validity is a critical concept in educational assessment, focusing on how well a test measures what it claims to assess. It is essential for ensuring that test scores genuinely reflect student abilities and knowledge, particularly in high-stakes testing environments where decisions about grades, graduation, and program admissions are made based on these assessments. There are several types of validity, including construct validity, which examines whether a test accurately evaluates theoretical constructs; content validity, which assesses the alignment of test items with the content area being measured; and criterion validity, which determines how well a test predicts future performance in relevant situations.
Additionally, consequential validity addresses the social implications of a test's use, acknowledging that the effects of testing can vary across different groups of students. The importance of test validity has evolved over time, with increasing recognition of the need for comprehensive validation processes that consider both the construction of tests and their real-world applications. Establishing validity is not a one-time task but requires ongoing evaluation to ensure tests remain effective and fair in their intended uses. Overall, valid tests serve as important tools that can facilitate informed decisions in education, benefiting both students and educational institutions.
On this Page
- Overview
- Test Validation
- History of Test Validity
- The APA Handbook of Standards
- Applications
- Content Validity
- Criterion Validity
- Construct Validity
- Convergent/Divergent Validation
- Factor Analysis
- Consequential Validity
- Further Insights
- Conducting Validation Studies
- Sources of Invalidity
- Conclusion
- Terms & Concepts
- Bibliography
- Suggested Reading
Subject Terms
Test Validity
In a nation where high-stakes testing has become so prevalent in education, it is important that tests provide an accurate assessment of student progress and achievement. In order for test scores to be used as valid resources for making precise judgments about students' abilities, they must be accurate and sound. A test must first be reliable, and then be assessed for its validity. Test validity is comprised of three types: construct validity, content validity, and criterion validity. Consequential validity, a recent and still debated form of test validity, and the relationship of validity and reliability are also covered. A brief history of how test validity has developed and some questions that should be asked when validating a test are also included.
Keywords Assessment; College-Level Examination Program (CLEP) Test; Consequential Validity; Construct Validity; Content Validity; Criterion Validity; High-Stakes Tests; No Child Left Behind Act of 2001 (NCLB); Reliability; Standardized Tests; Test Bias
Overview
In a nation where high-stakes testing has become so prevalent in education, it is important that tests provide an accurate assessment of student progress and achievement. In order for test scores to be used to make accurate judgments about students' abilities, they must be both reliable and valid. A test must first be reliable, and then be assessed for its validity.
• Test reliability is the extent to which a test consistently calculates what it is supposed to calculate. Reliability deals with the way a test is constructed. If a test is reliable, it can be counted on to report similar results when taken by similar groups under similar circumstances over a period of time. A reliable test is free from errors in its construction and measurement.
• Test validity refers to how well a test measures what it is supposed to measure. Validity refers to degree, not extent. A test is not completely valid or completely invalid, and results can improve or contradict a test’s previous findings if validity evidence continues to be gathered (Messick, n.d., as cited in College Board, 2007b).
It is also possible for tests to be reliable but not valid. Testing instruments are everywhere and can assess practically anything. With such a high-stakes testing environment, one of the most crucial aspects is how test scores are used and the way they can affect students, schools, districts, and states. Tests can be used to meet the stipulations of the No Child Left Behind Act, determine admission to schools or programs, determine high school graduation or grade retention, and to diagnose educational deficiencies. With stakes this high, it is important that the assessment selected is appropriate for the situation.
Test Validation
Test validation refers to verifying the use of a test in a certain framework of circumstances, such as admission into a gifted and talented program, high school graduation, and college entrance. Therefore, one aspect of test validation is studying test scores in the setting they were used in to see if the results are adequately and appropriately measuring what the test purports to measure (College Board, 2007b).
Test validation has evolved from using a single approach to testing for validity to multiple procedures used sequentially over the development of the testing instrument (Jackson, 1970, 1973; Guion, 1983, as cited in Anastasi, 1986). Validity is part of the test from the beginning of the testing instrument's development. Validation begins with identifying construct definitions by looking at theories, prior research, observation, and/or analysis of relevant behavior. Test items are then arranged to suit construct definitions. Empirical item analysis occurs when the items that are the most valid are selected from the pool of test items. Other analyses may occur, including factor analyses. Once the instrument has been developed, then validation of scores using statistical analyses against other criteria occur (Anastasi, 1986).
Before establishing the validity of a test, decisions need to be made as to which test and test scores to validate. Including a few choices is usually a good idea; and since validity is a matter of degree and not all or nothing, several tests may fit the need or a combination of a test and other factors. For example, if college personnel are looking for an admissions test, they should consider what results arise if they do not use any test; if they use a combination of a test, high school grade point average, and essay; or if they use one particular test. By comparing the results of the possibilities, they will have a good indication whether the test is valid-and if the test is as valid when used alone as with a combination of other factors evaluated, admissions personnel can save a lot of time and effort (College Board, 2007b).
History of Test Validity
One of the earliest references to test validity came in 1928 from psychologist Clark Hull, and in 1937 H. E. Garrett asserted that test validity is the "extent to which a test measures what it purports to measure" (Garrett, 1937, as cited in Geisinger, 1992, p. 199). The first formal publication to include information on test validity came in 1954 after the American Psychological Association convened a committee to develop standards for psychological tests (Gray, 1997). The 1954 handbook they developed provided the first set of professional test standards and was called Technical Recommendations for Psychological Tests and Diagnostic Techniques. In it was the claim that there were four basic categories of test validity:
• Content,
• Predictive,
• Concurrent,
• Construct.
The handbook stated that it was incumbent on the test users to validate the test for the purposes for which they were planning to use it. This made validating a test the responsibility of both the test publisher and the test user. It also now required test users to identify how they planned to use tests and then validate according to the proposed purpose (Geisinger, 1992).
The APA Handbook of Standards
From the 1950s through the 1970s test validity was considered to be dependent on use-specific and situation-specific correlations (Geisinger, 1992). In 1966 the American Psychological Association revised their 1954 handbook and renamed it Standards for Educational and Psychological Tests and Manuals (APA, 1966, as cited in Geisinger, 1992). The revised edition advocated matching the use of a test with a validation strategy for supporting the test (Messick, 1989, as cited in Geisinger, 1992). They also joined predictive and concurrent validity together and renamed it criterion validity (Geisinger, 1992; Gray, 1997). The handbook also suggested that it would be a good idea to validate a test using more than one approach.
The next revision to the handbook came in 1974. In this edition, social consequences of testing were mentioned, and that adverse impact and test bias should be considered whenever evaluating a test's validity. It also recommended that content validity consider test-taker behavior. In 1985, the American Psychological Association teamed up with the American Educational Research Association and National Council for Measurement in Education and revised the handbook to include test qualities, validity, reliability, and test uses and presented test validation as a unified undertaking (Geisinger, 1992). In 1999, the three entities jointly revised the handbook to reflect the changes in federal law, measurement trends that influence validity, surveying students with disabilities, and testing English language learners, among other things (AERA Books, 2006).
Applications
There are now four types of validity measured in any testing instrument:
• Content Validity
• Criterion Validity
• Construct Validity
• Consequential Validity
Content Validity
Content validity deals with the adequacy with which test items model the content area that needs to be evaluated. An example of a testing instrument lacking content validity would be if student scores on a comprehensive mathematics assessment depended on their understanding of English, or if just a certain number of questions regarding fractions were presented. Content validity is normally determined by expert judgment and not by using statistics, but assessments should still be correlated highly with other testing instruments that are supposed to represent the same content (Packer, 2004).
Content validity looks at how well test questions align with the “content or subject area they are supposed to assess. Content or subject area may also be referred to as performance domain. Content-related evidence of validity uses the judgments of people who are considered experts in testing of the content area” (College Board, 2007a). Two other ways to establish the content validity of a test are curricular and face validity:
• Face validity is “the extent to which a test or the questions on a test appear to measure a particular construct,” meaning does it look like a reasonable test for what it is supposed to assess (College Board, 2007a). Test users, students, parents, or the public can make this determination, which is also a good way to convince those who decide whether to adopt the testing instrument.
• Curricular validity is the “extent the content of the test matches the objectives of a specific curriculum” and is judged by groups of content experts who determine whether the content “is parallel to the curriculum objectives and if the test is balanced with curricular emphases” (College Board, 2007a). Curricular validity is critical for assessments that are used for high-stakes testing, such as high school exit exams, grade retention, and placement in special education programs (College Board, 2007a).
Criterion Validity
Criterion validity refers to whether an assessment adequately predicts a student's behavior in a specific situation (Packer, 2004). Criterion validity became a form of validity in the 1960s by combining concurrent validation and predictive validation (Geisinger, 1992; Gray, 1997). Validation “refers to procedures used to determine how valid a predictor is. With concurrent validation, the predictor and the criterion data are collected at or around the same time. Concurrent validation is appropriate for diagnostic screening tests” (Packer, 2004, III.2). If a student scored very high on an English achievement test, it is reasonable to expect that the same student would be earning high grades in English. The student's performance in the English class is concurrently validating the score on the test.
Concurrent validity must be determined if one measure is going to be exchanged in place of another, like a course test-out exam or when colleges accept CLEP (College-Level Examination Program) scores, giving college credit for successfully completing the exam. One way to determine concurrent validation with CLEP scores is to have students completing the college course given the CLEP exam for the subject at the same time. If the correlation is strong between the CLEP exam scores and the course grades, the test would be considered valid for that particular use (College Board, 2007a).
In predictive validation, “the predictor scores are collected first, and the criterion data is collected at a later time” (Packer, 2004, III.2). Predictive validation is appropriate for assessments intended to assess students' future status. Predictive assessments can be course placement tests, driving tests, or any assessment that can then be compared to course grades, driving performance or attainment of a driver's license, or success or failure of any activity the predictive assessment was designed to assess (College Board, 2007a).
A criterion-related validation study is accomplished by collecting both the assessment “scores and information on the criterion for the same students. By looking at the relationship between test scores and the criterion it is possible to see how valid the test was in determining success in college” by correlating the scores to the criterion (College Board, 2007a).
Construct Validity
Construct validity pertains to whether or not an assessment accurately evaluates a construct or trait that cannot be observed and is theoretical in nature – in other words, does it measure what it says it measures? (Packer, 2004). For example, if a mathematics test asks the majority of questions in word format using long phrases or sentences, it may be testing students' English abilities more than their mathematic abilities and, therefore, does not have construct validity (College Board, 2007a). There are two ways to determine a test's construct validity:
Convergent/Divergent Validation
Tests have convergent validity when they have a “high correlation with another test that measures the same construct” (Packer, 2004, III.3). Tests have divergent validity when they have a low correlation with tests that measure a different construct. “Evidence from content validity and criterion validity may be used to establish construct validity” (College Board, 2007a). To decide what the construct validity of a mathematics test was, the correlation of scores of the assessment with other mathematics assessments should be higher than the correlation of scores on reading assessments, which is both convergent-in this case mathematics correlation-and divergent-having a lack of reading correlation (College Board, 2007a).
Factor Analysis
Factor analysis is a statistical approach to determining construct validity. Construct validity can also be assessed using internal consistency, which means “scores on the individual test items should correlate highly with the total test score, and is used as evidence that the test is measuring a single construct” (Packer, 2004, III.3c). Construct validity can also be assessed using experimental intervention, which means “scores should change following experimental manipulation in the direction predicted by the theory underlying the construct”(Packer, 2004, III.3c).
Consequential Validity
Consequential validity refers to the social consequences of using a particular test for a particular purpose. Consequential validity is somewhat controversial, in that some testing experts consider it an actual form of test validity and other experts believe the social consequences of using a test are not an appropriate part of validity (College Board, 2007a). It has been noted that adverse social consequences of a test do not make it invalid, but the adverse social consequences should not be because of test invalidity (Messick, 1988, as cited in College Board, 2007a).
For example, because a particular subgroup of students receive lower results on a placement test and are then required to complete developmental courses does not necessarily make the assessment consequentially invalid; but if “it was determined that the test was measuring different traits for that particular subgroup of students than for the larger group and the traits were not important for completing the assessment,” then it could be that the assessment is consequentially invalid for that particular subgroup (College Board, 2007a).
Further Insights
Conducting Validation Studies
When developing validation studies, there are several questions that should be answered; and each question can have different answers depending on budget, availability, and expertise. According to the College Board, these questions include:
• Who determines whether a test has sufficient content validity? The answer could be an independent panel, the department, the department chair, a panel of instructors, or a panel of volunteers. It could also be that one group does an initial validation and then another group or individual confirms the validation after reviewing the data.
• What measure will be used to determine convergent or divergent validity? How many measures will be selected? Will some evaluated measures have similar results?
• What will be the basis for a curricular study? Will it be based on course outlines, state standards, classroom tests or course rubrics? Is there a different standard to use?
• What is the criterion in a criterion validity test? How is success decided in a particular course? How will success be determined in a program?
• Which social consequences should be evaluated in consequential validity? How will unintended effects be monitored? (College Board, 2007b)
Sources of Invalidity
There are two primary sources of invalidity: construct under representation and construct-irrelevant variance.
• Construct under representation occurs when the tasks being measured in the assessment do not include important components of the construct. When that occurs, the test results may not show students' true abilities within the construct that were supposed to be measured by the test.
• Construct-irrelevant variance occurs when the study assesses more variables than it can handle, many which would be deemed useless to the construct. This occurs in two different ways. Construct-irrelevant easiness occurs when unnecessary clues in questions let some students answer correctly “in ways that are irrelevant to the construct being assessed. Construct-irrelevant difficulty occurs when unnecessary aspects of the question make it irrelevantly difficult for some students or subgroups” (Brualdi, 1999, p. 15). Construct-irrelevant easiness will cause students to score higher than they usually would, and construct-irrelevant difficulty will cause students to score lower than they usually would (Brualdi, 1999).
Conclusion
Test scores are used to make many important decisions, such as assigning grades, grade retention, admission to a gifted and talented program, placement into special education, high school graduation, college admission, and course placement. In such a high-stakes environment, test scores can provide valuable information about students; but it is important to remember that just because a test has been validated for one particular use, that it is not necessarily valid for a different situation and will need to be evaluated for that use too.
Although no test is going to perfectly measure everything students have and have not mastered, valid tests can give instructors, schools, districts, and states a reasonable idea of how their students are doing and are one of the best, most effective ways to do it. When used in conjunction with other forms of assessment, valid, reliable tests can provide a strong foundation with which to make academic decisions.
Terms & Concepts
College-Level Examination Program (CLEP) Test: The College-Level Examination Program is a national program that gives students the opportunity to earn undergraduate college credit by taking an exam.
Consequential Validity: Consequential validity refers to the social consequences of using an assessment for a particular purpose.
Construct Validity: Construct validity refers to whether or not an assessment accurately measures what it says it measures.
Content Validity: Content validity refers to “the adequacy with which test items adequately and representatively sample the content area to be measured” (Packer, 2004, III.1).
Criterion Validity: Criterion validity refers to whether an assessment adequately predicts a student's behavior in a specific situation.
High-Stakes Tests: High-stakes tests are when those test scores are used to make decisions that have important consequences for students, schools, school districts, and/or states and can include high school graduation, promotion to the next grade, resource allocation, and instructor retention.
No Child Left Behind Act of 2001 (NCLB): The No Child Left Behind Act of 2001 is the latest reauthorization and a major overhaul of the Elementary and Secondary Education Act of 1965, the major federal law regarding K-12 education.
Reliability: Test reliability is the extent to which a test consistently measures what it is measuring.
Standardized Tests: Standardized tests are tests that are administered and scored in a uniform manner, and the tests are designed in such a way that the questions and interpretations are consistent.
Test Bias: Test bias occurs when provable and systematic differences in the results of students taking the test are discernable based on group membership, such as gender, socioeconomic standing, race, or ethnic group.
Bibliography
American Educational Research Association. (2006). Standards for Educational and Psychological Testing. Retrieved September 19, 2007, from http://www.aera.net/publicatimons/Default.aspx?menu_id=46&id=1407
Anastasi, A. (1986). Evolving concepts of test validation. Annual Review of Psychology, 37, 1-15. Retrieved September 17, 2007 from EBSCO Online Database Academic Search Premier. http://search.ebscohost.com/login.aspx?direct=true&db=aph&AN=11265870&site=ehost-live
Brualdi, A. (1999). Traditional and modern concepts of validity. (Report EDO-TM-9910). Washington, DC: Office of Educational Research and Improvement. (ERIC Document Reproduction Service No. ED435714). Retrieved September 18, 2007 from EBSCO Online Education Research Database. http://www.eric.ed.gov/ERICDocs/data/ericdocs2sql/content_storage_01/0000019b/80/15/f1/64.pdf
Chatterji, M. (2013). Bad tests or bad test use? A case of SAT use to examine why we need stakeholder conversations on validity. Teachers College Record, 115, 1-10. Retrieved December 15, 2013, from EBSCO Online Database Education Research Complete. http://search.ebscohost.com/login.aspx?direct=true&db=ehh&AN=90238824&site=ehost-live
College Board. (2007a). Types of validity evidence. Retrieved September 19, 2007, from http://www.collegeboard.com/highered/apr/aces/vhandbook/evidence.html
College Board. (2007b). What is test validity?. Retrieved September 19, 2007, from http://www.collegeboard.com/highered/apr/aces/vhandbook/testvalid.html
Geisinger, K. (1992). The metamorphosis to test validation. Educational Psychologist, 27 , 197. Retrieved September 18, 2007 from EBSCO Online Database Academic Search Premier. http://search.ebscohost.com/login.aspx?direct=true&db=aph&AN=6370701&site=ehost-live
Gray, T. (1997). Controversies regarding the nature of score validity: Still crazy after all these years. Paper presented at the Annual Meeting of the Southwest Educational Research Association. (ERIC Document Reproduction Service No. ED407414). Retrieved September 18, 2007 from EBSCO Online Education Research Database. http://www.eric.ed.gov/ERICDocs/data/ericdocs2sql/content_storage_01/0000019b/80/16/8c/5b.pdf
Moriarty, F. (2002). History of standardized testing. Retrieved September 19,
2007, from http://or.essortment.com/standardizedtes_riyw.htm
Packer, M. (2004). Experimental and statistical research methods. Retrieved September 19, 2007, from http://www.mathcs.duq.edu/~packer/Courses/Psy624/test.html
Welner, K. G. (2013). Consequential validity and the transformation of tests from measurement tools to policy tools. Teachers College Record, 115, 1-6. Retrieved December 15, 2013, from EBSCO Online Database Education Research Complete. http://search.ebscohost.com/login.aspx?direct=true&db=ehh&AN=90238829&site=ehost-live
Suggested Reading
Haladyna, T. (2002). Essentials of standardized achievement testing: Validity and accountability. Boston, MA: Allyn & Bacon.
McPhail, M. (2007). Alternative validation strategies: Developing new and leveraging existing validity evidence. Hoboken, NJ: Wiley.
Messick, S. (1979). Test validity and the ethics of assessment. Princeton, NJ: Educational Testing Service.
Wainer, H. & Braun, H. (1988). Test validity. Florence, KY: Lawrence Erlbaum Associates, Inc.