Criterion-Referenced Testing
Criterion-referenced testing is an assessment approach designed to evaluate what students know and can do based on specific educational outcomes predetermined by instructors, schools, districts, or state standards. Unlike norm-referenced tests, which rank students against their peers, criterion-referenced tests focus solely on whether each student meets or exceeds established standards, providing detailed insights into individual performance. This method allows educators to identify strengths and weaknesses in student learning, as well as the effectiveness of their teaching strategies. High-stakes applications, such as high school exit exams, utilize predetermined cut scores to determine student success. While criterion-referenced tests aim for equality in expectations across diverse student populations, the flexibility in setting performance standards can lead to discrepancies in perceived student mastery between different regions. This type of testing is often used to comply with educational mandates, such as the No Child Left Behind Act, emphasizing the need for measurable outcomes in education. Ultimately, criterion-referenced testing serves as a valuable tool for evaluating student learning against specific learning objectives rather than relative performance, promoting a more individualized understanding of student achievement.
On this Page
- Overview
- Criterion-Referenced Tests & No Child Left Behind
- Criterion Referenced Tests vs. Norm-Referenced Tests
- Further Insights
- Advantages & Disadvantages of Criterion-Referenced Testing
- Evaluating Learning
- Equal Expectations
- Applications
- Developing Criterion-Referenced Tests
- Using Criterion Referenced Tests
- Terms & Concepts
- Bibliography
- Suggested Reading
Subject Terms
Criterion-Referenced Testing
Criterion-referenced tests are used to determine what students can do and what they know based on a predetermined, specific set of educational outcomes. These outcomes can be determined by the instructor, school, district, or state based on the curriculum standards that are set. Criterion-referenced tests do not compare students to other students, which is the purpose of norm-referenced tests. As long as the criterion-referenced test is properly aligned to the expected educational outcomes, it can give detailed information about how well a student has performed on each outcome included on the test (Bond, 1995).
Keywords Assessment; Bell Curve; Content Validity; Criterion-Referenced Tests; Cut Scores; Evaluation; High School Exit Exams; High-Stakes Tests; No Child Left Behind Act of 2001 (NCLB); Norm Group; Norm-Referenced Tests; Rubric; Percentile Rank; Standardized Tests
Overview
Criterion-referenced tests are used to determine what students can do and what they know based on a predetermined, specific set of educational outcomes. These outcomes can be determined by the instructor, school, district, or state based on the curriculum standards that are set. Criterion-referenced tests do not compare students to other students, which is the purpose of norm-referenced tests. As long as the criterion-referenced test is properly aligned to the expected educational outcomes, it can give detailed information about how well a student has performed on each outcome included on the test (Bond, 1995). Criterion-referenced test scores can indicate whether students meet, do not meet, or exceed the predetermined acceptable standards that were assessed (Taylor & Walton, 2001).
For certain criterion-referenced tests, such as high school exit or grade advancement exams, “tests have cut scores, which are scores that determine whether a student passes or fails” (Bracey, 2000, p. 8). This cut score, which can also be referred to as the criterion, determines success or failure with the only concern being whether or not the student has attained the cut score. For example, on a high school exit exam, if the cut score is determined to be 65, all that matters is whether or not students get a 65 or better and not what their exact score is. Those students who score 65 or better pass, and those students who score 64 or below fail (Bracey, 2000). Those who achieve that score will receive their diploma; those who do not may be given another chance to take the assessment, may be referred for remediation, may have to take additional courses, or may have to repeat the school year.
Criterion-Referenced Tests & No Child Left Behind
Criterion-referenced tests are used for meeting No Child Left Behind (NCLB) standards because one of its goals is to increase instructional standards by mandating that states challenge their students in mathematics, reading/English language arts, and, by the end of the 2007-2008 school year, science. NCLB has added mandatory testing and federal reporting with potentially serious consequences for those states and districts that do not demonstrate 'adequate yearly progress,' making this type of criterion-referenced testing a high stakes form of test.
The NCLB stipulates that states are “to set challenging academic content standards and that the assessments must be aligned with those standards” (Linn, 2005, p. 81). However, NCLB does not define content standards, set the performance standards for each state, or detail the type of assessments and cut scores that should be used, leaving these determinations up to each individual state. NCLB goes further and lists what states need to use to set their annual measurable objectives “based on the percentage of students performing at or above proficiency. These standards are used to determine if schools, districts and states make adequate yearly progress, and the progress targets must be set so all students will be at or above the proficient level by 2014” (Linn, 2005, p. 91). If the percentage of students passing state tests is insufficient, schools then have not made adequate yearly progress and sanctions can be imposed.
Criterion Referenced Tests vs. Norm-Referenced Tests
It is possible for students in states that use both norm-referenced and criterion-referenced tests to score well on one test and not as well on the other, due to the differences in the tests. For example, the criterion-referenced test is not timed and the norm-referenced test is, the criterion-referenced test may give credit for demonstrating the proper technique even if the final answer is not correct, and a norm-referenced will split the test takers with half the students above the 50 percent mark and half the students below the 50 percent mark (Taylor & Walton, 2001). This means that if a student did well on a norm-referenced test, getting 85 percent of the questions correct, that student will fall below the 50 percent mark if a majority of the other students taking the test correctly answered 86 percent of the questions. On a criterion-referenced test correctly answering that percentage of questions would normally earn the student a grade of B.
Criterion-referenced tests are developed by reviewing a set of objectives or a curriculum and then composing the test questions with a goal of having the test “determine how well the students have mastered the identified objectives or curriculum. As with teacher-made tests, a criterion-referenced test can contain words that are unusual or rare in everyday speech and reading, as long as they occur in the curriculum and as long as the students have had an opportunity to learn them. With a criterion-referenced test, we are not much interested in differentiating students by their scores” (Bracey, 2000, p. 9). Ideally, all students would earn a passing score. Whereas norm-referenced tests normally align students by percentile ranks, criterion-referenced tests tend to differentiate students in terms of whether or not they have mastered or exceed expectations. As with “norm-referenced tests, criterion-referenced tests should be evaluated in terms of reliability and content validity” before being used (Bracey, 2000, p. 9).
Further Insights
Advantages & Disadvantages of Criterion-Referenced Testing
Evaluating Learning
Criterion-referenced tests have many advantages and uses. A properly aligned criterion-referenced test can give detailed information about how well a student has performed on each educational outcome or goal included on a test (Bond, 1995). For example, a criterion-referenced mathematics test focusing on fractions can pinpoint each student's strengths and weaknesses because instructors will be able to see if their students have mastered adding, subtracting, multiplying, and dividing fractions based on the questions they correctly answer. In addition to the primary competencies, instructors will also be able to determine if their students know basic mathematical concepts by how they work out each problem. Instructors can use criterion-referenced tests to determine if their students are learning the curriculum and how well they are teaching the curriculum.
If the subject matter of the tests is properly coordinated with the content of the instruction, criterion-referenced tests can also give students, their parents and instructors more information about what exactly students have learned, which can help everyone focus on what competencies still need to be mastered and which ones have been mastered (Bond, 1995). Many proponents of using criterion-referenced tests instead of norm-referenced tests believe that the grades produced by a norm-referenced test are depreciated because they occur despite what the test scores actually are. Thus, receiving the highest grade in the class is certainly not such a great achievement if the high score was 45 out of 100 and the student received an A nonetheless. On a norm-referenced test, 50 percent of any class achieve scores that lie above or below the average score; the other grades assigned are based on predetermined distribution patterns to form a traditional bell curve. Advocates of using norm-referenced tests instead of criterion-referenced tests contend that the grades resulting from criterion-referenced tests are cheapened because more "A" grades can be received than a norm-referenced test will produce because, in theory, all students in the classroom can correctly answer almost all the questions on a criterion-referenced test and will be given a grade of "A". This can lead to a debate about whether a grade of "A" is more meaningful when there are not as many of them that were earned, as is the case with a norm-referenced test, or is it more important to show that more students have mastered the material, which is shown with a criterion-referenced test (Aviles, 2001).
Equal Expectations
Another positive factor in using criterion-referenced tests is that there is equality in the level of expectations for all students. This means that schools having a large percentage of disadvantaged or at-risk students are expected to do as well as students who come from more affluent families with highly educated parents. However, a disadvantage is that schools, school districts, and states may institute lower cut scores to ensure that a sufficient number of students do achieve mastery level. Criterion-referenced tests may help determine whether or not some schools have students who are more difficult to teach than other schools as well as whether some students are able to learn more easily than other students. This can be true, assuming that all the students have been taught the same competencies, have been made aware of the testing criteria and understand the importance of the assessment they will be taking, and all other factors are equal. A school with a disproportionate number of at-risk students may have lower scores than a school with fewer at-risk students (Miller-Whitehead, 2001).
In an effort to comply with NCLB, states use criterion-referenced testing to determine adequate yearly progress. However, since each state can choose its own number for meeting the standard of learning for making adequate progress, there can be great disparity in those numbers. For example, one state may choose 300 as the cut score to prove competency and meet state standards and another state may elect to go with a cut score of 600 to show adequate learning progress to meet state standards (Taylor & Walton, 2001).
Applications
Criterion-referenced tests measure knowledge or skills that students should master and are based on grade level curriculum guidelines. Developing a criterion-referenced test should be a simple process for classroom instructors since they know what they are teaching and have personal knowledge of the students, and so can adjust the test accordingly. If instructors use rubrics to guide their instruction, the process can be even simpler.
Developing Criterion-Referenced Tests
However, for school district, state and national administrators, determining an appropriate criterion-referenced test can be difficult. There is no universal curriculum for the nation. NCLB requires that states set standards because they need to show their districts are making adequate yearly progress. These standards include stated learning objectives for subject areas and grade levels. The primary intention of criterion-referenced tests is to show that upon completion of a course, curriculum, or school year that all students will meet a certain level of mastery of the material, which is generally correctly answering between 75 and 80 percent of the test items. Students are encouraged to study for a criterion-referenced test and should know exactly what will be assessed (Miller-Whitehead, 2001).
Student scores on criterion-referenced tests are interpreted relative to a predetermined standard or competency, whereas student scores on norm-referenced tests are interpreted relative to how well students have done compared to other students who comprise the norm group as well as others who have already taken the test. Criterion-referenced tests tend to differentiate students in terms of 'not proficient,' 'proficient,' and 'advanced' or 'does not meet,' 'meets,' and 'exceeds' expectations (Bracey, 2000). Criterion-referenced tests may be considered less aggressive than norm-referenced tests because students are competing with a set standard and not with each other (Aviles, 2001). The content for a criterion-referenced test is chosen by how well it matches up to the learning outcomes or curriculum, whereas the content for a norm-referenced test is selected by how well it can rank students from low to high. Therefore, the content for criterion-referenced tests is based on its importance to the curriculum and outcomes, and the content for norm-referenced tests is based on how well it can discriminate among students. Both criterion-referenced and norm-referenced tests may be standardized. This is appropriate if the content of the test matches the knowledge and skills expected of all students; it is easier to match content with a criterion-referenced test (Bond, 1995). In the case of norm-referenced tests, comparing students to other students is useful whenever classes have limited space, for deciding inclusion into gifted and talented programs, for college admissions purposes, and for bestowing certain performance-related awards. Criterion-referenced tests can be used to determine student mastery of curriculum content and to help assess instructional competency in terms of effectively teaching course content.
Using Criterion Referenced Tests
A criterion-referenced test is a much better choice for program evaluation because it can give detailed information about what a student can and cannot do. Students know what competencies they are expected to master for a criterion-referenced test and are encouraged to study. Students taking a norm-referenced test are not expected to study and may have little or no knowledge of what is going to be addressed on the test. Thus, criterion-referenced tests are the better choice for addressing the requirements of the No Child Left Behind Act. Criterion-referenced tests can help teachers diagnose their students' strengths and weaknesses and evaluate their program of study, and norm-referenced tests cannot be used as an effective diagnostic tool (Griffee, 1995). Criterion-referenced tests are usually a good choice for determining whether a culturally or linguistically diverse child has a language disorder (Battle, 2002, as cited in Laing & Kami, 2003). Criterion-referenced tests can be used to evaluate a child's performance based on a certain ability or linguistic idea, making it possible to consider the social context in which communication occurs and how language is used by the culture, something which a norm-referenced test cannot do (Laing & Kami, 2003).
The most important facets of a well-constructed criterion-referenced test are the test specifications, also known as the objectives. This is because instructors need to know exactly what their students' performance means; so there can be no ambiguity in the descriptors. For example, it is not enough to say that a criterion-referenced test is going to cover fractions because it cannot be determined if the test is going to cover only adding and subtracting fractions or multiplying and dividing fractions or if it is going to cover all four concepts. A well-constructed criterion-referenced test should also have an adequate number of test items to cover each competency. Ideally, there should be at least five questions that address each concept, especially when the test is used as a high school exit exam or other high-stakes test. It should be limited in scope so that it does not overwhelm students with the number of questions asked in order to make sure each concept is adequately covered. It is better to assess more frequently and make sure the test is properly focused than to overload students with too many questions or too much testing time. Properly crafted questions will also address lesser or foundational skills that should have been acquired along the way. A well-constructed criterion-referenced test should also possess satisfactory reliability and validity. If possible, comparative data should be used that permits instructors, school districts, and/or states to determine where to set the cut scores. This is especially important for high-stakes testing (Popham, 1978).
Instructors may have mixed feelings about whether to use criterion-referenced or norm-referenced tests. Since norm-referenced tests are usually graded on a curve, then if the highest score of the class is a 70, students who score a 70 will receive a grade of A, and students with scores in the 50s and 60s could receive Bs and Cs. With a criterion-referenced test and a highest score of 70, no one gets an A or a B. Either scenario could mean the instructor was not able to adequately convey the lesson or that the students did not try hard enough to learn the concepts being taught. In using a traditional bell-shaped curve associated with norm-referenced tests, instructors need to determine if they are lowering the bar in that if, as long as no one excels in the class, average or below-average scores will earn grades of A and B. Grading on a curve can also make it difficult to determine whether or not an instructor's teaching skills are improving. Criterion-referenced grading can do the exact opposite by forcing instructors to look at their methods if the highest score on a test was 60 or 70. Instructor expectations may be too high or the information was not taught in appropriate intensity, duration, or format. Another possibility is that student effort was not sufficient. Regardless of the reason, criterion-referenced tests can help with the assessment of instruction. Criterion-referenced tests also help schools, districts, and states provide measurable outcomes in compliance with NCLB (Aviles, 2001).
In trying to meet NCLB standards of making adequate yearly progress, states should take care in what type of standardized test they elect to use. There are three criteria that should be kept in mind when choosing the appropriate assessment instrument:
• Whether or not the assessment strategies of a particular test match the state's predetermined educational goals.
• Whether or not the testing instrument addresses the content that the state wants to assess.
• Whether or not the assessment tool is capable of giving the types of interpretations administrators need to make about student performance in order to meet reporting standards (Bond, 1995).
Well-constructed criterion-referenced tests can be an effective tool for states, districts, and schools to meet federal reporting guidelines and for instructors to assess their performance in the classroom and determine whether or not their students have achieved mastery of the subjects being assessed.
Terms & Concepts
Bell Curve: A bell curve is a way of distributing grades intended to yield an assigned grading distribution among students that resembles the shape of a bell.
Content Validity: Content validity is the degree to which an exam calculates only what it is assigned to calculate.
Criterion-Referenced Tests: Criterion-referenced tests are assessments given to students to determine if specific skills have been mastered.
High School Exit Exams: High school exit exams are tests that students must pass in order to graduate from high school and receive a diploma.
High-Stakes Tests: High-stakes tests are those whose scores are used to make decisions that have important consequences for students, schools, school districts, and/or states and can include high school graduation, promotion to the next grade, resource allocation, and instructor retention.
No Child Left Behind Act of 2001 (NCLB): The latest reauthorization and a major overhaul of the Elementary and Secondary Education Act of 1965, the major federal law regarding K-12 education.
Norm Group: A norm group is a group of similar students who initially took a test when it was created. All students subsequently taking the test have their scores interpreted in relation to the performance of the norm group.
Norm-Referenced Tests: Norm-referenced tests are assessments administered to students to determine how well they perform in comparison to other students taking the same assessment.
Rubric: A rubric is a type of formative assessment that attempts to describe the different levels of quality from proficient to poor for a specific project or assignment.
Percentile Rank: Percentile rank gives a student's relative position, or percentage, of student scores that fell below the stated score.
Standardized Tests: Standardized tests are tests that are given and evaluated in a uniform manner. The exams are created specifically so that each question and intended interpretation is consistent.
Bibliography
Aviles, C. (2001). Grading with norm-referenced or criterion-referenced measurements: To curve or not to curve, that is the question. Social Work Education, 20 , 603-608. Retrieved July 4, 2007 from EBSCO Online Database Education Research Complete. http://search.ebscohost.com/login.aspx?direct=true&db=ehh&AN=6006829&site=ehost-live
Bond, L. (1995). Norm-referenced testing and criterion-referenced testing: The differences in purpose, content, and interpretation of results. Oak Brook, IL: North Central Regional Educational Lab. (ERIC Document Reproduction Service No. ED402327). Retrieved July 4, 2007 from EBSCO Online Education Research Database. http://www.eric.ed.gov/ERICDocs/data/ericdocs2sql/content_storage_01/0000019b/80/14/d2/23.pdf
Bracey, G. (2000). Thinking about tests and testing: A short primer in "assessment literacy." Washington, DC: American Youth Policy Forum; Denver, CO: National Conference of State Legislatures. (ERIC Document Reproduction Service No. ED445096). Retrieved July 4, 2007 from EBSCO Online Education Research Database. http://www.eric.ed.gov/ERICDocs/data/ericdocs2sql/content_storage_01/0000019b/80/16/76/5a.pdf
Fulcher, G., & Svalberg, A. (2013). Limited aspects of reality: Frames of reference in language assessment. International Journal of English Studies, 13, 1-19. Retrieved December 15, 2013, from EBSCO Online Database Education Research Complete. http://search.ebscohost.com/login.aspx?direct=true&db=ehh&AN=92896295&site=ehost-live
Griffee, D. (1995). Classroom testing for teachers who hate testing: Criterion-referenced test construction and evaluation. (ERIC Document Reproduction Service No. ED385140). Retrieved July 14, 2007 from EBSCO Online Education Research Database. http://www.eric.ed.gov/ERICDocs/data/ericdocs2sql/content_storage_01/0000019b/80/14/16/42.pdf
Johnson, T. G., Prusak, K. A., Pennington, T., & Wilkinson, C. (2011). The effects of the type of skill test, choice, and gender on the situational motivation of physical education students. Journal of Teaching in Physical Education, 30, 281-295. Retrieved December 15, 2013, from EBSCO Online Database Education Research Complete. http://search.ebscohost.com/login.aspx?direct=true&db=ehh&AN=66418794&site=ehost-live
Key cards' records used to bust DeKalb educators?. (2011). School Planning & Management, 50, 10. Retrieved December 15, 2013, from EBSCO Online Database Education Research Complete. http://search.ebscohost.com/login.aspx?direct=true&db=ehh&AN=59164555&site=ehost-live
Laing, S. & Kamhi, A. (2003). Alternative assessment of language and literacy in culturally and linguistically diverse populations. Language, Speech, & Hearing Services in Schools, 34 , 44-55. Retrieved July 4, 2007 from EBSCO Online Database Education Research Complete. http://search.ebscohost.com/login.aspx?direct=true&db=ehh&AN=8841126&site=ehost-live
Linn, R. (2005). Issues in the design of accountability systems. Yearbook of the National Society for the Study of Education, 104 , 79-98. Retrieved May 2, 2007 from EBSCO Online Database Education Research Complete. http://search.ebscohost.com/login.aspx?direct=true&db=ehh&AN=17238819&site=ehost-live
Miller-Whitehead, M. (2001). Practical considerations in the measurement of student achievement. (ERIC Document Reproduction Service No. ED457244). Retrieved July 4, 2007 from EBSCO Online Education Research Database. http://www.eric.ed.gov/ERICDocs/data/ericdocs2sql/content_storage_01/0000019b/80/19/41/c0.pdf
Popham, W. (1978). Well-crafted criterion-referenced tests. Educational Leadership, 36 , 91. Retrieved July 14, 2007 from EBSCO Online Database Education Research Complete. http://search.ebscohost.com/login.aspx?direct=true&db=ehh&AN=7730897&site=ehost-live
Taylor, K. & Walton, S. (2001). Who is norm? Instructor, 110 , 18. Retrieved July 14, 2007 from EBSCO Online Database Education Research Complete. http://search.ebscohost.com/login.aspx?direct=true&db=ehh&AN=4151097&site=ehost-live
Suggested Reading
American Association of School Administrators (1993). Making sense of testing and assessment. Lanham, MD: Rowman & Littlefield Publishers, Inc.
Brown, J. & Hudson, T. (2002). Criterion-referenced language testing. New York, NY: Cambridge University Press.
Popham, W. (1978). Criterion referenced measurement. Englewood Cliffs, NJ: Prentice-Hall, Inc.
Shaycoft, M. (1979). Handbook of criterion-referenced testing: Development, evaluation, and use. London, UK: Garland.
Shorrocks-Taylor, D. (1999). National testing: Past, present and future. Malden, MA: Blackwell Publishing Limited.