Norm-Referenced Testing

This article focuses on norm-referenced testing. Norm-referenced tests are assessments administered to students to determine how well they perform in comparison to other students taking the same assessment. The article describes how norm-referenced tests are created and developed, and gives examples of current widely used tests. The differences between norm-referenced tests and criterion-referenced tests are compared, as well as some advantages and disadvantages of norm-referenced testing. Scoring systems and techniques are also explained.

Keywords Content Validity; Criterion-Referenced Test; High-Stakes Tests; Educational Assessment; Norm-Referenced Test; Normative Sample; Norming; Percentile; Percentile Rank; Sampling Error; SAT Test; Standardized Tests; Test Bias

Overview

Norm-referenced tests are assessments administered to students to determine how well they perform in comparison to other students taking the same assessment. Each student's current performance can be compared to that of a representative sample of students, which is known as a norm group or normative sample, who have previously taken the test. Norm-referenced tests differ from criterion-referenced tests in that criterion-referenced tests help show how a student stands in relation to a particular educational curriculum, with an emphasis not on comparing students with others taking the assessment but on whether a student has mastered specific skills that have been taught (Monetti & Hinkle, 2003). Very few, if any, students are expected to attain a perfect score on a norm-referenced test, and students are generally not encouraged to study for norm-referenced tests because they are intended to measure a broad range of general knowledge already attained in reading, language arts, mathematics, science and social studies (Miller-Whitehead, 2001).

Using the Test

Norm-referenced tests are used to try to predict how well students will do in certain situations, such as college. They can also be used to place students in gifted and talented or remedial programs (Bracey, 2000). The SAT and Preliminary SAT/National Merit Scholarship Qualifying Test (PSAT/NMSQT) are examples of norm-referenced tests. The SAT is used for college entrance, and the PSAT/NMSQT gives students practice for the SAT Reasoning Test and a chance to enter the National Merit Scholarship Corporation scholarship programs ("About PSAT/NMSQT," n.d.). The California Achievement Test and the Iowa Test of Basic Skills are also examples of norm-referenced tests. There is no passing or failing of norm-referenced tests, since each student receives scores compared to others who have taken the test. Test scores are generally given as a percentile.

Designing the Test

Step 1: Examining the Curriculum Materials

To develop a norm-referenced test, test publishers examine the curriculum materials produced by textbook and workbook publishers and then develop questions that measure the skills most commonly used in the materials they have reviewed. Then experts review the items produced to determine their content validity, or whether the test measures what is it is supposed to measure. For example, a norm-referenced test that is purported to be a measure of students' reading skills but only assesses vocabulary would not have high content validity and should be redesigned. After proper content validity has been established, the test is then tried out by a sampling of students to see how the questions are answered. A norm-referenced test should not have questions that too many students cannot correctly answer, or have test items that too many students answer correctly. In general, norm-referenced tests only include items that between 30 percent and 70 percent of the students who have taken the test answer correctly. In addition, questions are removed that people with overall high scores do not correctly answer as well as the questions that students with overall low scores correctly answer (Bracey, 2000).

Step 2: Interpreting Student Performance

Each student's performance on norm-referenced tests is assessed according to the performance of a “normed” group, a larger set of students who have previously taken the exam in order to determine the norm. Because "norming"-constructing the norms of – an exam is an expensive and involved process, test publishers usually use norms for about seven years. The results are reported as a percentile rank, meaning that a student receiving a score of 65 has performed to the same degree or higher than 65 percent of the norming students, which, if properly normed, is indicative of all students who have taken the norm-referenced test since it was first given (Bond, 1995). However, test scores on norm-referenced tests typically rise the longer the test is in use, which could be attributed to changes in instruction or test preparation that instructors implement due to their familiarity with the test questions (Linn, Graue & Sanders, 1990, as cited in Monetti & Hinkle, 2003). Knowing student rank can be useful in deciding whether students may need some remedial assistance in a subject area or should be included in a gifted and talented program. Norm-referenced test results cannot provide information about what exactly students know, only that they know more of the test content than a percentage of the students who comprise the norm group (Bond, 1995).

Step 3: Constructing Norms

The process of constructing norms is called "norming." The norms are entered into a chart that lets the test interpreter convert the raw scores to a derived score, which will then make it easier to compare one student's score to the norm group. There are four types of derived scores: percentiles, standard scores, developmental scales, and ratios and quotients. In order to construct norms, the population of interest needs to be identified. This might be anything from a student body in a certain school district, all those who have applied to be part of a program, all residents of one state, and all students in a particular region such as the Midwest, Northeast, Pacific Northwest, or South. The most important statistics to be analyzed for the sample data should be determined, as should the tolerable amount of sampling error for those statistics. A procedure also needs to be devised for attaining the norm group, and the sample size needs to be determined (Rodriguez, 1997).

What Are Norm- Referenced Tests?

Norm-referenced tests reveal student scores in regard to those that have been pre-established by a norm, or average, group of similar students who have taken the same test. Norms are statistics that provide information about how a defined group of students performed on a given test. Many norm groups can be assigned for different tests, and a student's relative ranking is not often predictable, as it depends on which norm group was used as an analogy. To help states and districts select the norm-referenced test that best suits their needs, the normative sample should be described in enough detail to assist in the selection process. This means that the demographic characteristics of the norm group should be described in detail, including gender distribution, racial and ethnic background, the geographic location, socioeconomic position, and the education level of the group (Rodriguez, 1997). This information will allow states and districts to assess whether it would be a meaningful comparison for their students. If the demographic characteristics do not match up well with the students to be assessed, then that particular test should not be used as the results would not be relevant.

Further Insights

Norm-referenced tests are assessments given to students to determine how they perform in comparison to their peers who have taken the same test. All students taking the test are compared to a norm group who were given the test before it was distributed for mass use. The norm groups can be national norms or local norms and used depending on what type of comparison is being sought. Norm-referenced tests and criterion-referenced tests are vastly different but can be used in conjunction to provide an overall view of student performance with norm-referenced tests providing a comparison with other students and criterion-referenced tests showing student mastery of subject matter (Monetti & Hinkle, 2003).

Scoring

Percentiles & Percentile Ranking

Among derived scores lay percentiles, which are the most commonly used due to their ease of interpretation. The difference between percentile and percentile rank is that: “a percentile is a point in the distribution below which a certain percentage of the scores fall, and a percentile rank gives a student's relative position, or percentage, of student scores that fell below the obtained score. For example, the 90th percentile is the point below which 90 percent of the scores in the distribution fall; it does not mean that a student who has scored at the 90th percentile answered 90 percent of the questions correctly. Percentile rank of a score is the percentage of scores less than or equal to that particular score. A percentile rank of 85 is the percentage of scores in the distribution that falls at or below a score of 85. Percentile rank is a point in the percentile score, and a percentile is a point on the original measurement scale” (Rodriguez, 1997, p. 8).

Tests Not to be Used for Grading

Norm-referenced tests should not be used to determine grades. Since the distribution of a norm-referenced test results in a traditional bell-shaped curve, the results can be different depending on the class that is taking the assessment. For example, in a school district near Washington, D.C. all secondary schools in the district administered the same end-of-course algebra test to their students. At one school, a score of 66 garnered a grade of A, while in other schools, the 66 earned students a grade of B, C, or D. When a norm-referenced test is used for grading, it can be very discouraging for students if the majority of their class scores between 85 and 100. Then the student's 80 results in a grade of D or F. In this case, a criterion-referenced test would be more appropriate as it would have graded on what students have mastered and not on how their work compared to the other students in the class (Brandt, 2003).

Impact of Socio-Economic Background on Testing

Standardized, norm-referenced tests are intended to treat all students taking the assessment as equals since the norm group is representative of all students taking the test. However, research into the test results shows that students who come from a better socioeconomic background historically do better than students who come from a low-income background or African-American and Hispanic students do not do as well as white students. Research conducted in the late 1990s by people at the College Board, which owns the SAT, showed that the average verbal and mathematics scores hadn’t changed much, but that over the past ten years the difference was broadening between white students and lower-scoring African-American and Hispanic students. Perhaps this is so due to the fact that more minorities are becoming involved in the SATs (Wildavsky, 1999).

Test Bias

Cultural Bias vs. Economic Status

There is empirical evidence that shows norm-referenced tests are not culturally biased against African-Americans as an ethnic group, and that the lower scores are more closely related to their economic status (Roberts & DeBlassie, 1983, as cited in Castenell, Jr. & Castenell, 1988). However, test bias in the form of content validity and lack of test readiness has been cited as the reason low-income and minority students do poorly on standardized, norm-referenced achievement tests (Royer & Feldman, 1984, as cited by Castenell, Jr. & Castenell, 1988). Test publishers develop achievement tests based on what is commonly taught. If there is under representation of gender or any socioeconomic or racial/ethnic group, then there is a possibility of test bias because items that show such bias cannot be detected and eliminated. Test readiness refers to all social and psychological factors that can influence test performance, such as motivation, the emotional stability of students and their degree of anxiety about the test. This is a potential testing bias that is difficult to determine because studies have shown that social and psychological factors are not equally presented across all groups tested. Therefore, test readiness is a bias that might be impossible to completely eliminate (Castenell, Jr. & Castenell, 1988).

Controversial Suggestions by SAT

As colleges and universities try to determine what criteria should be used for admitting students into their institutions, the SAT posed a few considerations that were viewed as somewhat controversial, such as adding points to African-American and Hispanic students' scores to reward better-than-expected achievement and to grade students from low-income backgrounds on a curve to help them compete with their privileged, high-scoring adversaries. While these are possible solutions, they undermine the reason for using norm-referenced tests in the first place. A properly developed norm-referenced test is supposed to level the playing field for all students and be representative of all students who will be taking the assessment. In an attempt to solve the dilemma admissions officers face when trying to allow more minorities and use similar criteria to evaluate them, Educational Testing Service, which creates and administers the SAT for the College Board, proposed to develop a type of handicap process for the SAT. They have also developed a “new formula that predicts a student's expected SAT score based on a range of factors, including family income, parental education, and the quality of the high school attended. Students who earn scores that exceed the score predicted by their socioeconomic status by 200 points or more are identified as strivers. Another version of the formula would include race in the criteria, and students would be identified as strivers if they performed significantly better than the average score of their own racial or ethnic group” (Wildavsky, 1999, ¶ 2).

Those who oppose these suggestions maintain that they set a double standard for poor and minority students. The question also arises as to the assumption that just because students score above predicted levels for their socioeconomic group or racial/ethnic background may not mean that they had to overcome any obstacles (Wildavsky, 1999). Norm-referenced tests are intended to treat all students as equals; therefore, students from different socioeconomic and racial/ethnic backgrounds should all score the same if they have the same ability. If it is determined that the assessment is biased, then the norming process should be revisited for the next version of the test and the bias removed.

Viewpoints

Advantages of Norm-Referenced Testing

Norm-referenced tests have several advantages. Properly normed norm-referenced tests can produce reliable and objective measurement as long as the composition of the norm group is understood (Oosterhof, 2003, as cited in Monetti & Hinkle, 2003). One advantage of using norm-referenced tests is that they can give parents and counselors some idea of how well the student is doing compared to other students who have taken the test, which can aid in making career or college decisions (Miller-Whitehead, 2001).

Challenges of Norm- Referenced Testing

Some of the most discussed challenges with norm-referenced testing are content and linguistic bias and representation in the normative samples that is out of balance. As the population of the United States becomes even more ethnically and racially diverse, these issues will become more relevant.

Content Bias

Content bias refers to the expectation that every student has learned the same concepts, lessons, and vocabulary and has experienced comparable life events. Students who come from culturally and linguistically diverse backgrounds may not perform as well as other students on a content biased test because of their different cultural experiences and how they were socialized (Stockman, 2000, as cited in Laing & Kamhi, 2003).

Linguistic Bias

Linguistic bias describes the differences between the dialect used by the test administrator and the test-taker. The language barriers can be a problem even in the type of dialect that may be expected in a student’s response on the testing instrument. One example is the instructor's use of standard American English and students' use African American Vernacular English. These students might not test as well as expected because they are unfamiliar with some of the dialect used on the testing instrument. If linguistic bias is not taken into consideration, then these students' poor performance may be misinterpreted as a learning deficiency or impairment. If there is disproportionate representation in the normative samples for the norm-referenced test, then those students from culturally and linguistically diverse backgrounds may not do as well as those students who were included. However, test developers have more recently included representation of differing populations in the normative groups. The concern now is if there is an appropriate proportion of diverse populations to make the normative samples valid (Laing & Kamhi, 2003).

Alignment to State Standards

One disadvantage of using norm-referenced standardized tests that have been marketed to states as being aligned with each state's standards is whether or not the testing instruments are in fact properly aligned. One study conducted showed a vast difference in what teachers in the classroom thought about the alignment and the test publishers' perceptions about the alignment of their tests. The results showed rather wide inconsistencies between the publisher's perception and those of practicing teachers who evaluated two publishers' tests. In looking at the language arts alignment, for example, the practicing teachers felt that only six of the sixteen standards being evaluated (38 percent) for both Test A and Test B were aligned. However, the publisher for Test A felt that eleven of the sixteen standards being evaluated (69 percent) were aligned with their test; and the publisher for Test B felt that fourteen of the sixteen standards being evaluated (88 percent) were aligned with their test (Buckendahl, Plake, Impara & Irwin, 2000).

True Representation of the Students of the State

Another possible disadvantage of using norm-referenced, standardized tests when assessing statewide education standards is whether or not the norm group is truly representative of the students of the state. This is especially important when dealing with high-stakes testing, and state administrators should try to scrutinize the norms and normative group before purchasing the assessment. When examining norms, the types of derived scores that are reported, how demographically representative the normative sample is in comparison to the state's demographics, the size of the normative group, and how recently the test was standardized should all be considered (Wallace, Larsen & Elskin, 1992, as cited in Gronna & Jenkins, 1997).

Terms & Concepts

Content Validity: Content validity is the extent to which an exam calculates what it is purported to calculate.

Criterion-Referenced Test: Criterion-referenced tests are assessments given to students to determine if specific skills have been mastered.

High-Stakes Tests: High-stakes tests are when exam scores are taken into account when making decisions that have important consequences for students, schools, school districts, and/or states and can include high school graduation, promotion to the next grade, resource allocation, and instructor retention.

Norm-Referenced Test: Norm-referenced tests are assessments administered to students to determine how well they perform in comparison to other students taking the same assessment.

Percentile: Percentile is a certain point in the distribution that a percentage of the scores fall below.

Percentile Rank: Percentile rank reports a student's position, or percentage, in relation to the scores that fall under his or her score.

Sampling Error: Sampling error is the difference between the estimated sample and the parameter of the population.

SAT Test: The SAT test is a standardized, norm-referenced test taken by high school students applying to college and is used by some college admissions professionals as part of the admissions process and for scholarship selection.

Standardized Tests: Standardized tests are tests that are administered and scored in a uniform manner, and the tests are designed in such a way that the questions and interpretations are consistent.

Test Bias: Test bias occurs when provable and systematic differences in the results of students taking the test are discernable based on group membership, such as gender, socioeconomic standing, race, or ethnic group.

Bibliography

About PSAT/NMSQT (n.d.). Retrieved July 4, 2007, from http://www.collegeboard.com/student/testing/psat/about.html

Aviles, C. (2001). Grading with norm-referenced or criterion-referenced measurements: To curve or not to curve, that is the question. Social Work Education, 20 , 603-608. Retrieved July 4, 2007 from EBSCO Online Database Education Research Complete. http://search.ebscohost.com/login.aspx?direct=true&db=ehh&AN=6006829&site=ehost-live

Bassett, S. (2011). Normed testing: Fair means for comparing skills. Education Week, 30, 31-32. Retrieved December 15, 2013, from EBSCO Online Database Education Research Complete. http://search.ebscohost.com/login.aspx?direct=true&db=ehh&AN=58553980&site=ehost-live

Bond, L. (1995). Norm-referenced testing and criterion-referenced testing: The differences in purpose, content, and interpretation of results. Retrieved July 4, 2007 from Education Resources Information Center. http://www.eric.ed.gov/ERICDocs/data/ericdocs2sql/content_storage_01/0000019b/80/14/d2/23.pdf

Bracey, G. (2000). Thinking about tests and testing: A short primer in "Assessment Literacy." Washington, D.C.: American Youth Policy Forum. (ERIC Document Reproduction Service No. ED 445 096). Retrieved July 4, 2007 from Education Resources Information Center. http://www.eric.ed.gov/ERICDocs/data/ericdocs2sql/content_storage_01/0000019b/80/16/76/5a.pdf

Brandt, R. (2003). Don't blame the bell curve. Leadership, 32 , 18. Retrieved July 4, 2007 from EBSCO Online Database Education Research Complete. http://search.ebscohost.com/login.aspx?direct=true&db=ehh&AN=8945418&site=ehost-live

Buckendahl, C., Plake, B., Impara, J. & Irwin, P. (2000). Alignment of standardized achievement tests to state content standards: A comparison of publishers' and teachers' perspectives. (ERIC Document Reproduction Service No. ED 442 829). Retrieved July 4, 2007 from Education Resources Information Center. http://www.eric.ed.gov/ERICDocs/data/ericdocs2sql/content_storage_01/0000019b/80/16/48/f0.pdf

Castenell, Jr., L. & Castenell, M. (1988). Norm-referenced testing and low-income blacks. Journal of Counseling & Development, 67 , 205. Retrieved July 4, 2007 from EBSCO Online Database Education Research Complete. http://search.ebscohost.com/login.aspx?direct=true&db=ehh&AN=4962098&site=ehost-live

Flynn, L. J., Zheng, X., & Swanson, H. (2012). Instructing struggling older readers: A selective meta-analysis of intervention research. Learning Disabilities Research & Practice (Wiley-Blackwell), 27, 21-32. Retrieved December 15, 2013, from EBSCO Online Database Education Research Complete. http://search.ebscohost.com/login.aspx?direct=true&db=ehh&AN=71839443&site=ehost-live

Fulcher, G., & Svalberg, A. (2013). Limited aspects of reality: Frames of reference in language assessment. International Journal of English Studies, 13, 1-19. Retrieved December 15, 2013, from EBSCO Online Database Education Research Complete. http://search.ebscohost.com/login.aspx?direct=true&db=ehh&AN=92896295&site=ehost-live

Gronna, S. & Jenkins, A. (1997). Creating local norms to evaluate students in a norm-referenced statewide testing program. (ERIC Document Reproduction Service No. ED 408 339). Retrieved July 4, 2007 from Education Resources Information Center. http://www.eric.ed.gov/ERICDocs/data/ericdocs2sql/content_storage_01/0000019b/80/16/a2/32.pdf

Laing, S. & Kamhi, A. (2003). Alternative assessment of language and literacy in culturally and linguistically diverse populations. Language, Speech, & Hearing Services in Schools, 34 , 44-55. Retrieved July 4, 2007 from EBSCO Online Database Education Research Complete. http://search.ebscohost.com/login.aspx?direct=true&db=ehh&AN=8841126&site=ehost-live

Miller-Whitehead, M. (2001). Practical considerations in the measurement of student achievement. (ERIC Document Reproduction Service No. ED 457 244). Retrieved July 4, 2007 from Education Resources Information Center. http://www.eric.ed.gov/ERICDocs/data/ericdocs2sql/content_storage_01/0000019b/80/19/41/c0.pdf

Monetti, D. & Hinkle, K. (2003). Five important test interpretation skills for school counselors. (ERIC Document Reproduction Service No. ED 481 472). Retrieved July 4, 2007 from Education Resources Information Center. http://www.eric.ed.gov/ERICDocs/data/ericdocs2sql/content_storage_01/0000019b/80/1b/78/be.pdf

Rodriguez, M. (1997). Norming and norm-referenced test scores. (ERIC Document Reproduction Service No. ED 406 445). Retrieved July 4, 2007 from Education Resources Information Center. http://www.eric.ed.gov/ERICDocs/data/ericdocs2sql/content_storage_01/0000019b/80/16/77/47.pdf

Wildavsky, B. (1999). Grading on a curve. U.S. News & World Report, 127 , 53. Retrieved July 4, 2007 from EBSCO Online Database Academic Search Premier. http://search.ebscohost.com/login.aspx?direct=true&db=aph&AN=2227062&site=ehost-live

Suggested Reading

American Association of School Administrators (1993). Making sense of testing and assessment. Lanham, MD: Rowman & Littlefield Publishers, Inc.

Black, P. (1997). Testing: Friend or foe? Theory and practice of assessment and testing. Oxford, UK: Routledge.

Gipps, C. (1994). Beyond testing: Towards a theory of educational assessment. Oxford, UK: Routledge.

Shorrocks-Taylor, D. (1999). National testing: Past, present and future. Malden, MA: Blackwell Publishing Limited.

Essay by Sandra Myers, M.Ed.

Sandra Myers has a Master's degree in Adult Education from Marshall University and is the former Director of Academic and Institutional Support at Miles Community College in Miles City, Montana, where she oversaw the College's community service, developmental education, and academic support programs. She has taught business, mathematics, and computer courses; and her other areas of interest include adult education and community education.