Educational testing

Educational testing is an integral aspect of the modern education system, utilized at various levels—including local, state, and federal—primarily to evaluate student performance and ensure accountability within schools. The main objectives of educational testing encompass assessing student progress, identifying strengths and weaknesses, guiding educational decisions, and influencing curriculum development. There are two primary types of tests: norm-referenced tests (NRTs), which compare students against a normative group, and criterion-referenced tests (CRTs), which evaluate whether students have mastered specific standards.

Concerns regarding the validity and reliability of these tests are prevalent, as educators question whether the assessments accurately reflect student achievement and cater to diverse populations. Moreover, educational testing has become a focal point for discussions about equity, especially as socioeconomic factors and cultural biases appear to affect test outcomes. Trends in educational assessment have included international comparisons and the implementation of large-scale assessments like the National Assessment of Educational Progress (NAEP) in the U.S. Critics of educational testing argue that it can lead to anxiety among students, promote a narrow focus on test preparation, and fail to encapsulate the full range of a student's abilities. Overall, while educational testing serves important functions, it remains a topic of debate regarding its effectiveness and fairness in measuring student learning.

Published in: 2022

By: Holaway, Calli A.

Subject Terms

Educational testing

Summary: Mathematicians and researchers are constantly exploring the validity and reliability of educational testing.

Purpose of Testing

Educational testing is pervasive in modern education at the local, state, and federal levels, and mathematics is one of the most frequently tested areas. The purpose of educational testing is broad and multifaceted: to assess student progress and school accountability; to identify students’ strengths and weaknesses, as well as their eligibility or need for special services; to make educational decisions about individuals and groups of students; to choose curriculum and instructional techniques; to reward teachers or schools for performance; and to formulate educational legislation and policies. Students are often placed in courses and special programs as a result of educational testing and may be required to pass tests to graduate from high school or be admitted to schools at all levels, especially colleges and universities.

While some educators, parents, and politicians cite standardized tests for their presumed objectivity in measuring achievement and other skills or attributes, these tests are frequently a source of anxiety and competitive pressure for students. There is an entire industry dedicated to helping students prepare for and pass or score well on these tests. At the same time, researchers are constantly exploring the validity and reliability of tests with regard to fairness for subgroups of students, as well as their actual predictive ability. For example, there is a broad body of research on whether measures like high school grade point average, SAT math scores, or mathematics placement tests are predictors of success in college mathematics courses.

Types of Testing

The decisions that can be made based on testing information depend on the type of test that is administered. There are two different types of tests that provide different types of information: norm-referenced tests (NRT) and criterion-referenced tests (CRT).

NRTs are created for the purpose of comparing students to a norming group, which is composed of students who are similar to the student being tested. The scores of the norming group create the very familiar normal (bell-shaped) curve. NRT scores are typically reported as percentiles, which indicate that a student scored above a certain percentage of the norming group. For example, a student at the 84th percentile scored the same or higher than 84% of the students in the norming group. It is a common misconception with NRTs that students are compared to all other students who have taken the test; however, most NRTs are normed every several years using a new norming group with which test-takers are compared. NRTs are typically very general in nature, covering a broad range of objectives. Items that have a variety of difficulty levels are chosen for NRTs, as these types of items encourage a wide variability in the scores, thus allowing evaluators to more accurately determine how a student compares to others. The SAT and many intelligence tests such as the Wechsler Intelligence Scale for Children are norm-references tests.

Unlike NRTs, which are used to compare students to each other, CRTs are used to determine if a student has mastered a given set of standards. CRTs are typically narrow in focus, testing only a few objectives, and are generally focused on those objectives that are deemed most important. Scores for CRTs are typically reported as percentage correct or as scaled scores. Proficiency is determined by comparing a student’s score to an established cut point. Many schools regularly administer end-of-grade or end-of-course tests through which student achievement in mathematics subjects is measured.

Issues in Educational Testing

Two primary concerns with educational testing are the validity and reliability of the assessment. “Validity” in this context refers to whether a test is appropriate for the population being tested, as well as whether it appropriately addresses the content it is intended to measure. Educators from around the United States have expressed concern as to whether the tests that are currently being used to measure student achievement are valid and reliable. In an effort to address this concern, many states have undergone revisions of their tests in the past several years.

An additional concern with educational testing is in how student progress is measured over time. Statisticians have developed a variety of growth models to determine if individual students are improving as they move through school. These models may focus on improvement from grade level to grade level, or they may focus on student progress within a single school year (referred to as “value-added” or “teacher impact”). An ongoing issue with measuring student progress over time lies with the relationship between the assessments and the statistical measures that are used to analyze assessment data. Growth models are all based on certain assumptions about the assessments, which may or may not be met. In order to determine the impact of schools on student learning, one must ensure that the assessments and the statistical models used to analyze the data are compatible.

Test Analysis

Standardized educational tests undergo a variety of analytical procedures to evaluate their effectiveness at measuring a construct. Item analysis is frequently conducted to determine if items are functioning the way test developers intended. This analysis of student responses to items provides the difficulty index and the discrimination index. The difficulty index is simply the ratio of the number of students who answered the item correctly to the number of students who attempted the item; a higher difficulty index indicates an easier item. The discrimination index provides information on how well an item differentiates between students who performed well on the test and students who did not. A positive discrimination index indicates that those students who performed well overall on the test were more likely to answer the item correctly, while a negative discrimination index indicates that those students who performed poorly overall were more likely to answer the item correctly. For NRTs, item discrimination is particularly important, and test developers attempt to develop items that will have a high discrimination index.

Modern test analysis also uses a process called “item response theory” (IRT) to determine the effectiveness of a test or test item. IRT evaluates items based on the parameters of item difficulty, discrimination, and guessing and provides test developers with the probability that a student with a certain ability level will answer an item correctly. In addition, IRT allows for a more sophisticated measure of a test’s reliability.

Trends in Educational Testing

Recent trends in educational testing have been focused around making international comparisons of student achievement. The most well known of these comparisons are the Third International Mathematics and Science Study (TIMSS), conducted in 2007, and the Program for International Student Assessment (PISA), conducted in 2006. The TIMSS included fourth-grade students from 36 countries and eighth-grade students from 48 countries. Participating countries submitted items for the test and the test was developed by a committee of educational experts from various nations. The TIMSS also collected information on students’ background, including attitudes toward mathematics and science, academic self-concept, home life, and out-of-classroom activities. The PISA focused on problem solving in mathematics and science and on reading skills. The 2006 PISA included 15-year-olds from 57 countries. The goal of PISA is to determine students’ abilities to analyze and reason and to effectively communicate what they know. Additional international studies involving educational testing include the International Adult Literacy Survey, the Progress in International Reading Literacy Study, and the Civics Education Study.

In the United States, the National Assessment of Educational Progress (NAEP) is used to compare student achievement across states. NAEP includes students from grades 4, 8, and 12 and is designed to provide an overall picture of educational progress. Schools are randomly chosen to participate and students within those schools are also randomly chosen. The NAEP tests students in mathematics, reading, science, writing, civics, economics, and history.

The public focus on educational testing in the United States sharpened with the implementation of the No Child Left Behind (NCLB) Act in 2002. For the first time in American history, schools were publicly designated as “meeting” or “failing to meet” state standards, and issues of educational testing were brought to the forefront. Organizations like Achieve began closely examining how schools were preparing students for college and the work force and began working with state officials and business executives to improve student achievement. Educational testing is a valuable tool for these types of organizations, providing information on the effectiveness of American schools.

Controversies in Educational Testing

Not everyone believes educational testing is useful or meaningful, and there are many arguments against the use of such tests. For example, studies have suggested that the SAT is both culturally and statistically biased against African Americans, Hispanic Americans and Asian Americans. Others have found that socioeconomic status is correlated with performance on the SAT, which is believed to be related to the fact that students from wealthier families can afford expensive test preparation courses or multiple retakes of the test, both of which have been demonstrated to improve test scores in some cases. Others have documented a gender gap in SAT mathematics scores that is not easily explained by issues like the difference in the number of male and female test takers.

On many tests, stereotype threat or vulnerability has also been shown to affect test scores when race, gender, or culture are cued before a test. In response, some have advocated that self-identification should occur after a test. Researchers have also shown that the structure or methodology of the test can have an effect on performance. For example, female test scores on tests of spatial ability can improve when “I don’t know” is removed as an answer, or when ratio scoring or un-timed tests are used. Finally, there are many who believe that there are concepts that cannot be adequately measured by standardized assessments, even when the answers are not exclusively multiple choice and that using standardized tests as a primary method of assessment leads to “teaching to the test” rather than a broader educational experience for students.

Bibliography

Allerton, Chad. Mathematics and Science Education: Assessment, Performance and Estimates. Hauppauge, NY: Nova Science Publishers, 2009.

Crocker, Linda, and James Algina. Introduction to Classical and Modern Test Theory. Chicago, IL: Harcourt College Publishers, 1986.

Kubiszyn, Tom, and Gary Borich. Educational Testing & Measurement: Classroom Application and Practice. 9th ed. Hoboken, NJ: Wiley, 2010.

Mertler, Craig A. Interpreting Standardized Test Scores: Strategies for Data-Driven Instructional Decision Making. Thousand Oaks, CA: Sage, 2007.

Wright, Robert J. Educational Assessment: Tests and Measurements in the Age of Accountability. Thousand Oaks, CA: Sage, 2008.

Educational testing

Related Topics

On this Page

Subject Terms

Educational testing

Purpose of Testing

Types of Testing

Issues in Educational Testing

Test Analysis

Trends in Educational Testing

Controversies in Educational Testing

Bibliography