Testing and Evaluation.Testing and Evaluation
Testing and evaluation are essential components of the educational process, designed to assess both student learning and the effectiveness of teaching methods. Testing allows students to demonstrate their acquired knowledge or progress through various formats, such as written and oral examinations. Evaluation, on the other hand, examines the success of educational programs and helps identify areas for improvement in student learning outcomes. In the United States, standardized tests like the SAT and ACT serve as key metrics for college admissions and assessing state educational systems; however, their efficacy and fairness have been subjects of controversy due to potential biases related to race and gender.
Historically, educational testing has evolved significantly, from ancient China’s merit-based exams to modern standardized assessments. Various testing types exist, including diagnostic, formative, benchmark, and summative tests, each serving distinct purposes in measuring student performance throughout the learning process. Despite the structured approach, critics argue that standardized tests may not accurately reflect individual learning styles and capabilities, raising concerns about their reliability and fairness. As educational policies adapt, many institutions are reevaluating their reliance on such tests, prompting a shift towards more holistic assessment methods that consider diverse learner needs and backgrounds.
Subject Terms
Testing and Evaluation
Testing and evaluation are two ways, used in tandem with one another, to gauge the effectiveness of student learning and the learning process. Testing, which can take several forms, is a way for students to demonstrate how much information they have learned or how much progress they have made over the course of a lesson. An evaluation shows how well the educational program is working and how to improve the students’ learning outcomes.
In the American educational system, testing often entails written or oral tests administered by a teacher that are graded on a percentage scale that corresponds with the familiar letter system A, B, C, D, and F. While this type of testing can be a reliable guide for teachers to see what their students have learned, other testing systems can inform teachers of their student’s strengths and weaknesses or hint at their future progress. Another type of testing, the standardized test, can be used for placement, such as with SAT testing for college admission, or as a way to measure the overall effectiveness of a state educational system. However, many common testing methods have proved controversial as critics have focused on their shortcomings and noted possible biases in regard to race and gender.
Testing plays a significant role in educational evaluation as test scores are one measure of the effectiveness of classroom learning. Evaluation also takes into account how well the students have met the expected learning outcomes, how much knowledge they have gained (as opposed to rote memorization), and the overall effectiveness of homework and the tests themselves.


Overview
The earliest-known evidence of education-based testing dates back to ancient China during the Zhou Dynasty (1060 BCE–256 BCE). To reduce corruption and increase efficiency, Zhou leaders implemented a hiring policy for government based on merit. Promising candidates were recruited and given a series of standardized tests that got increasingly harder as the process continued. One of the tests was an essay based on the principles of Confucianism, a philosophical belief system that focuses on moral and ethical behavior. Successfully completing these tests allowed the applicant to move into a higher social class.
Western ideas of testing and evaluation evolved slowly over the centuries. For much of the medieval period, schooling was only available at monasteries and religious schools. Assessing the success of the teaching was done through oral argument or debate that was judged by senior members of the clergy. After the artistic and scientific advances of the Renaissance and Enlightenment, more people began attending universities throughout Europe, although admission was limited to those of wealthier classes. Although some written testing was added to the oral testing during this period, the ultimate assessment of a student’s ability was the final oral exam. A student’s entire classroom career was spent in preparation for this final exam, which was judged by their instructor or a school panel. This judgment was, by its nature, subjective, and the criteria for success varied from exam to exam.
In the nineteenth century, the idea of schooling began to shift from something reserved for the privileged to public education that was available for all. Horace Mann, an education reformer and an advocate for public schools, suggested that Boston, Massachusetts, school students take a standardized written test so that all children would have the same opportunity to prove what they had learned.
During the latter half of that century, standardized written tests did begin to replace oral tests in America’s schools, albeit not in the way Mann intended. The tests were not implemented to gauge the academic performance of schoolchildren, but rather to note their learning ability. The tests were used to separate children of different intelligence levels so the academically “gifted” would not be held back by children considered “slow.” This trend was led by the growth in the psychological sciences in which researchers attempted to develop new theories on the nature of intelligence and how people learn.
Education reformers had hoped that standardized testing would eliminate bias, but the test-scoring systems differed greatly and often provided contradictory results. At the onset of the twentieth century, academic testing adopted a more uniform and scientific approach, inspired, in part, by testing administered by the US Army to evaluate soldiers. The concept of multiple-choice questions was developed to make the testing process more consistent and efficient. This led to the introduction of the standardized College Entrance Examination Board in 1926. The test, which was used as a college admissions test, was later renamed the Scholastic Aptitude Test, or SAT. A rival test, American College Testing, or ACT was introduced in 1959. The modern SAT and ACT differ slightly in format, with the SAT including reading comprehension, writing and language, and math, while the ACT includes English, math, reading, science, and an optional written essay.
During the twentieth century and into the twenty-first, testing within primary and secondary school classrooms evolved in numerous ways. Depending on the grade level, testing was typically conducted throughout the school year, with more weighted testing, such as quarterly exams or final exams, given near the end of marking periods. Testing could include any type of format, from written tests to oral presentations.
In the 1960s, some of the focus in determining education success began to shift from the performance of students to the performance of school districts and teaching methods. Working with psychologists, educators developed the concept of formative evaluation, which allowed teachers to identify the students who needed help the most . This idea was paired with summative evaluation, an end-of-year process that evaluated the success of the program by measuring student achievement. Over time, the formative evaluation process became less testing-based and more reliant on informal teacher observations.
During the thirty years prior to 2000, the test scores of American students in reading and math had either stagnated or declined in relation to the rest of the world. The scores also indicated that students of color and those living in poverty trailed behind their White and wealthier counterparts. To remedy this situation, then-President George W. Bush signed the bipartisan No Child Left Behind Act in 2002, aiming to reverse the trend of declining student performance and bridge the gap between social groups. The bill increased the involvement of the federal government in public education, which had primarily been under state and local control.
Under the law, states give students in grades three through eight an annual standardized test. The test is also administered once in high school. The goal was to get all students to a proficient level in the core subjects by 2014. States were given some leeway to decide what the proficient level was and which of the standardized tests to use. If a school failed to meet its yearly progress goals, it faced escalating penalties. If a school missed goals two years in a row, students would be freely allowed to transfer to a new school in the same district. Missing goal for three straight years mandated that the school must offer free tutoring. Schools that continued to miss goals could be taken over by the state or even shut down.
By the 2014 deadline, no state had achieved 100 percent compliance, and several had more than 50 percent of their schools fall short of the standards. In 2015, President Barack Obama signed another bipartisan education policy creating the Every Student Succeeds Act, an update of the No Child Left Behind Act. The law left in place the previous bill’s testing system but shifted the accountability to the individual states in regard to standards and consequences.
Applications
Quantifying methods of testing and evaluation in education can be difficult, as educators and psychologists often use different definitions and break the process into varying categories. In general, testing categories can be divided into four types: diagnostic, formative, benchmark, or summative. Diagnostic tests are typically given at the beginning of a new school year or a new lesson plan. The method is used to provide the teacher with an idea of what the student knows so that they can develop a successful lesson plan. Students are not usually graded on diagnostic tests. They are simply a way for teachers to identify problem learning areas and focus more attention on those areas.
Formative tests show teachers how a student is learning during a particular lesson. Formative tests are generally more informal, such as shorter tests or quizzes that carry less weight than more-structured exams. Formative tests measure the ongoing learning process, not the result. This type of testing is more effective when the tests are given often and accompanied by teacher feedback. In a similar vein, benchmark testing indicates a student’s mastery of an entire section of content, such as a chapter in a textbook. Students are graded heavily on a benchmark test because it is material they are expected to have learned. While parents will often not be notified about their children’s performance on diagnostic and formative tests, they are typically notified about performance on benchmark testing. Summative testing is end-of-the year testing that shows how much of the overall material a student has learned. Summative testing is more structured and is intended to show if a student has learned the material at a level corresponding to school or state proficiency guidelines.
Testing techniques also vary and often include a combination of different styles. The most common testing techniques are objective and subjective testing. Subjective tests require a student to give a written response to a complex question. This type of testing is designed to assess student knowledge as well as the student’s writing and verbal skills and creativity. Subjective testing can include a question that requires a short answer or an answer in the form of an essay. It can also require a student to state their opinion on a topic from the material learned. Objective testing is more structured and requires a definitive right or wrong answer. Examples of objective testing include multiple-choice questions, true-or-false questions, or sections where the student much match answers and questions.
The two main types of educational evaluation are summative and formative. Summative evaluations analyze students’ total body of graded work to gauge overall performance, while formative evaluation involves the teacher’s observations that are communicated to the student through face-to-face interaction, parental meetings, feedback, or emails. Summative evaluations gauge the result at the end of the learning process, but formative evaluation can take place on a daily basis and be adjusted according to the needs of the student.
Other forms of evaluation include confirmative evaluation, which examines the success of an educational program after a year’s time. Norm-referenced evaluation compares student performance to a selected norm. This can be a national testing norm in a certain subject or a comparison to the performance of the entire school or school district. Criterion-referenced evaluation measures student performance in relation to a predetermined set of education standards. This type of evaluation is usually used to assess the success of a curriculum. Ipsative evaluation compares changes in a student’s performance with their past academic performance.
Issues
As far back as the nineteenth century, educators and school reformers realized that school testing came with numerous drawbacks that limited its effectiveness. The development of standardized tests was meant to remedy this situation, but the tests led to other concerns. Critics argue that the tests are not a good way to measure student performance, as students learn in different ways and at different paces. Some students may absorb the information and have a complete grasp of the subject, but may not be able to express their knowledge well in a structured test. Others may excel at taking standardized tests, but maintain a limited knowledge of the subject. Furthermore, students’ test-taking ability can be affected by outside factors, such as stress at home, hunger, or lack of sleep.
Standardized tests such as the SAT and ACT have also been accused of being biased against people of color and women. The precursor to the SAT tests was developed by the US Army during World War I (1914–1918). Its creator believed in the false idea from the era that White people were innately more intelligent than other ethnic or racial groups. Critics contend that this racial and ethnic bias remains embedded in the SAT test into the twenty-first century. For example, the SAT once contained a question that required students to compare the terms “oarsman:regatta” with “runner:marathon.” White students were able to answer the question with a higher proficiency because they were more familiar with the term “regatta” than were students of color. With the effectiveness of the SATs and ACTs coming into question in the twenty-first century, many universities are no longer using the tests as the determining factor for admissions.
The type of tests a student takes can also produce skewed results depending on the quality of the test. Multiple-choice tests may be easier to create, but researchers have found that vague answers and the inclusion of “all/none of the above”-type answers lead students to guess more often. Further research has shown that the results from a short-answer written test and a well-designed multiple-choice test are very similar. Experts suggest that a multiple-choice test with concise answers (and no “other” or “all/none of the above”-type answers) can be an effective style of testing provided the teacher reviews the results with the student soon after. However, even this method of testing has its drawbacks. Researchers have also found that boys tend to do better on multiple-choice questions than girls, while girls do better on open-ended written questions. The reason, they theorize, is that girls are more risk-averse than boys and are less likely to guess at a multiple-choice answer.
About the Author
Richard Sheposh graduated from Penn State University in 1989 with a Bachelor of Arts degree in communications and journalism. He spent twenty-three years working in the newspaper industry as a writer and an editor before entering the educational publishing business.
Bibliography
Berwick, C. (2019, October 25). What Does the Research Say About Testing? Edutopia. www.edutopia.org/article/what-does-research-say-about-testing/.
Brown, G. (2022, November 11). The Past, Present and Future of Educational Assessment: A Transdisciplinary Perspective. Frontiers in Education, 7. doi.org/10.3389/feduc.2022.1060633.
Carlton, G. (2022, August 15). A Brief History of the SAT, America’s Most Popular College Entrance Exam. www.bestcolleges.com/blog/history-of-sat/.
Cornell University Center for Teaching Innovation. (2023). Measuring Student Learning. teaching.cornell.edu/teaching-resources/assessment-evaluation/measuring-student-learning.
Gershon, L. (2015, May 12). A Short History of Standardized Tests. JSTOR Daily. daily.jstor.org/short-history-standardized-tests/.
Jimenez, L., & Modaffari, J. (2021, September 16). Future of Testing in Education: Effective and Equitable Assessment Systems. Center for American Progress. www.americanprogress.org/article/future-testing-education-effective-equitable-assessment-systems/.
Klein, A. (2015, April 10). No Child Left Behind: An Overview. Education Week. www.edweek.org/policy-politics/no-child-left-behind-an-overview/2015/04.
O’Malley, K. (2015, October 27). 4 Common Types of Tests Teachers Give (and Why). Noodle. resources.noodle.com/articles/4-types-of-tests-teachers-give-and-why/.
ProCon.org. (2020, December 7). Do Standardized Tests Improve Education in America? standardizedtests.procon.org/.
Secolsky, C., & Denison, D. B. (Eds.). (2017). Handbook on Measurement, Assessment, and Evaluation in Higher Education,2nd ed. Routledge.