Item Response Theory
Item Response Theory (IRT) is a statistical framework used to analyze test performance and improve the design of assessments. It focuses on measuring latent traits, such as abilities or personality characteristics, that are not directly observable but can be inferred from test responses. IRT posits that students who exhibit a stronger presence of a specific trait are more likely to answer corresponding test items correctly. This methodology enables more precise evaluations across a broad range of abilities, making it especially beneficial for both high and low-performing students.
One of the key applications of IRT is in computerized adaptive testing, where the difficulty of test questions is adjusted in real-time based on a student’s previous responses. Additionally, IRT helps identify potential test bias and can be utilized to equate scores across different tests. The theory employs various models, including one, two, and three-parameter models, which account for different characteristics of test items, such as difficulty and discrimination power. Overall, IRT presents a significant advancement over classical test theory, offering greater flexibility and accuracy in measuring student abilities across diverse populations.
On this Page
- Testing & Evaluation > Item Response Theory
- Overview
- Parameter Estimation
- Item Response Theory Models
- One Parameter Model
- Two Parameter Model
- Three Parameter Model
- Multiple Choice Items
- Foundations of Item Response Theory
- Item Response Functions
- Item Information Functions
- Invariance
- Applications
- Computerized Adaptive Testing
- Appropriateness Measurement
- Viewpoints
- Advantages of Item Response Theory
- Comparison to Classical Test Theory
- Prediction
- Effects of Invariance
- Test Bias
- Terms & Concepts
- Bibliography
- Suggested Reading
Subject Terms
Item Response Theory
Item response theory uses mathematical functions to predict or explain students' test performance, and, thereby, design more accurate tests. Tests are designed to measure a test taker's set of latent traits (e.g. aptitude, achievement, or personality) that are otherwise unobservable. By assuming that one dominant factor can be attributed to test item performance, the theory explains that test takers who manifest a trait more strongly have a higher probability for answering test items correctly than students who do manifest a trait as strongly. The theory can be applied to computer adaptive testing, wherein test questions are adapted according to a test taker's previous answers, and to identify students who may be cheating, guessing, or not trying. Compared to classical test theory, item response theory is better able to assess students whose abilities are either extremely high or extremely low.
Keywords Classical Test Theory; Computerized Adaptive Testing; Content Validity; Criterion-Referenced Test; Invariant Items; Item Response Function; Item Response Theory; Latent Trait; Norm Group; Norms; Norm-Referenced Test; Test Bias
Testing & Evaluation > Item Response Theory
Overview
Item response theory entails using mathematical functions to predict or explain a student's test performance using a set of factors called latent traits (or abilities). Latent traits are unobservable abilities that are to be measured by test items. The relationship between student item performance and these traits can be described by an item characteristic function, which aids in predicting test performance. The item characteristic function specifies that students with higher scores on the traits have higher expected probabilities for answering items correctly than students with lower scores on the traits. In using item response theory to create test items, the assumption is made that there is one dominant factor or ability that can account for item performance. The ability or trait measured by the test item is broadly or narrowly defined in terms of aptitude, achievement or personality (Hambleton, 1989).
Parameter Estimation
Item response theory is used by many test publishers, credentialing organizations, departments of education, school districts, the armed services, and industries to construct both norm-referenced and criterion-referenced tests. Item response theory is also used to try to determine test bias, to equate different tests, and to report ability scores (Hambleton, 1989). Item response theory testing instruments provide both invariant item statistics and ability estimates. Through a parameter estimation process, test items and students are placed on an ability scale so that there is as close a relationship as possible between the expected student probabilities for success on test items obtained from the item and ability parameters and the actual performance of students at each ability level. These item parameter estimates and student ability estimates are revised continually until the best possible agreement is attained between predictions based on the ability and item parameter estimates and the actual test data.
Item Response Theory Models
Item response theory is a test theory that uses mathematical models related to student responses to items and probabilities to link “observable data to theory with the ability to statistically adjust scores for properties of test items such as difficulty, discriminating power, and liability to guessing” (Embretson & Reise, 2000; Lord, 1980; Van der Linden & Hambleton, 1997, as cited in Reid, Kolakowsky-Hayner, Lewis & Armstrong, 2007, p. 178). For simple items, “such as right or wrong answers, item response theory specifies that the probability of answering an item correctly should increase in a predictable manner as the level of the student's ability increases. A plot of the probability of answering an item correctly at every level of student ability results in a curve that increases until it reaches a point of inflection,” and then it begins to decrease so that it begins to resemble a stretched out "S" (Reid et al., 2007, p. 178).
One Parameter Model
The simplest item response theory model with one parameter assumes “that items differ from each other only in their degree of difficulty. Item characteristic curves for a test developed using this model would look similar to each other and differ only in their relative location on the ability or trait continuum” (Reid et al., 2007, p. 178).
Two Parameter Model
Item characteristic curves for a test developed using two parameters differ “in relative placement on the ability continuum and also in the degree to which probability of answering an item correctly increases steeply with an increase in level of underlying ability or trait” (Reid et al., 2007, p. 178).
Three Parameter Model
Item characteristic curves that use three parameters differ in their level of difficulty, in their ability to distinguish between people who may or may not be at a certain level of capability or trait, and in how often students with a low skill levels will correctly answer the question (Reid et al., 2007).
Multiple Choice Items
More complex item response theory models need to be used for multiple-choice items. These more complex models create different item characteristic curves that are present for every response option. These models build upon ability or trait approximations on the arrangements of the responses (Reid et al., 2007).
Foundations of Item Response Theory
Item response theory is a way to develop assessment instruments using mathematical models and statistical methods. Item response theory can be “used to analyze items and scales, create and administer psychological measures, and to measure individuals on psychological constructs like depression” (Reise, Ainsworth & Haviland, 2005, p. 95). There are three foundations of item response theory. They are,
• Item response functions,
• Item information functions, and
• Invariance (Reise, Ainsworth & Haviland, 2005).
Item Response Functions
An item response function is a basic unit of item response theory. It is the mathematical function that describes “the relation between where a student falls on the continuum of a given construct, such as what is the probability that a student will give a particular response to a scale item designed to measure that construct” (Reise, 2005, p. 95). Item response functions are used to evaluate item quality. The construct in item response theory terms is referred to as latent trait because the characteristic is supposed to respond directly to on a scale that measures the same trait. Item response theory strives to determine an item response function for each item of a measure (Reise et al., 2005).
Item Information Functions
Item information functions are used “to judge the quality of an item by turning the item response function into an information function that can show how much information-a number that represents an item's ability to differentiate among students-the item provides. Different items can provide different amounts of information in different ranges of a latent trait with the easier items considered best for differentiating among students low on the trait and more difficult items best for differentiating among students high on the trait” (Reise et al., 2005, p. 25).
Invariance
For item response theory, invariance means students' positions on a latent trait continuum can be approximated from their reactions to all sets of items with item response functions even if they originated from different measures. Invariance also means that item properties that are represented by the item response function “do not depend on the characteristics of a particular population, and the scale of the trait does not depend on any particular item set because it exists independently of them” (Reise et al., 2005, p. 96).
Applications
Computerized Adaptive Testing
Item response theory can be used in computerized adaptive testing. Computerized adaptive testing allows a testing program to select test items that provide the most information for an individual student using fewer test questions by continually adjusting the test to meet each student's ability level. With adaptive testing, the reaction to one test item decides which test item should be dispensed next. Computerized adaptive testing allows for fewer test questions to be asked and strives to eliminate test items that may be too easy or too difficult for each student. Therefore, the number and difficulty of test items can vary for each student, and test items are selected from a large pool of questions. Adaptive testing is developed to be as precise as possible in determining the ability of each student who takes the test. Since item response theory parameters are sample invariant, the item parameters estimated from one sample group can be applied to all students (Zickar, 1998).
Appropriateness Measurement
Item response theory is also adept at appropriateness measurement, which attempts to identify students who do not fit the model for responding to items (Levine & Rubin, 1979, as cited in Zickar, 1998). In theory, appropriateness measurement can be used to identify those students who may be cheating, unmotivated or are answering the questions in a way that other students are not. Appropriateness measurement is based on a group of students who are considered to be 'regular,' and then other response patterns are checked against the normative pattern. Those students whose patterns do not conform are then checked further (Zickar, 1998).
Item response theory has many other applications. A few are listed here:
• Item banking . By using item response theory, a large number of test items can be calibrated to the same scale, different subsets can be selected for the bank, and all items may be reused in different combinations to obtain measurement on the same scale.
• Tailored testing . Because item response theory can be used to select a subset of items that are the most discriminating and are located at the ability level of a particular student, a smaller number of test items can be used to test a broad range of abilities.
• Test equating . Item response theory can be used to determine relationships between tests when one test form could not be administered again. Instruments can be assessed to determine commonalities between the two instruments.
• Pattern analysis . It is possible to determine which specific students do not fit the model by being able to look at unusual response patterns, which can help determine possible cheating or whether some students were not trying (Loyd, 1988).
Viewpoints
Advantages of Item Response Theory
The invariance aspect of item response theory measurements allows the same instrument to be used for students in different grade levels in the same school and can also be used to assess the same grade level in different schools, because it can link the scales from different measures. This means that different item response theory instruments can be used with a variety of students in a variety of venues and all the scale scores from all the different instruments can be combined and the scores can all be placed on a single, common scale (Reise et al., 2005).
A broad spectrum of ability levels can be estimated without knowing the range of ability within the standard normative sample using the mathematical functions that interpret the relationship between testing performance and skill level, (Hambleton & Swaminathan, 1985, as cited in Reid et al., 2007). Since item response theory deals with the “relationship between each item and the latent trait or ability being assessed by the instrument, each item can provide useful information about a student” (Hambleton & Swaminathan, 1985, as cited in Reid et al., 2007, p. 177).
Another positive aspect of item response theory is that different subdivisions of testing items can be administered to different students or to the same students several times over again. The analysis of the results does not rely solely on regular administration of one standard arrangement of test items. The estimate of a student's individual skill levels are created using equations that compute the levels along the dormant continuum (Reid et al., 2007).
There are other advantageous uses for item response theory assessments. They can help with
• Determining possible test bias in different assessment tools,
• Equating the results of a new instrument with those of old guidelines,
• Ascertaining how precisely a testing instrument calculates across the continuum of skills, and
• Developing instruments to accurately measure level of ability.
Using item response theory techniques, researchers are able to determine how well certain items can differentiate between students who do or do not have a specified target level of the ability or trait that they are assessing. Item response theory also allows for determining the extent to which approximating can have a poor affect on measuring an item (Reid et al., 2007).
Comparison to Classical Test Theory
Item response theory was developed to try to address several challenges with assessments that are based on classical test theory. With classical test development the norms that are used for interpreting the exam scores are sample-specific. “The average level of ability and the range of abilities among students whose performance was used to norm the test influence all subsequent test results” (Reid et al., 2007, p. 179). By using item response theory, techniques are used to create a mathematical equation to go beyond just the ability spectrum of the norm group. This can be imperative for those populations that are outside the norm group's range. Tests developed using classical test theory also require that the entire testing instrument be administered in order to obtain usable information because a right or wrong answer to any one item is basically valueless without information about how the other test items were answered. Item response theory methods allow for shorter tests to occur because the relationship between the skill - or trait - and student achievement are independently related and do not rely on the answers of other test items in order to draw a valid conclusion and determine a score.
Another challenge with classical test theory instruments is the difficulty of dealing with extremely high levels of ability and extremely low levels of ability, even if they fall within the normative sample. With classically developed tests, the standard of “error of measurement is assumed to be constant across all levels of ability or trait, but in reality it is higher at the extreme levels of each. In most instances, a test most accurately measures ability or trait when the average level of item difficulty is equal to the tested student's own ability level; the point where the student has a fifty percent probability of correctly answering each test item. By using item response theory, a test can be tailored to a particular ability level in order to obtain the most accurate estimate of ability within the shortest amount of time ” (Hambleton & Swaminathan, 1985, as cited in Reid et al., 2007, p. 179).
Prediction
Classical testing methods and procedures also do not provide a basis for determining what a student might do when given a test item. This information is necessary if a testing instrument is supposed to predict test score characteristics in one or more populations of examinees or if it is supposed to have particular characteristics for certain populations of students. Item response theory can overcome these shortcomings by having an ability scale on which student abilities are independent of the choice of test items from the pool of test items over which the ability scale is defined (Hambleton, 1989).
Effects of Invariance
The invariance in item response theory instruments means that students' positions on a latent trait continuum can be approximated by reviewing their answers to any items with the item response functions even if they come from different measures. In classical test theory instruments, the item responses are combined to assess a true score that pertains only to that particular calculation. In addition, “item properties do not depend on the characteristics of a particular population” with an item response theory instrument (Reise et al., 2005). In a classical test theory instrument, the raw score scale is known as a certain set of items on a particular calculation.
Test Bias
Item response theory instruments also vary from classical test theory instruments in how they deal with potential test bias. Test bias can arise from differences in age, gender, culture, ethnic, and socioeconomic standing. One of the biggest challenges that can arise from classical test theory instruments is having a norm group that is representative of the testing population. Scale items may react differently for different groups, which makes it practically impossible to administer the same assessment to different groups of students and expect to get meaning from the assessment results or make any accurate comparisons. While item response theory cannot claim to completely solve this issue, it does do a better job of mitigating possible test bias because it is designed to look at qualitative variation across groups, scales students from qualitatively diverse groups and puts them on a common scale (Reise et al., 2005).
Item response theory has shown a lot of promise to be a better way to develop tests than using classical test theory models. However, as with any testing instrument, more research and fine tuning still needs to occur. For example, the randomness of computerized adaptive testing may still need adjustment. Item response theory testing instruments have proven to have more flexibility, validity, and reliability than other mass administered testing instruments.
Terms & Concepts
Classical Test Theory: Classical test theory refers to a body of related theories that predict testing outcomes such as item difficulty or student ability.
Content Validity: Content validity refers to the extent to which a measure represents all facets of a given instrument.
Criterion-Referenced Test: Criterion-referenced tests are assessments given to students to determine if specific skills have been mastered.
Invariant Items: Invariant items in testing means that the item statistics are not dependent on the sample of students chosen for the norm group.
Item Response Function: An item response function is the mathematical function that describes the correlation between where a student lies on the continuum of a certain construct.
Latent Trait: “A latent trait is the underlying ability or trait presumed to be measured by the test items. Models of a latent trait cannot be tested directly because the latent trait cannot be directly observed, but they can be tested indirectly” (Reid, 2007, p. 179).
Norm Group: A norm group is a representative sample of students who have previously taken the assessment and by which each student subsequently taking the assessment is compared. Also known as a normative sample.
Norms: Norms are a normative or average score for a particular age group.
Norm-Referenced Test: Norm-referenced tests are assessments administered to students to determine how well they perform in comparison to other students taking the same assessment.
Test Bias: Test bias occurs when provable and systematic differences in the results of students taking the test are discernable based on group membership, such as gender, socioeconomic standing, race, or ethnic group.
Bibliography
Haberman, S., Sinharay, S., & Chon, K. (2013). Assessing item fit for unidimensional item response theory models using residuals from estimated item response functions. Psychometrika, 78, 417-440. Retrieved December 15, 2013, from EBSCO Online Database Education Research Complete. http://search.ebscohost.com/login.aspx?direct=true&db=ehh&AN=88226837&site=ehost-live
Hambleton, R. (1989). Item response theory: Introduction and bibliography. Amherst, MA: University of Massachusetts Amherst Laboratory of Psychometric and Evaluative Research. (ERIC Document Reproduction Service No. ED310137). Retrieved July 30, 2007, from Education Resources Information Center http://www.eric.ed.gov/ERICDocs/data/ericdocs2sql/content_storage_01/0000019b/80/1f/4f/29.pdf
Loyd, B. (1988). Implications of item response theory for the measurement practitioner. Applied Measurement in Education, 1 , 135. Retrieved July 30, 2007 from EBSCO Online Database Education Research Complete. http://search.ebscohost.com/login.aspx?direct=true&db=ehh&AN=7364145&site=ehost-live
Reid, C., Kolakowsky-Hayner, S., Lewis, A. & Armstrong, A. (2007). Modern psychometric methodology: Applications of item response theory. Rehabilitation Counseling Bulletin, 50 , 177-188. Retrieved July 30, 2007 from EBSCO Online Database Education Research Complete. http://search.ebscohost.com/login.aspx?direct=true&db=ehh&AN=24418861&site=ehost-live
Reise, S., Ainsworth, A. & Haviland, M. (2005). Item response theory. Current Directions in Psychological Science, 14 , 95-101. Retrieved July 30, 2007 from EBSCO Online Database Academic Search Premier. http://search.ebscohost.com/login.aspx?direct=true&db=a9h&AN=17380617&site=ehost-live
Talbot III, R. M. (2013). Taking an item-level approach to measuring change with the force and motion conceptual evaluation: An application of item response theory. School Science & Mathematics, 113, 356-365. Retrieved December 15, 2013, from EBSCO Online Database Education Research Complete. http://search.ebscohost.com/login.aspx?direct=true&db=ehh&AN=91808516&site=ehost-live
Ying, L., Hong, J., & Lissitz, R. W. (2012). Applying multidimensional item response theory models in validating test dimensionality: An example of k-12 large-scale science assessment. Journal of Applied Testing Technology, 13, 1-27. Retrieved December 15, 2013, from EBSCO Online Database Education Research Complete. http://search.ebscohost.com/login.aspx?direct=true&db=ehh&AN=89389722&site=ehost-live
Zickar, M. (1998). Modeling item-level data with item response theory. Current Directions in Psychological Science, 7 , 104-109.
Suggested Reading
Baker, F. (1985). The Basics of Item Response Theory. Portsmouth, NH: Heinemann.
Hambleton, R. & Swaminathan, H. (1984). Item Response Theory: Principles and Applications. New York, NY: Springer-Verlag.
Rogers, H., Swaminathan, H. & Hambleton, R. (1991). Fundamentals of Item Response Theory. Thousand Oaks, CA: Sage Publications.