Psychological testing and mathematics

SUMMARY: Though they often require a subjective element, psychological tests make every effort to generate useful quantitative data.

Testing is used for many purposes within psychology—among them to evaluate intelligence, diagnose psychiatric illness, and identify aptitudes and interests. Although the results of testing are rarely used as the sole criterion to make a diagnosis or other decision about an individual, they are often used in conjunction with information gained from other sources, such as interviews and observations of behavior. There are many types of psychological tests, but most share the goal of expressing an essentially unobservable quality, such as intelligence or anxiety in terms of numbers. The numbers themselves are not meant to be taken literally—no one seriously believes that a person’s intelligence is equivalent to their IQ score, for instance. Instead, the numbers are useful tools that help evaluate a person’s situation; for instance, how does the intellectual development of one particular child relate to that of other children of his age? Of course, the results of psychological testing should be evaluated with the social context of the individual in mind and with full respect for human diversity.

94982016-91546.jpg

Psychometrics

Psychometrics is a field of study that applies mathematical and statistical principles to devise new psychological tests and evaluate the properties of current tests. Psychologist Anne Anastasi was often known as the “test guru” for her pioneering work in psychometrics. In her 1954 book Psychological Testing, she discussed the ways in which trait development is influenced by education and heredity as well as how differences in training, culture, and language affect measurement. The two most common approaches to psychometrics in the twenty-first century are classical test theory and item response theory (IRT).

Classical test theory is the older approach, and the calculations required can be performed with a pencil and paper, although twenty-first-century computer software is often used. Classical test theory assumes that all measurements are imperfect and thus contain errors: the goal is to evaluate the amount of error in a measurement and develop ways to minimize it. Any observed measurement (for instance, a child’s score on an intelligence test) is made up of two components: true score and error. This may be written as an equation: X=T+E, where X is the observed score, T is the true score (the score representing the child’s true intelligence), and E is the error component (resulting from imperfect testing). Classical test theory assumes that that error is random and thus will sometimes be positive (resulting in a higher observed score than a true score) and sometimes negative (resulting in a lower observed score than a true score) so that over an infinite number of testing occasions, the mean of the observed scores will equal the true score. Although normally a test is administered only once to a given individual, this is a useful model that facilitates the evaluation of the reliability and validity of different tests.

Item response theory (IRT) is a different approach to psychological testing and assumes that observed performance on any given test item can be explained by a latent (unobservable) trait or ability so that individuals may be evaluated in terms of the amount of that trait they contain, and items may be evaluated in terms of the amount of the trait required to answer them positively. For an item on an intelligence test (intelligence being the latent trait), persons with higher intelligence should be more likely to answer the question correctly. The same principle applies to IRT-based tests evaluating other psychological characteristics; for instance, if an item in a psychological screening test is meant to diagnose depression, a person with more depressive symptoms should be more likely to answer it positively. IRT is a mathematically complex method of analysis that depends on the use of specialized computer software. It has become a popular means of evaluating psychological tests as computers have become more affordable.

Although the mathematical models of IRT differ from that of classical test theory, the goals are the same: to devise tests that measure characteristics of individuals with a minimum of error. Several important distinctions exist between classical test theory and IRT, which impact their appropriate use in a given experiment or research study. IRT can examine an individual response to individual items that are part of a larger test, while classical test theory uses the test as a whole to provide a measurement of an individual’s latent trait. Because IRT is more complex than classical test theory, it requires a higher level of statistical skill for proper analysis. It also requires a larger sample size to measure a construct appropriately. However, the results of IRT are more detailed and may be more useful in some situations, such as measuring grade levels over time.

Reliability and Validity

The term “reliability” refers to the consistency of a test score: if a test is reliable, it will yield consistent results over time, across groups of people, and without regard to irrelevant conditions, such as the person administering the test. Internal consistency is considered an aspect of reliability: it means that all the items in a test measure the same thing. Temporal reliability is also called “test-retest reliability” because it is typically evaluated by having groups of individuals take the same test on several occasions and seeing how their scores compare. Some differences are expected because of the random nature of the error component, but there should be a strong relationship between the observed scores of individuals on multiple occasions.

The term “interrater reliability” refers to the consistency of a test or scale regardless of who administers it. For instance, psychiatric conditions are often evaluated by having an observer rate an individual’s behavior using a scale, and the results for different observers evaluating the same individual at the same time should be similar. For instance, three psychologists using a scale to evaluate the same child for hyperactivity should reach similar conclusions. Both types of reliability are typically evaluated by correlating test results on different occasions (temporal) or the scores returned by different raters (interrater).

Internal consistency can be measured in several ways. The split-half method involves having a group of individuals take a test, then splitting the items into two groups (for instance, odd-numbered items in one group and even in the other), and calculating the correlation between the total scores of the two groups. Cronbach’s alpha (or coefficient alpha) is a refinement of the split-half method: it is the mean of all possible split-half coefficients. The measure was developed and named “alpha” by Lee Chronbach, an educational psychologist and measure theorist who began his career as a high school mathematics and chemistry teacher.

The term “validity” refers to whether a test measures what it claims to be measuring. Three types of validity are typically discussed: content, predictive, and construct. Content validity refers to whether the test includes a reasonable sample of the subject or quality it is intended to measure (for instance, mathematical aptitude or quality of life) and is usually established by having a panel of experts evaluate the test in relation to its purpose. Predictive validity means that test scores correlate highly with measures of similar outcomes in the future; for instance, a test of mechanical aptitude should correlate with a new hire’s success working as an auto repairer. Construct validity refers to a pattern of correlations predicted by the theory behind the quantity being measured: the scores on a test should correlate highly with scores on other tests that measure similar qualities and less highly with those that measure different qualities.

Other forms of validity relevant to psychological testing include concurrent, external, discriminant, internal, and face validity. If a test shows positive concurrent validity, it produces results consistent with a similar, well-established measure administered simultaneously as the inventory in question. External validity refers to the ability of a test to accurately measure its purported constructs in a population other than that on which it is tested. For example, if a test accurately measures depression in women who live in one city, it can not be said to have external validity if it fails to accurately measure depression in women in another city. Discriminant validity measures a test’s ability to distinguish between distinct groups. Face validity, the simplest form of validity, refers to whether the test appears to suit its proposed use.

Bibliography

Embretson, Susan E., and Steven P. Reise. Item Response Theory for Psychologists. Erlbaum, 2000.

Furr, R. Michael, and Verne R. Bacharach. Psychometrics: An Introduction. 4th ed., Sage, 2022.

Gopaul McNicol, et al. Assessment and Culture: Psychological Tests with Minority Populations. Elsevier, 2001.

Jabrayilov, Ruslan, et al. “Comparison of Classical Test Theory and Item Response Theory in Individual Change Assessment.” Applied Psychological Measurement, vol. 40, no. 8, 2016, pp. 559-72. doi.org/10.1177/0146621616664046. Accessed 17 Nov. 2024.

Kline, Paul. The Handbook of Psychological Testing. 2nd ed., Routledge, 2000.

Rust, John, et al. Modern Psychometrics: The Science of Psychological Assessment. 4th ed., Routledge, 2021.

Wood, James M., et al. “Psychometrics: Better Measurement Makes Better Clinicians.” In The Great Ideas of Clinical Science: 17 Principles That Every Mental Health Professional Should Understand. Edited by Scott O. Lilienfeld and William T. O’Donohue. Routledge, 2007.