Validity

Because of the nature of behavioral research, sociologists frequently use surveys and various other types of written data collection instruments (or their electronic equivalents) to obtain information from the people in their studies. Good data collection instruments have two characteristics in common: they are both reliable, consistently measuring whatever variable they measure, and valid, actually measuring what they purport to measure. There are several types of validity and concomitant approaches to determining the validity of an assessment instrument. Feedback from validation studies can be used to improve the quality of an assessment instrument and the data that behavioral scientists use to test their theories and describe the real world.

Keywords Correlation; Criterion; Data; Empirical; Operational Definition; Reliability; Sample; Survey; Survey Research; Validity

Research Methods > Validity

Overview

Data about human beings can be obtained from any number of sources, including observation of individuals or groups in either laboratory or real-world settings, historical data that have been collected for other purposes, and data collected by asking questions directly of individuals themselves regarding their opinions, attitudes, feelings, or past reactions. Data collected directly from individuals are typically gathered using various paper-and-pencil measurement instruments (or their electronic equivalents). Some of the more frequently used instruments of this type include surveys and questionnaires, personality tests, or even tests of mental ability. As anyone who has ever participated in a survey or taken a test knows, however, some data collection instruments are better than others. To be useful for scientific research, data collection instruments need to have two characteristics: They must be both reliable and valid.

Validity is the degree to which a survey or other data collection instrument measures what it purports or was designed to measure. For example, a survey that attempts to gather information about participants' attitudes toward candidates in a political election is valid if it indeed captures information about their attitudes toward the candidates rather than something else (e.g., their attitudes toward the person administering the survey). Reliability is the degree to which an assessment instrument consistently measures what it is intended to measure. No matter how well written a data collection instrument appears to be on its face, it cannot be valid unless it is reliable. In other words, if a measure is not reliable it does not consistently measure the same thing. This means that sometimes it is not measuring the construct that it was designed to measure, so the instrument is neither reliable nor valid. Both validity and reliability are essential when conducting survey research so that the data collected in the study will actually give researchers the information they are trying to gather.

Because of the nature of behavioral research, sociologists frequently use surveys and various other types of written data- collection instruments to obtain information from the people in their studies. As opposed to research in the physical sciences where one knows without question the difference in weight between one gram and two grams of a chemical compound and can judge the reaction this change makes, measuring people's attitudes, opinions, and other subjective factors is less straightforward. For example, if a researcher wanted to determine how angry a certain situation made people, he or she could develop a continuum of actions that would empirically test people's anger level. On a scale of one to ten, a score of ten might be operationally defined as throwing a temper tantrum while a score of one might be operationally defined as no observable difference in behavior. The problem with this approach, of course, is that not everyone shows the same behavioral responses to situations even though they may be feeling the same emotion. One person may express their extreme displeasure with a disapproving gaze while another may express their extreme displeasure by slashing someone's tires. If asked how angry they were on a scale of one to ten, however, both persons might reply that they were a 9.5. Researchers would need a better measure to determine how angry someone was.

Types of Validity

There are several different types of validity. Some of these types of validity are more appropriate to a discussion of the development of academic tests and organizational assessment instruments where real-world criteria of success exist. For example, if a teacher wanted to develop a midterm for one of their classes, they would have available to them relatively absolute criteria of success, such as whether the students could recall various facts in the textbook. Similarly, if a hiring manager wanted to develop a test that predicts how well an applicant will do on the job based on their experience and aptitudes, they would have various criteria for successful job performance for current employees. From a behavioral research perspective, however, validating a data collection instrument can be more complicated. For example, as mentioned above, there is no absolute real-world criterion for anger. All a researcher can do is ask people how angry they feel and take their word for it. This is true for most measures of attitudes, opinions, and the other subjective types of data that are of interest in many behavioral research studies.

Even though there are no objective criteria available on which one can test the validity of an assessment instrument, it is still important to try to develop as valid an instrument as possible. There are several types of validity of interest for such instruments. Content validity is a measure of how well the instrument items reflect the concepts that the instrument developer is trying to assess. Content validation is often performed by experts in an appropriate field of study. For example, a psychologist could review an assessment instrument measuring "anger" and determine whether or not the questions appear to reflect the state-of-the-art knowledge about anger and its indicators, or an expert in early childhood education determines the validity of an assessment instrument designed to test a child's reading level. Criterion-related validity is a measure of how well an assessment instrument measures what it is intended to measure as defined by another assessment instrument. Criterion-related validity, for example, could be ascertained by correlating the scores of the assessment instrument being validated with another instrument that has been proven successful in assessing "anger." Construct validity is a measure of how well an assessment instrument measures an underlying theoretical concept ("construct") that the researcher has developed.

Other types of validity that may be of interest in sociological research include cross validity, predictive validity, and face validity. In cross validation, the validity of an assessment instrument is tested with a new sample to determine if the instrument is valid across situations. For example, the anger assessment instrument might be validated with high school students and then cross validated with working adults to see if it is valid in both situations or if it only has limited applicability. Predictive validity refers to how well an assessment instrument predicts future events. For example, a sociologist might develop a psychometric instrument to assess the presence of known risk factors for juvenile delinquency. The instrument could be administered to adolescents and correlated with their incidence of juvenile delinquency. If there was a high correlation between scores on the instrument and juvenile delinquency and the instrument also had high reliability, it could be used for predicting adolescents who were at-risk for becoming delinquent so that schools or social service agencies could intervene to counteract the risk factors. Rather than being a true measure of validity, face validity is merely the concept that an assessment instrument appears to measure what it is trying to measure. For example, an assessment instrument designed to collect data about anger would ask about the respondents' mental and behavioral reactions in various anger-provoking situations. This is not to say that other questions would not also be able to assess a person's anger level. However, questions with face validity are obviously related to the question being investigated.

Applications

The Elementary School Success Profile

Feedback from validation studies can be used to improve the quality of an assessment instrument and the data that behavioral scientists use to test their theories and describe the real world. An example of a validation study was performed on the Elementary School Success Profile (ESSP). This is an online assessment tool that was designed to help schools identify social environmental influences that are related to the success of children for third through fifth graders. The ESSP goes beyond many other school assessment instruments by helping schools identify both appropriate individual and group social environmental intervention targets. Specifically, the ESSP examines twelve dimensions of children's social environment: family who cares, friends who care, neighbors who care, teachers who care, school is a fun place to learn, school is a fun place to be with other children, acceptance by peers, friends have good behavior, good physical health, good adjustment, positive feelings about self, and knowledge of where to get support. To help ensure that the ESSP was helpful for this purpose, Natasha K. Bowen, one of the original developers of the instrument, performed a validation study of the instrument (2008).

Three separate components of the ESSP collect data from students, parents or guardians, and teachers. The ESSP includes several features designed to increase reliability of the instrument, including the use of graphics and animations to hold the attention of children responding to the instrument, online screens that explain how to complete the questionnaire, and an audio option for individuals who have difficulty reading the questions. However, despite the features included for increased reliability, further safeguards are necessary to maximize the validity for the instrument. Although writing questions at a reading level is always an important consideration when developing items for an assessment tool, it is particularly important for instruments that attempt to collect data from children. Not only do children not have the depth and range of formal vocabulary that adults typically do, but they also tend to see things from a different perspective. Therefore, the validation of the instrument was particularly important so that the ESSP developers could be confident that the instrument was actually getting the data needed to be of use to educational professionals in identifying at-risk children.

Bowen performed a study to investigate the validity of the ESSP through the use of cognitive testing. In this approach to validation, questioning techniques are used to help researchers understand the thought processes being used by respondents to an assessment instrument so that sources of confusion and misunderstanding can be identified and eliminated to improve the validity of the instrument. As part of the process of developing and validating the ESSP, Bowen performed three rounds of cognitive interviews with fifty-eight children. The questions used in the cognitive testing were based on a review of the literature on cognitive methods, various relevant factors in childhood development, the purpose of the instrument, and constraints imposed by the schools in which the testing took place.

The interview process with the children used concurrent probes in which they were asked questions as they answered the questions on the ESSP. The children were asked to read each item out loud. If the child had difficulty reading a word, the interviewer would ask the child if he or she understood the word. Children were also asked to restate each item they read in their own words so that their understanding of its meaning could be assessed. The children were then directed to select a response to the item. After the children had chosen a response, the interviewer questioned them about why they chose the response to the item that they did. This series of questions and activities was designed to help the researcher determine whether the children interpreted the items on the instrument in the way that was intended by the developers.

Analysis of the data revealed four general types of problems with the version of the ESSP that was tested in the study:

  1. Difficulty recognizing words in the items,
  2. Difficulty comprehending word meanings,
  3. Incongruent explanations of their response choices,
  4. Misapplication of responses to the content of the item.

The operational definition of word-recognition difficulties was when a child did not recognize a word or misread a word and did not correct him/herself. Comprehension difficulties was operationally defined as misunderstanding the intent or content of the question. Incongruent responses displayed logical inconsistencies between the question and the child's answer such as when an answer did not match the question (e.g., an answer of "sometimes" to a question referring to something that "happens all the time"). Misapplication of the answers to the questions involved judgments about whether the child applied the response options to the appropriate concept in the item. For example, a child who responded that s/he did not seek help from adults at school because "sometimes I can get the answer by myself and sometimes I can't" would be deemed to misapplied the response options to the question.

The results of the cognitive testing revealed numerous problems with the validity of the existing version of the ESSP. Word recognition problems were the least common type of problem and related to only a small number of words. The children were able to understand the words, however, when they were spoken by the interviewer. Three strategies were developed to deal with word recognition problems. The problems words were deleted in many cases. In those cases where they could not be deleted, definitions were added to a help screen for every item in which a problem word was present and the word was included in the audio feature. Following these changes to the ESSP, performance increased on many of the items.

A substantial proportion of the validity problems encountered on the ESSP were a result of children not comprehending the content of an item. Some of these problems were the result of a child's confusion about the person referenced in a question (e.g., one child thought that "a person who lives near me" referred to his mother rather than to a neighbor). In other cases, the children did not understand the main intent of the question; for example, when asked about friends to play with "outside of school," one child talked about friends at recess (that is, outside of the physical school building) as opposed to friends to play with during non-school hours). To overcome such problems, developers designed four strategies. In some cases terms were defined on a screen preceding the actual question (e.g., "Friends are the kids you talk to and play with. Don't count brothers and sisters."). In some cases, the items causing difficulty were removed from the instrument. In other cases, items were reordered to reduce confusion (e.g., all negative items about a certain topic were grouped together). Some items were also simplified to improve comprehension. In subsequent testing, comprehension of simplified items was found to be improved.

Problems with incongruence were more difficult to resolve and were not consistent across the rounds of testing. To attempt to alleviate problems of response incongruity, a screen was added that operationally defined the four response possibilities as part of the introductory material for the program. These definitions were also available in the help screens available for each item. However, since the cognitive testing focused only on the items themselves and not on the explanatory material in the introduction or the help screens, it was impossible to assess whether these changes improved the validity of the instrument.

The final category of problems was the misapplication of response options to the content of the questions. For example, one child responded to the statement "Grown-ups in the neighborhood would say something to me if I did something wrong" with the option "never." His explanation for this choice was that he never did anything wrong. Strategies to alleviate misapplication problems focused on simplifying the content of the items or reducing the cognitive demands of the items. Testing in further rounds, however, did not show this to be an adequate strategy for eliminating this category of problem.

Conclusion

Assessment instruments including surveys, questionnaires, and various tests of mental functioning are frequently used in sociological research. However, in order to be effective in gathering information that can be used by sociologists, these instruments need to be both reliable (i.e., consistently measure what they measure) and valid (i.e., measure what they purport to measure). By definition, an assessment instrument cannot be valid if it is not reliable.

There are many types of validity of interest to sociologists, including content validity, criterion-related validity, construct validity, cross validity, predictive validity, and face validity. By improving the validity of an assessment instrument, behavioral scientists can improve the quality of the data that they gather and better describe the world around them.

Terms & Concepts

Correlation: The degree to which two events or variables are consistently related. Correlation may be positive (i.e., as the value of one variable increases the value of the other variable increases), negative (i.e., as the value of one variable increases the value of the other variable decreases), or zero (i.e., the values of the two variables are unrelated). Correlation does not imply causation.

Criterion: A dependent or predicted measure that is used to judge the effectiveness of persons, organizations, treatments, or predictors. The ultimate criterion measures effectiveness after all the data are collected. Intermediate criteria estimate this value earlier in the process. Immediate criteria estimate this value based on current values.

Data: In statistics, data are quantifiable observations or measurements that are used as the basis of scientific research.

Empirical: Theories or evidence that are derived from or based on observation or experiment.

Operational Definition: A definition that is stated in terms that can be observed and measured.

Reliability: The degree to which a psychological test or assessment instrument consistently measures what it is intended to measure. An assessment instrument cannot be valid unless it is reliable.

Sample: A subset of a population. A random sample is a sample that is chosen at random from the larger population with the assumption that such samples tend to reflect the characteristics of the larger population.

Survey: (a) A data collection instrument used to acquire information on the opinions, attitudes, or reactions of people; (b) A research study in which members of a selected sample are asked questions concerning their opinions, attitudes, or reactions are gathered using a survey instrument or questionnaire for purposes of scientific analysis; typically the results of this analysis are used to extrapolate the findings from the sample to the underlying population; (c) to conduct a survey on a sample.

Survey Research: A type of research in which data about the opinions, attitudes, or reactions of the members of a sample are gathered using a survey instrument. The phases of survey research are goal setting, planning, implementation, evaluation, and feedback. As opposed to experimental research, survey research does not allow for the manipulation of an independent variable.

Validity: The degree to which a survey or other data collection instrument measures what it purports to measure. A data collection instrument cannot be valid unless it is reliable. Content validity is a measure of how well assessment instrument items reflect the concepts that the instrument developer is trying to assess. Content validation is often performed by experts. Construct validity is a measure of well an assessment instrument measures what it is intended to measure as defined by another assessment instrument. Face validity is merely the concept that an assessment instrument appears to measure what it is trying to measure. Cross validity is the validation of an assessment instrument with a new sample to determine if the instrument is valid across situations. Predictive validity refers to how well an assessment instrument predicts future events.

Essay by Ruth A. Wienclaw, PhD

Ruth A. Wienclaw holds a doctorate in industrial/organizational psychology with a specialization in organization development from the University of Memphis. She is the owner of a small business that works with organizations in both the public and private sectors, consulting on matters of strategic planning, training, and human/systems integration.

Bibliography

Bowen, N. K. (2008, Mar). Cognitive testing and the validity of child-report data from the elementary school success profile. Social Work Research, 32 , 18-28. Retrieved July 11, 2024 from EBSCO online database Academic Search Premier:

Follingstad, D., & Rogers, M. (2013). Validity concerns in the measurement of women’s and men’s report of intimate partner violence. Sex Roles, 69(3/4), 149–167. Retrieved July 11, 2024 from EBSCO online database SocINDEX with Full Text.

Gajewski, B., Price, L., Coffland, V., Boyle, D., & Bott, M. (2013). Integrated analysis of content and construct validity of psychometric instruments. Quality and Quantity, 47, 57–78. Retrieved July 11, 2024 from EBSCO online database SocINDEX with Full Text.

Nunnally, J. C. (1978). Psychometric theory (2nd ed.). New York: McGraw-Hill Book Company.

Shakespeare-Finch, J., Martinek, E., Tedeschi, R. G., & Calhoun, L. G. (2013). A qualitative approach to assessing the validity of the Posttraumatic Growth Inventory. Journal of Loss and Trauma, 18, 572–591. Retrieved July 11, 2024 from EBSCO online database SocINDEX with Full Text.

Suggested Reading

Allen, M. J. & Yen, W. M. (1979). Introduction to measurement theory. Monterey, CA: Brooks/Cole Publishing Company.

Keatley, D., Clarke, D. D., & Hagger, M. S. (2013). Investigation the predictive validity of implicit and explicit measures of motivation in problem-solving behavioural tasks. British Journal of Social Psychology, 52, 510–524. Retrieved July 11, 2024 from EBSCO online database SocINDEX with Full Text.

Lemke, E. & Wiersma, W. (1976). Principles of psychological measurement. Chicago: Rand McNally College Publishing Company.

Moore, G. T. & Sugiyama, T. (2007). The Children's Physical Environment Rating Scale (CPERS): Reliability and validity for assessing the physical environment of early childhood educational facilities. Children, Youth & Environments, 17 , 24-53. Retrieved July 11, 2024 from EBSCO online database Education Research Complete:

Steckler, A. & McLeroy, K. R. (2008, Jan). The importance of external validity. American Journal of Public Health, 98 , 9-10. Retrieved July 11, 2024 from EBSCO online database Academic Search Premier: