Reliability
Reliability is a crucial concept in research and data collection, referring to the consistency with which an instrument measures a characteristic or attribute. It is essential for ensuring that the data collected is meaningful and can be trusted. Reliability is quantified by examining the observed variability in scores, which can arise from true differences among respondents as well as errors in measurement. Various methods, including parallel forms, test-retest, and split-half techniques, are used to estimate reliability.
Factors influencing reliability can be personal, such as the participant's understanding of questions or their emotional state during assessment. For instance, temporary conditions like fatigue can affect responses, leading to variability that does not reflect true opinions or characteristics. Additionally, external factors, such as the environment in which a survey is conducted, can impact participant responses.
To achieve valid results, a data collection instrument must be both reliable and valid; reliability alone does not guarantee that an instrument measures what it is intended to measure. Ultimately, ensuring high reliability in research instruments is vital for drawing accurate conclusions and making informed decisions based on the data gathered.
On this Page
- True Data Variance vs. Data Error
- Factors Affecting Reliability
- Personal Factors
- Difficulty in Understanding the Data Collection Instrument
- Inaccurate Measurements
- Testing the Efficacy of Data Collection Instruments
- Applications
- Case Study: Presenting Child Need
- The Results
- The Children's Physical Environment Rating Scale
- Conclusion
- Terms & Concepts
- Bibliography
- Suggested Reading
Subject Terms
Reliability
To yield useable data, surveys, assessment tools, and other data collection instruments need to be both reliable and valid. Reliability is a measure of the degree to which such instruments consistently measure a characteristic or attribute. Statistically, reliability is a measure of the observed variability in obtained scores on an instrument. Variability can come both from true variance (such as differences in opinions, knowledge, or other characteristics of the individual) or from error variance. The total variability of a data collection or assessment instrument is the sum of the true variability and the variability due to error. Reliability can be estimated through the use of parallel forms of the instrument, repeated administration of the same form of the instrument, subdivision of the instrument into two parallel groups of items, and analysis of the covariance among the individual items.
In the case of data collection or assessment instruments, reliability—the degree to which a data collection or assessment instrument consistently measures a characteristic or attribute—is essential for the instrument to be valid. In other words, one must be confident that the instrument measures what it purports to measure. No matter how well-written a data collection or assessment instrument appears to be on its face, it cannot be valid unless it is reliable. If a measure is not reliable, it does not consistently measure the same thing. In other words, it is not measuring the construct that it was designed to measure, so the instrument is neither reliable nor valid. Therefore, both validity and reliability are essential when conducting survey research, so that the data collected in the study will actually give researchers the information they are trying to gather. Without both reliability and validity, the data collected are meaningless, and no conclusions can be drawn.
True Data Variance vs. Data Error
Even in the physical sciences, two sets of measures performed on the same individuals never exactly duplicate each other. To the extent that this is true, the measurement instrument is unreliable, whether it is a physical scale used to measure the weight of a chemical compound or a paper-and-pencil survey used to measure a person's attitude toward something. For example, on a scale of 1 to 10, what one person describes as a 10 another person may call a 9.5. This does not necessarily mean that their opinions are different, just that the two people are expressing them differently. Some of the total observed variance (the square of the standard deviation) in scores is due to true variance, or real differences in the way that people are responding to the question. The other part of the total variance is due to error.
Factors Affecting Reliability
Personal Factors
There are many reasons why a data collection instrument may not be reliable and thus may contribute to the error variance. In general, what social scientists try to measure are lasting and general characteristics of individuals related to the underlying construct that the assessment instrument is trying to measure. However, other types of characteristics that are not part of the underlying construct, such as the individual's test-taking techniques and general ability to comprehend instructions, may also be measured.
In addition to the permanent characteristics of individuals, there are also temporary characteristics that can affect their responses to questions on data collection instruments. These might include such factors as general health, fatigue, or emotional strain, all of which can affect the way that an individual responds to a question -- a phenomenon familiar to anyone who has had to take a test in school when he or she was ill. Similarly, external conditions such as heat, light, ventilation, or even momentary distraction can impact one's responses in a way that does not reflect the underlying theoretical construct. Further, the subject's motivation can also impact the reliability of a data collection instrument. For example, for the most part teachers assume that their students are motivated to do well on any data collection instruments (e.g., a mid-term exam) given them. However, the same assumption cannot be made when asking a random sample of individuals to answer questions on the data collection or assessment instrument. For instance, it is often difficult to get shoppers to cooperate in opinion surveys because they are intent on accomplishing their errands so that they can go home. The motivation that may be offered to entice participation in the survey, such as a crisp new dollar bill or a carton of instant macaroni and cheese, is nothing compared to the motivation of students to do well in a course.
Difficulty in Understanding the Data Collection Instrument
Another source of variability in the way people respond to a data collection instrument may be individual differences in the way that people interpret the questions on the instrument. Care must always be taken in the development of a data collection instrument to write to a level that can be understood by all the people who will answer the questions. The questions need to be written unambiguously and with proper spelling, grammar, and punctuation to help reduce the possibility of low reliability because people do not understand what the questions are asking. For example, a child could easily take a question about a person who "lives near" him or her to mean the family members in the immediate household rather than a neighbor. Similarly, a question about how much one likes "sweet tea" means a different thing in the southern United States, where it refers to iced tea sweetened with simple syrup, than it does in Great Britain. The use of clear, concise language and operational definitions can help increase the reliability of the instrument.
Reliability problems may also stem from individual differences in the way that people interpret responses to a data collection instrument. Even in cases where the end points of the scale are operationally defined with clear examples, people who moderately dislike something could possibly vary their answers between 20 and 40 on a scale of 100, yet all mean the same thing. Similarly, some people never give a perfect score to anything on a rating scale because they believe that there is always room for improvement.
Inaccurate Measurements
Another potential cause of lack of reliability is when the data collection instrument is not valid and is actually measuring more than one thing. For example, a researcher might set up an experiment to determine whether men or women are more likely to stop and assist a stranger on the street who needs help. This could be done by having a confederate drop a sheaf of loose papers and counting how many times a man stops to help and how many times a woman stops to help. If more men stop than women, the researcher will probably conclude that men are more likely to help a stranger than are women. However, these results might not be replicable, as a great number of factors can affect a person’s willingness to stop and help. Similarly, great variability and concomitant low reliability can be found when data are collected through the use of a structured interview. Even when the questions are always asked with the same wording, differences between interviewers and how they are perceived by the people answering the questions can result in a situation where the same data collection instrument has widely disparate responses because of the interviewing styles of different interviewers.
Testing the Efficacy of Data Collection Instruments
In general terms, reliability is defined as the degree to which a data collection or assessment instrument consistently measures a characteristic or attribute. There are several ways that the reliability of such an instrument can be estimated.
The first of these methods involves the administration of two parallel forms of the instrument under specified conditions. The statistical correlation of the results of the two administrations is calculated to determine the degree of variance between the forms. However, it is important to note that it is typically difficult to develop two equivalent forms of the same assessment instrument that both have equal discriminability.
As a result of this difficulty, a second method of determining reliability, called test-retest reliability, is frequently used. In this approach, the same form of the data collection instrument is administered twice to the same sample of individuals, and the correlation between the two scores is calculated to determine the reliability.
A third approach to estimating reliability is to subdivide a single instrument into two presumably parallel groups of items. All of the items on the instrument are given to one sample of individuals at one time, and then the items are split out and treated as if they were two separate instruments. Each group is scored separately, and the resulting scores are correlated. This approach is called split-half reliability.
Finally, an analysis of covariance can be calculated among the individual items on the assessment instrument to determine the true score and error variances.
Applications
Case Study: Presenting Child Need
In an example of the application of reliability and validity assessment to a real-world problem, Forrester, Fairtlough, and Bennet (2007) examined the inter-rater reliability of methods to describe the needs of children to children's services in England and Wales. Although it is important to look at the unique characteristics of each case when determining what type of help a child needs, it is also important to be able to speak about "need" using a common language that will reliably discriminate between various classifications of need so that children can be given the help necessary for their well-being. Several typologies of need identification have been developed, but in order to maximize their usefulness, they must yield reliable results. Forrester, Fairtlough, and Bennet (2007) examined 200 consecutive file studies of closed referrals, first analyzing the files to classify the presenting needs or potential needs of the child in each case and then grouping them into clusters of variables for issues that occurred more than once. Fifty randomly selected cases were then tested to determine the relationship between the variables. The patterns were statistically analyzed using cross-tabulation and Spearman's rank correlation coefficient. Based on this analysis, the variables were reduced to a final list of ten, plus an "other" category.
The Results
The authors reached four main conclusions about the reliability of descriptions of children's need used by social services. First, it was found that descriptions of need that relied on a "main" need were not as reliable as other approaches, and that patterns of incidence could not be described adequately using the construct of "main" need. Although such an approach may simplify data presentation, it does not adequately describe the complexity of a child's situation, nor does it give any indication of the seriousness of the need. Second, although it was found that other approaches to describing need were more reliable than the "main" need approach, they, too, were not without their problems. In these typologies, classifications such as "dysfunctional family" or "unstable or otherwise detrimental family" were vague and had low levels of reliability. The authors urged that such terms be better defined in order to increase reliability. Third, the meaning of the legal definition of need had low levels of agreement between the raters. In part, this appeared to be a result of the fact that the legal definition emphasized seriousness of the need rather than presence of the need, as is the case in the other definitions and typologies.
The authors concluded that typologies needed to be developed for the full range of referrals to children's services. In addition, they cautioned that the concept of "main" need that was used in both research and government policy was unreliable and not a good indicator of a child's situation or problem. The use of this concept could lead to misclassification and inappropriate intervention for at-risk children. Third, some specific categories that were currently used, such as "dysfunctional family," required better definition to increase reliability of assessments. Finally, future typologies of need should be tested for inter-rater reliability before being implemented. It is only through a reliable instrument that at-risk children can be consistently identified and their needs appropriately assessed.
The Children's Physical Environment Rating Scale
In another example of a reliability study, Moore and Sugiyama (2007) examined the reliability and validity of a new scale to be used for assessing the physical environment of early childhood educational facilities. The literature links the physical environment of such facilities to cognitive and social development during early childhood. The Children's Physical Environment Rating Scale (CPERS) comprises 124 items clustered into fourteen scales that focus on planning, overall architectural quality, indoor activity spaces, and outdoor play areas.
The reliability of the CPERS was tested for inter-rater reliability, test-retest reliability, and internal consistency. Inter-rater reliability was tested in forty-six childhood development centers in Sydney, Australia. Each center was assessed by two of seven raters through several cycles of field testing. The resulting data were statistically analyzed to determine the degree of agreement between raters for each item and Cronbach's generalizability coefficient G for each subscale. These analyses showed a high degree of agreement and generalizability between the raters on the items on the CPERS. Based on this result, the authors concluded that the CPERS is a reliable instrument that can consistently be used to rate the physical environment of an early childhood facility, both for research purposes and in general.
In addition, the authors examined the degree to which scores on the CPERS were stable over time. Each of eleven early childhood development centers was assessed once and then reassessed again three to five weeks later. The results were analyzed using Cronbach's G. The results showed a high degree of test-retest reliability, indicating that scores on the CPERS are stable over time and are consistent measures.
Finally, internal consistency of the scales in the CPERS was assessed simultaneously but independently by two raters similar to center directors who might use the scale on a routine basis to assess eleven centers. The results of the ratings were analyzed using Cronbach's alpha to show the internal consistency of each subscale. In general, the results showed that the CPERS has very high internal consistency and is highly reliable for use in assessing early childhood centers.
Conclusion
In order for a data collection or assessment instrument to be valid and test what it purports to measure, it must be designed to be reliable and consistently measure a characteristic or attribute. If an instrument is not both reliable and valid, the resulting data are not of use to the researcher. There are many potential sources of variability in the results of data collection and assessment instruments. These include lasting and general characteristics of the individual, lasting but specific characteristics of the individual, temporary but general characteristics of the individual, temporary and specific characteristics of the individual, and systematic or chance factors affecting the administration of the instrument.
In addition, the data from every assessment instrument will also contain some degree of variability that is attributable to error. The total variability of a data collection or assessment instrument is the sum of the true variability and the variability due to error. Reliability can be estimated through the use of parallel forms of the instrument, repeated administration of the same form of the instrument, subdivision of the instrument into two presumably parallel groups of items, and analysis of the covariance among the individual items.
Terms & Concepts
Confederate: A person who assists a researcher by pretending to be part of the experimental situation while actually only playing a rehearsed part meant to stimulate a response from the research subject.
Correlation: The degree to which two events or variables are consistently related. Correlation may be positive (as the value of one variable increases, the value of the other variable increases), negative (as the value of one variable increases, the value of the other variable decreases), or zero (the values of the two variables are unrelated). Correlation does not imply causation.
Data: In statistics, quantifiable observations or measurements that are used as the basis of scientific research.
Operational Definition: A definition that is stated in terms that can be observed and measured.
Reliability: The degree to which a data collection or assessment instrument consistently measures a characteristic or attribute. An assessment instrument cannot be valid unless it is reliable.
Sample: A subset of a population. A random sample is a sample that is chosen at random from the larger population with the assumption that it will reflect the characteristics of the larger population.
Standard Deviation: A measure of variability that describes how far the typical score in a distribution is from the mean of the distribution.
Survey: (a) A data collection instrument used to acquire information on the opinions, attitudes, or reactions of people; (b) a research study in which members of a selected sample are asked questions concerning their opinions, attitudes, or reactions, and the responses are analyzed and used to extrapolate from the sample to the underlying population.
Survey Research: A type of research in which data about the opinions, attitudes, or reactions of the members of a sample are gathered using a survey instrument. The phases of survey research are goal setting, planning, implementation, evaluation, and feedback. Unlike experimental research, survey research does not allow for the manipulation of an independent variable.
Validity: The degree to which a survey or other data collection instrument measures what it purports to measure. A data collection instrument cannot be valid unless it is reliable. Content validity is a measure of how well assessment instrument items reflect the concepts that the instrument developer is trying to assess. Construct validity is a measure of how well an assessment instrument measures what it is intended to measure as defined by another assessment instrument. Face validity is when an assessment instrument appears to measure what it is trying to measure. Cross validity is the validation of an assessment instrument with a new sample to determine if the instrument is valid across situations. Predictive validity refers to how well an assessment instrument predicts future events.
Bibliography
Bowen, N. K. (2008). Cognitive testing and the validity of child-report data from the elementary school success profile. Social Work Research, 32(1), 18-28. Retrieved July 11, 2024, from EBSCO Online Database Academic Search Complete.
Compton, D., Love, T. P., & Sell, J. (2012). Developing and assessing intercoder reliability in studies of group interaction. Sociological Methodology, 42(1), 348-364. Retrieved July 11, 2024, from EBSCO Online Database SocINDEX with Full Text.
Forrester, D., Fairtlough, A., & Bennet, Y. (2007). Describing the needs of children presenting to children's services: Issues of reliability and validity. Journal of Children's Services, 2(2), 48-59. Retrieved July 11, 2024, from EBSCO Online Database SocINDEX with Full Text.
Hendrick, T. M., Fischer, A. H., Tobi, H., & Frewer, L. J. (2013). Self-reported attitude scales: Current practice in adequate assessment of reliability, validity, and dimensionality. Journal of Applied Social Psychology, 43(7), 1538-1552. Retrieved November 8, 2013, from EBSCO Online Database SocINDEX with Full Text. 0290&site=ehost-live
Moore, G. T. & Sugiyama, T. (2007). The Children's Physical Environment Rating scale (CPERS): Reliability and validity for assessing the physical environment of early childhood educational facilities. Children, Youth and Environments, 17(4), 24-53. Retrieved July 11, 2024, from EBSCO Online Database SocINDEX with Full Text.
Nunnally, J. C. (1978). Psychometric theory (2nd ed.). New York: McGraw-Hill Book Company.
Peterson, R. A., & Yeolib, K. (2013). On the relationship between coefficient alpha and composite reliability. Journal of Applied Psychology, 98(1), 194-198. Retrieved July 11, 2024, from EBSCO Online Database SocINDEX with Full Text.
Teixeira de Melo, A., Alarcão, M., & Pimentel, I. (2012). Validity and reliability of three rating scales to assess practitioners' skills to conduct collaborative, strength-based, systemic work in family-based services. American Journal of Family Therapy, 40(5), 420-433. Retrieved July 11, 2024, from EBSCO Online Database SocINDEX with Full Text.
Suggested Reading
Allen, M. J. & Yen, W. M. (1979). Introduction to measurement theory. Monterey, CA: Brooks/Cole Publishing Company.
Bulloch, S. (2013). Seeking construct validity in interpersonal trust research: A proposal on linking theory and survey measures. Social Indicators Research, 113(3), 1289-1310. Retrieved July 11, 2024, from EBSCO Online Database SocINDEX with Full Text.
Conley, T. B. (2006). Court ordered multiple offender drunk drivers: Validity and reliability of rapid assessment. Journal of Social Work Practice in the Addictions, 6(3), 37-51. Retrieved July 11, 2024, from EBSCO Online Database Academic Search Complete.
Dunn, T. W., Smith, T. B., & Montoya, J. A. (2006). Multicultural competency instrumentation: A review and analysis of reliability generalization. Journal of Counseling & Development, 84(4), 471-482. Retrieved July 11, 2024, from EBSCO Online Database Academic Search Complete.
Gillaspy, J. A. Jr. & Campbell, T. C. (2007). Reliability and validity of scores from the Inventory of Drug Use Consequences. Journal of Addictions & Offender Counseling, 27(1), 17-27. Retrieved July 11, 2024, from EBSCO Online Database Academic Search Complete.
Lemke, E. & Wiersma, W. (1976). Principles of psychological measurement. Chicago: Rand McNally College Publishing Company.
Lewis, C. A. & Cruise, S. M. (2006). Temporal stability of the Francis Scale of Attitude toward Christianity among 9- to 11-year-old English children: Test-retest data over six weeks. Social Behavior and Personality, 34(9), 1081-1086. Retrieved July 11, 2024, from EBSCO Online Database Academic Search Complete.