Correlation
Correlation is a statistical term that measures the degree to which two events or variables are related. It indicates both the strength and direction of a relationship, which can be positive (where an increase in one variable leads to an increase in the other), negative (where an increase in one variable leads to a decrease in the other), or zero (indicating no relationship). Importantly, correlation does not imply causation; knowing that two variables are correlated does not mean that one causes the other. Techniques for assessing correlation include the Pearson Product Moment Correlation for parametric data and the Spearman Rank Correlation Coefficient for nonparametric data. These measures are integral in various statistical analyses, including regression and factor analysis, which help build models that reflect complex real-world relationships. For instance, correlation can aid researchers in understanding social behaviors, such as how advertising might influence consumer habits. However, care must be taken to avoid misinterpreting correlation as causation, as other factors may drive the relationship observed.
On this Page
Subject Terms
Correlation
In statistics, correlation is the degree to which two events or variables are consistently related. This measure indicates both the degree and direction of the relationship between variables. However, it yields no information concerning the cause of the relationship. Correlation techniques are available for both parametric and nonparametric data. The Pearson Product Moment Correlation is also used in other inferential statistical techniques such as regression analysis and factor analysis to help researchers and theorists build models that reflect the complex relationships observed in the real world.
Keywords Correlation; Data; Demographic Data; Dependent Variable; Distribution; Factor Analysis; Independent Variable; Inferential Statistics; Model; Nonparametric Statistics; Parametric Statistics; Regression Analysis; Reliability; Variable
Correlation
Overview
Every day we make assumptions about the relationship of one event to another in both our personal and professional lives. "My alarm clock failed to go off this morning, so I will be late for work." "The cat ate an entire can of cat food so she must be feeling better." "I received a polite e-mail from Mr. Jones, so he must not be angry that my report was not submitted on time." Sociologists attempt to express the relationship between variables in the same way on a broader scale. "Advertisements induce previous purchasers to buy additional lottery tickets." "People tend to act more openly with strangers who outwardly appear to be similar to themselves." "Younger males tend to be less prejudiced towards women in the workplace."
From a statistical point of view, the mathematical expression of such relationships is called correlation. This is the degree to which two events or variables are consistently related. Correlation may be positive (i.e., as the value of one variable increases the value of the other variable increases), negative (i.e., as the value of one variable increases the value of the other variable decreases), or zero (i.e., the values of the two variables are unrelated). However, correlation does not give one any information about what caused the relationship between the two variables. Properly used, knowing the correlation between variables can give one useful information about behavior. For example, if I know that my cat gets sick when I feed her "Happy Kitty" brand cat food, I am unlikely to feed her "Happy Kitty" in the future. Of course, knowing that she gets sick after eating "Happy Kitty" does not explain why she gets sick. It may be that she is sensitive to one of the ingredients in "Happy Kitty" or it may be that "Happy Kitty" inadvertently released a batch of tainted food. However, my cat's digestive problems might not have anything to do with "Happy Kitty" at all. The neighborhood stray may eat all her "Happy Kitty" food, causing her to have eaten something else that causes her to get sick, or I changed her food to "Happy Kitty" at the same time she was sick from an unrelated cause. All I know is that when I feed her "Happy Kitty" she gets sick. Although I do not know why, this is still useful information to know. The same is true for the larger problems of sociology.
There are a number of ways to statistically determine the correlation between two variables. The most common of these is the technique referred to as the Pearson Product Moment Coefficient of Correlation, or Pearson r . This statistical technique allows researchers to determine whether the two variables are positively correlated (i.e., my cat gets sick when she eats "Happy Kitty"), negatively correlated (i.e., my cat is healthier when she eats "Happy Kitty"), or not correlated at all (i.e., there is no change in my cat's health when she eats "Happy Kitty").
Correlation vs. Causation
However, as mentioned above, knowing that two variables are correlated does not tell us whether one variable caused another or if both observations were caused by some other, unknown, third factor. As opposed to the various techniques of inferential statistics where we attempt to make inferences such as drawing conclusions about a population from a sample and in decision making by looking at the influence of an independent variable on a dependent variable, correlation does not imply causation. For example, if I have two clocks that keep perfect time in my house, I may observe that the alarm clock in my bedroom goes off every morning at seven o'clock just as the grandfather clock in the hallway chimes. This does not mean that the alarm clock caused the grandfather clock to chime or that the grandfather clock caused the alarm clock to go off. In fact, both of these events were caused by the same event: the passage of 24 hours since the last time they did this. Although it is easy to see in this simple example that a third factor must have caused both clocks to go off, the causative factor for two related variables is not always so easy to spot. To act on such unfounded assumptions about causation as inferred from correlation is part of the cycle of superstitious behavior. Many ancient peoples, for example, included some sort of sun god in their pantheon of deities. They noticed that when they made offerings to their sun god, the sun arose the next morning, bringing with it heat and light. So, they made offerings. From our modern perspective, however, we now know that the faithful practice of making offerings to a sun god was not the cause of the sun coming up the next morning. Rather, the apparent phenomenon of the rising sun is caused by the daily rotation of the earth on its access.
The classic example of showing the absurdity of inferring causation from correlation was published in the mid 20th century in a paper reporting the results of an analysis of fictional data. Neyman (1952) used an illustration of the correlation between the number of storks and the number of human births in various European countries. The result of the correlation analysis of the relationship between the sightings of storks and the number of births was both high and positive. Without understanding how to interpret the correlation coefficient, someone might conclude from this evidence that storks bring babies. The truth, however, was that the data were analyzed without respect of country size. Since larger northern European countries tend to have both more women and more storks, the observed correlation was due to country size. The correlation was incidental and not causal: correlation tells one nothing about causation. Although this example was originally meant to make people laugh, it was also meant as a warning: as absurd as these examples may sound, coefficients are frequently misinterpreted to imply causation.
Pearson Product Moment Correlation
The Pearson Product Moment Correlation is a parametric test that makes several assumptions concerning the data that are being analyzed. First, it assumes that the data have been randomly selected from a population that has a normal distribution. In addition, it assumes that the data are interval or ratio in nature. This means that not only do the rank orders of the data have meaning (e.g., a value of 6 is greater than a value of 5) but the intervals between the values also have meaning. For example, weight is a ratio scale. It is clear that the difference between 1 gram of a chemical compound and 2 grams of a chemical compound is the same as the difference between 100 grams of the compound and 101 grams of the compound. These measurements have meaning because the weight scale has a true zero (i.e., we know what it means to have 0 grams of the compound) and the intervals between values is equal. On the other hand, in attitude surveys and other data collection instruments used by sociologists, it may not be quite as clear that the difference between 0 and 1 on a 100 point rating scale of quality of a widget is the same as the difference between 50 and 51 or between 98 and 99. These are value judgments and the scale may not have a true zero. Even if the scale does start at 0, it may be difficult to define what this value means. It is difficult to know whether a score of 0 differs significantly from a score of 1 on an attitude scale. In both cases, the rater had a severe negative reaction to the item being discussed. Since ratings are subjective, even if numerical values are assigned to them, these do not necessarily meet the requirement of parametric statistics that the data be at the interval or ratio level.
Spearman Rank Correlation Coefficient
Fortunately, the Pearson product moment correlation is not the only method for determining the relationship between variable. For these situations, the Spearman Rank Correlation Coefficient can be used instead to determine the degree of relationship between two variables. The Spearman is a nonparametric statistical test that makes no assumptions about the underlying distribution of the data. Unlike the Pearson coefficient of correlation which requires interval or ratio level data, the Spearman can be used with ordinal level (i.e., ranked) data. In addition, the Spearman does not require interval data nor does it assume that there is a linear relationship between the variables. For example, the Spearman Rank Correlation could be used in a situation where one wanted to determine if the ratings of the violence level of television shows done by two different raters was close enough to be pooled (i.e., to determine whether or not both individuals were using the same subjective criteria when rating the shows).
Applications
Coefficients of Correlation & Their Use
Coefficients of correlation are used not only as stand alone statistics, but also as inputs into other statistical techniques including regression analysis and factor analysis. These techniques are used to develop multidimensional models that describe the complex nature of real world situations.
Factor Analysis
Factor analysis is a multivariate technique that is used to analyze the interrelationships between variables and attempts to articulate their common underlying factors. Factor analysis is used in situations where it is assumed that the nature of reality is not actually chaotic, but it is attributed to multiple underlying factors. Multidimensional mathematical techniques are applied to the data to examine how they cluster together into "factors." In many ways, factor analysis is more a logical procedure than a statistical one although it is based on the analysis of Pearson correlation coefficients between data. Factor analysis performs a causal analysis to determine the likelihood of the same underlying processes resulting in multiple observations. Although factor analysis can yield interesting information about the relationships between seemingly unrelated data, the determination of factors, in the end, is a qualitative decision requiring the insights of the researcher. Further, factor analysis does not determine "the" set of factors that underlie the data, but typically can reveal several likely sets of factors. This means that the researcher needs to give careful consideration to what is known about the situation in order to determine which potential set of factors is superior to the others. If such considerations are not available, however, the resulting factors will not be meaningful.
Factor Analysis & Pearson r
An example of a research study that uses the Pearson r as an input for model building was performed by Brennan, Molnar, and Earls (2007). The researchers used correlation and factor analysis to refine a measure of adolescents' exposure to violence in Chicago neighborhoods. Trained interviewers conducted separate interviews with each adolescent and his/her primary care giver about potentially harmful events that had occurred in the adolescent's life (both witnessed events and personal victimization). Subjects also answered 35 items concerning their anxiety/depression, aggression, and delinquency levels as well as a number of items used to collect demographic data. Among other methods, the researchers used correlation to identify items related to the measurement of exposure to violence that might not fit well with the construct of violence exposure. These were items that either correlated with other items on the scale and/or did not increase the reliability of the scale. Three scales were of particular interest: victimization, witnessing of violence, and learning of violence. To test whether or not these were truly three separate scales, the researchers performed a confirmatory factor analysis. Based on the results of the study, the researchers concluded that these were three different factors contributing to the exposure to violence in urban youth.
Regression Analysis & Pearson r
Another statistical technique that uses the Pearson r as an input is regression analysis. This is a family of statistical techniques used to develop a mathematical model for use in predicting one variable from the knowledge of another variable. Advanced regression techniques allow researchers to use both multiple independent and multiple dependent variables in developing models. The regression equation is a mathematical model of a real world situation that can be invaluable for forecasting and learning more about the interaction of variables in the real world. There are many types of multivariate regression including multiple linear regression, multivariate polynomial regression, and canonical correlation.
Subrahmanyam and Lin (2007) used correlation as an input into regression analysis to investigate the effects of Internet use on the well-being of adolescents. The researchers examined the effect of Internet usage on adolescents' feelings of loneliness and their perceptions of support from friends and family. Data were collected from 78 females and 78 males between the ages of 15 and 18.4 years of age. Each participant completed an Internet access questionnaire that asked questions about how they used the Internet (e.g., total time spent online, time spent using e-mail, place of access). The questionnaire also explored the subjects' knowledge of and familiarity with their online correspondents, as well as their relationships with these individuals. The loneliness level of the subjects was measured using the eight-item Roberts Revision of the UCLA Loneliness Scale and the availability of others to whom the participants could turn in times of need and how satisfied they were with that support was accessed using the 24-item Social Support Scale for Children . The data were analyzed using regression analysis. These results suggested that loneliness was not predicted by the length of time spent on the Internet. However, gender and the participants' perceptions of the participants on their online relationships did predict loneliness. Participants who felt that their online partners could be counted on in times of need also tended to be more lonely than those who did not. Finally, perceived support from significant others was not related to the amount of time spent online, time spent on e-mail, relationships with online partners, or perceptions about these relationships.
Conclusion
Coefficients of correlation mathematically express the degree of relationship between two events or variables on a scale of 0.0 (demonstrating no relationship between the two variables) to 1.0 (demonstrating a perfect relationship between the variables). In addition, coefficients of correlation may be positive (demonstrating as the value of one variable increases so does the value of the other variable) or negative (demonstrating that as the value of one variable increases the value of the other variable decreases). Two of the most common methods of determining correlation between variables are the Pearson Product Moment Coefficient of Correlation for use with parametric data and the Spearman Rank Order Coefficient of Correlation for use with nonparametric data. In addition, the Pearson statistic can be used as an input to other statistical techniques such as regression analysis and factor analysis in the building of models of complex real world behavior. Although correlation is an important statistical tool for sociologists, it is important that correlation by itself does not imply causation: Correlated variables may be both caused by a third, unknown factor or only speciously related.
Terms & Concepts
Correlation: The degree to which two events or variables are consistently related. Correlation may be positive (i.e., as the value of one variable increases the value of the other variable increases), negative (i.e., as the value of one variable increases the value of the other variable decreases), or zero (i.e., the values of the two variables are unrelated). Correlation does not imply causation.
Data: (sing. datum) In statistics, data are quantifiable observations or measurements that are used as the basis of scientific research.
Demographic Data: Statistical information about a given subset of the human population such as persons living in a particular area, shopping at an area mall, or subscribing to a local newspaper. Demographic data might include such information as age, gender, or income distribution, or growth trends.
Dependent Variable: The outcome variable or resulting behavior that changes depending on whether the subject receives the control or experimental condition (e.g., a consumer's reaction to a new cereal).
Distribution: A set of numbers collected from data and their associated frequencies.
Factor Analysis: A multivariate statistical technique that analyzes interrelationships between variables and attempts to articulate their common underlying factors.
Independent Variable: The variable in an experiment or research study that is intentionally manipulated in order to determine its effect on the dependent variable (e.g., the independent variable of type of cereal might affect the dependent variable of the consumer's reaction to it).
Inferential Statistics: A subset of mathematical statistics used in the analysis and interpretation of data. Inferential statistics are used to make inferences such as drawing conclusions about a population from a sample and in decision making.
Model: A representation of a situation, system, or subsystem. Conceptual models are mental images that describe the situation or system. Mathematical or computer models are mathematical representations of the system or situation being studied.
Nonparametric Statistics: A class of statistical procedures that is used in situations where it is not possible to estimate or test the values of the parameters (e.g., mean, standard deviation) of the distribution or where the shape of the underlying distribution is unknown.
Normal Distribution: A continuous distribution that is symmetrical about its mean and asymptotic to the horizontal axis. The area under the normal distribution is 1. The normal distribution is actually a family of curves and describes many characteristics observable in the natural world. The normal distribution is also called the Gaussian distribution or the normal curve of errors.
Parametric Statistics: A class of statistical procedures that is used in situations where it is reasonable to make certain assumptions about the underlying distribution of the data and where the values to be analyzed are either interval- or ratio-level data.
Regression Analysis: A family of statistical techniques used to develop a mathematical model for use in predicting one variable from the knowledge of another variable.
Reliability: The degree to which a psychological test or assessment instrument consistently measures what it is intended to measure. An assessment instrument cannot be valid unless it is reliable.
Variable: An object in a research study that can have more than one value. Independent variables are stimuli that are manipulated in order to determine their effect on the dependent variables (response). Extraneous variables are variables that affect the response but that are not related to the question under investigation in the study.
Bibliography
Armore, S. J. (1966). Introduction to statistical analysis and inferences for psychology and education . New York: John Wiley & sons.
Brennan, R. T., Molnar, B. E., & Earls, F. (2007). Refining the measurement of exposure to
violence (ETV) in urban youth. Journal of Community Psychology, 35 , 603-618. Retrieved March 28, 2008, from EBSCO Online Database SocINDEX with full Text. http://web.ebscohost.com/ehost/pdf?vid=5&hid=108&sid=e7761ab9-64c2-4b80-aea4-8ef8ea7ef230%40sessionmgr102
Cooley, W. W., & Lohnes, P. R. (1971). Multivariate data analysis. New York: John Wiley and Sons. ..F.- Holgado–Tello, F., Chacón–Moscoso, S., Barbero–García, I., & Vila–Abad, E. (2010). Polychoric versus Pearson correlations in exploratory and confirmatory factor analysis of ordinal variables. Quality & Quantity, 44, 153–66. Retrieved October 25, 2013 from EBSCO Online Database SocINDEX with Full Text. http://search.ebscohost.com/login.aspx?direct=true&db=sih&AN=47161252
Holosko, M. J. (2010). What Types of Designs are We Using in Social Work Research and Evaluation?. Research On Social Work Practice, 20, 665–73. Retrieved October 25, 2013 from EBSCO Online Database SocINDEX with Full Text. http://search.ebscohost.com/login.aspx?direct=true&db=sih&AN=54489183
Hollander, M. & Wolfe, D. A. (1973). Nonparametric statistical methods . New York: John Wiley and Sons.
Huff, D. (1954). How to lie with statistics. New York: W. W. Norton & Company.
Neyman, J. (1952). Lectures and Conferences on Mathematical Statistics and Probability (2nd ed.). US Department of Agriculture: Washington DC.
Segal, E. A., Cimino, A. N., Gerdes, K. E., Harmon, J. K., & Wagaman, M. (2013). A Confirmatory Factor Analysis of the Interpersonal and Social Empathy Index. Journal of The Society For Social Work & Research, 4, 131–53. Retrieved October 25, 2013 from EBSCO Online Database SocINDEX with Full Text. http://search.ebscohost.com/login.aspx?direct=true&db=sih&AN=90515573
Subrahmanyam, K. & Lin, G. (2007). Adolescents on the net: Internet use and well-being. Adolescence, 42 , 659-677. Retrieved March 18, 2008, from EBSCO Online Database SocINDEX with Full Text. http://web.ebscohost.com/ehost/pdf?vid=4&hid=7&sid=0448787f-afa0-4373-819c-88135f67c7ab%40sessionmgr7
Thurstone, L. L. (1947). Multiple-factor analysis . Chicago: University of Chicago Press.
Witt, R. S. (1980). Statistics. New York: Holt, Rinehart and Winston.
Suggested Reading
Brady, H. E. & Seawright, J. (2004) Framing social inquiry: From models of causation to statistically based causal inference. Paper prepared for the American Political Science Association Annual Meeting. Retrieved March 24, 2008, from http://www.asu.edu/clas/polisci/cqrm/APSA2004/BradySeawright.pdf
Eldeleklioglu, J. (2007). The relationships between aggressiveness, peer pressure and parental attitudes among Turkish high school students. Social Behavior & Personality: An International Journal, 35 , 975-986. Retrieved March 24, 2008, from EBSCO Online Database SocINDEX with Full Text. http://web.ebscohost.com/ehost/pdf?vid=8&hid=16&sid=f1afadd7-c398-46e6-9c2c-f70804ccc820%40sessionmgr9
Fricke, T. (2003). Culture and causality: An anthropological comment. Population & Development Review, 29 , 470-479. Retrieved March 24, 2008, from EBSCO Online Database SocINDEX with Full Text. http://web.ebscohost.com/ehost/pdf?vid=5&hid=16&sid=f1afadd7-c398-46e6-9c2c-f70804ccc820%40sessionmgr9
Gangl, M. (2010). Causal Inference in Sociological Research. Annual Review Of Sociology, 36, 21–47. Retrieved October 25, 2013 from EBSCO Online Database SocINDEX with Full Text. http://search.ebscohost.com/login.aspx?direct=true&db=sih&AN=53357050
Pearl, J. (2010). The Foundations of Causal Inference. Sociological Methodology, 40, 75–149. Retrieved October 25, 2013 from EBSCO Online Database SocINDEX with Full Text. http://search.ebscohost.com/login.aspx?direct=true&db=sih&AN=55203480
Petrovec, D., Tompa, G., & Šugman, K. (2007). Poverty and reaction to crime - irresponsibility proven. Sociologija: Mintis ir Veiksmas, 2007/2 , 32-42, Retrieved March 24, 2008, from EBSCO Online Database SocINDEX with Full Text. http://web.ebscohost.com/ehost/pdf?vid=6&hid=16&sid=f1afadd7-c398-46e6-9c2c-f70804ccc820%40sessionmgr9