Regression Analysis (Sociology)
Regression analysis is a powerful statistical tool used primarily in fields like sociology to understand and predict behaviors and interactions among individuals and groups. This methodology involves constructing mathematical models that relate one variable (the dependent variable) to one or more independent variables. Simple linear regression is the most basic form, allowing researchers to predict a dependent variable using a single independent variable, while multiple linear regression expands this to several independent variables.
Researchers also utilize advanced techniques such as multivariate and polynomial regression to account for more complex relationships, including nonlinear patterns. Understanding the correlation between variables, including positive and negative relationships, is integral to regression analysis; however, correlation alone does not imply causation.
While regression analysis can yield valuable insights, it requires careful consideration of underlying assumptions and potential data issues, such as multicollinearity and outliers. Its applications are broad, informing studies on topics like adolescent well-being in relation to internet use, and the dynamics of prejudice in the workplace. Overall, regression analysis serves as a crucial tool for sociologists seeking to quantify and predict social phenomena based on empirical data.
Subject Terms
Regression Analysis
Abstract
Regression analysis is a family of statistical tools that can help sociologists better understand and predict the way that people act and interact. Regression analysis is used to build mathematical models to predict the value of one variable from knowledge of another. Although statistical methods of correlation offer researchers techniques to help them better understand the degree to which two variables are consistently related, such knowledge alone is typically insufficient to predict behavior. Simple linear regression allows the value of one dependent variable to be predicted from the knowledge of one independent variable. Multiple linear regression can be used to develop models to predict the value of a dependent variable from the knowledge of the value of more than one independent variable.
Overview
Regression analysis is a family of statistical tools that can help sociologists better understand the way that people act and interact in groups and society. Regression analysis allows researchers to build mathematical models that can be used to predict the value of one variable from knowledge of another. There are a number of specific regression techniques that can be used by sociologists to model real-world behavior. These include:
- Simple linear regression analysis, which allows the modeling of two variables, one independent and one dependent
- Multiple linear regression analysis, which allows the modeling of two or more independent variables to predict one dependent variable
- Multiple curvilinear regression, where the relationship between variables is nonlinear (e.g., quadratic)
- Multivariate linear regression, which allows the simultaneous examination of several dependent variables
- Multivariate polynomial regression, which can be used to account for nonlinear relationships
The most commonly used of these techniques, simple linear regression and multiple linear regression, are discussed in the following sections.
Simple Linear Regression. Statistics offers sociology researchers a number of correlation techniques to help them better understand the degree to which two variables are consistently related. For example, correlation can help one understand the relationship between educational level and income level. Correlation coefficients show the degree of relationship between two variables with a value between zero and one. A correlation of 1.0 shows that the variables are completely related and a change in the value of one variable will signify a corresponding change in the other, while a correlation of 0.0 shows that there is no relationship between the two variables and that knowing the value of one variable will tell us nothing about the value of the other.
In addition to signifying the degree of relationship between two variables, a correlation coefficient also shows how the two variables are related. A positive correlation means that as the value of one variable increases, so does the value of the other variable. A negative correlation, on the other hand, means that as the value of one variable increases, the value of the other variable decreases. An example of a high positive correlation would be the relationship of weight to age for healthy children: the older the child is, the more he or she will probably weigh. An example of a high negative correlation would be the relationship between temperature and the likelihood of snow: the higher the temperature is, the less likely it is to snow.
However, as helpful as knowing what the correlation between two variables is, that knowledge alone does not necessarily give us sufficient information to predict behavior. For example, although we may know that people who do their grocery shopping when they are hungry are more likely to buy impulse items than those who are not, we cannot necessarily accurately predict that just because a person is hungry, he or she will purchase unneeded items at the grocery store. Merely knowing that there is a positive correlation between these two variables is insufficient to allow us to predict whether a given person or type of person is more likely to exhibit this behavior. In situations where one needs to be able to predict the value of one variable from knowledge of another variable based on the data, one needs to use simple linear regression.
Simple linear regression is a bivariate statistical tool that allows the value of one dependent variable to be predicted from the knowledge of one independent variable. Examples of sociological applications of simple linear regression include predicting the crime rate from population density, voting behavior in an election from voting behavior in the primary, and relative income based on gender. The pairs of data used in linear regression analysis are typically graphed on a scatter plot that shows the values of the points for two-variable numerical data. A line of best fit is superimposed on the scatter plot and used to predict the value of the dependent variable based on different values of the independent variable. A sample scatter plot with line of best fit is shown in Figure 1.
The equation for the regression line is determined by the statistics equivalent of the linear slope-intercept equation from basic algebra, y = mx + b:
ŷ = β0 + β1x + ∈
where
ŷ = the predicted value of y
β0 = the population y intercept
β1 = the population slope
∈ = the error term
For example, a sociologist interested in the behavior of small groups might want to determine whether or not the efficacy of the decisions made in small groups could be predicted from the number of people in the group. Although larger group size could mean that there are more ideas, more contribution to the thinking process, and a larger potential for synergistic thinking, a larger group could also mean that more time would be required to reach a decision, the competition of ideas could lead to confusion, and coalitions could form within the group and make it harder to resolve disagreements. A predictive model for group size versus efficacy of decision making could be developed by setting up an experiment that compared the efficacy of decision making on the same problem for groups of various sizes. The slope of the line of best fit passing through the data points on the scatter plot could be mathematically calculated, using these data points to determine the equation of the simple regression line. This equation could then be used by the sociologist to recommend optimal group size for similar types of decisions or projects based on the single variable of number of group members.
The problem with drawing a line of best fit through a scatter plot, of course, is that unless all the pairs of data fall on one straight line, it is possible to draw multiple lines through a data set. The question faced by the researcher is how to determine which of these possible lines will yield the best predictions of the dependent variable from the independent variable. This can be accomplished mathematically through residual analysis.
In regression analysis, a residual is defined as the difference between the actual y values and the predicted y values, or y - y^. To find the line of best fit, it is important to reduce the distance between the points on the scatter plot and the line. This is done by minimizing the sum of the squares of the residuals in order to find the line of best fit. By looking at the residuals, a researcher can better understand how well the regression line fits past data in order to estimate how well it will predict future data.
Standard regression analysis techniques make several assumptions, including that the model is correct and that the data are good. Unfortunately, the types of real-world data needed by sociologists tend to be messy. As a result, these assumptions are rarely met in practice. Many factors can contribute to the problems in regression analysis, including the use of the incorrect functional form, which is used for the regression function; correlation of variables; inconstant variance; sample data with outliers; and multicollinearity among subsets of the input variables such that they exhibit nearly identical linear relations. If one or more of these problems occur, the entire analysis may be invalidated. This risk is complicated by the fact that there are few indications in standard statistics to indicate when these problems have occurred. Although there are other indicators and potential remedies for these situations, they must be used with caution. For example, non-uniform residual plots may indicate that the underlying functions are nonlinear. Although outliers and extreme points can be deleted from the analysis, researchers must take care when doing so, because such points may indicate important information about the data, such as that other variables need to be included in the analysis. If these points are eliminated, this information is lost. There are also a number of ways to identify and rectify multicollinearity. However, these approaches are not interchangeable, and which method is the best depends on the underlying cause of the problem.
Multiple Linear Regression. When correctly used, simple linear regression can be very useful for building models and predicting the value of one variable from the knowledge of the value of another variable. However, the types of problems investigated by sociologists in the real world are often more complex and include multiple variables. For many such situations, multiple linear regression can be used to model the data. This statistical technique allows the prediction of a dependent variable from the knowledge of the value of more than one independent variable. For example, in the illustration of group decision-making efficacy used above, there are probably many more factors than group size that influence the value of the decision made by the group. The value of the dependent variable (decision-making efficacy) may also depend on the type of decision the group is trying to make, the experience of the group members with that type of problem, or how comfortable the individual members are working in group settings, among other factors. By using multiple linear regression analysis instead of simple linear regression, the researcher can take all these independent variables into account to determine their effect on the dependent variable, that is, the efficacy of the group's decision making.
As another example, a sociologist might want to know what social and psychological factors predict the likelihood of welfare use and exit from the welfare system. Such information could help identify at-risk individuals so that preventative measures could be taken or additional programs and support could be provided to help them exit the system. Similarly, regression analysis could be used to investigate the relationship between the symptoms of post-traumatic stress disorder (PTSD), life events, and unit cohesion for soldiers. Such information has the potential to help predict what variables contribute to the development of PTSD during wartime and determine possible ways to ameliorate their effects.
Applications
Regression analysis is used in a wide variety of sociological research situations to help researchers better understand the relationship between variables and thus predict behaviors (dependent variables) based on contributing factors (independent variables). The following sections highlight two such studies. The first used regression analysis to examine the relationship between Internet use and adolescent well-being. The second used regression analysis to examine prejudice toward women in the workplace based on social dominance orientation, right-wing authoritarianism, and sexism of others.
Internet Use & Adolescent Well-Being. The dependence of twenty-first-century adolescents on electronic means of communication is evident in the omnipresence of cell phones, personal computers, and various other forms of Internet access. Many people have expressed concern that the dependence on the Internet as a medium for social interaction will lead to fragile emotional ties and will affect overall well-being in a negative way. Subrahmanyam and Lin (2007) investigated the effects of Internet use on the well-being of adolescents using regression analysis. Specifically, the study examined the effect of Internet usage on adolescents' feelings of isolation and on the amount of support from those close to them, including friends and family. Participants in the study were 78 females and 78 males between the ages of 15 and 18.4 years of age. The participants completed an Internet access questionnaire that asked questions about how they used the Internet, including the total time they spent online, how much time they spent using e-mail, and where they accessed the Internet from. The questionnaire also explored their knowledge of and familiarity with their online correspondents and their relationships with those individuals. In addition, the eight-item Roberts revision of the UCLA Loneliness Scale was used to measure the loneliness level of the participants, and the 24-item Social Support Scale for Children was used to determine the availability of others to whom the participants could turn in times of need and how satisfied they were with that support.
The data collected in the study were analyzed using regression analysis. These results suggested that loneliness is not predicted by the length of time spent on the Internet, but rather that it is determined by gender and by the participants' perceptions of their online relationships. Further, participants who felt that their online partners could be counted on in times of need tended to be lonelier than those who did not. The amount of time spent e-mailing, browsing online, or forging relationships with online partners did not affect the perceived support from loved ones.
The implications of the study are relevant to parents, teachers, and anyone else concerned about the well-being of adolescents. Although previous research has shown chat rooms to be as dangerous places, current studies show that with proper supervision, they can be of great benefit to those adolescents who are lonely or feel they are not receiving sufficient support from parents and close friends. The study results also suggest that monitoring the length of time spent on the Internet is not in itself an effective way to supervise adolescents' Internet usage; it is also important to monitor what they do online, with whom they correspond, and what types of correspondence they have with their online partners.
Prejudice toward Women in the Workplace. Despite the strides toward equality in the workplace that have been achieved over the past few decades, prejudice toward women still exists. Christopher and Wojda (2008) examined the effects of social dominance orientation and right-wing authoritarianism on prejudice against women in the workplace, as expressed through employment skepticism and traditional role preference. The authors also examined the mediating effects of hostile sexism on these relationships. Of 1,500 potential participants who were contacted to take part in the study, 349 (182 women and 167 men) responded. These individuals filled out a 14-item scale about social dominance orientation, a 20-item scale measuring right-wing authoritarianism, the 10-item Brief Multidimensional Aversion to Women Who Work Scale, and the 22-item Ambivalent Sexism Inventory. They also answered questions designed to gather various pieces of demographic information.
The results were analyzed using a series of multiple regression analyses. The first two of these analyses tested whether or not social dominance orientation accounted for the variation in employment skepticism and whether right-wing authoritarianism accounted for the variation in preference for traditional roles. The second two regression analyses examined whether hostile sexism mediated the relationship between social dominance orientation and employment skepticism and whether benevolent sexism mediated the relationship between right-wing authoritarianism and traditional role preference. These analyses indicated that hostile sexism partially mediates the relationship between social dominance orientation and employment skepticism, while benevolent sexism fully mediates the relationship between right-wing authoritarianism and traditional role preference. The final two regression analyses performed in this study examined whether or not benevolent sexism mediated the relationship between social dominance orientation and employment skepticism and hostile sexism mediated the relationship between right-wing authoritarianism and traditional role preference. The results of these analyses indicated that benevolent sexism does not significantly mediate the relationship between social dominance orientation and employment skepticism, and hostile sexism does not mediate the relationship between right-wing authoritarianism and traditional role preference. The results of the study added to the current research literature on this topic by elucidating the complex and multifaceted relationship between prejudice against women in the workplace and the attitudes and ideologies of others. The complex relationships between these factors have several implications. For example, if a more qualified woman were denied a promotion in favor of a less qualified man, the supervisor making the decision may justify it by saying that the promotion would require longer hours, effectively reducing the woman's ability to fulfill her responsibilities in the home (traditional role preference). The supervisor could also deem the woman unfit to perform the job or task because of her gender (employment skepticism). The type of knowledge obtained from this and other research will enable the development of better tools for treatment of women in the workplace and training programs to educate supervisors about equal treatment and fair employment practices.
Conclusion
One of the tools available to sociologists in their quest to describe and predict human behavior within society is regression analysis. This is a family of statistical techniques that allows researchers to build mathematical models that can be used to predict the value of a dependent variable from knowledge of one or more independent variables. Regression analysis techniques make several assumptions about the underlying data that frequently do not apply to real-world data. With the cautious and appropriate use of various indicators and potential remedies, however, regression analysis can often yield models of the interactions of variables in the real world so that the state of sociological theory and practice can be advanced.
Terms & Concepts
Correlation: The degree to which two events or variables are consistently related. Correlation may be positive (as the value of one variable increases, so does the value of the other variable), negative (as the value of one variable increases, the value of the other variable decreases), or zero (the values of the two variables are unrelated). Correlation does not imply causation.
Data: In statistics, quantifiable observations or measurements that are used as the basis of scientific research.
Demographic Data: Statistical information about a given subset of the human population, such as persons who live in a particular area, shop at an area mall, or subscribe to a local newspaper. Demographic data might include such information as age, gender, or income distribution.
Dependent Variable: The outcome variable or resulting behavior that changes depending on whether the subject receives the control or experimental condition.
Employment Skepticism: The attitude that women do not belong in the workplace because they do not have the skills or abilities to work outside the home, they find the demands of the workplace too difficult, or some other gender-based rationale.
Independent Variable: The variable in an experiment or research study that is intentionally manipulated in order to determine its effect on the dependent variable.
Linear Regression: A statistical technique used to develop a mathematical model for use in predicting one variable from the knowledge of another variable or variables.
Model: A representation of a situation, system, or subsystem. Conceptual models are mental images that describe the situation or system. Mathematical or computer models are mathematical representations of the system or situation being studied.
Multicollinearity: When two or more independent variables in a multiple regression analysis are highly correlated.
Right-Wing Authoritarianism: An attitude in which the individual displays a high degree of deference to established authority, is aggressive toward societal outgroups (when such behavior is permitted by perceived authorities), and supports traditional values endorsed by authorities (Christopher, 2008).
Sample: A subset of a population. A random sample is a sample that is chosen at random from the larger population with the assumption that it will reflect the characteristics of the larger population.
Social Dominance Orientation: The degree to which a person prefers a hierarchical social system that enforces superiority over groups of lower status.
Variable: An object in a research study that can have more than one value. Independent variables are stimuli that are manipulated in order to determine their effect on the dependent variables. Extraneous variables are variables that affect the outcome of the study but are not related to the question under investigation.
Bibliography
Black, K. (2006). Business statistics for contemporary decision making (4th ed.). New York: John Wiley & Sons.
Christopher, A. N. & Wojda, M. R. (2008). Social dominance orientation, right-wing authoritarianism, sexism, and prejudice toward women in the workforce. Psychology of Women Quarterly, 32, 65–73. Retrieved March, 17 2008 from EBSCO Online Database SocINDEX with Full Text. http://search.ebscohost.com/login.aspx?direct=true&db=a9h&AN=30101363&site=ehost-live
Nelemans, S., Branje, S., Hale, W., Goossens, L., Koot, H., Oldehinkel, A., & Meeus, W. (2016). Discrepancies between perceptions of the parent-adolescent relationship and early adolescent depressive symptoms: An illustration of polynomial regression analysis. Journal of Youth & Adolescence, 45(10), 2049–2063. Retrieved October 29, 2018, from EBSCO Online Database Sociology Source Ultimate. http://search.ebscohost.com/login.aspx?direct=true&db=sxi&AN=118028756&site=ehost-live&scope=site
Siordia, C., Saenz, J., & Tom, S. E. (2012). An introduction to macro-level spatial nonstationarity: A geographically weighted regression analysis of diabetes and poverty. Human Geographies: Journal of Studies & Research in Human Geography, 6, 5–13. Retrieved November 6, 2013, from EBSCO Online Database SocINDEX with Full Text. http://search.ebscohost.com/login.aspx? direct=true&db=sih&AN=85124037&site=ehost-live
Subrahmanyam, K., & Lin, G. (2007). Adolescents on the net: Internet use and well-being. Adolescence, 42, 659–677. Retrieved March, 18 2008 from EBSCO Online Database SocINDEX with Full Text. http://search.ebscohost.com/ login.aspx?direct=true&db=a9h&AN=28031045&site=ehost-live
Takagi, D., Ikeda, K., & Kawachi, I. (2012). Neighborhood social capital and crime victimization: Comparison of spatial regression analysis and hierarchical regression analysis. Social Science & Medicine, 75, 1895–1902. Retrieved November 6, 2013, from EBSCO Online Database SocINDEX with Full Text. http://search.ebscohost.com/ login.aspx?direct=true&db=sih&AN=80032953&site=ehost-live
Timm, N. H. (1975). Multivariate analysis with applications in education and psychology. Monterey, CA: Brooks/Cole Publishing Company.
Trahan, A., & Russell, J. (2017). Race and police use of force: A regression analysis of varying situational approval from 1972 to 2012. Applied Psychology in Criminal Justice, 13(2), 142–154. Retrieved October 29, 2018, from EBSCO Online Database Sociology Source Ultimate. http://search.ebscohost.com/login.aspx?direct=true&db=sxi&AN=127011895&site=ehost-live&scope=site
Vis, B. (2012). The comparative advantages of fsQCA and regression analysis for moderately large-n analyses. Sociological Methods & Research, 41, 168–198. Retrieved November 6, 2013, from EBSCO Online Database SocINDEX with Full Text. http://search.ebscohost.com/login.aspx?direct=true&db=sih&AN=77340047&site=ehost-live
Witte, R. S. (1980). Statistics. New York: Holt, Rinehart and Winston.
Suggested Reading
Brailey, K., Vasterling, J. J., Proctor, S. P., Constans, J. I., & Friedman, M. J. (2007, Aug). PTSD symptoms, life events, and unit cohesion in U.S. soldiers: Baseline findings from the neurocognition deployment health study. Journal of Traumatic Stress, 20, 495–503. Retrieved 18 March 2008 from EBSCO Online Database SocINDEX with Full Text. http://search.ebscohost.com/login.aspx?direct=true&db=a9h&AN=26382263&site=ehost-live
Kozimor-King, M. L. (2008, Mar). Does belief matter? Social psychological characteristics and the likelihood of welfare use and exit. Journal of Sociology & Social Welfare, 35, 197–219. Retrieved March, 17 2008 from EBSCO Online Database SocINDEX with Full Text. http://search.ebscohost.com/login.aspx?direct=true&db=a9h&AN=31120720&site=ehost-live
Lindner, A. M. (2012). Teaching quantitative literacy through a regression analysis of exam performance. Teaching Sociology, 40, 50–59. Retrieved November 6, 2013, from EBSCO Online Database SocINDEX with Full Text. http://search.ebscohost.com/login.aspx?direct=true&db=sih&AN=70250877&site=ehost-live
McClure, T. E. & May, D. C. (2008). Dealing with misbehavior at schools in Kentucky. Youth & Society, 39, 406–429. Retrieved March, 17 2008 from EBSCO Online Database SocINDEX with Full Text. http://search.ebscohost.com/login.aspx?direct=true&db=a9h&AN=29385367&site=ehost-live
Wienclaw, R. A. (2018). Inferential statistics. Retrieved October 29, 2018, from EBSCO Online Database Research Starters - Sociology. http://search.ebscohost.com/login.aspx?direct=true&db=rst&AN=36267982&site=ehost-live&scope=site