Multiple regression

Multiple regression is a method used in statistical analysis to determine the value of a dependent variable based upon the value of two or more independent variables. The concept was first developed in the early twentieth century and evolved from the study of linear regression, the most basic form of predictive analysis. Researchers use multiple regression to find the value that best calculates the specific data for which they are searching. This data can help predict the factors that result in an outcome, or to forecast an effect or trend.

109057088-111303.jpg109057088-111302.jpg

Origins

The concept of statistical regression was first developed around the beginning of the nineteenth century as a way to calculate the effect of one variable on another. It grew out of a desire to predict the movement of heavenly bodies to aid in the navigation of seafaring vessels. French mathematician Adrien-Marie Legendre was the first to publish his findings, but credit for the discovery is often given to German mathematician Carl Friedrich Gauss. Known as the least squares method, the process examines observations of many variables as a way to minimize error. A variable is something that can be measured, such as a period of time, age, height, or salary. An independent variable is a statistic that is not affected by other variables. A person’s age or the number of floors in a house are examples of independent variables. A dependent variable is a statistic that depends on other factors. A football quarterback’s passing yards, for instance, may depend on the defense he is playing against or the weather during the game.

In the late nineteenth century, English statistician Francis Galton first used the term regression to describe the relationship between two variables. Galton compared the weight of sweet pea seeds to the seeds from their parent plants and found that the offspring seeds differed from their parents in a predictable manner: Parent seeds usually produced seeds of similar size; however, extremely large parents tended to produce smaller seeds and extremely small seeds produced larger seeds. His discovery became known as regression to the mean—the idea that extreme examples are usually followed by something closer to normal parameters. This concept is illustrated by drawing a line on a graph of various data points. The line is called the regression line and represents the mean, or statistical average.

After his experiment, Galton realized that more than one factor could have played a role in the size of his sweet pea seeds. He noticed that, as in humans, genetic traits might skip a generation. For example, a daughter may resemble her grandmother more than she may resemble her mother. Prior generations, such as the grandparents or great-grandparents, may also have affected the sweet pea seeds.

Galton’s student, the English mathematician Karl Pearson, also understood that comparing limited variables might not have provided a reliable statistical representation of the data. Pearson began to build upon his mentor’s work and examined the effect multiple independent variables had on a dependent variable. Using what he called multiple regression, he took the information obtained from multiple independent variables and examined the data to make more accurate statistical predictions.

Examples of Multiple Regression

Comparing two variables can provide a reliable statistical snapshot of a situation; however, it is limited by the amount of available data. If a researcher, for example, is trying to determine the factors involved in how much a person earns and the only data available are salary and age, the researcher may conclude that the two have a direct correlation. The salary in this case is a dependent variable and age is an independent variable.

Age may play a role in deciding a person’s salary, but there are usually many other factors that affect the outcome. Time with the company, experience level, performance on the job, and level of education are just some of the other factors that can be considered. Using multiple regression, a researcher will be able to construct a graph of data points and arrive at a statistical outcome. The researcher may find, for instance, that while all the independent variables have an effect on salary, people who earn the most are also more likely to have a higher level of education. From this data, the researcher might not only draw a correlation between the two variables but also might predict that people who earn a college degree have a better chance of making more money when they enter the workforce.

Multiple regression also helps guard against making inaccurate conclusions based only on partial correlation. For example, if a researcher compares hair length with average salary, the data may suggest that people with shorter hair make more money. If another independent variable, such as age, is added to the equation, a more accurate picture is generated. Older men tend to have less hair, and older women may be more likely to have shorter hair; therefore, the new data map would show a higher correlation between age and salary. The impact of hair length on salary would be minimized or eliminated.

Bibliography

"The Discovery of Statistical Regression." Priceconomics. Priceconomics, 6 Nov. 2015. Web. 14 Mar. 2016. priceonomics.com/the-discovery-of-statistical-regression. Accessed 13 Jan. 2025.

Simon, Gary. "Multiple Regression Basics." New York University. New York University, n.d. PDF. Web. 14 Mar. 2016. people.stern.nyu.edu/wgreene/Statistics/MultipleRegressionBasicsCollection.pdf. Accessed 13 Jan. 2025.

"What Are Independent and Dependent Variables?" National Center for Education Statistics. United States Department of Education, n.d. Web. 14 Mar. 2016.

nces.ed.gov/nceskids/help/user‗guide/graph/variables.asp. Accessed 13 Jan. 2025.

"What Is Linear Regression?" Statistics Solutions, 2025, www.statisticssolutions.com/what-is-linear-regression/. Accessed 13 Jan. 2025.