Linear regression
Linear regression is a statistical method used to estimate the relationship between two or more variables by fitting a linear equation to observed data. The fundamental equation for linear regression is expressed as \( y = a + bx \), where \( y \) is the dependent variable, \( x \) is the independent variable, \( b \) is the slope of the line, and \( a \) is the y-intercept. This method is particularly useful for predicting outcomes based on input variables, such as determining a student's test score based on study time. There are two main types of linear regression: simple linear regression, which involves two variables, and multiple linear regression, which involves more than two variables. The method of least squares is commonly employed to minimize the sum of the squares of the errors in the regression line, ensuring the best fit for the data. By gathering and averaging the observed values of the variables, one can derive the slope and intercept, ultimately allowing for the creation of a regression line that visually represents the relationship between the variables. Linear regression plays a crucial role in various fields, including economics, psychology, and the natural sciences, providing valuable insights into data trends and relationships.
On this Page
Subject Terms
Linear regression
In statistics, a regression is a method of estimating the relationship between two or more variables based on observed data. Linear regression is the method used when that relationship can be best expressed by a linear equation, that is, in the basic format y = a + bx. The line described by this equation is often referred to as a regression line or a line of best fit, meaning that when the data is plotted in a scatter diagram, it falls roughly along that line. There are two forms of linear regression, simple linear regression and multiple linear regression; which one to use depends on how many variables must be accounted for.
Overview
Regression analysis in statistics is based on the concept of regression toward the mean. The earliest regression model was the method of least squares, published in 1805 by Adrien-Marie Legendre.
Simple linear regression is used in situations in which the observed data has only two variables, one explanatory or independent (x) and one dependent (y), meaning that the value of y changes in response to the value of x. For example, if one assumes that a student’s score on a test is dependent on how much time that student spent studying, then the score is y and the time spent studying is x. If these two values are provided for each student in a class, the set of coordinates (x, y) can then be plotted on a graph, and the data set can be used to determine the formula for the regression line. In such a case, the regression line is represented by the equation y = a + bx, where b represents the slope of the line and a is the y-intercept, or the value of y when the line crosses the y-axis.
The most common method of simple linear regression is Legendre’s method of least squares, so named because the sum of the squares of the errors (the distance of each data point from the line) is the smallest possible amount. In this method, the mean of all the observed values of x and y is first collected. Using the example above, the amount of time each student spent studying would be added together and then divided by the number of students, resulting in the average value of x, represented by the variable . The test scores would be similarly averaged to produce
.
The next step is to calculate b, the slope of the regression line: For each set of coordinates, subtract
from x
and from y, then multiply the two differences. Next, add together the resulting products. This sum, which can be mathematically
expressed as
and represented here by the variable c, is the numerator of the fraction that will express the value of b. To determine the value of the denominator d, subtract
from each x value and square each resulting difference. Add the resulting products together. This sum is d. Divide c by d to get b.
The final step is to calculate a, the y-intercept. First, multiply b by . Next, subtract the resulting product from
to produce a, which can be expressed as
. Once both a and b have been determined, a regression line can be plotted that estimates the grade a student would receive on the test based on the amount of time spent studying.
Bibliography
Chatterjee, Samprit, and Ali S. Hadi. Regression Analysis by Example. 5th ed. Hoboken: Wiley, 2012. Print.
Diez, David M., Christopher D. Barr, and Mine Çetinkaya-Rundel. OpenIntro Statistics. OpenIntro. 2nd ed. OpenIntro, 2012. Web. 10 Oct. 2013.
Golberg, M. A., and H. A. Cho. Introduction to Regression Analysis. Rev. and updated ed. Billerica: WIT, 2010. Print.
Kahane, Leo H. Regression Basics. 2nd ed. Thousand Oaks: Sage, 2008. Print.
Lane, David M. “Regression.” Online Statistics Education: A Multimedia Course of Study. Lane, n.d. Web. 10 Oct. 2013.
Montgomery, Douglas C., Elizabeth A. Peck, and G. Geoffrey Vining. Introduction to Linear Regression Analysis. 5th ed. Hoboken: Wiley, 2012. Print.
Weisstein, Eric W. “Least Squares Fitting.” Wolfram MathWorld. Wolfram Research, n.d. Web. 10 Oct. 2013.
“What Is Linear Regression?” Stat Trek. StatTrek.com, n.d. Web. 10 Oct. 2013.