Linear regression

In statistics, a regression is a method of estimating the relationship between two or more variables based on observed data. Linear regression is the method used when that relationship can be best expressed by a linear equation, that is, in the basic format y = a + bx. The line described by this equation is often referred to as a regression line or a line of best fit, meaning that when the data is plotted in a scatter diagram, it falls roughly along that line. There are two forms of linear regression, simple linear regression and multiple linear regression; which one to use depends on how many variables must be accounted for.

Overview

Regression analysis in statistics is based on the concept of regression toward the mean. The earliest regression model was the method of least squares, published in 1805 by Adrien-Marie Legendre.

Simple linear regression is used in situations in which the observed data has only two variables, one explanatory or independent (x) and one dependent (y), meaning that the value of y changes in response to the value of x. For example, if one assumes that a student’s score on a test is dependent on how much time that student spent studying, then the score is y and the time spent studying is x. If these two values are provided for each student in a class, the set of coordinates (x, y) can then be plotted on a graph, and the data set can be used to determine the formula for the regression line. In such a case, the regression line is represented by the equation y = a + bx, where b represents the slope of the line and a is the y-intercept, or the value of y when the line crosses the y-axis.

The most common method of simple linear regression is Legendre’s method of least squares, so named because the sum of the squares of the errors (the distance of each data point from the line) is the smallest possible amount. In this method, the mean of all the observed values of x and y is first collected. Using the example above, the amount of time each student spent studying would be added together and then divided by the number of students, resulting in the average value of x, represented by the variable . The test scores would be similarly averaged to produce .

The next step is to calculate b, the slope of the regression line: For each set of coordinates, subtract from x and from y, then multiply the two differences. Next, add together the resulting products. This sum, which can be mathematically expressed as and represented here by the variable c, is the numerator of the fraction that will express the value of b. To determine the value of the denominator d, subtract from each x value and square each resulting difference. Add the resulting products together. This sum is d. Divide c by d to get b.

The final step is to calculate a, the y-intercept. First, multiply b by . Next, subtract the resulting product from to produce a, which can be expressed as . Once both a and b have been determined, a regression line can be plotted that estimates the grade a student would receive on the test based on the amount of time spent studying.

Bibliography

Chatterjee, Samprit, and Ali S. Hadi. Regression Analysis by Example. 5th ed. Hoboken: Wiley, 2012. Print.

Diez, David M., Christopher D. Barr, and Mine Çetinkaya-Rundel. OpenIntro Statistics. OpenIntro. 2nd ed. OpenIntro, 2012. Web. 10 Oct. 2013.

Golberg, M. A., and H. A. Cho. Introduction to Regression Analysis. Rev. and updated ed. Billerica: WIT, 2010. Print.

Kahane, Leo H. Regression Basics. 2nd ed. Thousand Oaks: Sage, 2008. Print.

Lane, David M. “Regression.” Online Statistics Education: A Multimedia Course of Study. Lane, n.d. Web. 10 Oct. 2013.

Montgomery, Douglas C., Elizabeth A. Peck, and G. Geoffrey Vining. Introduction to Linear Regression Analysis. 5th ed. Hoboken: Wiley, 2012. Print.

Weisstein, Eric W. “Least Squares Fitting.” Wolfram MathWorld. Wolfram Research, n.d. Web. 10 Oct. 2013.

“What Is Linear Regression?” Stat Trek. StatTrek.com, n.d. Web. 10 Oct. 2013.