Central limit theorem
The Central Limit Theorem (CLT) is a fundamental principle in statistics that asserts that as the size of a sample increases, the distribution of the sample mean tends to approximate a normal distribution, regardless of the original distribution of the data. This means that even if a dataset is skewed or irregular, larger samples will yield results that more accurately reflect the statistical average of the entire population. Generally, a sample size of thirty is considered sufficient for the CLT to hold, although some statisticians advocate for larger sample sizes of forty or fifty to achieve even more reliable results.
The theorem is particularly useful because it allows researchers to make inferences about a whole population based on smaller, manageable samples, which is practical in various fields, including social sciences and economics. For example, when measuring average wealth, a small sample may produce skewed results if an outlier is present, but increasing the sample size can reveal a more accurate depiction of the average income. The Central Limit Theorem thus serves as a cornerstone of statistical analysis, facilitating understanding and predictions across diverse datasets.
Central limit theorem
The central limit theorem is a concept in statistics that states that the distribution of the statistical mean in a sample of the population will approach normal distribution as the sample size gets larger. In other words, even if data obtained from independent random samples seems skewed to one particular side, when the sample size is increased, eventually the data will reflect the probable statistical average. The central limit theorem is a fundamental pillar of statistics and allows researchers to arrive at conclusions about entire populations by examining data from smaller sample sizes. Statisticians disagree on what constitutes a large enough sample size for the central limit theorem to provide valid results. In general, sample sizes of thirty or more are considered sufficient, although some researchers believe samples should be more than forty or fifty.
![This figure demonstrates the central limit theorem. It illustrates that increasing sample sizes result in sample means which are more closely distributed about the population mean. It also compares the observed distributions with the distributions that w. By Gerbem (Own work) [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0) or GFDL (http://www.gnu.org/copyleft/fdl.html)], via Wikimedia Commons rssalemscience-259263-149117.jpg](https://imageserver.ebscohost.com/img/embimages/ers/sp/embedded/rssalemscience-259263-149117.jpg?ephost1=dGJyMNHX8kSepq84xNvgOLCmsE2epq5Srqa4SK6WxWXS)
![This chart shows the mean of binomial distributions of different sample sizes. Simulations show that as it increases the sample, it converges toward the real mean. By Daniel Resende [CC BY-SA 4.0 (http://creativecommons.org/licenses/by-sa/4.0)], via Wikimedia Commons rssalemscience-259263-149116.jpg](https://imageserver.ebscohost.com/img/embimages/ers/sp/embedded/rssalemscience-259263-149116.jpg?ephost1=dGJyMNHX8kSepq84xNvgOLCmsE2epq5Srqa4SK6WxWXS)
Background
Statistics is a mathematical science that collects numerical data samples and analyzes those samples to determine the probability that they represent a larger whole. If a statistician wanted to discover the average height of every person in the United States, it would be a near impossible task to measure more than three hundred million people. Therefore, a researcher would measure a smaller sample size of people chosen at random so as not to unintentionally influence the results. Taking the sum of all the heights in the example and dividing it by the number of people sampled would reveal the statistical mean, or average.
Because the statistician is only measuring a segment of the population, there are several variable factors that need to be taken into consideration. The variance measures the distance each number in a data set is from the mean. Mathematically, variance is determined by taking the distance each point is from the mean and squaring that number, or multiplying the number by itself. The variance is the average of those results. The standard deviation is a measure of the dispersion of the data set from the mean. This means the further the data points are from the mean, the higher the standard deviation will be. Standard deviation is measured as the square root of the variance. For example, if a sample size of four people reveals their heights to be 50 inches, 66 inches, 74 inches, and 45 inches, then the mean would be 58.75 inches. The variance in this case would be 137.82, and the standard deviation would be 11.74.
Normal distribution is the probability distribution of data points in a symmetrical manner, with most of the points situated around the mean. This can be illustrated in the common bell curve that is a graphic with a rounded peak in the center that tapers away at either end. In a graph representing a normal distribution, the mean is represented by the central peak of the curve, while the standard deviation determines the peak's height.
Overview
Using height as an example is fairly straightforward, as most people in a population tend to be at or close to average. The central limit theorem comes into play when the data from a sample size does not fit the normal distribution and seems to misrepresent the statistical probability of the data. In an analysis of average height, recording the measurements of eight people may be able to yield reliable data that can lead to an accurate result. If a statistician is trying to measure wealth by recording the incomes of eight people, however, a disparity in one respondent's income may skew the results. For example, seven of the eight people may earn between $30,000 and $100,000, but if the eighth person is a millionaire, then the statistical mean would calculate to far more than the second-wealthiest respondent. The central limit theorem holds that if the sample size is increased, then the results will move closer to normal distribution, and a more accurate depiction of the average household wealth. Many statisticians say that a sample size of thirty or more is enough to achieve accurate results; however, some insist on a size of more than forty or fifty. In cases where the data points are unusually irregular, a statistician may need to utilize a larger sample size.
In practice, statisticians often take repeated random samples of the same size from the same population. These individual samples are then averaged together to get a data point. The process is repeated a number of times to arrive at a data set. If, in the wealth example, the sample size is three people, then the incomes of three people would be recorded, averaged to find the mean, and that figure would become a data point. Therefore, if a random sampling of three people asked for their income, and they responded with $21,000, $36,000, and $44,000, then their mean income would be $33,667.
For comparison's sake, assume the average salary in the United States was $45,000. If the survey was done correctly, the normal distribution should be near this figure. A group of five data points that yielded income numbers of $33,000, $39,000, $44,000, $351,000, and $52,000 would result in a mean value of $103,800, more than double the national average. The numbers are obviously affected by the income in the fourth data point. If the sample size is increased to ten respondents, the likelihood increases that they are more representative of the true average salary in the United States. Assuming the other values stayed the same, if the fourth figure dropped to $118,000, then the mean would be $57,200, more in line with the normal distribution. Moving the sample size to fifteen, twenty, or thirty would bring the results increasingly closer to the normal distribution.
Bibliography
Adams, William J. The Life and Times of the Central Limit Theorem. 2nd ed., American Mathematical Society, 2009.
Annis, Charles. "Central Limit Theorem: Summary." Statistical Engineering, 2023, statistical-engineering.com/clt-summary/. Accessed 17 Jan. 2023.
"Central Limit Theorem." Boston University School of Public Health, 24 July 2016, sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704‗Probability/BS704‗Probability12.html. Accessed 25 Jan. 2017.
Deviant, S. "Central Limit Theorem." The Practically Cheating Statistics Handbook. 3rd ed., CreateSpace Independent Publishing, 2010, pp. 88–97.
Dunn, Casey. "As 'Normal' as Rabbits' Weights and Dragons' Wings." New York Times, 23 Sept. 2013, www.nytimes.com/2013/09/24/science/as-normal-as-rabbits-weights-and-dragons-wings.html. Accessed 17 Jan. 2023. Ganti, Akhilesh, et al. "Central Limit Theorem (CLT): Definition and Key Characteristics." Investopedia, 8 Oct. 2024, www.investopedia.com/terms/c/central‗limit‗theorem.asp. Accessed 12 Nov. 2024.
Martz, Eston. "How the Central Limit Theorem Works." The Minitab Blog, 15 Apr. 2011, blog.minitab.com/blog/understanding-statistics/how-the-central-limit-theorem-works. Accessed 24 Jan. 2017.
Nedrich, Matt. "An Introduction to the Central Limit Theorem." Atomic Object, 15 Feb. 2015, spin.atomicobject.com/2015/02/12/central-limit-theorem-intro/. Accessed 24 Jan. 2017.
Paret, Michelle, and Eston Martz. "Tumbling Dice & Birthdays: Understanding the Central Limit Theorem." Minitab, Aug. 2009, www.minitab.com/uploadedFiles/Content/News/Published‗Articles/CentralLimitTheorem.pdf. Accessed 25 Jan. 2017.