Data analysis and probability in society
Data analysis and probability play a crucial role in various aspects of society, influencing decision-making processes across numerous fields, including healthcare, finance, government, and industry. Data analysis involves collecting, transforming, and modeling data to extract meaningful insights that aid in logical conclusions and informed choices. Statistical methods, particularly probabilistic approaches, are widely used due to advancements in technology that facilitate the rapid processing of large datasets. Professionals in this domain, such as statisticians, actuaries, and data analysts, can be found in both public and private sectors, addressing diverse challenges from economic forecasting to public health monitoring.
Historically, governments have relied on data collection for functions like census-taking, which is essential for resource allocation and representation. In the medical field, data analysis is instrumental in tracking disease outbreaks and determining the efficacy of treatments. Similarly, in finance, probability models are used to assess risks and establish insurance premiums. Additionally, the entertainment industry utilizes probability and data analysis in gaming and sports to enhance player experiences and strategize performance evaluations. Overall, the integration of data analysis and probability into everyday life underscores its significance in fostering informed decision-making and improving societal outcomes.
Data analysis and probability in society
Data analysis can be thought of as the process of collecting, transforming, summarizing, and modeling data, usually with the goal of producing useful information that facilitates drawing logical conclusions or making decisions. Virtually any field that conducts experiments or makes observations is involved in data analysis.
![FBI analyst. By not stated (FBI Photosimage source) [Public domain], via Wikimedia Commons 94981786-91317.jpg](
There are many mathematical data analysis methods, including statistics, data mining, data presentation architecture, fuzzy logic, genetic algorithms, and Fourier analysis, named for mathematician Joseph Fourier. Probabilistic statistical methods are among the most widely applied tools, and they are what many people think of when they hear the term “data analysis.” The use of probability, statistical analysis, and other mathematical data analysis methods is widespread, especially given technological advances and computer software that facilitate rapid, automated data collection and efficient, effective processing of massive data sets. Jobs that involve data collection, probabilistic modeling, statistical data analysis, data interpretation, and data dissemination are found in both the public and private sector, as well as in a diverse array of disciplines, including agriculture, biology, computer science, digital imaging, economics, engineering, education, forestry, geography, insurance, law, manufacturing, marketing, medicine, operations research, psychology, and pharmacology. Many specialized data analysts are known by job titles or classifications other than statistician, such as actuary, biostatistician, demographer, econometrician, epidemiologist, or psychometrician.
Professional Education
The first college statistics department was founded in 1911 at University College London. Other departments in universities around the world followed. In the twenty-first century, more than 200 colleges and universities in the United States offer undergraduate statistics degrees, and many more schools offer minors and courses in probability and statistics, data mining, and other mathematical data analysis methods. These courses may be taught either in mathematics and statistics departments or, often, in one of many partner disciplines, such as psychology, biology, or business. Graduate degrees in statistics do not necessarily require an undergraduate degree in statistics or mathematics, but most graduate degree programs prefer strong mathematical or statistical backgrounds with courses in areas like differential and integral calculus, mathematical modeling, probability theory, statistical methods, vector analysis, linear algebra, and mathematical statistics.
Historically, computational methods were a primary focus of statistics education. With the evolution of technology and the growing role of statistics in everyday life, statistics education has shifted to focus on conceptual understanding, analysis of real data in context, survey sampling and experimental design methods, technology for analysis and presentation, communication of methodology and results to both technical and nontechnical audiences, and statistical thinking or literacy. People with bachelor’s degrees in mathematics, statistics, or related mathematical fields, like operations research or decision sciences, can often find entry-level data analysis positions in government and industry, but research-related jobs and teaching at the community college level typically require master’s degrees. Teaching or research-related jobs at four-year colleges and universities usually require doctoral degrees. Work experience or qualifying exams, such as those administered by the Society of Actuaries, are often necessary for employment in some industries. Training and certification programs like Six Sigma Black Belt also signify a certain level of data analysis skill and knowledge.
Virtually all federal organizations have data analysis specialists or entire statistical subdivisions that use mathematical and statistical models. Since ancient times, governments have collected data and used mathematical methods to perform necessary functions. Archaeological evidence suggests that many ancient civilizations conducted censuses to enumerate their populations, often for taxation or military recruitment. Livestock, trade goods, and other property were sometimes counted in addition to people. Mathematics facilitated decisions regarding the distribution of resources like land, water, and food. The German word for this process of “state arithmetic” is cited as the origin of the English word “statistics,” which first appeared in Statistical Accounts of Scotland, an eighteenth-century work by politician John Sinclair that included data about people, geography, and economics. In the United States, counting of the population is required by the US Constitution, and congressional representation for the US House of Representatives is determined by the decennial census population values.
Over the decades, many mathematicians and statisticians worked on planning and implementing the census, like Lemuel Shattuck, who also co-founded the American Statistical Association in 1839. Since its creation in 1902, the duties and activities of the US Census Bureau have grown beyond the mandated 10-year census to include collecting and analyzing data on many social and economic issues, and the US Census Bureau is one of the largest employers of mathematicians and statisticians in the country. At the start of the twenty-first century, various agencies of the US government employed approximately 20 percent of the statisticians in the country. An additional 10 percent were employed by state and local governments, including state universities.
Statisticians and other mathematical data analysts working within many federal agencies are also responsible for developing new and innovative methods for gathering, validating, and analyzing data, especially the massive, messy, or incomplete data sets that are increasingly common in technological and industrialized societies. They also work to reduce bias and more accurately model issues that affect individuals and organizations. Many countries and governing entities around the world have agencies that perform similar functions. One major area of interest for most governments is the economic health of the country and the well-being of its workers. In the United States, the Bureau of Labor Statistics measures and forecasts factors such as labor market activity, productivity, price changes, spending, and working conditions. They began collecting data at the federal level in 1884. The Current Population Survey, implemented by the US Census Bureau, is a monthly survey of about 50,000 households that has been conducted for more than 50 years, and the Current Employment Statistics Survey gathers data from about 410,000 worksites to summarize variables such as hours worked and earnings.
While the Bureau of Labor Statistics focuses mostly on manufacturing and services, the US Department of Agriculture’s Economic Research Service, established in 1961, is responsible for data about farming, natural resources, and rural development, addressing issues like food safety, climate, farm employment, and rural economies. Its online Food Environment Atlas includes indicators that describe the U.S. “food environment” and model concepts like people’s geographic proximity to grocery stores or restaurants and food prices. The National Agricultural Statistics Service, also established in 1961, conducts the Census of Agriculture. It can be traced in part to a 1957 Congressional decision to approve probability survey methods for agriculture research. The US Internal Revenue Service’s Statistics Income Division, created in 1916, was among the first federal agencies to use stratified random sampling and machine summarization of data, both in the 1920s. In the twenty-first century, it assesses the tax impact of federal legislation.
Beyond their workforces, governments are also typically interested in the overall health, safety, and education of members of the broader society. The US National Center for Health Statistics, established in 1960, compiles public health statistics, tracks federal health initiatives, and helps assess trends related to health care and health behaviors. For example, it has monitored efforts to reduce obesity and teen pregnancy. Other data include health care delivery and changes, such as the use of prescription medications and emergency rooms. The Bureau of Justice Statistics, founded in 1980, is primarily responsible for crime and criminal justice data collection, analysis, and dissemination in the United States. One of its principal reports is the annual National Crime Victimization Survey. The Federal Bureau of Investigation, founded in 1908, creates the annual Uniform Crime Report. The National Center for Education Statistics was mandated by the 2002 Educational Sciences Reform Act to collect and analyze “statistics and facts as shall show the condition and progress of education in the several states and territories” of the United States. The US Congress uses data from this agency to plan education programs and to apportion federal funds among states.
In the twentieth century, issues like the energy crisis of the 1970s, climate change, and concerns over the future availability of oil focused more attention on U.S. energy resources and infrastructure. The Energy Information Administration (EIA) was established in 1977 to independently and impartially collect and analyze data to disseminate information about energy resources, uses, infrastructure, and flow, as well as their impacts on and responses to economic and environmental variables. The goals are to assist in creating policies and making energy decisions as well as educating the public about all aspects of energy.
While government is one of the largest producers and users of statistics, not everyone agrees on their validity or utility. Many have criticized politicians for selectively using or deliberately misusing data and statistics, while others have suggested that the issue is insufficient training or understanding of mathematical data analysis—though statistical methods are increasingly part of political science degree programs. Former North Carolina Representative Lunsford Richardson Preyer once said: “Statistics do not always lie, but they seldom voluntarily tell the truth. We can argue any position on this bill on a set of statistics and some study or another.” At the same time, some propose that effective democracy depends on citizens being able to access and understand current statistics. The burden and responsibility to produce credible information then rests with both the public, which has an obligation to provide valid data and seek to understand the outcomes, and the government, which must collect, analyze, and publicize information in a reliable, timely, and nonpartisan manner.
Industry and Manufacturing
The notion of interchangeable parts—pioneered by individuals like eighteenth-century army officer and engineer Jean-Baptiste Vaquette de Gribeauval and inventor Eli Whitney—followed by the mass production of goods during the Industrial Revolution, ushered in a new era of data collection and analysis to ensure the quality of manufactured products. In the early twentieth century, physicist and statistician Walter Shewhart pioneered data analysis methods in manufacturing that led some to call him the “father of statistical quality control.” Among other accomplishments, he developed specialized charts using data and probability to sample and track the variability in processes to identify both natural, random process deviations and non-random deviations in order to eliminate the latter and thus improve consistency in the product.
W. Edwards Deming expanded on these notions to help develop the industrial management practice known as “continuous quality control” or “continuous quality improvement.” Deming is credited with significant contributions to Japan’s post–World War II reputation for high-quality products, and his data-based control methods have been widely adopted in the United States. For example, Motorola’s Six Sigma program, founded in the 1980s, focused on training managers and employees at various levels in statistical methods and practices designed to identify and remove causes of product defects with the overall goal of minimizing process variability. The program name derives from statistical notation: sigma (σ) is commonly used to represent standard deviation, a measure of variability. Six standard deviations on either side of the mean in a bell-shaped or normal curve encompasses virtually all of the data values. If there are six standard deviations between the process mean and the nearest product specification limit, only three or four items per million produced will fail to meet those specifications. General Electric and other companies adapted and evolved the original Six Sigma ideas by merging them with other management strategies. For example, in the 1990s, concepts from a manufacturing optimization method known as “lean manufacturing” resulted in a hybrid program called “Lean Six Sigma.”
Data analysis and probability are also used in advertising and market research. Many of the common market research practices used in the twenty-first century are traced to the work of engineer and pioneer television analyst Arthur Nielson. These practices include data analysis to quantify market share and determining sales patterns by combining consumer surveys with sales audits.
Medicine and Pharmacy
In the nineteenth century, some in the medical community began to investigate the idea of using data analysis for medical applications. Physician William Farr applied data analytic methods to model epidemic diseases. He is often credited as the founder of epidemiology. Physician John Snow gathered data to trace the source of an 1854 cholera outbreak in London. Along with his census work, Shattuck helped implement many public health measures based on data analyses. Florence Nightingale invented her own graphical data presentations in order to summarize data on the health impacts of poor hygiene in British military hospitals. In the twenty-first century, agencies like the U.S. Centers for Disease Control and Prevention and the World Health Organization collect, analyze, and model data in order to, among other goals, track the spread of infectious disease; assess the impact of preventive measures, like vaccinations; and test the virulence of infectious agents.
Clinical trials or experiments are also performed to determine the effectiveness and safety of new medical procedures and drugs. In the eighteenth century, physician James Lind tested remedies for scurvy aboard a British navy ship, which can be cited as one of the first recorded cases of a controlled medical trial. Statistician and epidemiologist Austin Bradford Hill helped pioneer randomized, controlled clinical trials in the twentieth century and also worked to develop the Bradford-Hill criteria, a set of logical and mathematical conditions that must be met to determine causal relationships. Approval and patenting of pharmaceuticals and medical devices by federal agencies like the Food and Drug Administration, part of the U.S. Department of Health and Human Services, require extensive experimentation and data analysis. For example, when a television commercial for a drug states that it is “clinically proven,” this usually means that it has gone through experimental testing and that appropriate analyses of data have determined that it is very probably effective and safe, according to measures like the Bradford-Hill criteria.
Finance and Insurance
Probability is essential for quantifying risk, a concept that underlies most financial ventures and drives interest, credit, loan, and insurance rates. Data analysis can be used to derive probabilities and create financial models or indices like Fair Isaac Corporation (FICO) scores, the Dow Jones Industrial Average, and nations’ gross domestic products. Engineer and economist William Playfair is considered to be one of the creators of graphical data analysis. Beginning in the eighteenth century, he researched trade deficits and other types of economic and financial data.
Mathematician Louis Bachelier is known as the “father of financial mathematics” for his use of Brownian motion to model stock options at the turn of the twentieth century. Brownian motion, named for botanist Robert Brown, is a stochastic (probabilistic or random) process. The international Bachelier Financial Society is named for Louis Bachelier. Its goal is “the advancement of the discipline of finance under the application of the theory of stochastic processes, statistical and mathematical theory,” and it is open to individuals in any discipline. Actuarial scientists or actuaries are also widely employed to develop models of the financial impact of risk. For example, they may use a combination of theoretical probability and data analysis to determine appropriate premiums for life or health insurance using variables such as life expectancy, which is adjusted for characteristics or behaviors that modify risk, like gender or smoking.
Astronomer and mathematician Edmund Halley, for whom Halley’s Comet is named, is also often cited as the founder of actuarial science. He calculated mortality tables using data from the city of Breslau, Germany (now Wroc aw, Poland). Published in 1693, these tables are the earliest known works to mathematically quantify the relationship between age and mortality.
Entertainment and Gambling
Archaeological evidence suggests that games of chance have existed since antiquity. Probability appears in different forms in written works throughout the centuries, like the body of Talmudic scholarship and the 1494 treatise of mathematician and friar Luca Pacioli known as Summa de arithmetica, geometria, proportioni et proportionalita. The mathematical study of probability as it is known in the twenty-first century is traditionally traced to seventeenth-century mathematicians Blaise Pascal and Pierre de Fermat, who were inspired to formulate their mathematical “doctrine of chances” by problems in gambling. In the twenty-first century, gambling is a multibillion dollar industry. In Las Vegas and other places, oddsmakers use probability to determine risks, point spreads, and payoff values for games of chance, sporting events, and lotteries. Players often use betting systems that are based on data analysis or probability to attempt to beat the odds and increase their chances of winning.
One example was a group of students from the Massachusetts Institute of Technology and other schools who used card counting techniques and mathematical optimization strategies in blackjack, which was the basis of the 2008 movie 21 and a television documentary Breaking Vegas. The television game show Deal or No Deal, which has aired versions in approximately 80 countries around the world, has been studied by mathematicians, statisticians, and economists as a case of decision making involving probability and data analysis concepts, like expected value. Probability-based random number generation is incorporated into many popular video games to increase realism and create multiple scenarios, while moviemakers are exploring probability-based artificial intelligence systems to generate realistic behavior in large, computer-generated battle scenes. The pioneering Lord of the Rings movies used a program developed by computer graphics software engineer Stephen Regelous and named Multiple Agent Simulation System in Virtual Environment (MASSIVE), which uses probabilistic methods like fuzzy logic, derived from the fuzzy set theory of computer scientist and mathematician Lotfi Zadeh. Most sports collect a wide variety of data about their players, but in the latter twentieth century, advanced mathematical modeling, such as sabermetrics, developed by statistician George William “Bill” James, gained popularity for analyzing player and team performance and making predictions.
Best, Joel. Damned Lies and Statistics: Untangling Numbers From the Media, Politicians, and Activists. Berkeley: University of California Press, 2001.
Davenport, Thomas. Competing on Analytics: The New Science of Winning. Cambridge, MA: Harvard Business School Press, 2007.
Mlowdinow, Leonard. The Drunkard’s Walk: How Randomness Rules Our Lives. New York: Vintage Books, 2009.
Murphy, Megan, ed. “World Statistics Day.” Amstat News 400. (October 2010).
Rosenthal, Jeffrey. Struck by Lightning: The Curious World of Probabilities. Washington, DC: Joseph Henry Press, 2008.
Salsburg, David. The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century. New York: Holt Paperbacks, 2002.
Taleb, Nassim. Fooled by Randomness: The Hidden Role of Chance in Life and in the Markets. New York: Random House, 2008.
Wainer, Howard. Picturing the Uncertain World: How to Understand, Communicate, and Control Uncertainty through Graphical Display. Princeton, NJ: Princeton University Press, 2009.