Scatterplots

Summary: Scatterplots are useful tools for mathematicians and statisticians to graph and present data.

Human beings are constantly exploring the world around them to discover relationships that can be used to explain past and current events or phenomena and perhaps to predict future occurrences.

94982042-91572.jpg

The colloquial expression “a picture is worth a thousand words” is traced back to many possible historical sources, including French leader and noted student of mathematics Napoleon Bonaparte, who purportedly said, “A good sketch is better than a long speech.” In the twenty-first century, graphing is a fundamental first step in any exploratory data analysis, and graphical representations are common in the media. Scatterplots, which most often represent values of paired variables in a Cartesian plane, help data investigators identify relationships, describe patterns and correlation, fit linear and nonlinear functions using techniques like regression analysis, and locate points known as “outliers” that deviate from the predominant pattern. In the primary grades, students often use line graphs, which some consider to be a special case of scatterplots, while scatterplots for data may be explored beginning in the middle grades in both mathematics and science classes.

Early History

Mathematicians and others have long sought alternative methods of representation for researching, presenting, and connecting the mathematical concepts they studied. The Cartesian plane, named for René Descartes, facilitated graphing of algebraic equations and data beginning in the seventeenth century. Historians have traced scatterplots to 1686, though the term “scatter diagram” is attributed to early twentieth-century researchers such as statistician Karl Pearson, and “scatterplot” seems to have first appeared in a 1939 dictionary.

Examples of early pioneers of data graphing include “political arithmetician” Augustus Crome, who studied the relationships between nations’ population sizes, land areas, and wealth; mathematician and sociologist Adolphe Quetelet, who conducted studies of body measurements that helped contribute to the measure now known as the Body Mass Index, which relates height and weight; and engineer and political scientist William Playfair, who called himself the “inventor of linear arithmetic,” a term he used for graphs. He said: “. . . it gives a simple, accurate, and permanent idea, by giving form and shape to a number of separate ideas, which are otherwise abstract and unconnected.” Playfair’s eighteenth-century graphical summaries of British trade across various years are perhaps the earliest example of what would now be referred to as “time series plots” (or in some cases “line graphs”), which may be considered a special case of scatterplots.

94982042-29902.jpg

While Playfair plotted many economic variables as functions of time, the most extensive early use of scatterplots to relate two observed variables is probably the anthropometric and genetic research of Francis Galton, a cousin of scientist Charles Darwin. After studying medicine and mathematics in college, he became interested in the investigation and characterization of variability and deviations in many natural phenomena. He established a laboratory for the measurement and study of human mental and physical traits, focusing on empirical and statistical studies of heredity in the latter half of the nineteenth century. Many of Galton’s scatterplots involved graphing parental characteristics on one axis, usually the X, and offspring characteristics on the other. Like scientist Gregor Mendel, some of his initial genetic experiments were conducted on peas; later, he investigated measurements of people. Scatterplots of height appeared in his 1886 publication Regression Towards Mediocrity in Hereditary Stature, which is the origination of the name for the statistical technique of regression analysis. The word “mediocrity” in this context was a reference to the mean or average height (not a qualitative judgment) and was used to describe a pattern observed in the data: very short parents tend to have taller children, and very tall parents tend to have shorter children, in both cases closer to the mean.

Recent Developments

Prior to the development of computers and data analytic software, data had to be graphed by hand. In the twenty-first century, computers facilitate many types of scatterplots. In addition to the standard plots of two variables in the Cartesian plane, there are three-dimensional scatterplots that display point clouds to explore the ways in which three variables relate and interact. Symbols used to represent points on a two- or three-dimensional scatterplot may also be coded using different colors or shapes to indicate additional variables and uncover patterns. Matrix plots are square grids of scatterplots for a set of variables that plot all possible pairwise sets, usually arranged such that all of the plots in the same row share the same Y variable and all plots in the same column share the same X variable. Mathematicians, statisticians, computer scientists, and other types of researchers have explored the theoretical and methodological links between scatterplots and map surfaces for use in applications such as data mining and spatial analysis of geospatial information system (GIS) data.

While they are useful tools for exploration and representation, scatterplots are often subject to misinterpretations. For example, sometimes relationships or correlations shown in scatterplots are mistakenly taken as evidence of cause and effect, which must be inferred from the way in which the data were collected rather than from the strength of the association.

Bibliography

Few, Stephen. Now You See It: Simple Visualization Techniques for Quantitative Analysis. Oakland, CA: Analytics Press, 2009.

Friendly, M., and D. Denis. “The Early Origins and Development of the Scatterplot.” Journal of the History of the Behavioral Sciences 41, no. 2 (2005).

Stigler, Stephen. The History of Statistics: The Measurement of Uncertainty Before 1900. Cambridge, MA: Belknap Press of Harvard University Press, 1990.