Descriptive Statistics (Sociology)
Descriptive statistics is a branch of statistics that focuses on summarizing and describing data sets, making it an essential tool for researchers and analysts. It employs various techniques, such as charts and graphs, to visually represent data, making it easier to interpret and understand. Key components of descriptive statistics include measures of central tendency, specifically the mean, median, and mode, which aim to identify the average value within a data set. Additionally, measures of variability, such as range and standard deviation, provide insight into the spread or dispersion of data points.
While descriptive statistics are invaluable in presenting and organizing data, they do not allow for inferences or conclusions beyond the immediate data set. Each measure of central tendency and variability has its strengths and weaknesses, making it crucial to choose the appropriate one based on the data's distribution. Overall, descriptive statistics serves to simplify complex data, enabling clearer communication and understanding within research contexts, without drawing broader conclusions that require inferential statistics.
Subject Terms
Descriptive Statistics
Abstract
Descriptive statistics comprises a set of statistical tools that help sociologists, researchers, and other analysts better understand the masses of data with which they need to work. These tools include various types of charts and graphs to visually display the data so that they can be more easily understood, measures of central tendency that estimate the midpoint of a distribution, and measures of variability that summarize how widely dispersed the data are over the distribution. Each measure of central tendency and variability has particular strengths and weaknesses and should only be used under certain conditions. Descriptive statistics do not allow one to make inferences about the data or to determine whether or not the data values are statistically significant. Rather, they only describe data.
Overview
At its most basic, sociology is the study of humans within society. In order to better understand human behavior from this perspective, sociologists attempt to describe, explain, and predict the behavior of people in social contexts. At first glance, this task seems deceptively simple. After all, we usually know how and why we react the way we do in various situations. It should seem a simple step to extrapolate from our own attitudes and behavior to those of people in general. However, it is not valid to assume that everyone thinks or behaves in the same way. Human beings are infinitely diverse, and often two people can look at the same data or situation and arrive at two very different conclusions.
For example, although all voters have access to the same information during a presidential race, these races can be hotly contested, and voters can fiercely disagree over a candidate's merits. Even within the same party, voters can be divided over a candidate, with some giving credence to one piece of information about the candidate and others valuing another piece. It is a truism that people can look at the same situation and honestly disagree. For this reason, it is impossible to extrapolate from the attitudes or behavior of one individual to society at large. To truly describe, explain, and predict the behavior of people in social contexts, sociologists must acquire data on the attitudes and behaviors of more than one individual.
Just as data collected from only one individual is not of much use to sociologists, neither is data collected from a mere two or three people. Sociologists need to gather data from a large number of people in order to have any confidence that their findings can be extrapolated to people in general. The number of people used in sociological research studies routinely reaches in the hundreds for just this reason. Although hundreds or even thousands of inputs will give us a better picture of how people actually react or behave, this massive amount of data leads to another problem: How can we make sense of all the data and interpret them in a meaningful way? Fortunately, the field of mathematics offers us numerous statistical tools that can aid us in this task.
When thinking of statistics, most people think of inferential statistics, which is a subset of mathematical statistics used in the analysis and interpretation of data. Inferential statistics are used to make inferences from data, such as drawing conclusions about a population based on a sample. This branch of statistics comprises the seemingly arcane formulae and mathematical computations that so many students dread.
However, there is another class of statistical tools that is used to summarize data and develop inputs for use in inferential statistical computation. Although not a substitute for inferential statistics, descriptive statistics is very useful in helping sociologists better understand the masses of data with which they need to work. In general, descriptive statistics is a subset of mathematical statistics that describes and summarizes data. Descriptive statistics are used to summarize and display data through various types of charts and graphs, such as histograms and pie charts. Using these tools, one can easily get a rough idea of the shape of the data; describe the "average" of the data through measures of central tendency, including the mean, median, and mode; and summarize the variability of the data through such measures as the standard deviation, the semi-interquartile deviation, and the range.
Applications
Graphing. One subset of descriptive statistics comprises various graphing techniques that help one organize and summarize data so that they are more easily comprehended. One of the most common and helpful methods for doing this is a frequency distribution. In this technique, data are divided into intervals of typically equal length using techniques such as a stem-and-leaf plot or a box-and-whiskers plot. Graphing data within intervals rather than as individual data points reduces the number of data points on the graph, making the graph—and the underlying data—easier to comprehend.
For example, one might seek to understand people's attitudes about the effects of cell phone use on driving behavior by asking 1,000 people to rate the effects on a scale of 1 to 100, with 1 being the most negative and 100 being the most positive. However, it would be difficult to display these results by graphing all 1,000 points. There would be several clusters of data points where a number of people gave the same response, as well as clusters of data points where people gave similar but not identical responses. Although displaying the data in this way certainly shows the full range of people's responses, it is difficult to interpret the data because of the large number of data points. In addition, one must question whether there is truly a meaningful difference between a rating of 22 on a 100-point scale and a rating of 23. Both of the people responding believed that cell phone usage had a negative effect on driving behavior, but can one really say that the person who responded with a 22 felt that much more negatively about the effects of cell phone usage than the person who responded with a 23? Probably not.
Therefore, it is reasonable to aggregate the data into ranges within the span of scores (e.g., 110, 1120, etc.) before graphing them. As a result, the number of points on the graph is decreased and larger patterns can emerge. Figure 1 shows a comparison between a scatter plot of raw data and a histogram with a superimposed frequency distribution.
Measures of Central Tendency. Although graphing the data using this or other graphing techniques is helpful for better understanding the shape of the underlying distribution, other statistical tools, like measures of central tendency and measures of variability, can be used to understand the data even more thoroughly.
Measures of central tendency estimate the midpoint of a distribution. These measures include
- the median, or the number in the middle of the distribution when the data points are arranged in order;
- the mode, or the number that occurs most often in the distribution; and
- the mean, or the sum of all data values in the distribution divided by the total number of data points in the distribution.
These three methods frequently give different estimates of the midpoint of a distribution because they are all affected differently by the shape of the distribution and by any outlying points.
For example, as shown in Figure 2, for the data set 2, 3, 3, 7, 9, 14, 17, the mode is 3, as there are two 3s in the distribution, but only one of each of the other numbers; the median is 7, since, when the seven numbers in the distribution are arranged numerically, 7 is the number that occurs in the middle; and the mean (or arithmetic mean) is 7.857, since the sum of the seven numbers is 55 and 55 ÷ 7 = 7.857.
The three measures of central tendency all have different characteristics. It is important to remember that these differences are real and the measures are not interchangeable. For example, in a skewed distribution, where one end has extreme outliers but the data is otherwise normally distributed, the median may be pulled toward the skew (i.e., toward the end of with the outliers). Because of this, when the ends are not balanced and data are clustered toward one end of the distribution, the median may disproportionately reflect the outlying data points.
The mean is even more affected by extreme scores. If, for example, one wants to know the "average" salary of the salaries shown in Figure 3, one has to carefully consider just how the distribution affects the mean. In this case, it may be more accurate to report the average salary as the mode rather than the mean due to the small proportion of people who make a much higher salary than other people in the same field. As shown in the figure, this small proportion of people pulls the mean in the direction of the skew.
Each measure of central tendency is best used under different circumstances. The mode has the obvious advantage of being quick and easy: one need only determine which number occurs most frequently in the distribution. However, this same characteristic means that the mode also has the disadvantage of lacking stability: a small change in the numbers can lead to a great change in the mode. Because the mode does not take actual score values into account, it is not really valuable for any purpose other than to state which number has the highest frequency.
The median is a more stable measure than the mode, and occasionally it may stand alone as a statistic. In fact, the median is preferred to both the mode and the mean for use in non-symmetrical distributions because it is less variable than the mode and less affected by extreme scores than the mean. However, like the mode, the median is a terminal statistic: it cannot be used to make statistical inferences about the data.
The mean has advantages in most situations over the other two measures of central tendency, and it is not a terminal statistic, meaning it can be used as an input for many inferential statistical techniques. The mean is highly stable, and its value does not vary greatly because of a change in a single score. Because of these traits, it is generally advisable to always use the mean as the measure of central tendency unless there is a compelling reason not to do so, such as a non-symmetrical distribution with extreme outliers.
Measures of Variability. It is important to note that although measures of central tendency give a quick measure of the "average" value in a distribution, this information by itself is insufficient to truly understand the distribution of the underlying data. For example, the data may be evenly spread across the distribution, cluster in the middle, or cluster at either end. Yet all of these distributions can yield the same value for the mean. Measures of central tendency are helpful for better understanding large amounts of data, but they are only one part of the puzzle. For example, without seeing the graph of the distribution, knowing that a sample of data has a mean of 10 does not give one much information about the data. One needs additional information in order to really understand what the data signify. The scope and signification of the data set can be better understood by knowing how far the data points are from each other, what the end points of the distribution are, and, in general, how the data are distributed. To better understand this aspect of a collection of data, one uses measures of variability. Measures of variability are descriptive statistics that summarize how widely dispersed the data are over the distribution. Specifically, these measures are the range, the semi-interquartile deviation, and the standard deviation, corresponding to the mode, the median, and the mean, respectively.
The range is a statement of the difference between the highest and lowest scores in the distribution. In conjunction with a measure of central tendency, this information helps one better understand the data. For example, if a class's mean score on a test was 60 out of a total possible score of 100, one would draw different conclusions about the class's abilities if the lowest and highest scores were 0 and 100 than if they were 50 and 70. Looking at the distribution within the first range, it would appear that more people got over half of the questions correct, because otherwise the mean would be less than 50. In the second case, it would appear that either no one understood the material well enough to get a large majority of the questions correct or a significant number of questions were badly worded, because no one earned a score of more than 70 out of 100. Some distributions, as in the first case, have outlying data, or stragglers at one or both ends of the distribution that are far removed from the rest of the data. The range, however, treats all values in the distribution alike and does not give consideration to whether or not they are outliers.
Like the median, the semi-interquartile deviation is a positional measure that eliminates the extreme scores on both ends of the distribution. To determine the semi-interquartile deviation, one divides the distribution into quarters. The first quartile (Q1) is determined by finding the median of the lower half of the distribution (i.e., the number with 25 percent of the numbers in the distribution below it). The third quartile (Q3) is determined by finding the median of the upper half of the distribution (i.e., the number with 25 percent of the numbers in the distribution above it). The semi-interquartile deviation (Q) is then calculated by subtracting the value of the first quartile from the value of the third quartile and dividing this number by two: Q = (Q3 - Q1)/2.
Just as the mean is a mathematically derived measure of central tendency, the standard deviation is a mathematical determination of the variability of a distribution. This statistic is an index of the degree to which scores differ from the mean of the distribution, making it a measure of variability that describes how far the typical score in a distribution is from the mean of the distribution. This statistic is obtained by subtracting the mean of the distribution from each score in order to determine the deviation of each score from the mean, squaring each resulting deviation, adding the squared deviations, and dividing this number by the total number of scores. The larger the standard deviation, the farther away the typical score is from the mean of the distribution.
Like measures of central tendency, each measure of variability has its own strengths and weaknesses. One of the uses of the range is to determine how many intervals should be used when developing a frequency distribution. The range is also the best method for determining variability if all one wants to do is look at the distribution. However, the range is highly unstable and easily affected be extreme scores. Further, it is a terminal statistic, not useful for much more than describing the distribution of the data. The semi-interquartile deviation has an advantage over the range in that it eliminates the extreme scores at both ends of the distribution, thereby making it more stable. In addition, the semi-interquartile deviation is a quick method for finding out whether or not a distribution is skewed. However, like the range, it is a terminal statistic. For most circumstances, particularly those in which one wants to do additional analysis of the data and make statistical inferences, the standard deviation is the best tool to use for describing the variability in a distribution. Like the mean, the standard deviation is used as the basis for inferential statistical techniques.
Conclusion
Descriptive statistics is a class of statistical tools that is very useful in helping sociologists, researchers, and other analysts better understand the masses of data with which they need to work. Descriptive statistics are used to summarize and display data in various types of charts and graphs, such as histograms and pie charts; mathematically describe what the "average" of the data is through measures of central tendency, including the mean, median, and mode; and summarize the variability of the data through such measures as the standard deviation, the semi-interquartile deviation, and the range. Each measure of central tendency and variability has different strengths and weaknesses, and the measures are not interchangeable.
It is important to remember that descriptive statistics do just that: describe the data. They do not allow one to make inferences about the data or determine whether or not the data values are statistically significant. This type of operation belongs to the realm of inferential statistics.
Terms & Concepts
Box-and-Whiskers Plot: A graphing technique that summarizes a data set by depicting the upper and lower quartiles, the median, and the two extreme values of a distribution. Also known as a box plot or a candlestick chart.
Data: In statistics, quantifiable observations or measurements that are used as the basis of scientific research.
Descriptive Statistics: A subset of mathematical statistics that describes and summarizes data.
Distribution: A set of numbers collected from data and their associated frequencies.
Inferential Statistics: A subset of mathematical statistics used in the analysis and interpretation of data, as well as in decision making.
Mean: An arithmetically derived measure of central tendency in which the sum of the values of all the data points is divided by the total number of data points.
Measures of Central Tendency: Descriptive statistics that are used to estimate the midpoint of a distribution. Measures of central tendency include the median, the mode, and the mean.
Measures of Variability: Descriptive statistics that summarize how widely dispersed the data are over the distribution. The range describes the difference between the highest and lowest scores, the semi-interquartile deviation is a positional measure that eliminates the extreme scores on both ends of the distribution, and the standard deviation is a mathematically derived index of the degree to which scores differ from the mean of the distribution.
Median: The number in the middle of a distribution when all values are placed in order. A measure of central tendency.
Mode: The number that occurs most often within a distribution. A measure of central tendency.
Population: The entire group of subjects belonging to a certain category, such as all women between the ages of 18 and 27, all dry-cleaning businesses, or all college students.
Quartile: Any of three points that divide an ordered distribution into four equal parts, each of which contains one quarter of the data.
Sample: A subset of a population. A random sample is a sample that is chosen at random from the larger population with the assumption that it will reflect the characteristics of the larger population.
Skewed: A distribution that is not symmetrical around the mean, meaning that there are more data points on one side of the mean than on the other.
Statistics: A branch of mathematics that deals with the analysis and interpretation of data. Mathematical statistics provides the theoretical underpinnings for various applied statistical disciplines in which data are analyzed to find answers to quantifiable questions. Applied statistics uses these techniques to solve real-world problems.
Stem-and-Leaf Plot: A graphing technique in which individual data points are broken into the rightmost units ("leaves") and the leftmost units ("stems"). For example, the number 42 would have a stem of 4 and a leaf of 2; the number 47 would have a stem of 4 and a leaf of 7.
Bibliography
Cibois, P. (2012). The interpretation of statistics in sociology. BMS: Bulletin De Methodologie Sociologique, 114, 50–58. Retrieved November 5, 2013, from EBSCO online database SocINDEX with Full Text. http://search.ebscohost.com/login.aspx?direct=true&db=sih&AN=89974832&site=ehost-live
Gringeri, C., Barusch, A., & Cambron, C. (2013). Examining foundations of qualitative research: A review of social work dissertations, 2008–2010. Journal of Social Work Education, 49, 760–773. Retrieved November 5, 2013, from EBSCO online database SocINDEX with Full Text. http://search.ebscohost.com/login.aspx?direct=true&db=sih&AN=90595347&site=ehost-live
Guo, J., Li, W., Li, C., & Gao, S. (2012). Standardization of interval symbolic data based on the empirical descriptive statistics. Computational Statistics & Data Analysis, 56, 602–610. Retrieved November 5, 2013, from EBSCO online database Academic Search Complete. http://search.ebscohost.com/login.aspx?direct=true&db=a9h&AN=67136175&site=ehost-live
Huff, D. (1954). How to lie with statistics. New York: W. W. Norton & Company.
Witte, R. S. (1980). Statistics. New York: Holt, Rinehart and Winston.
Suggested Reading
Feld, S. L. (1997). Mathematics in thinking about sociology. Sociological Forum, 12, 3–9. Retrieved March 13, 2008, from EBSCO online database Academic Search Complete. http://search.ebscohost.com/login.aspx?direct=true&db=ioh&AN=1537075&site=ehost-live
Gravetter, F. J., & Wallnau, L. B. (2006). Statistics for the behavioral sciences. Belmont, CA: Wadsworth/Thomson Learning.
Iyengar, S. (2013). Artists by the numbers: Moving from descriptive statistics to impact analyses. Work & Occupations, 40, 496–505. Retrieved November 5, 2013, from EBSCO online database SocINDEX with Full Text. http://search.ebscohost.com/login.aspx?direct=true&db=sih&AN=91553862&site=ehost-live
Wienclaw, R. (2018). Sociology and probability theory. Retrieved from EBSCO online database Research Starters—Sociology. http://search.ebscohost.com/login.aspx?direct=true&db=rst&AN=36268117&site=ehost-live&scope=site
Young, R. K., & Veldman, D. J. (1977). Introductory statistics for the behavioral sciences (3rd ed.). New York: Holt, Rinehart and Winston.