Numerical Data Presentation

Abstract

In business, one needs to examine data from many sources in order to determine the best strategy for success. Descriptive statistics offers numerous techniques for organizing numerical data so that they can be presented in a form that humans can easily assimilate. Common methods for organizing and arranging data include the stem-and-leaf plot and the box-and-whiskers plot. In addition, the stem-and-leaf plot can be used in the development of one of the most frequently used methods for graphically depicting data: the frequency distribution. Often, the midpoints on a frequency distribution are connected by a line called a frequency polygon. These graphs may also be translated into ogives, or cumulative frequency polygons. Quality control utilizes statistical tools to increase the level of quality and reduce defects and waste. Some of the descriptive statistics used in quality control include Pareto charts, scatter plots, and Shewhart control charts.

Overview

Human beings are constantly being bombarded by data. In business, one needs to examine data from many sources in order to determine the best strategy for success: customer feedback, competitors' actions, marketplace trends, and so on. Even within these categories, data need to be organized, described, and presented in ways that help human beings comprehend them and use them to solve problems and make decisions. For example, if the marketing department wanted to know customers' reactions to a proposed new widget design before the company decided whether or not to introduce the new product to the market at large, they might let a sample of potential customers use the widget and then complete a survey regarding their reactions. Although when organized and analyzed, these data could be invaluable inputs for making a decision, a pile of 1,000 surveys sitting on the corner of someone's desk is not.

To help solve this problem and to prepare the data for further analysis, the amount of data to be handled are frequently reduced through any one of a number of graphing techniques. These methods are just a few of the many forms of data visualization used to make the presentation of data more easily understandable. While useful at any scale, specialized techniques for numerical data presentation have become especially necessary in the era of big data—huge datasets based on information collected through the Internet and the ever-growing amount of sensors throughout the world ("Visualizations Make Big Data Meaningful," 2014). As technology improves data scientists develop new ways to present and interact with numerical and other forms of data, though many are rooted in basic statistical concepts.

Techniques for Organizing Data. Descriptive statistics offers numerous techniques for organizing numerical data so that they can be presented in a form that humans can easily assimilate. Take, for example, the following collection of 50 data points:

ors-bus-810-126470.jpg

The numbers are in random order and it is difficult to tell at a glance whether or not the number 53 is included in the set. If these were raw data from potential customers indicating their reactions and ratings to a new widget design on a 100-point scale, it would be extremely difficult to tell whether or not the new design was successful. One could reduce the confusion somewhat by arranging the raw data in numerical order:

ors-bus-810-126471.jpg

It becomes much easier to see that the number 53 is included in the data set. However, it is still not readily apparent whether or not the group of people interviewed liked the new widget. One way to organize the data so that the answer to this question is clearer is to group them into intervals and graph the results in a frequency histogram. A histogram is a type of vertical bar chart that graphs frequencies of objects within various classes on the y-axis against the classes on the x-axis. Frequencies are graphed as a series of rectangles. One can, of course, use whatever size intervals are convenient. For example, if the data set ran from zero to 1,000, one might choose to clump the data in groups of 100 (i.e., 1–100, 101–200, etc.). If the range of data (i.e., the difference between the highest and the lowest values in the data set) is smaller, smaller intervals would be more appropriate.

Stem & Leaf Plots. The most basic set of tools used in descriptive statistics comprises various graphing techniques that help organize and summarize data so that they are more easily comprehensible. One common way of arranging data is through a stem-and-leaf plot. This is a graphing technique in which individual data points are broken into the rightmost units ("leaves") and the leftmost units ("stems"). For example, the number 42 would have a stem of 4 and a leaf of 2; the number 47 would have a stem of 4 and a leaf of 7. Using this technique, the data on customer response to the new widget design would look like this:

ors-bus-810-126472.jpg

The stem-and-leaf plot gives one a better idea of the distribution of data. For example, we can see that the majority of people rating the new design rated it between 20 and 59. However, one can also easily see in this plot that not everyone rated the new design in this interval and that there are extreme scores on each end of the distribution. What is not so easily seen in a stem-and-leaf plot is what the median (the middle value in the ordered data set) of the distribution is.

The Box & Whiskers Plot. The median of the distribution can be readily seen through another data presentation technique: The box-and-whiskers plot (also called a candlestick chart). In this approach to graphing data, the upper and lower quartiles, the median, and the two extreme values of a distribution are used to summarize the data in a compact form. In the example of the widget rating data, the median is 41. The upper and lower quartiles (labeled Q1 and Q3 respectively in Figure 4) are found in the same way by finding the median of the lower and upper halves of the distributions. For the widget rating data, the lower quartile is 26 and the upper quartile is 59.5. The area between the upper and lower quartiles is enclosed with a rectangle on a number line and the position of the median is indicated within the rectangle. In addition, the end points of the data set (in the widget rating example, the lowest value is two and the highest value is 94) are also indicated on the number line and connected to the rectangle by lines.

ors-bus-810-126473.jpg

The box-and-whiskers plot summarizes a number of characteristics of the data at a glance. One can tell by looking at the box-and-whiskers plot where the mid-point of the distribution is and where the bulk of the scores are (i.e., the 50 percent of the scores within the rectangle). In addition, one can tell how spread out the scores are in the distribution by looking at the end points of the line connecting to the rectangle. The box-and-whiskers plot also helps one easily see whether or not the distribution is skewed, that is if it is not symmetrical around the mean (i.e., there are more data points on one side of the mean than there are on the other).

Frequency Distribution. In statistics, one of the most frequently used methods for graphically depicting data is the frequency distribution. In this method, the data are divided into intervals of typically equal length much in the same way as in the development of a stem-and-leaf plot. By graphing data within intervals rather than as individual data points, the number of data points on the graph is reduced and the graph—and the underlying data—becomes easier to comprehend. For example, Figure 5 shows the data on widget ratings graphed on a scatter plot. However, although the scatter plot correctly shows that the rating most frequently received was 30, this does not mean that the average value of the ratings was 30. To exaggerate the point further, if 10 people had given the new widget design a rating of 2 while the other 90 people had all given the new design a rating in the 80s and 90s, it does not mean that the value with the greatest number of responses (i.e., 2) represents the overall ratings on the new design. In addition, one must consider whether or not there was really a meaningful difference between a rating of 22 on a 100-point scale and a rating of 23. In both cases, the person responding did not like the new widget.

ors-bus-810-126474.jpg

Aggregations of Data Sets. To get around these problems, data sets are typically aggregated within intervals and then graphed. For example, if one took the data as aggregated in the stem-and-leaf plot above, the frequencies of the values could be represented on a vertical bar chart called a histogram as shown in Figure 6a. This graphing technique reduces the number of points on the graph so that larger patterns can emerge and be understood. Often, the midpoints on a frequency distribution are connected by a line called a frequency polygon (Figure 6b) which may be smoothed to better illustrate the shape of the underlying distribution (Figure 6c). Histograms may also be translated into ogives, or cumulative frequency polygons. This graphing technique plots the cumulative frequencies of the data rather than the frequencies within the individual intervals. For example, in the widget rating data, there are three scores in the first interval of the stem-and-leaf plot and five scores in the second interval. Rather than plotting the points three and five for the corresponding intervals as is done in the frequency polygon (Figure 6b), the ogive plots the cumulative frequencies. Therefore, in this example, the first two points in the ogive are three and eight (i.e., 3 + 5). The ogive for the widget rating data is shown in Figure 6d.

ors-bus-810-126475.jpg

Applications

Quality Control Engineering & Statistics. Quality control engineering is concerned in part with the quality of goods produced on a production line. Although it might be tempting to assume that modern high technology equipment and automation would repeatedly produce quality products without adjustment, this assumption flies in the face of the laws of physics and of probability. No matter how automated a process or how advanced the technology used to control quality, production processes are never perfect and error in the form of both defects and waste creeps in. Sometimes, errors are due to "noise," random variability that occurs naturally. For example, tonnage and quality of ore produced from a mine varies naturally from day to day. These changes in quality or quantity can affect the inputs into the production line (e.g., lower-quality ore may result in greater breakage of the widgets which were produced using it). However, other errors can be due to problems with the process, equipment, materials, or humans working the line. Quality control engineering examines processes for ways that it can be continually improved in order to increase the quality of the product.

Quality control utilizes tools from both descriptive statistics and inferential statistics to increase the level of quality and reduce defects and waste. Some of the descriptive statistics used in quality control include: Pareto charts, scatter plots, and Shewhart control charts.

Pareto Charts. Pareto charts are used to display the most common types of defects in ranked order of occurrence. Pareto charts are also frequently shown with cumulative percentage line graphs to more easily represent the total percentage of errors accounted for by various defects.

Scatter Plots. Another type of graph commonly used in quality control is the scatter plot. These graphs depict two-variable numerical data so that the relationship between the variables can be examined. For example, if one wanted to know the relationship between the number of defects observed in a given month and the cost of the loss of quality to the company, the two values (number and cost) could be graphed on a two-dimensional graph so that one could better understand the relationship. Examples of a Pareto chart (with cumulative percentage line graph) and a scatter plot are shown in Figure 7.

ors-bus-810-126476.jpg

Quality Control Charts. Another way that quality control engineers deal with fluctuations in quality is through the use of quality control charts. These are simple graphing procedures that help quality control engineers and managers monitor processes and determine whether or not they are in control. These charts are based on two statistical ideas. First, random noise occurs naturally in any process (e.g., the variations in ore quality from a mine). Second, within a random process there is a certain amount of regularity. Only 5 percent of the time (i.e., one occurrence in 20) will a variable differ from its mean by more than two standard deviations. A process is said to be within statistical control if it performs within the limits of its capability within these parameters.

Quality control charts (also called Shewhart control charts after their originator) help one examine quality data to determine whether or not a process is within statistical control. There are two categories of control charts: control charts for measurements and control charts for compliance. The X-bar chart (so-called because it examines arithmetic means, the mathematical symbol for which is X¾), is a chart of the means of some characteristic of the product (e.g., acceptability of solder joints) of small random samples taken from the production line over time. As shown in Figure 8, these means are plotted over time on a chart that contains a center line (i.e., the mean for the process) and upper and lower control limits. The center line in the chart is the arithmetic mean of the means of the samples. The upper control limit in the chart is three standard deviations above the center line and the lower control limit is three standard deviations below the center line. If all the points plotted fall between the upper and lower control limits on the chart, the process is considered to be in control. If, however, a computed sample mean falls outside the control limits, the process is considered to be out of control. The process is then typically stopped so that an assignable cause can be determined. Sometimes assignable causes for the process being out of control are easily explained, such as passing phenomena that are unlikely to occur again. Other times, however, the assignable causes are more serious or long-lasting, and require corrective action (e.g., replacing a defective part or machine, retraining of employees, switching suppliers). In addition to X-bar charts that keep track of processes by examining the means of samples, quality control charts include R charts that keep track of the range, p charts track the proportion of defective products, c charts track the number of defects, and s charts that examine sample variance.

ors-bus-810-126477.jpg

In addition to Shewhart control charts, more sophisticated charting methods available. For example, multivariate charting methods are available that allow the quality control engineer to monitor several related variables simultaneously. Similarly, methods are also available for charting a single measurement rather than a sample (e.g., moving average charts, exponentially weighted moving average charts) and cumulative sum methods that are more sensitive than Shewhart control charts for detecting small, consistent changes.

Terms & Concepts

Box-and-Whiskers Plot: A graphing technique that summarizes a data set by depicting the upper and lower quartiles, the median, and the two extreme values of a distribution to summarize data. Also known as a box plot or a candlestick chart.

Cumulative Frequency Polygon: A graph in which each point represents the sum of the frequencies of the interval and the preceding intervals. Also called an ogive chart.

Data: (sing. datum) In statistics, data are quantifiable observations or measurements that are used as the basis of scientific research.

Descriptive Statistics: A subset of mathematical statistics that describes and summarizes data.

Distribution: A set of numbers collected from data and their associated frequencies.

Frequency Distribution: A graphing technique in which an observed distribution is partitioned into intervals (typically of equal size) and the data within the intervals are summarized and displayed in a bar chart.

Frequency Polygon: A graphing technique in which the summary point for each interval of a frequency distribution is connected by a line from the left-most point to the right-most point.

Histogram: A graphing technique in which data are represented as vertical rectangles where the heights of the rectangles are proportional to the frequencies observed in the corresponding interval.

Inferential Statistics: A subset of mathematical statistics used in the analysis and interpretation of data. Inferential statistics are used to make inferences such as drawing conclusions about a population from a sample and in decision making.

Pareto Chart: A vertical bar chart that graphs the number and types of defects for a product or service against the order of magnitude (from greatest to least). These charts are used to display the most common types of defects in ranked order of occurrence. Pareto charts are often shown with cumulative percentage line graphs to more easily show the total percentage of errors accounted for by various defects.

Quartile: Any of three points that divide an ordered distribution into four equal parts each of which contain one quarter of the scores.

Raw Data: Data that have not been organized, summarized, or otherwise processed so that they are in usable form. Also called ungrouped or atomic data. See Figure 1.

Scatter Plot: A graphical representation of pairs of data (e.g., length and width of an object).

Skewed: A distribution that is not symmetrical around the mean (i.e., there are more data points on one side of the mean than there are on the other).

Stem-and-Leaf Plot: A graphing technique in which individual data points are broken into the rightmost units ("leaves") and the leftmost units ("stems"). For example, the number 42 would have a stem of 4 and a leaf of 2; the number 47 would have a stem of 4 and a leaf of 7. See Figure 3.

Bibliography

Black, K. (2006). Business statistics for contemporary decision making (4th ed.). New York: John Wiley & Sons.

Daley, J. (2013). The numbers that lie. Entrepreneur, 41(8), 90–94. Retrieved December 3, 2013 from EBSCO Online Database Business Source Premier. http://search.ebscohost.com/login.aspx?direct=true&db=buh&AN=89123513

Ferguson, G. A. (1971). Statistical analysis in psychology and education (3rd ed.). New York: McGraw-Hill Book Company.

Hartmann, H. (2016). Statistics for engineers. Communications of the ACM, 59(7), 58–66. doi:10.1145/2890780. Retrieved December 27, 2016, from EBSCO online database Business Source Ultimate. http://search.ebscohost.com/login.aspx?direct=true&db=bsu&AN=116599166&site=ehost-live&scope=site

John, P. W. (1990). Statistical methods in engineering and quality assurance. New York: John Wiley & Sons.

Majumder, M., Hofmann, H., & Cook, D. (2013). Validation of visual statistical inference, applied to linear models. Journal of the American Statistical Association, 108(503), 942–956. Retrieved December 3, 2013 from EBSCO Online Database Business Source Premier. http://search.ebscohost.com/login.aspx?direct=true&db=buh&AN=90465281

Maynard, R. (2012). Understanding business performance data. Operations Management (1755-1501), 38(1), 25–29. Retrieved December 3, 2013 from EBSCO Online Database Business Source Premier. http://search.ebscohost.com/login.aspx?direct=true&db=buh&AN=73797367

Visualizations make big data meaningful. (2014). Communications of the ACM, 57(6), 19–21. Retrieved Dec. 4, 2015, from EBSCO Online Database Business Source Complete. http://search.ebscohost.com/login.aspx?direct=true&db=bth&AN=96205554&site=ehost-live&scope=site

Witte, R. S. (1980). Statistics. New York: Holt, Rinehart and Winston.

Suggested Reading

Antony, J. (2001). Understanding, managing and implementing quality. London: Routledge.

Barabesi, L., & Fattorini, L. (2013). Special issue on inferential strategies for environmental surveys. Statistical Methods & Applications, 22(1), 1–2. Retrieved December 3, 2013 from EBSCO Online Database Business Source Premier. http://search.ebscohost.com/login.aspx?direct=true&db=buh&AN=85896238

Good-producing industries. (2006, December). Canadian Economic Observer, 19, 35–41. Retrieved September 6, 2007, from EBSCO Online Database Business Source Complete. http://search.ebscohost.com/login.aspx?direct=true&db=bth&AN=23835514&site=ehost-live

Hilbe, J. M. (2014). Modeling count data. New York, NY: Cambridge University Press.

Jance, M. L. (2012). Statistics and the entrepreneur. Academy of Business Research Journal, 133–37. Retrieved December 3, 2013 from EBSCO Online Database Business Source Premier. http://search.ebscohost.com/login.aspx?direct=true&db=buh&AN=85672206

Leitnaker, M. G. & Cooper, A. (2005). Using statistical thinking and designed experiments to understand process operation. Quality Engineering, 17(2), 279–289. Retrieved August 21, 2007, from EBSCO Online Database Business Source Complete. http://search.ebscohost.com/login.aspx?direct=true&db=bth&AN=17003962&site=ehost-live

Samuel, A. (2015). How to give a data-heavy presentation. Harvard Business Review Digital Articles, 2–5. Retrieved December 27, 2016, from EBSCO online database Business Source Ultimate. http://search.ebscohost.com/login.aspx?direct=true&db=bsu&AN=118685309&site=ehost-live&scope=site

Essay by Ruth A. Wienclaw, PhD

Dr. Ruth A. Wienclaw holds a doctorate in industrial/organizational psychology with a specialization in organization development from the University of Memphis. She is the owner of a small business that works with organizations in both the public and private sectors, consulting on matters of strategic planning, training, and human/systems integration.