Intermediate Applied Statistics
Intermediate Applied Statistics is a branch of statistics that focuses on methods for making inferences about populations based on sample data. Building upon foundational descriptive statistics, this area of study emphasizes techniques such as inferential statistics, which allow analysts to draw conclusions and make decisions that inform business strategies and organizational goals. Key statistical methods covered typically include analysis of variance (ANOVA), which examines the effects of multiple independent variables on a single dependent variable; the Pearson product moment coefficient of correlation, which assesses the relationship between two variables; and linear regression, used for predicting outcomes based on variable relationships.
These statistical techniques empower decision-makers by providing insights into complex data sets and enabling the evaluation of hypotheses through various tests. For instance, ANOVA helps avoid misleading results from multiple t-tests by analyzing variances across several groups simultaneously. Understanding these statistical methods is vital in contexts ranging from market analysis to quality control, making them integral to effective business practices. However, it is essential to recognize that statistical results reflect probabilities rather than certainties, necessitating careful interpretation to avoid errors in decision-making. Overall, Intermediate Applied Statistics equips individuals with the analytical tools needed to navigate the data-driven landscape of modern business.
On this Page
Intermediate Applied Statistics
Although descriptive statistics are invaluable for organizing and describing data, frequently one needs to be able to draw inferences from the data in order to make decisions and develop plans of action to help the organization reach its goals and objectives. This often involves the use of inferential statistics, a subset of mathematical statistics used in the analysis and interpretation of data. Most intermediate statistics courses discuss a number of statistical procedures that can be used for these purposes including analysis of variance, a family of statistical techniques that analyze the joint and separate effects of multiple independent variables on a single dependent variable; the Pearson product moment coefficient of correlation which estimates the degree to which two events or variables are consistently related; and linear regression, a statistical technique in which a line of best fit is extrapolated through a set of data points to analyze the effect of an independent variable on a dependent variable.
Statistics -- either in the form of numerical data or in their analysis and interpretation -- seem to be everywhere one looks. The newspaper tells us the statistics of Senator Harvey's voting record: the percentage of votes in which he participated, how many of his votes were in support of environmental issues, and whether or not his actions were in favor of or opposed to higher taxes. Television advertisements tell us that the Woody has been found to be 59 percent safer than the Speed Racer. The latest diet book promises to help us lose 14 percent more weight on average. Professors are rated by students on a five-point scale and students' grades are judged based on the normal curve. In the business world, descriptive statistics, a subset of mathematical statistics that describes and summaries data, is useful, too. Marketers find it helpful to know that on average, people rate the new widget design as an 8.5 on a 10 point scale. Quality control engineers graph the number of defects found in a series of random samples to determine whether processes are in or out of control. Employees are given job performance ratings that determine whether they will be given a raise or put on probation. In these ways and more, statistics are part of our lives.
However, one often wants to be able to do more than merely describe data. Although descriptive statistics are invaluable for organizing and describing data through various graphing techniques, measures of central tendency, and measures of variability, one frequently needs to be able to draw inferences from the data in order to make decisions and develop plans of action to help the organization reach its goals and objectives. Good business strategy is based on the rigorous analysis of empirical data, including market needs and trends, competitor capabilities and offerings, and the organization's resources and abilities. Developing a good business strategy often involves the use of inferential statistics, a subset of mathematical statistics used in the analysis and interpretation of data. Inferential statistics are used to make inferences such as drawing conclusions about a population from a sample and for making many business decisions.
In addition to descriptive statistics, most beginning statistics courses also teach basic inferential statistical techniques, including z-tests to estimate the mean of a population from the mean of a sample. In addition, the t-tests taught in basic statistics courses can be used for hypothesis testing to determine the probability that various theories about business phenomena are true. However, these statistical techniques can only test simple hypotheses comparing two samples to determine if they come from the same population. Real world business problems, however, are often more complex and there is frequently a need to compare more than two conditions at a time. Inferential statistics offers other techniques that can help managers and other business decision makers to answer more complex questions. Some of these techniques include analysis of variance (ANOVA), correlation, and linear regression.
Interpreting Statistical Results
These statistical techniques are powerful tools that can be invaluable in assisting managers and other organizational decision makers in their tasks. However, it needs to be borne in mind that the results of statistical techniques do not yield black-and-white answers, but probabilities. Inferential statistics are used to test the probability that the null hypothesis (H0) is true. This is the statement that there is no statistical difference between the status quo and the experimental condition. If the null hypothesis is true, then the treatment or characteristic being studied made no difference on the end result. For example, a null hypothesis might state that there is no difference in preference for Super Crunchies cereal than for Nutty Flakies cereal in children versus adults. The alternative hypothesis (H1), on the other hand, would state that there is a relationship between the two variables.
A lack of understanding of the way that probability works can result in poor experimental design that yields spurious results. It is important to remember that the results of a statistical data analysis do not prove whether or not the hypothesis is true, but what the probability is of the hypothesis being true at a given confidence level. So, for example, if a t-test or analysis of variance results in a value that is significant at the a = .05 level, this means not that the hypothesis is true, but that the analyst is willing to run the risk of being wrong five times out of 100. This means that there is a possibility of error when interpreting statistics and either accepting or rejecting the null hypothesis. Type I errors occur when one incorrectly rejects the null hypothesis and accepts the alternate hypothesis. An example of a Type I error would be if an analyst concluded that adults enjoy Super Crunchies while children do not enjoy them when, in fact, there is no difference. Type II errors, on the other hand, occur when one incorrectly accepts the null hypothesis. For example, if the analyst interpreted the results to mean that both children and adults equally enjoy Super Crunchies when in actuality adults prefer it more than children do, then a Type II error would have occurred.
Applications
The techniques taught in intermediate statistics courses vary from course to course. However, most intermediate statistics courses discuss a powerful family of statistical techniques called analysis of variance; a family of statistical techniques that analyze the joint and separate effects of multiple independent variables on a single dependent variable and determine the statistical significance of the effect. Another statistical technique that is often taught in intermediate statistics courses is the Pearson product moment coefficient of correlation which estimates the degree to which two events or variables are consistently related. In addition, most intermediate statistics courses introduce students to the concept of linear regression, a statistical technique in which a line of best fit is extrapolated through a set of data points to analyze the effect of a single independent variable on a dependent variable.
Analysis of Variance
Analysis of variance is a family of techniques used to analyze the joint and separate effects of multiple independent variables on a single dependent variable and to determine the statistical significance of the effect. Although t-tests are fine for testing hypotheses for two conditions (e.g., does battery X have a longer life than battery Y?), t-tests cannot handle situations where there is more than one dependent variable. So, for example, a t-test could not test whether there was a difference between the life of batteries X, Y, and Z. Although it might be tempting to run a t-test comparing batteries X and Y, another to test batteries Y and Z, and a third to test batteries X and Z, this approach can lead to spurious results: the "significance" observed in such situations may not be reflective of underlying differences but actually be an artifact of testing the data too many times. The more tests are run on a single set of data, the higher the probability that spuriously significant results will occur merely by chance. For such situations, analysis of variance can be used with much less risk.
Data Assumptions
Analysis of variance makes several important assumptions about the data.
- First, it is assumed that the sample observations are drawn from normally distributed populations.
- Second, it is assumed that the samples are randomly drawn from the population.
- Third, it is assumed that the variances of the populations are equal.
With these assumptions in mind, analysis of variance examines at two sources of variability.
- Variability between groups is the variation among the scores of subjects that are treated differently. In the example of the expected life of three different batteries, the between groups variability would look at the difference in battery life for batteries X vs. Y vs. Z.
- Within groups variability, the second type of variability of interest in analysis of variance, looks at the variation among the scores of subjects that are treated alike. In the battery example, within groups variability would look at the variation of scores among all battery Xs tested, all battery Ys tested, and all battery Zs tested. This type of variability is sometimes referred to as the "error term" because it is due to random error resulting from uncontrolled factors such as individual differences.
The statistic resulting from an analysis of variance is the F ratio, which is defined as:
F = variability between groups / variability within groups
For situations where the null hypothesis is true (i.e., there is no difference between treatments), then the F statistic reflects only random error. If there is a difference between treatments and the alternate hypothesis is true, however, then the F statistic will reflect not only the variability due to random error, but the variability due to the treatment effect in addition:
F = random error + treatment effect / variability within groups
Research Design
Analysis of variance is not a single technique but a family of techniques used with experimental research. In the completely randomized design one-way analysis of variance, subjects are randomly assigned to treatments in a research design that contains only one independent variable with two or more treatment levels. For example, if one wanted to know which of several packaging options people preferred (e.g., the current packaging vs. one or more new packaging options), one could analyze the data with a one-way analysis of variance. Another research design commonly analyzed using analysis of variance is the randomized block design. In this design, a second variable (referred to as a blocking variable) is used to control for confounding or extraneous variables that are not being tested in the research. Two-way analysis of variance is used when the research design includes two or more independent variables (treatments) that the analyst desires to examine simultaneously. For example, the marketing department may want to know if there is a difference between the way that women and men react to the three packaging options. In addition to univariate analysis of variance, multivariate analysis of variance (MANOVA) techniques are available that allow the business analyst to test hypotheses on more complex problems involving the simultaneous effects of multiple independent variables on multiple dependent variables.
Pearson Product Moment Coefficient of Correlation
Sometimes one needs to be able to predict one variable from knowledge of another variable in order to make decisions about the future. For example, if one were launching a new cereal in marketplace, it might be helpful to know the demographics of the people who prefer the cereal so that it could be introduced into the correct market. If the new cereal appealed primarily to children but not to adults, it would not be a prudent strategy to put it in the grocery store in an all-adult community. Similarly, a cereal that appeals to children would probably have different packaging than one that appealed to adults. To make these and other decisions about marketing of the new cereal, it would be helpful to know the relationship between the variables of customer age and attitude toward the new cereal. One way to do this is through the Pearson product moment coefficient of correlation. This technique allows analysts to determine whether the two variables are positively correlated (i.e., the older people become, the more they like the new cereal), negatively correlated (i.e., the older people become, the less they like the new cereal), or not correlated at all (i.e., both children and adults like the new cereal equally).
Although the correlation coefficient can show the relationship between two variables, it does not show what caused the relationship. Just because two variables are related does not mean that one caused the other. Both may be caused by a third, unknown force. For example, just because the alarm clock goes off at 7:00 every morning does not mean that the rising sun causes the alarm to sound. Similarly, it is not the alarm clock that causes the sun to rise. Although these two variables are related, one does not cause the other. In another example, in general it could be said that weight gain in the first year of life is positively correlated with age (i.e., the older the baby is, the more likely it is to weigh). However, this same correlation would not apply to adults: heavier adults are not necessarily older than lighter adults. Correlation only shows the relationship between the two variables: It does not explain why the relationship occurs or what caused it.
In a classic example, it was once noted that the number of births in villages in a northern European country was highly correlated with the number of storks seen nesting in the chimneys. If one did not understand that correlation does not imply causation, it would be tempting to interpret this positive correlation to mean that storks bring babies. The truth, however, is that more babies were conceived in the summer, which meant they were born the following spring when, coincidentally, storks were happily nesting over the warm chimneys.
Simple Linear Regression
Knowing that there is a relationship between two variables does not always provide sufficient information to make good decisions, however. Sometimes it would be helpful to develop a mathematical model (i.e., a mathematical representation of the system or situation being studied) to be used to predict one variable from another. Regression analysis is a family of statistical techniques that is used for this purpose. Data are typically graphed on a scatter plot (a graph depicting pairs of points for two-variable numerical data) and a line of best fit that can be used to predict the value of one variable from a knowledge of the value of the other variable is mathematically determined. A sample scatter plot with line of best fit is shown in Figure 1.
For example, a trainer might want to determine the cost of running a seminar for varying numbers of students. More students, of course, would mean more income. On the other hand, more students would also mean more expenses such as costs for handouts, larger conference rooms, more trainers to run small group sessions, and so forth. Too few students, on the other hand, might mean that the training course would not pay for itself. To develop a predictive model for cost vs. number of trainees, data could be collected on these variables for a number of training courses. Linear regression could then be used to mathematically calculate the slope of the line of best fit passing through the data to determine the equation of the simple regression line. This equation could then be used by the trainer to determine optimal class sizes for training courses.
Conclusion
Intermediate statistics courses go beyond the basic tools offered by descriptive statistics to allow the user to make inferences about the data. Among the techniques typically taught in these courses are analysis of variance, the Pearson product moment coefficient of correlation and linear regression. These tools can be invaluable to the business analyst and decision maker alike in solving real world business problems.
Terms & Concepts
Analysis of Variance (ANOVA): A family of statistical techniques that analyze the joint and separate effects of multiple independent variables on a single dependent variable and determine the statistical significance of the effect.
Correlation: The degree to which two events or variables are consistently related. Correlation may be positive (i.e., as the value of one variable increases the value of the other variable increases), negative (i.e., as the value of one variable increases the value of the other variable decreases), or zero (i.e., the values of the two variables are unrelated). Correlation does not imply causation.
Data: (sing. datum) In statistics, data are quantifiable observations or measurements that are used as the basis of scientific research.
Dependent Variable: The outcome variable or resulting behavior that changes depending on whether the subject receives the control or experimental condition (e.g., a consumer's reaction to a new cereal).
Descriptive Statistics: A subset of mathematical statistics that describes and summarizes data.
Hypothesis: An empirically testable declaration that certain variables and their corresponding measure are related in a specific way proposed by a theory.
Independent Variable: The variable in an experiment or research study that is intentionally manipulated in order to determine its effect on the dependent variable (e.g., the independent variable of type of cereal might affect the dependent variable of the consumer's reaction to it).
Inferential Statistics: A subset of mathematical statistics used in the analysis and interpretation of data. Inferential statistics are used to make inferences such as drawing conclusions about a population from a sample and in decision making.
Linear Regression: A statistical technique used to develop a mathematical model for use in predicting one variable from the knowledge of another variable.
Mathematical Statistics: A branch of mathematics that deals with the analysis and interpretation of data. Mathematical statistics provides the theoretical underpinnings for various applied statistical disciplines, including business statistics, in which data are analyzed to find answers to quantifiable questions.
Null Hypothesis (H0): The statement that the findings of the experiment will show no statistical difference between the current condition (control condition) and the experimental condition.
Population: The entire group of subjects belonging to a certain category (e.g., all women between the ages of 18 and 27; all dry cleaning businesses; all college students).
Sample: A subset of a population. A random sample is a sample that is chosen at random from the larger population with the assumption that such samples tend to reflect the characteristics of the larger population.
Statistical Significance: The degree to which an observed outcome is unlikely to have occurred due to chance.
Variable: An object in a research study that can have more than one value. Independent variables are stimuli that are manipulated in order to determine their effect on the dependent variables (response). Extraneous variables are variables that affect the response but that are not related to the question under investigation in the study.
Bibliography
Black, K. (2006). Business statistics for contemporary decision making (4th ed.). New York: John Wiley & Sons.
Buglear, J. (2014). Practical Statistics: A Handbook for Business Projects. London: Kogan Page. Retrieved December 3, 2013 from EBSCOhost eBook Collection. http://search.ebscohost.com/login.aspx?direct=true&db=nlebk&AN=653154&site=ehost-live
Jance, M. L. (2012). Statistics and the entrepreneur. Academy of Business Research Journal, 133-37. Retrieved December 3, 2013 from EBSCO Online Database Business Source Premier. http://search.ebscohost.com/login.aspx?direct=true&db=buh&AN=85672206
Trevor, W. (2013). Applied Business Statistics: Methods and Excel-based Applications 3e. [N.p.]: Juta and Company. Retrieved December 3, 2013 from EBSCOhost eBook Collection. http://search.ebscohost.com/login.aspx?direct=true&db=nlebk&AN=667439&site=ehost-live
Witte, R. S. (1980). Statistics. New York: Holt, Rinehart and Winston.
Suggested Reading
Berry-James, R. M. (2007). Applied statistics for public policy. Public Administration Quarterly, 31(1), 121-124. Retrieved September 27, 2007, from EBSCO Online Database Business Source Complete. http://search.ebscohost.com/login.aspx?direct=true&db=bth&AN=25301917&site=bsi-live
Bowerman, B. L. & O'Connel, R. T. (2005). Business statistics in practice (4th ed.) Columbus, OH: Irwin/McGraw-Hill.
Costantini, M. (2013). Forecasting the industrial production using alternative factor models and business survey data. Journal of Applied Statistics, 40(10), 2275-2289. Retrieved December 3, 2013 from EBSCO Online Database Business Source Premier. http://search.ebscohost.com/login.aspx?direct=true&db=buh&AN=90259077
Daley, J. (2013). The numbers that lie. Entrepreneur, 41(8), 90-94. Retrieved December 3, 2013 from EBSCO Online Database Business Source Premier. http://search.ebscohost.com/login.aspx?direct=true&db=buh&AN=89123513
Kim, H., Loh, W., Shih, Y. & Chaudhuri, P. (2007). Visualizable and interpretable regression models with good prediction power. IEE Transactions, 39(6), 565-579. Retrieved September 27, 2007, from EBSCO Online Database Business Source Complete. http://search.ebscohost.com/login.aspx?direct=true&db=bth&AN=24471426&site=bsi-live
Pham-Gia, T., Turkkan, N. & Marchant, E. (2006). Density of the ratio of two normal random variables and applications. Communications in Statistics: Theory & Methods, 35(9), 1569-1591. Retrieved September 27, 2007, from EBSCO Online Database Business Source Complete. http://search.ebscohost.com/login.aspx?direct=true&db=bth&AN=22455479&site=bsi-live
Groebner, D. F., Shannon, P. W., Fry, P. C., & and Smith, K. D. (2003). Business statistics: A decision-making approach (6th ed.). Upper Saddle River, NJ: Prentice Hall.
Levine, D. M., Krehbiel, T. C., & Berenson, M. L. (2005). Business statistics: First course (4th ed.). Upper Saddle River, NJ: Prentice Hall.