Statistical Computing

Statistical computing is the intersection of statistical theory and computer science, playing a crucial role in the analysis of data across various fields. It encompasses the development and use of software programs designed to perform statistical calculations, significantly enhancing the efficiency and accuracy of data analysis. There are three main approaches to statistical computing: single programs, statistical systems or large package programs, and collections of statistical algorithms. These programs, such as SAS, SPSS, and MINITAB, are designed to process large datasets quickly and reduce human error, although they do introduce their own types of computational errors.

While computers excel at performing complex calculations, humans remain essential in designing experiments, choosing analytical techniques, and interpreting results. Statistical computing not only allows for the management of vast amounts of data but also optimizes the extraction of meaningful insights from this data. Users should be aware of potential computational errors, such as blunders or approximation errors, which can impact the reliability of results. Overall, statistical computing is an integral component of modern scientific research and analysis, providing essential tools for informed decision-making across various domains.

Published in: 2021

By: Wienclaw, Ruth A.

Subject Terms

Statistical Computing

Statistical computing involves the interaction of virtually every aspect of statistical theory and practice with computer science. In many ways, statistical computing forms the boundary between the two disciplines. There are three major approaches to statistical computing programs: Single programs, statistical systems or large package programs, and collections of statistical algorithms. Although computers can reduce the errors that may be introduced into calculations by humans, computers also have their own set of associated errors. A wide variety of software application programs are available for performing statistical calculations. Programs run the gamut from blending with relational database software on one end of the spectrum to blending with mathematical software at the other end.

Except in class exercises or for very simple statistical calculations with a small sample size, virtually no one today performs statistical analyses by hand. Certainly the classroom experience of learning to calculate by hand is invaluable for understanding how statistics work, but the truth is that computers are so much better than human beings at the actual processing and calculating tasks involved in performing statistical techniques. Human beings still need to design the underlying experiment, determine what statistical technique used to analyze the data, and interpret the results. However, computers are better than humans at the processing step -- taking the inputs of raw data and turning them into interpretable results. Properly programmed and functioning computers do not reverse numbers or make arithmetic errors as humans are wont to do. In addition, computers excel at processing large amounts of data quickly in a way no human could ever do.

Statistical computing involves the interaction of virtually every aspect of statistical theory and practice and well as nearly every aspect of computer science. Both statistics and computer science are fundamental to all science and together provide complementary tools for scientific endeavors. Statistics is concerned with the accumulation of data, optimal extraction of information from data, and how determining inferences can be made from data in order to extend knowledge. To do these things, statistics often involves processing or combining data either numerically or symbolically, a task at which computer science excels. Computer science deals with ways to optimize these processes, representing information and knowledge in useful ways, and understanding the limits of what can be computed.

The Approaches to Statistical Computing Programs

There are three major approaches to statistical computing programs.

Single programs (such as the Biomedical Data Program (BMDP) developed at the University of California at Los Angeles) comprise collections of statistical programs that require the user to do little more than input and output the data in order to run statistical analyses and acquire usable results.
A second approach is the statistical system or large package program. These are very complex programs that allow users to perform a wide range of statistical analyses by giving the computer instructions in the special language of the system. The Statistical Program for the Social Sciences (SPSS) package and the British General Statistical Program (GENSTAT) program are examples of this category of programs. These programs can be more useful to frequent users who fully understand the system's language and have a good understanding of the system's strengths and weaknesses. However, these requirements also mean that this category of program can be difficult to use for those who do not use it regularly.
The third approach to statistical computing is the development of a collection of statistical algorithms (i.e., sequences of well-defined, unambiguous, simple instructions in the form of mathematical and logical procedures that inform a computer how to solve a problem) which are combined into programs. If a convenient method can be found to do this, the algorithmic approach can be very flexible in meeting the needs of the user.

Computer Error

Although computers can reduce the errors that can be introduced into calculations by humans, they have their own set of associated errors that may be introduced into the calculations. Although computer storage capabilities are increasing and become less expensive, in the end, computers still have limited storage space. To help maximize the use of this space, computers typically store only the most significant digits of data. The actual number of digits that can be stored is determined by the word length of the particular computer. For example, although many people are taught in school that the mathematical concept pi (the ratio of the circumference to the square of the diameter of a circle) is equal to 3.14 or 3.141 or 3.1416, it is actually an infinite decimal which cannot be computed exactly. Therefore, these numbers are merely approximations of the value of pi. To store the value of pi, therefore, a computer needs to truncate the number at some point, rounding it appropriately. In many instances, how this is done is not of particular importance to the outcome of the calculations in which it is used. In other instances, however, it is and can throw off the entire calculation with the rounding error being magnified in subsequent computations.

Because of rounding error and other factors, computer results frequently contain some error. In most cases, this error is not large and the results are good enough for their purposes. However, being aware of the types of error than can appear in computations and how they are caused can help the user of statistical programs to anticipate potential problems and be better prepared to interpret the results.

Three Types of Computational Error

In general, there are three types of errors than can affect the results of computations:

Blunders are gross errors or mistakes that are easily correctible if detected. Examples of blunders include if there was a "bug" in a computer program or if incorrect data were input into the system. The other two types of errors, however, are less easily corrected.
Errors due to the use of an approximate computation method occur when one uses a function or process that approximates a true function or process. For example, evaluating the first n terms in a series expansion of a function may yield results that are only approximations even if the calculations are carried out exactly. Errors due to approximation imposed by the computer result from the way that the computer performs its calculations. These include rounding errors and chopping of fractions in floating point operations. This type of error may start with the data input into the computer for computation. In most cases, due to limitations in storage space and computational ability, these data themselves are only approximations of the "true" values. Therefore, even if the subsequent calculations are performed exactly, they are being performed on approximations of the exact values of the data. As a result, error -- although often negligible -- occurs.
Another source of error in computer calculations derives from the fact that computers operate in a binary rather than a decimal system. In a binary system, a number is represented by a sequences of "switches" that may be "on" or "off." Each digit in a binary number represents a power of two. As shown in Table 1, the rightmost digit in the numbers is the units, the next digit to the left is the twos, and so forth. Although binary numbers can be confusing to humans who are more used to working in the base 10 or decimal system, binary numbers are both more precise and more economical for use in a computer since electronic circuits have two states: On and off. To build a circuit with 10 states would be both more difficult and expensive. Therefore, although a human may input a decimal number into the computer, the computer typically translates this into a binary number, performs its calculations, then translates the number back into base 10 before returning the answer to the human. However, although whole numbers have exact binary equivalents, decimal numbers do not. Error, therefore, can creep into the computations.

Base 10 Binary 2⁰ = 1 1 2¹ = 2 10 2² = 4 100 2³ = 8 1000 2⁴ = 16 10000 Examples 3 11 7 111 11 1011 20 10100

The Development of Statistical Computing Application Software

The development of high quality statistical computing application software is not a trivial task. First, the most appropriate numerical method must be chosen. These are then written as algorithms that can be implemented by the computer. The algorithms must then be carefully coded in the computer's language. In addition, the developer must consider other factors including the user interface and the ease with which a user can operate the program. As the state-of-the-art in statistics and disciplines that use its methods advance, there is a continuing need for new programs to support the new procedures. Sometimes these merely require the modification or extension of an existing algorithm. In other cases, however, the new procedure is so innovative that a totally new numerical method must be derived.

Goal of Statistical Application Programs

The goal of statistical application programs is to produce dependable, efficient results. To do this, it is insufficient to have good numerical methods and algorithms. These must also be properly programmed (i.e., expressed in computer language) if they are to be of use. Programming is an important step in the process of turning mathematical statistics into reliable, usable results. Well-written algorithms can work poorly in practice if they are poorly programmed. This process requires both substantial knowledge and attention to detail.

Five Components of a Computer System

As shown in Figure 1, there are five main components of a computer system.

1. Main Storage: The main storage (also called the memory) of the computer is a piece of hardware that holds the information needed for the computational task or job that is being performed. Input: This information typically comprises programs instructions, numeric constants and input data.

2. Output: This information comprises intermediate results, and final results of the computations. Information stored in the main storage memory is stored in the form of binary digits, or bits. These are often addressed in 8-bit units called bytes.
3. Control: The control component controls the activities of the other components, including retrieval, storage, decoding, and processing of data.
4. Arithmetic & Logic: The arithmetic and logic component performs the operations on the data, including comparisons and conversion. Together the control and arithmetic and logic components are often referred to as the central processing unit (CPU).

Applications

Software Application Programs

A recent article published by George Mason University reviewed a wide variety of software application programs available for performing statistical calculations. Of particular interest are the SAS, MINITAB, BMDP, SPSS, and S-PLUS programs. Programs of this type can run the gamut from blending with relational database software on one end of the spectrum to blending with mathematical software at the other end.

SAS

The SAS System for Statistical Analysis is arguably the industry standard in statistical software application programs. This statistical computing package grew out of a project in the Department of Experimental Statistics at North Carolina State University in the late 1960s. Today, this evolving system is a tool for complete data management and analysis. The SAS System comprises numerous tools including:

Statistical analysis of time series data
Classical statistical problems
Multivariate analysis
Linear modeling
Clustering
Data visualization
Plotting

The SAS System is available for PC and UNIX-based platforms as well as for mainframe computers. In addition, the system can be used to conduct simulation studies with random number generators for various distributions. The SAS System allows users to integrate their own functions into the system. It allows sophisticated analysis of time series data, which is of particular interest in situations where it is important to analyze trends, business cycles, and seasonal fluctuations for forecasting and making business decisions regarding future conditions.

MINITAB

Three other statistical application programs were started about the same time as SAS: MINITAB, BMDP, and SPSS. As computing power has evolved, these systems have also evolved from being mainframe application programs to being available for use on smaller computers. MINITAB statistical software comprises tools to analyze scientific, business, and academic data. In the business world, MINTAB is used in a variety of applications including quality control, chemometrics, and general statistics. MINITAB is available for most mmonly used platforms including Windows, DOS, Macintosh, Open VMS, and UNIX. A student edition of the software is available for use by high school and college students. The software is designed to be user friendly, and features drop down menus and prompts at each step of the process. As a result of this design, MINITAB does not require lengthy manuals or an extended learning curve in order for the user to be able to utilize the software. Data can be entered directly into MINITAB or imported in a variety of file formats including Lotus, Excel, Symphony, Quattro Pro, dBase and ASCII files.

BMDP

The standard for high-end statistical analysis software is BMDP. Although originally designed for analyzing biomedical data, BMDP has long since evolved into a multipurpose statistical package. Currently, BMDP offers personal, classic, and professional editions. Like MINITAB, BMDP is designed to be user friendly. The user interface features point-and-click and fill-in-the-blank features that help users interact with the program. BMDP features a comprehensive library of over 50 statistical routines, and is based on the most advanced algorithms available. The personal edition includes descriptive statistics, t-tests, nonparametric statistics, analysis of variance, frequency tables, and both simple and multiple regression. In addition, the professional edition offers capabilities for log-linear modeling, correspondence analysis, additional regression techniques, and multivariate analysis.

SPSS

Although originally designed for large computer systems, SPSS is now also available for personal computers. SPSS is widely used for survey research, marketing and sales analysis, quality improvement, scientific research, and other applications. SPSS is designed to be a comprehensive set of statistical, graphing, and reporting tools. SPSS Base includes algorithms for most of the commonly used statistical techniques, complete graphics capabilities, and broad data management and reporting capabilities. Other modules build on this base. These modules include a professional statistics module, an advanced statistics module, modules for dealing with tables and trends, a developer's kit, as well as several other modules.

S-PLUS

S-PLUS is a statistical package that is primarily used by research-oriented statisticians. S-PLUS offers great flexibility for customization and implementation of user-defined routines. In addition to statistical functions, S-PLUS offers extensive graphics capability. S-PLUS runs on both PC and UNIX-based platforms. The statistical capabilities of S-PLUS include the ability to generate random data from 20 different types of distributions, perform 12 different types of hypothesis testing, linear, nonlinear, and project-pursuit regression, multivariate analysis, and analysis of time series data.

Conclusion

In addition to these widely used statistical packages, there are a number of more extensive statistical computing programs available for academic and research use. Application software in this category tends to be less comprehensive and reliable, but more innovative than the mainstream programs discussed above. Software in this category includes XGobi, Xlisp-Stat, ExplorN, and MANET.

Terms & Concepts

Algorithm: A sequence of well-defined, unambiguous, simple instructions (i.e., mathematical and logical procedures) that informs a computer how to solve a problem.

Analysis of Variance (ANOVA): A family of statistical techniques that analyze the joint and separate effects of multiple independent variables on a single dependent variable and determine the statistical significance of the effect.

Application Software: A software program that performs functions not related to the running of the computer itself. Application software includes word processing, electronic spreadsheets, computer graphics, and presentation software.

Data: (sing. datum) In statistics, data are quantifiable observations or measurements that are used as the basis of scientific research.

Descriptive Statistics: A subset of mathematical statistics that describes and summarizes data. Descriptive statistics include graphing techniques, measures of central tendency (i.e., mean, median, and mode), and measures of variability (e.g., range, standard deviation).

Floating Point Calculation: A method used to store and calculate numbers in a computer. In floating point calculation, the location of the decimal point is not fixed (i.e., "floating") so that significant digits are taken into account as needed in the calculation. Although floating point notation helps computers more accurately perform calculations, this approach is not without drawbacks.

Forecasting: In business, forecasting is the science of estimating or predicting future trends. Forecasts are used to support managers in making decisions about many aspects of the business including buying, selling, production, and hiring.

Mainframe: A very large computer that is capable of supporting hundreds or thousands of users simultaneously. Mainframe computers typically perform several functions at once. In the hierarchy of computers, mainframes fall between midrange computers and supercomputers.

Mathematical Statistics: A branch of mathematics that deals with the analysis and interpretation of data. Mathematical statistics provides the theoretical underpinnings for various applied statistical disciplines, including business statistics, in which data are analyzed to find answers to quantifiable questions.

Model: A representation of a situation, system, or subsystem. Conceptual models are mental images that describe the situation or system. Mathematical or computer models are mathematical representations of the system or situation being studied.

Multivariate Statistics: A branch of statistics that is used to summarize, represent, and analyze multiple quantitative measurements obtained on a number of individuals or objects. Examples of multivariate statistics include factor analysis, cluster analysis, and multivariate analysis of variance (MANOVA).

Relational Database: A database that stores data in two-dimensional tables. A relational database management system works with two data tables at the same time and relates the data through links (e.g., a common column or field).

Regression: A statistical technique used to develop a mathematical model for use in predicting one variable from the knowledge of another variable.

Statistics: A branch of mathematics that deals with the analysis and interpretation of data. Mathematical statistics provides the theoretical underpinnings for various applied statistical disciplines, including business statistics, in which data are analyzed to find answers to quantifiable questions. Applied statistics uses these techniques to solve real world problems.

Time Series Data: Data gathered on a specific characteristic over a period of time. Time series data are used in business forecasting. To be useful, time series data must be collected at intervals of regular length.

Bibliography

Cooke, D., Craven, A. H., & Clarke, G. M. (1990). Basic statistical computing (2nd ed.). London: Edward Arnold.

Culpepper, S., & Aguinis, H. (2011). R is for revolution: A cutting-edge, free, open source statistical package. Organizational Research Methods, 14(4), 735-740. Retrieved October 31, 2013, from EBSCO Online Database Business Source Complete. http://search.ebscohost.com/login.aspx?direct=true&db=bth&AN=65475944&site=ehost-live

George Mason University. (2007). A guide to statistical software. Retrieved August 15, 2007, from George Mason University online database http://www.galaxy.gmu.edu/papers/astr1.html

Jahn, N., Fenner, M., & Schirrwagen, J. (2013). PlosOpenR -- Exploring FP7 funded PLOS publications. Information Services & Use, 33(2), 93-101. Retrieved October 31, 2013, from EBSCO Online Database Business Source Complete. http://search.ebscohost.com/login.aspx?direct=true&db=bth&AN=91673015&site=ehost-live

Kennedy, W. J. Jr. & Gentle, J. E. (1980). Statistical computing. New York: Marcel Dekker.

Raim, A. M., Gobbert, M. K., Neerchal, N. K., & Morel, J. G. (2013). Maximum-likelihood estimation of the random-clumped multinomial model as a prototype problem for large-scale statistical computing. Journal of Statistical Computation & Simulation, &3(12), 2178-2194. Retrieved October 31, 2013, from EBSCO Online Database Business Source Complete. http://search.ebscohost.com/login.aspx?direct=true&db=bth&AN=91281577&site=ehost-live

Thisted, R. A. (1988). Elements of statistical computing: Numerical computation. New York: Chapman and Hall.

Statistical Computing

On this Page

Subject Terms

Statistical Computing

The Approaches to Statistical Computing Programs

Computer Error

Three Types of Computational Error

The Development of Statistical Computing Application Software

Goal of Statistical Application Programs

Five Components of a Computer System

Applications

Software Application Programs

SAS

MINITAB

BMDP

SPSS

S-PLUS

Conclusion

Terms & Concepts

Bibliography

Suggested Reading