Diagnostic testing and mathematics

Summary: Diagnostic tests rely on statistics from clinical research to predict the presence or severity of a disease in a specific patient.

The ability of humans to detect and treat diseases has advanced considerably in the past two centuries, with the discovery of underlying causes, such as microorganisms, and treatments, like antibiotics, as well as methods for diagnosing injury and disease. In medicine, a diagnostic test is in an instrument used to detect or predict the presence or absence of disease or the severity of disease.

94981791-91320.jpg

The instrument used may take a variety of forms, including a patient inventory or a mechanical device. In clinical research, it is common practice to assess the quality of such instruments relative to established gold standards.

Here, the intention is often to replace a traditional method by a newer one that offers greater benefits to health providers or patients, including cost reduction and less physical or psychological discomfort.

It may be of interest to use the diagnostic tool to predict outcomes based on existing symptoms. In this case, the gold standard is used to confirm patient outcomes for comparison with test predictions based on surrogate measures.

Common measures of instrument quality include reliability, validity, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV). Strictly speaking, these measures apply specifically to the scores forthcoming from the instruments rather than the instruments themselves, as they are based on studies applied to a specific sample of patients. Mathematicians and statisticians are essential partners in creating many diagnostic tools, such as magnetic resonance imaging, as well as for developing and refining the measures that allow clinicians and researchers to determine the efficacy of diagnostic instruments. They also help design experiments in which new instruments are tested and compared.

Nursing and other healthcare education programs frequently require courses in mathematics or statistics, and the field of biostatistics is one of the fastest-growing occupations in the late twentieth and early twenty-first century.

Reliability represents the reproducibility of the test outcomes. A simple case involves estimation of the extent of chance-corrected agreement in the interpretation of categorical findings from medical images derived from patients. Here, agreement might be measured across different clinicians based on a single imaging procedure or alternatively, across different imaging procedures. In such cases, an appropriate choice of Kappa statistic or intra-class correlation coefficient may prove helpful. For continuous data, the Bland–Altman method has also proved particularly popular in measuring agreement across different methods. This is especially so within medicine, where for example, there may be a need to compare residual tumor sizes obtained using magnetic resonance imaging, and pathologic findings (the gold standard) in breast cancer patients who have undergone neoadjuvant (preoperative) chemotherapy.

The remaining measures above represent the accuracy of the test outcomes. Validity, which is a function of reliability, represents the extent to which the diagnostic test measures what is intended and is particularly relevant in psychological testing. Sensitivity (specificity) measures the proportion of genuine instances of disease (absence of disease, respectively), which are detected as such by the diagnostic test. By contrast, the PPV (NPV) measures the proportion of cases diagnosed by the test as instances of disease (absence of disease, respectively) which are, or will turn out to be, genuine. In assessing test accuracy, it can prove misleading to focus exclusively on sensitivity and specificity.

The PPV and NPV for a disease are influenced strongly by disease prevalence (the pre-test probability that a randomly chosen person from the study cohort has the disease). The PPV increases with increasing prevalence and where prevalence is particularly low (less than 5%), the PPV can be markedly improved by moderate increases in test specificity. In interpreting a published PPV, it is essential not only to consider the CI but also to verify whether disease prevalence for the published study is representative of that for the types of patient currently under consideration. This requirement is also particularly true of the NPV.

Further, it is typically the case that an initial stage has occurred whereby diagnostic test measurements in continuous form have been classified into categories. This categorization requires the derivation of a threshold value for differentiating between diseased and non-diseased patients. The clinician may be interested in finding the threshold value that offers an optimal combination of values for sensitivity and (1-specificity). Examples of scores that have been used in this way include

  • The GRACE (Global Registry of Acute Coronary Events) score in predicting death and myocardial infarction for patients with Acute Coronary Syndrome
  • The APACHE (Acute Physiology and Chronic Health Evaluation) II score and GS (Glasgow Severity) score in the prediction of each of onset of severe pancreatitis, MODS (multiorgan dysfunction syndrome), and death in patients presenting with acute pancreatitis
  • The MELD (Model of End-Stage Liver Disease) and UKELD (United Kingdom MELD) scores in the assessment of risk of acute liver failure and hence the prediction of waiting list mortality in patients awaiting liver transplants

The underlying procedure for deriving the threshold value involves the segregation of the test instrument scores into two groups, as determined by the gold standard, namely those who do and those who do not have the condition of interest. The accuracy of the diagnostic test is in turn assessed on the basis of these two groups. This assessment involves generating a series of threshold values and corresponding values for sensitivity and 1-specificity. The ROC curve (Receiver Operating Characteristic) involves a plot of sensitivity versus 1-specificity. If the intention is to compare the performance of competing diagnostic tests, ROC curves for the different tests can be plotted on the same graph. For any one plot, the numerically optimal combination of sensitivity and specificity values is represented by the point on the curve that is closest to the top left-hand corner. However, the trade-off between sensitivity and specificity must also be carefully weighed.

For example, if the test is confirmatory, as might be the case in human immunodeficiency virus (HIV) testing, it may be preferable to choose a slightly different point, which further reduces the proportion of false positives (1-specificity) with a small cost to sensitivity. In comparing the accuracy of two tests by means of ROCs, it is common to use the area under the curve (AUC).

Where the diagnostic test identifies cases falling into the upper (lower) range of a test score, the AUC may be interpreted as a measure of the likelihood for a randomly chosen diseased patient and disease-free patient that the diseased patient will have a higher value (lower value, respectively) than the disease-free patient.

Where ROCs do not overlap, therefore, the greater the area under the curve, the more effective the diagnostic tool. Where they do overlap, the curve with the lower overall AUC may have a peak at an optimal combination of sensitivity and specificity values not attained by the other curve. It may therefore make sense to compare the partial areas under the curves within one or more ranges of specificity values.

Bibliography

Fox, Keith A., et al. “Prediction of Risk of Death and Myocardial Infarction in the Six Months Following Presentation With ACS: A Prospective, Multinational, Observational Study (GRACE).” British Medical Journal 333 (2006).

Lasko, Thomas A., et al. “The Use of Receiver Operating Characteristic Curves in Biomedical Informatics.” Journal of Biomedical Informatics 38, no. 5 (2005).

Modifi, Reza, et al. “Identification of Severe Acute Pancreatitis Using an Artificial Neural Network.” Surgery 141, no. 1 (2007).

Neuberger, James, et al. “Selection of Patients for Liver Transplantation and Allocation of Donated Livers in the UK.” GUT 57 (2008).

Obuchowski, Nancy A. “Receiver Operating Curves and Their Use in Radiology.” Radiology 229 (2003).

Pan, Jian-Xin, and Kai-Tai Fang. Growth Curve Models and Statistical Diagnostics. New York: Springer, 2002.

Partidge, Savannah C., et al. “Accuracy of MR Imaging for Revealing Residual Breast Cancer in Patients Who Have Undergone Neoadjuvant Chemotherapy.” American Journal of Roentgenology 179 (2002).

Ward, Michael E. “Diagnostic Tests.” www.chlamydiae.com/restricted/docs/labtests/diag‗examples.asp.

Wilson, Edwin B. “Probable Inference, The Law of Succession, and Statistical Inference.” Journal of the American Statistical Association 22, no. 158 (1927).

Zhou, Xiao-Hua, Donna McClish, and Nancy Obuchowski. Statistical Methods in Diagnostic Medicine. Hoboken, NJ: Wiley-Interscience, 2002.