Survival Models

Survival analysis is a topic capturing the interest of many, especially since the turn of the millenium. This essay summarizes a survival function; a hazard function along with various issues and concepts associated with endeavors to estimate the probability of survival beyond some point in time. Many studies assume or confirm that survival time follows an exponential probability distribution. In contrast to linear functions between two variables, the probability of survival beyond a specific event will increase or decrease at an increasing or decreasing rate over time. One way to begin thinking about this topic is to consider that the probability of survival or living beyond age 50 decreases at an increasing rate over time. Analysts can then examine factors that relate to such a finding in their attempt to improve longevity and perhaps the quality of life after a certain age. As a thumbnail sketch of survival analysis and its models, this essay covers a bank loan and vehicle replacement decisions as two simplistic applications of a topic that can be quite complex.

Keywords Exponential Probability Distribution; Hazard Function; Longevity; Survival Analysis; Survival Function; Survival Time

Actuarial Science > Survival Models

Overview

This essay covers many basics of survival analysis. It introduces some terminology and offers a couple applications from the areas of gambling, finance, and engineering. From a historical viewpoint, survival analysis drew heavily from statistical concepts such as probability distributions, confidence intervals, estimation, and hypothesis testing. Some readers may recall these concepts from their coursework in statistics. Survival analysis can be a very complex topic even for those with a firm understanding of Bayesian and other statistical methods. In simple terms, survival analysis is a method for estimating the probability of survival beyond some specific point in time.

This essay gets into some specifics in the pages ahead, but readers should know at the onset that survival analysis usually entails the following procedures: Establishing the baseline form of the hazard function using a sample set of data; describing the differences that distinguish surviving entities from non-surviving entities resident in that sample; generalizing, if appropriate, the results to a larger group; and, examining, if appropriate, the effect of factors on survival probability. Initiation of these steps requires consideration of the underlying hazard form. Survival time may have an exponential distribution meaning that the rate of survival or failure increases or decreases over time. Studies also vary widely in their levels of sophistication with respect to the underlying form of the survival time distribution. For example, researchers at the onset of their study may simply form a convenient assumption about the temporal nature of that probability. Better yet, they may set a course toward confirming whether the nature of the probability of survival is constant, increasing, or decreasing over time. Before getting any deeper into the topic by introducing concepts such as survival and hazard functions, a need exists to pause and consider some scholarly advice.

The Martingale Concept

Oakes (2004) asserts that survival analysis is most understandable through the martingale concept. Consider an example from the subject of gambling. As most of us are aware, the first wager usually fails to produce the winnings we sought. There are a number of methods available for gamblers to cope with that initial failure. Some may look at it as a one-time expenditure of effort and currency, but others may take an extended view on the endeavor. In terms of the latter approach, the martingale is an attempt to recover recent losses by doubling the dollar amount of each successive wager.

Let us look at disaggregating that process by examining the various time segments represented in the case of this martingale. Limiting our focus to the initial wager, accepting its occurrence without any data on its actual outcome, there are some key parts to the sequence: The period before that wager; the wager event; and, the period after that wager. One approach using survival analysis could begin by marking an initial wager placement as a point in time or a date that separates the past from the future for purposes of the analysis. Let us suppose that a researcher decides to conduct a study of all gamblers and all their bets made over the course of one year since that initial event.

Notice the exclusion of the period prior to the wager event. In terms of its statistical property, whether gamblers accept it or not, the probability of winning or losing on the next wager is largely independent of the outcome seen from a previous wager. Note also that the endpoint for study has no significance other than it exists by virtue of a research design specification. One might imagine the reason for the one-year duration is an attempt by a researcher to replicate a prior study and/or follow the lead of other researchers. Whatever the reason, the researcher in this hypothetical project initiates the data collection phase. S/he begins to record all the bets made by a specific group of gamblers and their respective outcomes (win or lose) since the date of that original event in addition to data on other variables that suit study purposes.

A highly important variable is that created to record whether a gambler continued or ceased to place wagers during the study period. It makes some sense to classify as survivors those gamblers who continued to place wagers and those who ceased to place wagers as non-survivors. Suppose for a moment the researchers conducted an initial descriptive type of analysis to discern whether the survivors beat the odds for the game they were playing though they had second thoughts shortly after.

In their first analytic pass, they estimated the proportion of bets that had favorable or unfavorable outcomes. In doing so, they found that 40 percent of the survivor's bets were winners. As one can imagine, they become very interested in the result suggesting the actual win rate is much better the widely published payout ratio. Shortly afterwards, they came to realize a major flaw in their previous thought processes. Specifically, it is inappropriate from a statistical or scientific perspective to draw certain conclusions from the aforementioned comparison. For starters, there is a major difference between a frequency distribution and a probability distribution; for example, the former simply captures observations from one sample whereas the latter reflects expectations taking a much larger number of samples into account.

Perhaps those readers who completed a statistics course will recall the differences between a descriptive method of statistical analysis and an inferential method of statistical analysis. Without going into a great amount of detail here and now, the latter utilizes probability distributions and permits analysts to generalize the findings from a sample to the larger population. In other words, the results infer something about the larger population from the smaller sample. Anyone who reads about survival analysis is likely to find studies that use one or both types of analysis.

The Survival Function

Survival analysis is applicable to many types of decisions and contexts. Usually, the analysis entails the use of dichotomous terms such as survival or failure, gain or loss, stay or leave, and so forth. Those features inform us that survival analysis involves classifying situations into two outcomes with a primary emphasis on the group of survivors. Taking a simple approach to a complex topic, this essay draws on examples such motor vehicle replacements (Chen & Lin, 2006) and loan application decisions (Morrison, 2004). The reader will soon recognize the value of its applications to inanimate objects and to living beings.

In terms of human or animal life, the survival function expresses the probability of the actual time of death occurring beyond some expected or specific point in time. As the name of the function implies, the group of interest is the survivors or those who are living longer than expected or after some specific date. With the availability of two sets of data and the application of statistical analysis procedures to them, a profile will eventually emerge that will allow analysts to compare and contrast survivors and non-survivors on a set of defining characteristics or features. For example, researchers may be vigilant in their attempts to delineate how an independent time-related variable such as birth year influences a dependent variable such as the probability of living past a presumed age of 110.

If and when they review the literature and studies related to survival analysis, readers will find references to those types of variables and a number of statistical techniques. These references warrant brief mention here. Regression analysis is a statistical procedure for analyzing the nature of a relationship between an independent variable and a dependent variable. In most instances, the dependent variable is a consequence of the independent variable; for example, income is partially determined by education level. Data on these variables are available in many forms such as continuous (which typically covers all the possible numeric values) and non-continuous.

Data Management & Analysis

Survival analysis entails the applications of regression techniques to a non-continuous dependent variable. In other words, the variable takes the form of outcomes that are recorded as being either binary (a gain or a loss, for instance) in form or probabilistic (takes on values between zero and one that signify percent chance) in form. As a simple matter for background purposes here, readers will find references in the literature to two types of regression analysis. One is the Probit regression technique or model, which is applicable when an outcome is in a dual or binary form. The other is the Logit regression technique or model, which is applicable when an outcome is in odds ratio or probability form.

Survival models utilize ordinary regression analysis procedures in which the dependent, response, or outcome variable is in a form related to time. Like all statistical models and techniques, findings emerge after numerous attempts to address a host of challenges including those having to do with data and information collection. Oakes (2004) informs us that information about events and any corresponding factors usually arrive in bits and pieces over time. Furthermore, the nature of its arrival suggests that the knowledge and predictions stemming from that data are problematic. By extension, calculation of a statistical function such as the likelihood function is straightforward when a research subject's lifetime is knowable with precision because study records include the dates of birth and death. However, records are incomplete or missing data in reality for many reasons.

A number of situations arise in a clinical trial that have bearing on data completeness and accuracy. Data on some subjects are missing because they do not experience the event of interest, they exit a study, they move, and/or they are unavailable for follow up. Incomplete records result in what is termed censored observations. Literature reviews will reveal two variants on the censored datum. On the one hand, researchers may only know that a study participant's death occurred after some specific date or the birth date of someone who discontinued participation in a study. These instances lead to classification of the record as right-censored because some datum after a specific event is missing from the files. On the other hand, researchers may only know that a subject's lifetime is shorter than a certain lifetime. This instance leads to classification of the data as left-censored data because some datum before a specific event is missing from the files. In sum, various circumstances arise in data management and analysis, which prompt researchers to acknowledge the limitations of their studies and statistical models.

Hazard Function, Rate & Models

Hazard function is a term found in studies including those that pertain to demography, actuarial science, and engineering. It may be increasing, decreasing, or constant over time with respect to a probability of survival or failure. For example, a mechanical system may fail soon after initial operation or much later as the system ages; a bank loan delinquency may be more likely soon after origination of a loan rather than later; and, the risk of dying increases with age.

A variety of interpretations are presented in the literature with respect to all the interrelationships among survival, failure, risk, and hazard most of which are quite complex and humbling. In basic terms, a hazard rate quantifies the risk of an event with respect to time. One interpretation is that a high hazard rate signifies that the probability of failure rises over time. Conversely, a low hazard rate signifies that the probability of failure falls over time. In other words, a high hazard rate equates to a low survival time and a low hazard rate equates to a high survival time.

Regression analysis techniques are useful for calculating hazard rates. That is, when a researcher avoids making convenient assumptions about its value and nature. The hazard rate can take various forms. Three models are available to those who want to estimate a hazard rate. Readers should keep in mind that scale is a measure of a model's random disturbance term, which reveals information about degree of error attached to the estimation process, because it becomes a relevant and important consideration in a subsequent section of this essay. Table 1 summarizes those three models.

The Exponential Distribution Model is the simplest of the three because it suggests that risk remains constant over time. Most of the articles reviewed by the author of this essay refer to the Weibull Distribution Model. Those publications point to its medium level of complexity and its realistic production of straight lines. The lines slope upward or downward portraying the dynamic nature of risk. For example, an upward-sloping line indicates that the probability of failure increases at an increasing rate over time. This is consistent with the notion that the likelihood of death increases with age.

Applications

This section integrates all the information covered up to this point. It applies the hazard and survival functions, data management and analysis issues, and statistical concepts and methods to two areas chosen because of their brevity, convenience, and possible relevance. Readers who are currently undergraduate college students may find themselves employed as a loan officer, a fleet manager, or another position in which survival analysis may be a valuable tool.

Bank Loans

Credit scoring and survival analysis techniques are related, but they are different in one key aspect according to Morrison (2004). Credit scores allow bankers to decide whether an applicant is eligible to receive a loan, a credit card, or some other form of borrowings. In a generic sense, credit scores represent the probability that the borrower will default during the next year on the loan agreement by failing to make loan payments and/or making them late. Credit scores reveal the statistical possibility of a nonpayment event within a year and they help loan officers answer the question: What is the probability of that event?

Survival analysis, in contrast, addresses the question: When will the event occur? Those who use these approaches acknowledge that default could occur in the future, but they are only able to observe past and current data. The underlying assumption is that history makes for a good indicator of the future. In order to apply survival analysis to a loan officer's decision, an analyst will need to take some steps to collect relevant data. There is the need to record defaults from the point of loan origination. That data will provide initial information about the survival time of a loan.

An analyst with data on survival time can then conduct further analyses. One could explore variables profiling loans at risk of default. Those variables may include data specific to the loan, the borrower, and/or the economy; for example, the loan program type, the loan to value ratio, the borrower's income level, geographic location of borrower, state of the national economy, interest rates, and so on. Information on the relevance of these and other variables is available through reviews of academic journals and trade publications.

Vehicle Scrappage

Vehicle scrappage rates and survival probability are converse terms though the former refers to actuality and the latter refers to possibility. By definition, a scrappage rate is the mathematical result of subtracting the survival probability from unity; for instance, a scrappage rate of 37 percent equates to a survival probability of 63 percent. In their summary of past studies on the topic, Chen and Lin report that scappage rates vary with time and thereby older vehicles incur a higher hazard in replacement than newer vehicles.

Readers may recall that hazard refers to the underlying nature of a survival probability. Over time, the probability of survival can remain constant, increase, or decrease, but the choice of time segment matters a great deal. Nonetheless, it is safe to assume that survival probabilities decrease at any increasing rate or that scrappage rates increase at an increasing rate. For example, vehicles are more likely to fail, to be recycled, and/or become the next addition to a junk pile as time passes. Let us return to the evidence, as reported by Chen and Lin (2006), at a deeper level.

They report that the earliest study on the subject found the following patterns in vehicle scrappage rates: It is very low during the first year after production; it increases significantly between the fourth and ninth years; it rises slowly between the ninth and fifteenth years; and then it declines after reaching the maximum scrappage rate. These findings serve to remind readers of how sophisticated the subject of survival analysis can become and to alert them of the difficulties one will encounter in developing statistical models of the subject. In brief, approaches to the subject vary in complexity to a great degree.

The most simplistic application of a survival model to vehicle duration is common across households. A scrappage decision at the household level typically centers on the total length of time that a resident holds the vehicle. Usually, it is a decision that fails to take into account actual lifespan of the vehicle. Taking note of the vehicles traversing the highways and byways, it certainly appears that many individuals begin to consider replacement of their motor vehicles after three years of possession and/or when the odometer displays 50,000 miles. Perhaps more visible along those routes are those of us who look to those typical household scappages as providers of specific makes and models much closer to the end of the lifespan. Furthermore, some of us devote a significant amount of time examining those variables and more. Certainly, it is far easier to form a buy or sell decision when it involves a small number of vehicles for any given time segment.

The situation is far more complex for an agency or organization holding a fleet of almost 2,000 vehicles. The key decision is which vehicles are to be scrapped and when. In their informative study of the DuPage County (Illinois) Forest Preserve District, Chen and Lin (2006) examined the current selection method and proposed an alternative method. In the case of the former, the usual method of selecting vehicles for replacement involved ranking them in accordance with evaluations based on a comprehensive set of nine criteria. Though the County's method is comprehensive, Chen and Lin (2006) describe it as being subjective and deterministic. The method is subjective because individuals refer to their experience in the procedures for assigning scores. It is deterministic because the methods exclude provisions for assessing the extent to which the course of evaluation contains errors.

In order to resolve those shortcomings, Chen and Lin (2006) set out to develop a statistical model for selecting vehicles for replacement. The alternative is also comprehensive yet objective and probabilistic. It is comprehensive because the model can incorporate a variety of factors, objective because survival probability calculations are a result of the model and probabilistic because the model allows for assessments of errors. In presenting their case for a comprehensive, objective, and probabilistic model to handle a vehicle scrappage decision, Chen and Lin (2006) make this important statement: "The key element in survival analysis is the specification of the hazard rate" (p. 735).

Their methodical analysis of data leads them to realize that the nature of the hazard rate does resemble the Weibull distribution model; readers may recall its description earlier in this essay. Consequently, their specification of the hazard rate represents an important and accurate step in the process of survival analysis. Most importantly, their study confirms an underlying assumption of the model avoiding mere acceptance of the basic nature of the hazard rate in the County's data. Researchers consider studies, models, and theories robust when they are able to specify the fewest number of underlying assumptions. In order to reach this level of achievement, scholars and analysts need to collect, verify, and utilize data in accordance with the design and purposes of their own study and those found in the literature.

As we head toward the final paragraphs in this essay, it is also important to recall that one objective was to specify which vehicles will be replaced and when. Dealing with a few challenges in the initial phases of data collection, they wound up using two groups of data consisting of 232 vehicles in total. Each clean set of data contains the same variables, without any missing or erroneous data. The largest group, which holds records on the survivors, contains information on 146 vehicles in active use between July 1, 2004 and February 28 2005. The other group, which holds records on the non-survivors, contains information on 86 vehicles that became inactive during that same period and were sold via auction or sent to a scrap yard.

In addition to helping public administrators to identify vehicles for replacement, Chen and Lin's (2006) detailed comparisons between the survivors and the non-survivors reveal which factors should receive the most attention in formulating decisions about vehicle selection for replacement. In paving the way to assist countless others, the study compared active vehicles to inactive vehicles and arrived at a profile based on 10 key variables: Survival time or useful life; vehicle age; odometer mileage; number of road calls; reformulated gasoline; number of repairs; and whether the vehicle is a minivan and/or a Chevrolet, a Dodge, a Ford, or another brand. As we move toward closure of this essay, readers of that article may gain substantial insight into their own vehicle purchase, utilization, and disposal decisions.

The article by Chen & Lin (2006) conveys some significant findings and results and represents a fine portrayal of survival analysis and its nuances. For the sake of brevity, the author of this essay encourages readers who hold an interest to review and contemplate those findings and results. Some may find it to be a humbling, yet worthwhile endeavor much like this author did. Interestingly, the article stopped short of discussing all the differences between their alternative comprehensive, objective, and probabilistic model and the original model employed by the County.

Conclusion

In conclusion, this essay represents a thumbnail sketch of survival analysis. It covers a few applications of the concepts and the methods found in a recent scan of the literature. Readers who desire more breadth and depth are encouraged to consult academic journals and trade publications. Certainly, a reader will find a diverse array of information that will challenge and satisfy the needs of that individual. Some articles will be more helpful and comprehensive than others. This essay aims to fit that description.

Terms & Concepts

Dependent Variable: A variable that represents an outcome, a result or a consequence of one or more independent variables.

Hazard Function: Quantifies and describes the temporal nature of a survival risk.

Independent Variable: A variable that exerts an influence on an outcome, a result, or a consequence.

Left-censored Data: The classification signifying the case of an incomplete record for a study participant who exited the study for some reason prior to a key event.

Logit Regression: A technique for examining the underlying nature of a relationship when the outcome is in binary form such as a gain or loss.

Martingale: A system in which a gambler attempts to recover past losses by doubling or otherwise increasing the amount of subsequent bets.

Probit Regression: A technique for examining the underlying nature of a relationship when the outcome is in odd ratios or probabilities that range between zero and one.

Regression Analysis: A statistical procedure for examining the nature of a relationship between a dependent variable and an independent variable.

Right-censored Data: The classification signifying the case of an incomplete record for a study participant who exited the study for some reason after a key event.

Survival Function: Conveys the probability of survival beyond a specified time, sometimes accounting for random influences.

Weibull Distribution Model: A probability distribution that helps researchers conclude whether a survival or failure rate increases or decreases, in an exponential or linear manner, over time.

Table 1: A Summary of Hazard Distribution Models

Model Hazard Function Appearance Able to Handle Inflections Exponential Horizontal, Straight Line No Weibull Non-horizontal Straight Line No Log-Normal Early segment of S-shaped Curve Yes

Bibliography

Chen, C., & Lin, J. (2006). Making an informed vehicle scrappage decision. Transport Reviews, 26(6), 731-748. Retrieved December 1, 2007, from EBSCO Online Database Business Source Premier. http://search.ebscohost.com/login.aspx?direct=true&db=buh&AN=22651090&site=ehost-live

Gu, Y., Sinha, D., & Banerjee, S. (2011). Analysis of cure rate survival data under proportional odds model. Lifetime Data Analysis, 17(1), 123-134. Retrieved November 15, 2013, from EBSCO Online Database Business Source Complete. http://search.ebscohost.com/login.aspx?direct=true&db=bth&AN=57281696&site=ehost-live

Morrison, J. (2004). Introduction to survival analysis in business. Journal of Business Forecasting Methods & Systems, 23(1), 18-22. Retrieved December 1, 2007, from EBSCO Online Database Business Source Premier. http://search.ebscohost.com/login.aspx?direct=true&db=buh&AN=13010424&site=ehost-live

Oakes, D. (2002). Survival analysis. In A.E. Raferty, M.A. Tanner, & M.T. Wells (Eds.), Statistics the 21st Century (pp. 4-11). Boca Raton, FL: CRC Press LLC.

Richards, S.J. (2012). A handbook of parametric survival models for actuarial use. Scandinavian Actuarial Journal, 2012(4), 233-257. Retrieved November 15, 2013, from EBSCO Online Database Business Source Complete. http://search.ebscohost.com/login.aspx?direct=true&db=bth&AN=84342497&site=ehost-live

Shauly, M., Rabinowitz, G., Gilutz, H., & Parmet, Y. (2011). Combined survival analysis of cardiac patients by a Cox PH model and a Markov chain. Lifetime Data Analysis, 17(4), 496-513. Retrieved November 15, 2013, from EBSCO Online Database Business Source Complete. http://search.ebscohost.com/login.aspx?direct=true&db=bth&AN=65491405&site=ehost-live

Suggested Readings

Barbu, V., Boussmart, M., & Limnios, N. (2004). Discrete-time semi-Markov model for reliability and survival analysis. Communications in Statistics: Theory & Methods, 33(11/12), 2833-2868. Retrieved December 1, 2007, from EBSCO Online Database Business Source Premier. http://search.ebscohost.com/login.aspx?direct=true&db=buh&AN=15898983&site=ehost-live

Chung, Y., Dey, D., Kim, M., & Kim, C. (2005). Bayesian Model choice in exponential survival models. Communications in Statistics: Theory & Methods, 34(12), 2311-2330. Retrieved December 1, 2007, from EBSCO Online Database Business Source Premier. http://search.ebscohost.com/login.aspx?direct=true&db=buh&AN=18909091&site=ehost-live

Manchanda, P., Dubé, J., Goh, K., & Chintagunta, P. (2006). The effect of banner advertising on internet purchasing. Journal of Marketing Research (JMR), 43(1), 98-108. Retrieved December 1, 2007, from EBSCO Online Database Business Source Premier. http://search.ebscohost.com/login.aspx?direct=true&db=buh&AN=19625416&site=ehost-live

Omori, Y., & Johnson, R. (2006). The influence of random effects on univariate and bivariate discrete proportional hazards models. Communications in Statistics: Theory & Methods, 35(9), 1757-1764. Retrieved December 1, 2007, from EBSCO Online Database Business Source Premier. http://search.ebscohost.com/login.aspx?direct=true&db=buh&AN=22455480&site=ehost-live

Silva, G., & Amaral-Turkman, M. (2004). Bayesian analysis of an additive survival model with frailty. Communications in Statistics: Theory & Methods, 33(10), 2517-2533. Retrieved December 1, 2007, from EBSCO Online Database Business Source Premier. http://search.ebscohost.com/login.aspx?direct=true&db=buh&AN=15123735&site=ehost-live

Essay by Steven R. Hoagland, Ph.D.

Dr. Hoagland holds a baccalaureate and a master's degree in economics, a master of urban studies, and a doctorate in management with a cognate in education. His professional background includes leadership in planning, assessment, and research and service as an adjunct professor of economics. Dr. Hoagland has delivered more than 50 courses in business, economics, and statistics. When time and resources permit, as the founding executive director of a nonprofit organization launched in 2007, he guides college-bound high school students toward a more objective and simplified method of college selection. That endeavor holds promise for improving the financial return on consumer investments in higher learning and for advancing institutional accountability for performance and quality.