Propensity Score Matching

Abstract

Propensity score matching is a statistical technique used to attempt to prevent certain types of bias that may be inadvertently introduced during the assignment of participants to experimental treatment and control groups. In some types of studies, selection bias may be introduced, causing individuals with certain characteristics to have a greater likelihood of being assigned to the treatment group. A propensity score is a measure of the probability that an individual will be assigned to the treatment group. In propensity score matching, participants with high propensity scores are matched with those who have low propensity scores, to ensure that both the treatment and control groups have equal representation.

Overview

Propensity score matching features prominently in the field of education, where the need to provide services in an equitable fashion is paramount, due to the many ways in which bias has manifested itself throughout the history of the field. There has been a tendency for educators to assume that the dominant culture's worldview represents what is normal, and that other perspectives are somehow incorrect or inferior. Often the result of this type of bias has been some form of tracking, in which students from outside the dominant culture are provided with less effective, less rigorous instruction under the assumption that this is all they can cope with. Such bias can also surface during educational research, with potentially devastating effects. Propensity score matching is one of several methods used to filter out unconscious bias during the design phase of a research project.ors-edu-20171002-21-165091.jpg

It is difficult to understand the concept of propensity score matching without first considering the type of problem that the technique is designed to solve. In experimental research, scientists conduct experiments in order to test hypotheses. For example, if a scientist had developed a new medication to treat memory loss, the scientist would want to make sure that the medication helps to get rid of this condition. The scientist would develop a hypothesis that the medication would reduce memory loss in the average person, and then would design an experiment to see whether this turns out to be true, or if it is untrue. One problem with designing the experiment is that there is no such thing as an average person on which the medication can be tested—every person is different in some ways from every other person, so no single individual is completely average. To get around this difficulty, scientists approximate the average person by assembling a group of study participants for the experiment; if the group, or sample, is large enough, then from a statistical standpoint it can be used to represent the general population in the way that an average person would, if he or she existed (Randolph et al., 2014).

Once a group of participants has been created, the next step for the researcher is to divide the participants into two groups. These are called the control group and the experimental group (also known as the treatment group). The experiment needs to test what happens to those who use the medicine, but there must also be a way of testing what would have happened if no medicine had been applied. Without an untreated, or control, group, the experiment would be limited to comparing the treatment group with a purely hypothetical untreated group. The purpose of the experimental group is to receive the treatment, and the purpose of the control group is to show what happens without treatment.

Once the experiment ends, the results of the two groups are compared to see if there is any difference, and if so, how much of a difference and what kind of a difference (Liu & Ripley, 2014). In the above example, there might be a very small difference, perhaps experienced by only one person, or a very large difference, with almost everyone in the treatment group affected. Regardless of its size, the difference could either be a reduction in memory loss, or an increase; a reduction would confirm the hypothesis that the medication helps fight memory loss, while an increase would contradict the hypothesis.

The size of the difference is important, because some small differences might be attributable to random chance—maybe the group that was selected happened to have especially curable (or incurable) memory loss. Researchers must determine whether a difference between the experimental and the control groups is statistically significant, that is, large enough that it is unlikely to be the result of chance. If the experiment shows a difference that is not statistically significant, then the hypothesis will have to be rejected, since there is no way to be sure that the difference was due to the medication and not to random variation (Torres et al., 2017).

In all of these steps, the part where propensity score matching comes in is the stage at which participants in the experiment are assigned to the experimental or control group. If a researcher were to assign people to study groups based on some criterion such as gender or eye color, this could skew the results of the experiment; the experimental results are only generalizable to the population if they are based on a random sample. If all of the members of the experimental group are women, then the sample is no longer random, and the results could only be applied to women. To avoid this problem, scientists use randomization to assign individual participants to the control and experimental groups; one way to think of this is as if each participant stepped forward and a coin was flipped to determine whether they would go into the control or experimental group. This process would continue until both groups had the same number of participants (Li, Palma & Xu, 2017).

For some types of conditions that researchers wish to test, random assignment (the coin flipping) is not able to eliminate all forms of bias in how the participants are distributed; what remains is known as selection bias. This occurs when some aspect or characteristic of the participants makes it more likely for them to be assigned to the experimental group. To return to the example of the memory loss researcher, assuming that memory loss is a condition that has a greater likelihood of occurring in elderly people, then the treatment group in the experiment would be likely to contain large numbers of elderly subjects, even though participants in the experiment had been randomly assigned. The problem this creates is that both groups are no longer truly random. This means that the results of the experiment cannot be extrapolated to the general population, because it cannot be determined whether any difference in outcome was caused by the treatment or by the difference in the composition of the two groups (Boughrara & Dridi, 2017).

Further Insights

Propensity score matching is a method designed to address selection bias. Participants are first assigned to the experimental and control groups as discussed above. Next, the characteristic that will be used for matching is identified; in the case of the memory loss medication, this characteristic is the age of the participant. Using the matching characteristic, the researcher next identifies pairs of participants—one from the control group and the other from the experimental group—that have a very small difference between each other on the matching characteristic.

Here, a matched pair would be one person from the experimental group and one from the control group, each sharing the same age or almost the same age. The researcher continues the process of identifying matched pairs until all members of one group have been matched with a member of the other group. By doing this, the difference between the groups that had existed is balanced out, and the experiment can then proceed since both the control group and the experimental group are essentially the same as one another (Lei et al., 2017).

An interesting example of the use of propensity score matching in social science research concerns the effectiveness of participation in group counseling for the treatment of drug addiction. An observational study was conducted of people participating in group therapy for drug use, and several characteristics were tracked, such as motivation and previous exposure to addiction treatment programs. At the conclusion of the study, as the outcomes of the treatment and control groups were being compared, propensity score matching was used to ensure balanced representation between the two groups, and an interesting result emerged.

While overall the participation in the group appeared to correlate with greater success in escaping the throes of addiction, the propensity score matching indicated that group therapy for addiction provided the greatest benefit to those who were least in need of it, i.e. those participants with higher motivation. Upon reflection, this does make an odd sort of sense, because people with high indices of motivation to overcome their addiction would already be on their way to doing so, and the group sessions would likely do little more than nudge them in the direction they were already moving. Yet it may also be the case that those in greater need of help dealing with their addiction, as demonstrated by lower scores on motivation measures, feel more powerless to surmount obstacles and thus attribute less value to the therapy, giving it the appearance of less efficacy (Shipman, Swanquist & Whited, 2017).

Issues

As with most topics in the field of research, there is some difference of opinion about the merits of using propensity score matching to ensure randomization. Some authors have argued that the use of propensity score matching can actually have an effect opposite to the one intended, actually causing greater imbalances between groups than it corrects. Those who hold this view point out that part of the nature of propensity scores is that they are always approximations rather than exact values—they are, after all, measures of probability, not certainty. Because of this, the thinking goes, using propensity scores inevitably increases unpredictability and the potential for imbalances (Lee & Little, 2017).

These objections can seem somewhat mysterious unless one understands an additional complication about propensity score matching, related to the idea of multiple covariates. Covariate is the term used to describe the characteristic used to create the matched pair. In the example of the medication used to treat memory loss, the covariate would be age. This seems clear enough until one considers that there can be multiple covariates in effect at the same time. In addition to age, other covariates at work could be gender, ethnicity, occupation, and so forth.

When propensity score matching is used to address one covariate, it can cause imbalance in another covariate, and when that second covariate is corrected for, this may cause the first and the third to become unmatched, and so forth. This problem becomes more serious when one notes that many researchers have been laboring under the belief that one can continue to add in covariates without any negative consequences, as if increasing the number of variables must inevitably increase the randomness. Certainly this is a possibility, but it is by no means an inevitability, as additional covariates may in some cases produce imbalanced groupings rather than greater randomization.

Covariates potentially not only conflict with one another, but also interact with one another in exceedingly complex ways. Because the complexity of covariate interactions so quickly becomes impossibly complex, many researchers and statisticians gloss over the topic, under the mistaken belief that these interactions can be ignored if one introduces enough covariates to create what, in effect, amounts to chaos (Piccone, 2015).

Propensity score matching is frequently used as an informal research tool, and in many federal agencies its use is even required as part of research that receives funding from the government. Underlying its popularity, however, is the often-overlooked fact that propensity score matching is not, and was never intended to be, a perfectly accurate or fully randomized method of experimental research; it is a form of quasi-experimental research that relies on approximations that are useful with finite populations but limited when greater accuracy is required. Propensity score matching is most useful in observational research, and even then it only accounts for observed differences between groups.

Often, unobserved or hidden differences exist between groups that have an impact on group balance, yet propensity score matching can do nothing to address these hidden covariates—hence, the concerns about over-reliance on the method. Propensity score matching has what has been called a powerful assumption that groups are unconfounded, and in some cases this assumption is erroneous (Jacovidis, Foelber & Horst, 2017).

Terms & Concepts

Bipartate Matching: A method of matching that creates matched pairs from two separate groups, such as the treatment group and the control group. The goal is to create matched pairs in which both members of the pair have almost the same score on the characteristic of interest, whether this is height, age, weight, or something similar. A well-matched pair is one in which the members are very similar.

Control Group: In an experiment, the group of participants who do not receive treatment. Their function is to show what would happen if no treatment were applied, so that this can be compared against the outcome of the treatment group.

Experimental/Treatment Group: In an experiment, the group that receives treatment. The effects of the treatment, if any, are measured at the end of the experiment and compared with the results of the control group, to see if any differences between the two groups can be identified.

Matched Set: A pair of participants, one drawn from the control group and the other drawn from the experimental group, that are close to each other in terms of the characteristic of interest.

Non-Bipartate Matching: A method of propensity score matching that creates matched pairs by drawing from more groups than just the experimental and control groups.

Sample: A group being studied that is representative of the larger population, but small enough to allow researchers to work with.

Selection Bias: An inadvertent skewing of participant selection in an experiment, usually caused by the uneven distribution of certain characteristics in the general population.

Bibliography

Boughrara, A., & Dridi, I. (2017). Does inflation targeting matter for foreign portfolio investment: Evidence from propensity score matching. Journal of Economic Development, 42(2), 67–86.

Jacovidis, J. N., Foelber, K. J., & Horst, S. J. (2017). The effect of propensity score matching method on the quantity and quality of matches. Journal of Experimental Education, 85(4), 535–558. Retrieved January 1, 2018 from EBSCO Online Database Education Source. http://search.ebscohost.com/login.aspx?direct=true&db=eue&AN=124503752&site=ehost-live

Lee, J., & Little, T. D. (2017). A practical guide to propensity score analysis for applied clinical research. Behaviour Research & Therapy, 98, 76–90.

Lei, D., Xin-Xin, Z., Lin-Chun, F., Bao-Lin, Q., Jing, C., Jun, Y., & ... Ma, L. (2017). Propensity score matching analysis of a phase II study on simultaneous modulated accelerated radiation therapy using helical tomotherapy for nasopharyngeal carcinomas. BMC Cancer, 17, 1–11.

Li, Y., Palma, M. A., & Xu, Z. P. (2017). Impacts of playing after school on academic performance: A propensity score matching approach. Education Economics, 25(6), 575–589. Retrieved January 1, 2018 from EBSCO Online Database Education Source. http://search.ebscohost.com/login.aspx?direct=true&db=eue&AN=125480844&site=ehost-live

Liu, L., & Ripley, D. (2014). Propensity score matching in a study on technology-integrated science learning. International Journal of Technology in Teaching & Learning, 10(2), 88–104. Retrieved January 1, 2018 from EBSCO Online Database Education Source. http://search.ebscohost.com/login.aspx?direct=true&db=eue&AN=102673867&site=ehost-live

Piccone, J. E. (2015). Improving the quality of evaluation research in corrections: The use of propensity score matching. Journal of Correctional Education, 66(3), 28–46. Retrieved January 1, 2018 from EBSCO Online Database Education Source. http://search.ebscohost.com/login.aspx?direct=true&db=eue&AN=109335955&site=ehost-live

Randolph, J. J., Falbe, K., Manuel, A. K., & Balloun, J. L. (2014). A step-by-step guide to propensity score matching in R. Practical Assessment, Research & Evaluation, 19(18), 1–6. Retrieved January 1, 2018 from EBSCO Online Database Education Source. http://search.ebscohost.com/login.aspx?direct=true&db=eue&AN=99779852&site=ehost-live

Shipman, J. E., Swanquist, Q. T., & Whited, R. L. (2017). Propensity score matching in accounting research. Accounting Review, 92(1), 213–244.

Torres, F., Ríos, J., Saez-Peñataro, J., & Pontes, C. (2017). Is propensity score analysis a valid surrogate of randomization for the avoidance of allocation bias? Seminars in Liver Disease, 37(3), 275–286.

Suggested Reading

Guarcello, M., Levine, R., Beemer, J., Frazee, J., Laumakis, M., & Schellenberg, S. (2017). Balancing student success: Assessing supplemental instruction through coarsened exact matching. Technology, Knowledge & Learning, 22(3), 335–352. Retrieved January 1, 2018 from EBSCO Online Database Education Source. http://search.ebscohost.com/login.aspx?direct=true&db=eue&AN=125109473&site=ehost-live

Harris, H., & Horst, S. J. (2016). A brief guide to decisions at each step of the propensity score matching process. Practical Assessment, Research & Evaluation, 21(1–4), 1–11. Retrieved January 1, 2018 from EBSCO Online Database Education Source. http://search.ebscohost.com/login.aspx?direct=true&db=eue&AN=114374786&site=ehost-live

Hill, L., Maier-Katkin, D., Ladny, R., & Kinsley, K. (2018). When in doubt, go to the library: The effect of a library-intensive freshman research and writing seminar on academic success. Journal of Criminal Justice Education, 29(1), 116–136. Retrieved January 1, 2018 from EBSCO Online Database Education Source. http://search.ebscohost.com/login.aspx?direct=true&db=eue&AN=127544608&site=ehost-live

Huber, S., Dietrich, J. F., Nagengast, B., & Moeller, K. (2017). Using propensity score matching to construct experimental stimuli. Behavior Research Methods, 49(3), 1107–1119.

Jakubowski, M. (2015). Latent variables and propensity score matching: A simulation study with application to data from the Programme for International Student Assessment in Poland. Empirical Economics, 48(3), 1287–1325.

Yang, S., Imbens, G. W., Cui, Z., Faries, D. E., & Kadziola, Z. (2016). Propensity score matching and subclassification in observational studies with multi-level treatments. Biometrics, 72(4), 1055–1065.

Essay by Scott Zimmer, JD