One of the core challenges of data science is drawing meaningful causal conclusions from observational data. In many such cases, the goal is to estimate the true impact of a treatment or behaviour as fairly as possible. This article explores Propensity Score Matching (PSM), a statistical technique used for that very purpose.
Unlike randomized experiments (A/B tests) or treatment trials, observational studies tend to show preexisting differences between the treatment and control groups that are important. PSM is a statistical mechanism to replicate a randomized experiment while controlling for confounders.
Let us explore it in further detail here.
Also read: Introduction to Synthetic Control Using Propensity Score Matching
What is PSM?
Propensity Score Matching (PSM) is a statistical mechanism to replicate a randomized experiment while controlling for confounders. This method pairs treated and untreated units with similar propensity scores, or the probability of receiving the treatment, to form a well-balanced comparison group.
In 1983, Rosenbaum and Rubin proposed this solution by describing the propensity score as the conditional probability of assignment to a treatment given an observed set of covariates. PSM tries to reduce biases created by confounders that can distort a simple outcome comparison between treated and untreated groups by pairing units with similar propensity scores.
Why is PSM Useful?
Let us understand this with an example. Suppose we wish to find the influence of a behavior of a customer, such as going back to an online shop, on an outcome, on the decision to make a purchase. Instead of forcing some people back, we go for observational data. This is quite misleading simply through comparing the purchase rates of returning vs new customers, because they are two vastly different cohorts in many ways (such as familiarity with the site, browsing behaviors, etc.).
PSM assists by matching returning customers to new customers, matching the same traits observed except for the “returning” status, and mimicking a randomized controlled trial. Once the pair is matched, the difference in purchase rates is more likely to come from the return status itself.
In this post, we will explore Propensity Score Matching from theory to Python implementation in depth.
Understanding Propensity Score Matching
To illustrate the behavior of PSM in action, we will utilize a publicly accessible e-commerce dataset (Online Shoppers Purchasing Intention Dataset). The dataset includes web session data of an e-commerce system, whether the user generated revenue (purchase) in that session (outcome), and other features of their browsing behavior. We will establish a causal inference scenario:
- treatment = if the visitor is a returning customer,
- control = new customer,
- outcome = whether the session concludes in a purchase (Revenue=True).
We will estimate the effect of returning visitors on the probability of purchase using PSM by matching returning and new visitors with similar browsing metrics.
A propensity score is the probability of a unit (such as a user) receiving a treatment given its observed covariates. In our model, it would be the likelihood that a user is receiving treatment (returning visitor), based on their session properties such as pages visited, time spent, etc. Key assumptions when making causal inference using PSM entail:
- Conditional Independence: all confounding variables influencing treatment assignment and effect are observed and included in the propensity model.
- Common support (overlap): the probability for any given combination of covariates is not 0 or 1 (there is overlap in covariate distributions between treated and control groups). In practice, this translates to having returning visitors and new visitors who have similar properties; if several returning visitors have very different features from new visitors that new visitors do not have (and vice versa), those should be excluded, or PSM will not be useful.
PSM Workflow
The general PSM workflow is:
- Propensity score estimation – typically fit a logistic regression of Treatment ~ Covariates to get each unit’s propensity
- Matching – pair each treated unit with one or more control units having close propensity scores
- Balance diagnostics – check if the matched samples have balanced covariates.
(If the imbalance remains, refine the model or the matching method) - Treatment effect estimation – compute the outcome difference between matched treated and control groups
Case Study Setup: E-commerce Dataset and Treatment Scenario
In this practical scenario, we further utilize the Online Shoppers Purchasing Intention dataset from the UCI Machine Learning Repository. This dataset contains 12,330 retail website sessions, and about 15% of sessions end with a purchase (Revenue=True). Each session has the information such as the number of different pages visited (Administrative, Informational, ProductRelated), the amount spent on these pages, bounce rate, exit rate, page value, special day indicator (how close to a holiday), and some categorical features (Month, Browser, Region, TrafficType, VisitorType, and Weekend).
Described treatment
We determine treatment = 1 (for “Returning_Visitor”) and 0 (for “New_Visitor”) based on VisitorType. Around 85% of sessions are attended by users returning to the service. Therefore, the group is considerably larger than that of the control.
Outcome
The Revenue feature is True if the session resulted in a purchase (conversion) or False if no purchase happened. By factoring in behavior, we want to estimate how much more probable the purchase is for repeat customers because they are repeat customers. While returning visitors are logging on at a much higher raw conversion rate than new visitors, they also inspect more products and stay on the site longer. We will use PSM to compare return vs. new visitors by controlling for those factors.
Let’s load the dataset and do some light preprocessing for our analysis. We’ll use pandas to load the CSV and then encode the relevant columns for our binary treatment and outcome:
import pandas as pd
# Load the dataset
Df = pd.read_csv("online_shoppers_intention.csv")
# Encode categorical variables of interest
Df['Revenue'] = df['Revenue'].map({False: 0, True: 1})
Df['Treatment'] = df['VisitorType'].map({"Returning_Visitor": 1, "New_Visitor": 0})
# (Drop or combine ‘Other’ visitor types if present, for simplicity)
Df = df[df['VisitorType'].isin(["Returning_Visitor", "New_Visitor"])]
# Take a quick look at treatment vs outcome rates
Print(pd.crosstab(df['Treatment'], df['Revenue'], normalize="index") * 100)
In the above, we create a new column Treatment, which is 1 for returning visitors and 0 for new visitors. The crosstab will show the purchase rate (% Revenue=1) for each group before matching. We expect returning visitors to have a higher conversion rate than new visitors initially. However, this raw difference is confounded by other variables. To isolate the effect of being a return visitor, we proceed to estimate propensity scores.
Propensity Score Estimation (Logistic Regression)
For this, we will estimate each session’s propensity score (probability of being a returning visitor) given its observed features. We will use a logistic regression model for this exercise. The covariates should include all factors that influence the likelihood of being a returning visitor and are related to the outcome. In our case, plausible covariates are the various page visit counts, durations, and metrics like bounce rate, etc., since returning visitors likely have different browsing behaviors. We include a broad set of features from the session (excluding any that are trivial or occur after the decision to return). For simplicity, let’s use the numeric features available:
from sklearn.linear_model import LogisticRegression
# Features to use for propensity model (all numerical except target)
Covariates = ['Administrative', 'Administrative_Duration', 'Informational',
'Informational_Duration', 'ProductRelated', 'ProductRelated_Duration',
'BounceRates', 'ExitRates', 'PageValues', 'SpecialDay']
X = df[covariates]
T = df['Treatment']
# Fit logistic regression to predict Treatment
Ps_model = LogisticRegression(solver="lbfgs", max_iter=1000)
Ps_model.fit(X, T)
# Predicted propensity scores:
Propensity_scores = ps_model.predict_proba(X)[:, 1]
Df['PropensityScore'] = propensity_scores
Print(df[['Treatment', 'PropensityScore']].head(10))
Here we train a logistic regression where the dependent variable is Treatment (returning vs new) and the independent variables are the session features. Each session gets a PropensityScore between 0 and 1. A high score (close to 1) means the model thinks this session looks very much like a returning visitor based on the browsing characteristics; a score near 0 means it looks like a new visitor profile.
Visualising Propensity Score Distributions
After fitting, it’s good practice to visualize the propensity score distributions for treated and control groups to check overlap (the positivity assumption). Ideally, there should be substantial overlap; if the distributions hardly overlap, PSM may not be reliable because we lack comparable counterparts.
Below is a distribution plot of propensity scores for the two groups:
Propensity score distribution for treated (returning visitors) and untreated (new visitors). We see some overlap in this case, which is somewhat essential for a valid match.
In our scenario, we expect returning visitors on average to have higher propensity scores (since they are returners), but if many new visitors have moderate-to-high scores and some returners have lower scores, we have overlap. If there were returners with extremely high scores >0.9 and no new visitors in that range, those returners would be hard to match and might need to be dropped (common support issue).
Matching: Pairing on Propensity Scores
With propensity scores in hand, we proceed to match each treated unit with one or more control units with similar scores. A simple and commonly used method is one-to-one nearest neighbor matching without replacement: for each treated unit, find the control unit with the closest propensity score, and do not reuse control units once matched. This yields a matched sample of treated and control units of equal size. Other strategies include many-to-one matching, caliper (tolerance for maximum distance allowed) matching, optimal matching, etc.. For this demonstration, we’ll use one-to-one nearest neighbor matching.
We can perform matching manually or using libraries. Below is how to do one-to-one matching in Python using a nearest neighbors’ approach from scikit-learn:
import numpy as np
from sklearn.neighbors import NearestNeighbors
# Split the data into treated and control dataframes
Treated_df = df[df['Treatment'] == 1].copy()
Control_df = df[df['Treatment'] == 0].copy()
# Use nearest neighbor on propensity score
Nn = NearestNeighbors(n_neighbors=1, metric="euclidean")
nn.fit(control_df[['PropensityScore']])
distances, indices = nn.kneighbors(treated_df[['PropensityScore']])
# `indices` gives index of closest control for each treated
Matched_pairs = list(zip(treated_df.index, control_df.iloc[indices.flatten()].index))
Print(f"Matched {len(matched_pairs)} pairs")
We fit a nearest neighbors model on the control group’s propensity scores, then for each treated observation, find the nearest control. The result matched_pairs is a list of (treated_index, control_index) pairs of matched observations. We should get approximately as many pairs as the size of the smaller group, here the control group, because once we exhaust the controls, we cannot match the remaining treated units without reuse. If the dataset has more treated than control units, as in our case, some treated units will remain unmatched. In practice, analysts generally drop these unmatched treated units and focus the analysis on the region of common support.
After matching, we create a new DataFrame for the matched sample containing only the matched treated and matched control observations. This matched sample is what we’ll use to estimate the treatment effect. But first, did matching succeed in balancing the covariates?
Balance Diagnostics
An important task in PSM analysis is to check whether or not the treated and control groups are indeed more balanced after matching. We illustrate and compare the distributions for the covariates of each matched treated vs matched control. To make it more convenient, we can observe the SMD of each covariate before and after matching is complete. SMD measures the difference in means between groups divided by the pooled standard deviation. It is a unitless measure of imbalance, where an SMD of 0 indicates perfect balance. A common rule of thumb treats an absolute SMD below 0.1 as a negligible imbalance.
Let’s compute and compare SMDs for a few key features before and after matching:
# Function to compute standardized mean difference
def standardized_diff(x, group):
# x: series of covariate, group: series of Treatment (1/0)
Treated_vals = x[group == 1]
Control_vals = x[group == 0]
Mt, mc = treated_vals.mean(), control_vals.mean()
Var_t, var_c = treated_vals.var(), control_vals.var()
Return (mt - mc) / np.sqrt((var_t + var_c) / 2)
Covariates_to_check = ['ProductRelated', 'ProductRelated_Duration', 'BounceRates']
For cov in covariates_to_check:
Smd_before = standardized_diff(df[cov], df['Treatment'])
Smd_after = standardized_diff(treated_df.loc[[i for i, _ in matched_pairs], cov],
Control_df.loc[[j for _, j in matched_pairs], cov])
Print(f"{cov}: SMD before = {smd_before:.3f}, after = {smd_after:.3f}")
Output
This will output something like (for example):
ProductRelated: SMD before = 0.28, after = 0.05
ProductRelated_Duration: SMD before = 0.30, after = 0.08
BounceRates: SMD before = -0.22, after = -0.04
The exact values will change with the data, but we anticipate that SMDs shrink significantly towards 0 after matching. Note that the number of product-related pages (ProductRelated), for example, had a moderately large imbalance before (the treated users viewed, on average, more pages than controls, SMD 0.28) and decreased to 0.05 after matching – suggesting a successful balancing effect. The same benefits are observed for total product page duration and bounce rate. That tells us our propensity model plus matching did a reasonable job of generating similar groups. If, in fact, there is any covariate that still has a high SMD (say > 0.1) after a match, one could consider refining the propensity model (think interaction terms or non-linear terms, etc.) or imposing a caliper to force a closer fit.
The figure below illustrates the covariate balance before and after matching for our example (with absolute SMDs for a few covariates):
Covariate balance before and after matching. The bar chart displays absolute standardized mean differences for three sample features. After matching, the balance improves, the bars become significantly smaller, and all covariate differences fall below the 0.1 threshold (red dashed line), indicating negligible imbalance. What we see is that PSM has made the treated and control groups much more similar in terms of the observed variables. Now we can compare these similar matched groups based on the outcomes to assess the causal effect.
Estimating the Causal Effect
Lastly, we compute the treatment effect on the outcome with the matched sample. With one-to-one matched pairs, our answer for a binary outcome is that we can compare the average conversion of treated vs. control outcomes in the matched sample. This provides an estimate of the Average Treatment Effect on the Treated (ATT) – the difference in purchase probability for returning visitors if they had instead been new visitors (counterfactually).
On the condition, if after matching we have the following numbers: Returning visitors purchase rate is 12.5 per cent, new visitors purchase rate = 10 per cent. The ATT would be 2.5 percentage points (0.125 – 0.100). This implies that being a returning visitor causally increases the purchase probability by 2.5 points (if our assumptions hold).
In code, this is just:
matched_treated_indices = [i for i, _ in matched_pairs]
matched_control_indices = [j for _, j in matched_pairs]
att = df.loc[matched_treated_indices, 'Revenue'].mean() - \
df.loc[matched_control_indices, 'Revenue'].mean()
print(f"ATT (treatment effect on purchase rate) = {att*100:.2f} percentage points")
Note that this estimate is specific to the population of returning visitors (ATT). We would need a bit of a different analysis (and maybe weighting) if we wanted the average treatment effect in the whole population (ATE). In numerous cases, ATT is a key issue: “how much did the treatment help those who received it?” In our case, we’re measuring how much higher the conversion is for returning users due to their returning status rather than behavioral differences.
Interpreting the result
If the ATT is positive (say 2.5pp, as was the case in our supposed outcome), that means returning visitors are more likely to purchase not simply by viewing more pages or spending more time (and we controlled for those), but due to being a return customer (e.g., more trust, intent). If the ATT is virtually zero, the observed covariates explain the entire difference in raw purchase rates, and returning status itself adds no extra boost once we account for them.
Finally, it’s critical to recognize limitations in PSM. PSM can only measure observed confounding effects in the data. Moreover, poor overlap or a mis-specified propensity model can also cause inaccurate estimates. Sensitivity analyses strengthen the analysis, and researchers can use other methods. For instance, inverse probability weighting or doubly robust estimation can be used to check robustness.
Conclusion
Propensity Score Matching is not so new to the data scientist, and is an incredibly powerful technique for causal inference with observational data. In this blog, we explained how PSM works in theory and practice using an e-commerce case study. We demonstrate estimation of propensity scores with a logistic regression, match treatment and control units with similar scores, check covariate balance with diagnostics like standardized mean differences, and estimate the causal effect on an outcome. By matching returning visitors with new visitors who behaved similarly on an online shopping site, we could capture the role of being a returning visitor in determining purchase likelihood.
Our analysis showed that returning visitors do indeed have (even after accounting for their browsing behavior) a better conversion probability.
The process we followed – Introduction → Propensity Model → Matching → Diagnosis → Effect Estimation.
When we use PSM, we need to be aware of the selection of covariates (we must consider all relevant confounders), assumptions (overlap between propensity scores), and realize that PSM minimizes bias from observed covariates but does not remove bias from hidden covariates. Considering this, PSM offers an honest and transparent method to approximate a trial and infer causality from your data.
Login to continue reading and enjoy expert-curated content.
