Skip to content

Making Smarter Business Decisions with Propensity Score Analysis

Felix Lennert
August 13, 2025

Learn more on how and when to use propensity score analysis with Propensity Score Analysis: Basics taught by Professor Shenyang Guo on September 18-20. Or, expand your foundational knowledge with Propensity Score Analysis: Advanced on October 9-11 to explore more advanced techniques and methods. Register for one or both (email info@statisticalhorizons.com for a bundle discount)!

Imagine you’re a data analyst at a large company. Your team rolls out a new training program to improve sales performance. The training isn’t mandatory, and employees can decide to participate. A few months later, your boss asks you: Did it work?

This seems straightforward – compare trained vs. untrained employees. But this assumes each employee’s decision was random and independent. However, factors that drive employees to choose training might also affect results: motivated employees might take training and succeed due to intrinsic motivation; employees with more time might take training but perform worse on actual work; or struggling employees might seek training but only improve marginally.

Simply comparing trained vs. untrained employees could mislead you. This is where propensity score analysis comes in – a powerful tool that helps estimate causal effects when randomized experiments aren’t feasible. It’s common in academic research but underused in industry.

LEARN MORE IN A SEMINAR

The Problem with Naive Comparisons

The following examples will use R and the tidyverse. We’ll demonstrate using the lalonde dataset from the MatchIt package, which evaluates a 1970s job training program’s impact on later earnings.

Key variables: – treat: 1 = received job training, 0 = did not – re78: earnings in 1978 – Covariates: age, education (educ), race, previous earnings (re74re75)

library(tidyverse)
library(MatchIt)

data("lalonde") 

lalonde <- as_tibble(lalonde)
glimpse(lalonde)
Rows: 614
Columns: 9
$ treat    <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ age      <int> 37, 22, 30, 27, 33, 22, 23, 32, 22, 33, 19, 21, 18, 27, 17, 1…
$ educ     <int> 11, 9, 12, 11, 8, 9, 12, 11, 16, 12, 9, 13, 8, 10, 7, 10, 13,…
$ race     <fct> black, hispan, black, black, black, black, black, black, blac…
$ married  <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0…
$ nodegree <int> 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1…
$ re74     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ re75     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ re78     <dbl> 9930.0460, 3595.8940, 24909.4500, 7506.1460, 289.7899, 4056.4…

A simple comparison shows:

lalonde |> 
  group_by(treat) |> 
  summarize(mean_re78 = mean(re78))
# A tibble: 2 × 2
  treat mean_re78
  <int>     <dbl>
1     0     6984.
2     1     6349.

Straightforward answer: people who didn’t attend training earned more ($6,984) than those who did ($6,349). That seems more than a little counterintuitive. But are these groups comparable?

Looking at group characteristics:

lalonde |> 
  group_by(treat) |> 
  summarize(mean_age = mean(age),
            mean_educ = mean(educ)) |> 
  left_join(lalonde |> 
     count(treat, race) |> 
     pivot_wider(id_cols = treat, names_from = race, values_from = n))
Joining with `by = join_by(treat)`
# A tibble: 2 × 6
  treat mean_age mean_educ black hispan white
  <int>    <dbl>     <dbl> <int>  <int> <int>
1     0     28.0      10.2    87     61   281
2     1     25.8      10.3   156     11    18

These groups differ significantly: training participants are younger and have different racial compositions (Whites and Hispanics under-represented in training). If these variables affect earnings, our results are biased.

Testing this relationship:

earnings_model <- lm(re78 ~ age + as.factor(race), data = lalonde)
earnings_model |> summary()

Call:
lm(formula = re78 ~ age + as.factor(race), data = lalonde)

Residuals:
   Min     1Q Median     3Q    Max 
 -9462  -5536  -1991   4020  54491 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)            3851.44     925.96   4.159 3.65e-05 ***
age                      70.18      30.56   2.296  0.02200 *  
as.factor(race)hispan  1436.40     992.99   1.447  0.14854    
as.factor(race)white   1750.78     644.86   2.715  0.00682 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7400 on 610 degrees of freedom
Multiple R-squared:  0.02353,   Adjusted R-squared:  0.01873 
F-statistic:   4.9 on 3 and 610 DF,  p-value: 0.002258

Results show older employees and White employees earn significantly more. Since characteristics associated with higher earnings are also related to lower training participation, our naive comparison is misleading.

The Propensity Score Solution

To get an unbiased estimate, we need “apples to apples” comparisons – matching individuals similar on all dimensions except training participation.

Propensity score matching works by:

  1. Estimating propensity scores: Calculate each person’s predicted probability of receiving treatment based on their characteristics.
  2. Matching: Pair each treated individual with their most similar untreated counterpart.
  3. Comparing outcomes: Analyze results using these matched pairs.

Step 1: Calculate Propensity Scores

treatment_model <- glm(treat ~ age + as.factor(race), 
                      data = lalonde, family = binomial())
lalonde$p_score <- predict(treatment_model, type = "response")
lalonde
# A tibble: 614 × 10
   treat   age  educ race   married nodegree  re74  re75   re78 p_score
   <int> <int> <int> <fct>    <int>    <int> <dbl> <dbl>  <dbl>   <dbl>
 1     1    37    11 black        1        1     0     0  9930.  0.610 
 2     1    22     9 hispan       0        1     0     0  3596.  0.159 
 3     1    30    12 black        0        0     0     0 24909.  0.631 
 4     1    27    11 black        0        1     0     0  7506.  0.639 
 5     1    33     8 black        0        1     0     0   290.  0.622 
 6     1    22     9 black        0        1     0     0  4056.  0.654 
 7     1    23    12 black        0        0     0     0     0   0.651 
 8     1    32    11 black        0        1     0     0  8472.  0.625 
 9     1    22    16 black        0        0     0     0  2164.  0.654 
10     1    33    12 white        1        0     0     0 12418.  0.0569
# ℹ 604 more rows

This creates a single summary measure capturing how likely someone was to participate in training.

Step 2: Matching Process

We would manually implement matching by:

  1. Splitting data into treated and untreated groups.
  2. For each treated individual, finding the untreated person with the closest propensity score.
  3. Matching without replacement (each untreated person used only once).

The matchit() function will do this for you. As before, we specify the variables that might influence the propensity of an individual receiving treatment. Then, we can extract the matched data set.

matched <- matchit(
  treat ~ age + as.factor(race),
  data = lalonde,
  method = "nearest", 
  distance = "logit"
)

matched_data <- match.data(matched)

Step 3: Compare Matched Results

After matching each training participant with their most similar non-participant, we can simply calculate differences that compare matched individuals:

matched_data |> 
  group_by(treat) |> 
  summarize(mean_re78 = mean(re78))
# A tibble: 2 × 2
  treat mean_re78
  <int>     <dbl>
1     0     6202.
2     1     6349.

Now we see a different picture: when comparing similar individuals, training has a positive effect ($6,349 vs. $6,202).

Including more variables

For more robust matching, include all available variables. Now the training effect becomes even more pronounced.

matched_full <- matchit(
  treat ~ age + educ + as.factor(race) + married + nodegree + re74 + re75,
  data = lalonde,
  method = "nearest", 
  distance = "logit"
)

matched_data_full <- match.data(matched_full)

matched_data_full |> 
  group_by(treat) |> 
  summarize(mean_re78 = mean(re78))
# A tibble: 2 × 2
  treat mean_re78
  <int>     <dbl>
1     0     5455.
2     1     6349.

Summary of Results

# A tibble: 3 × 4
  approach                 training_not_taken training_taken training_effect
  <chr>                                 <dbl>          <dbl>           <dbl>
1 No Matching                           6984.          6349.           -635.
2 Matching on Age and Race              6202.          6349.            148.
3 Matching on Full Data                 5455.          6349.            894.

The naive comparison suggests training reduced earnings by $635. However, propensity score matching reveals the true effect: training increased earnings by nearly $900.

When to Use Propensity Score Matching

This technique is invaluable for:

  • Marketing campaigns: Did the campaign work, or did engaged customers self-select?
  • Policy interventions: Comparing outcomes when participation isn’t random.
  • Employee programs: Evaluating training, wellness, or development initiatives.
  • Product launches: Understanding adoption effects beyond early adopter bias.

Limitations and Considerations

Propensity score matching addresses selection bias from observable differences but cannot account for unobserved factors. It works best when:

  • You have rich data on participant characteristics.
  • The assumption of “selection on observables” is reasonable.
  • Randomized experiments aren’t feasible.
  • Treatment and control groups have sufficient overlap in propensity scores.

Practical Implementation Tips

  1. Include relevant covariates: Use all variables that might influence both treatment selection and outcomes.
  2. Check balance: Verify that matched groups are similar on observed characteristics.
  3. Assess overlap: Ensure adequate common support between treatment and control groups.
  4. Sensitivity analysis: Test robustness with different matching methods and specifications.
  5. Consider alternatives: Explore other causal inference methods like instrumental variables or regression discontinuity when appropriate.

Conclusion

Propensity score matching reveals a fundamental truth: the most obvious answer isn’t always right. This technique offers a rigorous path to causal insights when randomized experiments aren’t feasible.

The method transforms potentially misleading comparisons into meaningful causal analysis by creating artificial control groups through statistical matching. While not perfect, it’s a powerful tool for making evidence-based decisions in complex business environments where self-selection is the norm rather than the exception.

For data practitioners, the lesson is simple: before concluding an intervention worked or failed, ask whether participants were fundamentally different from non-participants. A few lines of matching code could mean the difference between the right strategic decision and the wrong one.

In our example, a naive analysis would have led to canceling an effective training program, while proper analysis reveals its substantial positive impact.

If you found this intriguing and want to learn more on this and other techniques, including decision criteria on when the differences you observe are actually significant and how you can use these techniques in more elaborate statistical models, Statistical Horizons offers courses on Matching and other topics from the realm of statistics that can help you thrive in an increasingly data driven world.

Share

Leave a Reply

Your email address will not be published. Required fields are marked *