Why Dichotomizing Continuous Variables Is a Mistake

A small simulation showing the information loss from categorizing predictors.

modelling
regression
simulation
Author

Fei Zuo

Published

October 26, 2025

Modified

March 4, 2026

🎧 1-minute summary

Narration generated using OpenAI.fm.

Many clinical trial analyses still dichotomize continuous predictors. This feels simple. It is not harmless.

Let’s simulate a clean linear relationship and see what happens.

Simulation setup

Code
set.seed(1)

n <- 500
x <- rnorm(n)
y <- 2*x + rnorm(n)

d <- data.frame(x, y)

Here, the outcome \(y\) depends linearly on the predictor \(x\).

\[ y = 2x + \epsilon, \quad \epsilon \sim N(0,1) \]

Model using a continuous predictor

Code
fit_cont <- lm(y ~ x, data = d)
summary(fit_cont)

Call:
lm(formula = y ~ x, data = d)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.94113 -0.74507  0.01663  0.72882  3.11131 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.04496    0.04730  -0.951    0.342    
x            1.95692    0.04678  41.832   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.057 on 498 degrees of freedom
Multiple R-squared:  0.7785,    Adjusted R-squared:  0.778 
F-statistic:  1750 on 1 and 498 DF,  p-value: < 2.2e-16

This model uses the full information contained in the predictor \(x\).

The estimated coefficient represents the expected change in \(y\) for a one-unit increase in \(x\).

Model using a dichotomized predictor

Code
d$x_cat <- ifelse(d$x >= median(d$x), 1, 0)
fit_cat <- lm(y ~ x_cat, data = d)
summary(fit_cat)

Call:
lm(formula = y ~ x_cat, data = d)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.6942 -1.1305 -0.0022  1.0593  6.1757 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -1.5672     0.1017  -15.42   <2e-16 ***
x_cat         3.1330     0.1438   21.79   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.607 on 498 degrees of freedom
Multiple R-squared:  0.4881,    Adjusted R-squared:  0.4871 
F-statistic: 474.9 on 1 and 498 DF,  p-value: < 2.2e-16

This replaces the continuous predictor with a binary variable indicating whether the value of \(x\) is at or above the median versus below the median.

Instead of estimating a slope, the model now estimates the difference in mean outcome between the two groups.

Comparing R-squared

About R²

R² (coefficient of determination) is the proportion of variability in the outcome explained by the predictors in the model:

\[ R^2 = 1 - \frac{\text{RSS}}{\text{TSS}} \]

Where:

  • RSS = Residual Sum of Squares

  • TSS = Total Sum of Squares

Interpretation:

  • R² = 0 means the model explains none of the variability.
  • R² = 0.40 means 40% of the variability in the outcome is explained by the predictors.
  • R² = 1 means perfect fit (rare outside of overfitting).

Cautions:

  • R² is not a causal measure.
  • A high value does not imply a correct mode.
  • R² always increases when more predictors are added, even if they are irrelevant.
  • For non-linear models (e.g., logistic regression), pseudo-R² measures do not have the same interpretation.
Code
summary(fit_cont)$r.squared 
[1] 0.7784605
Code
summary(fit_cat)$r.squared
[1] 0.4881285

In this simulation, comparing R² between models shows how dichotomizing a predictor discards information — the explanatory power decreases even though the underlying relationship is unchanged.

The model using a dichotomized predictor throws away information. The estimated relationship becomes less precise and less powerful.

Why information is lost

The continuous model uses the full range of variation in the predictor. The dichotomized model only uses whether the value falls above or below a threshold. Observations with very different values of \(x\) are treated as identical if they fall in the same category.

For example, values of \(x=0.1\) and \(x=3.0\) would both be coded as 1 if they fall at or above the median, even though they carry very different information about the outcome.

This loss of information reduces the model’s ability to explain variability in the outcome and weakens its ability to detect real relationships.

Takeaway

Categorizing a continuous predictor throws away information and reduces statistical efficiency. Unless there is a compelling clinical reason, predictors should generally be analyzed on their original continuous scale.