Code
set.seed(1)
n <- 500
x <- rnorm(n)
y <- 2*x + rnorm(n)
d <- data.frame(x, y)A small simulation showing the information loss from categorizing predictors.
Fei Zuo
October 26, 2025
March 4, 2026
🎧 1-minute summary
Narration generated using OpenAI.fm.
Many clinical trial analyses still dichotomize continuous predictors. This feels simple. It is not harmless.
Let’s simulate a clean linear relationship and see what happens.
Here, the outcome \(y\) depends linearly on the predictor \(x\).
\[ y = 2x + \epsilon, \quad \epsilon \sim N(0,1) \]
Call:
lm(formula = y ~ x, data = d)
Residuals:
Min 1Q Median 3Q Max
-2.94113 -0.74507 0.01663 0.72882 3.11131
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.04496 0.04730 -0.951 0.342
x 1.95692 0.04678 41.832 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.057 on 498 degrees of freedom
Multiple R-squared: 0.7785, Adjusted R-squared: 0.778
F-statistic: 1750 on 1 and 498 DF, p-value: < 2.2e-16
This model uses the full information contained in the predictor \(x\).
The estimated coefficient represents the expected change in \(y\) for a one-unit increase in \(x\).
Call:
lm(formula = y ~ x_cat, data = d)
Residuals:
Min 1Q Median 3Q Max
-4.6942 -1.1305 -0.0022 1.0593 6.1757
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.5672 0.1017 -15.42 <2e-16 ***
x_cat 3.1330 0.1438 21.79 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.607 on 498 degrees of freedom
Multiple R-squared: 0.4881, Adjusted R-squared: 0.4871
F-statistic: 474.9 on 1 and 498 DF, p-value: < 2.2e-16
This replaces the continuous predictor with a binary variable indicating whether the value of \(x\) is at or above the median versus below the median.
Instead of estimating a slope, the model now estimates the difference in mean outcome between the two groups.
R² (coefficient of determination) is the proportion of variability in the outcome explained by the predictors in the model:
\[ R^2 = 1 - \frac{\text{RSS}}{\text{TSS}} \]
Where:
RSS = Residual Sum of Squares
TSS = Total Sum of Squares
Interpretation:
Cautions:
In this simulation, comparing R² between models shows how dichotomizing a predictor discards information — the explanatory power decreases even though the underlying relationship is unchanged.
The model using a dichotomized predictor throws away information. The estimated relationship becomes less precise and less powerful.
The continuous model uses the full range of variation in the predictor. The dichotomized model only uses whether the value falls above or below a threshold. Observations with very different values of \(x\) are treated as identical if they fall in the same category.
For example, values of \(x=0.1\) and \(x=3.0\) would both be coded as 1 if they fall at or above the median, even though they carry very different information about the outcome.
This loss of information reduces the model’s ability to explain variability in the outcome and weakens its ability to detect real relationships.
Categorizing a continuous predictor throws away information and reduces statistical efficiency. Unless there is a compelling clinical reason, predictors should generally be analyzed on their original continuous scale.