Power, Estimands, and Endpoint Choice in Seizure Trials
Aligning sample size, estimands, and seizure count models in epilepsy trials
🎧 1-minute summary
Narration generated using OpenAI.fm.
In many epilepsy clinical trials evaluating seizure frequency, the primary analysis is based on subject-level outcome measures such as:
What is an estimand
An estimand precisely defines the treatment effect a clinical trial seeks to estimate. According to ICH E9(R1) (2019), an estimand includes:
- The target population
- The treatment condition
- The outcome variable
- The handling of intercurrent events
- The population-level summary measure (e.g., mean difference, rate ratio)
It clarifies the scientific question before analysis begins.
Percent change from baseline (PCFB), or
Responder status derived from achieving a ≥50% reduction in PCFB.
Correspondingly, the sample size calculation is often derived from detecting:
A difference in mean PCFB under normal-theory assumptions, or or a shift in the distribution of PCFB using a rank-based nonparametric test (e.g., the Wilcoxon rank-sum test).
- The latter is frequently described as powering for a “median difference,” although the Wilcoxon test does not, in general, test equality of medians — a distinction we will revisit.
A difference in responder proportions (using binomial assumptions).
The above looks standard but they are conceptually disconnected from the estimand the trial is supposed to target.
What is the underlying outcome?
In epilepsy trials, what we observe are seizure counts reported over specified time intervals:
\[ Y_i^{\text{base}}, \quad Y_i^{\text{post}} \]
Seizures are:
Nonnegative integers
Heterogeneous across patients
Overdispersed
Often approximately negative binomial
In a parallel-group randomized trial, the natural causal estimand is defined on the count scale, typically as a rate ratio representing a multiplicative change in seizure frequency.
\[ \text{Rate ratio} = \frac{E\!\left[Y^{\text{post}}(1)\right]} {E\!\left[Y^{\text{post}}(0)\right]} \]
Estimand (ICH E9(R1))
In ICH E9(R1) terms, this specifies the treatment effect through the outcome (post-treatment seizure counts) and the summary measure (a rate ratio), producing a population-level causal contrast on the count scale.
This answers:
How does seizure frequency compare between treatment and control?
Modeling treatment as acting on the event rate reflects both the biological mechanism and the stochastic structure of the data.
Stochastic structure
“Stochastic” refers to the probabilistic pattern of variability in the data.
Seizure counts are discrete events whose variability depends on the underlying event rate and differs across patients. Count models respect this mean–variance relationship; transformed endpoints such as percent change or responder status do not.
So what happens when we change the outcome scale?
Instead of modeling counts directly, many trials analyze transformations of these counts.
The moment we change the outcome scale, we change the estimand even if the raw data are the same.
Seizures are recurrent events occurring over time. Treatment plausibly acts by modifying the underlying event rate, leading to proportional (multiplicative) changes rather than fixed absolute shifts.
Clinically, seizure studies are already described this way (e.g., a 30% reduction, a 50% reduction). A rate ratio formalizes that proportional interpretation directly on the count scale.
Powering on percent change from baseline (PCFB)
PCFB is defined as:
\[ \frac{Y^{\text{post}} - Y^{\text{base}}} {Y^{\text{base}}} \]
This is:
A nonlinear transformation
A ratio of two random variables
Baseline-dependent
Highly heteroskedastic
Asymmetric
Heteroskedasticity (linear regression)
The classical linear model assumes homoskedasticity (i.e., constant error variance): \(\mathrm{Var}(\varepsilon_i \mid X_i)=\sigma^2\).
Heteroskedasticity means the variance depends on \(X\): \(\mathrm{Var}(\varepsilon_i \mid X_i)=\sigma_i^2\).
Estimated coefficients remain unbiased (if the mean model is correct), but standard errors can be wrong, leading to invalid confidence intervals and p-values.
PCFB and heteroskedasticity
PCFB is a ratio with baseline in the denominator.
When baseline is small, small absolute changes produce large percent changes.
The resulting variability depends on baseline level.
Power calculations that assume a constant variance on the percent-change scale therefore misrepresent the true variability of the endpoint.
Yet sample size calculations typically assume:
Approximate normality (or large-sample symmetry)
Homoskedasticity
Additive effects on the percent-change scale
A well-defined mean percent change (i.e., not dominated by baseline-dependent instability, extreme values, or denominator sensitivity)
None of these assumptions are satisfied for the PCFB endpoint.
The estimand under PCFB
When PCFB is analyzed, the estimand becomes:
\[ \Delta_{\text{PCFB}} = E\!\left[ \frac{Y_i^{\text{post}} - Y_i^{\text{base}}} {Y_i^{\text{base}}} \;\middle|\; A_i = 1 \right] - E\!\left[ \frac{Y_i^{\text{post}} - Y_i^{\text{base}}} {Y_i^{\text{base}}} \;\middle|\; A_i = 0 \right] \]
Treatment indicator
(A_i = 1) denotes assignment to treatment and (A_i = 0) denotes control.
This answers:
Did average percent change differ between groups?
That is a different question from:
By what proportion did treatment change seizure frequency relative to control?
The former targets the mean of a ratio-based transformation. The latter targets the seizure process itself.
Once PCFB is chosen as the outcome scale, the trial no longer estimates a treatment effect on seizure counts or seizure frequency. It estimates a difference in average percent change, a quantity that:
Depends heavily on baseline values
Is unstable at low baseline values
Embeds regression to the mean
Distorts variability due to asymmetric bounds
Discards information on the absolute scale by equating biologically distinct count reductions
For example:
Going from 200 → 100 seizures
Going from 10 → 5 seizures
Both yield:
\[ \frac{{\text{post}} - {\text{baseline}}} {{\text{baseline}}} = -50\% \]
But the clinical burden is clearly different.
The PCFB transformation collapses distinct biological realities into identical summary values.
The asymmetry problem
Percent decrease is bounded:
\[ -100\% \le \text{PCFB} < \infty \]
You cannot reduce seizures by more than 100%, but increases are unbounded. The scale is inherently asymmetric:
50 → 5 = -90%
5 → 50 = +900%
100 → 5 = −95%
5 → 100 = +1900%
This means:
Large increases can dominate averages.
Variance behaves very differently in the increasing direction.
The scale does not reflect biological symmetry.
The PCFB value depends heavily on baseline level but baseline can no longer be meaningfully adjusted for in a regression model. This will be demonstrated in a subsequent simulation post.
The asymmetry arises from the transformation, not from seizure biology.
Converting counts to PCFB distorts the variance structure and discards information contained in the original data.
A treatment contrast defined on percent change from baseline is not equivalent to a proportional change in the group-level event rate.
A rate ratio compares mean seizure frequencies between groups. PCFB averages subject-specific ratios, which weight patients differently depending on baseline and do not, in general, recover the proportional (multiplicative) effect on the underlying event rate.
Does analyzing the median rescue PCFB?
A common response is:
“Then analyze the median percent change instead.”
This does not resolve the fundamental issues.
Even when focusing on the median, the estimand remains defined on the PCFB scale. The question simply becomes:
Did the median percent change differ between groups?
This is still:
A nonlinear ratio
Baseline-dependent
Asymmetric
Heteroskedastic
Switching from mean to median does not repair the scale distortion.
The estimand remains disconnected from the underlying seizure count process.
Wilcoxon does not test equality of medians
In practice, “median PCFB analysis” often uses the Wilcoxon rank-sum test.
But the Wilcoxon test does not, in general, test equality of medians. It tests a shift in distributions — more precisely, it evaluates the probability that a randomly selected patient from the treatment group has a better outcome than a randomly selected patient from the control group:
\[ P(Z_1 > Z_0) + \frac{1}{2} P(Z_1 = Z_0) = \frac{1}{2} \]
Only under strong conditions (e.g., identical shapes with pure location shift) does this correspond to a median difference.
Those conditions are unlikely to hold for PCFB because:
PCFB is skewed
It is bounded below at −100%
It is unbounded above
Its variance depends on baseline
So describing a Wilcoxon-powered trial as being powered for a “median difference” is generally incorrect.
Hodges–Lehmann is not a median difference either
The Hodges–Lehmann estimator is often described as estimating a “median difference.”
Notation
(Z_{1i}), (Z_{0j}): individual outcomes for subjects in the treatment and control groups.
(Z_1), (Z_0): the corresponding group-level outcome distributions (random variables).
The difference in medians subtracts the 50th percentile of each group’s distribution, whereas the Hodges–Lehmann estimator takes the median of all cross-group pairwise differences.
In fact, it estimates the median of all pairwise differences:
\[ \operatorname{Median}_{i,j}\bigl(Z_{1i} - Z_{0j}\bigr) \] That is not the same as the difference in marginal medians:
\[ \operatorname{Median}(Z_1) - \operatorname{Median}(Z_0) \]
These are only equal under special symmetry conditions.
Again, PCFB generally does not satisfy those conditions.
What about ranked ANCOVA?
Ranked ANCOVA replaces PCFB with its overall ranks and fits a standard linear model.
Under the null, it tests whether treatment shifts the adjusted rank distribution of PCFB. This may appear attractive as it allows baseline adjustment and avoids normality assumptions. But it does not fix the core issue.
The resulting estimand targets a shift in the ranked percent changes, not an effect on seizure counts. It answers:
Does treatment change the mean ranks of percent changes?
Ranked ANCOVA
Often described as “nonparametric,” ranked ANCOVA avoids distributional assumptions about the outcome but still fits a linear ANCOVA model after ranking.
Thus it retains the usual linear model structure, including additive covariate effects and common slopes unless interactions are included. It is distribution-free for the outcome, but not model-free.
In practice, ranked ANCOVA provides only a p-value for a treatment effect on the rank scale. It does not directly estimate a clinically interpretable treatment effect (e.g., a mean or median difference in PCFB), nor does it provide a corresponding measure of precision.
What often happens in epilepsy trials is that the p-value from ranked ANCOVA is reported alongside descriptive median PCFB values for each treatment group. This implicitly suggests that the test pertains to a difference in medians, which is generally not true. The p-value corresponds to a shift in adjusted ranks, not to a formally estimated median difference.
Again, ranking does not repair the distortion introduced by PCFB itself. It simply applies a different statistical test to the same transformed endpoint.
What about quantile regression?
If the goal is to estimate a median difference in percent change, the appropriate tool is quantile regression.
For the median, the estimand is:
\[ Q_{0.5}(Z \mid A = 1) - Q_{0.5}(Z \mid A = 0) \]
where \(Z\) denotes percent change from baseline.
This answers:
What is the difference in median percent change between groups?
This directly estimates a difference in medians with confidence intervals, unlike Wilcoxon, Hodges–Lehmann, or ranked ANCOVA.
But the estimand remains a contrast in percent change, not a treatment effect on seizure frequency.
The statistical method can be correct, while the estimand is misaligned with the data-generating mechanism.
Powering on responder status
Responder status is typically derived as:
\[ R_i = \mathbf{1}\!\left\{ \frac{Y_i^{\text{post}} - Y_i^{\text{base}}}{Y_i^{\text{base}}} \le -0.5 \right\} \]
The estimand under responder analysis
When responder status is analyzed, the estimand becomes:
\[ \Delta_{\text{Resp}} = P\bigl(R(1)=1\bigr) - P\bigl(R(0)=1\bigr) \]
This answers:
Did the probability of achieving at least a 50% reduction differ between groups?
The responder estimand targets a tail probability of a thresholded transformation of a ratio of counts — several layers removed from the underlying seizure process.
Because responder status is derived from PCFB, it inherits the structural distortions of that transformation and further reduces the data to a binary indicator.
This approach implicitly assumes:
The outcome is inherently binary
Treatment is acting directly on the probability of response
There is no underlying count structure beneath the response indicator
Under this view, “responder status” is treated as if it were the primary stochastic outcome following a Bernoulli data-generating process. But in reality, it is not generated by a Bernoulli mechanism and instead constructed from noisy, overdispersed count data.
Consequences for power and sample size
Power may be overstated based on a responder analysis because it:
Discards magnitude beyond the 50% threshold
Changes discontinuously near the cutoff
Is highly sensitive to baseline seizure distribution
Ignores count overdispersion and heterogeneity in variance calculations
Ignores follow-up duration or exposure time in variance calculations
Why this matters
When the outcome scale changes, the estimand changes. And when the estimand changes, the scientific question changes.
| Analysis | Estimand | Scientific Question |
|---|---|---|
| Count model | Rate ratio | By what proportion does treatment change seizure frequency relative to control? |
| PCFB | Difference in percent change (mean or median) | Did percent change differ between treatment and control? |
| Responder | Difference in response probability | Did the probability of achieving at least a 50% reduction differ between treatment and control? |
Why proportional (multiplicative) effects align with seizure-generating processes
Seizures are recurrent events generated by an underlying patient-specific event rate.
If treatment reduces that rate by, say, 30% compared to placebo, this implies:
[ E[Y(1)] = 0.70 E[Y(0)] ]
This proportional (multiplicative) structure:
• Reflects how risk-reduction therapies typically operate
• Preserves the count scale
• Avoids baseline-dependent distortion introduced by dividing by individual baseline values
In contrast, percent change and responder status are transformations of observed counts. They introduce nonlinear distortion, heteroskedasticity, and arbitrary thresholds that are not inherent to seizure biology
Powering under PCFB or responder frameworks is disconnected from the true data-generating process and have the following implications:
Type I error can drift because variance is mischaracterized
Power is misestimated
Larger sample sizes are often required
The trial is designed to detect the wrong quantity
You may believe that you are powering the trial to detect for seizure reduction, but you are actually powering it to detect:
A mean difference or a distribution shift in a transformed ratio, or
A proportion difference in a threshold-crossing probability
The design, the estimand, and the biology are no longer aligned. Once that misalignment is built into the sample size calculation, it propagates through the entire trial.
What should be done instead?
For seizure endpoints:
Specify the estimand on the count scale using an appropriate treatment contrast (e.g., a rate ratio) and convert the result to a clinically interpretable quantity (e.g., a rate ratio of 0.7 translates to a 30% reduction in seizure frequency for active compared to placebo)
Specify a realistic data-generating process (e.g., negative binomial with overdispersion)
Simulate trials under that model
Fit the intended analysis model
Estimate empirical power
This ensures:
Estimand coherence
Proper variance representation
Honest operating characteristics
What comes next
The remaining issue is operating characteristics — measures of how a design performs across repeated trials under a given data-generating process.
Analytical vs Empirical Power
Analytical power is calculated from a closed-form formula under assumed model conditions (e.g., normality, binomial variance).
Empirical power is estimated by simulation: generate many datasets under a specified data-generating process, analyze each one, and compute the proportion rejecting (H_0).
When model assumptions are incorrect, analytical power may be miscalibrated.
Empirical power reflects operating characteristics under the assumed data-generating process.
In a subsequent post, we will compare the analysis of seizure counts or frequency, PCFB, and responder endpoints under realistic seizure-generating processes, to examine:
Type I error
Empirical power
Bias
Mean squared error (MSE)
Confidence interval coverage
Takeaway
Seizures are counts. Replacing counts with percent change or responder status changes the estimand and therefore the scientific question.
PCFB and responder analyses do not estimate a treatment effect on seizure frequency. They estimate contrasts in transformed or thresholded quantities whose statistical properties and interpretation are disconnected from the underlying count process.
When the estimand is misaligned with the seizure-generating process, power is computed for the wrong quantity and the trial answers a different question than intended.