Trial Notes

Moving Beyond Percent Change from Baseline: Longitudinal Negative Binomial Models in Single-Arm Studies

Fei Zuo — Sat, 04 Apr 2026 04:00:00 GMT

🎧 1-minute summary

Narration generated using OpenAI.fm.

Disclaimer

This post uses simulated data for illustration and does not reflect any specific study, program, or proprietary design.

In single-arm or open-label seizure studies, efficacy is often summarized using percent change from baseline (PCFB). Tables of mean or median PCFB, and sometimes responder rates, are used to describe how patients appear to improve after starting treatment.

While familiar, these summaries come with important limitations:

discard rich longitudinal data into a single derived value
behave asymmetrically and nonlinearly
obscure how outcomes evolve over time
do not directly estimate a seizure reduction (see discussion here)

An alternative is to analyze the observed seizure counts directly over time using a longitudinal negative binomial (NB) model. Rather than reducing the data to baseline-to-post-baseline summaries, this approach uses all repeated measurements and models how seizure rates evolve after treatment initiation.

In single-arm settings, this naturally leads to using time as a proxy for treatment exposure.

Model setup

In the absence of a control arm, we observe seizure counts repeatedly after treatment begins. A longitudinal NB model can be written as:

where:

: seizure count for patient at time
: effect of follow-up time
: baseline seizure frequency
: coefficient for baseline
: subject-specific random effect
: offset accounting for observation duration (applicable when follow-up duration differs across subjects)

Here, time captures how seizure rates change after treatment initiation, while baseline explains between-patient differences.

How this is different from PCFB

PCFB asks:

How much did the outcome change from baseline to a single follow-up time?

A longitudinal NB model asks:

What is the trajectory of seizure rates over time after treatment starts?

Key differences:

PCFB discards the original count data → NB uses count data from all visits
PCFB embeds baseline in the outcome, introducing extra noise and variability → NB allows baseline to be modelled separately as a predictor
PCFB distorts scale because it’s asymmetric and nonlinear → NB models counts on a natural rate scale
trials reporting on PCFB summaries typically ignore within-subject correlation by collapsing repeated measures into a single summary → NB explicitly models repeated measures

Baseline: covariate vs reference point

A common practice in single-arm analyses is to:

compare each post-baseline visit to baseline, rather than adjusting for baseline as a covariate

While intuitive, this approach has a key flaw.

Baseline seizure frequency is highly prognostic of future counts. It is not just a reference point—it is a major source of variability across patients.

When you anchor everything to baseline:

baseline becomes part of the outcome (as in PCFB)
its prognostic information is no longer used to explain variability
additional noise is introduced because the outcome depends on both baseline and post-baseline measurements
and time effects can become confounded with baseline-dependent dynamics (including regression to the mean)

In contrast, including baseline as a covariate:

separates between-patient heterogeneity from within-patient change over time
improves efficiency
and yields a cleaner interpretation of the post-treatment trajectory

In single-arm settings, apparent improvements measured as changes from baseline (including PCFB and responder rates derived from PCFB) do not identify a treatment effect. This is because the outcome is defined relative to baseline, so any observed changes, whether due to natural fluctuations, regression to the mean, or other time effects, are carried directly into the endpoint. Without a control group, these changes are indistinguishable from what would have happened in the absence of treatment. As a result, descriptive summaries such as PCFB or responder rates should not be interpreted as evidence of efficacy.

A longitudinal negative binomial model provides a more principled way to evaluate post-baseline trajectory by accounting for baseline differences and within-subject variability. Importantly, baseline is treated as a covariate rather than embedded in the outcome, so the model does not rely on baseline as a reference point for comparison.

Interpreting the time effect in a negative binomial model

If time is modelled linearly:

then:

is the rate ratio per visit
e.g., implies a 10% reduction in seizure rate per visit

This provides a simple summary of longitudinal change but it assumes the effect is constant over time.

Allowing the time effect to vary

In practice, treatment-associated trajectories are rarely perfectly linear. You may see:

rapid early improvement
delayed onset
plateauing
or rebound

These patterns can be explored by modelling time more flexibly.

Time as a categorical variable

Code

Power, Estimands, and Endpoint Choice in Seizure Trials

Fei Zuo — Sun, 04 Jan 2026 05:00:00 GMT

🎧 1-minute summary

Narration generated using OpenAI.fm.

In many epilepsy clinical trials evaluating seizure frequency, the primary analysis is based on subject-level outcome measures such as:

What is an estimand

An estimand precisely defines the treatment effect a clinical trial seeks to estimate. According to ICH E9(R1) (2019), an estimand includes:

The target population
The treatment condition
The outcome variable
The handling of intercurrent events
The population-level summary measure (e.g., mean difference, rate ratio)

It clarifies the scientific question before analysis begins.

Percent change from baseline (PCFB), or
Responder status derived from achieving a ≥50% reduction in PCFB.

Correspondingly, the sample size calculation is often derived from detecting:

A difference in mean PCFB under normal-theory assumptions, or or a shift in the distribution of PCFB using a rank-based nonparametric test (e.g., the Wilcoxon rank-sum test).
- The latter is frequently described as powering for a “median difference,” although the Wilcoxon test does not, in general, test equality of medians — a distinction we will revisit.
A difference in responder proportions (using binomial assumptions).

The above looks standard but they are conceptually disconnected from the estimand the trial is supposed to target.

What is the underlying outcome?

In epilepsy trials, what we observe are seizure counts reported over specified time intervals:

Seizures are:

Nonnegative integers
Heterogeneous across patients
Overdispersed
Often approximately negative binomial

In a parallel-group randomized trial, the natural causal estimand is defined on the count scale, typically as a rate ratio representing a multiplicative change in seizure frequency.

Estimand (ICH E9(R1))

In ICH E9(R1) terms, this specifies the treatment effect through the outcome (post-treatment seizure counts) and the summary measure (a rate ratio), producing a population-level causal contrast on the count scale.

This answers:

How does seizure frequency compare between treatment and control?

Modeling treatment as acting on the event rate reflects both the biological mechanism and the stochastic structure of the data.

Stochastic structure

“Stochastic” refers to the probabilistic pattern of variability in the data.
Seizure counts are discrete events whose variability depends on the underlying event rate and differs across patients. Count models respect this mean–variance relationship; transformed endpoints such as percent change or responder status do not.

So what happens when we change the outcome scale?

Instead of modeling counts directly, many trials analyze transformations of these counts.

The moment we change the outcome scale, we change the estimand even if the raw data are the same.

Why a rate ratio?

Seizures are recurrent events occurring over time. Treatment plausibly acts by modifying the underlying event rate, leading to proportional (multiplicative) changes rather than fixed absolute shifts.

Clinically, seizure studies are already described this way (e.g., a 30% reduction, a 50% reduction). A rate ratio formalizes that proportional interpretation directly on the count scale.

Powering on percent change from baseline (PCFB)

PCFB is defined as:

This is:

A nonlinear transformation
A ratio of two random variables
Baseline-dependent
Highly heteroskedastic
Asymmetric

Heteroskedasticity (linear regression)

The classical linear model assumes homoskedasticity (i.e., constant error variance): .

Heteroskedasticity means the variance depends on : .

Estimated coefficients remain unbiased (if the mean model is correct), but standard errors can be wrong, leading to invalid confidence intervals and p-values.

PCFB and heteroskedasticity

PCFB is a ratio with baseline in the denominator.
When baseline is small, small absolute changes produce large percent changes.

The resulting variability depends on baseline level.
Power calculations that assume a constant variance on the percent-change scale therefore misrepresent the true variability of the endpoint.

Yet sample size calculations typically assume:

Approximate normality (or large-sample symmetry)
Homoskedasticity
Additive effects on the percent-change scale
A well-defined mean percent change (i.e., not dominated by baseline-dependent instability, extreme values, or denominator sensitivity)

None of these assumptions are satisfied for the PCFB endpoint.

The estimand under PCFB

When PCFB is analyzed, the estimand becomes:

Treatment indicator

(A_i = 1) denotes assignment to treatment and (A_i = 0) denotes control.

This answers:

Did average percent change differ between groups?

That is a different question from:

By what proportion did treatment change seizure frequency relative to control?

The former targets the mean of a ratio-based transformation. The latter targets the seizure process itself.

Once PCFB is chosen as the outcome scale, the trial no longer estimates a treatment effect on seizure counts or seizure frequency. It estimates a difference in average percent change, a quantity that:

Depends heavily on baseline values
Is unstable at low baseline values
Embeds regression to the mean
Distorts variability due to asymmetric bounds
Discards information on the absolute scale by equating biologically distinct count reductions

For example:

Going from 200 → 100 seizures
Going from 10 → 5 seizures

Both yield:

But the clinical burden is clearly different.

The PCFB transformation collapses distinct biological realities into identical summary values.

The asymmetry problem

Percent decrease is bounded:

You cannot reduce seizures by more than 100%, but increases are unbounded. The scale is inherently asymmetric:

50 → 5 = -90%
5 → 50 = +900%
100 → 5 = −95%
5 → 100 = +1900%

This means:

Large increases can dominate averages.
Variance behaves very differently in the increasing direction.
The scale does not reflect biological symmetry.
The PCFB value depends heavily on baseline level but baseline can no longer be meaningfully adjusted for in a regression model. This will be demonstrated in a subsequent simulation post.

The asymmetry arises from the transformation, not from seizure biology.

Converting counts to PCFB distorts the variance structure and discards information contained in the original data.

Subject-level percent change ≠ proportional rate reduction

A treatment contrast defined on percent change from baseline is not equivalent to a proportional change in the group-level event rate.

A rate ratio compares mean seizure frequencies between groups. PCFB averages subject-specific ratios, which weight patients differently depending on baseline and do not, in general, recover the proportional (multiplicative) effect on the underlying event rate.

Does analyzing the median rescue PCFB?

A common response is:

“Then analyze the median percent change instead.”

This does not resolve the fundamental issues.

Even when focusing on the median, the estimand remains defined on the PCFB scale. The question simply becomes:

Did the median percent change differ between groups?

This is still:

A nonlinear ratio
Baseline-dependent
Asymmetric
Heteroskedastic

Switching from mean to median does not repair the scale distortion.

The estimand remains disconnected from the underlying seizure count process.

Wilcoxon does not test equality of medians

In practice, “median PCFB analysis” often uses the Wilcoxon rank-sum test.

But the Wilcoxon test does not, in general, test equality of medians. It tests a shift in distributions — more precisely, it evaluates the probability that a randomly selected patient from the treatment group has a better outcome than a randomly selected patient from the control group:

Only under strong conditions (e.g., identical shapes with pure location shift) does this correspond to a median difference.

Those conditions are unlikely to hold for PCFB because:

PCFB is skewed
It is bounded below at −100%
It is unbounded above
Its variance depends on baseline

So describing a Wilcoxon-powered trial as being powered for a “median difference” is generally incorrect.

Hodges–Lehmann is not a median difference either

The Hodges–Lehmann estimator is often described as estimating a “median difference.”

Notation

(Z_{1i}), (Z_{0j}): individual outcomes for subjects in the treatment and control groups.

(Z_1), (Z_0): the corresponding group-level outcome distributions (random variables).

The difference in medians subtracts the 50th percentile of each group’s distribution, whereas the Hodges–Lehmann estimator takes the median of all cross-group pairwise differences.

In fact, it estimates the median of all pairwise differences:

That is not the same as the difference in marginal medians:

These are only equal under special symmetry conditions.

Again, PCFB generally does not satisfy those conditions.

What about ranked ANCOVA?

Ranked ANCOVA replaces PCFB with its overall ranks and fits a standard linear model.

Under the null, it tests whether treatment shifts the adjusted rank distribution of PCFB. This may appear attractive as it allows baseline adjustment and avoids normality assumptions. But it does not fix the core issue.

The resulting estimand targets a shift in the ranked percent changes, not an effect on seizure counts. It answers:

Does treatment change the mean ranks of percent changes?

Ranked ANCOVA

Often described as “nonparametric,” ranked ANCOVA avoids distributional assumptions about the outcome but still fits a linear ANCOVA model after ranking.

Thus it retains the usual linear model structure, including additive covariate effects and common slopes unless interactions are included. It is distribution-free for the outcome, but not model-free.

In practice, ranked ANCOVA provides only a p-value for a treatment effect on the rank scale. It does not directly estimate a clinically interpretable treatment effect (e.g., a mean or median difference in PCFB), nor does it provide a corresponding measure of precision.

What often happens in epilepsy trials is that the p-value from ranked ANCOVA is reported alongside descriptive median PCFB values for each treatment group. This implicitly suggests that the test pertains to a difference in medians, which is generally not true. The p-value corresponds to a shift in adjusted ranks, not to a formally estimated median difference.

Again, ranking does not repair the distortion introduced by PCFB itself. It simply applies a different statistical test to the same transformed endpoint.

What about quantile regression?

If the goal is to estimate a median difference in percent change, the appropriate tool is quantile regression.

For the median, the estimand is:

where denotes percent change from baseline.

This answers:

What is the difference in median percent change between groups?

This directly estimates a difference in medians with confidence intervals, unlike Wilcoxon, Hodges–Lehmann, or ranked ANCOVA.

But the estimand remains a contrast in percent change, not a treatment effect on seizure frequency.

The statistical method can be correct, while the estimand is misaligned with the data-generating mechanism.

Powering on responder status

Responder status is typically derived as:

The estimand under responder analysis

When responder status is analyzed, the estimand becomes:

This answers:

Did the probability of achieving at least a 50% reduction differ between groups?

The responder estimand targets a tail probability of a thresholded transformation of a ratio of counts — several layers removed from the underlying seizure process.

Because responder status is derived from PCFB, it inherits the structural distortions of that transformation and further reduces the data to a binary indicator.

This approach implicitly assumes:

The outcome is inherently binary
Treatment is acting directly on the probability of response
There is no underlying count structure beneath the response indicator

Under this view, “responder status” is treated as if it were the primary stochastic outcome following a Bernoulli data-generating process. But in reality, it is not generated by a Bernoulli mechanism and instead constructed from noisy, overdispersed count data.

Consequences for power and sample size

Power may be overstated based on a responder analysis because it:

Discards magnitude beyond the 50% threshold
Changes discontinuously near the cutoff
Is highly sensitive to baseline seizure distribution
Ignores count overdispersion and heterogeneity in variance calculations
Ignores follow-up duration or exposure time in variance calculations

Why this matters

When the outcome scale changes, the estimand changes. And when the estimand changes, the scientific question changes.

Analysis	Estimand	Scientific Question
Count model	Rate ratio	By what proportion does treatment change seizure frequency relative to control?
PCFB	Difference in percent change (mean or median)	Did percent change differ between treatment and control?
Responder	Difference in response probability	Did the probability of achieving at least a 50% reduction differ between treatment and control?

Why proportional (multiplicative) effects align with seizure-generating processes

Seizures are recurrent events generated by an underlying patient-specific event rate.
If treatment reduces that rate by, say, 30% compared to placebo, this implies:

[ E[Y(1)] = 0.70 E[Y(0)] ]

This proportional (multiplicative) structure:

• Reflects how risk-reduction therapies typically operate
• Preserves the count scale
• Avoids baseline-dependent distortion introduced by dividing by individual baseline values

In contrast, percent change and responder status are transformations of observed counts. They introduce nonlinear distortion, heteroskedasticity, and arbitrary thresholds that are not inherent to seizure biology

Powering under PCFB or responder frameworks is disconnected from the true data-generating process and have the following implications:

Type I error can drift because variance is mischaracterized
Power is misestimated
Larger sample sizes are often required
The trial is designed to detect the wrong quantity

You may believe that you are powering the trial to detect for seizure reduction, but you are actually powering it to detect:

A mean difference or a distribution shift in a transformed ratio, or
A proportion difference in a threshold-crossing probability

The design, the estimand, and the biology are no longer aligned. Once that misalignment is built into the sample size calculation, it propagates through the entire trial.

What should be done instead?

For seizure endpoints:

Specify the estimand on the count scale using an appropriate treatment contrast (e.g., a rate ratio) and convert the result to a clinically interpretable quantity (e.g., a rate ratio of 0.7 translates to a 30% reduction in seizure frequency for active compared to placebo)
Specify a realistic data-generating process (e.g., negative binomial with overdispersion)
Simulate trials under that model
Fit the intended analysis model
Estimate empirical power

This ensures:

Estimand coherence
Proper variance representation
Honest operating characteristics

What comes next

The remaining issue is operating characteristics — measures of how a design performs across repeated trials under a given data-generating process.

Analytical vs Empirical Power

Analytical power is calculated from a closed-form formula under assumed model conditions (e.g., normality, binomial variance).

Empirical power is estimated by simulation: generate many datasets under a specified data-generating process, analyze each one, and compute the proportion rejecting (H_0).

When model assumptions are incorrect, analytical power may be miscalibrated.
Empirical power reflects operating characteristics under the assumed data-generating process.

In a subsequent post, we will compare the analysis of seizure counts or frequency, PCFB, and responder endpoints under realistic seizure-generating processes, to examine:

Type I error
Empirical power
Bias
Mean squared error (MSE)
Confidence interval coverage

Takeaway

Seizures are counts. Replacing counts with percent change or responder status changes the estimand and therefore the scientific question.

PCFB and responder analyses do not estimate a treatment effect on seizure frequency. They estimate contrasts in transformed or thresholded quantities whose statistical properties and interpretation are disconnected from the underlying count process.

When the estimand is misaligned with the seizure-generating process, power is computed for the wrong quantity and the trial answers a different question than intended.

Rethinking Placebo Response

Fei Zuo — Sun, 14 Dec 2025 05:00:00 GMT

🎧 1-minute summary

Narration generated using OpenAI.fm.

In randomized clinical trials, especially in neurology and psychiatry, we often hear phrases like:

“The placebo response was high.”
“This indication has a large placebo effect.”
“The drug failed because of placebo response.”

These statements are often misleading. In this post, I’ll explain why placebo response is frequently misunderstood, why it cannot be interpreted as evidence of a causal improvement, and why misunderstanding it leads to flawed trial design and reasoning. The discussion here focuses on superiority trials.

What is placebo response?

Placebo response refers to the overall change observed in participants assigned to placebo between baseline and follow-up. This term does not refer solely to a psychobiological “placebo effect.” In methodological literature, placebo response encompasses all sources of change occurring in the placebo arm, including natural disease course and statistical artifacts, not just expectation-driven effects.

Placebo response is therefore a composite quantity.

The decomposition problem

In practice, observed change in the placebo arm reflects multiple distinct components:

Natural history (disease fluctuation, spontaneous remission, cyclical patterns)
Regression to the mean (especially when enrollment requires elevated baseline severity)
Measurement error (random variation in outcome assessment)
Study effect (being observed, monitored, structured follow-up, adherence reinforcement, behavioral change)
True placebo effect (expectation-driven psychobiologic response attributable to belief in treatment)

All of these are bundled together under the label “placebo response.”

And here is the problem:

These components cannot be separated in a standard parallel-group randomized trial. The design does not allow identification of which portion of within-arm change is attributable to which mechanism.
Placebo response is a within-arm change-from-baseline quantity. It is not a causal effect.

Within-arm change is not a causal estimand

When we calculate placebo response, we are computing:

That is a descriptive statistic.

But the treatment effect in a randomized trial is:

That is a between-arm contrast.

Randomization protects and justifies the treatment contrast, not the within-arm summaries.

So when people say: “Patients improved on placebo.”

The statement sounds causal but it is ambiguous. Improvement always implies a comparison.

The question is:

Improved relative to what?

Relative to baseline?

That is simply change over time that bundles together everything that happens between baseline and follow-up, including regression to the mean, measurement error, natural fluctuation in disease severity, study participation effects, and any expectation-driven placebo effects. But it remains a before-after comparison within a single arm and does not isolate any specific component. More importantly, without a “no-placebo” counterfactual (i.e., a control arm in which participants receive no placebo intervention), there is no basis for concluding that placebo itself caused the observed effect. Regression to the mean, measurement error, natural fluctuation, and study participation effects can all produce change over time even if no placebo were given.
Relative to no study participation?

Enrollment in a clinical trial changes behavior. Patients are monitored more closely, seen at scheduled visits, encouraged to adhere to treatment, and given structured clinical attention. These factors can influence outcomes even without active therapy — a so-called study effect. If “improvement on placebo” refers to improvement relative to what would have happened outside the trial entirely, then it incorporates these participation effects. That is a different counterfactual from baseline change and it is not directly observed in a standard parallel-group randomized trial.
Relative to the natural disease trajectory?

Many conditions fluctuate over time. Some improve spontaneously, some remit partially, and some follow cyclical patterns. If “improvement on placebo” refers to improvement relative to what would have happened under the untreated natural disease trajectory without study participation, then it invokes a different counterfactual altogether. That counterfactual is not identifiable in a standard parallel-group randomized trial.
Relative to drug?

This is the only comparison randomization is designed to answer. The causal estimand in a superiority trial is:

What is a causal effect?

A causal effect is defined as the contrast between two potential outcomes for the same individual:

where:

Y(1) is the outcome under treatment
Y(0) is the outcome under control

Because both potential outcomes cannot be observed for the same individual, causal effects cannot be directly measured. In a randomized trial, randomization allows unbiased estimation of the average causal effect by comparing outcomes between groups:

Apparent improvement within the placebo arm alone does not identify a causal effect, because it does not compare observed outcomes to the unobserved counterfactual.

Without clarifying the counterfactual, “improvement on placebo” collapses multiple distinct estimands into a single phrase, and only one of them (drug vs placebo) is relevant to causal treatment effect estimation.

What randomization actually protects

Randomization ensures that everything happening in the placebo arm is also happening in the treatment arm in expectation.

That includes:

Regression to the mean
Natural history
Measurement noise
Study participation effects
Expectation effects (if blinding holds)

Because the estimand is the difference between groups, all shared componensts cancel in expectation and therefore in the contrast.

This is the key insight:

Placebo response is shared variability.
Shared variability does not bias the treatment effect.
Randomization renders placebo response irrelevant to the treatment contrast.

The logical mistake

The reasoning often goes like this:

The placebo arm “improved” a lot.
Therefore, placebo is “competing” with the drug.
Therefore, we must reduce placebo response.

But this logic confuses within-arm change with the between-arm contrast that defines treatment effect.

The treatment effect is the difference between randomized groups, not the magnitude of change within one arm.

To see this clearly, consider a simple simulation.

A simulation demonstration

We simulate a fluctuating disease across many trials where:

Patients enroll during a symptomatic period (as is common in practice due to eligibility criteria).
Outcomes are noisy.
There is regression to the mean.
The drug has a modest true effect of −2 units.
The placebo has no causal effect, but outcomes naturally fluctuate.

Regression to the mean

Because enrollment is based on a high observed baseline, some patients qualify as a result of random positive noise fluctuation rather than consistently high underlying severity.

When re-measured at follow-up, that noise component does not systematically repeat. As a result, the group will tend to show apparent improvement on average even in the absence of any active treatment.

This is precisely why change from baseline is not informative about causal effect. The observed improvement reflects statistical artifact and natural variability, not evidence that placebo caused benefit.

Code

Why Block Size Should Not Appear in the Protocol or SAP

Fei Zuo — Sun, 30 Nov 2025 05:00:00 GMT

🎧 1-minute summary

Narration generated using OpenAI.fm.

In randomized clinical trials, we talk a lot about blinding. But an equally important and subtly different concept is allocation concealment.

They are not the same thing.

Blinding protects against bias after treatment is assigned (e.g., preventing biased outcome assessment).
Allocation concealment protects against bias at the moment of enrollment before treatment is assigned (e.g., preventing foreknowledge of upcoming treatment assignment).

Because allocation concealment depends on unpredictability at the time of enrollment, certain operational details such as block size can weaken that protection if widely disclosed.

Regulatory and Reporting Standards Emphasize Concealment, Not Disclosure of Block Size

Major regulatory and reporting guidance documents are clear about the importance of allocation concealment. Notably, they do not require disclosure of block size in protocols or SAPs.

ICH E9 (Statistical Principles for Clinical Trials)

ICH E9 emphasizes that randomization protects against selection bias and provides a valid basis for statistical inference. This protection depends on the allocation sequence remaining unpredictable at the time of enrollment.

Unpredictability is the core principle. If the block size is known to investigators, predictability can arise — directly conflicting with this requirement.

ICH E6 (Good Clinical Practice)

ICH E6 emphasizes safeguarding the randomization process:

“The investigator should follow the trial’s randomization procedures, if any, and should ensure that the code is broken only in accordance with the protocol.”

Maintaining control of the randomization schedule including operational details like block size is part of preserving its integrity.

CONSORT 2010 Statement

CONSORT distinguishes clearly between:

Sequence generation
Allocation concealment mechanism

Item 9 of the CONSORT 2010 checklist requires:

“Mechanism used to implement the random allocation sequence (such as sequentially numbered containers), describing any steps taken to conceal the sequence until interventions were assigned.”

CONSORT focuses on how concealment was ensured, not on revealing details that would undermine it.

The Purpose of Block Randomization

Block randomization is used to maintain approximate treatment balance throughout enrollment.

For example, in a 1:1 trial with block size 4, each block contains:

2 assignments to Treatment A
2 assignments to Treatment B

The order of assignments within each block is randomly permuted. Since each block contains equal numbers of assignments for each treatment, treatment counts remain balanced throughout enrollment, reducing the risk of large imbalances if the trial stops early.

The Problem: Predictability

If investigators know the block size, allocation can become predictable.

Imagine:

1:1 randomization
Block size = 4
Three patients in a block have already been assigned: A, B, A

What is the fourth assignment?

It must be B.

At that point, allocation is no longer random in practice. It is deducible.

The mere ability to anticipate the next assignment can influence enrollment decisions in small but systematic ways:

Delaying enrollment of a “difficult” patient
Accelerating enrollment of a “good prognosis” patient
Selectively enrolling borderline-eligible patients

This introduces selection bias, even in a formally randomized trial. That is a failure of allocation concealment.

Allocation Concealment ≠ Blinding

Even in an open-label trial (no blinding), allocation concealment must still be protected.

Blinding protects:

Outcome assessment
Patient behavior
Investigator expectations

Allocation concealment protects:

The integrity of who gets assigned what

You can have:

A blinded trial with broken allocation concealment
An open-label trial with perfect allocation concealment

They are conceptually distinct safeguards.

Why Block Size Does Not Belong in the Protocol or SAP

The protocol and SAP are often widely distributed:

Investigators
Coordinators
Sponsors
Regulators
Data monitoring committees

If the block size is written there, it is no longer protected.

Best practice is:

The protocol states that blocked randomization will be used.
The SAP describes the randomization method in general terms.
The actual block size (and any random variation in block size) is known only to:
- The unblinded statistician
- The randomization unit/vendor/system

This protects allocation concealment while still maintaining transparency about the design.

Fixed Block Sizes

Using a fixed block size makes predictability easier.

A better approach is:

Randomly varying block sizes (e.g., 2, 4, and 6)

This dramatically reduces predictability. But even then, disclosing the possible block sizes weakens protection.

Randomization terminology

Simple randomization
Each subject is assigned independently (e.g., coin flip) using a chance mechanism. Balance is achieved on average, but temporary imbalances can occur.

Permuted block randomization — fixed block size
Assignments are balanced within blocks of a constant size (e.g., all blocks size 4), and the order within each block is randomly permuted.

Permuted block randomization — varying block sizes
The block size itself is randomly selected from a prespecified set (e.g., 2, 4 and 6). Each block remains internally permuted. This reduces predictability compared with fixed small blocks.

Do:
- Use permuted blocks to maintain balance during enrollment.
- Prefer varying block sizes to protect allocation concealment.
- Ensure block size is a multiple of the allocation ratio (e.g., multiple of 2 for 1:1 allocation).
- Restrict knowledge of block sizes to the unblinded randomization statistician/system.

Do not:
- Use small fixed block sizes when assignments could be inferred.
- Disclose block size in widely distributed protocol or SAP documents.
- Confuse allocation concealment with blinding.

The Statistical Consequences of Broken Allocation Concealment

Empirical evidence shows that trials with inadequate allocation concealment:

Systematically overestimate treatment effects
Are more likely to report statistically significant results
Introduce selection bias at enrollment

What the Protocol Should Say

Instead of specifying block size, a protocol can say:

Participants will be randomized in a 1:1 ratio using a centralized, computer-generated permuted block randomization schedule. Details of the randomization sequence will be maintained by the unblinded statistician to ensure allocation concealment.

That is sufficient. Regulators do not require disclosure of block size in public-facing documents.

Takeaway

Block size is an operational detail. Its disclosure does not improve reproducibility, transparency, or statistical validity. But it can undermine allocation concealment.
And once allocation concealment is compromised, the trial is vulnerable to selection bias even if it remains fully blinded.
Protecting allocation concealment is protecting the integrity of randomization itself.
Block size should be known only to the unblinded randomization statistician or system.
Regulators do not require disclosure of block size in public-facing documents.

References

International Council for Harmonisation of Technical Requirements for Pharmaceuticals for Human Use. (1998). ICH E9: Statistical principles for clinical trials.

International Council for Harmonisation of Technical Requirements for Pharmaceuticals for Human Use. (2019). ICH E9(R1) addendum on estimands and sensitivity analysis in clinical trials.

International Council for Harmonisation of Technical Requirements for Pharmaceuticals for Human Use. (2016). ICH E6(R2): Guideline for good clinical practice.

Schulz, K. F., Altman, D. G., & Moher, D., for the CONSORT Group. (2010). CONSORT 2010 statement: Updated guidelines for reporting parallel group randomized trials. BMJ, 340, c332. https://doi.org/10.1136/bmj.c332

Schulz, K. F., Chalmers, I., Hayes, R. J., & Altman, D. G. (1995). Empirical evidence of bias: Dimensions of methodological quality associated with estimates of treatment effects in controlled trials. JAMA, 273(5), 408–412. https://doi.org/10.1001/jama.1995.03520290060030

Schulz, K. F., & Grimes, D. A. (2002). Allocation concealment in randomised trials: Defending against deciphering. The Lancet, 359(9306), 614–618. https://doi.org/10.1016/S0140-6736(02)07750-4

Moher, D., Pham, B., Jones, A., Cook, D. J., Jadad, A. R., Moher, M., Tugwell, P., & Klassen, T. P. (1998). Does quality of reports of randomised trials affect estimates of intervention efficacy reported in meta-analyses? The Lancet, 352(9128), 609–613. https://doi.org/10.1016/S0140-6736(98)01085-X

Savović, J., Jones, H. E., Altman, D. G., Harris, R. J., Jüni, P., Pildal, J., Als-Nielsen, B., Balk, E. M., Gluud, C., Gluud, L. L., Ioannidis, J. P. A., Schulz, K. F., Beynon, R., Welton, N. J., Wood, L., & Sterne, J. A. C. (2012). Influence of reported study design characteristics on intervention effect estimates from randomized, controlled trials: Combined analysis of meta-epidemiological studies. Annals of Internal Medicine, 157(6), 429–438. https://www.acpjournals.org/doi/10.7326/0003-4819-157-6-201209180-00537

Why Dichotomizing Continuous Variables Is a Mistake

Fei Zuo — Sun, 26 Oct 2025 04:00:00 GMT

🎧 1-minute summary

Narration generated using OpenAI.fm.

Many clinical trial analyses still dichotomize continuous predictors. This feels simple. It is not harmless.

Let’s simulate a clean linear relationship and see what happens.

Simulation setup

Code