# TODO: collect the installation of all necessary packages in one place at the beginning of the tutorial
#install.packages(c("ggplot2", "ggdist", "pwr"))
Linear Model (a single predictor)
We start with the simplest possible linear model: (a) a continuous outcome variable is predicted by a single dichotomous predictor. This model actually rephrases a t-test as a linear model! Then we build up increasingly complex models: (b) a single continuous predictor and (c) multiple continuous predictors (i.e., multiple regression).
Data = Model + Error
The general equation for a linear model is:
\[y = b_0 + b_1*x_1 + b_2*x_2 + ... + e \tag{1}\]
where \(y\) is the continuous outcome variable, \(b_0\) is the intercept, \(x_1\) is the first predictor variable with its associated regression weight \(b_1\), etc. Finally, \(e\) is the error term which captures all variability in the outcome that is not explained by the predictor variables in the model. It is assumed that the error term is independently and identically distributed (iid) following a normal distribution with a mean of zero and a variance of \(\sigma^2\):
\[e \mathop{\sim}\limits^{\mathrm{iid}} N(mean=0, var=\sigma^2)\]
This assumption of the distribution of the errors is important for our simulations, as we will simulate the error according to that assumption. In an actual statistical analysis, the variance of the errors, \(\sigma^2\), is estimated from the data.
“i.i.d” means that the error of each observation is independent from all other observations. This assumption is violated, for example, when we look at romantic couples. If one partner is exceptionally disstressed, then the other partner presumably will also have higher stress levels. In this case, the errors of both partners are correlated, and not independent any more. (See the section on modelling multilevel data for a statistical way of handling such interdependences).
A concrete example
Let’s fill the abstract symbols of Equation 1 with some concrete content. Assume that we want to analyze the treatment effect of an intervention supposed to reduce depressive symptoms. In the study, a sample of participants with a diagnosed depression are randomized to either a control group or the treatment group.
We have the following two variables in our data set:
BDI
is the continuous outcome variable representing the severity of depressive symptoms after the treatment, assessed with the BDI-II inventory (higher values mean higher depression scores).treatment
is our dichotomous predictor defining the random group assignment. We assign the following numbers: 0 = control group (e.g., an active control group), 1 = treatment group.
This is an opportunity to think about the actual scale of our variables. At the end, when we simulate data we need to insert concrete numbers into our equations:
- What BDI values would we expect on average in our sample (before treatment)?
- What variability would we expect in our sample?
- What average treatment effect would we expect?
All of these values are needed to be able to simulate data, and the chosen numbers imply a certain effect size.
Get some real BDI data as starting point
The creators of this tutorial are no experts in clinical psychology; we opportunistically selected open data sets based on their availability. Usually, we would look for meta-analyses - ideally bias-corrected - for more comprehensive evidence.
The R package HSAUR
contains open data on 100 depressive patients, where 50 received treatment-as-usual (TAU
) and 50 received a new treatment (“Beat the blues”; BtheB
). Data was collected in a pre-post-design with several follow-up measurements. For the moment, we focus on the pre-treatment baseline value (bdi.pre
) and the first post-treatment value (bdi.2m
). We will use that data set as a “pilot study” for our power analysis.
# the data can be found in the HSAUR package, must be installed first
#install.packages("HSAUR")
# load the data
data("BtheB", package = "HSAUR")
# get some information about the data set:
::BtheB
?HSAUR
hist(BtheB$bdi.pre)
The standardized cutoffs for the BDI are:
- 0–13: minimal depression
- 14–19: mild depression
- 20–28: moderate depression
- 29–63: severe depression.
table(BtheB$bdi.pre >= 14) |> prop.table() |> round(2)
FALSE TRUE
0.2 0.8
Hence, 80% of participants had at least a “mild depression” before treatment according to these cutoffs.
Returning to our questions from above:
What BDI values would we expect on average in our sample (before treatment)?
# we take the pre-score here:
mean(BtheB$bdi.pre)
[1] 23.33
The average BDI score in that sample was 23, corresponding to a “moderate depression”.
- What variability would we expect in our sample?
var(BtheB$bdi.pre)
[1] 117.5163
- What average treatment effect would we expect?
# we take the 2 month follow-up measurement,
# separately for the "treatment as usual" and
# the "Beat the blues" group:
mean(BtheB$bdi.2m[BtheB$treatment == "TAU"], na.rm=TRUE)
[1] 19.46667
mean(BtheB$bdi.2m[BtheB$treatment == "BtheB"])
[1] 14.71154
Hence, the two treatments reduced BDI scores from an average of 23 to 19 (TAU
) and 15 (BtheB
). Based on that data set, we can conclude that a typical treatment effect is somewhere between a 4 and a 8-point reduction of BDI scores.1
For our purpose, we compute the average treatment effect combined for both treatments. The average post-treatment score is:
mean(BtheB$bdi.2m, na.rm=TRUE)
[1] 16.91753
So, the average reduction across both treatments is \(17-23=6\).
Enter specific values for the model parameters
Let’s rewrite the abstract equation with the specific variable names. We first write the equation for the systematic part (without the error term). This also represents the predicted value:
\[\widehat{\text{BDI}} = b_0 + b_1*\text{treatment} \tag{2}\]
We use the notation \(\widehat{\text{BDI}}\) (with a hat) to denote the predicted BDI score.
The predicted score for the control group then simply is the intercept of the model, as the second term is erased by entering the value “0” for the control group:
\[\widehat{\text{BDI}} = b_0 + b_1*0 = b_0\]
The predicted score for the treatment group is the value for the control group plus the regression weight:
\[\widehat{\text{BDI}} = b_0 + b_1*1\] Hence, the regression weight (aka. “slope parameter”) \(b_1\) estimates the mean difference between both groups, which is the treatment effect.
With our knowledge from the open BDI data, we insert plausible values for the intercept \(b_0\) and the treatment effect \(b_1\). We expect a reduction of the depression score, so the treatment effect is assumed to be negative. We take the combined treatment effect of the two pilot treatments. And as power analysis is not rocket science, we generously round the values:
\[\widehat{\text{BDI}} = 23 - 6*treatment\]
Hence, the predicted value is \(23 - 6*0 = 23\) for the control group, and \(23 - 6*1 = 17\) for the treatment group.
With the current model, all persons in the control group have the same predicted value (23), as do all persons in the treatment group (17).
As a final step, we add the random noise to the model, based on the variance in the pilot data:
\[\text{BDI} = 23 - 6*treatment + e; e \sim N(0, var=117) \]
That’s our final equation with assumed population parameters! With that equation, we assume a certain state of reality and can sample “virtual participants”.
What is the effect size in the model?
The raw effect size is simply the treatment effect on the original BDI scale (i.e., the group difference in the outcome variable). In our case we assume that the treatment lowers the BDI score by 6 points, on average. The standardized effect size relates the raw effect size to the variability. In the two-group example, this can be expressed as Cohen’s d, which is the mean difference divided by the standard deviation (SD):
\[d = \frac{M_{treat} - M_{control}}{SD} = \frac{17 - 23}{\sqrt{117}} = -0.55\]
If you look up the formula of Cohen’s d, it typically uses the pooled SD from both groups. As we assumed that both groups have the same SD, we simply took that value.
Let’s simulate!
Once we committed to an equation with concrete parameter values, we can simulate data for a sample of, say, \(n=100\) participants. Simply write down the equation and add random normal noise with the rnorm
function. Note that the rnorm
function takes the standard deviation (not the variance). Finally, set a seed so that the generated random numbers are reproducible:
set.seed(0xBEEF)
# define all simulation parameters and predictor variables:
<- 50
n <- c(rep(0, n/2), rep(1, n/2)) # first 50 control, then 50 treatment
treatment
# Write down the equation
<- 23 - 6*treatment + rnorm(n, mean=0, sd=sqrt(117))
BDI
# combine all variables in one data frame
<- data.frame(treatment, BDI)
df
head(df)
treatment BDI
1 0 29.37241
2 0 25.49819
3 0 18.80709
4 0 14.52574
5 0 11.79615
6 0 10.98850
tail(df)
treatment BDI
45 1 25.871732
46 1 27.744107
47 1 20.749970
48 1 40.275788
49 1 2.981421
50 1 8.151660
Let’s plot the simulated data with raincloud plots (see here for a tutorial):
library(ggplot2)
ggplot(df, aes(x=as.factor(treatment), y=BDI)) +
::stat_halfeye(adjust = .5, width = .3, .width = 0, justification = -.3, point_colour = NA) +
ggdistgeom_boxplot(width = .1, outlier.shape = NA) +
::geom_half_point(side = "l", range_scale = .4, alpha = .5) gghalves
Let’s assume that we collected and analyzed this sample - what would have been the results?
summary(lm(BDI ~ treatment, data=df))
Call:
lm(formula = BDI ~ treatment, data = df)
Residuals:
Min 1Q Median 3Q Max
-24.8539 -7.9693 0.5002 7.7587 23.7909
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.975 2.318 10.776 2.07e-14 ***
treatment -7.787 3.278 -2.376 0.0216 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.59 on 48 degrees of freedom
Multiple R-squared: 0.1052, Adjusted R-squared: 0.08656
F-statistic: 5.643 on 1 and 48 DF, p-value: 0.02156
Here, and in the entire tutorial, we assume an \(\alpha\)-level of .005 (Benjamin et al., 2018). As you can see, in this specific sample the treatment effect is not significant at p > .02. But maybe in another sample of the same size?
Doing the power analysis
Now we need to repeatedly draw many samples and see how many of the analyses would have detected the existing effect. To do this, we put the code from above into a loop and repeatedly do the procedure.
set.seed(0xBEEF)
# define all predictor and simulation variables.
# (we do this outside the loop as it stays constant)
<- 1000
iterations <- 100
n <- c(rep(0, n/2), rep(1, n/2))
treatment
# vector that collects the p-values from all 1000 virtual samples
# (prepare an empty NA vector with 1000 slots)
<- rep(NA, iterations)
p_values
# now repeatedly draw samples, analyze, and the p-value of the
# focal regression weight (i.e., the slope parameter)
for (i in 1:iterations) {
<- 23 - 6*treatment + rnorm(n, mean=0, sd=sqrt(117))
BDI <- lm(BDI ~ treatment)
res <- summary(res)$coefficients["treatment", "Pr(>|t|)"]
p_values[i] }
How many of our 1000 virtual samples would have found the effect?
table(p_values < .005)
FALSE TRUE
535 465
Only 46% of samples with the same size of \(n=100\) result in a significant p-value.
46% - that is our power for \(\alpha = .005\), Cohen’s \(d=.55\), and \(n=100\).
Sample size planning: Find the necessary sample size
Now we know that a sample size of 100 does not lead to a sufficient power. But what sample size would we need to achieve a power of at least 80%? In the simulation approach you need to test different \(n\)s until you find the necessary sample size. We do this by wrapping the simulation code into another loop that continuously increases the n. We then store the computed power for each n.
set.seed(0xBEEF)
# define all predictor and simulation variables.
<- 1000
iterations <- seq(100, 300, by=20) # test ns between 100 and 400
ns
<- data.frame()
result
for (n in ns) { # outer loop
<- c(rep(0, n/2), rep(1, n/2))
treatment <- rep(NA, iterations)
p_values
# now repeatedly draw samples, analyze, and save p-value
for (i in 1:iterations) { # inner loop
<- 23 - 6*treatment + rnorm(n, mean=0, sd=sqrt(117))
BDI <- lm(BDI ~ treatment)
res <- summary(res)$coefficients["treatment", "Pr(>|t|)"]
p_values[i]
}
<- rbind(result, data.frame(
result n = n,
power = sum(p_values < .005)/iterations)
)
# show the result after each run (not shown here in the tutorial)
print(result)
}
Let’s plot there result:
ggplot(result, aes(x=n, y=power)) + geom_point() + geom_line()
Hence, with n=180 (90 in each group), we have a 80% chance to detect the effect.
🥳 Congratulations! You did your first power analysis by simulation. 🎉
For these simple models, we can also compute analytic solutions. Let’s verify our results with the pwr
package:
library(pwr)
pwr.t.test(d = 0.55, sig.level = 0.005, power = .80)
Two-sample t test power calculation
n = 90.00212
d = 0.55
sig.level = 0.005
power = 0.8
alternative = two.sided
NOTE: n is number in *each* group
Exactly the same result - phew 😅
Sensitivity analyses
When we entered the specific numbers into our equations, we used point estimates from a relatively small pilot sample. As there is considerable uncertainty around these estimates, the true population value could be much smaller or larger. Therefore, we will explore two other ways to set the effect size: safeguard power analysis and the smallest effect size of interest (SESOI).
Safeguard power analysis
As sensitivity analysis, we will apply a safeguard power analysis, that aims for the lower end of a two-sided 60% CI around the parameter of the focal treatment effect (the intercept is irrelevant). (Of course you can use any other value than 60%, but this is the value (tentatively) mentioned by the inventors of the safeguard power analysis.)
If you assume publication bias, another heuristic for aiming at a more realistic population effect size is the “divide-by-2” heuristic. (TODO: Link to presentation)
We can use the t.test
function to compute a CI around that value:
t.test(BtheB$bdi.2m, BtheB$bdi.pre, na.rm=TRUE, conf.level = 0.60)
Welch Two Sample t-test
data: BtheB$bdi.2m and BtheB$bdi.pre
t = -4.1613, df = 194.87, p-value = 4.745e-05
alternative hypothesis: true difference in means is not equal to 0
60 percent confidence interval:
-7.712244 -5.112704
sample estimates:
mean of x mean of y
16.91753 23.33000
You can also use the MOTE online calculator by Erin Buchanan et al. to compute effect sizes and confidence intervals around them: https://doomlab.shinyapps.io/mote/
This takes either raw statistics (means, SDs) or test statistics as input.
As the assumed effect is negative, we aim for the upper, i.e., the more conservative limit, which is considerably smaller, at -5.1.
When you use a standardized mean difference that has been reported in a paper, you can use a function from the MBESS
package to compute the CI:
library(MBESS)
# smd = standardized mean difference = Cohen's d
ci.smd(smd=0.55, n.1=50, n.2=50, conf.level=0.60)
Now we can rerun the power simulation with this more conservative value (the only change to the code above is that we changed the treatment effect from -6 to -5.1).
Show the code
set.seed(0xBEEF)
# define all predictor and simulation variables.
<- 1000
iterations <- seq(100, 300, by=20)
ns
<- data.frame()
result
for (n in ns) {
<- c(rep(0, n/2), rep(1, n/2))
treatment <- rep(NA, iterations)
p_values
# now repeatedly draw samples, analyze, and save p-value of
for (i in 1:iterations) {
<- 23 - 5.1*treatment + rnorm(n, mean=0, sd=sqrt(117))
BDI <- lm(BDI ~ treatment)
res <- summary(res)$coefficients["treatment", "Pr(>|t|)"]
p_values[i]
}
<- rbind(result, data.frame(
result n = n,
power = sum(p_values < .005)/iterations)
)
# show the result after each run
print(result)
}
n power
1 100 0.314
2 120 0.408
3 140 0.473
4 160 0.570
5 180 0.618
6 200 0.717
7 220 0.735
8 240 0.795
9 260 0.823
10 280 0.878
11 300 0.887
With that more conservative effect size assumption, we would need around 240 participants, i.e. 120 per group.
Smallest effect size of interest (SESOI)
Many methodologists argue that we should not power for the expected effects size, but rather for the smallest effect size of interest. In this case, a non-significant result can be interpreted as “We accept the \(H_0\), and even if a real effect existed, it most likely is too small to be relevant”.
What change of BDI scores is perceived as “clinically important”? The hard part is to find a convincing theoretical or empirical argument for the chosen SESOI. In the case of the BDI, luckily someone else did that work.
The NICE guidance suggest that a change of >=3 BDI-II points is clinically important.
However, as you can expect, things are more complicated. Button et al. (2015) analyzed data sets where patients have been asked, after a treatment, whether they felt “better”, “the same” or “worse”. With these subjective ratings, they could relate changes in BDI-II scores to perceived improvements. Hence, even when depressive symptoms where measureably reduced in the BDI, patients still might answer “feels the same”, which indicates that the reduction did not surpass a threshold of subjective relevant improvement. For example, the minimal clinical importance depends on the baseline severity: For patients to feel notably better, they need more reduction of BDI-II scores if they start from a higher level of depressive symptoms. Following from this analysis, typical SESOIs are higher than the NICE guidelines, more in the range of -6 BDI points.
Let’s use the NICE recommendation of -3 BDI points as a lower threshold for our power analysis (anything larger than that will be covered anyway).
Show the code
set.seed(0xBEEF)
# define all predictor and simulation variables.
<- 1000
iterations <- seq(600, 800, by=20)
ns
<- data.frame()
result
for (n in ns) {
<- c(rep(0, n/2), rep(1, n/2))
treatment <- rep(NA, iterations)
p_values
# now repeatedly draw samples, analyze, and save p-value of
for (i in 1:iterations) {
<- 23 - 3*treatment + rnorm(n, mean=0, sd=sqrt(117))
BDI <- lm(BDI ~ treatment)
res <- summary(res)$coefficients["treatment", "Pr(>|t|)"]
p_values[i]
}
<- rbind(result, data.frame(
result n = n,
power = sum(p_values < .005)/iterations)
) }
n power
1 600 0.728
2 620 0.738
3 640 0.756
4 660 0.781
5 680 0.791
6 700 0.791
7 720 0.827
8 740 0.820
9 760 0.852
10 780 0.866
11 800 0.871
Hence, we need around 700 participants to reliably detect this smallest effect size of interest.
Did you spot the strange pattern in the result? At n=720, the power is 83%, but only 82% with n=740? This is not possible and suggests that this is simply Monte Carlo sampling error - 1000 iterations is not enough to get precise estimates. When we increase iterations to 10,000, it takes much longer, but gives more precise results:
Show the code
set.seed(0xBEEF)
# define all predictor and simulation variables.
<- 10000
iterations
# To speed up computations, we limited the range of ns to the relevant region
<- seq(640, 740, by=20)
ns
<- data.frame()
result
for (n in ns) {
<- c(rep(0, n/2), rep(1, n/2))
treatment <- rep(NA, iterations)
p_values
# now repeatedly draw samples, analyze, and save p-value of
for (i in 1:iterations) {
<- 23 - 3*treatment + rnorm(n, mean=0, sd=sqrt(117))
BDI <- lm(BDI ~ treatment)
res <- summary(res)$coefficients["treatment", "Pr(>|t|)"]
p_values[i]
}
<- rbind(result, data.frame(
result n = n,
power = sum(p_values < .005)/iterations)
)
# show the result after each run
print(result)
}
n power
1 600 0.7244
2 620 0.7353
3 640 0.7528
4 660 0.7721
5 680 0.7889
6 700 0.8001
7 720 0.8150
8 740 0.8328
9 760 0.8374
10 780 0.8573
11 800 0.8682
Now the power increases monotonically with sample size, as expected.
Recap: What did we do?
These are the steps we did in the tutorial:
- Define the statistical model that you will use to analyze your data (in our case: a linear regression). This sometimes is called the data generating mechanism or the data generating model.
- Find empirical information about the parameters in the model. These are typically mean values, variances, group differences, regression coefficients, or correlations (in our case: a mean value (before treatment), a group difference, and an error variance). These numbers define your assumed state of the population.
- Program your equation, which consists of a systematic (i.e., deterministic) part and (one or more) random error terms.
- Do the simulation:
- Repeatedly sample data from your equation (only the random error changes in each run, the deterministic part stays the same)
- Compute the index of interest (typically: the p-value of a focal effect)
- Count how many samples would have detected the effect
- Tune your simulation parameters until the desired level of power is achieved
Concerning step 4d, the typical scenario is that you increase sample size until your desired level of power is achieved. But you could imagine other ways to increase power, e.g.:
- increase the reliability of a measurement instrument (which will be reflected in a larger effect size) until desired power is achieved. (This could relate to a trade-off: Do you invest more effort and time in better measurement, or do you keep an unreliable measure and collect more participants?)
- Compare a within- and a between-design in simulations
- … TODO
Bonus 1: Beyond power: General design analysis
TBD:
- Compute another quantity of interest (e.g., the width of a CI), and optimize the sample size for that.
Bonus 2: Optimizing your code for speed
TBD
system.time({
# vectors that collect the results
<- rep(NA, iterations)
p_values
# now repeatedly draw samples, analyze, and save p-value of
for (i in 1:iterations) {
print(i) # show how long we have to wait ...
<- 23 - 6*treatment + rnorm(n, mean=0, sd=sqrt(117))
BDI <- lm(BDI ~ treatment)
res <- summary(res)$coefficients["treatment", "Pr(>|t|)"]
p_values[i]
}
})
Generally, you can start with few iterations for a fast exploration of the parameter space. When you found the relevant range (e.g., the necessary sample size to achieve a power between 70% and 90 %), you can restrict the parameter space to that range and increase the number of iterations to get more precise results.
Bonus 3: The quick solution: sampling from pilot samples
If your pilot study has exactly the same data structure as the study you are planning for you can take the resulting lm
object and directly sample from that.
With that procedure, however, you assume that the pilot study perfectly hit the true effect size (and all other parameters of the model)
References
Button, K. S., Kounali, D., Thomas, L., Wiles, N. J., Peters, T. J., Welton, N. J., Ades, A. E., & Lewis, G. (2015). Minimal clinically important difference on the Beck Depression Inventory—II according to the patient’s perspective. Psychological Medicine, 45(15), 3269–3279. https://doi.org/10.1017/S0033291715001270
Footnotes
By comparing only the post-treatment scores we used an unbiased causal model for the treatment effect. Later we will increase the precision of the estimate by including the baseline score into the model.↩︎