Bonus: Optimizing R code for speed

Author

Felix Schönbrodt

Optimizing code for speed can be an art (and you get lost and spend/waste hours by micro-optimizing some milliseconds). But the Pareto principle also applies here: with 20% effort, you can have quick and substantial gains.

Code profiling means that the code execution is timed, just like you had a stopwatch. You try to make your code snippet as fast as possible. RStudio has a built-in profiler that (in theory) allows to see which code line takes up the longest time. But in my experience, if the computation of each single line is very short (and the duration mostly comes from the many repetitions), it is very inaccurate (i.e., the time spent is allocated to the wrong lines). Therefore, we’ll resort to the simplest way of timing code: We simple measure overall execution time by wrapping our code in a system.time({ ... }) call.

First, naive version

We use system.time() to measure how long our code execution takes. Longer code blocks need to be wrapped in curly braces {...}. The function returns multiple timings; the relevant number for us is the “user” time.

Here is a first version of the power simulation code for a simple LM.

t0 <- system.time({

iterations <- 2000
ns <- seq(300, 500, by=50)
result <- data.frame()

for (n in ns) {
  p_values <- c()
  
  for (i in 1:iterations) {
    treatment <- c(rep(0, n/2), rep(1, n/2))
    BDI <- 23 - 3*treatment + rnorm(n, mean=0, sd=sqrt(117))
    df <- data.frame(treatment, BDI)
    res <- lm(BDI ~ treatment, data=df)
    p_values <- c(p_values, summary(res)$coefficients["treatment", "Pr(>|t|)"])
  }
  
  result <- rbind(result, data.frame(n = n, power = sum(p_values < .005)/iterations))
}

})
t0

   user  system elapsed 
  3.923   0.068   4.424

result

    n  power
1 300 0.3345
2 350 0.3870
3 400 0.4745
4 450 0.5525
5 500 0.6230

This first version takes 3.923 seconds. Of course we have sampling error here as well; if you run this code multiple times, you will always get slightly different timings. But, again, we refrain from micro-optimizing in the millisecond range, so a single run is generally good enough. You should only tune your simulation in a way that it takes at least a few seconds; if you are in the millisecond range, the timings are imprecise and you won’t see speed improvements very well.

Rule 1: No growing vectors/data frames

This is one of the most common bottlenecks: You start with an empty vector (or even worse: data frame), and grow it by rbind-ing new rows to it in each iteration.

t1 <- system.time({

iterations <- 2000
ns <- seq(300, 500, by=50)

result <- data.frame()

for (n in ns) {
  print(n)
  
  # CHANGE: Preallocate vector with the final size, initialize with NAs
  p_values <- rep(NA, iterations)
  
  for (i in 1:iterations) {
    treatment <- c(rep(0, n/2), rep(1, n/2))
    BDI <- 23 - 6*treatment + rnorm(n, mean=0, sd=sqrt(117))
    df <- data.frame(treatment, BDI)
    res <- lm(BDI ~ treatment, data=df)
    
    # CHANGE: assign resulting p-value to specific slot in vector
    p_values[i] <- summary(res)$coefficients["treatment", "Pr(>|t|)"]
  }
  
  # Here we stil have a growing data.frame - but as this is only done 5 times, it does not matter.
  result <- rbind(result, data.frame(n = n, power = sum(p_values < .005)/iterations))
}

})

[1] 300
[1] 350
[1] 400
[1] 450
[1] 500

# Combine the different timings in a data frame
timings <- rbind(t0, t1) |> data.frame()

# compute the absolute and relativ difference of consecutive rows:
timings$diff <- c(NA, timings[2:nrow(timings), 1] - timings[1:(nrow(timings)-1), 1])
timings$rel_diff <- c(NA, timings[2:nrow(timings), "diff"]/timings[1:(nrow(timings)-1), 1])

timings

   user.self sys.self elapsed user.child sys.child  diff   rel_diff
t0     3.923    0.068   4.424          0         0    NA         NA
t1     3.982    0.044   4.202          0         0 0.059 0.01503951

OK, this didn’t really change anything here. But in general (in particular with data frames) this is worth looking at.

Rule 2: Avoid data frames as far as possible

Use matrizes instead of data frames wherever possible; or avoid them at all (as we do in the code below).

t2 <- system.time({

iterations <- 2000
ns <- seq(300, 500, by=50)

result <- data.frame()

for (n in ns) {
  treatment <- c(rep(0, n/2), rep(1, n/2))
  p_values <- rep(NA, iterations)
  
  for (i in 1:iterations) {
    BDI <- 23 - 3*treatment + rnorm(n, mean=0, sd=sqrt(117))
    
    # CHANGE: We don't need the data frame - just create the two variables in the environment and lm() takes them from there.
    #df <- data.frame(treatment, BDI)
    res <- lm(BDI ~ treatment)
    
    p_values[i] <- summary(res)$coefficients["treatment", "Pr(>|t|)"]
  }
  
  result <- rbind(result, data.frame(n = n, power = sum(p_values < .005)/iterations))
}

})

timings <- rbind(t0, t1, t2) |> data.frame()
timings$diff <- c(NA, timings[2:nrow(timings), 1] - timings[1:(nrow(timings)-1), 1])
timings$rel_diff <- c(NA, timings[2:nrow(timings), "diff"]/timings[1:(nrow(timings)-1), 1])
timings

   user.self sys.self elapsed user.child sys.child   diff    rel_diff
t0     3.923    0.068   4.424          0         0     NA          NA
t1     3.982    0.044   4.202          0         0  0.059  0.01503951
t2     2.926    0.044   2.978          0         0 -1.056 -0.26519337

This showed a substantial improvement of around -1.1 seconds; a relative gain of -26.5%.

Rule 3: Avoid unnecessary computations

What do we actually need? In fact only the p-value for our focal predictor. But the lm function does so many more things, for example parsing the formula BDI ~ treatment.

We could strip away all overhead and do only the necessary steps: Fit the linear model, and retrieve the p-values (see https://stackoverflow.com/q/49732933). This needs some deeper knowledge of the functions and some google-fu. When you do this, you should definitely compare your results with the original result from the lm function and verify that they are identical!

t3 <- system.time({

iterations <- 2000
ns <- seq(300, 500, by=50)

result <- data.frame()

for (n in ns) {
  # construct the design matrix: first column is all-1 (intercept), second column is the treatment factor
  x <- cbind(
    rep(1, n),
    c(rep(0, n/2), rep(1, n/2))
  )
  
  p_values <- rep(NA, iterations)
  
  for (i in 1:iterations) {
    y <- 23 - 3*x[, 2] + rnorm(n, mean=0, sd=sqrt(117))

    # For comparison - do we get the same results? Yes!
    # res0 <- lm(y ~ x[, 2])
    # summary(res0)
    
    # fut the model:
    m <- .lm.fit(x, y)
    
    # compute p-values based on the residuals:
    rss <- sum(m$residuals^2)
    rdf <- length(y) - ncol(x)
    resvar <- rss/rdf
    R <- chol2inv(m$qr)
    se <- sqrt(diag(R) * resvar)
    ps <- 2*pt(abs(m$coef/se),rdf,lower.tail=FALSE)
    
    p_values[i] <- ps[2]
  }
  
  result <- rbind(result, data.frame(n = n, power = sum(p_values < .005)/iterations))
}

})
timings <- rbind(t0, t1, t2, t3) |> data.frame()
timings$diff <- c(NA, timings[2:nrow(timings), 1] - timings[1:(nrow(timings)-1), 1])
timings$rel_diff <- c(NA, timings[2:nrow(timings), "diff"]/timings[1:(nrow(timings)-1), 1])
timings

   user.self sys.self elapsed user.child sys.child   diff    rel_diff
t0     3.923    0.068   4.424          0         0     NA          NA
t1     3.982    0.044   4.202          0         0  0.059  0.01503951
t2     2.926    0.044   2.978          0         0 -1.056 -0.26519337
t3     0.304    0.016   0.320          0         0 -2.622 -0.89610390

Rule 4: Use optimized packages

For many statistical models, there are packages optimized for speed, see for example here: https://stackoverflow.com/q/49732933

library(RcppArmadillo)

t4 <- system.time({

iterations <- 2000
ns <- seq(300, 500, by=50)

result <- data.frame()

for (n in ns) {
  # construct the design matrix: first column is all-1 (intercept), second column is the treatment factor
  x <- cbind(
    rep(1, n),
    c(rep(0, n/2), rep(1, n/2))
  )
  
  p_values <- rep(NA, iterations)
  
  for (i in 1:iterations) {
    y <- 23 - 3*x[, 2] + rnorm(n, mean=0, sd=sqrt(117))

    # For comparison - do we get the same results? Yes!
    # res0 <- lm(y ~ x[, 2])
    # summary(res0)
    
    mdl <- RcppArmadillo::fastLmPure(x, y)
    pval <- 2*pt(abs(mdl$coefficients/mdl$stderr), mdl$df.residual, lower.tail=FALSE)
    
    p_values[i] <- pval[2]
  }
  
  result <- rbind(result, data.frame(n = n, power = sum(p_values < .005)/iterations))
}

})
timings <- rbind(t0, t1, t2, t3, t4) |> data.frame()
timings$diff <- c(NA, timings[2:nrow(timings), 1] - timings[1:(nrow(timings)-1), 1])
timings$rel_diff <- c(NA, timings[2:nrow(timings), "diff"]/timings[1:(nrow(timings)-1), 1])
timings

   user.self sys.self elapsed user.child sys.child   diff    rel_diff
t0     3.923    0.068   4.424          0         0     NA          NA
t1     3.982    0.044   4.202          0         0  0.059  0.01503951
t2     2.926    0.044   2.978          0         0 -1.056 -0.26519337
t3     0.304    0.016   0.320          0         0 -2.622 -0.89610390
t4     0.290    0.005   0.299          0         0 -0.014 -0.04605263

This step only gave a minor -5% relative increase in speed - but as a bonus, it made our code much easier to read and shorter.

I think we can stop here - we managed to reduce the execution time for our example from 3.923 seconds to 0.29 seconds - 13.5x faster!

With that fast code, we can easily explore a broad parameter range (n ranging from 100 to 1000) and increase the iterations to 2000 for more stable results. (See also: “Bonus: How many Monte-Carlo iterations are necessary?”)

Show the code

t5 <- system.time({

iterations <- 2000
ns <- seq(100, 1000, by=50)

result <- data.frame()

for (n in ns) {

    x <- cbind(
    rep(1, n),
    c(rep(0, n/2), rep(1, n/2))
  )
  
  p_values <- rep(NA, iterations)
  
  for (i in 1:iterations) {
    y <- 23 - 3*x[, 2] + rnorm(n, mean=0, sd=sqrt(117))
    
    mdl <- RcppArmadillo::fastLmPure(x, y)
    pval <- 2*pt(abs(mdl$coefficients/mdl$stderr), mdl$df.residual, lower.tail=FALSE)
    
    p_values[i] <- pval[2]
  }
  
  result <- rbind(result, data.frame(n = n, power = sum(p_values < .005)/iterations))
}

})
result

      n  power
1   100 0.0745
2   150 0.1380
3   200 0.1900
4   250 0.2590
5   300 0.3515
6   350 0.4135
7   400 0.4600
8   450 0.5575
9   500 0.6035
10  550 0.6785
11  600 0.7225
12  650 0.7780
13  700 0.8210
14  750 0.8295
15  800 0.8695
16  850 0.8890
17  900 0.9120
18  950 0.9330
19 1000 0.9490

Some steps, such as avoiding growing vectors, didn’t really help here, but will help a lot in other scenarios.

There are many blog post showing and comparing strategies to increase R performance, e.g.:

But always remember:

“We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.”

Donald Knuth (Structured Programming with go to Statements, ACM Journal Computing Surveys, Vol 6, No. 4, Dec. 1974. p. 268)

Session Info

These speed measurements have been performed on a 2021 MacBook Pro with M1 processor.

sessionInfo()

R version 4.2.0 (2022-04-22)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 13.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] RcppArmadillo_0.11.4.2.1 prettycode_1.1.0         colorDF_0.1.7           

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.9        digest_0.6.31     crayon_1.5.2      lifecycle_1.0.3  
 [5] jsonlite_1.8.4    magrittr_2.0.3    evaluate_0.19     stringi_1.7.8    
 [9] rlang_1.0.6       cli_3.6.0         rstudioapi_0.14   vctrs_0.5.1      
[13] rmarkdown_2.19    tools_4.2.0       stringr_1.5.0     glue_1.6.2       
[17] htmlwidgets_1.6.1 purrr_1.0.0       yaml_2.3.6        xfun_0.36        
[21] fastmap_1.1.0     compiler_4.2.0    htmltools_0.5.4   knitr_1.41