Parallel Programming in R

Lately, I’ve been using more complicated estimators for both coursework and my own research. Often times, these estimators don’t have very pretty analytical solutions for their variance-covariance matrices. One way to get around this, is to bootstrap your estimator to get valid uncertainty estimates. Unfortunately, bootstrapping requires me to use for loops (which I despise). While for loops do tend to make for more readable code, they are hilariously slow in R.

After some searching around the internet, I ran across one potential fix to help speed up bootstrap calculations: parallel programming. Using the foreach package in R, I can speed up my bootstrap quite easily. The following example demonstrates the performance gains of using a foreach loop instead of for loop.

    #init model params
    samp.size <- 50
    num.boots <- 1000
    alpha.0 <- 1
    beta.0 <- 0
    sigma.0 <- 1

    #initialize some values for the bootstrap
    epsilon <- rt(n = samp.size,df = 5) #draw epsilon for error term
    X <- rnorm(n = samp.size,mean = 0, sd = 2) #draw X
    u <- sigma.0*epsilon #generate homoskedastic regression disturbance 
    Y <- alpha.0 + beta.0*X + u #create outcome
    df <- data.frame(Y,X) #convert X,Y to dataframe
    mod <- lm(Y~X) #estimate regression of Y on X
    mod$vcovHC <- sandwich::vcovHC(mod) #
    t.stat.observed <- coef(mod)['X']/sqrt(diag(mod$vcovHC))['X']

Regular For Loop

    start <- proc.time() #init start time
    t.stats <- NULL #init vector of bootstrapped t.statistics
    for(b in 1:num.boots){
      boot.indices <- sample(x = seq(1,nrow(df)),size = samp.size,replace = T)
      boot.df <- df[boot.indices,]
      mod <- lm(Y~X,data = boot.df)
      mod$vcovHC <- sandwich::vcovHC(mod)
      boot.t <- unname(coef(mod)['X']/sqrt(diag(mod$vcovHC))['X']) #return HC t stat
      t.stats[b] <- boot.t
    }
    proc.time() - start #elapsed time

    ##    user  system elapsed 
    ##   3.287   0.021   3.321

Okay not terrible… but let’s see how the parallelized version of for loop performs.

Parallel For Loop

    registerDoParallel(cores=2) #register number of cores
    start <- proc.time() #init start time
    t.stats <- foreach(i=1:num.boots, .combine = c) %dopar% {
      boot.indices <- sample(x = seq(1,nrow(df)),size = samp.size,replace = T)
      boot.df <- df[boot.indices,]
      mod <- lm(Y~X,data = boot.df)
      mod$vcovHC <- sandwich::vcovHC(mod)
      unname(coef(mod)['X']/sqrt(diag(mod$vcovHC))['X']) #return HC t stat
    }
    proc.time() - start #elapsed time

    ##    user  system elapsed 
    ##   3.671   0.270   2.204

Clearly there are some pretty large performance gains here. Moreover, the syntax behind the foreach loop still retains the readibility of the regular for loop. In sum, we can get some pretty large performance gains from some very easy tweaks to our R code.