Simulations

Loops
Additional Reading

Loops

This example demonstrates the use of loops to create a simulation to examine the how model statistics might vary for a given sampling framework.

In this case we are taking repeated random draws of size N from a population, then calculating the slope and confidence interval of the slope. We want to note cases where b1 contains zero since these would represent NULL results in our study.

For a single sample we calculate the slope and confidence interval as follows:

# regression model: Y = b0 + b1(X)
d.sample <- dplyr::sample_n( d, size=N )
m <- lm( y ~ x, data=d.sample )
b1 <- (coef( m ))[2]
ci <- confint( m )
ci.b1.lower <- ci[2,1]
ci.b1.upper <- ci[2,2]

Then to examine lots of scenarios we can just repeat this code inside of a loop.

get_slope <- function( d, N=10 )
{             
  d.sample <- dplyr::sample_n( d, size=N )
  m <- lm( y ~ x, data=d.sample )
  b1 <- (coef( m ))[2]
  ci <- confint( m )
  ci.b1.lower <- ci[2,1]
  ci.b1.upper <- ci[2,2]
  df <- data.frame( b1, ci.b1.lower, ci.b1.upper )
}

results <- NULL                 
for( i in 1:100 )
{
   one.model <- get_slope( d )
   results <- dplyr::bind_rows( results, one.model )
}

These types of bootstrapping simulations are very useful for generating robust versions of sampling statistics when the data is irregular or closed-form solutions do not exist.

In this case were are interested in statistical power as a function of sample size. In studies like drug trials it might cost $10,000 for each study participant, so drug companies want to minimize the sample size needed to veryify the effectiveness of their drugs.

We can use previous research to ascertain a reasonable correlation between X and Y or anticipated effect size to simulate some population data.

Type II Errors represent cases that the regression fails to produce a slope that is differentiable from zero (the confidence interval of slope b1 contains zero).

We would start with a small sample size (n=10 in this example) then increase it until we have exceeded a target Type II Error rate.

Some helper functions were created to generate the proper statistics inside of the loops.

Pay attention to differences in the constructors.

The first example collects only the slope from each simulation, so results are stored in a collector vector called slopes.

# BOOTSTRAPPING TYPE II ERRORS
# Examine Type II Errors
# as a function of sample size

# load data and helper functions
source( "https://raw.githubusercontent.com/DS4PS/cpp-527-fall-2020/master/lectures/loop-example.R" )
head( d )                       # data frame with X and Y 
get_sample_slope( d, n=10 )     # returns a single value
test_for_null_slope( d, n=10 )  # returns a one-row data frame

## EXAMINE SLOPES
## sample size = 10


slopes <- NULL  # collector vector 

for( i in 1:1000 )  # iterator i
{

  b1 <- get_sample_slope( d, n=10 )
  slopes[ i ] <- b1   
 
}


# descriptives from 10,000 random draws, sample size 10

head( slopes )
[1] 2.246041 3.979462 1.714822 4.689032 1.763237 3.107451

summary( slopes )  
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#  -2.194   1.596   2.176   2.088   2.600   4.868

hist( slopes, breaks=25, col="gray20", border="white" )

The second example generates the slope, confidence interval (lower and upper bounds), and a null significance test from each new sample. The results are stored in a data frame called results.

## EXAMINE CONFIDENCE INTERVALS
## sample size = 10

# build the
# results data frame 
# using row binding


results <- NULL

for( i in 1:50 )
{

  null.slope.test <- test_for_null_slope( d, n=10 )
  results <- rbind( results, null.slope.test )

}


head( results )

# confidence intervals from 50 draws, sample size 10

#            b1 ci.b1.lower ci.b1.upper null.slope
# x  -0.9783359  -4.5757086    2.619037       TRUE
# x1  2.3897431   0.4295063    4.349980      FALSE
# x2  2.0781628  -0.6677106    4.824036       TRUE
# x3  2.9178206   0.7080918    5.127549      FALSE
# x4  2.3702949   0.5238930    4.216697      FALSE
# x5  1.9701996   0.5513491    3.389050      FALSE

plot_ci( df=results )

## alternative constructor:
## the list version 
## runs faster but it 
## needs to be converted 
## to a data frame afterwards
  
results <- list()

for( i in 1:50 )
{

  null.slope.test <- test_for_null_slope( d, n=10 )
  results[[ i ]] <- null.slope.test

}

## convert list to df
results <- dplyr::bind_rows( results )

head( results )
plot_ci( df=results )

Additional Reading

These readings are a slightly more advanced treatment of loops and control structures used in simulations. Dive in or bookmark them for later.

Don’t Loop - Apply

Simulation