Inspecting the Data

head(twins)

##   Foster Biological Social
## 1     82         82   high
## 2     80         90   high
## 3     88         91   high
## 4    108        115   high
## 5    116        115   high
## 6    117        129   high

str(twins)

## 'data.frame':    27 obs. of  3 variables:
##  $ Foster    : int  82 80 88 108 116 117 132 71 75 93 ...
##  $ Biological: int  82 90 91 115 115 129 131 78 79 82 ...
##  $ Social    : Factor w/ 3 levels "high","low","middle": 1 1 1 1 1 1 1 3 3 3 ...

summary(twins)

##      Foster         Biological       Social  
##  Min.   : 63.00   Min.   : 68.0   high  : 7  
##  1st Qu.: 84.50   1st Qu.: 83.5   low   :14  
##  Median : 94.00   Median : 94.0   middle: 6  
##  Mean   : 95.11   Mean   : 95.3              
##  3rd Qu.:107.50   3rd Qu.:104.5              
##  Max.   :132.00   Max.   :131.0

Null Sampling Distribution for the Slope

Instructions:

Load the infer package.
Using the twins data and the tidy function from the broom package, run a linear model regressing the IQ of the Foster twin against the Biological twin.
Using the infer package, permute the response variable and run the same function to get a slope value on a dataset where there is no link between the explanatory and response variables. (Note: use hypothesize(null = “independence”) for the regression hypothesis test.)
Repeat the previous step 10 times to find 10 different slope values on 10 different permuted datasets.

# Load the infer package
library(infer)

# Calculate the observed slope
obs_slope <- lm(Foster ~ Biological, data = twins) %>%
  tidy() %>%   
  filter(term == "Biological") %>%
  pull(estimate)    

# Simulate 10 slopes with a permuted dataset
set.seed(4747) 
perm_slope <- twins %>%
  specify(Foster ~ Biological) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 10, type = "permute") %>%
  calculate(stat = "slope") 

# Print the observed slope and the 10 permuted slopes
obs_slope

## [1] 0.901436

perm_slope

## # A tibble: 10 x 2
##    replicate    stat
##        <int>   <dbl>
##  1         1  0.143 
##  2         2  0.0710
##  3         3 -0.456 
##  4         4  0.0749
##  5         5  0.297 
##  6         6  0.0673
##  7         7  0.140 
##  8         8  0.164 
##  9         9  0.0971
## 10        10  0.184

SE of the slope

Instructions; - Using the infer steps, generate 500 permuted slope statistics - Plot the null slopes using geom_density - Find the mean and the standard deviation of the null slopes

# Make a dataframe with replicates and plot them!
set.seed(4747)
perm_slope <- twins %>%
  specify(Foster ~ Biological) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 500, type = "permute") %>%
  calculate(stat = "slope") 

ggplot(perm_slope, aes(x = stat)) + 
  geom_density()

# Calculate the mean and the standard deviation of the slopes
perm_slope %>% 
  ungroup() %>% 
  summarize(mean(stat), sd(stat))

## # A tibble: 1 x 2
##   `mean(stat)` `sd(stat)`
##          <dbl>      <dbl>
## 1      0.00629      0.196

p-value

Instructions: - Calculate the absolute value of the observed test statistic. Use pull() to create a numeric value with which to work - Compare the observed statistic to the perm_slopes calculated in the previous exercise - The p-value will be the proportion of times (out of 1000) that the absolute value of the permuted statistics is greater than the absolute value of the observed statistic

# Calculate the absolute value of the slope
abs_obs_slope <- lm(Foster ~ Biological, data = twins) %>%
  tidy() %>%   
  filter(term == "Biological") %>%
  pull(estimate) %>%
  abs()

# Compute the p-value  
perm_slope %>% 
  mutate(abs_perm_slope = abs(stat)) %>%
  summarize(p_value = mean(abs_perm_slope > abs_obs_slope))

## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1       0

Simulation-based CI for slope

Bootstrapping is another resampling method that allows estimation of the sampling distribution of the statistic of interest
Because interest is now in creating a CI, there is no null hypothesis, hence there isn’t any reason to permute any of the variables
The idea behind bootstrap sampling is to resample from the original dataset with replacement, i.e. for each boostrap sample, some of the observations will be replicated and others omitted
In linear-model setting, each resample is taken of a pair of observations one at a time i.e. on the case of our dataset here, a set of twins will be resampled with replacement
Slope will be calculated based on the bootstrap resample of the original observation
Applying resampling techniques at the dataset will help us understand the differences between data permutation and data bootstrapping
infer package allows us to do repeated sampling from one of two possible models
The generate() function step-permutes the variables so that the null hypothesis is true (which allows for hypothesis testing)

Bootstrapping the data

Use the infer steps to bootstrap the twins data 1000 times. Remember you do not need to hypothesize() because you are creating confidence intervals, not hypothesis testing now!
Use head() to see the first few lines of the dataframe you created

# Calculate 1000 bootstrapped slopes
set.seed(4747)
BS_slope <- twins %>%
  specify(Foster ~ Biological) %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "slope") 

# Look at the head of BS_slope  
head(BS_slope)

## # A tibble: 6 x 2
##   replicate  stat
##       <int> <dbl>
## 1         1 0.946
## 2         2 0.966
## 3         3 0.870
## 4         4 0.930
## 5         5 0.807
## 6         6 0.900

SE method - bootstrap CI for slope

Instructions: - Again, use the infer steps to calculate 1000 bootstrapped BS_slope values - Find the mean and standard deviation of the slopes, and calculate the confidence interval as the mean plus or minus two standard deviations.

# Bootstrap the slopes
set.seed(4747)
BS_slope <- twins %>%
  specify(Foster ~ Biological) %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "slope") 

# Create a confidence interval
BS_slope %>% 
summarize(lower = mean(stat) - 2*sd(stat),
          upper = mean(stat) + 2*sd(stat))

## # A tibble: 1 x 2
##   lower upper
##   <dbl> <dbl>
## 1 0.719  1.08

print(BS_slope)

## # A tibble: 1,000 x 2
##    replicate  stat
##        <int> <dbl>
##  1         1 0.946
##  2         2 0.966
##  3         3 0.870
##  4         4 0.930
##  5         5 0.807
##  6         6 0.900
##  7         7 0.862
##  8         8 0.930
##  9         9 0.768
## 10        10 0.835
## # ... with 990 more rows

Percentile method - Boostrap CI for slope

Set alpha to be 0.05
Use the infer steps to get 1000 BS_slope values
Create the confidence interval using quantile for the upper and lower cutoffs determined by alpha

# Set alpha
alpha <- .05

# Bootstrap the slopes
set.seed(4747)
BS_slope <- twins %>%
  specify(Foster ~ Biological) %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "slope")

# Create a confidence interval  
BS_slope %>% 
summarize(low = quantile(stat, alpha / 2), 
          high = quantile(stat, 1 - alpha / 2))

## # A tibble: 1 x 2
##     low  high
##   <dbl> <dbl>
## 1 0.724  1.08

Inference for Linear Regression - Chapter 2 (Simulation-based Inference) COURSE NOTES

Joel Jr Rudinas

August 12, 2018

Introduction

Let us load the necessary library packages

Importing the Data