The Exercise

Our goal is to simulate data with the following relationship between Y and X: \(Y = 2x + e\)

…where \(e =\) random variable with normal distribution (0, 1).

We will simulate this data such that:
\(E(X) = 2\)
\(V(X) = 4\)

Using expectation operators, we can also compute the following:
\(E(Y) = 4\)
\(V(Y) = 17\)
\(Cov(X,Y) = 8\)
\(Corr(X,Y) = .97\)

Loading Packages & Data

From Personal/Lab Computer

Load tidyverse to get access to ggplot2, dplyr, and tibble. Load broom for easy bootstrapping.

ipak <- function(pkg){
  new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
  if (length(new.pkg)) 
    install.packages(new.pkg, dependencies = TRUE)
  sapply(pkg, require, character.only = TRUE)
}
packages <- c("tidyverse", "broom")
ipak(packages)
# Change ggplot theme to light background
old <- theme_set(theme_light(base_size = 12))

From Lab Computer

library(dplyr)
library(tibble)
library(broom)

Simulating the Data

First we create a dataframe using the tibble() function. This basically creates a dataframe with a more user-friendly output. Each argument for tibble() begins with the column name (or variable) you want to create for your dataframe (in this case “SubjID” to indicate the Subject ID number). Since we are simulating 1,000 subjects – we will set this variable equal to 1 through 1000. Instead of typing this out manually using the c() function (e.g., c(1, 2, 3, 4, …, 1000)) we will use the sequence operator : to create a vector of integers starting with 1 and ending with 1000.

tibble(SubjID = 1:1000)


We can create a second variable simply by adding a new argument. Let’s start by creating a continuous predictor variable XN (X normal) We can do this using the rnorm() function, which takes three arguments: (1) The number of random numbers to generate; (2) The mean of the normal distribution from which to draw from; and (3) The SD of the distribution from which to draw from. Specifically, let’s draw 1,000 numbers (since we have 1,000 subjects) from a normal distribution with mean = 2 and SD = 2.

tibble(SubjID = 1:1000,
       XN = rnorm(1000,2,2))


In addition, let’s create YN (Y normal) which will be equal to 2 times the value of XN plus random noise with mean = 0 and SD = 1.

tibble(SubjID = 1:1000,
       XN = rnorm(1000,2,2),
       YN = 2*XN + rnorm(1000,0,1))


Now let’s assign this dataframe to an object called lab_data. In addition, lets set the random number generator to today’s date so that we can reproduce this exact dataframe when we revisit this page.

set.seed(11092016) # Set seed to today's date
lab_data <- tibble(SubjID = 1:1000,
                   XN = rnorm(1000,2,2),
                   YN = 2*XN + rnorm(1000,0,1))
lab_data

BONUS: Export to SPSS

For those who are more comfortable conducting tests in SPSS, feel free to export your new simulated data into a .csv, which can then be easily imported into SPSS. We can do this with the write.csv() function which takes 3 important arguments: (1) The name of the dataframe you want to export (in this case, lab_data); (2) The name of the file you want to save your dataframe as (unless you specify the full path, it will export this file into your working directory); and (3) Whether you want the first column to indicate the row names of your data (e.g., 1, 2, 3, 4, … 1000). This row.names argument defaults to TRUE (which is often a nuisance) so let’s change that to FALSE so that we do not export row names.

## This is commented out at the moment but simply remove the hashtag in front of the line in order to export the data.
# write.csv(lab_data, "lab09Wdata.csv", row.names=FALSE)

Verifying Data

Now that we’ve simulated our data, let’s make sure that the means and variances match what we would expect from expectation operators.

We should see that E(X) is approximately equal to 2 and V(X) is approximately 4.

mean(lab_data$XN)
[1] 2.046836
var(lab_data$XN)
[1] 3.745371


Success! Now let’s see if E(Y) is approximately equal to 4.5 and V(Y) is approximately 17.

mean(lab_data$YN)
[1] 4.025967
var(lab_data$YN)
[1] 16.14966


Finally, let’s see if the empirical covariance and correlation match the theoretical values:
\(Cov(X, Y) = 8\)
\(Corr(X, Y) = .97\)

We can use the cov() function to compute the covariance and the cor() function to compute the correlation. For both functions, the first two arguments must reference a vector of values, in this case: (1) lab_data\(XN** and (2) **lab_data\)YN.

cov(lab_data$XN, lab_data$YN)
[1] 7.519566
cor(lab_data$XN, lab_data$YN)
[1] 0.96686

BONUS: Efficient Alternative

The syntax below provides a more compact solution to the lab exercise.

set.seed(11092016) # Set seed to today's date
lab_data <- tibble(SubjID = 1:1000) %>% 
  mutate(XN = rnorm(n(),2,2),
         YN = 2*XN + rnorm(n(),0,1),
         XD = if_else(SubjID > 500, 4, 0),
         XND = if_else(XN >= 2, 4, 0))
summarize_at(lab_data, vars(XN, YN), funs(mean))
summarize_at(lab_data, vars(XN, YN), funs(var))
cov(lab_data$XN, lab_data$YN)
[1] 7.519566
cor(lab_data$XN, lab_data$YN)
[1] 0.96686

Computing New Variables

In the homework assignment you will be asked to compute additional variables. We can do this using the mutate() function which takes a dataframe as its first argument, and then each additional argument creates a new variable (much like the syntax for tibble()).

Let’s start by making a dichotomous predictor variable called XD, where the first 500 subjects have XD = 0 and the last 500 subjects have XD = 4. This way, the mean of XD should be roughly equivalent to the mean of XN. We can do this using the ifelse() function, which takes 3 arguments: (1) The conditional statement that either evaluates to TRUE or FALSE; (2) The value returned when the conditional statement is TRUE; and (3) The value returned when the conditional statement is FALSE (This will look familiar to Excel users). So let’s set (1) the conditional statement to SubjID > 500, such that (2) XD will equal 1 for all subject IDs greater than 500 whereas (3) XD will equal 0 for all subject IDs less than or equal to zero.

mutate(lab_data, XD = ifelse(SubjID > 500, 4, 0))


Since we want to add more than one variable, we can do so within one call of the mutate() function by simply adding extra arguments. Let’s create a dichotomized version of our continuous predictor variable (XND) by performing a mean split. Again, we use the ifelse() function such that when XN is greater than it’s mean (2), we set XND equal to 1 (or 0 otherwise). Finally, let’s save this to a new dataframe called lab_data2.

lab_data2 <- mutate(lab_data, 
                    XD = ifelse(SubjID > 500, 4, 0),
                    XND = ifelse(XN >= 2, 4, 0))
lab_data2

We can see our code worked by looking at the output above. For Subject 1, where XN = 3.56 (which is greater than the mean), we see XND = 1. For Subject 2, where XN = 0.66 (which is less than the mean), we see XND = 0.

Regression (OLS)

Let’s begin by predicting our continuous outcome variable (YN) as a function of our continuous predictor (XN). First we create a model named m1 which we assign to the linear model (using lm()) with the formula \(YN = XN\), which we express with the following model notation: YN ~ XN. Finally, we need to tell the lm() function which data we are working with, so we specify lab_data2. We then use the summary() function on the model we just created to print out the main results.

m1 <- lm(YN ~ XN, data=lab_data2)
summary(m1)

Call:
lm(formula = YN ~ XN, data = lab_data2)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.6142 -0.7099 -0.0232  0.6928  3.1653 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.08346    0.04726  -1.766   0.0777 .  
XN           2.00770    0.01678 119.637   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.027 on 998 degrees of freedom
Multiple R-squared:  0.9348,    Adjusted R-squared:  0.9348 
F-statistic: 1.431e+04 on 1 and 998 DF,  p-value: < 2.2e-16

How well do these estimates correspond with our structural model \(Y = 2X + e\) ?

Now let’s predict our continuous outcome variable (YN) as a function of our dichotomized predictor variable (XND). This time we will call our model m2 and we will change the formula to YN ~ XND. In addition to printing the summary of this model, we will also use the confint() function to print out the confidence intervals for the regression estimates.

m2 <- lm(YN ~ XND, data=lab_data2)
summary(m2) # Model Summary

Call:
lm(formula = YN ~ XND, data = lab_data2)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.7767 -1.7469 -0.0702  1.7574  8.9486 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.91652    0.11608   7.896 7.57e-15 ***
XND          1.53933    0.04084  37.695  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.583 on 998 degrees of freedom
Multiple R-squared:  0.5874,    Adjusted R-squared:  0.587 
F-statistic:  1421 on 1 and 998 DF,  p-value: < 2.2e-16
confint(m2) # Confidence Intervals
               2.5 %   97.5 %
(Intercept) 0.688737 1.144306
XND         1.459195 1.619464

How well do these estimates correspond with our structural model \(Y = 2X + e\) ?
What do you notice about the confidence interval?
What does this say about using a mean split for a regression?

BONUS: Plotting Residuals

m1 <- lm(YN ~ XN, data=lab_data2)
summary(m1)

Call:
lm(formula = YN ~ XN, data = lab_data2)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.6142 -0.7099 -0.0232  0.6928  3.1653 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.08346    0.04726  -1.766   0.0777 .  
XN           2.00770    0.01678 119.637   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.027 on 998 degrees of freedom
Multiple R-squared:  0.9348,    Adjusted R-squared:  0.9348 
F-statistic: 1.431e+04 on 1 and 998 DF,  p-value: < 2.2e-16
# base R
hist(m1$residuals)

## ggplot solution
m1_augment <- augment(m1) # augment() from the broom package
ggplot(m1_augment, aes(.resid)) + geom_histogram(fill="light grey",color="black") + theme_light()

Correlation

Now let’s create a dataset based on the same structural equation, but this time with far fewer observations (15 instead of 1,000).

lab_data3 <- tibble(SubjID = 1:25,
                    XN = rnorm(25, 2, 2),
                    YN = 2*XN + rnorm(25, 0, 1),
                    XND = ifelse(XN >= 2, 1, 0),
                    YND = ifelse(YN >= 4, 1, 0))
lab_data3

Boostrapped Correlation

First let’s create a function to compute a boostrapped confidence interval for a correlation. Don’t worry about knowing the details about how this function works. We just need to run the code below so that we can reference the function in the next section

# Create function
cor.CI.bootstrap <- function(df, var1, var2, nSims=1000, seed=NA) {
  
  if (!is.na(seed)) {set.seed(seed)}
  cat("\n Bootstrapped 95% CI for",var1,"&",var2,"with",nSims,"simulations:")
  cat("\n")
  cat("\n")
  
  df %>% 
    bootstrap(nSims) %>% 
    do(tidy(cor(.[var1], .[var2]))) %>% 
    ungroup() %>% select_(pearson_r = var2) %>% 
    arrange(pearson_r) %>% 
    slice( c(nSims*.025, nSims*.975)) %>% 
    mutate_all(funs(round(.,4))) %>% 
    add_column(bound = c("lower", "upper")) %>% print()
  
}

Now let’s apply this function to our lab data. The cor.CI.boostrap() function we just created takes three mandatory arguments (and one that is optional): (1) The dataframe that you will be working with; (2) One (of the two) variables that you want to correlate; and (3) The variable you want to correlate with the variable you specified in the previous argument. For this function to work you must ensure that the variable names you supply are wrapped in quotation marks (" " or ‘’); otherwise, the function will not work. (NOTE: There’s also an optional argument to specify the seed of the random number generator, so let’s set seed = 11092016 so we’re all on the same page.)

Let’s try this out on the data we’ve just created.

cor.CI.bootstrap(lab_data3, "XN", "YN", seed=11092016)

 Bootstrapped 95% CI for XN & YN with 1000 simulations:


|================================================================================================          | 91% ~0 s remaining     
|===================================================================================================       | 94% ~0 s remaining     
|======================================================================================================    | 96% ~0 s remaining     
|========================================================================================================= |100% ~0 s remaining     

Now let’s compute the traditional confidence interval.

cor.test(lab_data3$XN, lab_data3$YN)

    Pearson's product-moment correlation

data:  lab_data3$XN and lab_data3$YN
t = 13.481, df = 23, p-value = 0.000000000002095
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.8714457 0.9745047
sample estimates:
      cor 
0.9421584 

How do these estimates differ from the boostrapped confidence intervals?

Now let’s do the same with our dichotomized predictor and outcome variables (XND, YND).

cor.CI.bootstrap(lab_data3, "XND", "YND", seed=11092016)

 Bootstrapped 95% CI for XND & YND with 1000 simulations:


|=======================================================================================================   | 98% ~0 s remaining     
cor.test(lab_data3$XND, lab_data3$YND)

    Pearson's product-moment correlation

data:  lab_data3$XND and lab_data3$YND
t = 4.2895, df = 23, p-value = 0.0002736
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.3686443 0.8404150
sample estimates:
      cor 
0.6666667 

Which of these confidence bounds are more appropriate?

BONUS: More on boostrapping

First lets create a variable called N which will be equal to the sample size of our most recent data simulation (25 observations). We can do this using the nrow() function which will return the number of rows in our dataframe lab_data3.

Now lets use the sample() function to randomly sample the indices (or subject IDs) of these observations. We provide the following argument: (1) the sequence of numbers we want to sample, (2) the number of times we want to sample, and (3) whether we want to sample with replacement. When we set replace = FALSE, notice that each index (1 - 25) is sampled exactly once.

Unsurprisingly, if we were to sample our data (without replacement) 25 times, and created a bar chart for the number of times each subject was sampled… we would see a row of bars = 1.

N <- nrow(lab_data3)
1:N  #ordered sequence
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
sample(1:N, N, replace = FALSE) #randomly sampled sequence
 [1] 25 18 20 22  9 10 15  6  3 24 21 16 13 14  1  7 11 12  4 19  5  8  2 23 17
slice(lab_data3, sample(1:N, N, replace = FALSE)) %>% ggplot(aes(SubjID)) + geom_bar() + xlim(0,25)

Now let’s set replace = TRUE and see what happens. Notice we notice that some subjects might get sampled more than once, whereas others might never get sampled.

1:N  #ordered sequence
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
idx <- sample(1:N, N, replace = TRUE)
idx
 [1] 17 21  2  8  4 12 10 14 22 14 15 16 18 13 20 23  9 24 21  1  1  5  2  4 11
slice(lab_data3,idx) %>% ggplot(aes(SubjID)) + geom_bar() + xlim(0,25)

Here’s how our resampled data would look.

lab_data3_r1 <- slice(lab_data3, idx)
lab_data3_r1

For a correlation

The blue lines are the standard confidence bounds whereas the red lines are the boostrapped confidence bounds.

  
cor_bs %>% arrange(replicate)
cor_bs %>% 
  ggplot(aes(pearson_r)) + 
  geom_histogram(fill="light grey", color="black") + 
  geom_vline(aes(xintercept=pearson_r), data= slice(cor_bs,25), color="red", linetype="dashed") + 
  geom_vline(aes(xintercept=pearson_r), data= slice(cor_bs,975), color="red", linetype="dashed") + 
  geom_vline(aes(xintercept=conf.low), data= cor_reg, color="blue", linetype="dashed") + 
  geom_vline(aes(xintercept=conf.high), data= cor_reg, color="blue", linetype="dashed")

Here’s what the regression line would reveal for this resampled data.

lab_data3 %>% 
  slice(sample(1:n(), n(), replace = TRUE)) %>% 
  ggplot(aes(XND, YND)) + geom_jitter(width=.1, height=.1) + geom_smooth(method=lm, se=F)

Here’s what the regression lines would look like if we repeated this procedure 1,000 times.

bootnls_aug <- lab_data3 %>% bootstrap(1000) %>%
    do(augment(lm(YND ~ XND, data=.)))
ggplot(bootnls_aug, aes(XND, YND)) + geom_point() +
    geom_line(aes(y=.fitted, group=replicate), alpha=.1, color="blue")

---
title: "Lab10W"
author: "Julian Wills"
date: "November 9th, 2016"
output:
  html_notebook:
    theme: readable
    toc: yes
---

## The Exercise

Our goal is to simulate data with the following relationship between Y and X:
$Y = 2x + e$  

...where $e =$ random variable with normal distribution (0, 1).

We will simulate this data such that:   
$E(X) = 2$  
$V(X) = 4$

Using expectation operators, we can also compute the following:  
$E(Y) = 4$    
$V(Y) = 17$  
$Cov(X,Y) = 8$    
$Corr(X,Y) = .97$  

## Loading Packages & Data

### From Personal/Lab Computer
Load **tidyverse** to get access to ggplot2, dplyr, and tibble. Load **broom** for easy bootstrapping. 
```{r, results='hide', message=FALSE}
ipak <- function(pkg){
  new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
  if (length(new.pkg)) 
    install.packages(new.pkg, dependencies = TRUE)
  sapply(pkg, require, character.only = TRUE)
}

packages <- c("tidyverse", "broom")
ipak(packages)

# Change ggplot theme to light background
old <- theme_set(theme_light(base_size = 12))
```

### From Lab Computer

```{r}
library(dplyr)
library(tibble)
library(broom)
```


## Simulating the Data

First we create a dataframe using the **tibble()** function. This basically creates a dataframe with a more user-friendly output. Each argument for tibble() begins with the column name (or *variable*) you want to create for your dataframe (in this case "SubjID" to indicate the Subject ID number). Since we are simulating 1,000 subjects -- we will set this variable equal to 1 through 1000. Instead of typing this out manually using the c() function (e.g., c(1, 2, 3, 4, ..., 1000)) we will use the sequence operator **:** to create a vector of integers starting with 1 and ending with 1000. 
```{r}

tibble(SubjID = 1:1000)

```

<br />
We can create a second variable simply by adding a new argument. Let's start by creating a continuous predictor variable **XN** (X normal) We can do this using the **rnorm()** function, which takes three arguments: (1) The number of random numbers to generate; (2) The mean of the normal distribution from which to draw from; and (3) The SD of the distribution from which to draw from. Specifically, let's draw **1,000** numbers (since we have 1,000 subjects) from a normal distribution with mean = 2 and SD = 2. 

```{r}
tibble(SubjID = 1:1000,
       XN = rnorm(1000,2,2))

```

<br />
In addition, let's create **YN** (Y normal) which will be equal to 2 times the value of XN plus random noise with mean = 0 and SD = 1. 
```{r}
tibble(SubjID = 1:1000,
       XN = rnorm(1000,2,2),
       YN = 2*XN + rnorm(1000,0,1))

```

<br />
Now let's assign this dataframe to an object called **lab_data**. In addition, lets set the random number generator to today's date so that we can reproduce this exact dataframe when we revisit this page. 
```{r}

set.seed(11092016) # Set seed to today's date

lab_data <- tibble(SubjID = 1:1000,
                   XN = rnorm(1000,2,2),
                   YN = 2*XN + rnorm(1000,0,1))
lab_data

```


### BONUS: Export to SPSS

For those who are more comfortable conducting tests in SPSS, feel free to export your new simulated data into a .csv, which can then be easily imported into SPSS. We can do this with the **write.csv()** function which takes 3 important arguments: (1) The name of the dataframe you want to export (in this case, *lab_data*); (2) The name of the file you want to save your dataframe as (unless you specify the full path, it will export this file into your working directory); and (3) Whether you want the first column to indicate the row names of your data (e.g., 1, 2, 3, 4, ... 1000). This row.names argument defaults to TRUE (which is often a nuisance) so let's change that to FALSE so that we do not export row names. 
```{r}

## This is commented out at the moment but simply remove the hashtag in front of the line in order to export the data.

# write.csv(lab_data, "lab09Wdata.csv", row.names=FALSE)

```


## Verifying Data

Now that we've simulated our data, let's make sure that the means and variances match what we would expect from expectation operators.  

We should see that E(X) is approximately equal to 2 and V(X) is approximately 4. 
```{r}

mean(lab_data$XN)
var(lab_data$XN)

```

<br />
Success! Now let's see if E(Y) is approximately equal to 4.5 and V(Y) is approximately 17. 
```{r}

mean(lab_data$YN)
var(lab_data$YN)

```

<br />
Finally, let's see if the empirical covariance and correlation match the theoretical values:  
$Cov(X, Y) = 8$  
$Corr(X, Y) = .97$  

We can use the **cov()** function to compute the covariance and the **cor()** function to compute the correlation. For both functions, the first two arguments must reference a vector of values, in this case: (1) **lab_data$XN** and (2) **lab_data$YN**. 
```{r}

cov(lab_data$XN, lab_data$YN)
cor(lab_data$XN, lab_data$YN)

```


### BONUS: Efficient Alternative

The syntax below provides a more compact solution to the lab exercise. 
```{r}

set.seed(11092016) # Set seed to today's date

lab_data <- tibble(SubjID = 1:1000) %>% 
  mutate(XN = rnorm(n(),2,2),
         YN = 2*XN + rnorm(n(),0,1),
         XD = if_else(SubjID > 500, 4, 0),
         XND = if_else(XN >= 2, 4, 0))

summarize_at(lab_data, vars(XN, YN), funs(mean))
summarize_at(lab_data, vars(XN, YN), funs(var))

cov(lab_data$XN, lab_data$YN)
cor(lab_data$XN, lab_data$YN)

```


## Computing New Variables

In the homework assignment you will be asked to compute additional variables. We can do this using the **mutate()** function which takes a dataframe as its first argument, and then each additional argument creates a new variable (much like the syntax for **tibble()**). 

Let's start by making a dichotomous predictor variable called **XD**, where the first 500 subjects have XD = 0 and the last 500 subjects have XD = 4. This way, the mean of XD should be roughly equivalent to the mean of XN. We can do this using the **ifelse()** function, which takes 3 arguments: (1) The *conditional statement* that either evaluates to TRUE or FALSE; (2) The value returned when the conditional statement is TRUE; and (3) The value returned when the conditional statement is FALSE (This will look familiar to Excel users). So let's set (1) the conditional statement to **SubjID > 500**, such that (2) XD will equal 1 for all subject IDs greater than 500 whereas (3) XD will equal 0 for all subject IDs less than or equal to zero. 
```{r}
mutate(lab_data, XD = ifelse(SubjID > 500, 4, 0))
```

<br />
Since we want to add more than one variable, we can do so within one *call* of the mutate() function by simply adding extra arguments. Let's create a dichotomized version of our continuous predictor variable (**XND**) by performing a mean split. Again, we use the **ifelse()** function such that when XN is greater than it's mean (2), we set XND equal to 1 (or 0 otherwise). Finally, let's save this to a new dataframe called **lab_data2**. 
```{r}

lab_data2 <- mutate(lab_data, 
                    XD = ifelse(SubjID > 500, 4, 0),
                    XND = ifelse(XN >= 2, 4, 0))

lab_data2

```

We can see our code worked by looking at the output above. For Subject 1, where XN = 3.56 (which is greater than the mean), we see XND = 1. For Subject 2, where XN = 0.66 (which is less than the mean), we see XND = 0. 

## Regression (OLS)

Let's begin by predicting our continuous outcome variable (YN) as a function of our continuous predictor (XN). First we create a model named **m1** which we assign to the linear model (using **lm()**) with the formula $YN = XN$, which we express with the following model notation: **YN ~ XN**. Finally, we need to tell the lm() function which **data** we are working with, so we specify *lab_data2*. We then use the **summary()** function on the model we just created to print out the main results. 
```{r}

m1 <- lm(YN ~ XN, data=lab_data2)
summary(m1)

```
How well do these estimates correspond with our structural model $Y = 2X + e$ ?

Now let's predict our continuous outcome variable (YN) as a function of our dichotomized predictor variable (XND). This time we will call our model **m2** and we will change the formula to **YN ~ XND**. In addition to printing the summary of this model, we will also use the **confint()** function to print out the confidence intervals for the regression estimates. 
```{r}

m2 <- lm(YN ~ XND, data=lab_data2)
summary(m2) # Model Summary
confint(m2) # Confidence Intervals

```
How well do these estimates correspond with our structural model $Y = 2X + e$ ?  
What do you notice about the confidence interval?  
What does this say about using a mean split for a regression?  

### BONUS: Plotting Residuals 

```{r}

m1 <- lm(YN ~ XN, data=lab_data2)
summary(m1)

# base R
hist(m1$residuals)

## ggplot solution
m1_augment <- augment(m1) # augment() from the broom package
ggplot(m1_augment, aes(.resid)) + geom_histogram(fill="light grey",color="black") + theme_light()

```

## Correlation

Now let's create a dataset based on the same structural equation, but this time with far fewer observations (15 instead of 1,000).
```{r}
lab_data3 <- tibble(SubjID = 1:25,
                    XN = rnorm(25, 2, 2),
                    YN = 2*XN + rnorm(25, 0, 1),
                    XND = ifelse(XN >= 2, 1, 0),
                    YND = ifelse(YN >= 4, 1, 0))


lab_data3

```


### Boostrapped Correlation

First let's create a function to compute a boostrapped confidence interval for a correlation. Don't worry about knowing the details about how this function works. We just need to run the code below so that we can reference the function in the next section
```{r}

# Create function
cor.CI.bootstrap <- function(df, var1, var2, nSims=1000, seed=NA) {
  
  if (!is.na(seed)) {set.seed(seed)}
  cat("\n Bootstrapped 95% CI for",var1,"&",var2,"with",nSims,"simulations:")
  cat("\n")
  cat("\n")
  
  df %>% 
    bootstrap(nSims) %>% 
    do(tidy(cor(.[var1], .[var2]))) %>% 
    ungroup() %>% select_(pearson_r = var2) %>% 
    arrange(pearson_r) %>% 
    slice( c(nSims*.025, nSims*.975)) %>% 
    mutate_all(funs(round(.,4))) %>% 
    add_column(bound = c("lower", "upper")) %>% print()
  
}

```
  

Now let's apply this function to our lab data. The **cor.CI.boostrap()** function we just created takes three mandatory arguments (and one that is optional): (1) The dataframe that you will be working with; (2) One (of the two) variables that you want to correlate; and (3) The variable you want to correlate with the variable you specified in the previous argument. For this function to work you *must* ensure that the variable names you supply are wrapped in quotation marks (" " or ' '); otherwise, the function will not work. (NOTE: There's also an optional argument to specify the *seed* of the random number generator, so let's set **seed = 11092016** so we're all on the same page.)

Let's try this out on the data we've just created. 
```{r}

cor.CI.bootstrap(lab_data3, "XN", "YN", seed=11092016)

```
  
  
Now let's compute the traditional confidence interval. 
```{r}

cor.test(lab_data3$XN, lab_data3$YN)

```
How do these estimates differ from the boostrapped confidence intervals?   


Now let's do the same with our dichotomized predictor and outcome variables (**XND, YND**). 
```{r}
cor.CI.bootstrap(lab_data3, "XND", "YND", seed=11092016)
cor.test(lab_data3$XND, lab_data3$YND)

```
Which of these confidence bounds are more appropriate?   


## BONUS: More on boostrapping


First lets create a variable called **N** which will be equal to the sample size of our most recent data simulation (25 observations). We can do this using the **nrow()** function which will return the number of rows in our dataframe *lab_data3*. 

Now lets use the **sample()** function to randomly sample the indices (or subject IDs) of these observations. We provide the following argument: (1) the sequence of numbers we want to sample, (2) the number of times we want to sample, and (3) whether we want to sample with replacement. When we set **replace = FALSE**, notice that each index (1 - 25) is sampled exactly once. 

Unsurprisingly, if we were to sample our data (without replacement) 25 times, and created a bar chart for the number of times each subject was sampled... we would see a row of bars = 1. 
```{r}
N <- nrow(lab_data3)
1:N  #ordered sequence
sample(1:N, N, replace = FALSE) #randomly sampled sequence

slice(lab_data3, sample(1:N, N, replace = FALSE)) %>% ggplot(aes(SubjID)) + geom_bar() + xlim(0,25)

```
  
  
Now let's set **replace = TRUE** and see what happens. Notice we notice that some subjects might get sampled more than once, whereas others might never get sampled. 
```{r}
1:N  #ordered sequence
idx <- sample(1:N, N, replace = TRUE)
idx

slice(lab_data3,idx) %>% ggplot(aes(SubjID)) + geom_bar() + xlim(0,25)
```
  
 
Here's how our resampled data would look. 
```{r}
lab_data3_r1 <- slice(lab_data3, idx)
lab_data3_r1
```
    


For a correlation
```{r, result='hide'}
cor_bs <- lab_data3 %>% 
  bootstrap(1000) %>% 
  do(tidy(cor(.$YND, .$XND))) %>% 
  rename(pearson_r = x) %>% 
  ungroup() %>% 
  arrange(pearson_r)

cor_reg <- cor.test(lab_data3$XND, lab_data3$YND) %>% tidy()
```

The blue lines are the standard confidence bounds whereas the red lines are the boostrapped confidence bounds. 
```{r}
  
cor_bs %>% arrange(replicate)
cor_bs %>% 
  ggplot(aes(pearson_r)) + 
  geom_histogram(fill="light grey", color="black") + 
  geom_vline(aes(xintercept=pearson_r), data= slice(cor_bs,25), color="red", linetype="dashed") + 
  geom_vline(aes(xintercept=pearson_r), data= slice(cor_bs,975), color="red", linetype="dashed") + 
  geom_vline(aes(xintercept=conf.low), data= cor_reg, color="blue", linetype="dashed") + 
  geom_vline(aes(xintercept=conf.high), data= cor_reg, color="blue", linetype="dashed")

```


Here's what the regression line would reveal for this resampled data. 
```{r}
lab_data3 %>% 
  slice(sample(1:n(), n(), replace = TRUE)) %>% 
  ggplot(aes(XND, YND)) + geom_jitter(width=.1, height=.1) + geom_smooth(method=lm, se=F)
```
  
  
Here's what the regression lines would look like if we repeated this procedure 1,000 times. 

```{r, results='hide'}
bootnls_aug <- lab_data3 %>% bootstrap(1000) %>%
    do(augment(lm(YND ~ XND, data=.)))
```

```{r}

ggplot(bootnls_aug, aes(XND, YND)) + geom_point() +
    geom_line(aes(y=.fitted, group=replicate), alpha=.1, color="blue")

```
