You may use this document as a starting point for this assignment. Your submission should be written in R and include all points addressed below.

Bootstrap

To keep this asssignment simple, we are going to use the built in dataset diamonds that is included with R.

summary(diamonds)
##      carat               cut        color        clarity          depth      
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194   Median :61.80  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066   3rd Qu.:62.50  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655   Max.   :79.00  
##                                     J: 2808   (Other): 2531                  
##      table           price             x                y         
##  Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720  
##  Median :57.00   Median : 2401   Median : 5.700   Median : 5.710  
##  Mean   :57.46   Mean   : 3933   Mean   : 5.731   Mean   : 5.735  
##  3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540   3rd Qu.: 6.540  
##  Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :58.900  
##                                                                   
##        z         
##  Min.   : 0.000  
##  1st Qu.: 2.910  
##  Median : 3.530  
##  Mean   : 3.539  
##  3rd Qu.: 4.040  
##  Max.   :31.800  
## 
  1. A quick google search suggested that I should purchase a 1 carat diamond as an engagement ring. Using bootstrapping, test the hypothesis that the mean is 1 carat. Explain in words how to create a bootstrap and how to create a bootstrap distribution for the mean. Make sure to state the hypothesis, express a confidence interval, \(p\) value, and state the conclusion in the proper statistical terms for the mean.

Cross Validation

The multiple linear regression using carat, depth, and table to predict price is done below.

fit <- lm(price ~ carat +depth + table, data = diamonds)
summary(fit)
## 
## Call:
## lm(formula = price ~ carat + depth + table, data = diamonds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18288.0   -785.9    -33.2    527.2  12486.7 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13003.441    390.918   33.26   <2e-16 ***
## carat        7858.771     14.151  555.36   <2e-16 ***
## depth        -151.236      4.820  -31.38   <2e-16 ***
## table        -104.473      3.141  -33.26   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1526 on 53936 degrees of freedom
## Multiple R-squared:  0.8537, Adjusted R-squared:  0.8537 
## F-statistic: 1.049e+05 on 3 and 53936 DF,  p-value: < 2.2e-16
  1. Repeat this linear model using 10 fold cross validation. Explain in words what you are doing. Examine one of the folds carefully explaining the steps involved. Examine the \(R^2\) value and residual mean standard error. Compare the values you get to the original model.