You may use this document as a starting point for this assignment. Your submission should be written in R and include all points addressed below.

Bootstrap

To keep this asssignment simple, we are going to use the built in dataset diamonds that is included with R.

data=diamonds
summary(data)
##      carat               cut        color        clarity          depth      
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194   Median :61.80  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066   3rd Qu.:62.50  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655   Max.   :79.00  
##                                     J: 2808   (Other): 2531                  
##      table           price             x                y         
##  Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720  
##  Median :57.00   Median : 2401   Median : 5.700   Median : 5.710  
##  Mean   :57.46   Mean   : 3933   Mean   : 5.731   Mean   : 5.735  
##  3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540   3rd Qu.: 6.540  
##  Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :58.900  
##                                                                   
##        z         
##  Min.   : 0.000  
##  1st Qu.: 2.910  
##  Median : 3.530  
##  Mean   : 3.539  
##  3rd Qu.: 4.040  
##  Max.   :31.800  
## 
  1. A quick google search suggested that I should purchase a 1 carat diamond as an engagement ring. Using bootstrapping, test the hypothesis that the mean is 1 carat. Explain in words how to create a bootstrap and how to create a bootstrap distribution for the mean. Make sure to state the hypothesis, express a confidence interval, \(p\) value, and state the conclusion in the proper statistical terms for the mean.

first thing i am going to do is explain how to create a bootstrap. Bootstrapping is a statistical procedure that resamples a single dataset to create many simulated samples with replacement.

how to create a bootstrap distribution for the mean? so once you bootstrap youll have alot of different data set. next you are going to do is take the mean of each of those data set. let’s say we have 10 data sets with 5 data points in those data sets. you are going take the mean of each of those data sets, in this case we are going end up with 10 means. we are going graph those 10 means to get our bootstrap distribution for the mean

im going get started by setting up my bootstrap.

library(boot)
set.seed(42)
samp_mean <- function(x,i){
  mean( x[i])
}
results <- boot(data$carat, samp_mean,100)
plot(results)

I will now state my hypothesis. null hypothesis will say that mean is 1 carat while the alternative hypothesis says that it will not be 1 \[ H_0: \mu = 1 \\ H_A: \mu\neq 1 \] here im just getting the xbar se and confidence interval. I will Compute the p Value using t test \[ t = \frac{\mu-\overline x}{SE} \]

xbar = results$t0
se=sd(results$t)
c(results$t0-2*se,results$t0+2*se)
## [1] 0.7940702 0.8018093

so the confidence interval is between 0.7940702 and 0.8018093

t = (1-results$t0)/se
1-pt(t,99)
## [1] 0

we can safely reject the null hypothesis. google was wrong about it being 1 because 1 doesnt fall in our 95% confidence interval # Cross Validation

The multiple linear regression using carat, depth, and table to predict price is done below.

fit <- lm(price ~ carat +depth + table, data = diamonds)
summary(fit)
## 
## Call:
## lm(formula = price ~ carat + depth + table, data = diamonds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18288.0   -785.9    -33.2    527.2  12486.7 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13003.441    390.918   33.26   <2e-16 ***
## carat        7858.771     14.151  555.36   <2e-16 ***
## depth        -151.236      4.820  -31.38   <2e-16 ***
## table        -104.473      3.141  -33.26   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1526 on 53936 degrees of freedom
## Multiple R-squared:  0.8537, Adjusted R-squared:  0.8537 
## F-statistic: 1.049e+05 on 3 and 53936 DF,  p-value: < 2.2e-16
  1. Repeat this linear model using 10 fold cross validation. Explain in words what you are doing. Examine one of the folds carefully explaining the steps involved. Examine the \(R^2\) value and residual mean standard error. Compare the values you get to the original model.
trainc <- trainControl(method = "cv", number = 10)
model2 <- train(price ~ carat +depth + table, data = diamonds,
                method = "lm",
                trControl = trainc)
print(model2)
## Linear Regression 
## 
## 53940 samples
##     3 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 48545, 48547, 48546, 48546, 48546, 48545, ... 
## Resampling results:
## 
##   RMSE     Rsquared   MAE     
##   1526.12  0.8537866  994.4244
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

what is k fold? well in k folds you divide your data up into k pieces and you hold back one of the pieces. you then fit your model on the remaining pieces.you then test the model on that piece that you held back. you then mix them all together hold back one hold back a different one this time and repeat k times. we do this to run a statistic on the statistic on the test statistic. that is what and how k fold is. in this case k is 10.

Compare the values you get to the original model. in my case the two values were almost perfect match which was cool.the R-squared number tell us that there is a strong correlation. what this means is that you can predict price with carat depth table.