You may use this document as a starting point for this assignment. Your submission should be written in R and include all points addressed below.
To keep this asssignment simple, we are going to use the built in dataset diamonds that is included with R.
data=diamonds
summary(data)
## carat cut color clarity depth
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194 Median :61.80
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066 3rd Qu.:62.50
## Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00
## J: 2808 (Other): 2531
## table price x y
## Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720
## Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
## Mean :57.46 Mean : 3933 Mean : 5.731 Mean : 5.735
## 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900
##
## z
## Min. : 0.000
## 1st Qu.: 2.910
## Median : 3.530
## Mean : 3.539
## 3rd Qu.: 4.040
## Max. :31.800
##
first thing i am going to do is explain how to create a bootstrap. Bootstrapping is a statistical procedure that resamples a single dataset to create many simulated samples with replacement.
how to create a bootstrap distribution for the mean? so once you bootstrap youll have alot of different data set. next you are going to do is take the mean of each of those data set. let’s say we have 10 data sets with 5 data points in those data sets. you are going take the mean of each of those data sets, in this case we are going end up with 10 means. we are going graph those 10 means to get our bootstrap distribution for the mean
im going get started by setting up my bootstrap.
library(boot)
set.seed(42)
samp_mean <- function(x,i){
mean( x[i])
}
results <- boot(data$carat, samp_mean,100)
plot(results)
I will now state my hypothesis. null hypothesis will say that mean is 1 carat while the alternative hypothesis says that it will not be 1 \[
H_0: \mu = 1 \\
H_A: \mu\neq 1
\] here im just getting the xbar se and confidence interval. I will Compute the p Value using t test \[
t = \frac{\mu-\overline x}{SE}
\]
xbar = results$t0
se=sd(results$t)
c(results$t0-2*se,results$t0+2*se)
## [1] 0.7940702 0.8018093
so the confidence interval is between 0.7940702 and 0.8018093
t = (1-results$t0)/se
1-pt(t,99)
## [1] 0
we can safely reject the null hypothesis. google was wrong about it being 1 because 1 doesnt fall in our 95% confidence interval # Cross Validation
The multiple linear regression using carat, depth, and table to predict price is done below.
fit <- lm(price ~ carat +depth + table, data = diamonds)
summary(fit)
##
## Call:
## lm(formula = price ~ carat + depth + table, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18288.0 -785.9 -33.2 527.2 12486.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13003.441 390.918 33.26 <2e-16 ***
## carat 7858.771 14.151 555.36 <2e-16 ***
## depth -151.236 4.820 -31.38 <2e-16 ***
## table -104.473 3.141 -33.26 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1526 on 53936 degrees of freedom
## Multiple R-squared: 0.8537, Adjusted R-squared: 0.8537
## F-statistic: 1.049e+05 on 3 and 53936 DF, p-value: < 2.2e-16
trainc <- trainControl(method = "cv", number = 10)
model2 <- train(price ~ carat +depth + table, data = diamonds,
method = "lm",
trControl = trainc)
print(model2)
## Linear Regression
##
## 53940 samples
## 3 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 48545, 48547, 48546, 48546, 48546, 48545, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 1526.12 0.8537866 994.4244
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
what is k fold? well in k folds you divide your data up into k pieces and you hold back one of the pieces. you then fit your model on the remaining pieces.you then test the model on that piece that you held back. you then mix them all together hold back one hold back a different one this time and repeat k times. we do this to run a statistic on the statistic on the test statistic. that is what and how k fold is. in this case k is 10.
Compare the values you get to the original model. in my case the two values were almost perfect match which was cool.the R-squared number tell us that there is a strong correlation. what this means is that you can predict price with carat depth table.