The link to the previous part of this project that involved regression can be found here: https://rpubs.com/bekkahmoore/738944 Moving on!
I will continue using the same data, but will now apply bootstrapping to it! My null hypothesis (\(\mu_M\)) was that the mean marriage rate of the younger couples was larger than the mean marriage rate of the older couples, indicating younger marriage was more popular. The alternative hypothesis (\(\mu_W\)) would be that the younger marriage rate was NOT higher than the older. Stating those in proper statistical terms: \[ \begin{align} H_0:\ \mu_M=\mu_W\\ H_A:\ \mu_M\neq\mu_W. \end{align} \] I’ll now implement bootstrap techniques to compare against my original results.
set.seed(42)
sample_mean <- function(x, i) {
mean(x[i])
}
old_age <- marriage_data1$Married.ages.32.34..pct.
young_age <- marriage_data1$Married.ages.23.25..pct.
old_results <- boot(old_age, sample_mean, 100)
plot(old_results)
young_results <- boot(young_age, sample_mean, 100)
plot(young_results)
The means for both are actually pretty spot on to those I obtained in project 0. Let me get the confidence intervals for the bootstraps and compare those to the Welch two sample t-test I ran in Project 0.
boot.ci(old_results, type = "perc")
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 100 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = old_results, type = "perc")
##
## Intervals :
## Level Percentile
## 95% ( 0.5797, 0.5944 )
## Calculations and Intervals on Original Scale
## Some percentile intervals may be unstable
boot.ci(young_results, type = "perc")
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 100 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = young_results, type = "perc")
##
## Intervals :
## Level Percentile
## 95% ( 0.1249, 0.1374 )
## Calculations and Intervals on Original Scale
## Some percentile intervals may be unstable
Wow! I didn’t realize how much more accurate the bootstrapping confidence intervals would be than when I ran the regular hypothesis test. In the original test, it gave me a confidence interval for the older generation of 0.2948 to 0.8384. Bootstrapping, however, shrunk that interval to 0.5797 to 0.5944. From the plot above, it seems that the mean is about 0.587, which falls right in the confidence interval given. Bootstrapping is a much more precise and efficient way to estimate where 95% of your data will likely fall. But obviously, the confidence interval of the older group was substantially higher than the younger, so we can conclude the same results as before: we reject the null hypothesis because the younger mean is NOT higher than the older mean.
I’ll now compute the p value for the test statistic.
diff2 = function(d,i){
d$group <- d$Married.ages.32.34..pct.[i];
Mean= tapply(X=d$Married.ages.23.25..pct., INDEX=d$group, mean)
Diff = Mean[1]-Mean[2]
Diff
}
diffboot = boot(data = data,statistic= diff2, R=100)
print(diffboot)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = data, statistic = diff2, R = 100)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -2.2e-06 0.02030441 0.1205957
mean(abs(diffboot$t)>abs(diffboot$t0))
## [1] 1
Okay it looks like I got a P value of 1… but I’m not entirely sure if I did that correctly…
#NonParametric Statistic
Let’s look at the medians of our two age groups and see if they are different. My null hypothesis is that since the mean of the older group is higher, the median of the older group will be higher as well.
median(marriage_data1$Married.ages.32.34..pct.)
## [1] 0.5993488
median(marriage_data1$Married.ages.23.25..pct.)
## [1] 0.1115676
Those sample medians are very different actually. Let’s run it through the bootstrap exactly like we did above.
sample_median <- function(x,i){
median(x[i])
}
bootOldMedian = boot(marriage_data1$Married.ages.32.34..pct.,sample_median,100)
boot.ci(bootOldMedian, type = "perc")
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 100 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = bootOldMedian, type = "perc")
##
## Intervals :
## Level Percentile
## 95% ( 0.5924, 0.6064 )
## Calculations and Intervals on Original Scale
## Some percentile intervals may be unstable
bootYoungMedian = boot(marriage_data1$Married.ages.23.25..pct., sample_median, 100)
boot.ci(bootYoungMedian, type = "perc")
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 100 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = bootYoungMedian, type = "perc")
##
## Intervals :
## Level Percentile
## 95% ( 0.1036, 0.1194 )
## Calculations and Intervals on Original Scale
## Some percentile intervals may be unstable
Once again the bootstrapping process did a great job of giving an accurate estimate of where 95% of the data would fall. The true medians both fall within their confidence intervals. Let me do the \(p\) value here as well, very similar to how I did earlier.
diff3 = function(d,i){
d$group <- d$Married.ages.32.34..pct.[i];
Median= tapply(X=d$Married.ages.23.25..pct., INDEX=d$group, median)
Diff = Median[1]-Median[2]
Diff
}
diffmedian = boot(data = data, statistic = diff3, R = 100)
print(diffmedian)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = data, statistic = diff3, R = 100)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -2.2e-06 -0.01826726 0.126884
mean(abs(diffmedian$t)>abs(diffmedian$t0))
## [1] 1
Hmmmm I got 1 again, so I have a strong feeling that something isn’t right with my \(p\) tests… But just by comparing their confidence intervals for the medians, it’s clear that the older age group once again has the higher results. This time, we can accept the null hypothesis that the older age has the higher median, because there is data to back up that claim!
#Cross Validation
I’ll now perform cross validation using the two-thirds split method. First, I need to fit the model to the training data. I want to revisit the idea of the older group’s marriage rate being a predictor of the younger group’s.
Rate_fit <- lm(marriage_data1$Married.ages.23.25..pct.~marriage_data1$Married.ages.32.34..pct., data = marriage_data1)
I really struggled to get the 66% split cross validation to run, so for the sake of knitting this to a pdf successfully, I will omit that and move on to 10 fold cross validation.
Just a run down of how the process is going to work: I randomly shuffle my data I create an even 10 folds of the random data *I then sample with replacement to create a distribution for the mean.
Let’s do it!
Data_CV <-marriage_data1[sample(nrow(marriage_data1)),]
folds <- cut(seq(1,nrow(marriage_data1)),breaks=10,labels=FALSE)
for(i in 1:10){
testIndexes <- which(folds==i,arr.ind=TRUE)
testData <- Data_CV[testIndexes, ]
trainData <- Data_CV[-testIndexes, ]
}
model <- lm(marriage_data1$Married.ages.23.25..pct. ~ marriage_data1$Married.ages.32.34..pct., data = trainData)
summary(model)
##
## Call:
## lm(formula = marriage_data1$Married.ages.23.25..pct. ~ marriage_data1$Married.ages.32.34..pct.,
## data = trainData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.13979 -0.05930 -0.01189 0.04819 0.30733
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.15967 0.01518 -10.52 <2e-16
## marriage_data1$Married.ages.32.34..pct. 0.49471 0.02542 19.46 <2e-16
##
## (Intercept) ***
## marriage_data1$Married.ages.32.34..pct. ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07765 on 818 degrees of freedom
## Multiple R-squared: 0.3165, Adjusted R-squared: 0.3157
## F-statistic: 378.8 on 1 and 818 DF, p-value: < 2.2e-16
Let’s look at the confidence interval for this model.
confint.lm(model)
## 2.5 % 97.5 %
## (Intercept) -0.1894745 -0.1298627
## marriage_data1$Married.ages.32.34..pct. 0.4448218 0.5446040
So 97.5% of the time, we can count on the mean being between -0.1298627 and 0.5446040. Something seems off here because I don’t have any negative marriage rates in my data set… any suggestions or interpretations are welcomed!
Let me plot it so we can get a clear visual of the data.
ggplot(data = Data_CV, mapping = aes(marriage_data1$Married.ages.32.34..pct.,marriage_data1$Married.ages.23.25..pct.,color = marriage_data1$Married.ages.23.25..pct.))+
geom_jitter() +
geom_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
I’ll make some predictions on this testing set and see how well it works.
pre = predict(model,testData)
## Warning: 'newdata' had 82 rows but variables found have 820 rows
summary(pre)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.07383 0.10718 0.13684 0.13113 0.16408 0.25955
This is kind of a hot mess. I had to change some things from my original rmd file to be able to get it to knit. But I think bootstrapping went fairly well. I am open to all suggestions on how to make my cross validation better. Thank you!