Parametric tests require us to make certain assumptions about the distribution of the data. In cases where the data violates these assumptions, parametric methods would not be appropriate for data analysis and hypothesis testing. In these cases, non-parametric statistics provide a suitable alternative for data analysis as these methods do not make assumptions about the underlying distribution of the data. Granted, parametric tests have more power than their non-parametric counterparts.
\(H_{0}\): \(\theta\)=0
\(H_{A}\): \(\theta\)≠0
# Example where the conclusions differ
adopted<-data.frame(animal=c(rep("cat",6),rep("dog",6)),
breed=c("One Cat","Two Cat","Red Cat","Blue Cat","Old Cat","New Cat",
"Husky","Golden Retriever","Cocker Spaniel","Pug","Samoyed","Australian Shepherd"),
values=c(58, 62, 65, 66, 82, 87, 67, 69, 86, 89, 100, 830))
# As you can see, I am clearly very familiar with cat breeds
xtab <- matrix(c(sum(adopted[1:6,3]), sum(adopted[7:12,3])), ncol=1, byrow=T)
rownames(xtab) <- c("Cats","Dogs"); colnames(xtab)<-c("Adopted")
xtab
## Adopted
## Cats 420
## Dogs 1241
# Total number of aoptions n=1661
# Wilcoxon
wilcox.test(values ~ animal, data=adopted)
##
## Wilcoxon rank sum test
##
## data: values by animal
## W = 5, p-value = 0.04113
## alternative hypothesis: true location shift is not equal to 0
# T-test
t.test(values ~ animal, data=adopted)
##
## Welch Two Sample t-test
##
## data: values by animal
## t = -1.0962, df = 5.0146, p-value = 0.3228
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -457.4357 183.7690
## sample estimates:
## mean in group cat mean in group dog
## 70.0000 206.8333
# Example where the conclusions are the same
adopted.v2<-data.frame(animal=c(rep("cat",6),rep("dog",6)),
breed=c("One Cat","Two Cat","Red Cat","Blue Cat","Old Cat","New Cat",
"Husky","Golden Retriever","Cocker Spaniel","Pug","Samoyed","Australian Shepherd"),
values=c(68,62,65,66,82,87,67,69,86,89,100,820)) # +10 One Cat, -10 Australian Shepherd
xtab.v2 <- matrix(c(sum(adopted.v2[1:6,3]), sum(adopted.v2[7:12,3])), ncol=1, byrow=T)
rownames(xtab.v2) <- c("Cats","Dogs"); colnames(xtab.v2)<-c("Adopted")
xtab.v2
## Adopted
## Cats 430
## Dogs 1231
# Total number of adoptions is still n=1661, didn't add any more observations to the dataset, only shifted 10 adoptions from the breed with the largest number to the breed with the lowest number. 830-10=820 adoptions for Australian Shepherds and 58+10=68 adoptions for One Cats.
# Wilcoxon
wilcox.test(values ~ animal, data=adopted.v2)
##
## Wilcoxon rank sum test
##
## data: values by animal
## W = 6, p-value = 0.06494
## alternative hypothesis: true location shift is not equal to 0
# T-test
t.test(values ~ animal, data=adopted.v2)
##
## Welch Two Sample t-test
##
## data: values by animal
## t = -1.0841, df = 5.0116, p-value = 0.3277
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -449.8318 182.8318
## sample estimates:
## mean in group cat mean in group dog
## 71.66667 205.16667
The CART trees in random forest models are built on multiple bootstrapped data samples. The random forest model is then able to make predictions based on estimates from multiple diverse CART trees. Basically, all of the predictions from all CART trees can be produced in one random forest model. The additional information the random forest model learns from each additional, slightly different tree helps to smooth the decision boundary. The example below using iris data helps visualize these concepts.
df<-iris[1:150, c(1:2,5)]
CART.example <- rpart(Species ~ ., data=df)
RF.example <- randomForest(Species ~ ., data=df)
decisionplot(CART.example, df, class = "Species", main = "CART")
decisionplot(RF.example, df, class = "Species", main = "Random Forest")
Decision plot visualization function: https://michael.hahsler.net/SMU/EMIS7332/R/
(a) Using the “USArrests” data set that is pre-loaded in R, construct a kernel density estimator of the number of assault arrests and show the density in a figure.
ggplot(USArrests, aes(Assault, ..density..))+ geom_histogram(bins=10)+ geom_density(bw=40, color="cyan4")
(b) Using the “USArrests” data set that is pre-loaded in R, fit a loess regression model with UrbanPop as a predictor and Assault as the response variable.
ggplot(USArrests, aes(UrbanPop, Assault))+ geom_point()+geom_smooth(span=0.82, color="cyan4",se=FALSE)+geom_smooth(color="pink")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
library(dslabs); library(reshape2)
measles <- subset(us_contagious_diseases, disease == "Measles")
measles <- melt(measles, id = c("disease","state","year"), measure = "count")
measles <- dcast(measles, formula = year ~ state)
names(measles)[10]<-"DC"
names(measles)[31]<-"NH"
names(measles)[32]<-"NJ"
names(measles)[33]<-"NM"
names(measles)[34]<-"NY"
names(measles)[35]<-"NC"
names(measles)[36]<-"ND"
names(measles)[41]<-"RI"
names(measles)[42]<-"SC"
names(measles)[43]<-"SD"
names(measles)[50]<-"WV"
(a) Construct a CART model to predict number of measles cases in Illinois using the number of cases in other states as potential predictors. Display the best tree that you find and try to offer some interpretation.
set.seed(1234)
ind <- sample(76)
a<-ind[1:52]; b<-ind[53:76]
train <- measles[a,]
test <- measles[b,]
tree <- rpart(Illinois ~ .-year, data=train, minsplit=3, minbucket=1)
rpart.plot(tree, branch.lty=3, extra=101)
tree
## n= 52
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 52 16374220000 12238.920
## 2) Indiana< 8771.5 41 1382022000 4827.902
## 4) Montana< 542.5 25 29132490 785.440 *
## 5) Montana>=542.5 16 306012100 11144.250 *
## 3) Indiana>=8771.5 11 4347079000 39861.820
## 6) Michigan< 68691.5 9 965083100 32241.110
## 12) Ohio< 20750 4 56403010 22066.250 *
## 13) Ohio>=20750 5 163279900 40381.000 *
## 7) Michigan>=68691.5 2 507275000 74155.000 *
p.tree <- predict(tree, test)
mean((p.tree - test$Illinois)^2)
## [1] 46302858
We start by checking the number of measles cases in Indiana and then proceed down the tree accordingly. If there are at least 8,772 cases of measles in Indiana, then we look at the number of cases in Michigan. If there are at least 68,692 cases in Michigan, then our measles prediction for Illinois is 74,155. If there are less than 68,692 cases in Michigan, then we look at the number of cases in Ohio. If there are less than 20,750 cases in Ohio, then we predict 4 cases of measles in Illinois, otherwise we predict 5 cases.
If there are less than 8,772 cases of measles in Indiana, then we look at the number of cases in Montana. If there are at least 543 cases, then we predict 16 cases of measles in Illinois, otherwise we predict 25 cases.
(b) Construct a ranodm forest model to predict number of measles cases in Illinois using the number of cases in other states as potential predictors. Is the random forest model better than the CART model at predicting measles cases.
forest <- randomForest(Illinois ~ .-year, data=train)
p.forest<-predict(forest, test)
mean((p.forest-test$Illinois)^2)
## [1] 18021505
##
## CART Random Forest
## 4 20
As expected, the random forest predictions for measles were closer to the true number of cases in Illinois 20 out of 24 times. The CART predictions only out-performed random forest 4 times.
(c) Compute Kendall’s-\(\tau\) between the Illinois and Massachusetts measles cases data. Compute the standard error of this estimator using the bootstrap procedure. Also, estimate the bias of this estimator.
thetaHat <- cor(measles$Illinois, measles$Massachusetts, method="kendall"); thetaHat
## [1] 0.6707704
set.seed(1234)
x <- measles[,c(15,23)]
n <- nrow(x)
nsim <- 1000
thetaBoots=rep(NA,nsim)
for (i in 1:nsim){
bootsSample = x[sample(1:n,n,replace=TRUE),]
thetaBoots[i] = cor(bootsSample$Illinois, bootsSample$Massachusetts, method="kendall")
}
MSE = mean((thetaBoots-thetaHat)^2)
#Standard Error
sqrt(MSE)
## [1] 0.05237864
mean(thetaBoots)-thetaHat
## [1] 0.001037724
(d) Perform PCA on the the measles data with states as vari- ables. (Remember to exclude year as a predictor) How many components are needed to account for 85% of the variability? Also, try to interpret the first two principal components.
copy.measles <- select(measles, -year)
pca<-prcomp(copy.measles, scale=T, center=T)
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 5.0717 2.3906 1.73002 1.55793 1.49685 1.21572
## Proportion of Variance 0.5044 0.1120 0.05869 0.04759 0.04393 0.02898
## Cumulative Proportion 0.5044 0.6164 0.67510 0.72269 0.76662 0.79560
## PC7 PC8 PC9 PC10 PC11 PC12
## Standard deviation 1.08633 1.06723 1.0023 0.91873 0.81014 0.80742
## Proportion of Variance 0.02314 0.02233 0.0197 0.01655 0.01287 0.01278
## Cumulative Proportion 0.81874 0.84107 0.8608 0.87732 0.89019 0.90297
## PC13 PC14 PC15 PC16 PC17 PC18
## Standard deviation 0.74147 0.71233 0.65939 0.6101 0.58548 0.57551
## Proportion of Variance 0.01078 0.00995 0.00853 0.0073 0.00672 0.00649
## Cumulative Proportion 0.91375 0.92370 0.93223 0.9395 0.94625 0.95274
## PC19 PC20 PC21 PC22 PC23 PC24
## Standard deviation 0.55033 0.53632 0.53413 0.47762 0.45042 0.42163
## Proportion of Variance 0.00594 0.00564 0.00559 0.00447 0.00398 0.00349
## Cumulative Proportion 0.95868 0.96432 0.96991 0.97439 0.97837 0.98185
## PC25 PC26 PC27 PC28 PC29 PC30
## Standard deviation 0.39860 0.37323 0.32642 0.30369 0.28084 0.25949
## Proportion of Variance 0.00312 0.00273 0.00209 0.00181 0.00155 0.00132
## Cumulative Proportion 0.98497 0.98770 0.98979 0.99160 0.99314 0.99446
## PC31 PC32 PC33 PC34 PC35 PC36
## Standard deviation 0.24514 0.23105 0.17986 0.17410 0.16067 0.13137
## Proportion of Variance 0.00118 0.00105 0.00063 0.00059 0.00051 0.00034
## Cumulative Proportion 0.99564 0.99669 0.99732 0.99792 0.99842 0.99876
## PC37 PC38 PC39 PC40 PC41 PC42
## Standard deviation 0.11746 0.10899 0.09484 0.08418 0.07666 0.06662
## Proportion of Variance 0.00027 0.00023 0.00018 0.00014 0.00012 0.00009
## Cumulative Proportion 0.99903 0.99926 0.99944 0.99958 0.99969 0.99978
## PC43 PC44 PC45 PC46 PC47 PC48
## Standard deviation 0.06287 0.04965 0.04220 0.03468 0.03022 0.02156
## Proportion of Variance 0.00008 0.00005 0.00003 0.00002 0.00002 0.00001
## Cumulative Proportion 0.99986 0.99991 0.99994 0.99997 0.99998 0.99999
## PC49 PC50 PC51
## Standard deviation 0.01338 0.0106 0.008771
## Proportion of Variance 0.00000 0.0000 0.000000
## Cumulative Proportion 1.00000 1.0000 1.000000
PCA.2<-data.frame(pca$rotation[,1:2]); PCA.2
## PC1 PC2
## Alabama -0.16077162 0.042731298
## Alaska -0.08266421 -0.300024217
## Arizona -0.12868551 -0.247690985
## Arkansas -0.13829837 0.082508114
## California -0.14897156 -0.061962413
## Colorado -0.16622331 0.011455876
## Connecticut -0.13511995 -0.042029374
## Delaware -0.12920511 0.087914219
## DC -0.14506752 0.176021423
## Florida -0.15794779 -0.070406716
## Georgia -0.14505390 0.161494198
## Hawaii -0.06869364 -0.260656919
## Idaho -0.14512041 -0.155358321
## Illinois -0.15108708 0.098003499
## Indiana -0.16114808 -0.013318903
## Iowa -0.13971884 -0.137085486
## Kansas -0.12183149 0.179860957
## Kentucky -0.15941506 -0.107568440
## Louisiana -0.12880993 0.181271454
## Maine -0.13859742 -0.025486074
## Maryland -0.13778829 0.164185121
## Massachusetts -0.15596602 0.065003456
## Michigan -0.15537818 -0.012377601
## Minnesota -0.10365411 0.124337903
## Mississippi -0.09166173 -0.270562714
## Missouri -0.14658836 0.194138301
## Montana -0.15092050 -0.074427762
## Nebraska -0.11563188 0.151788097
## Nevada -0.09348111 -0.236351866
## NH -0.13037741 0.061583381
## NJ -0.15068380 0.019806893
## NM -0.14319861 -0.052276556
## NY -0.16794996 0.058355691
## NC -0.11861754 0.245930832
## ND -0.15390149 -0.112592318
## Ohio -0.15699184 0.066697673
## Oklahoma -0.15138139 0.069342825
## Oregon -0.14008630 -0.176361081
## Pennsylvania -0.15428094 0.185198560
## RI -0.10660378 0.002788881
## SC -0.13809717 0.099533976
## SD -0.10164930 0.206866330
## Tennessee -0.14427407 -0.186433116
## Texas -0.14544375 -0.188351499
## Utah -0.13373339 -0.018775683
## Vermont -0.12245849 -0.017323594
## Virginia -0.16727171 0.004395603
## Washington -0.15211770 -0.155900243
## WV -0.15105109 -0.108167943
## Wisconsin -0.16534908 -0.030975257
## Wyoming -0.15047417 0.074709929
9 components are needed to account for 85% of the variability.
1st principal component, largest magnitudes: NY, Virginia, Colorado, Wisconsin, Indiana, Alabama.
2nd principal component, largest magnitudes: Alaska, Mississippi, Hawaii, Arizona, Nevada, Texas, Tennessee.
(e) Perform hierarchical clustering on the the measles data with states as observations. (Remember to exclude year data). How many clusters do there appear to be? Try to interpret each of the clusters.
copy.measles <- standardize(copy.measles)
copy.measles <- t(copy.measles)
# Average Euclidean
dist.Ea <- dist(copy.measles, method="euclidean")
clust.Ea <- hclust(dist.Ea, method="ave")
plot(clust.Ea, cex=.5)
rect.hclust(clust.Ea, k=3)
# Average Canberra
dist.Ca <- dist(copy.measles, method="canberra")
clust.Ca <- hclust(dist.Ca, method="ave")
plot(clust.Ca, cex=.4)
rect.hclust(clust.Ca, k=2)
# Average Manhattan
dist.Ma <- dist(copy.measles, method="manhattan")
clust.Ma <- hclust(dist.Ma, method="ave")
plot(clust.Ma, cex=.4)
rect.hclust(clust.Ma, k=3)
It’s a little hard to interpret since we aren’t breaking out by year (where we could see clusters based on measles epidemics historically). I see some geography trends in how the clusters are split up but nothing that seems too meaningful (i.e. New York and CA in the same cluster consistently). Perhaps there’s an underlying population component but I would need to study the numbers from the original data more. The only reoccuring trend I do see is Alaska and Hawaii on the same sub-branch in the Canberra and Manhattan denograms. That seems reasonable that they would potentially mirror each other moreso than other states given that measles is a communicable disease and these states don’t share a physical border with the mainland US. Further, I doubt there was a lot of travel to and from AK/HI back when measles was more common (massive drop off in cases after the vaccine was invented in 1963 and flying wasn’t super mainstream until the end of the 1950’s so AK/HI probably weren’t getting a ton of visitors from the other 48 states).