488 Final Exam

Question 1:

What is the difference between parametric and non-parametric statistics?

Parametric tests require us to make certain assumptions about the distribution of the data. In cases where the data violates these assumptions, parametric methods would not be appropriate for data analysis and hypothesis testing. In these cases, non-parametric statistics provide a suitable alternative for data analysis as these methods do not make assumptions about the underlying distribution of the data. Granted, parametric tests have more power than their non-parametric counterparts.

Question 2:

Give an example of a hypothesis test that can be tested using both parametric and non-parametric tests. State the null and alternative hypotheses and describe (or give the names of) the parametric and non-parametric tests used to test the hypothesis that you have given as an example.

Two Sample T-Test (parametric) vs Wilcoxon Rank Sum test (non-parametric)
Null & Alternative Hypothesis:

\(H_{0}\): \(\theta\)=0

\(H_{A}\): \(\theta\)≠0

# Example where the conclusions differ
adopted<-data.frame(animal=c(rep("cat",6),rep("dog",6)),
                    breed=c("One Cat","Two Cat","Red Cat","Blue Cat","Old Cat","New Cat",
                            "Husky","Golden Retriever","Cocker Spaniel","Pug","Samoyed","Australian Shepherd"),
                    values=c(58, 62, 65, 66, 82, 87, 67, 69, 86, 89, 100, 830))
# As you can see, I am clearly very familiar with cat breeds

xtab <- matrix(c(sum(adopted[1:6,3]), sum(adopted[7:12,3])), ncol=1, byrow=T)
rownames(xtab) <- c("Cats","Dogs"); colnames(xtab)<-c("Adopted")
xtab

##      Adopted
## Cats     420
## Dogs    1241

# Total number of aoptions n=1661

# Wilcoxon
wilcox.test(values ~ animal, data=adopted)

## 
##  Wilcoxon rank sum test
## 
## data:  values by animal
## W = 5, p-value = 0.04113
## alternative hypothesis: true location shift is not equal to 0

# T-test
t.test(values ~ animal, data=adopted)

## 
##  Welch Two Sample t-test
## 
## data:  values by animal
## t = -1.0962, df = 5.0146, p-value = 0.3228
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -457.4357  183.7690
## sample estimates:
## mean in group cat mean in group dog 
##           70.0000          206.8333

# Example where the conclusions are the same
adopted.v2<-data.frame(animal=c(rep("cat",6),rep("dog",6)),
                    breed=c("One Cat","Two Cat","Red Cat","Blue Cat","Old Cat","New Cat",
                            "Husky","Golden Retriever","Cocker Spaniel","Pug","Samoyed","Australian Shepherd"),
                       values=c(68,62,65,66,82,87,67,69,86,89,100,820)) # +10 One Cat, -10 Australian Shepherd 

xtab.v2 <- matrix(c(sum(adopted.v2[1:6,3]), sum(adopted.v2[7:12,3])), ncol=1, byrow=T)
rownames(xtab.v2) <- c("Cats","Dogs"); colnames(xtab.v2)<-c("Adopted")
xtab.v2

##      Adopted
## Cats     430
## Dogs    1231

# Total number of adoptions is still n=1661, didn't add any more observations to the dataset, only shifted 10 adoptions from the breed with the largest number to the breed with the lowest number. 830-10=820 adoptions for Australian Shepherds and 58+10=68 adoptions for One Cats.

# Wilcoxon
wilcox.test(values ~ animal, data=adopted.v2)

## 
##  Wilcoxon rank sum test
## 
## data:  values by animal
## W = 6, p-value = 0.06494
## alternative hypothesis: true location shift is not equal to 0

# T-test
t.test(values ~ animal, data=adopted.v2)

## 
##  Welch Two Sample t-test
## 
## data:  values by animal
## t = -1.0841, df = 5.0116, p-value = 0.3277
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -449.8318  182.8318
## sample estimates:
## mean in group cat mean in group dog 
##          71.66667         205.16667

Question 3:

A random forest consists of many trees, but there is a difference between the trees in a random forest and a regular tree model. Describe the differences between these trees.

The CART trees in random forest models are built on multiple bootstrapped data samples. The random forest model is then able to make predictions based on estimates from multiple diverse CART trees. Basically, all of the predictions from all CART trees can be produced in one random forest model. The additional information the random forest model learns from each additional, slightly different tree helps to smooth the decision boundary. The example below using iris data helps visualize these concepts.

df<-iris[1:150, c(1:2,5)]

CART.example <- rpart(Species ~ ., data=df)
RF.example <- randomForest(Species ~ ., data=df)

decisionplot(CART.example, df, class = "Species", main = "CART")

decisionplot(RF.example, df, class = "Species", main = "Random Forest")

Decision plot visualization function: https://michael.hahsler.net/SMU/EMIS7332/R/

Question 4:

Answer the following question with the “USArrest” dataset in R.

(a) Using the “USArrests” data set that is pre-loaded in R, construct a kernel density estimator of the number of assault arrests and show the density in a figure.

ggplot(USArrests, aes(Assault, ..density..))+ geom_histogram(bins=10)+ geom_density(bw=40, color="cyan4")

(b) Using the “USArrests” data set that is pre-loaded in R, fit a loess regression model with UrbanPop as a predictor and Assault as the response variable.

ggplot(USArrests, aes(UrbanPop, Assault))+ geom_point()+geom_smooth(span=0.82, color="cyan4",se=FALSE)+geom_smooth(color="pink")

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Question 5:

Answer the following questions with the data sets in the dslabs pacakge. We will focus on measles.

library(dslabs); library(reshape2)
measles <- subset(us_contagious_diseases, disease == "Measles")
measles <- melt(measles, id = c("disease","state","year"), measure = "count")
measles <- dcast(measles, formula = year ~ state)
   
names(measles)[10]<-"DC"
names(measles)[31]<-"NH"
names(measles)[32]<-"NJ"
names(measles)[33]<-"NM"
names(measles)[34]<-"NY"
names(measles)[35]<-"NC"
names(measles)[36]<-"ND"
names(measles)[41]<-"RI"
names(measles)[42]<-"SC"
names(measles)[43]<-"SD"
names(measles)[50]<-"WV"

(a) Construct a CART model to predict number of measles cases in Illinois using the number of cases in other states as potential predictors. Display the best tree that you find and try to offer some interpretation.

set.seed(1234)
ind <- sample(76)
a<-ind[1:52]; b<-ind[53:76]
train <- measles[a,]
test <- measles[b,]

tree <- rpart(Illinois ~ .-year, data=train, minsplit=3, minbucket=1)
rpart.plot(tree, branch.lty=3, extra=101)

tree

## n= 52 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 52 16374220000 12238.920  
##    2) Indiana< 8771.5 41  1382022000  4827.902  
##      4) Montana< 542.5 25    29132490   785.440 *
##      5) Montana>=542.5 16   306012100 11144.250 *
##    3) Indiana>=8771.5 11  4347079000 39861.820  
##      6) Michigan< 68691.5 9   965083100 32241.110  
##       12) Ohio< 20750 4    56403010 22066.250 *
##       13) Ohio>=20750 5   163279900 40381.000 *
##      7) Michigan>=68691.5 2   507275000 74155.000 *

p.tree <- predict(tree, test)
mean((p.tree - test$Illinois)^2)

## [1] 46302858

We start by checking the number of measles cases in Indiana and then proceed down the tree accordingly. If there are at least 8,772 cases of measles in Indiana, then we look at the number of cases in Michigan. If there are at least 68,692 cases in Michigan, then our measles prediction for Illinois is 74,155. If there are less than 68,692 cases in Michigan, then we look at the number of cases in Ohio. If there are less than 20,750 cases in Ohio, then we predict 4 cases of measles in Illinois, otherwise we predict 5 cases.

If there are less than 8,772 cases of measles in Indiana, then we look at the number of cases in Montana. If there are at least 543 cases, then we predict 16 cases of measles in Illinois, otherwise we predict 25 cases.

(b) Construct a ranodm forest model to predict number of measles cases in Illinois using the number of cases in other states as potential predictors. Is the random forest model better than the CART model at predicting measles cases.

forest <- randomForest(Illinois ~ .-year, data=train) 

p.forest<-predict(forest, test)
mean((p.forest-test$Illinois)^2)

## [1] 18021505

## 
##          CART Random Forest 
##             4            20

As expected, the random forest predictions for measles were closer to the true number of cases in Illinois 20 out of 24 times. The CART predictions only out-performed random forest 4 times.

(c) Compute Kendall’s-\(\tau\) between the Illinois and Massachusetts measles cases data. Compute the standard error of this estimator using the bootstrap procedure. Also, estimate the bias of this estimator.

Kendall’s-\(\tau\)

thetaHat <- cor(measles$Illinois, measles$Massachusetts, method="kendall"); thetaHat

## [1] 0.6707704

Bootstrapping Standard Error Procedure

set.seed(1234)
x <- measles[,c(15,23)]
n <- nrow(x)
nsim <- 1000 
thetaBoots=rep(NA,nsim)  

for (i in 1:nsim){
  bootsSample = x[sample(1:n,n,replace=TRUE),]
  thetaBoots[i] = cor(bootsSample$Illinois, bootsSample$Massachusetts, method="kendall")
  }

MSE = mean((thetaBoots-thetaHat)^2)

#Standard Error
sqrt(MSE)

## [1] 0.05237864

Bias

mean(thetaBoots)-thetaHat

## [1] 0.001037724

(d) Perform PCA on the the measles data with states as vari- ables. (Remember to exclude year as a predictor) How many components are needed to account for 85% of the variability? Also, try to interpret the first two principal components.

copy.measles <- select(measles, -year)

pca<-prcomp(copy.measles, scale=T, center=T)
summary(pca)

## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     5.0717 2.3906 1.73002 1.55793 1.49685 1.21572
## Proportion of Variance 0.5044 0.1120 0.05869 0.04759 0.04393 0.02898
## Cumulative Proportion  0.5044 0.6164 0.67510 0.72269 0.76662 0.79560
##                            PC7     PC8    PC9    PC10    PC11    PC12
## Standard deviation     1.08633 1.06723 1.0023 0.91873 0.81014 0.80742
## Proportion of Variance 0.02314 0.02233 0.0197 0.01655 0.01287 0.01278
## Cumulative Proportion  0.81874 0.84107 0.8608 0.87732 0.89019 0.90297
##                           PC13    PC14    PC15   PC16    PC17    PC18
## Standard deviation     0.74147 0.71233 0.65939 0.6101 0.58548 0.57551
## Proportion of Variance 0.01078 0.00995 0.00853 0.0073 0.00672 0.00649
## Cumulative Proportion  0.91375 0.92370 0.93223 0.9395 0.94625 0.95274
##                           PC19    PC20    PC21    PC22    PC23    PC24
## Standard deviation     0.55033 0.53632 0.53413 0.47762 0.45042 0.42163
## Proportion of Variance 0.00594 0.00564 0.00559 0.00447 0.00398 0.00349
## Cumulative Proportion  0.95868 0.96432 0.96991 0.97439 0.97837 0.98185
##                           PC25    PC26    PC27    PC28    PC29    PC30
## Standard deviation     0.39860 0.37323 0.32642 0.30369 0.28084 0.25949
## Proportion of Variance 0.00312 0.00273 0.00209 0.00181 0.00155 0.00132
## Cumulative Proportion  0.98497 0.98770 0.98979 0.99160 0.99314 0.99446
##                           PC31    PC32    PC33    PC34    PC35    PC36
## Standard deviation     0.24514 0.23105 0.17986 0.17410 0.16067 0.13137
## Proportion of Variance 0.00118 0.00105 0.00063 0.00059 0.00051 0.00034
## Cumulative Proportion  0.99564 0.99669 0.99732 0.99792 0.99842 0.99876
##                           PC37    PC38    PC39    PC40    PC41    PC42
## Standard deviation     0.11746 0.10899 0.09484 0.08418 0.07666 0.06662
## Proportion of Variance 0.00027 0.00023 0.00018 0.00014 0.00012 0.00009
## Cumulative Proportion  0.99903 0.99926 0.99944 0.99958 0.99969 0.99978
##                           PC43    PC44    PC45    PC46    PC47    PC48
## Standard deviation     0.06287 0.04965 0.04220 0.03468 0.03022 0.02156
## Proportion of Variance 0.00008 0.00005 0.00003 0.00002 0.00002 0.00001
## Cumulative Proportion  0.99986 0.99991 0.99994 0.99997 0.99998 0.99999
##                           PC49   PC50     PC51
## Standard deviation     0.01338 0.0106 0.008771
## Proportion of Variance 0.00000 0.0000 0.000000
## Cumulative Proportion  1.00000 1.0000 1.000000

PCA.2<-data.frame(pca$rotation[,1:2]); PCA.2

##                       PC1          PC2
## Alabama       -0.16077162  0.042731298
## Alaska        -0.08266421 -0.300024217
## Arizona       -0.12868551 -0.247690985
## Arkansas      -0.13829837  0.082508114
## California    -0.14897156 -0.061962413
## Colorado      -0.16622331  0.011455876
## Connecticut   -0.13511995 -0.042029374
## Delaware      -0.12920511  0.087914219
## DC            -0.14506752  0.176021423
## Florida       -0.15794779 -0.070406716
## Georgia       -0.14505390  0.161494198
## Hawaii        -0.06869364 -0.260656919
## Idaho         -0.14512041 -0.155358321
## Illinois      -0.15108708  0.098003499
## Indiana       -0.16114808 -0.013318903
## Iowa          -0.13971884 -0.137085486
## Kansas        -0.12183149  0.179860957
## Kentucky      -0.15941506 -0.107568440
## Louisiana     -0.12880993  0.181271454
## Maine         -0.13859742 -0.025486074
## Maryland      -0.13778829  0.164185121
## Massachusetts -0.15596602  0.065003456
## Michigan      -0.15537818 -0.012377601
## Minnesota     -0.10365411  0.124337903
## Mississippi   -0.09166173 -0.270562714
## Missouri      -0.14658836  0.194138301
## Montana       -0.15092050 -0.074427762
## Nebraska      -0.11563188  0.151788097
## Nevada        -0.09348111 -0.236351866
## NH            -0.13037741  0.061583381
## NJ            -0.15068380  0.019806893
## NM            -0.14319861 -0.052276556
## NY            -0.16794996  0.058355691
## NC            -0.11861754  0.245930832
## ND            -0.15390149 -0.112592318
## Ohio          -0.15699184  0.066697673
## Oklahoma      -0.15138139  0.069342825
## Oregon        -0.14008630 -0.176361081
## Pennsylvania  -0.15428094  0.185198560
## RI            -0.10660378  0.002788881
## SC            -0.13809717  0.099533976
## SD            -0.10164930  0.206866330
## Tennessee     -0.14427407 -0.186433116
## Texas         -0.14544375 -0.188351499
## Utah          -0.13373339 -0.018775683
## Vermont       -0.12245849 -0.017323594
## Virginia      -0.16727171  0.004395603
## Washington    -0.15211770 -0.155900243
## WV            -0.15105109 -0.108167943
## Wisconsin     -0.16534908 -0.030975257
## Wyoming       -0.15047417  0.074709929

9 components are needed to account for 85% of the variability.
1st principal component, largest magnitudes: NY, Virginia, Colorado, Wisconsin, Indiana, Alabama.
2nd principal component, largest magnitudes: Alaska, Mississippi, Hawaii, Arizona, Nevada, Texas, Tennessee.

(e) Perform hierarchical clustering on the the measles data with states as observations. (Remember to exclude year data). How many clusters do there appear to be? Try to interpret each of the clusters.

copy.measles <- standardize(copy.measles)
copy.measles <- t(copy.measles)

# Average Euclidean
dist.Ea <- dist(copy.measles, method="euclidean")
clust.Ea <- hclust(dist.Ea, method="ave")

plot(clust.Ea, cex=.5)
rect.hclust(clust.Ea, k=3)

# Average Canberra
dist.Ca <- dist(copy.measles, method="canberra")
clust.Ca <- hclust(dist.Ca, method="ave")

plot(clust.Ca, cex=.4)
rect.hclust(clust.Ca, k=2)

# Average Manhattan
dist.Ma <- dist(copy.measles, method="manhattan")
clust.Ma <- hclust(dist.Ma, method="ave")

plot(clust.Ma, cex=.4)
rect.hclust(clust.Ma, k=3)

It’s a little hard to interpret since we aren’t breaking out by year (where we could see clusters based on measles epidemics historically). I see some geography trends in how the clusters are split up but nothing that seems too meaningful (i.e. New York and CA in the same cluster consistently). Perhaps there’s an underlying population component but I would need to study the numbers from the original data more. The only reoccuring trend I do see is Alaska and Hawaii on the same sub-branch in the Canberra and Manhattan denograms. That seems reasonable that they would potentially mirror each other moreso than other states given that measles is a communicable disease and these states don’t share a physical border with the mainland US. Further, I doubt there was a lot of travel to and from AK/HI back when measles was more common (massive drop off in cases after the vaccine was invented in 1963 and flying wasn’t super mainstream until the end of the 1950’s so AK/HI probably weren’t getting a ton of visitors from the other 48 states).