(1)

(10 points) What is the difference between parametric and non-parametric statistics (2-3 lines should suffice for an answer)?

Parametric tests are used when there is an underlying theoretical distribution, such as the normal. We use non-parametric tests when the underlying assumption of a distribution is NOT there. Parametric tests are more powerful than non-parametrics tests, and thus should be used when appropriate.

(2)

(10 points) Give an example of a hypothesis test that can be tested using both parametric and non-parametric tests. State the null and alternative hypotheses and describe (or give the names of) the parametric and non- parametric tests used to test the hypothesis that you have given as an example. (HINT: Pick an example that is easy for yourself!)

The 1 sample t-test for mean can be tested as the non-parametric sign test. \[H_0: \theta = 0\] \[H_a: \theta > 0\] In the parametric 1 sample t-test \(\theta\) would be the mean of the data set and in the non-parametric sign test \(\theta\) would be the median (both measures of center). The t-test uses the t-distribution to measure how much greater the mean is than 0 and finds the corresponding 1-sided p-value.

The sign test assumes that there is a 50% chance of being above 0 and counts the number of values greater than 0 in the sample. Then it calculates the probability that we have a result as extreme, or more extreme as observed. For example we might have 10 out 12 values greater than 0 so the the p-value would be the probability we have 10, 11, or 12 out of 12 values greater than 0. This is calculated by the binomial distribution with p= 0.5.

(3)

(10 points) A random forest consists of many trees, but there is a difference between the trees in a random forest and a regular tree model. Describe the differences between these trees.

The trees in a random forest are built on a subset of the variables in a bootstrap sample of the data. The resulting random forest model uses an aggregate or majority vote to make predictions. A regular tree model initially is built on all variables (not all are choosen, and seeing which ones are important is a pro for this type of model).

(4)

Answer the following question with the “USArrest” dataset in R.

(A)

(10 points) Using the “USArrests” data set that is pre-loaded in R, construct a kernel density estimator of the number of assault arrests and show the density in a figure. (hint: ggplot2 makes this very easy).

n = nrow(USArrests)
S = sd(USArrests$Assault)
delta = 1.06/n^(1/5)*S; delta #Hardle's Rule of Thumb
## [1] 40.39738
hist(USArrests$Assault,main="Delta=40",freq=FALSE)  
dens<-density(USArrests$Assault,bw=delta) 
points(dens$x,dens$y,type="l",col="red",lwd=3)

(B)

(10 points) Using the “USArrests” data set that is pre-loaded in R, fit a loess regression model with UrbanPop as a predictor and Assault as the response variable. (hint: ggplot2 makes this very easy).

lw = loess(Assault~UrbanPop,data=USArrests)

plot(USArrests$UrbanPop, USArrests$Assault)
par(new=T)
j <- order(USArrests$UrbanPop)
points(USArrests$UrbanPop[j], lw$fitted[j],col="red",lwd=3,type="l")

(5)

Answer the following questions with the data sets in the dslabs pacakge. We will focus on measles. (I did some data cleaning for you. Also, assume that the 0’s in the early years are correct.)

library(dslabs)
   library(reshape2)
   measles <- subset(us_contagious_diseases, disease == "Measles")
   measles <- melt(measles, id = c("disease","state","year"), measure = "count")
   measles <- dcast(measles, formula = year ~ state)
#Remove the space in some of the state names.
names(measles)[10]<-"DC"
names(measles)[31]<-"NH"
names(measles)[32]<-"NJ"
names(measles)[33]<-"NM"
names(measles)[34]<-"NY"
names(measles)[35]<-"NC"
names(measles)[36]<-"ND"
names(measles)[41]<-"RI"
names(measles)[42]<-"SC"
names(measles)[43]<-"SD"
names(measles)[50]<-"WV"

(A)

(10 points) Construct a CART model to predict number of measles cases in Illinois using the number of cases in other states as potential predictors. Display the best tree that you find and try to offer some interpretation.

library(tree) 
tree1<-tree(Illinois~.-year , data = measles)
#summary(tree1)
#predict(tree1)
plot(tree1)
text(tree1, cex=.75)

I used the presets, because I know that they are better than I at tuning the tree. I was planning on learning from my homework assignment feedback.

This tree uses the measles numbers in Indiana, West Virginia, and North Carolina. If the number of Measles cases in Indiana is less than 8771.5 then we look at West Virginia cases. If the West Virginia cases were less than 1676 then we predict measles cases in Illinois to be 1206, if greater than 1676 then we predict 12840 cases in Illinois. Looking at cases in Indiana greater than 8771.5, then we look at cases in North Carolina, if less than 10117.5 we predicat 29070, if greater than 10117.5 we predict 56350 cases in Illinois.

(B)

(10 points) Construct a ranodm forest model to predict number of measles cases in Illinois using the number of cases in other states as potential predictors. Is the random forest model better than the CART model at predicting measles cases.

library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
rf = randomForest(Illinois~.-year , data = measles)
#p.rf = predict(rf, type = "response")

The random forest model is going to be much better than the CART model at predicting measles cases in Illinois since it will use a lot more of the data.

(C)

(10 points) Compute Kendall’s-τ between the Illinois and Massachusetts measles cases data. Compute the standard error of this estimator using the bootstrap procedure. Also, estimate the bias of this estimator.

Kendall’s Tau:

thetaHat = cor(measles$Illinois, measles$Massachusetts, method = "kendall")

Bootstrap MSE:

x = measles[,c(15,23)]
n = nrow(x)
nsim = 1000 
thetaBoots = rep(NA,nsim)  

for (i in 1:nsim){
  bootsSample = x[sample(1:n,n,replace=TRUE),]
thetaBoots[i] = cor(bootsSample$Illinois, bootsSample$Massachusetts, method = "kendall")
 }
#Bootstrap estimate of MSE
MSE = mean((thetaBoots-thetaHat)^2)

#Standard Error
sqrt(MSE)
## [1] 0.05029133

Bias Estimate

mean(thetaBoots)-thetaHat
## [1] 0.002743426

(D)

(10 points) Perform PCA on the the measles data with states as variables. (Remember to exclude year as a predictor) How many components are needed to account for 85% of the variability? Also, try to interpret the first two principal components.

scaled = scale(measles[,2:52])
s = cor(scaled)
e = eigen(s)
sum(e$values[1:8])/sum(e$values)
## [1] 0.8410743
sum(e$values[1:9])/sum(e$values)
## [1] 0.8607708
e$vectors[,2]
##  [1] -0.042731298  0.300024217  0.247690985 -0.082508114  0.061962413
##  [6] -0.011455876  0.042029374 -0.087914219 -0.176021423  0.070406716
## [11] -0.161494198  0.260656919  0.155358321 -0.098003499  0.013318903
## [16]  0.137085486 -0.179860957  0.107568440 -0.181271454  0.025486074
## [21] -0.164185121 -0.065003456  0.012377601 -0.124337903  0.270562714
## [26] -0.194138301  0.074427762 -0.151788097  0.236351866 -0.061583381
## [31] -0.019806893  0.052276556 -0.058355691 -0.245930832  0.112592318
## [36] -0.066697673 -0.069342825  0.176361081 -0.185198560 -0.002788881
## [41] -0.099533976 -0.206866330  0.186433116  0.188351499  0.018775683
## [46]  0.017323594 -0.004395603  0.155900243  0.108167943  0.030975257
## [51] -0.074709929
pc.measles=prcomp(scaled)
summary(pc.measles)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     5.0717 2.3906 1.73002 1.55793 1.49685 1.21572
## Proportion of Variance 0.5044 0.1120 0.05869 0.04759 0.04393 0.02898
## Cumulative Proportion  0.5044 0.6164 0.67510 0.72269 0.76662 0.79560
##                            PC7     PC8    PC9    PC10    PC11    PC12
## Standard deviation     1.08633 1.06723 1.0023 0.91873 0.81014 0.80742
## Proportion of Variance 0.02314 0.02233 0.0197 0.01655 0.01287 0.01278
## Cumulative Proportion  0.81874 0.84107 0.8608 0.87732 0.89019 0.90297
##                           PC13    PC14    PC15   PC16    PC17    PC18
## Standard deviation     0.74147 0.71233 0.65939 0.6101 0.58548 0.57551
## Proportion of Variance 0.01078 0.00995 0.00853 0.0073 0.00672 0.00649
## Cumulative Proportion  0.91375 0.92370 0.93223 0.9395 0.94625 0.95274
##                           PC19    PC20    PC21    PC22    PC23    PC24
## Standard deviation     0.55033 0.53632 0.53413 0.47762 0.45042 0.42163
## Proportion of Variance 0.00594 0.00564 0.00559 0.00447 0.00398 0.00349
## Cumulative Proportion  0.95868 0.96432 0.96991 0.97439 0.97837 0.98185
##                           PC25    PC26    PC27    PC28    PC29    PC30
## Standard deviation     0.39860 0.37323 0.32642 0.30369 0.28084 0.25949
## Proportion of Variance 0.00312 0.00273 0.00209 0.00181 0.00155 0.00132
## Cumulative Proportion  0.98497 0.98770 0.98979 0.99160 0.99314 0.99446
##                           PC31    PC32    PC33    PC34    PC35    PC36
## Standard deviation     0.24514 0.23105 0.17986 0.17410 0.16067 0.13137
## Proportion of Variance 0.00118 0.00105 0.00063 0.00059 0.00051 0.00034
## Cumulative Proportion  0.99564 0.99669 0.99732 0.99792 0.99842 0.99876
##                           PC37    PC38    PC39    PC40    PC41    PC42
## Standard deviation     0.11746 0.10899 0.09484 0.08418 0.07666 0.06662
## Proportion of Variance 0.00027 0.00023 0.00018 0.00014 0.00012 0.00009
## Cumulative Proportion  0.99903 0.99926 0.99944 0.99958 0.99969 0.99978
##                           PC43    PC44    PC45    PC46    PC47    PC48
## Standard deviation     0.06287 0.04965 0.04220 0.03468 0.03022 0.02156
## Proportion of Variance 0.00008 0.00005 0.00003 0.00002 0.00002 0.00001
## Cumulative Proportion  0.99986 0.99991 0.99994 0.99997 0.99998 0.99999
##                           PC49   PC50     PC51
## Standard deviation     0.01338 0.0106 0.008771
## Proportion of Variance 0.00000 0.0000 0.000000
## Cumulative Proportion  1.00000 1.0000 1.000000
#screeplot(pc.measles, type='lines', main="Scree Plot")

summary(measles$Alaska)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00  222.47   21.75 2511.00
summary(measles$Arizona)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.00    32.75   642.50  2115.92  3630.25 10609.00

You need 9 Principal Components to account for at least 85% of the variability.

It appears that the first component includes all states except the ones with 0s for the first many years (least used states: Alaska, Miss, Nevada, and Hawaii), and most strongly composed of Alabama, NY, Wisconsin, Virginia.

The second component seems to focus mostly on states that had many cases in the later years like Alaska, Nevada, Tennessee, Texas, NC, Miss, and Hawaii (opposite of component 1).

(E)

(10 points) Perform hierarchical clustering on the the measles data with states as observations. (Remember to exclude year data). How many clusters do there appear to be? Try to interpret each of the clusters.

rownames(scaled) = measles$year
d <- dist(scaled, method = "canberra")
hc <- hclust(d, method ="average")
plot(hc)

There appears to be two very distinct clusters: The 1930-1960s and 1970-2003. The cluster on the right seems to represent years with a large number of cases of measles while the cluster on the left seems to be years with significantly less cases of measles.