(10 points) What is the difference between parametric and non-parametric statistics (2-3 lines should suffice for an answer)?
Parametric tests are used when there is an underlying theoretical distribution, such as the normal. We use non-parametric tests when the underlying assumption of a distribution is NOT there. Parametric tests are more powerful than non-parametrics tests, and thus should be used when appropriate.
(10 points) Give an example of a hypothesis test that can be tested using both parametric and non-parametric tests. State the null and alternative hypotheses and describe (or give the names of) the parametric and non- parametric tests used to test the hypothesis that you have given as an example. (HINT: Pick an example that is easy for yourself!)
The 1 sample t-test for mean can be tested as the non-parametric sign test. \[H_0: \theta = 0\] \[H_a: \theta > 0\] In the parametric 1 sample t-test \(\theta\) would be the mean of the data set and in the non-parametric sign test \(\theta\) would be the median (both measures of center). The t-test uses the t-distribution to measure how much greater the mean is than 0 and finds the corresponding 1-sided p-value.
The sign test assumes that there is a 50% chance of being above 0 and counts the number of values greater than 0 in the sample. Then it calculates the probability that we have a result as extreme, or more extreme as observed. For example we might have 10 out 12 values greater than 0 so the the p-value would be the probability we have 10, 11, or 12 out of 12 values greater than 0. This is calculated by the binomial distribution with p= 0.5.
(10 points) A random forest consists of many trees, but there is a difference between the trees in a random forest and a regular tree model. Describe the differences between these trees.
The trees in a random forest are built on a subset of the variables in a bootstrap sample of the data. The resulting random forest model uses an aggregate or majority vote to make predictions. A regular tree model initially is built on all variables (not all are choosen, and seeing which ones are important is a pro for this type of model).
Answer the following question with the “USArrest” dataset in R.
(10 points) Using the “USArrests” data set that is pre-loaded in R, construct a kernel density estimator of the number of assault arrests and show the density in a figure. (hint: ggplot2 makes this very easy).
n = nrow(USArrests)
S = sd(USArrests$Assault)
delta = 1.06/n^(1/5)*S; delta #Hardle's Rule of Thumb
## [1] 40.39738
hist(USArrests$Assault,main="Delta=40",freq=FALSE)
dens<-density(USArrests$Assault,bw=delta)
points(dens$x,dens$y,type="l",col="red",lwd=3)
(10 points) Using the “USArrests” data set that is pre-loaded in R, fit a loess regression model with UrbanPop as a predictor and Assault as the response variable. (hint: ggplot2 makes this very easy).
lw = loess(Assault~UrbanPop,data=USArrests)
plot(USArrests$UrbanPop, USArrests$Assault)
par(new=T)
j <- order(USArrests$UrbanPop)
points(USArrests$UrbanPop[j], lw$fitted[j],col="red",lwd=3,type="l")
Answer the following questions with the data sets in the dslabs pacakge. We will focus on measles. (I did some data cleaning for you. Also, assume that the 0’s in the early years are correct.)
library(dslabs)
library(reshape2)
measles <- subset(us_contagious_diseases, disease == "Measles")
measles <- melt(measles, id = c("disease","state","year"), measure = "count")
measles <- dcast(measles, formula = year ~ state)
#Remove the space in some of the state names.
names(measles)[10]<-"DC"
names(measles)[31]<-"NH"
names(measles)[32]<-"NJ"
names(measles)[33]<-"NM"
names(measles)[34]<-"NY"
names(measles)[35]<-"NC"
names(measles)[36]<-"ND"
names(measles)[41]<-"RI"
names(measles)[42]<-"SC"
names(measles)[43]<-"SD"
names(measles)[50]<-"WV"
(10 points) Construct a CART model to predict number of measles cases in Illinois using the number of cases in other states as potential predictors. Display the best tree that you find and try to offer some interpretation.
library(tree)
tree1<-tree(Illinois~.-year , data = measles)
#summary(tree1)
#predict(tree1)
plot(tree1)
text(tree1, cex=.75)
I used the presets, because I know that they are better than I at tuning the tree. I was planning on learning from my homework assignment feedback.
This tree uses the measles numbers in Indiana, West Virginia, and North Carolina. If the number of Measles cases in Indiana is less than 8771.5 then we look at West Virginia cases. If the West Virginia cases were less than 1676 then we predict measles cases in Illinois to be 1206, if greater than 1676 then we predict 12840 cases in Illinois. Looking at cases in Indiana greater than 8771.5, then we look at cases in North Carolina, if less than 10117.5 we predicat 29070, if greater than 10117.5 we predict 56350 cases in Illinois.
(10 points) Construct a ranodm forest model to predict number of measles cases in Illinois using the number of cases in other states as potential predictors. Is the random forest model better than the CART model at predicting measles cases.
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
rf = randomForest(Illinois~.-year , data = measles)
#p.rf = predict(rf, type = "response")
The random forest model is going to be much better than the CART model at predicting measles cases in Illinois since it will use a lot more of the data.
(10 points) Compute Kendall’s-τ between the Illinois and Massachusetts measles cases data. Compute the standard error of this estimator using the bootstrap procedure. Also, estimate the bias of this estimator.
Kendall’s Tau:
thetaHat = cor(measles$Illinois, measles$Massachusetts, method = "kendall")
Bootstrap MSE:
x = measles[,c(15,23)]
n = nrow(x)
nsim = 1000
thetaBoots = rep(NA,nsim)
for (i in 1:nsim){
bootsSample = x[sample(1:n,n,replace=TRUE),]
thetaBoots[i] = cor(bootsSample$Illinois, bootsSample$Massachusetts, method = "kendall")
}
#Bootstrap estimate of MSE
MSE = mean((thetaBoots-thetaHat)^2)
#Standard Error
sqrt(MSE)
## [1] 0.05029133
Bias Estimate
mean(thetaBoots)-thetaHat
## [1] 0.002743426
(10 points) Perform PCA on the the measles data with states as variables. (Remember to exclude year as a predictor) How many components are needed to account for 85% of the variability? Also, try to interpret the first two principal components.
scaled = scale(measles[,2:52])
s = cor(scaled)
e = eigen(s)
sum(e$values[1:8])/sum(e$values)
## [1] 0.8410743
sum(e$values[1:9])/sum(e$values)
## [1] 0.8607708
e$vectors[,2]
## [1] -0.042731298 0.300024217 0.247690985 -0.082508114 0.061962413
## [6] -0.011455876 0.042029374 -0.087914219 -0.176021423 0.070406716
## [11] -0.161494198 0.260656919 0.155358321 -0.098003499 0.013318903
## [16] 0.137085486 -0.179860957 0.107568440 -0.181271454 0.025486074
## [21] -0.164185121 -0.065003456 0.012377601 -0.124337903 0.270562714
## [26] -0.194138301 0.074427762 -0.151788097 0.236351866 -0.061583381
## [31] -0.019806893 0.052276556 -0.058355691 -0.245930832 0.112592318
## [36] -0.066697673 -0.069342825 0.176361081 -0.185198560 -0.002788881
## [41] -0.099533976 -0.206866330 0.186433116 0.188351499 0.018775683
## [46] 0.017323594 -0.004395603 0.155900243 0.108167943 0.030975257
## [51] -0.074709929
pc.measles=prcomp(scaled)
summary(pc.measles)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 5.0717 2.3906 1.73002 1.55793 1.49685 1.21572
## Proportion of Variance 0.5044 0.1120 0.05869 0.04759 0.04393 0.02898
## Cumulative Proportion 0.5044 0.6164 0.67510 0.72269 0.76662 0.79560
## PC7 PC8 PC9 PC10 PC11 PC12
## Standard deviation 1.08633 1.06723 1.0023 0.91873 0.81014 0.80742
## Proportion of Variance 0.02314 0.02233 0.0197 0.01655 0.01287 0.01278
## Cumulative Proportion 0.81874 0.84107 0.8608 0.87732 0.89019 0.90297
## PC13 PC14 PC15 PC16 PC17 PC18
## Standard deviation 0.74147 0.71233 0.65939 0.6101 0.58548 0.57551
## Proportion of Variance 0.01078 0.00995 0.00853 0.0073 0.00672 0.00649
## Cumulative Proportion 0.91375 0.92370 0.93223 0.9395 0.94625 0.95274
## PC19 PC20 PC21 PC22 PC23 PC24
## Standard deviation 0.55033 0.53632 0.53413 0.47762 0.45042 0.42163
## Proportion of Variance 0.00594 0.00564 0.00559 0.00447 0.00398 0.00349
## Cumulative Proportion 0.95868 0.96432 0.96991 0.97439 0.97837 0.98185
## PC25 PC26 PC27 PC28 PC29 PC30
## Standard deviation 0.39860 0.37323 0.32642 0.30369 0.28084 0.25949
## Proportion of Variance 0.00312 0.00273 0.00209 0.00181 0.00155 0.00132
## Cumulative Proportion 0.98497 0.98770 0.98979 0.99160 0.99314 0.99446
## PC31 PC32 PC33 PC34 PC35 PC36
## Standard deviation 0.24514 0.23105 0.17986 0.17410 0.16067 0.13137
## Proportion of Variance 0.00118 0.00105 0.00063 0.00059 0.00051 0.00034
## Cumulative Proportion 0.99564 0.99669 0.99732 0.99792 0.99842 0.99876
## PC37 PC38 PC39 PC40 PC41 PC42
## Standard deviation 0.11746 0.10899 0.09484 0.08418 0.07666 0.06662
## Proportion of Variance 0.00027 0.00023 0.00018 0.00014 0.00012 0.00009
## Cumulative Proportion 0.99903 0.99926 0.99944 0.99958 0.99969 0.99978
## PC43 PC44 PC45 PC46 PC47 PC48
## Standard deviation 0.06287 0.04965 0.04220 0.03468 0.03022 0.02156
## Proportion of Variance 0.00008 0.00005 0.00003 0.00002 0.00002 0.00001
## Cumulative Proportion 0.99986 0.99991 0.99994 0.99997 0.99998 0.99999
## PC49 PC50 PC51
## Standard deviation 0.01338 0.0106 0.008771
## Proportion of Variance 0.00000 0.0000 0.000000
## Cumulative Proportion 1.00000 1.0000 1.000000
#screeplot(pc.measles, type='lines', main="Scree Plot")
summary(measles$Alaska)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 222.47 21.75 2511.00
summary(measles$Arizona)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 32.75 642.50 2115.92 3630.25 10609.00
You need 9 Principal Components to account for at least 85% of the variability.
It appears that the first component includes all states except the ones with 0s for the first many years (least used states: Alaska, Miss, Nevada, and Hawaii), and most strongly composed of Alabama, NY, Wisconsin, Virginia.
The second component seems to focus mostly on states that had many cases in the later years like Alaska, Nevada, Tennessee, Texas, NC, Miss, and Hawaii (opposite of component 1).
(10 points) Perform hierarchical clustering on the the measles data with states as observations. (Remember to exclude year data). How many clusters do there appear to be? Try to interpret each of the clusters.
rownames(scaled) = measles$year
d <- dist(scaled, method = "canberra")
hc <- hclust(d, method ="average")
plot(hc)
There appears to be two very distinct clusters: The 1930-1960s and 1970-2003. The cluster on the right seems to represent years with a large number of cases of measles while the cluster on the left seems to be years with significantly less cases of measles.