Problem 1.

Parametric statistics make assumptions about the probability distributions of variables. The assumptions often involve parametric distributions or CLT for large samples. Non parametric statistics do not make these distributional assumptions about the data. Nonparametric tests are more robust than parametric tests, meaning they are valid in a broader range of situations.

Parametric: More assumptions (not as robust), more power Non-parametric: Less assumptions (robust), less power (sometimes)

Problem 2.

Let \(\Delta\) be the difference in means between two groups and assume independence between groups

We have two options. 1. Use the Parametric Two Sample Independent T-Test 2. Use the non parametric permutation test (or we compute with ranks and use the Wilcoxon Rank Sum

\(H_0 : \Delta = 0\) \(H_A : \Delta \neq 0\)

For a two sample t test, \(\delta\) compares the means of the data, but the permutation/wilcoxon rank sum test compares the medians of the data. Both the mean and median measure the center of the data set, but the mean can be influenced by an outlier.

Below, we show an example where the two tests provide different results due to an outlier.

With the two sample t-test, we obtain a p-value of 0.3321. This is because we have thrown an outlier into group two, which skews the mean heavily to the right and the difference between the groups is negligible.

With the wilcoxon rank test, the p-value is 0.02 < α=0.05. So, we argue that the true difference in medians is equal to 0.

Without the outier though, the tests would yield a similar conclusion.

#With an Outlier
g1 <- c(38.9, 61.2, 73.3, 21.8, 63.4, 64.6, 48.4, 48.8, 48.5)
g2 <- c(100000, 60, 63.4, 76, 89.4, 73.3, 67.3, 61.3, 62.4) 
# Create a data frame
my_data <- data.frame( 
                group = rep(c("G1", "G2"), each = 9),
                tally = c(g1,  g2)
                )

ttest <- t.test(tally ~ group, data = my_data, var.equal = TRUE)
ttest

## 
##  Two Sample t-test
## 
## data:  tally by group
## t = -1.0015, df = 16, p-value = 0.3315
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -34658.69  12417.76
## sample estimates:
## mean in group G1 mean in group G2 
##            52.10         11172.57

wilcox.test(g1,g2)

## Warning in wilcox.test.default(g1, g2): cannot compute exact p-value with
## ties

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  g1 and g2
## W = 14, p-value = 0.02155
## alternative hypothesis: true location shift is not equal to 0

#Without an Outlier
g1 <- c(38.9, 61.2, 73.3, 21.8, 63.4, 64.6, 48.4, 48.8, 48.5)
g2 <- c(80, 60, 63.4, 76, 89.4, 73.3, 67.3, 61.3, 62.4) 
# Create a data frame
my_data <- data.frame( 
                group = rep(c("G1", "G2"), each = 9),
                tally = c(g1,  g2)
                )

ttest <- t.test(tally ~ group, data = my_data, var.equal = TRUE)
ttest

## 
##  Two Sample t-test
## 
## data:  tally by group
## t = -2.9507, df = 16, p-value = 0.009396
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -31.351864  -5.137025
## sample estimates:
## mean in group G1 mean in group G2 
##         52.10000         70.34444

wilcox.test(g1,g2)

## Warning in wilcox.test.default(g1, g2): cannot compute exact p-value with
## ties

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  g1 and g2
## W = 14, p-value = 0.02155
## alternative hypothesis: true location shift is not equal to 0

Problem 3.

Decision trees are used for classification and regression. Trees computations are fast, easily interpretable, can deal with missing data and accomodate smooth and jagged responses. They are built on the entire dataset and use all the variables of interests.

Decision trees use algorithms that determine an optimal choice at each node. The greedy algorithm is most popular, which takes the most optimal decision at each step, but does not take into account the global optimum.

A random forest selects observations/rows and specific varialbes to build multiple decision trees from and then averages the results. After a large number of trees are built using this method, each tree “votes” or chooses the class, and the class receiving the most votes by a simple majority is the “winner” or predicted class.

Problem 4

Part A.

#By Hand
set.seed(1234)
n<-100
x<-rnorm(n)

kern<-function(z,x,Delta=.1){
n<-length(x)
1/(n*Delta)*sum(dnorm((z-x)/Delta))
}

S<-sd(USArrests$Assault)
n<-length(USArrests$Assault)

hist(USArrests$Assault,main="Delta=25",freq=FALSE)  
dens<-density(USArrests$Assault,bw=25) 
points(dens$x,dens$y,type="l",col="red",lwd=3)

#Part B.

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.5.2

loess(UrbanPop~Assault, data=USArrests)

## Call:
## loess(formula = UrbanPop ~ Assault, data = USArrests)
## 
## Number of Observations: 50 
## Equivalent Number of Parameters: 4.58 
## Residual Standard Error: 13.91

ggplot(USArrests, aes(UrbanPop, Assault)) +
  geom_point() +
  geom_smooth(span=0.6)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Problem 5

library(dslabs)
library(reshape2)
measles <- subset(us_contagious_diseases, disease == "Measles")
measles <- melt(measles, id = c("disease","state","year"), measure = "count")
measles <- dcast(measles, formula = year ~ state)
#Remove the space in some of the state names.
names(measles)[10]<-"DC"
names(measles)[31]<-"NH"
names(measles)[32]<-"NJ"
names(measles)[33]<-"NM"
names(measles)[34]<-"NY"
names(measles)[35]<-"NC"
names(measles)[36]<-"ND"
names(measles)[41]<-"RI"
names(measles)[42]<-"SC"
names(measles)[43]<-"SD"
names(measles)[50]<-"WV"

#Subsetting the Data so we have a method of comparison using MSE
set.seed(1234)
train <- sample(1:dim(measles)[1], dim(measles)[1] / 2, replace =FALSE )
test <- -train
m.train <- measles[train, ]
m.test <- measles[test, ]

Part A. CART

From this tree, it looks like Indiana and Michigan should definitely be in the model and help predict the number of cases of measles in Illinois. This makes sense as they are geographically close.

The tree follows the following methodology: If the number of measles cases in Indiana is less than 8772, then we flow down to the left and look at the number of cases in Michigan. if the number of cases in Michigan is less than 5446, we flow down and view the number of cases in Michigan again. If the cases are less than 391, we predict 172.3 cases in Illinois. If not, we predict 2174 cases in Illinois. If the number of cases in Michigan is more than 5446, we predict 11180 cases in Illinois. If the number of cases in Indiana is more than 8772, we predict 45260 cases in Illinois.

Note: I have removed year from predictors

#CART - Classification and Regression Tree
library(rpart)
model <- rpart(Illinois ~.-year, data = m.train, control = rpart.control(cp = 0.0001))
par(xpd = NA) # otherwise on some devices the text is clipped
plot(model)
text(model)

print(model)

## n= 38 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
## 1) root 38 13949170000 11453.6600  
##   2) Indiana< 8771.5 31   897773000  3819.1940  
##     4) Michigan< 5445.5 22    28446070   809.2273  
##       8) Michigan< 391 15     1232157   172.3333 *
##       9) Michigan>=391 7     8091168  2174.0000 *
##     5) Michigan>=5445.5 9   182788100 11176.8900 *
##   3) Indiana>=8771.5 7  3242852000 45263.4300 *

bestcp <- model$cptable[which.min(model$cptable[,"xerror"]),"CP"]
tree.pruned <- prune(model, cp = bestcp)
tree.pruned

## n= 38 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
## 1) root 38 13949170000 11453.6600  
##   2) Indiana< 8771.5 31   897773000  3819.1940  
##     4) Michigan< 5445.5 22    28446070   809.2273  
##       8) Michigan< 391 15     1232157   172.3333 *
##       9) Michigan>=391 7     8091168  2174.0000 *
##     5) Michigan>=5445.5 9   182788100 11176.8900 *
##   3) Indiana>=8771.5 7  3242852000 45263.4300 *

predicted_cart <- predict(tree.pruned, m.test)
mean((predicted_cart - m.test$Illinois)^2) #Test MSE

## [1] 68725063

#87533226 MSE without pruning 
#68725063 MSE with pruning cp = 0.0001

Part B.

The random forest MSE is 32325047 while the best CART MSE was 68725063. So, the random forest is better at predicting measles.

From the Variable Importance Plot, we see the same conclusion as above though that Michigan and Indiana are important variables in this model.

Note: I have removed year from predictors

set.seed(1234)
library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

model_RF <-randomForest(Illinois ~.-year, data=m.train, importance =TRUE, ntree = 500)

plot(model_RF)

varImpPlot(model_RF)

predicted_RF<-predict(model_RF, newdata=m.test, OOB=T)
mean((predicted_RF-m.test$Illinois)^2)

## [1] 30728628

Part C.

For Illinois and Massachusetts Kendall’s Tau = 0.6707

The standard error of this estimator is 0.0516299 using the bootstrap procedure

The bias of this procedure is 0.001037724.

#Kendall's Tau
thetaHat <- cor(measles$Illinois, measles$Massachusetts, method = "kendall")
thetaHat

## [1] 0.6707704

#Bootstrapping Standard Error Procedure
set.seed(1234)
x <- measles[,c(15,23)]
n <- nrow(x)
nsim <- 1000 
thetaBoots = rep(NA,nsim)  

for (i in 1:nsim){
  bootsSample = x[sample(1:n,n,replace=TRUE),]
thetaBoots[i] = cor(bootsSample$Illinois, bootsSample$Massachusetts, method = "kendall")
 }
#Bootstrap estimate of MSE
MSE = mean((thetaBoots-thetaHat)^2)

#Standard Error
sqrt(MSE)

## [1] 0.05237864

#Bias
mean(thetaBoots)-thetaHat

## [1] 0.001037724

Part D.

Below, I have removed year and scaled and centered the data because we cannot compare New York counts to Wyoming. Then, I have performed PCA.

We would need 9 principal components to account for 85% of the variability.

From the first principal component, the loadings are highest in magnitude for -Alabama, Colorado, New York, Indiana, and Wisconsin

From the second principal component, the loadings are highest in magnitude for -Alaska, Arizona, Hawaii, Mississipi, Nevada

This is really hard to interpret because there are so many variables and none of these stick out very strongly. In the second principal component, Hawaii and Alaska are highest and this makes sense because they are far removed from the continental US.

I would say as the significant states in the first PC (Alabama, Colorado, New York, Indiana, and Wisconsin) increase, the remaining ones increase as well.

measlesPCA <- measles[,2:52]

m.pca <- prcomp(measlesPCA, scale=TRUE, center=TRUE)
m.pca$rotation[,1:2]

##                       PC1          PC2
## Alabama       -0.16077162  0.042731298
## Alaska        -0.08266421 -0.300024217
## Arizona       -0.12868551 -0.247690985
## Arkansas      -0.13829837  0.082508114
## California    -0.14897156 -0.061962413
## Colorado      -0.16622331  0.011455876
## Connecticut   -0.13511995 -0.042029374
## Delaware      -0.12920511  0.087914219
## DC            -0.14506752  0.176021423
## Florida       -0.15794779 -0.070406716
## Georgia       -0.14505390  0.161494198
## Hawaii        -0.06869364 -0.260656919
## Idaho         -0.14512041 -0.155358321
## Illinois      -0.15108708  0.098003499
## Indiana       -0.16114808 -0.013318903
## Iowa          -0.13971884 -0.137085486
## Kansas        -0.12183149  0.179860957
## Kentucky      -0.15941506 -0.107568440
## Louisiana     -0.12880993  0.181271454
## Maine         -0.13859742 -0.025486074
## Maryland      -0.13778829  0.164185121
## Massachusetts -0.15596602  0.065003456
## Michigan      -0.15537818 -0.012377601
## Minnesota     -0.10365411  0.124337903
## Mississippi   -0.09166173 -0.270562714
## Missouri      -0.14658836  0.194138301
## Montana       -0.15092050 -0.074427762
## Nebraska      -0.11563188  0.151788097
## Nevada        -0.09348111 -0.236351866
## NH            -0.13037741  0.061583381
## NJ            -0.15068380  0.019806893
## NM            -0.14319861 -0.052276556
## NY            -0.16794996  0.058355691
## NC            -0.11861754  0.245930832
## ND            -0.15390149 -0.112592318
## Ohio          -0.15699184  0.066697673
## Oklahoma      -0.15138139  0.069342825
## Oregon        -0.14008630 -0.176361081
## Pennsylvania  -0.15428094  0.185198560
## RI            -0.10660378  0.002788881
## SC            -0.13809717  0.099533976
## SD            -0.10164930  0.206866330
## Tennessee     -0.14427407 -0.186433116
## Texas         -0.14544375 -0.188351499
## Utah          -0.13373339 -0.018775683
## Vermont       -0.12245849 -0.017323594
## Virginia      -0.16727171  0.004395603
## Washington    -0.15211770 -0.155900243
## WV            -0.15105109 -0.108167943
## Wisconsin     -0.16534908 -0.030975257
## Wyoming       -0.15047417  0.074709929

summary(m.pca)

## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     5.0717 2.3906 1.73002 1.55793 1.49685 1.21572
## Proportion of Variance 0.5044 0.1120 0.05869 0.04759 0.04393 0.02898
## Cumulative Proportion  0.5044 0.6164 0.67510 0.72269 0.76662 0.79560
##                            PC7     PC8    PC9    PC10    PC11    PC12
## Standard deviation     1.08633 1.06723 1.0023 0.91873 0.81014 0.80742
## Proportion of Variance 0.02314 0.02233 0.0197 0.01655 0.01287 0.01278
## Cumulative Proportion  0.81874 0.84107 0.8608 0.87732 0.89019 0.90297
##                           PC13    PC14    PC15   PC16    PC17    PC18
## Standard deviation     0.74147 0.71233 0.65939 0.6101 0.58548 0.57551
## Proportion of Variance 0.01078 0.00995 0.00853 0.0073 0.00672 0.00649
## Cumulative Proportion  0.91375 0.92370 0.93223 0.9395 0.94625 0.95274
##                           PC19    PC20    PC21    PC22    PC23    PC24
## Standard deviation     0.55033 0.53632 0.53413 0.47762 0.45042 0.42163
## Proportion of Variance 0.00594 0.00564 0.00559 0.00447 0.00398 0.00349
## Cumulative Proportion  0.95868 0.96432 0.96991 0.97439 0.97837 0.98185
##                           PC25    PC26    PC27    PC28    PC29    PC30
## Standard deviation     0.39860 0.37323 0.32642 0.30369 0.28084 0.25949
## Proportion of Variance 0.00312 0.00273 0.00209 0.00181 0.00155 0.00132
## Cumulative Proportion  0.98497 0.98770 0.98979 0.99160 0.99314 0.99446
##                           PC31    PC32    PC33    PC34    PC35    PC36
## Standard deviation     0.24514 0.23105 0.17986 0.17410 0.16067 0.13137
## Proportion of Variance 0.00118 0.00105 0.00063 0.00059 0.00051 0.00034
## Cumulative Proportion  0.99564 0.99669 0.99732 0.99792 0.99842 0.99876
##                           PC37    PC38    PC39    PC40    PC41    PC42
## Standard deviation     0.11746 0.10899 0.09484 0.08418 0.07666 0.06662
## Proportion of Variance 0.00027 0.00023 0.00018 0.00014 0.00012 0.00009
## Cumulative Proportion  0.99903 0.99926 0.99944 0.99958 0.99969 0.99978
##                           PC43    PC44    PC45    PC46    PC47    PC48
## Standard deviation     0.06287 0.04965 0.04220 0.03468 0.03022 0.02156
## Proportion of Variance 0.00008 0.00005 0.00003 0.00002 0.00002 0.00001
## Cumulative Proportion  0.99986 0.99991 0.99994 0.99997 0.99998 0.99999
##                           PC49   PC50     PC51
## Standard deviation     0.01338 0.0106 0.008771
## Proportion of Variance 0.00000 0.0000 0.000000
## Cumulative Proportion  1.00000 1.0000 1.000000

screeplot(m.pca)

biplot(m.pca , scale =0) #Yikes not even sure how to interpret this

Part E.

Below, I have performed many different methods of hierarchical clustering to aggregate a result.

A cluster that is common with most of these dendrograms is the cluster of Alaska, Hawaii, Mississippi, and Nevada. This cluster is very similar to the loading analysis of our second principal component.

Another cluster that is evident in all of these dendrograms is a cluster of northeastern states (Clear in Euclidean Distance, Complete Linkage). This geographical group stays together in nearly all of the clusters which may suggest that these states help serve as predictors for the number of measles cases for each other.

South Carolina and North Carolina branch off of one node in all of these dendrograms.

A general conclusion is the graphs are by geographical location.

library(robustHD)

## Loading required package: perry

## Loading required package: parallel

## Loading required package: robustbase

stdmeasles <- standardize(measlesPCA)
stdmeasles <- t(stdmeasles)

Euclidean Distances

d=dist(stdmeasles,method = "euclidean")

fit=hclust(d, method="ave")
plot(fit,cex=0.5, main = "Average Linkage, Euclidean") # display dendogram
groups=cutree(fit, k=3) # cut tree into 3 clusters
rect.hclust(fit, k=3, border="red")

fit=hclust(d, method="single")
plot(fit,cex=0.5, main = "Single Linkage, Euclidean") # display dendogram
groups=cutree(fit, k=3) # cut tree into 3 clusters
rect.hclust(fit, k=3, border="red")

fit=hclust(d, method="complete")
plot(fit,cex=0.5, main = "Complete Linkage, Euclidean") # display dendogram
groups=cutree(fit, k=3) # cut tree into 3 clusters
rect.hclust(fit, k=3, border="red")

Manhattan Distances

d=dist(stdmeasles,method = "manhattan")

fit=hclust(d, method="ave")
plot(fit,cex=0.5, main = "Average Linkage, Manhattan") # display dendogram
groups=cutree(fit, k=3) # cut tree into 3 clusters
rect.hclust(fit, k=3, border="red")

fit=hclust(d, method="single")
plot(fit,cex=0.5, main = "Single Linkage, Manhattan") # display dendogram
groups=cutree(fit, k=3) # cut tree into 3 clusters
rect.hclust(fit, k=3, border="red")

fit=hclust(d, method="complete")
plot(fit,cex=0.5, main = "Complete Linkage, Manhattan") # display dendogram
groups=cutree(fit, k=3) # cut tree into 3 clusters
rect.hclust(fit, k=3, border="red")

Maxiumum Distance

d=dist(stdmeasles,method = "maximum")

fit=hclust(d, method="ave")
plot(fit,cex=0.5, main = "Average Linkage, Maximum") # display dendogram
groups=cutree(fit, k=3) # cut tree into 3 clusters
rect.hclust(fit, k=3, border="red")

fit=hclust(d, method="single")
plot(fit,cex=0.5, main = "Single Linkage, Maximum") # display dendogram
groups=cutree(fit, k=3) # cut tree into 3 clusters
rect.hclust(fit, k=3, border="red")

fit=hclust(d, method="complete")
plot(fit,cex=0.5, main = "Complete Linkage, Maximum") # display dendogram
groups=cutree(fit, k=3)# cut tree into 3 clusters
rect.hclust(fit, k=3, border="red")

Canberra

d=dist(stdmeasles,method = "canberra")

fit=hclust(d, method="ave")
plot(fit,cex=0.5, main = "Average Linkage, Maximum") # display dendogram
groups=cutree(fit, k=3) # cut tree into 3 clusters
rect.hclust(fit, k=3, border="red")

fit=hclust(d, method="single")
plot(fit,cex=0.5, main = "Single Linkage, Maximum") # display dendogram
groups=cutree(fit, k=3) # cut tree into 3 clusters
rect.hclust(fit, k=3, border="red")

fit=hclust(d, method="complete")
plot(fit,cex=0.5, main = "Complete Linkage, Maximum") # display dendogram
groups=cutree(fit, k=3)# cut tree into 3 clusters
rect.hclust(fit, k=3, border="red")

STAT 488 NP Final Exam

Kajal Chokshi

5/2/2019