Parametric statistics make assumptions about the probability distributions of variables. The assumptions often involve parametric distributions or CLT for large samples. Non parametric statistics do not make these distributional assumptions about the data. Nonparametric tests are more robust than parametric tests, meaning they are valid in a broader range of situations.
Parametric: More assumptions (not as robust), more power Non-parametric: Less assumptions (robust), less power (sometimes)
Let \(\Delta\) be the difference in means between two groups and assume independence between groups
We have two options. 1. Use the Parametric Two Sample Independent T-Test 2. Use the non parametric permutation test (or we compute with ranks and use the Wilcoxon Rank Sum
\(H_0 : \Delta = 0\) \(H_A : \Delta \neq 0\)
For a two sample t test, \(\delta\) compares the means of the data, but the permutation/wilcoxon rank sum test compares the medians of the data. Both the mean and median measure the center of the data set, but the mean can be influenced by an outlier.
Below, we show an example where the two tests provide different results due to an outlier.
With the two sample t-test, we obtain a p-value of 0.3321. This is because we have thrown an outlier into group two, which skews the mean heavily to the right and the difference between the groups is negligible.
With the wilcoxon rank test, the p-value is 0.02 < α=0.05. So, we argue that the true difference in medians is equal to 0.
Without the outier though, the tests would yield a similar conclusion.
#With an Outlier
g1 <- c(38.9, 61.2, 73.3, 21.8, 63.4, 64.6, 48.4, 48.8, 48.5)
g2 <- c(100000, 60, 63.4, 76, 89.4, 73.3, 67.3, 61.3, 62.4)
# Create a data frame
my_data <- data.frame(
group = rep(c("G1", "G2"), each = 9),
tally = c(g1, g2)
)
ttest <- t.test(tally ~ group, data = my_data, var.equal = TRUE)
ttest
##
## Two Sample t-test
##
## data: tally by group
## t = -1.0015, df = 16, p-value = 0.3315
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -34658.69 12417.76
## sample estimates:
## mean in group G1 mean in group G2
## 52.10 11172.57
wilcox.test(g1,g2)
## Warning in wilcox.test.default(g1, g2): cannot compute exact p-value with
## ties
##
## Wilcoxon rank sum test with continuity correction
##
## data: g1 and g2
## W = 14, p-value = 0.02155
## alternative hypothesis: true location shift is not equal to 0
#Without an Outlier
g1 <- c(38.9, 61.2, 73.3, 21.8, 63.4, 64.6, 48.4, 48.8, 48.5)
g2 <- c(80, 60, 63.4, 76, 89.4, 73.3, 67.3, 61.3, 62.4)
# Create a data frame
my_data <- data.frame(
group = rep(c("G1", "G2"), each = 9),
tally = c(g1, g2)
)
ttest <- t.test(tally ~ group, data = my_data, var.equal = TRUE)
ttest
##
## Two Sample t-test
##
## data: tally by group
## t = -2.9507, df = 16, p-value = 0.009396
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -31.351864 -5.137025
## sample estimates:
## mean in group G1 mean in group G2
## 52.10000 70.34444
wilcox.test(g1,g2)
## Warning in wilcox.test.default(g1, g2): cannot compute exact p-value with
## ties
##
## Wilcoxon rank sum test with continuity correction
##
## data: g1 and g2
## W = 14, p-value = 0.02155
## alternative hypothesis: true location shift is not equal to 0
Decision trees are used for classification and regression. Trees computations are fast, easily interpretable, can deal with missing data and accomodate smooth and jagged responses. They are built on the entire dataset and use all the variables of interests.
Decision trees use algorithms that determine an optimal choice at each node. The greedy algorithm is most popular, which takes the most optimal decision at each step, but does not take into account the global optimum.
A random forest selects observations/rows and specific varialbes to build multiple decision trees from and then averages the results. After a large number of trees are built using this method, each tree “votes” or chooses the class, and the class receiving the most votes by a simple majority is the “winner” or predicted class.
#By Hand
set.seed(1234)
n<-100
x<-rnorm(n)
kern<-function(z,x,Delta=.1){
n<-length(x)
1/(n*Delta)*sum(dnorm((z-x)/Delta))
}
S<-sd(USArrests$Assault)
n<-length(USArrests$Assault)
hist(USArrests$Assault,main="Delta=25",freq=FALSE)
dens<-density(USArrests$Assault,bw=25)
points(dens$x,dens$y,type="l",col="red",lwd=3)
#Part B.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.5.2
loess(UrbanPop~Assault, data=USArrests)
## Call:
## loess(formula = UrbanPop ~ Assault, data = USArrests)
##
## Number of Observations: 50
## Equivalent Number of Parameters: 4.58
## Residual Standard Error: 13.91
ggplot(USArrests, aes(UrbanPop, Assault)) +
geom_point() +
geom_smooth(span=0.6)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
library(dslabs)
library(reshape2)
measles <- subset(us_contagious_diseases, disease == "Measles")
measles <- melt(measles, id = c("disease","state","year"), measure = "count")
measles <- dcast(measles, formula = year ~ state)
#Remove the space in some of the state names.
names(measles)[10]<-"DC"
names(measles)[31]<-"NH"
names(measles)[32]<-"NJ"
names(measles)[33]<-"NM"
names(measles)[34]<-"NY"
names(measles)[35]<-"NC"
names(measles)[36]<-"ND"
names(measles)[41]<-"RI"
names(measles)[42]<-"SC"
names(measles)[43]<-"SD"
names(measles)[50]<-"WV"
#Subsetting the Data so we have a method of comparison using MSE
set.seed(1234)
train <- sample(1:dim(measles)[1], dim(measles)[1] / 2, replace =FALSE )
test <- -train
m.train <- measles[train, ]
m.test <- measles[test, ]
From this tree, it looks like Indiana and Michigan should definitely be in the model and help predict the number of cases of measles in Illinois. This makes sense as they are geographically close.
The tree follows the following methodology: If the number of measles cases in Indiana is less than 8772, then we flow down to the left and look at the number of cases in Michigan. if the number of cases in Michigan is less than 5446, we flow down and view the number of cases in Michigan again. If the cases are less than 391, we predict 172.3 cases in Illinois. If not, we predict 2174 cases in Illinois. If the number of cases in Michigan is more than 5446, we predict 11180 cases in Illinois. If the number of cases in Indiana is more than 8772, we predict 45260 cases in Illinois.
Note: I have removed year from predictors
#CART - Classification and Regression Tree
library(rpart)
model <- rpart(Illinois ~.-year, data = m.train, control = rpart.control(cp = 0.0001))
par(xpd = NA) # otherwise on some devices the text is clipped
plot(model)
text(model)
print(model)
## n= 38
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 38 13949170000 11453.6600
## 2) Indiana< 8771.5 31 897773000 3819.1940
## 4) Michigan< 5445.5 22 28446070 809.2273
## 8) Michigan< 391 15 1232157 172.3333 *
## 9) Michigan>=391 7 8091168 2174.0000 *
## 5) Michigan>=5445.5 9 182788100 11176.8900 *
## 3) Indiana>=8771.5 7 3242852000 45263.4300 *
bestcp <- model$cptable[which.min(model$cptable[,"xerror"]),"CP"]
tree.pruned <- prune(model, cp = bestcp)
tree.pruned
## n= 38
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 38 13949170000 11453.6600
## 2) Indiana< 8771.5 31 897773000 3819.1940
## 4) Michigan< 5445.5 22 28446070 809.2273
## 8) Michigan< 391 15 1232157 172.3333 *
## 9) Michigan>=391 7 8091168 2174.0000 *
## 5) Michigan>=5445.5 9 182788100 11176.8900 *
## 3) Indiana>=8771.5 7 3242852000 45263.4300 *
predicted_cart <- predict(tree.pruned, m.test)
mean((predicted_cart - m.test$Illinois)^2) #Test MSE
## [1] 68725063
#87533226 MSE without pruning
#68725063 MSE with pruning cp = 0.0001
The random forest MSE is 32325047 while the best CART MSE was 68725063. So, the random forest is better at predicting measles.
From the Variable Importance Plot, we see the same conclusion as above though that Michigan and Indiana are important variables in this model.
Note: I have removed year from predictors
set.seed(1234)
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
model_RF <-randomForest(Illinois ~.-year, data=m.train, importance =TRUE, ntree = 500)
plot(model_RF)
varImpPlot(model_RF)
predicted_RF<-predict(model_RF, newdata=m.test, OOB=T)
mean((predicted_RF-m.test$Illinois)^2)
## [1] 30728628
For Illinois and Massachusetts Kendall’s Tau = 0.6707
The standard error of this estimator is 0.0516299 using the bootstrap procedure
The bias of this procedure is 0.001037724.
#Kendall's Tau
thetaHat <- cor(measles$Illinois, measles$Massachusetts, method = "kendall")
thetaHat
## [1] 0.6707704
#Bootstrapping Standard Error Procedure
set.seed(1234)
x <- measles[,c(15,23)]
n <- nrow(x)
nsim <- 1000
thetaBoots = rep(NA,nsim)
for (i in 1:nsim){
bootsSample = x[sample(1:n,n,replace=TRUE),]
thetaBoots[i] = cor(bootsSample$Illinois, bootsSample$Massachusetts, method = "kendall")
}
#Bootstrap estimate of MSE
MSE = mean((thetaBoots-thetaHat)^2)
#Standard Error
sqrt(MSE)
## [1] 0.05237864
#Bias
mean(thetaBoots)-thetaHat
## [1] 0.001037724
Below, I have removed year and scaled and centered the data because we cannot compare New York counts to Wyoming. Then, I have performed PCA.
We would need 9 principal components to account for 85% of the variability.
From the first principal component, the loadings are highest in magnitude for -Alabama, Colorado, New York, Indiana, and Wisconsin
From the second principal component, the loadings are highest in magnitude for -Alaska, Arizona, Hawaii, Mississipi, Nevada
This is really hard to interpret because there are so many variables and none of these stick out very strongly. In the second principal component, Hawaii and Alaska are highest and this makes sense because they are far removed from the continental US.
I would say as the significant states in the first PC (Alabama, Colorado, New York, Indiana, and Wisconsin) increase, the remaining ones increase as well.
measlesPCA <- measles[,2:52]
m.pca <- prcomp(measlesPCA, scale=TRUE, center=TRUE)
m.pca$rotation[,1:2]
## PC1 PC2
## Alabama -0.16077162 0.042731298
## Alaska -0.08266421 -0.300024217
## Arizona -0.12868551 -0.247690985
## Arkansas -0.13829837 0.082508114
## California -0.14897156 -0.061962413
## Colorado -0.16622331 0.011455876
## Connecticut -0.13511995 -0.042029374
## Delaware -0.12920511 0.087914219
## DC -0.14506752 0.176021423
## Florida -0.15794779 -0.070406716
## Georgia -0.14505390 0.161494198
## Hawaii -0.06869364 -0.260656919
## Idaho -0.14512041 -0.155358321
## Illinois -0.15108708 0.098003499
## Indiana -0.16114808 -0.013318903
## Iowa -0.13971884 -0.137085486
## Kansas -0.12183149 0.179860957
## Kentucky -0.15941506 -0.107568440
## Louisiana -0.12880993 0.181271454
## Maine -0.13859742 -0.025486074
## Maryland -0.13778829 0.164185121
## Massachusetts -0.15596602 0.065003456
## Michigan -0.15537818 -0.012377601
## Minnesota -0.10365411 0.124337903
## Mississippi -0.09166173 -0.270562714
## Missouri -0.14658836 0.194138301
## Montana -0.15092050 -0.074427762
## Nebraska -0.11563188 0.151788097
## Nevada -0.09348111 -0.236351866
## NH -0.13037741 0.061583381
## NJ -0.15068380 0.019806893
## NM -0.14319861 -0.052276556
## NY -0.16794996 0.058355691
## NC -0.11861754 0.245930832
## ND -0.15390149 -0.112592318
## Ohio -0.15699184 0.066697673
## Oklahoma -0.15138139 0.069342825
## Oregon -0.14008630 -0.176361081
## Pennsylvania -0.15428094 0.185198560
## RI -0.10660378 0.002788881
## SC -0.13809717 0.099533976
## SD -0.10164930 0.206866330
## Tennessee -0.14427407 -0.186433116
## Texas -0.14544375 -0.188351499
## Utah -0.13373339 -0.018775683
## Vermont -0.12245849 -0.017323594
## Virginia -0.16727171 0.004395603
## Washington -0.15211770 -0.155900243
## WV -0.15105109 -0.108167943
## Wisconsin -0.16534908 -0.030975257
## Wyoming -0.15047417 0.074709929
summary(m.pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 5.0717 2.3906 1.73002 1.55793 1.49685 1.21572
## Proportion of Variance 0.5044 0.1120 0.05869 0.04759 0.04393 0.02898
## Cumulative Proportion 0.5044 0.6164 0.67510 0.72269 0.76662 0.79560
## PC7 PC8 PC9 PC10 PC11 PC12
## Standard deviation 1.08633 1.06723 1.0023 0.91873 0.81014 0.80742
## Proportion of Variance 0.02314 0.02233 0.0197 0.01655 0.01287 0.01278
## Cumulative Proportion 0.81874 0.84107 0.8608 0.87732 0.89019 0.90297
## PC13 PC14 PC15 PC16 PC17 PC18
## Standard deviation 0.74147 0.71233 0.65939 0.6101 0.58548 0.57551
## Proportion of Variance 0.01078 0.00995 0.00853 0.0073 0.00672 0.00649
## Cumulative Proportion 0.91375 0.92370 0.93223 0.9395 0.94625 0.95274
## PC19 PC20 PC21 PC22 PC23 PC24
## Standard deviation 0.55033 0.53632 0.53413 0.47762 0.45042 0.42163
## Proportion of Variance 0.00594 0.00564 0.00559 0.00447 0.00398 0.00349
## Cumulative Proportion 0.95868 0.96432 0.96991 0.97439 0.97837 0.98185
## PC25 PC26 PC27 PC28 PC29 PC30
## Standard deviation 0.39860 0.37323 0.32642 0.30369 0.28084 0.25949
## Proportion of Variance 0.00312 0.00273 0.00209 0.00181 0.00155 0.00132
## Cumulative Proportion 0.98497 0.98770 0.98979 0.99160 0.99314 0.99446
## PC31 PC32 PC33 PC34 PC35 PC36
## Standard deviation 0.24514 0.23105 0.17986 0.17410 0.16067 0.13137
## Proportion of Variance 0.00118 0.00105 0.00063 0.00059 0.00051 0.00034
## Cumulative Proportion 0.99564 0.99669 0.99732 0.99792 0.99842 0.99876
## PC37 PC38 PC39 PC40 PC41 PC42
## Standard deviation 0.11746 0.10899 0.09484 0.08418 0.07666 0.06662
## Proportion of Variance 0.00027 0.00023 0.00018 0.00014 0.00012 0.00009
## Cumulative Proportion 0.99903 0.99926 0.99944 0.99958 0.99969 0.99978
## PC43 PC44 PC45 PC46 PC47 PC48
## Standard deviation 0.06287 0.04965 0.04220 0.03468 0.03022 0.02156
## Proportion of Variance 0.00008 0.00005 0.00003 0.00002 0.00002 0.00001
## Cumulative Proportion 0.99986 0.99991 0.99994 0.99997 0.99998 0.99999
## PC49 PC50 PC51
## Standard deviation 0.01338 0.0106 0.008771
## Proportion of Variance 0.00000 0.0000 0.000000
## Cumulative Proportion 1.00000 1.0000 1.000000
screeplot(m.pca)
biplot(m.pca , scale =0) #Yikes not even sure how to interpret this
Below, I have performed many different methods of hierarchical clustering to aggregate a result.
A cluster that is common with most of these dendrograms is the cluster of Alaska, Hawaii, Mississippi, and Nevada. This cluster is very similar to the loading analysis of our second principal component.
Another cluster that is evident in all of these dendrograms is a cluster of northeastern states (Clear in Euclidean Distance, Complete Linkage). This geographical group stays together in nearly all of the clusters which may suggest that these states help serve as predictors for the number of measles cases for each other.
South Carolina and North Carolina branch off of one node in all of these dendrograms.
A general conclusion is the graphs are by geographical location.
library(robustHD)
## Loading required package: perry
## Loading required package: parallel
## Loading required package: robustbase
stdmeasles <- standardize(measlesPCA)
stdmeasles <- t(stdmeasles)
d=dist(stdmeasles,method = "euclidean")
fit=hclust(d, method="ave")
plot(fit,cex=0.5, main = "Average Linkage, Euclidean") # display dendogram
groups=cutree(fit, k=3) # cut tree into 3 clusters
rect.hclust(fit, k=3, border="red")
fit=hclust(d, method="single")
plot(fit,cex=0.5, main = "Single Linkage, Euclidean") # display dendogram
groups=cutree(fit, k=3) # cut tree into 3 clusters
rect.hclust(fit, k=3, border="red")
fit=hclust(d, method="complete")
plot(fit,cex=0.5, main = "Complete Linkage, Euclidean") # display dendogram
groups=cutree(fit, k=3) # cut tree into 3 clusters
rect.hclust(fit, k=3, border="red")
d=dist(stdmeasles,method = "manhattan")
fit=hclust(d, method="ave")
plot(fit,cex=0.5, main = "Average Linkage, Manhattan") # display dendogram
groups=cutree(fit, k=3) # cut tree into 3 clusters
rect.hclust(fit, k=3, border="red")
fit=hclust(d, method="single")
plot(fit,cex=0.5, main = "Single Linkage, Manhattan") # display dendogram
groups=cutree(fit, k=3) # cut tree into 3 clusters
rect.hclust(fit, k=3, border="red")
fit=hclust(d, method="complete")
plot(fit,cex=0.5, main = "Complete Linkage, Manhattan") # display dendogram
groups=cutree(fit, k=3) # cut tree into 3 clusters
rect.hclust(fit, k=3, border="red")
d=dist(stdmeasles,method = "maximum")
fit=hclust(d, method="ave")
plot(fit,cex=0.5, main = "Average Linkage, Maximum") # display dendogram
groups=cutree(fit, k=3) # cut tree into 3 clusters
rect.hclust(fit, k=3, border="red")
fit=hclust(d, method="single")
plot(fit,cex=0.5, main = "Single Linkage, Maximum") # display dendogram
groups=cutree(fit, k=3) # cut tree into 3 clusters
rect.hclust(fit, k=3, border="red")
fit=hclust(d, method="complete")
plot(fit,cex=0.5, main = "Complete Linkage, Maximum") # display dendogram
groups=cutree(fit, k=3)# cut tree into 3 clusters
rect.hclust(fit, k=3, border="red")
d=dist(stdmeasles,method = "canberra")
fit=hclust(d, method="ave")
plot(fit,cex=0.5, main = "Average Linkage, Maximum") # display dendogram
groups=cutree(fit, k=3) # cut tree into 3 clusters
rect.hclust(fit, k=3, border="red")
fit=hclust(d, method="single")
plot(fit,cex=0.5, main = "Single Linkage, Maximum") # display dendogram
groups=cutree(fit, k=3) # cut tree into 3 clusters
rect.hclust(fit, k=3, border="red")
fit=hclust(d, method="complete")
plot(fit,cex=0.5, main = "Complete Linkage, Maximum") # display dendogram
groups=cutree(fit, k=3)# cut tree into 3 clusters
rect.hclust(fit, k=3, border="red")