Non-Parametric Statistics

Instructions

This exam is due May 4, 2019 at 11:55pm. There are 100 points available for students in both 388/488. Students enrolled in 388 will be scored out of 80 points. Don’t cheat. Seriously, don’t cheat. Have a great summer. Try not to do anything stupid.

(1) [10 points]

What is the difference between parametric and non-parametric statistics (2-3 lines should suffice for an answer)?

Parametric statistics makes assumptions about the probability distributions of data. These assumptions often involve parametric distributions or the central limit theorem is applied as samples get larger. Non-parametric statistics, however, does not make these distributional assumptions about the data; usually non-parametric statistical methods make other, more flexible assumptions about the data. In general, assuming more about data corresponds with greater power in statistical testing, but not always. Some advantages of non-parametric statistical methods over parametric statistical methods include the following:

Fewer/less stringent assumptions
Exactness of p-values, coverage probabilities & error rates
Typically easier to apply than their normal-theory counterparts
Procedures are typically easier to understand
Procedures are only slightly less efficient than their normal theory competitors when underlying populations are normal, and they are at least mildly, if not wildly, more efficient when underlying populations are not normal
Methods are relatively insensitve to outliers
Widely applicable since many procedures simply rely on the ranks of observations rather than their true magnitudes

(Taken directly from class notes)

(2) [10 points]

Give an example of a hypothesis test that can be tested using both parametric and non-parametric tests. State the null and alternative hypotheses and describe (or give the names of) the parametric and nonparametric tests used to test the hypothesis that you have given as an example. (HINT: Pick an example that is easy for yourself!)

Randomly generated data

set.seed(2634)
x <- rnorm(100, mean = 0, sd = 1)
kable(head(x), caption = "100 Randomly Generated Variables from Standard Normal")

100 Randomly Generated Variables from Standard Normal
x
0.4244015
-1.4940158
-1.6156726
-0.6741247
1.6568220
1.0843686

Parametric hypothesis test: One Sample t-Test
\(H_{0}: \mu = 0\)
\(H_{A}: \mu \neq 0\)

t.test(x)


    One Sample t-test

data:  x
t = -0.037259, df = 99, p-value = 0.9704
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 -0.2193076  0.2112232
sample estimates:
   mean of x 
-0.004042188

Assumptions:
* Data are IID * Approximately normally distributed * No outliers

Non-parametric alternative test: Sign Test (let \(\theta\) be the median)
\(H_{0}: \theta = 0\)
\(H_{A}: \theta \neq 0\)

table(sign(x))
binom.test(x = 47, n = 100, p = 0.5)


-1  1 
47 53 

    Exact binomial test

data:  47 and 100
number of successes = 47, number of trials = 100, p-value = 0.6173
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
 0.3694052 0.5724185
sample estimates:
probability of success 
                  0.47

Assumptions:
* data are IID

(3) [10 points]

A random forest consists of many trees, but there is a difference between the trees in a random forest and a regular tree model. Describe the differences between these trees.

The major differences between trees in CART models and Random Forests is the way trees are “grown.” When trees are being modeled in CART, all observations start in a single node and all binary splits for all independent variables are searched until the one that reduces within leaf variance the most is found. This process is repeated until the reduction in variance meets some threshold or a minimum number of observations in each node is reached.

Random forests are an ensemble method combining CART and bootstrapping. When a random forest model is being trained, a bootstrap sample is taken from a random sample. At each node a random subset of the independent variables is taken, all possible binary splits are considered only for that subset, and the split that reduces variance most is selected. Since each tree is constructed using only a sample of the data, there is an inherent holdout dataset; predictions can be made on the data which was not used to train the model and predictions are averaged across trees.

In summary, the major difference between trees in CART models and random forests is the number of independent variables considered at each split and the sample of observations that the models are trained on.

(4) Answer the following question with the “USArrest” dataset in R.

rm(list = ls())
data("USArrests")
df <- USArrests
kable(head(df), caption = "USA Arrests Data from R")

USA Arrests Data from R
	Murder	Assault	UrbanPop	Rape
Alabama	13.2	236	58	21.2
Alaska	10.0	263	48	44.5
Arizona	8.1	294	80	31.0
Arkansas	8.8	190	50	19.5
California	9.0	276	91	40.6
Colorado	7.9	204	78	38.7

a. [10 points] Using the “USArrests” data set that is pre-loaded in R, construct a kernel density estimator of the number of assault arrests and show the density in a figure. (hint: ggplot2 makes this very easy).

set.seed(2634)

ggplot(df, aes(x = Assault, y = ..density..)) + geom_histogram(bins = 10, closed = "right") + 
    geom_density(bw = 5, col = "red") + labs(title = "Kernel Density Estimator", 
    subtitle = "Bandwidth = 55", y = "Density", x = "Number of Assault Arrests")

ggplot(df, aes(x = Assault, y = ..density..)) + geom_histogram(bins = 10, closed = "right") + 
    geom_density(bw = 10, col = "red") + labs(title = "Kernel Density Estimator", 
    subtitle = "Bandwidth = 10", y = "Density", x = "Number of Assault Arrests")

ggplot(df, aes(x = Assault, y = ..density..)) + geom_histogram(bins = 10, closed = "right") + 
    geom_density(bw = 15, col = "red") + labs(title = "Kernel Density Estimator", 
    subtitle = "Bandwidth = 15", y = "Density", x = "Number of Assault Arrests")

b. [10 points] Using the “USArrests” data set that is pre-loaded in R, fit a loess regression model with UrbanPop as a predictor and Assault as the response variable. (hint: ggplot2 makes this very easy).

# Fitting Loess regression models with spans between .1 and 1
fits <- data_frame(span = seq(from = 0.1, to = 2, by = 0.1)) %>% group_by(span) %>% 
    do(augment(loess(Assault ~ UrbanPop, df, degree = 2, span = .$span)))

# Plotting smoothing parameters
ggplot(fits, aes(x = UrbanPop, y = Assault, frame = span)) + geom_point() + 
    geom_line(aes(y = .fitted), color = "red") + transition_states(span, wrap = TRUE, 
    transition_length = 3) + labs(title = "Loess Regression: Assault Arrests as a function of Population", 
    subtitle = "Span: {closest_state}", x = "Urban Population", y = "Assault Arrests")

(5) Answer the following questions with the data sets in the dslabs pacakge. We will focus on measles. (I did some data cleaning for you. Also, assume that the 0’s in the early years are correct.)

rm(list = ls())
measles <- subset(us_contagious_diseases, disease == "Measles")
measles <- melt(measles, id = c("disease", "state", "year"), measure = "count")
measles <- dcast(measles, formula = year ~ state)
# Remove the space in some of the state names.
names(measles)[10] <- "DC"
names(measles)[31] <- "NH"
names(measles)[32] <- "NJ"
names(measles)[33] <- "NM"
names(measles)[34] <- "NY"
names(measles)[35] <- "NC"
names(measles)[36] <- "ND"
names(measles)[41] <- "RI"
names(measles)[42] <- "SC"
names(measles)[43] <- "SD"
names(measles)[50] <- "WV"
kable(head(measles), caption = "Measles Data")

Measles Data
year	Alabama	Arizona	Arkansas	California	Colorado	Connecticut	Delaware	DC	Florida	Georgia	Idaho	Illinois	Indiana	Iowa	Kansas	Kentucky	Louisiana	Maine	Maryland	Massachusetts	Michigan	Minnesota	Missouri	Montana	Nebraska	NH	NJ	NM	NY	NC	ND	Ohio	Oklahoma	Oregon	Pennsylvania	RI	SC	SD	Tennessee	Texas	Utah	Vermont	Virginia	Washington	WV	Wisconsin	Wyoming
1928	8843	847	8899	3698	2099	10014	597	2566	1714	4637	88	7444	8470	702	2533	5551	5373	3204	17539	41371	25544	2052	7154	757	1083	1274	31631	2760	75391	56207	371	23522	6298	2089	57237	4986	19798	1097	8276	5524	85	1199	0	5348	3323	3642	500
1929	2959	236	1245	4024	748	9800	566	455	1127	1180	717	38503	10391	2041	10370	1125	1853	5732	2043	14861	18052	12663	6644	4285	4166	1609	7253	230	30318	1207	2286	37181	1095	4662	47605	1851	211	1158	860	4107	350	378	0	3866	6527	29825	696
1930	4156	2024	994	43416	11781	1810	261	889	5245	4459	879	16517	3915	9791	12555	1841	1816	1591	1209	26787	28213	5641	5462	633	10286	589	24525	2260	39887	1304	798	18488	3960	2642	34404	269	252	2400	4712	4275	5318	852	0	9904	2733	22082	772
1931	8934	2135	849	27807	4787	12869	2428	4198	3894	3053	118	39738	17628	851	1758	4801	244	3505	20959	16431	6403	3229	13723	1538	345	1406	20361	1415	54376	15299	966	19500	774	2343	79748	4529	3330	1474	3581	2336	153	1143	0	3121	5070	15146	139
1932	270	86	99	12618	2376	5701	39	273	207	1009	40	16113	2914	158	6150	1953	1453	8853	1093	19727	42116	2516	1618	5443	438	717	15407	1199	60464	13653	2150	36228	1034	4998	45957	9970	3192	667	1870	4564	72	4103	1309	10009	10476	28255	557
1933	1735	1261	5438	26551	297	5430	205	533	993	4628	898	12092	3570	824	6289	1560	931	110	697	15067	22089	20110	5771	2270	1625	1042	29855	668	72057	16197	2518	15751	2437	2253	32912	63	6319	2064	3667	21734	2091	1116	8656	2481	5315	9651	573

a. [10 points] Construct a CART model to predict number of measles cases in Illinois using the number of cases in other states as potential predictors. Display the best tree that you find and try to offer some interpretation.

# Set seed and partition training and test data
set.seed(2634)
ind <- sample(1:dim(measles)[1], 50, replace = FALSE)
df.train <- measles[ind, ]
df.test <- measles[-ind, ]


m3 <- rpart(formula = Illinois ~ ., data = df.train, method = "class", control = list(minsplit = 10, 
    maxdepth = 12, xval = 10))


hyper_grid <- expand.grid(minsplit = seq(5, 20, 1), maxdepth = seq(8, 15, 1))


models <- list()
for (i in 1:nrow(hyper_grid)) {
    # get minsplit, maxdepth values at row i
    minsplit <- hyper_grid$minsplit[i]
    maxdepth <- hyper_grid$maxdepth[i]
    # train a model and store in the list
    models[[i]] <- rpart(formula = Illinois ~ ., data = df.train, method = "class", 
        control = list(minsplit = minsplit, maxdepth = maxdepth))
}

get_cp <- function(x) {
    min <- which.min(x$cptable[, "xerror"])
    cp <- x$cptable[min, "CP"]
}

# function to get minimum error
get_min_error <- function(x) {
    min <- which.min(x$cptable[, "xerror"])
    xerror <- x$cptable[min, "xerror"]
}

hyper_grid %>% mutate(cp = purrr::map_dbl(models, get_cp), error = purrr::map_dbl(models, 
    get_min_error)) %>% arrange(error) %>% top_n(-5, wt = error)


optimal_tree <- rpart(formula = Illinois ~ ., data = df.train, method = "class", 
    control = list(minsplit = 13, maxdepth = 11, cp = 0.01))

pred <- predict(optimal_tree, newdata = df.test, type = "prob")
RMSE(pred = pred, obs = df.test$Illinois)
pred.v <- predict(optimal_tree, newdata = df.test, type = "vector")
RMSE(pred = pred.v, obs = df.test$Illinois)

# MSE
mse <- mean((pred - df.train$Illinois)^2)
mse
mse.v <- mean((pred.v - df.train$Illinois)^2)
mse.v

rpart.plot(optimal_tree)

  minsplit maxdepth         cp    error
1       18        9 0.02083333 1.020833
2       16       10 0.02083333 1.020833
3       20       10 0.02083333 1.020833
4        5       11 0.02083333 1.020833
5        8       14 0.02083333 1.020833
[1] 20994.23
[1] 20976.59
[1] 367342596
[1] 371806390

b. [10 points] Construct a random forest model to predict number of measles cases in Illinois using the number of cases in other states as potential predictors. Is the random forest model better than the CART model at predicting measles cases?

set.seed(2634)
IL.allForests <- randomForest(Illinois ~ ., data = df.train, importance = TRUE, 
    ntree = 1000)
plot(IL.allForests)

varImpPlot(IL.allForests)

kable(IL.allForests$importance)

# IL.Forests <-randomForest(Illinois ~., data = df.train, importance = TRUE,
# ntree = 1000)
predicted_IL_all <- predict(IL.allForests, newdata = df.test, OOB = TRUE)
mean((predicted_IL_all - df.train$Illinois)^2)

	%IncMSE	IncNodePurity
year	5013718.90	147016844
Alabama	3428043.46	196232298
Alaska	95544.66	12646942
Arizona	2838243.24	96108267
Arkansas	1911789.74	61883127
California	2136464.56	63279864
Colorado	845288.73	48817455
Connecticut	1623153.77	68633995
Delaware	2133156.97	152204705
DC	4031348.26	141946674
Florida	1510095.24	90524935
Georgia	295596.52	55653068
Hawaii	572525.72	11083755
Idaho	606956.24	40822339
Indiana	50378492.26	1359002447
Iowa	1625943.46	54703720
Kansas	1073052.50	42971344
Kentucky	1190068.17	65194579
Louisiana	20835.75	9632086
Maine	359022.01	31839591
Maryland	298545.90	40526396
Massachusetts	6061346.49	66800942
Michigan	11470584.70	1140871157
Minnesota	1662032.68	37375662
Mississippi	327433.78	10196907
Missouri	17003643.21	1314028002
Montana	21227254.51	640149893
Nebraska	4338777.71	170121677
Nevada	353571.72	5260739
NH	1889592.30	107437184
NJ	1821218.26	86121475
NM	1107558.84	30585413
NY	7675756.77	200186051
NC	1411373.60	180215866
ND	3735772.47	398554116
Ohio	34437650.30	1792231704
Oklahoma	1613582.83	38617384
Oregon	4213832.81	70573230
Pennsylvania	25977043.11	1820793259
RI	2232008.01	54976505
SC	231291.27	25179201
SD	3715450.95	199361294
Tennessee	3180351.65	98863190
Texas	3719564.86	86249283
Utah	973850.52	75584051
Vermont	516699.03	29817854
Virginia	2146913.56	70185814
Washington	649190.73	60611649
WV	19891411.55	432161089
Wisconsin	6194129.00	381078507
Wyoming	3302768.30	129772645

[1] 475351741

c. [10 points] Compute Kendall’s \(\tau\) between the Illinois and Massachusetts measles cases data. Compute the standard error of this estimator using the bootstrap procedure. Also, estimate the bias of this estimator.

d. [10 points] Perform PCA on the the measles data with states as variables. (Remember to exclude year as a predictor) How many components are needed to account for 85% of the variability? Also, try to interpret the first two principal components.

e. [10 points] Perform hierarchical clustering on the the measles data with states as observations. (Remember to exclude year data). How many clusters do there appear to be? Try to interpret each of the clusters.