Instructions

This exam is due May 4, 2019 at 11:55pm. There are 100 points available for students in both 388/488. Students enrolled in 388 will be scored out of 80 points. Don’t cheat. Seriously, don’t cheat. Have a great summer. Try not to do anything stupid.

(1) [10 points]

What is the difference between parametric and non-parametric statistics (2-3 lines should suffice for an answer)?

Parametric statistics makes assumptions about the probability distributions of data. These assumptions often involve parametric distributions or the central limit theorem is applied as samples get larger. Non-parametric statistics, however, does not make these distributional assumptions about the data; usually non-parametric statistical methods make other, more flexible assumptions about the data. In general, assuming more about data corresponds with greater power in statistical testing, but not always. Some advantages of non-parametric statistical methods over parametric statistical methods include the following:

  • Fewer/less stringent assumptions
  • Exactness of p-values, coverage probabilities & error rates
  • Typically easier to apply than their normal-theory counterparts
  • Procedures are typically easier to understand
  • Procedures are only slightly less efficient than their normal theory competitors when underlying populations are normal, and they are at least mildly, if not wildly, more efficient when underlying populations are not normal
  • Methods are relatively insensitve to outliers
  • Widely applicable since many procedures simply rely on the ranks of observations rather than their true magnitudes

(Taken directly from class notes)

(2) [10 points]

Give an example of a hypothesis test that can be tested using both parametric and non-parametric tests. State the null and alternative hypotheses and describe (or give the names of) the parametric and nonparametric tests used to test the hypothesis that you have given as an example. (HINT: Pick an example that is easy for yourself!)

Randomly generated data

100 Randomly Generated Variables from Standard Normal
x
0.4244015
-1.4940158
-1.6156726
-0.6741247
1.6568220
1.0843686

Parametric hypothesis test: One Sample t-Test
\(H_{0}: \mu = 0\)
\(H_{A}: \mu \neq 0\)


    One Sample t-test

data:  x
t = -0.037259, df = 99, p-value = 0.9704
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 -0.2193076  0.2112232
sample estimates:
   mean of x 
-0.004042188 

Assumptions:
* Data are IID * Approximately normally distributed * No outliers

Non-parametric alternative test: Sign Test (let \(\theta\) be the median)
\(H_{0}: \theta = 0\)
\(H_{A}: \theta \neq 0\)


-1  1 
47 53 

    Exact binomial test

data:  47 and 100
number of successes = 47, number of trials = 100, p-value = 0.6173
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
 0.3694052 0.5724185
sample estimates:
probability of success 
                  0.47 

Assumptions:
* data are IID

(3) [10 points]

A random forest consists of many trees, but there is a difference between the trees in a random forest and a regular tree model. Describe the differences between these trees.

The major differences between trees in CART models and Random Forests is the way trees are “grown.” When trees are being modeled in CART, all observations start in a single node and all binary splits for all independent variables are searched until the one that reduces within leaf variance the most is found. This process is repeated until the reduction in variance meets some threshold or a minimum number of observations in each node is reached.

Random forests are an ensemble method combining CART and bootstrapping. When a random forest model is being trained, a bootstrap sample is taken from a random sample. At each node a random subset of the independent variables is taken, all possible binary splits are considered only for that subset, and the split that reduces variance most is selected. Since each tree is constructed using only a sample of the data, there is an inherent holdout dataset; predictions can be made on the data which was not used to train the model and predictions are averaged across trees.

In summary, the major difference between trees in CART models and random forests is the number of independent variables considered at each split and the sample of observations that the models are trained on.

(4) Answer the following question with the “USArrest” dataset in R.

USA Arrests Data from R
Murder Assault UrbanPop Rape
Alabama 13.2 236 58 21.2
Alaska 10.0 263 48 44.5
Arizona 8.1 294 80 31.0
Arkansas 8.8 190 50 19.5
California 9.0 276 91 40.6
Colorado 7.9 204 78 38.7

(5) Answer the following questions with the data sets in the dslabs pacakge. We will focus on measles. (I did some data cleaning for you. Also, assume that the 0’s in the early years are correct.)

Measles Data
year Alabama Alaska Arizona Arkansas California Colorado Connecticut Delaware DC Florida Georgia Hawaii Idaho Illinois Indiana Iowa Kansas Kentucky Louisiana Maine Maryland Massachusetts Michigan Minnesota Mississippi Missouri Montana Nebraska Nevada NH NJ NM NY NC ND Ohio Oklahoma Oregon Pennsylvania RI SC SD Tennessee Texas Utah Vermont Virginia Washington WV Wisconsin Wyoming
1928 8843 0 847 8899 3698 2099 10014 597 2566 1714 4637 0 88 7444 8470 702 2533 5551 5373 3204 17539 41371 25544 2052 0 7154 757 1083 0 1274 31631 2760 75391 56207 371 23522 6298 2089 57237 4986 19798 1097 8276 5524 85 1199 0 5348 3323 3642 500
1929 2959 0 236 1245 4024 748 9800 566 455 1127 1180 0 717 38503 10391 2041 10370 1125 1853 5732 2043 14861 18052 12663 0 6644 4285 4166 0 1609 7253 230 30318 1207 2286 37181 1095 4662 47605 1851 211 1158 860 4107 350 378 0 3866 6527 29825 696
1930 4156 0 2024 994 43416 11781 1810 261 889 5245 4459 0 879 16517 3915 9791 12555 1841 1816 1591 1209 26787 28213 5641 0 5462 633 10286 0 589 24525 2260 39887 1304 798 18488 3960 2642 34404 269 252 2400 4712 4275 5318 852 0 9904 2733 22082 772
1931 8934 0 2135 849 27807 4787 12869 2428 4198 3894 3053 0 118 39738 17628 851 1758 4801 244 3505 20959 16431 6403 3229 0 13723 1538 345 0 1406 20361 1415 54376 15299 966 19500 774 2343 79748 4529 3330 1474 3581 2336 153 1143 0 3121 5070 15146 139
1932 270 0 86 99 12618 2376 5701 39 273 207 1009 0 40 16113 2914 158 6150 1953 1453 8853 1093 19727 42116 2516 0 1618 5443 438 0 717 15407 1199 60464 13653 2150 36228 1034 4998 45957 9970 3192 667 1870 4564 72 4103 1309 10009 10476 28255 557
1933 1735 0 1261 5438 26551 297 5430 205 533 993 4628 0 898 12092 3570 824 6289 1560 931 110 697 15067 22089 20110 0 5771 2270 1625 0 1042 29855 668 72057 16197 2518 15751 2437 2253 32912 63 6319 2064 3667 21734 2091 1116 8656 2481 5315 9651 573

a. [10 points] Construct a CART model to predict number of measles cases in Illinois using the number of cases in other states as potential predictors. Display the best tree that you find and try to offer some interpretation.

# Set seed and partition training and test data
set.seed(2634)
ind <- sample(1:dim(measles)[1], 50, replace = FALSE)
df.train <- measles[ind, ]
df.test <- measles[-ind, ]


m3 <- rpart(formula = Illinois ~ ., data = df.train, method = "class", control = list(minsplit = 10, 
    maxdepth = 12, xval = 10))


hyper_grid <- expand.grid(minsplit = seq(5, 20, 1), maxdepth = seq(8, 15, 1))


models <- list()
for (i in 1:nrow(hyper_grid)) {
    # get minsplit, maxdepth values at row i
    minsplit <- hyper_grid$minsplit[i]
    maxdepth <- hyper_grid$maxdepth[i]
    # train a model and store in the list
    models[[i]] <- rpart(formula = Illinois ~ ., data = df.train, method = "class", 
        control = list(minsplit = minsplit, maxdepth = maxdepth))
}

get_cp <- function(x) {
    min <- which.min(x$cptable[, "xerror"])
    cp <- x$cptable[min, "CP"]
}

# function to get minimum error
get_min_error <- function(x) {
    min <- which.min(x$cptable[, "xerror"])
    xerror <- x$cptable[min, "xerror"]
}

hyper_grid %>% mutate(cp = purrr::map_dbl(models, get_cp), error = purrr::map_dbl(models, 
    get_min_error)) %>% arrange(error) %>% top_n(-5, wt = error)


optimal_tree <- rpart(formula = Illinois ~ ., data = df.train, method = "class", 
    control = list(minsplit = 13, maxdepth = 11, cp = 0.01))

pred <- predict(optimal_tree, newdata = df.test, type = "prob")
RMSE(pred = pred, obs = df.test$Illinois)
pred.v <- predict(optimal_tree, newdata = df.test, type = "vector")
RMSE(pred = pred.v, obs = df.test$Illinois)

# MSE
mse <- mean((pred - df.train$Illinois)^2)
mse
mse.v <- mean((pred.v - df.train$Illinois)^2)
mse.v

rpart.plot(optimal_tree)

  minsplit maxdepth         cp    error
1       18        9 0.02083333 1.020833
2       16       10 0.02083333 1.020833
3       20       10 0.02083333 1.020833
4        5       11 0.02083333 1.020833
5        8       14 0.02083333 1.020833
[1] 20994.23
[1] 20976.59
[1] 367342596
[1] 371806390

b. [10 points] Construct a random forest model to predict number of measles cases in Illinois using the number of cases in other states as potential predictors. Is the random forest model better than the CART model at predicting measles cases?

%IncMSE IncNodePurity
year 5013718.90 147016844
Alabama 3428043.46 196232298
Alaska 95544.66 12646942
Arizona 2838243.24 96108267
Arkansas 1911789.74 61883127
California 2136464.56 63279864
Colorado 845288.73 48817455
Connecticut 1623153.77 68633995
Delaware 2133156.97 152204705
DC 4031348.26 141946674
Florida 1510095.24 90524935
Georgia 295596.52 55653068
Hawaii 572525.72 11083755
Idaho 606956.24 40822339
Indiana 50378492.26 1359002447
Iowa 1625943.46 54703720
Kansas 1073052.50 42971344
Kentucky 1190068.17 65194579
Louisiana 20835.75 9632086
Maine 359022.01 31839591
Maryland 298545.90 40526396
Massachusetts 6061346.49 66800942
Michigan 11470584.70 1140871157
Minnesota 1662032.68 37375662
Mississippi 327433.78 10196907
Missouri 17003643.21 1314028002
Montana 21227254.51 640149893
Nebraska 4338777.71 170121677
Nevada 353571.72 5260739
NH 1889592.30 107437184
NJ 1821218.26 86121475
NM 1107558.84 30585413
NY 7675756.77 200186051
NC 1411373.60 180215866
ND 3735772.47 398554116
Ohio 34437650.30 1792231704
Oklahoma 1613582.83 38617384
Oregon 4213832.81 70573230
Pennsylvania 25977043.11 1820793259
RI 2232008.01 54976505
SC 231291.27 25179201
SD 3715450.95 199361294
Tennessee 3180351.65 98863190
Texas 3719564.86 86249283
Utah 973850.52 75584051
Vermont 516699.03 29817854
Virginia 2146913.56 70185814
Washington 649190.73 60611649
WV 19891411.55 432161089
Wisconsin 6194129.00 381078507
Wyoming 3302768.30 129772645
[1] 475351741

c. [10 points] Compute Kendall’s \(\tau\) between the Illinois and Massachusetts measles cases data. Compute the standard error of this estimator using the bootstrap procedure. Also, estimate the bias of this estimator.

d. [10 points] Perform PCA on the the measles data with states as variables. (Remember to exclude year as a predictor) How many components are needed to account for 85% of the variability? Also, try to interpret the first two principal components.

e. [10 points] Perform hierarchical clustering on the the measles data with states as observations. (Remember to exclude year data). How many clusters do there appear to be? Try to interpret each of the clusters.