This exam is due May 4, 2019 at 11:55pm. There are 100 points available for students in both 388/488. Students enrolled in 388 will be scored out of 80 points. Don’t cheat. Seriously, don’t cheat. Have a great summer. Try not to do anything stupid.
Parametric statistics makes assumptions about the probability distributions of data. These assumptions often involve parametric distributions or the central limit theorem is applied as samples get larger. Non-parametric statistics, however, does not make these distributional assumptions about the data; usually non-parametric statistical methods make other, more flexible assumptions about the data. In general, assuming more about data corresponds with greater power in statistical testing, but not always. Some advantages of non-parametric statistical methods over parametric statistical methods include the following:
(Taken directly from class notes)
Randomly generated data
set.seed(2634)
x <- rnorm(100, mean = 0, sd = 1)
kable(head(x), caption = "100 Randomly Generated Variables from Standard Normal")| x |
|---|
| 0.4244015 |
| -1.4940158 |
| -1.6156726 |
| -0.6741247 |
| 1.6568220 |
| 1.0843686 |
Parametric hypothesis test: One Sample t-Test
\(H_{0}: \mu = 0\)
\(H_{A}: \mu \neq 0\)
One Sample t-test
data: x
t = -0.037259, df = 99, p-value = 0.9704
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-0.2193076 0.2112232
sample estimates:
mean of x
-0.004042188
Assumptions:
* Data are IID * Approximately normally distributed * No outliers
Non-parametric alternative test: Sign Test (let \(\theta\) be the median)
\(H_{0}: \theta = 0\)
\(H_{A}: \theta \neq 0\)
-1 1
47 53
Exact binomial test
data: 47 and 100
number of successes = 47, number of trials = 100, p-value = 0.6173
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
0.3694052 0.5724185
sample estimates:
probability of success
0.47
Assumptions:
* data are IID
The major differences between trees in CART models and Random Forests is the way trees are “grown.” When trees are being modeled in CART, all observations start in a single node and all binary splits for all independent variables are searched until the one that reduces within leaf variance the most is found. This process is repeated until the reduction in variance meets some threshold or a minimum number of observations in each node is reached.
Random forests are an ensemble method combining CART and bootstrapping. When a random forest model is being trained, a bootstrap sample is taken from a random sample. At each node a random subset of the independent variables is taken, all possible binary splits are considered only for that subset, and the split that reduces variance most is selected. Since each tree is constructed using only a sample of the data, there is an inherent holdout dataset; predictions can be made on the data which was not used to train the model and predictions are averaged across trees.
In summary, the major difference between trees in CART models and random forests is the number of independent variables considered at each split and the sample of observations that the models are trained on.
rm(list = ls())
data("USArrests")
df <- USArrests
kable(head(df), caption = "USA Arrests Data from R")| Murder | Assault | UrbanPop | Rape | |
|---|---|---|---|---|
| Alabama | 13.2 | 236 | 58 | 21.2 |
| Alaska | 10.0 | 263 | 48 | 44.5 |
| Arizona | 8.1 | 294 | 80 | 31.0 |
| Arkansas | 8.8 | 190 | 50 | 19.5 |
| California | 9.0 | 276 | 91 | 40.6 |
| Colorado | 7.9 | 204 | 78 | 38.7 |
set.seed(2634)
ggplot(df, aes(x = Assault, y = ..density..)) + geom_histogram(bins = 10, closed = "right") +
geom_density(bw = 5, col = "red") + labs(title = "Kernel Density Estimator",
subtitle = "Bandwidth = 55", y = "Density", x = "Number of Assault Arrests")ggplot(df, aes(x = Assault, y = ..density..)) + geom_histogram(bins = 10, closed = "right") +
geom_density(bw = 10, col = "red") + labs(title = "Kernel Density Estimator",
subtitle = "Bandwidth = 10", y = "Density", x = "Number of Assault Arrests")ggplot(df, aes(x = Assault, y = ..density..)) + geom_histogram(bins = 10, closed = "right") +
geom_density(bw = 15, col = "red") + labs(title = "Kernel Density Estimator",
subtitle = "Bandwidth = 15", y = "Density", x = "Number of Assault Arrests")# Fitting Loess regression models with spans between .1 and 1
fits <- data_frame(span = seq(from = 0.1, to = 2, by = 0.1)) %>% group_by(span) %>%
do(augment(loess(Assault ~ UrbanPop, df, degree = 2, span = .$span)))
# Plotting smoothing parameters
ggplot(fits, aes(x = UrbanPop, y = Assault, frame = span)) + geom_point() +
geom_line(aes(y = .fitted), color = "red") + transition_states(span, wrap = TRUE,
transition_length = 3) + labs(title = "Loess Regression: Assault Arrests as a function of Population",
subtitle = "Span: {closest_state}", x = "Urban Population", y = "Assault Arrests")rm(list = ls())
measles <- subset(us_contagious_diseases, disease == "Measles")
measles <- melt(measles, id = c("disease", "state", "year"), measure = "count")
measles <- dcast(measles, formula = year ~ state)
# Remove the space in some of the state names.
names(measles)[10] <- "DC"
names(measles)[31] <- "NH"
names(measles)[32] <- "NJ"
names(measles)[33] <- "NM"
names(measles)[34] <- "NY"
names(measles)[35] <- "NC"
names(measles)[36] <- "ND"
names(measles)[41] <- "RI"
names(measles)[42] <- "SC"
names(measles)[43] <- "SD"
names(measles)[50] <- "WV"
kable(head(measles), caption = "Measles Data")| year | Alabama | Alaska | Arizona | Arkansas | California | Colorado | Connecticut | Delaware | DC | Florida | Georgia | Hawaii | Idaho | Illinois | Indiana | Iowa | Kansas | Kentucky | Louisiana | Maine | Maryland | Massachusetts | Michigan | Minnesota | Mississippi | Missouri | Montana | Nebraska | Nevada | NH | NJ | NM | NY | NC | ND | Ohio | Oklahoma | Oregon | Pennsylvania | RI | SC | SD | Tennessee | Texas | Utah | Vermont | Virginia | Washington | WV | Wisconsin | Wyoming |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1928 | 8843 | 0 | 847 | 8899 | 3698 | 2099 | 10014 | 597 | 2566 | 1714 | 4637 | 0 | 88 | 7444 | 8470 | 702 | 2533 | 5551 | 5373 | 3204 | 17539 | 41371 | 25544 | 2052 | 0 | 7154 | 757 | 1083 | 0 | 1274 | 31631 | 2760 | 75391 | 56207 | 371 | 23522 | 6298 | 2089 | 57237 | 4986 | 19798 | 1097 | 8276 | 5524 | 85 | 1199 | 0 | 5348 | 3323 | 3642 | 500 |
| 1929 | 2959 | 0 | 236 | 1245 | 4024 | 748 | 9800 | 566 | 455 | 1127 | 1180 | 0 | 717 | 38503 | 10391 | 2041 | 10370 | 1125 | 1853 | 5732 | 2043 | 14861 | 18052 | 12663 | 0 | 6644 | 4285 | 4166 | 0 | 1609 | 7253 | 230 | 30318 | 1207 | 2286 | 37181 | 1095 | 4662 | 47605 | 1851 | 211 | 1158 | 860 | 4107 | 350 | 378 | 0 | 3866 | 6527 | 29825 | 696 |
| 1930 | 4156 | 0 | 2024 | 994 | 43416 | 11781 | 1810 | 261 | 889 | 5245 | 4459 | 0 | 879 | 16517 | 3915 | 9791 | 12555 | 1841 | 1816 | 1591 | 1209 | 26787 | 28213 | 5641 | 0 | 5462 | 633 | 10286 | 0 | 589 | 24525 | 2260 | 39887 | 1304 | 798 | 18488 | 3960 | 2642 | 34404 | 269 | 252 | 2400 | 4712 | 4275 | 5318 | 852 | 0 | 9904 | 2733 | 22082 | 772 |
| 1931 | 8934 | 0 | 2135 | 849 | 27807 | 4787 | 12869 | 2428 | 4198 | 3894 | 3053 | 0 | 118 | 39738 | 17628 | 851 | 1758 | 4801 | 244 | 3505 | 20959 | 16431 | 6403 | 3229 | 0 | 13723 | 1538 | 345 | 0 | 1406 | 20361 | 1415 | 54376 | 15299 | 966 | 19500 | 774 | 2343 | 79748 | 4529 | 3330 | 1474 | 3581 | 2336 | 153 | 1143 | 0 | 3121 | 5070 | 15146 | 139 |
| 1932 | 270 | 0 | 86 | 99 | 12618 | 2376 | 5701 | 39 | 273 | 207 | 1009 | 0 | 40 | 16113 | 2914 | 158 | 6150 | 1953 | 1453 | 8853 | 1093 | 19727 | 42116 | 2516 | 0 | 1618 | 5443 | 438 | 0 | 717 | 15407 | 1199 | 60464 | 13653 | 2150 | 36228 | 1034 | 4998 | 45957 | 9970 | 3192 | 667 | 1870 | 4564 | 72 | 4103 | 1309 | 10009 | 10476 | 28255 | 557 |
| 1933 | 1735 | 0 | 1261 | 5438 | 26551 | 297 | 5430 | 205 | 533 | 993 | 4628 | 0 | 898 | 12092 | 3570 | 824 | 6289 | 1560 | 931 | 110 | 697 | 15067 | 22089 | 20110 | 0 | 5771 | 2270 | 1625 | 0 | 1042 | 29855 | 668 | 72057 | 16197 | 2518 | 15751 | 2437 | 2253 | 32912 | 63 | 6319 | 2064 | 3667 | 21734 | 2091 | 1116 | 8656 | 2481 | 5315 | 9651 | 573 |
# Set seed and partition training and test data
set.seed(2634)
ind <- sample(1:dim(measles)[1], 50, replace = FALSE)
df.train <- measles[ind, ]
df.test <- measles[-ind, ]
m3 <- rpart(formula = Illinois ~ ., data = df.train, method = "class", control = list(minsplit = 10,
maxdepth = 12, xval = 10))
hyper_grid <- expand.grid(minsplit = seq(5, 20, 1), maxdepth = seq(8, 15, 1))
models <- list()
for (i in 1:nrow(hyper_grid)) {
# get minsplit, maxdepth values at row i
minsplit <- hyper_grid$minsplit[i]
maxdepth <- hyper_grid$maxdepth[i]
# train a model and store in the list
models[[i]] <- rpart(formula = Illinois ~ ., data = df.train, method = "class",
control = list(minsplit = minsplit, maxdepth = maxdepth))
}
get_cp <- function(x) {
min <- which.min(x$cptable[, "xerror"])
cp <- x$cptable[min, "CP"]
}
# function to get minimum error
get_min_error <- function(x) {
min <- which.min(x$cptable[, "xerror"])
xerror <- x$cptable[min, "xerror"]
}
hyper_grid %>% mutate(cp = purrr::map_dbl(models, get_cp), error = purrr::map_dbl(models,
get_min_error)) %>% arrange(error) %>% top_n(-5, wt = error)
optimal_tree <- rpart(formula = Illinois ~ ., data = df.train, method = "class",
control = list(minsplit = 13, maxdepth = 11, cp = 0.01))
pred <- predict(optimal_tree, newdata = df.test, type = "prob")
RMSE(pred = pred, obs = df.test$Illinois)
pred.v <- predict(optimal_tree, newdata = df.test, type = "vector")
RMSE(pred = pred.v, obs = df.test$Illinois)
# MSE
mse <- mean((pred - df.train$Illinois)^2)
mse
mse.v <- mean((pred.v - df.train$Illinois)^2)
mse.v
rpart.plot(optimal_tree) minsplit maxdepth cp error
1 18 9 0.02083333 1.020833
2 16 10 0.02083333 1.020833
3 20 10 0.02083333 1.020833
4 5 11 0.02083333 1.020833
5 8 14 0.02083333 1.020833
[1] 20994.23
[1] 20976.59
[1] 367342596
[1] 371806390
set.seed(2634)
IL.allForests <- randomForest(Illinois ~ ., data = df.train, importance = TRUE,
ntree = 1000)
plot(IL.allForests)kable(IL.allForests$importance)
# IL.Forests <-randomForest(Illinois ~., data = df.train, importance = TRUE,
# ntree = 1000)
predicted_IL_all <- predict(IL.allForests, newdata = df.test, OOB = TRUE)
mean((predicted_IL_all - df.train$Illinois)^2)| %IncMSE | IncNodePurity | |
|---|---|---|
| year | 5013718.90 | 147016844 |
| Alabama | 3428043.46 | 196232298 |
| Alaska | 95544.66 | 12646942 |
| Arizona | 2838243.24 | 96108267 |
| Arkansas | 1911789.74 | 61883127 |
| California | 2136464.56 | 63279864 |
| Colorado | 845288.73 | 48817455 |
| Connecticut | 1623153.77 | 68633995 |
| Delaware | 2133156.97 | 152204705 |
| DC | 4031348.26 | 141946674 |
| Florida | 1510095.24 | 90524935 |
| Georgia | 295596.52 | 55653068 |
| Hawaii | 572525.72 | 11083755 |
| Idaho | 606956.24 | 40822339 |
| Indiana | 50378492.26 | 1359002447 |
| Iowa | 1625943.46 | 54703720 |
| Kansas | 1073052.50 | 42971344 |
| Kentucky | 1190068.17 | 65194579 |
| Louisiana | 20835.75 | 9632086 |
| Maine | 359022.01 | 31839591 |
| Maryland | 298545.90 | 40526396 |
| Massachusetts | 6061346.49 | 66800942 |
| Michigan | 11470584.70 | 1140871157 |
| Minnesota | 1662032.68 | 37375662 |
| Mississippi | 327433.78 | 10196907 |
| Missouri | 17003643.21 | 1314028002 |
| Montana | 21227254.51 | 640149893 |
| Nebraska | 4338777.71 | 170121677 |
| Nevada | 353571.72 | 5260739 |
| NH | 1889592.30 | 107437184 |
| NJ | 1821218.26 | 86121475 |
| NM | 1107558.84 | 30585413 |
| NY | 7675756.77 | 200186051 |
| NC | 1411373.60 | 180215866 |
| ND | 3735772.47 | 398554116 |
| Ohio | 34437650.30 | 1792231704 |
| Oklahoma | 1613582.83 | 38617384 |
| Oregon | 4213832.81 | 70573230 |
| Pennsylvania | 25977043.11 | 1820793259 |
| RI | 2232008.01 | 54976505 |
| SC | 231291.27 | 25179201 |
| SD | 3715450.95 | 199361294 |
| Tennessee | 3180351.65 | 98863190 |
| Texas | 3719564.86 | 86249283 |
| Utah | 973850.52 | 75584051 |
| Vermont | 516699.03 | 29817854 |
| Virginia | 2146913.56 | 70185814 |
| Washington | 649190.73 | 60611649 |
| WV | 19891411.55 | 432161089 |
| Wisconsin | 6194129.00 | 381078507 |
| Wyoming | 3302768.30 | 129772645 |
[1] 475351741