The U.S. National Parks System includes 417 areas including national parks, monuments, battlefields, military parks, historical parks, historical sites, lakeshores, seashores, recreation areas, scenic rivers and trails, and the White House (see map in Figure 1). Every year, hundreds of millions of recreational visitors come to the parks. What do we know about the parks that can affect the visitor counts? Can we forecast the monthly visits to a given park accurately? To derive insights and answer these questions, we take a look at the historical visits data and the parks information released by the National Parks Service (NPS). - ParkName: The full name of the park.
ParkType: The type of the park. For this study we restrict ourselves to the following more frequently visited types: National Battlefield, National Historic Site, National Historical Park, National Memorial, National Monument, National Park, National Recreation Area, and National Seashore.
Region: The region of the park, including Alaska, Intermountain, Midwest, National Capital, Northeast, Pacific West, and Southeast.
State: The abbreviation of the state where the park resides.
Year, Month: the year and the month for the visits.
lat, long: Latitude and longitude of the park.
Cost: a simple extraction of the park’s entrance fee. Some parks may have multiple levels of entrance fees (differ by transportation methods, age, military status, etc.); for this problem, we only extracted the first available cost information.
logVisits: Natural logarithm of the recreational visits (with one added to the visits to avoid taking logs of zero) to the park in the given year and month.
laglogVisits: the logVisits from last month.
-laglogVisitsYear: the logVisits from last year.
visits <- read.csv("parkvisits.csv")
str(visits)
## 'data.frame': 25587 obs. of 12 variables:
## $ ParkName : chr "Abraham Lincoln Birthplace NHP" "Abraham Lincoln Birthplace NHP" "Abraham Lincoln Birthplace NHP" "Abraham Lincoln Birthplace NHP" ...
## $ ParkType : chr "National Historical Park" "National Historical Park" "National Historical Park" "National Historical Park" ...
## $ Region : chr "Southeast " "Southeast " "Southeast " "Southeast " ...
## $ State : chr "KY" "KY" "KY" "KY" ...
## $ Year : int 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
## $ Month : int 1 2 3 4 5 6 7 8 9 10 ...
## $ lat : num 37.6 37.6 37.6 37.6 37.6 ...
## $ long : num -85.7 -85.7 -85.7 -85.7 -85.7 ...
## $ cost : num 0 0 0 0 0 0 0 0 0 0 ...
## $ logVisits : num 8.26 8.55 8.99 9.81 9.87 ...
## $ laglogVisits : num NA 8.26 8.55 8.99 9.81 ...
## $ laglogVisitsYear: num NA NA NA NA NA NA NA NA NA NA ...
visits2016jul <- subset(visits, Year == 2016 & Month == 7)
table(visits2016jul$ParkType)
##
## National Battlefield National Historic Site National Historical Park
## 10 75 42
## National Memorial National Monument National Park
## 28 68 57
## National Recreation Area National Seashore
## 15 10
visits2016jul[which.max(visits2016jul$logVisits),]
## ParkName ParkType Region State Year Month
## 11587 Great Smoky Mountains NP National Park Southeast TN 2016 7
## lat long cost logVisits laglogVisits laglogVisitsYear
## 11587 35.60116 -83.50818 0 14.197 14.12603 14.18127
Which region has the highest average log visits in July 2016?
tapply(visits2016jul$logVisits, visits2016jul$Region, FUN = mean, na.rm = TRUE)
## Alaska Intermountain Midwest National Capital
## 9.374157 10.129625 9.747281 10.293175
## Northeast Pacific West Southeast
## 9.914755 10.767849 10.027672
What is the correlation between entrance fee (the variable cost) and the log visits in July 2016?
cor(visits2016jul$cost, visits2016jul$logVisits)
## [1] 0.4010611
Let’s now look at the time dimension of the data. Subset the original data (visits) to “Yellowstone NP” only and save as ys. Use the following code to plot the logVisits through the months between 2010 and 2016:
ys_ts=ts(ys$logVisits,start=c(2010,1),freq=12)
plot(ys_ts)
What observations do you make?
ys <- subset(visits, ParkName == "Yellowstone NP")
ys_ts <- ts(ys$logVisits,start=c(2010,1),freq=12)
plot(ys_ts)
Note that there are some NA’s in the data - you can run colSums(is.na(visits)) to see the summary.
Why do we have NA’s in the laglogVisits and laglogVisitsYear? These variables were created by lagging the log visits by a month or by a year.
colSums(is.na(visits))
## ParkName ParkType Region State
## 0 0 0 0
## Year Month lat long
## 0 0 84 84
## cost logVisits laglogVisits laglogVisitsYear
## 0 0 305 3660
# To deal with the missing values, we will simply remove the observations with the missing values first (there are more sophisticated ways to work with missing values, but for this purpose removing the observations is fine). Run the following:
visits <- visits[rowSums(is.na(visits)) == 0, ]
# How many observations are there in visits now?
str(visits)
## 'data.frame': 21855 obs. of 12 variables:
## $ ParkName : chr "Abraham Lincoln Birthplace NHP" "Abraham Lincoln Birthplace NHP" "Abraham Lincoln Birthplace NHP" "Abraham Lincoln Birthplace NHP" ...
## $ ParkType : chr "National Historical Park" "National Historical Park" "National Historical Park" "National Historical Park" ...
## $ Region : chr "Southeast " "Southeast " "Southeast " "Southeast " ...
## $ State : chr "KY" "KY" "KY" "KY" ...
## $ Year : int 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
## $ Month : int 1 2 3 4 5 6 7 8 9 10 ...
## $ lat : num 37.6 37.6 37.6 37.6 37.6 ...
## $ long : num -85.7 -85.7 -85.7 -85.7 -85.7 ...
## $ cost : num 0 0 0 0 0 0 0 0 0 0 ...
## $ logVisits : num 7.88 8.2 8.98 9.87 9.74 ...
## $ laglogVisits : num 8.32 7.88 8.2 8.98 9.87 ...
## $ laglogVisitsYear: num 8.26 8.55 8.99 9.81 9.87 ...
We are interested in predicting the log visits. Before doing the split, let’s also make Month a factor variable by including the following:
visits\(Month = as.factor(visits\)Month)
Subset our dataset into a training and a testing set by splitting based on the year: training would contain 2010-2014 years of data, and testing would be 2015-2016 data.
Let’s build now a simple linear regression model “mod” using the training set to predict the log visits. As a first step, we only use the laglogVisits variable (log visits from last month).
What’s the coefficient of the laglogVisits variable?
visits$Month <- as.factor(visits$Month)
str(visits)
## 'data.frame': 21855 obs. of 12 variables:
## $ ParkName : chr "Abraham Lincoln Birthplace NHP" "Abraham Lincoln Birthplace NHP" "Abraham Lincoln Birthplace NHP" "Abraham Lincoln Birthplace NHP" ...
## $ ParkType : chr "National Historical Park" "National Historical Park" "National Historical Park" "National Historical Park" ...
## $ Region : chr "Southeast " "Southeast " "Southeast " "Southeast " ...
## $ State : chr "KY" "KY" "KY" "KY" ...
## $ Year : int 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
## $ Month : Factor w/ 12 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ lat : num 37.6 37.6 37.6 37.6 37.6 ...
## $ long : num -85.7 -85.7 -85.7 -85.7 -85.7 ...
## $ cost : num 0 0 0 0 0 0 0 0 0 0 ...
## $ logVisits : num 7.88 8.2 8.98 9.87 9.74 ...
## $ laglogVisits : num 8.32 7.88 8.2 8.98 9.87 ...
## $ laglogVisitsYear: num 8.26 8.55 8.99 9.81 9.87 ...
#Subsetting the data
training_visits <- subset(visits, Year <= 2014)
test_visits <- subset(visits, Year > 2014)
# linear regression modelling
mod <- lm(logVisits ~ laglogVisits, data = training_visits)
summary(mod)
##
## Call:
## lm(formula = logVisits ~ laglogVisits, data = training_visits)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.5265 -0.2866 0.0413 0.3306 12.3876
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.659455 0.028923 22.8 <2e-16 ***
## laglogVisits 0.927945 0.003073 301.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8817 on 14557 degrees of freedom
## Multiple R-squared: 0.8623, Adjusted R-squared: 0.8623
## F-statistic: 9.117e+04 on 1 and 14557 DF, p-value: < 2.2e-16
# Prediction
Predicted_visits <- predict(mod, newdata = test_visits)
# out of sample R2
SSE <- sum((test_visits$logVisits - Predicted_visits)^2)
SST <- sum((test_visits$logVisits - mean(training_visits$logVisits))^2)
R_sq <- 1 - SSE/SST
R_sq
## [1] 0.8975923
We see that the model achieves good predictive power already simply using the previous month’s visits. To see if the other knowledge we have about the parks can improve the model, let’s add these variables in a new model.
The new model would have the following variables:
laglogVisits, laglogVisitsYear, Year, Month, Region, ParkType, and cost
Looking at the model summary, which of the following statements are correct (significance at 0.05 level)?
mod2 <- lm(logVisits ~ laglogVisits + laglogVisitsYear + Year + Month + Region + ParkType + cost, data = training_visits)
summary(mod2)
##
## Call:
## lm(formula = logVisits ~ laglogVisits + laglogVisitsYear + Year +
## Month + Region + ParkType + cost, data = training_visits)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.2891 -0.2014 0.0058 0.2293 10.1600
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.514727 10.909227 -0.231 0.817696
## laglogVisits 0.650702 0.005177 125.682 < 2e-16 ***
## laglogVisitsYear 0.314173 0.005163 60.845 < 2e-16 ***
## Year 0.001244 0.005421 0.230 0.818468
## Month2 0.174008 0.029701 5.859 4.77e-09 ***
## Month3 0.422671 0.029859 14.156 < 2e-16 ***
## Month4 0.276278 0.029842 9.258 < 2e-16 ***
## Month5 0.397868 0.030025 13.251 < 2e-16 ***
## Month6 0.272168 0.030113 9.038 < 2e-16 ***
## Month7 0.255440 0.030234 8.449 < 2e-16 ***
## Month8 0.027959 0.030201 0.926 0.354589
## Month9 -0.024103 0.030090 -0.801 0.423141
## Month10 -0.216415 0.029958 -7.224 5.30e-13 ***
## Month11 -0.177488 0.029782 -5.960 2.59e-09 ***
## Month12 -0.105784 0.029709 -3.561 0.000371 ***
## RegionIntermountain 0.207608 0.037177 5.584 2.39e-08 ***
## RegionMidwest 0.207578 0.038276 5.423 5.95e-08 ***
## RegionNational Capital 0.205045 0.047474 4.319 1.58e-05 ***
## RegionNortheast 0.234444 0.038089 6.155 7.70e-10 ***
## RegionPacific West 0.233219 0.037759 6.177 6.73e-10 ***
## RegionSoutheast 0.249889 0.038567 6.479 9.50e-11 ***
## ParkTypeNational Historic Site -0.048652 0.036199 -1.344 0.178960
## ParkTypeNational Historical Park 0.029225 0.037831 0.772 0.439832
## ParkTypeNational Memorial 0.028364 0.039655 0.715 0.474452
## ParkTypeNational Monument -0.026587 0.037258 -0.714 0.475489
## ParkTypeNational Park 0.019296 0.039394 0.490 0.624276
## ParkTypeNational Recreation Area 0.060879 0.045403 1.341 0.179980
## ParkTypeNational Seashore 0.034273 0.048327 0.709 0.478218
## cost 0.002859 0.001028 2.780 0.005435 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7309 on 14530 degrees of freedom
## Multiple R-squared: 0.9056, Adjusted R-squared: 0.9054
## F-statistic: 4977 on 28 and 14530 DF, p-value: < 2.2e-16
In the new model, what’s the out-of-sample R2 in the testing set?
Predicted_visits2 <- predict(mod2, newdata = test_visits)
SSE2 <- sum((test_visits$logVisits - Predicted_visits2)^2)
SST2 <- sum((test_visits$logVisits - mean(training_visits$logVisits))^2)
R_sq2 <- 1 - SSE2/SST2
R_sq2
## [1] 0.937253
The out-of-sample R2 does not appear to be very good under regression trees, compared to a linear regression model. We could potentially improve it via cross validation.
Set seed to 201, run a 10-fold cross-validated cart model, with cp ranging from 0.0001 to 0.005 in increments of 0.0001. What is optimal cp value on this grid?
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
set.seed(201)
numFolds <- trainControl(method = "cv", number = 10)
cartGrid <- expand.grid( .cp = seq(0.0001,0.005,0.0001))
train(logVisits ~ laglogVisits + laglogVisitsYear + Year + Month + Region + ParkType + cost, data = training_visits, method = "rpart", trControl = numFolds, tuneGrid = cartGrid)
## CART
##
## 14559 samples
## 7 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 13103, 13103, 13102, 13104, 13103, 13104, ...
## Resampling results across tuning parameters:
##
## cp RMSE Rsquared MAE
## 0.0001 0.7091974 0.9108305 0.3458448
## 0.0002 0.7085292 0.9109220 0.3512776
## 0.0003 0.7152408 0.9092339 0.3583233
## 0.0004 0.7237052 0.9070232 0.3688886
## 0.0005 0.7362237 0.9038112 0.3815323
## 0.0006 0.7399929 0.9028245 0.3855472
## 0.0007 0.7380079 0.9033927 0.3894208
## 0.0008 0.7403798 0.9027542 0.3923063
## 0.0009 0.7408346 0.9025643 0.3927549
## 0.0010 0.7481463 0.9005786 0.3972774
## 0.0011 0.7467012 0.9009523 0.3987272
## 0.0012 0.7495321 0.9002201 0.3997401
## 0.0013 0.7519675 0.8995514 0.4014459
## 0.0014 0.7547692 0.8987879 0.4052245
## 0.0015 0.7568397 0.8982017 0.4061301
## 0.0016 0.7575437 0.8979788 0.4086154
## 0.0017 0.7585135 0.8976539 0.4106278
## 0.0018 0.7640608 0.8962863 0.4131544
## 0.0019 0.7667635 0.8955105 0.4159244
## 0.0020 0.7667635 0.8955105 0.4159244
## 0.0021 0.7725256 0.8939648 0.4215485
## 0.0022 0.7782371 0.8924668 0.4308113
## 0.0023 0.7839540 0.8908853 0.4366702
## 0.0024 0.7884100 0.8896100 0.4439863
## 0.0025 0.7870230 0.8899620 0.4449950
## 0.0026 0.7957688 0.8874951 0.4517651
## 0.0027 0.8003900 0.8861064 0.4611686
## 0.0028 0.8024952 0.8854989 0.4633812
## 0.0029 0.8005645 0.8860123 0.4650752
## 0.0030 0.8038477 0.8851550 0.4683632
## 0.0031 0.8038477 0.8851550 0.4683632
## 0.0032 0.8038477 0.8851550 0.4683632
## 0.0033 0.8038477 0.8851550 0.4683632
## 0.0034 0.8046526 0.8848610 0.4688624
## 0.0035 0.8046526 0.8848610 0.4688624
## 0.0036 0.8046526 0.8848610 0.4688624
## 0.0037 0.8046526 0.8848610 0.4688624
## 0.0038 0.8009293 0.8858505 0.4683310
## 0.0039 0.8009293 0.8858505 0.4683310
## 0.0040 0.8023013 0.8854159 0.4692306
## 0.0041 0.8023013 0.8854159 0.4692306
## 0.0042 0.8023013 0.8854159 0.4692306
## 0.0043 0.8023013 0.8854159 0.4692306
## 0.0044 0.8023013 0.8854159 0.4692306
## 0.0045 0.8023013 0.8854159 0.4692306
## 0.0046 0.8023013 0.8854159 0.4692306
## 0.0047 0.8023013 0.8854159 0.4692306
## 0.0048 0.8023013 0.8854159 0.4692306
## 0.0049 0.8023013 0.8854159 0.4692306
## 0.0050 0.8037850 0.8849801 0.4696590
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 2e-04.
Rerun the regression tree on the training data, now using the cp value equal to the one selected in the previous problem (under the original range). Note: do not get the tree from the cross-validation directly.
What is the out-of-sample R2 in the testing set?
library(rpart)
set.seed(201)
Final_tree <- rpart(logVisits ~ laglogVisits + laglogVisitsYear + Year + Month + Region + ParkType + cost, data = training_visits, cp = 1e-4)
pred_FinTree <- predict(Final_tree, newdata = test_visits)
# out of sample R2
SSE3 <- sum((test_visits$logVisits - pred_FinTree)^2)
SST3 <- sum((test_visits$logVisits - mean(training_visits$logVisits))^2)
R_sq3 <- 1 - SSE3/SST3
R_sq3
## [1] 0.937113
We can potentially further improve the models by using a random forest. Set seed to 201 again. Train a random forest model with the same set of covariates, and using just default parameters (no need to specify). This may take a few minutes.
What is the R2 on the testing set for the random forest model?
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
set.seed(201)
# Building the model and run predictions
RForest_visits <- randomForest(logVisits ~ laglogVisits + laglogVisitsYear + Year + Month + Region + ParkType + cost, data=training_visits)
Pred_RforestB <- predict(RForest_visits, newdata = test_visits)
# out of sample R2
SSE4 <- sum((test_visits$logVisits - Pred_RforestB)^2)
SST4 <- sum((test_visits$logVisits - mean(training_visits$logVisits))^2)
R_sq4 <- 1 - SSE4/SST4
R_sq4
## [1] 0.9473915
## DONE PART 1 =)