Exam 1: National Park Visits

The U.S. National Parks System includes 417 areas including national parks, monuments, battlefields, military parks, historical parks, historical sites, lakeshores, seashores, recreation areas, scenic rivers and trails, and the White House (see map in Figure 1). Every year, hundreds of millions of recreational visitors come to the parks. What do we know about the parks that can affect the visitor counts? Can we forecast the monthly visits to a given park accurately? To derive insights and answer these questions, we take a look at the historical visits data and the parks information released by the National Parks Service (NPS). - ParkName: The full name of the park.

-laglogVisitsYear: the logVisits from last year.

visits <- read.csv("parkvisits.csv")
str(visits)
## 'data.frame':    25587 obs. of  12 variables:
##  $ ParkName        : chr  "Abraham Lincoln Birthplace NHP" "Abraham Lincoln Birthplace NHP" "Abraham Lincoln Birthplace NHP" "Abraham Lincoln Birthplace NHP" ...
##  $ ParkType        : chr  "National Historical Park" "National Historical Park" "National Historical Park" "National Historical Park" ...
##  $ Region          : chr  "Southeast " "Southeast " "Southeast " "Southeast " ...
##  $ State           : chr  "KY" "KY" "KY" "KY" ...
##  $ Year            : int  2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
##  $ Month           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ lat             : num  37.6 37.6 37.6 37.6 37.6 ...
##  $ long            : num  -85.7 -85.7 -85.7 -85.7 -85.7 ...
##  $ cost            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ logVisits       : num  8.26 8.55 8.99 9.81 9.87 ...
##  $ laglogVisits    : num  NA 8.26 8.55 8.99 9.81 ...
##  $ laglogVisitsYear: num  NA NA NA NA NA NA NA NA NA NA ...
visits2016jul <- subset(visits, Year == 2016 & Month == 7)
table(visits2016jul$ParkType)
## 
##     National Battlefield   National Historic Site National Historical Park 
##                       10                       75                       42 
##        National Memorial        National Monument            National Park 
##                       28                       68                       57 
## National Recreation Area        National Seashore 
##                       15                       10

The most number of visitors: Which park?

visits2016jul[which.max(visits2016jul$logVisits),]
##                       ParkName      ParkType     Region State Year Month
## 11587 Great Smoky Mountains NP National Park Southeast     TN 2016     7
##            lat      long cost logVisits laglogVisits laglogVisitsYear
## 11587 35.60116 -83.50818    0    14.197     14.12603         14.18127

Relationship between regions and visits

Which region has the highest average log visits in July 2016?

tapply(visits2016jul$logVisits, visits2016jul$Region, FUN = mean, na.rm = TRUE)
##            Alaska    Intermountain           Midwest  National Capital  
##          9.374157         10.129625          9.747281         10.293175 
##        Northeast      Pacific West         Southeast  
##          9.914755         10.767849         10.027672

Relationship between cost and visits

What is the correlation between entrance fee (the variable cost) and the log visits in July 2016?

cor(visits2016jul$cost, visits2016jul$logVisits)
## [1] 0.4010611

Time series of plot of visits

Let’s now look at the time dimension of the data. Subset the original data (visits) to “Yellowstone NP” only and save as ys. Use the following code to plot the logVisits through the months between 2010 and 2016:

ys_ts=ts(ys$logVisits,start=c(2010,1),freq=12)

plot(ys_ts)

What observations do you make?

ys <- subset(visits, ParkName == "Yellowstone NP")

ys_ts <- ts(ys$logVisits,start=c(2010,1),freq=12)
plot(ys_ts)

Missing values

Note that there are some NA’s in the data - you can run colSums(is.na(visits)) to see the summary.

Why do we have NA’s in the laglogVisits and laglogVisitsYear? These variables were created by lagging the log visits by a month or by a year.

colSums(is.na(visits))
##         ParkName         ParkType           Region            State 
##                0                0                0                0 
##             Year            Month              lat             long 
##                0                0               84               84 
##             cost        logVisits     laglogVisits laglogVisitsYear 
##                0                0              305             3660
# To deal with the missing values, we will simply remove the observations with the missing values first (there are more sophisticated ways to work with missing values, but for this purpose removing the observations is fine). Run the following:

visits <- visits[rowSums(is.na(visits)) == 0, ]

# How many observations are there in visits now?
str(visits)
## 'data.frame':    21855 obs. of  12 variables:
##  $ ParkName        : chr  "Abraham Lincoln Birthplace NHP" "Abraham Lincoln Birthplace NHP" "Abraham Lincoln Birthplace NHP" "Abraham Lincoln Birthplace NHP" ...
##  $ ParkType        : chr  "National Historical Park" "National Historical Park" "National Historical Park" "National Historical Park" ...
##  $ Region          : chr  "Southeast " "Southeast " "Southeast " "Southeast " ...
##  $ State           : chr  "KY" "KY" "KY" "KY" ...
##  $ Year            : int  2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
##  $ Month           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ lat             : num  37.6 37.6 37.6 37.6 37.6 ...
##  $ long            : num  -85.7 -85.7 -85.7 -85.7 -85.7 ...
##  $ cost            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ logVisits       : num  7.88 8.2 8.98 9.87 9.74 ...
##  $ laglogVisits    : num  8.32 7.88 8.2 8.98 9.87 ...
##  $ laglogVisitsYear: num  8.26 8.55 8.99 9.81 9.87 ...

Predicting Visits

We are interested in predicting the log visits. Before doing the split, let’s also make Month a factor variable by including the following:

visits\(Month = as.factor(visits\)Month)

Subset our dataset into a training and a testing set by splitting based on the year: training would contain 2010-2014 years of data, and testing would be 2015-2016 data.

Let’s build now a simple linear regression model “mod” using the training set to predict the log visits. As a first step, we only use the laglogVisits variable (log visits from last month).

What’s the coefficient of the laglogVisits variable?

visits$Month <-  as.factor(visits$Month)
str(visits)
## 'data.frame':    21855 obs. of  12 variables:
##  $ ParkName        : chr  "Abraham Lincoln Birthplace NHP" "Abraham Lincoln Birthplace NHP" "Abraham Lincoln Birthplace NHP" "Abraham Lincoln Birthplace NHP" ...
##  $ ParkType        : chr  "National Historical Park" "National Historical Park" "National Historical Park" "National Historical Park" ...
##  $ Region          : chr  "Southeast " "Southeast " "Southeast " "Southeast " ...
##  $ State           : chr  "KY" "KY" "KY" "KY" ...
##  $ Year            : int  2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
##  $ Month           : Factor w/ 12 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ lat             : num  37.6 37.6 37.6 37.6 37.6 ...
##  $ long            : num  -85.7 -85.7 -85.7 -85.7 -85.7 ...
##  $ cost            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ logVisits       : num  7.88 8.2 8.98 9.87 9.74 ...
##  $ laglogVisits    : num  8.32 7.88 8.2 8.98 9.87 ...
##  $ laglogVisitsYear: num  8.26 8.55 8.99 9.81 9.87 ...
#Subsetting the data

training_visits <- subset(visits, Year <= 2014)
test_visits <- subset(visits, Year > 2014)

# linear regression modelling

mod <- lm(logVisits ~ laglogVisits, data = training_visits)
summary(mod)
## 
## Call:
## lm(formula = logVisits ~ laglogVisits, data = training_visits)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.5265  -0.2866   0.0413   0.3306  12.3876 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.659455   0.028923    22.8   <2e-16 ***
## laglogVisits 0.927945   0.003073   301.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8817 on 14557 degrees of freedom
## Multiple R-squared:  0.8623, Adjusted R-squared:  0.8623 
## F-statistic: 9.117e+04 on 1 and 14557 DF,  p-value: < 2.2e-16
# Prediction

Predicted_visits <- predict(mod, newdata = test_visits)

# out of sample R2
SSE <- sum((test_visits$logVisits - Predicted_visits)^2)
SST <- sum((test_visits$logVisits - mean(training_visits$logVisits))^2)
R_sq <- 1 - SSE/SST
R_sq
## [1] 0.8975923

Add new variables

We see that the model achieves good predictive power already simply using the previous month’s visits. To see if the other knowledge we have about the parks can improve the model, let’s add these variables in a new model.

The new model would have the following variables:

laglogVisits, laglogVisitsYear, Year, Month, Region, ParkType, and cost

Looking at the model summary, which of the following statements are correct (significance at 0.05 level)?

mod2 <- lm(logVisits ~ laglogVisits + laglogVisitsYear + Year + Month + Region + ParkType + cost, data = training_visits)
summary(mod2)
## 
## Call:
## lm(formula = logVisits ~ laglogVisits + laglogVisitsYear + Year + 
##     Month + Region + ParkType + cost, data = training_visits)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.2891  -0.2014   0.0058   0.2293  10.1600 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      -2.514727  10.909227  -0.231 0.817696    
## laglogVisits                      0.650702   0.005177 125.682  < 2e-16 ***
## laglogVisitsYear                  0.314173   0.005163  60.845  < 2e-16 ***
## Year                              0.001244   0.005421   0.230 0.818468    
## Month2                            0.174008   0.029701   5.859 4.77e-09 ***
## Month3                            0.422671   0.029859  14.156  < 2e-16 ***
## Month4                            0.276278   0.029842   9.258  < 2e-16 ***
## Month5                            0.397868   0.030025  13.251  < 2e-16 ***
## Month6                            0.272168   0.030113   9.038  < 2e-16 ***
## Month7                            0.255440   0.030234   8.449  < 2e-16 ***
## Month8                            0.027959   0.030201   0.926 0.354589    
## Month9                           -0.024103   0.030090  -0.801 0.423141    
## Month10                          -0.216415   0.029958  -7.224 5.30e-13 ***
## Month11                          -0.177488   0.029782  -5.960 2.59e-09 ***
## Month12                          -0.105784   0.029709  -3.561 0.000371 ***
## RegionIntermountain               0.207608   0.037177   5.584 2.39e-08 ***
## RegionMidwest                     0.207578   0.038276   5.423 5.95e-08 ***
## RegionNational Capital            0.205045   0.047474   4.319 1.58e-05 ***
## RegionNortheast                   0.234444   0.038089   6.155 7.70e-10 ***
## RegionPacific West                0.233219   0.037759   6.177 6.73e-10 ***
## RegionSoutheast                   0.249889   0.038567   6.479 9.50e-11 ***
## ParkTypeNational Historic Site   -0.048652   0.036199  -1.344 0.178960    
## ParkTypeNational Historical Park  0.029225   0.037831   0.772 0.439832    
## ParkTypeNational Memorial         0.028364   0.039655   0.715 0.474452    
## ParkTypeNational Monument        -0.026587   0.037258  -0.714 0.475489    
## ParkTypeNational Park             0.019296   0.039394   0.490 0.624276    
## ParkTypeNational Recreation Area  0.060879   0.045403   1.341 0.179980    
## ParkTypeNational Seashore         0.034273   0.048327   0.709 0.478218    
## cost                              0.002859   0.001028   2.780 0.005435 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7309 on 14530 degrees of freedom
## Multiple R-squared:  0.9056, Adjusted R-squared:  0.9054 
## F-statistic:  4977 on 28 and 14530 DF,  p-value: < 2.2e-16

Out of sample R2 - again 2nd time

In the new model, what’s the out-of-sample R2 in the testing set?

Predicted_visits2 <- predict(mod2, newdata = test_visits)

SSE2 <- sum((test_visits$logVisits - Predicted_visits2)^2)
SST2 <- sum((test_visits$logVisits - mean(training_visits$logVisits))^2)
R_sq2 <- 1 - SSE2/SST2
R_sq2
## [1] 0.937253

Regression trees with CV

The out-of-sample R2 does not appear to be very good under regression trees, compared to a linear regression model. We could potentially improve it via cross validation.

Set seed to 201, run a 10-fold cross-validated cart model, with cp ranging from 0.0001 to 0.005 in increments of 0.0001. What is optimal cp value on this grid?

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
set.seed(201)

numFolds <- trainControl(method = "cv", number = 10)
cartGrid <- expand.grid( .cp = seq(0.0001,0.005,0.0001)) 

train(logVisits ~ laglogVisits + laglogVisitsYear + Year + Month + Region + ParkType + cost, data = training_visits, method = "rpart", trControl = numFolds, tuneGrid = cartGrid)
## CART 
## 
## 14559 samples
##     7 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 13103, 13103, 13102, 13104, 13103, 13104, ... 
## Resampling results across tuning parameters:
## 
##   cp      RMSE       Rsquared   MAE      
##   0.0001  0.7091974  0.9108305  0.3458448
##   0.0002  0.7085292  0.9109220  0.3512776
##   0.0003  0.7152408  0.9092339  0.3583233
##   0.0004  0.7237052  0.9070232  0.3688886
##   0.0005  0.7362237  0.9038112  0.3815323
##   0.0006  0.7399929  0.9028245  0.3855472
##   0.0007  0.7380079  0.9033927  0.3894208
##   0.0008  0.7403798  0.9027542  0.3923063
##   0.0009  0.7408346  0.9025643  0.3927549
##   0.0010  0.7481463  0.9005786  0.3972774
##   0.0011  0.7467012  0.9009523  0.3987272
##   0.0012  0.7495321  0.9002201  0.3997401
##   0.0013  0.7519675  0.8995514  0.4014459
##   0.0014  0.7547692  0.8987879  0.4052245
##   0.0015  0.7568397  0.8982017  0.4061301
##   0.0016  0.7575437  0.8979788  0.4086154
##   0.0017  0.7585135  0.8976539  0.4106278
##   0.0018  0.7640608  0.8962863  0.4131544
##   0.0019  0.7667635  0.8955105  0.4159244
##   0.0020  0.7667635  0.8955105  0.4159244
##   0.0021  0.7725256  0.8939648  0.4215485
##   0.0022  0.7782371  0.8924668  0.4308113
##   0.0023  0.7839540  0.8908853  0.4366702
##   0.0024  0.7884100  0.8896100  0.4439863
##   0.0025  0.7870230  0.8899620  0.4449950
##   0.0026  0.7957688  0.8874951  0.4517651
##   0.0027  0.8003900  0.8861064  0.4611686
##   0.0028  0.8024952  0.8854989  0.4633812
##   0.0029  0.8005645  0.8860123  0.4650752
##   0.0030  0.8038477  0.8851550  0.4683632
##   0.0031  0.8038477  0.8851550  0.4683632
##   0.0032  0.8038477  0.8851550  0.4683632
##   0.0033  0.8038477  0.8851550  0.4683632
##   0.0034  0.8046526  0.8848610  0.4688624
##   0.0035  0.8046526  0.8848610  0.4688624
##   0.0036  0.8046526  0.8848610  0.4688624
##   0.0037  0.8046526  0.8848610  0.4688624
##   0.0038  0.8009293  0.8858505  0.4683310
##   0.0039  0.8009293  0.8858505  0.4683310
##   0.0040  0.8023013  0.8854159  0.4692306
##   0.0041  0.8023013  0.8854159  0.4692306
##   0.0042  0.8023013  0.8854159  0.4692306
##   0.0043  0.8023013  0.8854159  0.4692306
##   0.0044  0.8023013  0.8854159  0.4692306
##   0.0045  0.8023013  0.8854159  0.4692306
##   0.0046  0.8023013  0.8854159  0.4692306
##   0.0047  0.8023013  0.8854159  0.4692306
##   0.0048  0.8023013  0.8854159  0.4692306
##   0.0049  0.8023013  0.8854159  0.4692306
##   0.0050  0.8037850  0.8849801  0.4696590
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 2e-04.

Final regression Tree

Rerun the regression tree on the training data, now using the cp value equal to the one selected in the previous problem (under the original range). Note: do not get the tree from the cross-validation directly.

What is the out-of-sample R2 in the testing set?

library(rpart)

set.seed(201)

Final_tree <- rpart(logVisits ~ laglogVisits + laglogVisitsYear + Year + Month + Region + ParkType + cost, data = training_visits, cp = 1e-4)
pred_FinTree <- predict(Final_tree, newdata = test_visits)

# out of sample R2

SSE3 <- sum((test_visits$logVisits - pred_FinTree)^2)
SST3 <- sum((test_visits$logVisits - mean(training_visits$logVisits))^2)
R_sq3 <- 1 - SSE3/SST3
R_sq3
## [1] 0.937113

Random Forest

We can potentially further improve the models by using a random forest. Set seed to 201 again. Train a random forest model with the same set of covariates, and using just default parameters (no need to specify). This may take a few minutes.

What is the R2 on the testing set for the random forest model?

library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
set.seed(201)

# Building the model and run predictions
RForest_visits <- randomForest(logVisits ~ laglogVisits + laglogVisitsYear + Year + Month + Region + ParkType + cost, data=training_visits)
Pred_RforestB <- predict(RForest_visits, newdata = test_visits)

# out of sample R2

SSE4 <- sum((test_visits$logVisits - Pred_RforestB)^2)
SST4 <- sum((test_visits$logVisits - mean(training_visits$logVisits))^2)
R_sq4 <- 1 - SSE4/SST4
R_sq4
## [1] 0.9473915
## DONE PART 1 =)