Description

Use historical draw results, and number of hunters to train a model we can use to predict the number of hunters in future years. I would like to compare this to the results we generated by grouping the seasons together.

NOTICE that I am only looking at the general rifle hunting seasons on public land. There are also hunters for Archery, Muzzleloader, Private Land, Ranching for Wildlife, etc.


Setup

Load required libraries for wrangling data, charting, and mapping

library(plyr,quietly = T) # data wrangling
library(dplyr,quietly = T) # data wrangling

Attaching package: ‘dplyr’

The following objects are masked from ‘package:plyr’:

    arrange, count, desc, failwith, id, mutate, rename, summarise, summarize

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union
library(ggplot2, quietly = T) # charting
library(ggthemes,quietly = T) # so I can add the highcharts theme and palette
library(scales,quietly = T) # to load the percent function when labeling plots
library(caret,quietly = T) # classification and regression training
library(foreach,quietly = T) # parallel processing to speed up the model training
library(doMC,quietly = T) # parallel processing to speed up the model training
library(lubridate,quietly = T) # for timing models

Attaching package: ‘lubridate’

The following object is masked from ‘package:plyr’:

    here

The following object is masked from ‘package:base’:

    date

Set our preferred charting theme

theme_set(theme_minimal()+theme_hc()+theme(legend.key.width = unit(1.5, "cm")))

Run script to get hunter data

source('~/_code/colorado-dow/datasets/Colorado Elk Harvest Data.R', echo=F)
toOrdinal 1.0-0.0  (7-18-2017). For help: >help("toOrdinal") or visit https://centerforassessment.github.io/toOrdinal
The working directory was changed to /Users/psarnow/_code/colorado-dow/datasets inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the working directory for notebook chunks.

Table of the harvest data

head(COElkRifleAll)
  Unit Harvest.Antlered Hunters.Antlered Success.Antlered Season HuntCode Harvest.Antlerless Hunters.Antlerless Success.Antlerless Hunters.Either Success.Either Year
1    1                0                0               NA      1 EM001O1R                 NA                 NA                 NA             NA             NA 2006
2    2                0                0               NA      1 EM002O1R                 NA                 NA                 NA             NA             NA 2006
3  201                0                0               NA      1 EM201O1R                 NA                 NA                 NA             NA             NA 2006
4    3                0                0               NA      1 EM003O1R                 NA                 NA                 NA             NA             NA 2006
5  301                0                0               NA      1 EM301O1R                 NA                 NA                 NA             NA             NA 2006
6    4                0                0               NA      1 EM004O1R                 NA                 NA                 NA             NA             NA 2006

Run script to get draw data

source('~/_code/colorado-dow/datasets/Elk Drawing Summaries.R', echo=F)
Expected 26 pieces. Missing pieces filled with `NA` in 85 rows [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].binding character and factor vector, coercing into character vectorExpected 26 pieces. Missing pieces filled with `NA` in 85 rows [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].binding character and factor vector, coercing into character vectorExpected 26 pieces. Missing pieces filled with `NA` in 85 rows [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].binding character and factor vector, coercing into character vectorExpected 26 pieces. Missing pieces filled with `NA` in 81 rows [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].binding character and factor vector, coercing into character vectorThe working directory was changed to /Users/psarnow/_code/colorado-dow/datasets inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the working directory for notebook chunks.

Table of the data

head(COElkDrawAll)
  HuntCode Orig_Quota Ttl_Chce_1 Chcs_Drawn    Sex Unit Season Year Draw_Success
1 EE003O1R       1075       2188       1075 Either    3      1 2006    0.4913163
2 EE003O4R        300        533        300 Either    3      4 2006    0.5628518
3 EE004O4R        150        235        150 Either    4      4 2006    0.6382979
4 EE005O4R         10         33         10 Either    5      4 2006    0.3030303
5 EE006O1R        100        168        100 Either    6      1 2006    0.5952381
6 EE006O4R         80         83         80 Either    6      4 2006    0.9638554

source geodata

source('~/_code/colorado-dow/datasets/Colorado GMUnit and Road data.R', echo=F)
rgdal: version: 1.2-20, (SVN revision 725)
 Geospatial Data Abstraction Library extensions to R successfully loaded
 Loaded GDAL runtime: GDAL 2.1.3, released 2017/20/01
 Path to GDAL shared files: /Library/Frameworks/R.framework/Versions/3.5/Resources/library/rgdal/gdal
 GDAL binary built with GEOS: FALSE 
 Loaded PROJ.4 runtime: Rel. 4.9.3, 15 August 2016, [PJ_VERSION: 493]
 Path to PROJ.4 shared files: /Library/Frameworks/R.framework/Versions/3.5/Resources/library/rgdal/proj
 Linking to sp version: 1.2-7 
rgeos version: 0.3-26, (SVN revision 560)
 GEOS runtime version: 3.6.1-CAPI-1.10.1 r0 
 Linking to sp version: 1.2-7 
 Polygon checking: TRUE 


Attaching package: ‘magrittr’

The following object is masked from ‘package:tidyr’:

    extract
OGR data source with driver: ESRI Shapefile 
Source: "/Users/psarnow/_code/colorado-dow/datasets/CPW_GMUBoundaries/BigGameGMUBoundaries03172015.shp", layer: "BigGameGMUBoundaries03172015"
with 185 features
It has 12 fields
Integer64 fields read as strings:  GMUID 
OGR data source with driver: ESRI Shapefile 
Source: "/Users/psarnow/_code/colorado-dow/datasets/ne_10m_roads/ne_10m_roads.shp", layer: "ne_10m_roads"
with 56601 features
It has 29 fields
Integer64 fields read as strings:  scalerank question 

Take a peak at the boundary data

head(Unitboundaries2)
  id      long      lat order  hole piece group COUNTY DEERDAU ELKDAU ANTDAU MOOSEDAU BEARDAU LIONDAU SQ_MILES    ACRES SHAPE_area SHAPE_len Unit
1  1 -109.0486 40.83476     1 FALSE     1   1.1 MOFFAT     D-1   E-47   A-11     M-99    B-15     L-1 127.2227 81422.53  329506619  100751.7    1
2  1 -109.0472 40.83357     2 FALSE     1   1.1 MOFFAT     D-1   E-47   A-11     M-99    B-15     L-1 127.2227 81422.53  329506619  100751.7    1
3  1 -109.0460 40.83295     3 FALSE     1   1.1 MOFFAT     D-1   E-47   A-11     M-99    B-15     L-1 127.2227 81422.53  329506619  100751.7    1
4  1 -109.0449 40.83228     4 FALSE     1   1.1 MOFFAT     D-1   E-47   A-11     M-99    B-15     L-1 127.2227 81422.53  329506619  100751.7    1
5  1 -109.0438 40.83204     5 FALSE     1   1.1 MOFFAT     D-1   E-47   A-11     M-99    B-15     L-1 127.2227 81422.53  329506619  100751.7    1
6  1 -109.0423 40.83181     6 FALSE     1   1.1 MOFFAT     D-1   E-47   A-11     M-99    B-15     L-1 127.2227 81422.53  329506619  100751.7    1

Set to predictive analytics directory

setwd("~/_code/colorado-dow/phase III - predictive analytics")

Organize data

Will start by grouping all of the seasons together, and modeling the number of hunters per Year and Unit

Group Draw results data by Year and Unit

COElkDraw_Unit <- summarise(group_by(COElkDrawAll,Year,Unit,Season),
                       Quota = sum(Orig_Quota,na.rm = T),
                       Drawn = sum(Chcs_Drawn,na.rm = T))

Appropriate field classes for model training

COElkDraw_Unit$Year <- as.numeric(COElkDraw_Unit$Year)

Group Hunter data by Year and Unit

COElkHunters_Unit <- summarise(group_by(COElkRifleAll,Year,Unit,Season),
                          Hunters = sum(c(Hunters.Antlered,Hunters.Antlerless,Hunters.Either),na.rm = T))
COElkHunters_Unit$Year <- as.numeric(COElkHunters_Unit$Year)

Join in Hunter and Draw data together

COElkHunters_Unit <- left_join(COElkHunters_Unit, COElkDraw_Unit, by = c("Year","Unit","Season"))

Replace the draw data that don’t have entries with 0

COElkHunters_Unit$Drawn[is.na(COElkHunters_Unit$Drawn)] <- 0
COElkHunters_Unit$Quota[is.na(COElkHunters_Unit$Quota)] <- 0

Split into train and test sets. Will use 75% of the data to train on.

COElkHunters_Unit <- mutate(group_by(COElkHunters_Unit,Unit),
                       numentries = n())
COElkHunters_Unit <- filter(COElkHunters_Unit, numentries >= 3)
COElkHunters_Unit$UnitYearSeason <- paste(COElkHunters_Unit$Unit, COElkHunters_Unit$Year,COElkHunters_Unit$Season)
traindata2 <- COElkHunters_Unit %>% group_by(Unit) %>% sample_frac(size = .75, replace = F)
testdata2 <- COElkHunters_Unit[!COElkHunters_Unit$UnitYearSeason %in% traindata2$UnitYearSeason,]
COElkHunters_Unit <- select(COElkHunters_Unit, -UnitYearSeason, -numentries)
traindata2 <- select(traindata2, -UnitYearSeason, -numentries)
testdata2 <- select(testdata2, -UnitYearSeason, -numentries)

Save off for importing into AzureML

write.csv(COElkHunters_Unit,file = "~/_code/colorado-dow/datasets/COElkHunters_Unit.csv",row.names = F)

Data Visualization

notice that the number of hunters data is skewed.

ggplot(COElkHunters_Unit, aes(Hunters)) + 
  geom_density() +
  xlab("Hunters in Unit") +
  ylab("Number of Units") +
  theme(axis.text.y = element_blank()) +
  labs(title="Distribution of Hunters in each Unit", subtitle="2006-2017", caption="source: cpw.state.co.us")

A general rule of thumb to consider is that skewed data whose ratio of the highest value to the lowest value is greater than 20 have significant skewness. Also, the skewness statistic can be used as a diagnostic. If the predictor distribution is roughly symmetric, the skewness values will be close to zero. As the distribution becomes more right skewed, the skewness statistic becomes larger. Similarly, as the distribution becomes more left skewed, the value becomes negative. Replacing the data with the log, square root, or inverse may help to remove the skew.

Example of how BoxCox can redistribute the data

preProcValues2 <- preProcess(as.data.frame(traindata2), method = "BoxCox")
trainBC <- predict(preProcValues2, as.data.frame(traindata2))
ggplot(trainBC, aes(Hunters)) + 
  geom_density() +
  xlab("BoxCox Hunters in Unit") +
  ylab("Number of Units") +
  theme(axis.text.y = element_blank()) +
  labs(title="BoxCox Distribution of Hunters in each Unit", subtitle="2006-2017", caption="source: cpw.state.co.us")

caret has a preproccess function for correcting for skewness ‘BoxCox’, we will need to be sure to look at using this function in the training models.


Model Building

Model Training Methods

Loop through possible methods, utilizing the quicker ‘adaptive_cv’ parameter search from caret. Consider scripting this into AzureML to make it run much faster, though there is more setup and errors to control for

quickmethods <- c("lm",'svmLinear',"svmRadial","knn","cubist","kknn","glm.nb")
step1_all_Unit <- NULL
for (imethod in quickmethods) {
  step1_Unit <- NULL
  start <- now()
  
  # if (imethod == "lm") {
  #   controlmethod <- "repeatedcv"
  # } else {controlmethod <- "adaptive_cv"}
  controlmethod <- "repeatedcv"
  fitControl <- trainControl(
    method = controlmethod,
    # search = 'random',
    number = 4,
    repeats = 4,
    allowParallel = TRUE,
    summaryFunction = defaultSummary)
  
  registerDoSEQ()
  registerDoMC(cores = 6)
  
  HuntersModel_1_Unit = train(Hunters ~ ., data = traindata2,
                         method = imethod,
                         tuneLength = 15,
                         trControl = fitControl)
  
  HuntersModel_1_Unit
  
  # measure performance
  predictdata <- predict(HuntersModel_1_Unit, testdata2)
  
  step1_Unit$method <- imethod
  step1_Unit$RMSE <- postResample(pred = predictdata, obs = testdata2$Hunters)[1]
  step1_Unit$duration <- now() - start
  step1_Unit <- as.data.frame(step1_Unit)
  step1_all_Unit <- rbind(step1_all_Unit,step1_Unit)
}
model fit failed for Fold1.Rep1: link=sqrt Error : no valid set of coefficients has been found: please supply starting values
NaNs producedmodel fit failed for Fold1.Rep1: link=identity Error : no valid set of coefficients has been found: please supply starting values
model fit failed for Fold2.Rep1: link=sqrt Error : no valid set of coefficients has been found: please supply starting values
NaNs producedmodel fit failed for Fold2.Rep1: link=identity Error : no valid set of coefficients has been found: please supply starting values
model fit failed for Fold3.Rep1: link=sqrt Error : no valid set of coefficients has been found: please supply starting values
NaNs producedmodel fit failed for Fold3.Rep1: link=identity Error : no valid set of coefficients has been found: please supply starting values
model fit failed for Fold4.Rep1: link=sqrt Error : no valid set of coefficients has been found: please supply starting values
NaNs producedmodel fit failed for Fold4.Rep1: link=identity Error : no valid set of coefficients has been found: please supply starting values
model fit failed for Fold1.Rep2: link=sqrt Error : no valid set of coefficients has been found: please supply starting values
model fit failed for Fold2.Rep2: link=sqrt Error : no valid set of coefficients has been found: please supply starting values
NaNs producedmodel fit failed for Fold1.Rep2: link=identity Error : no valid set of coefficients has been found: please supply starting values
NaNs producedmodel fit failed for Fold2.Rep2: link=identity Error : no valid set of coefficients has been found: please supply starting values
model fit failed for Fold3.Rep2: link=sqrt Error : no valid set of coefficients has been found: please supply starting values
model fit failed for Fold4.Rep2: link=sqrt Error : no valid set of coefficients has been found: please supply starting values
NaNs producedNaNs producedmodel fit failed for Fold3.Rep2: link=identity Error : no valid set of coefficients has been found: please supply starting values
model fit failed for Fold4.Rep2: link=identity Error : no valid set of coefficients has been found: please supply starting values
model fit failed for Fold1.Rep3: link=sqrt Error : no valid set of coefficients has been found: please supply starting values
model fit failed for Fold2.Rep3: link=sqrt Error : no valid set of coefficients has been found: please supply starting values
NaNs producedmodel fit failed for Fold1.Rep3: link=identity Error : no valid set of coefficients has been found: please supply starting values
NaNs producedmodel fit failed for Fold2.Rep3: link=identity Error : no valid set of coefficients has been found: please supply starting values
model fit failed for Fold3.Rep3: link=sqrt Error : no valid set of coefficients has been found: please supply starting values
model fit failed for Fold4.Rep3: link=sqrt Error : no valid set of coefficients has been found: please supply starting values
NaNs producedmodel fit failed for Fold3.Rep3: link=identity Error : no valid set of coefficients has been found: please supply starting values
NaNs producedmodel fit failed for Fold4.Rep3: link=identity Error : no valid set of coefficients has been found: please supply starting values
model fit failed for Fold1.Rep4: link=sqrt Error : no valid set of coefficients has been found: please supply starting values
model fit failed for Fold2.Rep4: link=sqrt Error : no valid set of coefficients has been found: please supply starting values
NaNs producedmodel fit failed for Fold1.Rep4: link=identity Error : no valid set of coefficients has been found: please supply starting values
NaNs producedmodel fit failed for Fold2.Rep4: link=identity Error : no valid set of coefficients has been found: please supply starting values
model fit failed for Fold3.Rep4: link=sqrt Error : no valid set of coefficients has been found: please supply starting values
model fit failed for Fold4.Rep4: link=sqrt Error : no valid set of coefficients has been found: please supply starting values
NaNs producedmodel fit failed for Fold3.Rep4: link=identity Error : no valid set of coefficients has been found: please supply starting values
NaNs producedmodel fit failed for Fold4.Rep4: link=identity Error : no valid set of coefficients has been found: please supply starting values
There were missing values in resampled performance measures.missing values found in aggregated results

View Results, and compare to previous first models that did not expand to Seasons

step1_all
         method     RMSE       duration
RMSE         lm 165.0025  6.866598 secs
RMSE1 svmLinear 178.0942  4.950974 secs
RMSE2 svmRadial 134.0561 22.081016 secs
RMSE3       knn 733.8210 10.502875 secs
RMSE4    cubist 151.9572  3.057477 secs
RMSE5      kknn 132.0211  9.778734 secs
RMSE6    cubist 160.5464  1.027274 secs
step1_all_Unit
         method      RMSE       duration
RMSE         lm 153.24028  3.146424 secs
RMSE1 svmLinear 162.96937 31.758684 secs
RMSE2 svmRadial  60.62208  7.481067 secs
RMSE3       knn 224.88778  1.459118 secs
RMSE4    cubist  63.20635  2.664269 secs
RMSE5      kknn  57.59955  5.068010 secs
RMSE6    glm.nb 161.06055 18.607480 secs

The performance RMSE metric is certainly improved when including the Seasonal grouping. Lets take the best method, and see if it is visually reasonable while also charting without Seasons.

Predictions using best Season model

Build model

controlmethod <- "repeatedcv"
  fitControl <- trainControl(
    method = controlmethod,
    # search = 'random',
    number = 4,
    repeats = 4,
    allowParallel = TRUE,
    summaryFunction = defaultSummary)
  
registerDoSEQ()
registerDoMC(cores = 6)
  
HuntersModel_1_Unit = train(Hunters ~ ., data = COElkHunters_Unit,
                         method = "svmRadial",
                         # preProc = c("center","scale"), 
                         tuneLength = 15,
                         trControl = fitControl)
  
HuntersModel_1_Unit
Support Vector Machines with Radial Basis Function Kernel 

6022 samples
   5 predictor

No pre-processing
Resampling: Cross-Validated (4 fold, repeated 4 times) 
Summary of sample sizes: 4516, 4515, 4518, 4517, 4517, 4516, ... 
Resampling results across tuning parameters:

  C        RMSE       Rsquared   MAE     
     0.25  165.11678  0.7321893  99.83178
     0.50  153.00322  0.7587465  92.55356
     1.00  137.94672  0.8001821  82.60569
     2.00  118.53390  0.8520248  70.15132
     4.00   98.10570  0.8977195  58.23606
     8.00   81.44651  0.9276793  50.10515
    16.00   72.00511  0.9416747  45.61673
    32.00   67.82607  0.9475692  43.21396
    64.00   65.01697  0.9516157  41.54591
   128.00   62.98457  0.9545102  40.12867
   256.00   61.85285  0.9560942  39.32768
   512.00   61.35372  0.9568002  38.92707
  1024.00   61.64766  0.9564220  38.99960
  2048.00   62.23304  0.9556303  39.26794
  4096.00   63.32468  0.9540772  39.62725

Tuning parameter 'sigma' was held constant at a value of 0.003719777
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.003719777 and C = 512.

Predict Hunter for next year, 2018

# Get list of Units and Seasons that will have data
COElkHunters2018_Unit <- COElkHunters_Unit
COElkHunters2018_Unit$Unit_Season <- paste(COElkHunters2018_Unit$Unit,COElkHunters2018_Unit$Season)
COElkHunters2018_Unit <- as.data.frame(unique(COElkHunters2018_Unit$Unit_Season))
colnames(COElkHunters2018_Unit) <- "Unit_Season"
# Fill in missing Units and Seasons per unique Unit_Seasons
COElkHunters2018_Unit$Unit <- str_extract(COElkHunters2018_Unit$Unit_Season,"[:alnum:]+(?=[:blank:])")
COElkHunters2018_Unit$Season <- str_extract(COElkHunters2018_Unit$Unit_Season,"(?<=[:blank:])[:alnum:]+")
COElkHunters2018_Unit <- select(COElkHunters2018_Unit, -Unit_Season)
COElkHunters2018_Unit$Year <- 2018
# Draw data for 2018
COElkDraw_Unit2018 <- filter(COElkDraw_Unit,Year==2018)
# A left join will autofill missing draw data with NAs, but will retain the full list of Unit Seasons
COElkHunters2018_Unit <- left_join(COElkHunters2018_Unit,COElkDraw_Unit2018)
Joining, by = c("Unit", "Season", "Year")
# Replace the draw data that don't have entries with 0
COElkHunters2018_Unit$Drawn[is.na(COElkHunters2018_Unit$Drawn)] <- 0
COElkHunters2018_Unit$Quota[is.na(COElkHunters2018_Unit$Quota)] <- 0
# Only use the fields that were included in the model
COElkHunters2018_Unit <- COElkHunters2018_Unit[, colnames(COElkHunters2018_Unit) %in% c("Unit","Season",HuntersModel_1_Unit$coefnames)]
# Use trained model to predict Hunters
COElkHunters2018_Unit$Hunters <- round(predict(HuntersModel_1_Unit, COElkHunters2018_Unit))
COElkHunters2018_Unit$Hunters[COElkHunters2018_Unit$Hunters<0] <- 0

Label and Join models together for comparisons

# Load first model without Seasons
load("~/_code/colorado-dow/datasets/COElkHunters2018.RData")
Hunterscompare <- rbind.fill(COElkHunters,COElkHunters2018)
Hunterscompare$modeldata <- "Without Seasons"
Hunterscompare_Season <- rbind.fill(COElkHunters_Unit,COElkHunters2018_Unit)
Hunterscompare_Season <- summarise(group_by(Hunterscompare_Season,Year,Unit),
                                   Hunters = sum(Hunters))
Hunterscompare_Season$modeldata <- "Seasons"
Hunterscompare <- rbind.fill(Hunterscompare,Hunterscompare_Season)
# Group Units
HunterscompareStatewide <- summarise(group_by(Hunterscompare,Year,modeldata),
                                   Hunters = sum(Hunters))
ggplot(HunterscompareStatewide, aes(Year,Hunters,group=modeldata,fill=modeldata)) +
  geom_bar(position="dodge",stat="identity") +
  coord_cartesian(ylim = c(130000,155000)) +
  scale_fill_hc() +
  labs(title="Statewide Elk Hunters", caption="source: cpw.state.co.us")

Hunters Statewide by Season and Year

Hunters_Season <- rbind.fill(COElkHunters_Unit,COElkHunters2018_Unit)
ggplot(Hunters_Season, aes(Year,Hunters,group=Season,fill=Season)) +
  geom_bar(position="dodge",stat="identity") +
  # coord_cartesian(ylim = c(130000,155000)) +
  scale_fill_hc() +
  labs(title="Statewide Elk Hunters", caption="source: cpw.state.co.us")

Hunters in Unit 77

Hunters_Season_77 <- filter(Hunters_Season, Unit == "77")
ggplot(Hunters_Season_77, aes(Year,Hunters,group=Season,fill=Season)) +
  geom_bar(position="dodge",stat="identity") +
  # coord_cartesian(ylim = c(130000,155000)) +
  scale_fill_hc() +
  labs(title="Statewide Elk Hunters", caption="source: cpw.state.co.us")

Compare to the Draw Data (Quota, Drawn)

Data_Season_77 <- gather(Hunters_Season_77,"Field",Amount,Hunters,Quota,Drawn)
ggplot(Data_Season_77, aes(Season,Amount,fill=Field,group=Field)) +
  # geom_point() +
  # geom_line() +
  facet_grid(. ~ Year) +
  geom_bar(position="dodge",stat="identity") +
  # coord_cartesian(ylim = c(130000,155000)) +
  scale_fill_hc() +
  labs(title="Statewide Elk Hunters", caption="source: cpw.state.co.us")

ggplot(Data_Season_77, aes(Year,Amount,color=Field,group=Field)) +
  # geom_point() +
  geom_line() +
  facet_grid(Season ~ .,scales = 'free') +
  # geom_bar(position="dodge",stat="identity") +
  # coord_cartesian(ylim = c(130000,155000)) +
  scale_color_hc() +
  labs(title="Statewide Elk Hunters", caption="source: cpw.state.co.us")

Keep going by creating models for hunters per year, unit and season?

I believe the results appear reasonable.

We can further identify better models and tune them. But right now we will proceed with the last trained model…. revisiting after the project is more complete.

Save off so we don’t have to recreate the model everytime we want the results

FinalHuntersSeasonmodel <- HuntersModel_1_Unit
save(FinalHuntersSeasonmodel, file = "~/_code/colorado-dow/datasets/FinalHuntersSeasonmodel.RData")
COElkHunters2018_Season <- COElkHunters2018_Unit
save(COElkHunters2018_Season,file="~/_code/colorado-dow/datasets/COElkHunters2018_Season.RData")
---
title: "Predict Number of Future Elk Hunters per Unit Season"
author: "Pierre Sarnow"
output:
  html_notebook:
    toc: yes
    toc_float: false
    toc_depth: 6
    theme: yeti
    hightlight: default
    code_folding: none
---


***
## Description
Use historical draw results, and number of hunters to train a model we can use to 
predict the number of hunters in future years. I would like to compare this to the results
we generated by grouping the seasons together. 


*__NOTICE__ that I am only looking at the general rifle hunting seasons on public land. There are also 
hunters for Archery, Muzzleloader, Private Land, Ranching for Wildlife, etc.*

***
## Setup
Load required libraries for wrangling data, charting, and mapping
```{r}
library(plyr,quietly = T) # data wrangling
library(dplyr,quietly = T) # data wrangling
library(ggplot2, quietly = T) # charting
library(ggthemes,quietly = T) # so I can add the highcharts theme and palette
library(scales,quietly = T) # to load the percent function when labeling plots
library(caret,quietly = T) # classification and regression training
library(foreach,quietly = T) # parallel processing to speed up the model training
library(doMC,quietly = T) # parallel processing to speed up the model training
library(lubridate,quietly = T) # for timing models
```

Set our preferred charting theme
```{r}
theme_set(theme_minimal()+theme_hc()+theme(legend.key.width = unit(1.5, "cm")))
``` 

Run script to get hunter data
```{r}
source('~/_code/colorado-dow/datasets/Colorado Elk Harvest Data.R', echo=F)
```

Table of the harvest data
```{r}
head(COElkRifleAll)
```


Run script to get draw data
```{r}
source('~/_code/colorado-dow/datasets/Elk Drawing Summaries.R', echo=F)
```

Table of the data
```{r}
head(COElkDrawAll)
```

source geodata
```{r}
source('~/_code/colorado-dow/datasets/Colorado GMUnit and Road data.R', echo=F)
```

Take a peak at the boundary data
```{r}
head(Unitboundaries2)
```

Set to predictive analytics directory
```{r}
setwd("~/_code/colorado-dow/phase III - predictive analytics")
```

### Organize data
Will start by grouping all of the seasons together, and modeling the number of hunters per Year and Unit

Group Draw results data by Year and Unit
```{r}
COElkDraw_Unit <- summarise(group_by(COElkDrawAll,Year,Unit,Season),
                       Quota = sum(Orig_Quota,na.rm = T),
                       Drawn = sum(Chcs_Drawn,na.rm = T))
```

Appropriate field classes for model training
```{r}
COElkDraw_Unit$Year <- as.numeric(COElkDraw_Unit$Year)
```

Group Hunter data by Year and Unit
```{r}
COElkHunters_Unit <- summarise(group_by(COElkRifleAll,Year,Unit,Season),
                          Hunters = sum(c(Hunters.Antlered,Hunters.Antlerless,Hunters.Either),na.rm = T))

COElkHunters_Unit$Year <- as.numeric(COElkHunters_Unit$Year)
```

Join in Hunter and Draw data together
```{r}
COElkHunters_Unit <- left_join(COElkHunters_Unit, COElkDraw_Unit, by = c("Year","Unit","Season"))
```

Replace the draw data that don't have entries with 0
```{r}
COElkHunters_Unit$Drawn[is.na(COElkHunters_Unit$Drawn)] <- 0
COElkHunters_Unit$Quota[is.na(COElkHunters_Unit$Quota)] <- 0
```

Split into train and test sets. Will use 75% of the data to train on. 

```{r}
COElkHunters_Unit <- mutate(group_by(COElkHunters_Unit,Unit),
                       numentries = n())
COElkHunters_Unit <- filter(COElkHunters_Unit, numentries >= 3)
COElkHunters_Unit$UnitYearSeason <- paste(COElkHunters_Unit$Unit, COElkHunters_Unit$Year,COElkHunters_Unit$Season)

traindata2 <- COElkHunters_Unit %>% group_by(Unit) %>% sample_frac(size = .75, replace = F)
testdata2 <- COElkHunters_Unit[!COElkHunters_Unit$UnitYearSeason %in% traindata2$UnitYearSeason,]

COElkHunters_Unit <- select(COElkHunters_Unit, -UnitYearSeason, -numentries)

traindata2 <- select(traindata2, -UnitYearSeason, -numentries)
testdata2 <- select(testdata2, -UnitYearSeason, -numentries)
```

Save off for importing into AzureML
```{r}
write.csv(COElkHunters_Unit,file = "~/_code/colorado-dow/datasets/COElkHunters_Unit.csv",row.names = F)
```
### Data Visualization
notice that the number of hunters data is skewed.
```{r fig.width=10}
ggplot(COElkHunters_Unit, aes(Hunters)) + 
  geom_density() +
  xlab("Hunters in Unit") +
  ylab("Number of Units") +
  theme(axis.text.y = element_blank()) +
  labs(title="Distribution of Hunters in each Unit", subtitle="2006-2017", caption="source: cpw.state.co.us")

```


A general rule of thumb to consider is that skewed data whose ratio of the highest value to the 
lowest value is greater than 20 have significant skewness. Also, the skewness statistic can be 
used as a diagnostic. If the predictor distribution is roughly symmetric, the skewness values 
will be close to zero. As the distribution becomes more right skewed, the skewness statistic 
becomes larger. Similarly, as the distribution becomes more left skewed, the value becomes negative.
Replacing the data with the log, square root, or inverse may help to remove the skew.

Example of how BoxCox can redistribute the data
```{r}
preProcValues2 <- preProcess(as.data.frame(traindata2), method = "BoxCox")
trainBC <- predict(preProcValues2, as.data.frame(traindata2))
```

```{r fig.width=10}
ggplot(trainBC, aes(Hunters)) + 
  geom_density() +
  xlab("BoxCox Hunters in Unit") +
  ylab("Number of Units") +
  theme(axis.text.y = element_blank()) +
  labs(title="BoxCox Distribution of Hunters in each Unit", subtitle="2006-2017", caption="source: cpw.state.co.us")
```
caret has a preproccess function for correcting for skewness 'BoxCox', we will need to be sure to
look at using this function in the training models.

***
## Model Building

### Model Training Methods
Loop through possible methods, utilizing the quicker 'adaptive_cv' parameter search from caret.
Consider scripting this into AzureML to make it run much faster, though there is more setup and errors to 
control for

```{r}
quickmethods <- c("lm",'svmLinear',"svmRadial","knn","cubist","kknn","glm.nb")

step1_all_Unit <- NULL
for (imethod in quickmethods) {
  step1_Unit <- NULL
  start <- now()
  
  # if (imethod == "lm") {
  #   controlmethod <- "repeatedcv"
  # } else {controlmethod <- "adaptive_cv"}
  controlmethod <- "repeatedcv"
  fitControl <- trainControl(
    method = controlmethod,
    # search = 'random',
    number = 4,
    repeats = 4,
    allowParallel = TRUE,
    summaryFunction = defaultSummary)
  
  registerDoSEQ()
  registerDoMC(cores = 6)
  
  HuntersModel_1_Unit = train(Hunters ~ ., data = traindata2,
                         method = imethod,
                         tuneLength = 15,
                         trControl = fitControl)
  
  HuntersModel_1_Unit
  
  # measure performance
  predictdata <- predict(HuntersModel_1_Unit, testdata2)
  
  step1_Unit$method <- imethod
  step1_Unit$RMSE <- postResample(pred = predictdata, obs = testdata2$Hunters)[1]
  step1_Unit$duration <- now() - start
  step1_Unit <- as.data.frame(step1_Unit)
  step1_all_Unit <- rbind(step1_all_Unit,step1_Unit)
}
```
View Results, and compare to previous first models that did not expand to Seasons
```{r}
step1_all
step1_all_Unit
```
The performance RMSE metric is certainly improved when including the Seasonal grouping.
Lets take the best method, and see if it is visually reasonable while also charting without Seasons.

### Predictions using best Season model
Build model
```{r}
controlmethod <- "repeatedcv"
  fitControl <- trainControl(
    method = controlmethod,
    # search = 'random',
    number = 4,
    repeats = 4,
    allowParallel = TRUE,
    summaryFunction = defaultSummary)
  
registerDoSEQ()
registerDoMC(cores = 6)
  
HuntersModel_1_Unit = train(Hunters ~ ., data = COElkHunters_Unit,
                         method = "svmRadial",
                         # preProc = c("center","scale"), 
                         tuneLength = 15,
                         trControl = fitControl)
  
HuntersModel_1_Unit
```
Predict Hunter for next year, 2018
```{r}
# Get list of Units and Seasons that will have data
COElkHunters2018_Unit <- COElkHunters_Unit
COElkHunters2018_Unit$Unit_Season <- paste(COElkHunters2018_Unit$Unit,COElkHunters2018_Unit$Season)
COElkHunters2018_Unit <- as.data.frame(unique(COElkHunters2018_Unit$Unit_Season))
colnames(COElkHunters2018_Unit) <- "Unit_Season"
# Fill in missing Units and Seasons per unique Unit_Seasons
COElkHunters2018_Unit$Unit <- str_extract(COElkHunters2018_Unit$Unit_Season,"[:alnum:]+(?=[:blank:])")
COElkHunters2018_Unit$Season <- str_extract(COElkHunters2018_Unit$Unit_Season,"(?<=[:blank:])[:alnum:]+")
COElkHunters2018_Unit <- select(COElkHunters2018_Unit, -Unit_Season)
COElkHunters2018_Unit$Year <- 2018
# Draw data for 2018
COElkDraw_Unit2018 <- filter(COElkDraw_Unit,Year==2018)

# A left join will autofill missing draw data with NAs, but will retain the full list of Unit Seasons
COElkHunters2018_Unit <- left_join(COElkHunters2018_Unit,COElkDraw_Unit2018)

# Replace the draw data that don't have entries with 0
COElkHunters2018_Unit$Drawn[is.na(COElkHunters2018_Unit$Drawn)] <- 0
COElkHunters2018_Unit$Quota[is.na(COElkHunters2018_Unit$Quota)] <- 0

# Only use the fields that were included in the model
COElkHunters2018_Unit <- COElkHunters2018_Unit[, colnames(COElkHunters2018_Unit) %in% c("Unit","Season",HuntersModel_1_Unit$coefnames)]
# Use trained model to predict Hunters
COElkHunters2018_Unit$Hunters <- round(predict(HuntersModel_1_Unit, COElkHunters2018_Unit))

COElkHunters2018_Unit$Hunters[COElkHunters2018_Unit$Hunters<0] <- 0
```

Label and Join models together for comparisons
```{r}
# Load first model without Seasons
load("~/_code/colorado-dow/datasets/COElkHunters2018.RData")
Hunterscompare <- rbind.fill(COElkHunters,COElkHunters2018)
Hunterscompare$modeldata <- "Without Seasons"

Hunterscompare_Season <- rbind.fill(COElkHunters_Unit,COElkHunters2018_Unit)
Hunterscompare_Season <- summarise(group_by(Hunterscompare_Season,Year,Unit),
                                   Hunters = sum(Hunters))
Hunterscompare_Season$modeldata <- "Seasons"

Hunterscompare <- rbind.fill(Hunterscompare,Hunterscompare_Season)

```
```{r}
# Group Units
HunterscompareStatewide <- summarise(group_by(Hunterscompare,Year,modeldata),
                                   Hunters = sum(Hunters))
```

```{r fig.width=10}
ggplot(HunterscompareStatewide, aes(Year,Hunters,group=modeldata,fill=modeldata)) +
  geom_bar(position="dodge",stat="identity") +
  coord_cartesian(ylim = c(130000,155000)) +
  scale_fill_hc() +
  labs(title="Statewide Elk Hunters", caption="source: cpw.state.co.us")
```


#### Hunters Statewide by Season and Year
```{r fig.width=10}
Hunters_Season <- rbind.fill(COElkHunters_Unit,COElkHunters2018_Unit)
ggplot(Hunters_Season, aes(Year,Hunters,group=Season,fill=Season)) +
  geom_bar(position="dodge",stat="identity") +
  # coord_cartesian(ylim = c(130000,155000)) +
  scale_fill_hc() +
  labs(title="Statewide Elk Hunters", caption="source: cpw.state.co.us")
```

#### Hunters in Unit 77
```{r fig.width=10}
Hunters_Season_77 <- filter(Hunters_Season, Unit == "77")
ggplot(Hunters_Season_77, aes(Year,Hunters,group=Season,fill=Season)) +
  geom_bar(position="dodge",stat="identity") +
  # coord_cartesian(ylim = c(130000,155000)) +
  scale_fill_hc() +
  labs(title="Statewide Elk Hunters", caption="source: cpw.state.co.us")
```


### Compare to the Draw Data (Quota, Drawn)
```{r fig.width=10}
Data_Season_77 <- gather(Hunters_Season_77,"Field",Amount,Hunters,Quota,Drawn)
ggplot(Data_Season_77, aes(Season,Amount,fill=Field,group=Field)) +
  # geom_point() +
  # geom_line() +
  facet_grid(. ~ Year) +
  geom_bar(position="dodge",stat="identity") +
  # coord_cartesian(ylim = c(130000,155000)) +
  scale_fill_hc() +
  labs(title="Statewide Elk Hunters", caption="source: cpw.state.co.us")
```
```{r fig.width=10}
ggplot(Data_Season_77, aes(Year,Amount,color=Field,group=Field)) +
  # geom_point() +
  geom_line() +
  facet_grid(Season ~ .,scales = 'free') +
  # geom_bar(position="dodge",stat="identity") +
  # coord_cartesian(ylim = c(130000,155000)) +
  scale_color_hc() +
  labs(title="Statewide Elk Hunters", caption="source: cpw.state.co.us")
```


## Keep going by creating models for hunters per year, unit and season?
I believe the results appear reasonable.

We can further identify better models and tune them.
But right now we will proceed with the last trained model.... revisiting after the
project is more complete.

Save off so we don't have to recreate the model everytime we want the results
```{r}
FinalHuntersSeasonmodel <- HuntersModel_1_Unit
```
```{r}
save(FinalHuntersSeasonmodel, file = "~/_code/colorado-dow/datasets/FinalHuntersSeasonmodel.RData")
```

```{r}
COElkHunters2018_Season <- COElkHunters2018_Unit
save(COElkHunters2018_Season,file="~/_code/colorado-dow/datasets/COElkHunters2018_Season.RData")
```


