This case study proposes to predict the numbers of visitors to an attraction park in Germany. This park is located at the border of two counties, benefiting from the different bank holiday and school holiday from both counties.
This type of analysis can have important operational and business impact:
- forecast periods/days with higher frequentation, to determine when to hire more temporary personel, plan for more restauration, …
- identify low frequentation periods in order to find alternative offering/distractions to push attendance
- plan maintenance on low frequentation days
- forecast revenue better
- optimize pricing policy
The task is to predict the column called ‘label’ for the test set. We will measure the prediction error using RMSE (root mean-squared error) metric with a baseline value of about 500, we will endeavour to find models with the lowest RMSE.
All the code for this project is shown inline and is reproducible, the reader needs to place the source files in the same folder as the Rmd or R file.
SUMMARY FINDINGS
You can read a more complete summary in our conclusions at the end of the document.
After trying several algorithms, we retained random forests as the one yielding predictions with the lowest RMSE (303.17 in-sample RMSE on full train set).
We also present a strategy to improve the predictions (more data pre-processing, changing variables, tweaking models further, trying other models, …).
As we wanted to stay within the recommended time frame of about 10 hours for this project, we retained and presented in this document the approaches giving the best RMSE.
LOADING LIBRARIES
We start by loading some of the libraries we’ll be needing for the analysis:
- tidyverse (dplyr, ggplot2, … for data wrangling, plots, …)
- scales, gridExtra: some additional formatting options for ggplot
- lubridate (to handle date formats)
- caret (create train/calibration datasets, run/test algorithms and get acccuracy metrics, …)
- RWeka, rpart, ranger, glmnet (for different models, all integrated within caret though)
- neuralnet: simple implementation of neural networks
- psych: for pairs.panel function (diagnosing correlation between variables)
- doSnow: package to run algorithms on multiple logical cores, useful to speed up the process when running random forests or other resource-intensive algorithms on several logical cores of your machine, we’ll decide within the course of the project whether to use it or not.
library(knitr); library(tidyverse); library(scales); library(gridExtra); library(lubridate); library(RWeka); library(caret); library(rpart); library(ranger); library(glmnet); library(psych); library(doSNOW); library(neuralnet)
For this task, we dispose of 3 datasets.
Loading the data:
train <- read.csv("data_train.csv", stringsAsFactors = FALSE)
test <- read.csv("data_test.csv", stringsAsFactors = FALSE)
weather <- read.csv("weather.csv", stringsAsFactors = FALSE)
#convert date variable, from character to date format
train$date <- as.POSIXct(train$date)
test$date <- as.POSIXct(test$date)
weather$date <- as.POSIXct(weather$date)
#print summary for train.csv
summary(train)
## date bank_holiday feature_0
## Min. :2005-03-20 00:00:00 Min. :0.00000 Min. :0.00000
## 1st Qu.:2006-05-30 12:00:00 1st Qu.:0.00000 1st Qu.:0.00000
## Median :2007-08-10 00:00:00 Median :0.00000 Median :0.00000
## Mean :2007-08-10 02:57:37 Mean :0.07688 Mean :0.06598
## 3rd Qu.:2008-10-19 12:00:00 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :2009-12-31 00:00:00 Max. :3.00000 Max. :1.00000
## feature_1 feature_2 feature_3 feature_4
## Min. :0.00000 Min. :0.00000 Min. :3.200 Min. :6.700
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:3.200 1st Qu.:6.700
## Median :0.00000 Median :0.00000 Median :4.300 Median :7.300
## Mean :0.05278 Mean :0.04073 Mean :3.856 Mean :7.077
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:4.300 3rd Qu.:7.300
## Max. :1.00000 Max. :1.00000 Max. :4.600 Max. :7.600
## feature_5 feature_6 feature_7 school_holiday
## Min. :1.700 Min. :3.700 Min. :0.00000 Min. :0.0000
## 1st Qu.:1.700 1st Qu.:3.700 1st Qu.:0.00000 1st Qu.:0.0000
## Median :2.300 Median :4.300 Median :0.00000 Median :0.0000
## Mean :2.077 Mean :4.077 Mean :0.04303 Mean :0.7321
## 3rd Qu.:2.300 3rd Qu.:4.300 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :2.600 Max. :4.600 Max. :1.00000 Max. :3.0000
## feature_8 feature_9 feature_10 label
## Min. :0 Min. :0 Min. :0.00000 Min. : 142
## 1st Qu.:0 1st Qu.:0 1st Qu.:0.00000 1st Qu.: 623
## Median :0 Median :0 Median :0.00000 Median : 930
## Mean :0 Mean :0 Mean :0.05852 Mean :1119
## 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0.00000 3rd Qu.:1532
## Max. :0 Max. :0 Max. :1.00000 Max. :3761
DATASET 1: data_train.csv - 1473 observations (about 82.76% of total training+test cases) and 15 variables:
- date: almost one observation per day (4 days are missing), time range = 20 March 2005 to 31 December 2009
- bank_holiday: 0 = no holiday, 1 = bank holiday for county 1, 2 = bank holiday for county 2, 3 = bank holiday for both counties
- school_holiday: 0 = no holiday, 1 = school holiday for county 1, 2 = school holiday for county 2, 3 = school holiday for both counties
- feature_0 to feature_10: 11 dummy-coded variables. We notice that feature_8 and feature_9 are just 0s.
- label: the number of visitors on each specific day, ranging from 142 to 3761 visitors.
There are no missing values in this dataset.
DATASET 2: data_test.csv - 363 observations and 15 variables (about 17.24% of total training+test cases):
We do not print the summary here as it holds exactly the same structure as the training set, with a couple of comments:
- date: time = 01 January 2010 to 31 December 2010
- label: variable holds only NA values, since it is the variable we need to predict (number of visitors). Other than that, there are no missing values in the dataset.
summary(weather)
## date air_humidity air_temperature_daily_max
## Min. :2005-03-20 00:00:00 Min. : 38.00 Min. :-9.90
## 1st Qu.:2006-08-29 06:00:00 1st Qu.: 72.00 1st Qu.: 8.60
## Median :2008-02-08 12:00:00 Median : 81.00 Median :15.20
## Mean :2008-02-08 02:21:45 Mean : 79.22 Mean :14.77
## 3rd Qu.:2009-07-19 18:00:00 3rd Qu.: 89.00 3rd Qu.:20.93
## Max. :2010-12-31 00:00:00 Max. :100.00 Max. :36.90
## NA's :38 NA's :42
## air_temperature_daily_mean air_temperature_daily_min precipitation
## Min. :-12.10 Min. :-14.700 Min. : 0.000
## 1st Qu.: 5.80 1st Qu.: 2.600 1st Qu.: 0.000
## Median : 11.10 Median : 6.800 Median : 0.200
## Mean : 10.58 Mean : 6.463 Mean : 2.407
## 3rd Qu.: 15.80 3rd Qu.: 11.000 3rd Qu.: 2.800
## Max. : 28.00 Max. : 20.000 Max. :131.100
## NA's :33 NA's :43
## snow_height sunshine_hours wind_speed_max
## Min. : 0.0000 Min. : 0.000 Min. : 2.70
## 1st Qu.: 0.0000 1st Qu.: 0.400 1st Qu.: 7.70
## Median : 0.0000 Median : 3.300 Median : 9.80
## Mean : 0.5522 Mean : 4.481 Mean :10.32
## 3rd Qu.: 0.0000 3rd Qu.: 7.400 3rd Qu.:12.40
## Max. :28.0000 Max. :16.100 Max. :35.20
## NA's :41 NA's :43
DATASET 3: weather.csv - 2106 observations and 9 variables:
- date: covers the time range of the training and test sets
- 8 weather variables: humidity, maximal temperature, mean temperature, minimal temperature, precipitation, snow height, sunshine hours and wind speed. We notice missing values for 6 of these variables and we may need to come up with a strategy for that.
The data (train and test) is quite clean, wihtout obvious outliers or missing values.
The weather dataset contains missing values that we are dealing with now.
Do people visit the attraction park more on a sunny day? A warm day? A dry day? These types of variables can be useful to explain patterns in attendance and although we may not use continuous variables as they are, we should start by inspecting missing values.
Question 1: In which years do the missing values occur?
#format a dataframe to store the missing values per year
NA_year_table <- data.frame(Year = c(2005:2010), NAs = rep(0))
#loop over each year and count how many NAs we have
for(i in c(1:6)){
a <- sum(is.na(weather[year(weather$date) == 2004+i,]))
NA_year_table$NAs[i] <- a
}
kable(NA_year_table, align = "cc") #print table
| Year | NAs |
|---|---|
| 2005 | 50 |
| 2006 | 4 |
| 2007 | 0 |
| 2008 | 5 |
| 2009 | 0 |
| 2010 | 181 |
About 75% of the missing values occurs in 2010, which is our test set, we need to see if we can reasonably impute missing values so that we end up with better predictions.
Question 2: how many consecutive NAs do we have per variable?
#which variables have NAs
weatherNAvars <- names(weather)[apply(weather, 2, function(x) sum(is.na(x))) != 0]
#initialize a list to store NA data per variable
consecutive_NAs_list <- list()
#for loop to count occurences of consecutive NAs
for(i in weatherNAvars){
lrna <- rle(is.na(weather[,i]))
consecutive_NAs_list[[i]] <- sort(lrna$lengths[lrna$values], decreasing = TRUE)
}
#print the list
consecutive_NAs_list
## $air_humidity
## [1] 31 5 2
##
## $air_temperature_daily_max
## [1] 30 7 1 1 1 1 1
##
## $air_temperature_daily_mean
## [1] 30 2 1
##
## $air_temperature_daily_min
## [1] 30 7 1 1 1 1 1 1
##
## $sunshine_hours
## [1] 30 6 1 1 1 1 1
##
## $wind_speed_max
## [1] 30 7 1 1 1 1 1 1
We seem to retrieve a full month of consecutive missing data, the total across the 6 variables is 181, which coincides with the missing values in year 2010, our test set.
For the 7 or less days of consecutive missing data, this should not be a major problem, we will deal with this a bit later.
Question 3: in which month do these 30+ missing values occur?
#which are the row indices where the air_humidity variable occur?
which(is.na(weather$air_humidity))
## [1] 154 155 156 157 158 1226 1227 2076 2077 2078 2079 2080 2081 2082
## [15] 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096
## [29] 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106
#we see on the above variable that rows 2076 to 2106 are NAs, to which date range does it correspond?
range(weather$date[2076:2106])
## [1] "2010-11-30 CET" "2010-12-31 CET"
#our hunch is that since equal quantities of NAs show in each of the six variables, let's see if we can confirm this by checking for NAs on all variables that have missing values on the same row indices
apply(weather[2076:2106,weatherNAvars], 2, function(x) sum(is.na(x)))
## air_humidity air_temperature_daily_max
## 31 30
## air_temperature_daily_mean air_temperature_daily_min
## 30 30
## sunshine_hours wind_speed_max
## 30 30
As seen above, the longer consecutive run of missing values of 30 days (31 days for the humidity variable) occurs at the same time, the last month of the test set, ranging from 1 December to 31 December 2010. This poses a problem as we cannot infer the data in a failsafe way.
Some algorithms we’ll use can deal with NAs pretty well, others not, so we’ll need to take a decision as to how we strategize this.
Ignoring or not ignoring the variables - We have 2 options here:
- we think that the variables impacted by NAs don’t hold a strong predictive power and we just ignore them,
- or if it proves that these variables are important predictors, we’ll need to impute missing values, at the risk of coming up with a last month of predictions that can be impacted by the choice of the imputing method.
Thoughts about the variables with missing values:
- air humidity: our intuition would tell us that humidity has partly to do with precipitation or temperature (confounding variable?) and park visitors do not check humidity before going out, just if it rains or not.
- daily temperature (mix, max, mean): the 3 are obviously correlated for most of the observations, minimum temperature occurs in the middle of the night and shouldn’t impact any decision for park visit, maximum temperature seems more likely to be the most useful metric. But do we need a continuous measurement here? Do a few degrees difference impact visitor’s decisions? Or would this trend be captured by the season or month of the year, that we will include anyway in the dataset?
- sunshine hours: this can have a stronger impact as we may intuitively think that visitors may prefer a sunny day to visit an outdoor park.
- Wind speed: maximal value of 35.2 km/h is not very high, the impact on the outcome variable may not be that great.
Intuition is one thing, but only testing the impact on each variable can confirm or not our hunch, this is why we will start with imputing missing values first.
Imputing strategy:
- we will derive a weather_imp dataframe with imputed values, from the weather dataframe
- two popular imputing methods consist of using the mean or median value, or using linear imputing. Using the mean would be counter-productive as it wouldn’t capture season-driven daily weather variations. Linear imputing won’t work on the last missing month of data as we do not know the latest value (31 december 2010).
AS A RESULT:
- For 7 or less consecutive missing values: we use a linear imputing method.
- For the 30+ missing data points at the end of the dataset, we take a mix of both methods: we calculate the daily average variations (day by day differences) based on the average of same day values across each different year and make the imputing series start at the last known value of the dataset.
In short: imputed_value_01-12-2010 = (known value of 11-30-2010) + (mean(2005:2010) of November 30th - mean(2005:2010) of December 1st), and so on …
First Step - Imputing last month of missing data
We initialize first a copy of the weather dataset to store the imputed data: weather_imp
Then, we create a function that performs our imputing method:
#create weather_impt object, exact image from weather dataset
weather_imp <- weather
#create function to impute missing values for the last set of missing values, we just need to enter the index of the column to impute data into
imputeNA <- function(col1){
#while loop to identify row index of first missing value starting backwards from the last row
first_row_NA <- 2106
while (is.na(weather[first_row_NA,col1])) {first_row_NA <- first_row_NA-1}
first_row_NA <- first_row_NA+1
first_row_NA
#initialize empty numeric vector to store imputed values
imputedValues <- 0
#for loop, we store the mean od each specific day across all years (2005:2010), we start one row before the missing value occurs to retain it as a baseline
for(i in c((first_row_NA-1):2106)){
imputedValues[i-first_row_NA+2] <- mean(weather[yday(weather$date) == yday(weather[i,1]),col1], na.rm = TRUE)
}
#measure differences between each day and its preceding day
diff_imputedValues <- diff(imputedValues)
#re-initialize imputedValues vector
imputedValues <- 0
#calculate first value, which is last known observed value + the first difference value
imputedValues[1] <- weather[first_row_NA-1,col1] + diff_imputedValues[1]
#the rest of the calculation will take last imputed value + computed difference
for(j in c(2:length(diff_imputedValues))){
imputedValues[j] <- imputedValues[j-1] + diff_imputedValues[j]
}
#merge the results into the dataframe, weather_imp
weather_imp[c(first_row_NA:2106), col1] <<- imputedValues
}
#run the function over the columns with missing values
for(k in c(2,3,4,5,8,9)){
imputeNA(k)
}
Second Step - Linearly imputing the rest of the missing values
#we exclude the first column that contains the date, so that it retains the original date format
weather_imp[-1] <- data.frame(lapply(weather_imp[-1], function(x) approxfun(seq_along(x), x)(seq_along(x))))
Housekeeping tasks: verify that no NAs are left in weather_imp, rename the variables into shorter names, and delete unnecessary variables:
#verify that there are no NAs left
sum(is.na(unlist(weather_imp)))
## [1] 0
#rename columns
names(weather_imp) <- c("date", "humidity", "temp_max", "temp_mean", "temp_min", "precipitation", "snow", "sunshine", "wind")
#delete unnecessary variables
rm(list = setdiff(ls(), c("test", "train", "weather", "weather_imp")))
Our weather_imp dataframe doesn’t have missing values anymore. The imputing method may not be perfect, but this should serve our purpose. We may, further in the process, drop some of these variables or transform some into into categorical variables since having continuous variables may not be always relevant.
This will be decided at a later stage in the process.
We create a new function that will perform several tasks:
- from the date, derive the month, year, and season
- from the date, derive if the observation is a weekday or a weekend
- delete the feature_8 and feature_9 variables that just hold zeros in both train and test sets
- merge weather data with the dataset
- rename the dataset as test_new and train_new
add_vars <- function(x){
#store name of the dataset (argument we pass to the function in order to reuse later to save the new dataframe)
name <- as.character(substitute(x))
#mutate function to add variables: Year, Month, is.weekend, Season
x <- x %>% mutate(Year = year(date), Month = month(date),
is.weekend = ifelse(weekdays(x$date) %in% c("Saturday", "Sunday"), 1, 0),
season = ifelse(Month %in% 10:12, "Fall",
ifelse(Month %in% 1:3, "Winter",
ifelse(Month %in% 4:6, "Spring",
"Summer")))
)
#reorder the levels of the categorical variable for consistent plotting
x$season <- factor(x$season, levels = c("Fall", "Winter", "Spring", "Summer"))
#delete empty variables
x$feature_8 <- NULL; x$feature_9 <- NULL
#merge dataset with the weather data
x <- x %>% left_join(.,weather_imp, by = "date")
#assign the dataset to the global environment, under a new name
assign(paste0(name, "_new"), x, envir = .GlobalEnv)
}
#perform the function on train and test datasets
add_vars(train); add_vars(test)
head(train_new, 3)
## date bank_holiday feature_0 feature_1 feature_2 feature_3
## 1 2005-03-20 0 0 0 0 3.2
## 2 2005-03-21 0 0 0 0 3.2
## 3 2005-03-22 0 0 0 0 3.2
## feature_4 feature_5 feature_6 feature_7 school_holiday feature_10 label
## 1 6.7 1.7 3.7 0 0 0 915
## 2 6.7 1.7 3.7 0 3 0 1057
## 3 6.7 1.7 3.7 0 3 0 1482
## Year Month is.weekend season humidity temp_max temp_mean temp_min
## 1 2005 3 1 Winter 69 8.9 4.4 0.2
## 2 2005 3 0 Winter 58 13.2 6.2 -0.3
## 3 2005 3 0 Winter 52 18.0 10.6 3.0
## precipitation snow sunshine wind
## 1 0.0 0 11.1 7.6
## 2 0.0 0 11.3 9.7
## 3 3.3 0 8.6 9.2
Our datasets hold some dummy variables and we have no real clue what they are about. We can observe how they correlate between each other.
pairs.panels(train_new[c("feature_0", "feature_1", "feature_2", "feature_3", "feature_4", "feature_5", "feature_6", "feature_7", "feature_10")])
We see strong correlation (respectively 0.99 and 1) between variables feature_3 to feature_6. Keeping all these variables won’t bring additional predictive power, we decide to delete them and retain only feature_4.
#remove columns
test_new[,c(6,8,9)] <- NULL
train_new[,c(6,8,9)] <- NULL
We also had a hunch that the three temperature variables (min, mean, max) can be strongly correlated. We repeat the same process, showing correlations between weather variables and label:
pairs.panels(train_new[names(train_new[-c(1:9,11:15)])])
Again a strong correlation, we won’t need the three variables and therefore retain the one that has the highest correlation with label, temp_max.
#remove columns
test_new[,c(17,18)] <- NULL
train_new[,c(17,18)] <- NULL
Our two new datasets, train_new and test_new have now 20 variables and no missing values and we deleted the dummy variables that were either null or were correlated.
Before selecting our models and testing the ideal candidates, we will try to get a better feel of the data with some basic EDA.
summary(train_new$label); sd(train_new$label)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 142 623 930 1119 1532 3761
## [1] 616.6403
Visitors vary between 142 and 3761 per day, with a point estimate of 1119 visitors average per day and a standard deviation of about 617.
#plot distribution of visitors per day
ggplot(train_new, aes(x = label)) + geom_histogram(aes(y =..density..),
col="red",
binwidth = 50,
fill="salmon",
alpha=1) + geom_abline(xintercept = mean(train_new$label)) +
geom_density(col="black") +
geom_vline(aes(xintercept = mean(label)),col='grey',size=2)+
labs(title='Figure 1 - Distribution of visitors per day', x='Visitors per day', y='Density')
There is a fairly important right skew which is to be expected in this type of use cases as the distribution is bounded by 0, and unbounded on the right side. A couple of diagnosis plots like the qqplot show that there are some divergence in the lower and upper ends of the qqline, put considering the large sample size (1743), we can reasonably assume normality.
How do the daily visitors trend looks like?
ggplot(data = train_new, aes(x = date, y = label)) + geom_line() + geom_smooth(method = "lm") + labs(title = "Figure 2 - Daily visits", x = "Date", y = "Visitors")
We can detect seasonality and the regression line shows a slight decreasing trend in visitors over the years.
Is there a seasonal effect?
train_new %>% group_by(Year, season) %>% summarise(Visitors = sum(label)) %>% arrange(season, Year) %>%
ggplot(aes(x = Year, y = Visitors)) + geom_col(aes(fill = Year)) + facet_grid(~season) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) + scale_y_continuous(labels = comma) +
labs(title = "Figure 3 - Visitors per season", x = "Season / Year", y = "Visitors")
We record more visitors on weekends than on weekdays
ggplot(data = train_new, aes(y = label)) + geom_boxplot() + facet_wrap(~is.weekend) +
labs(title = "Figure 4 - Boxplots, weekday(0) vs. weekends(1)", y = "Visitors")
Impact of school holidays
#school holiday
ggplot(data = train_new, aes(y = label)) + geom_boxplot() + facet_grid(~school_holiday) +
labs(title = "Figure 5 - Boxplots, school holidays", y = "Visitors")
We see the difference between no school holiday (0) and school holiday for both counties. One notable thing is that there are more visits during school holiday in country 1 than for school holiday in county 2. One hint for the park administrators could be to push marketing/advertising in county 2.
Conclusion
These basic exploratory data analysis highlights some of the possible correlations between variables in the dataframe, and what could impact more or less visits.
We will now tackle the prediction tasks.
Our problem is about numeric prediction, our outcome or dependent variable is numeric and we need methods adapted to this task, the standard ones being: multiple linear regression, cart (categorization and regression trees), model trees, random forests, artificial neural networks.
Our process:
- divide our train set into training and calibration sets so that we can measure RMSE against unseen data.
- test and fine-tune each of the above mentioned algorithms
- derive a RMSE for each option
- select the best candidate (with lowest RMSE)
- predict the values for the test set
A few notes:
- we will treat the date variable as an ID variable, it won’t be used for predicting (whereas data derived from it like month or season or weekdays will be used)
- we have to remain cautious that each transformation we apply to our train and calibration sets, must be recorded so that we appply it as well to our test set
The in-sample accuracy measures for each algorithm are generally inflated dur to overfitting, we also need to measure the real effect on unseen data. In order to do so, we cut our train_new dataset into 2 parts: a training set and a calibration set (we could have named it validation as well).
We set the seed for reproducibility and use the createDataPartition function from caret, and derive our train set (train_new_1) and calibration set (calibration_1) with a 70% / 30% breakdown.
set.seed(75984325)
indexes <- createDataPartition(train_new$label, times = 1, p = 0.7, list = FALSE)
train_new_1 <- train_new[indexes,]
calibration_1 <- train_new[-indexes,]
The caret package is again useful to perform our task with cross-validation, allowing us to reduce the effect of overfitting.
#create a trainControl object to ask for a 10-fold cross validation:
ctrl<-trainControl(method = "cv",number = 10)
#fit the model, method is a linear model, metric is RMSE, we exclude the date from the predictors
lmCVFit<-train(label ~ ., data = train_new_1[,-1], method = "lm", trControl = ctrl, metric="RMSE")
#read the summary
summary(lmCVFit)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1705.86 -226.62 -60.36 201.19 2198.10
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.321e+05 4.239e+04 3.116 0.00188 **
## bank_holiday 3.350e+01 2.631e+01 1.273 0.20309
## feature_0 1.157e+02 4.709e+01 2.457 0.01414 *
## feature_1 -3.341e+02 8.142e+01 -4.104 4.34e-05 ***
## feature_2 -2.090e+02 8.545e+01 -2.446 0.01460 *
## feature_4 1.339e+02 8.515e+01 1.572 0.11619
## feature_7 9.109e+01 6.622e+01 1.376 0.16922
## school_holiday 2.212e+02 1.172e+01 18.875 < 2e-16 ***
## feature_10 5.104e+01 4.976e+01 1.026 0.30520
## Year -6.556e+01 2.138e+01 -3.067 0.00221 **
## Month -8.061e+01 1.494e+01 -5.395 8.24e-08 ***
## is.weekend 5.485e+02 2.617e+01 20.960 < 2e-16 ***
## seasonWinter -3.520e+02 1.374e+02 -2.561 0.01055 *
## seasonSpring -4.719e+02 9.798e+01 -4.816 1.65e-06 ***
## seasonSummer -1.446e+02 6.040e+01 -2.393 0.01685 *
## humidity -3.314e+00 1.870e+00 -1.772 0.07661 .
## temp_max 2.068e+01 2.738e+00 7.552 8.44e-14 ***
## precipitation 4.029e+00 2.785e+00 1.447 0.14824
## snow -7.332e+00 7.409e+00 -0.990 0.32255
## sunshine -1.408e-01 4.888e+00 -0.029 0.97703
## wind 7.076e+00 3.542e+00 1.998 0.04598 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 397.9 on 1201 degrees of freedom
## Multiple R-squared: 0.5923, Adjusted R-squared: 0.5855
## F-statistic: 87.25 on 20 and 1201 DF, p-value: < 2.2e-16
The output shows us that most predictors are significant (some of them may be strongly correlated though).
- Most significant predictors: is.weekend, school_holiday, temp_max.
- Least significant predictors: sunshine, feature_10, snow, wind, humidity.
Adjusted R-Squared: about 58% of the variance in the outcome is explained by the variables in the model.
Now we can predict the values of our calibration set and measure the RMSE:
#in-sample metrics
inSamp_lm <- apply(lmCVFit$resample[,1:3], 2, mean)
#ou of sample metrics - predicting on calibration set
PredValues_lm <- predict(lmCVFit, newdata = calibration_1)
outSamp_lm <- postResample(pred = PredValues_lm, obs = calibration_1$label)
#print results
inSamp_lm; outSamp_lm
## RMSE Rsquared MAE
## 404.5425317 0.5747277 298.4896042
## RMSE Rsquared MAE
## 412.7575881 0.5474085 303.8406353
With an in-sample RMSE of 404.54 and out-of-sample RMSE of 412.76, the results are better than our baseline of 500.
We can try to improve that accuracy with other algorithms.
(Note: other tuning parameters did not yield significantly better results, we are showing here only the basic tuning version.)
Decision trees can also apply to regression/numerical predictions:
ctrl <- trainControl(method = "cv", number = 10)
Grid <- expand.grid(cp = seq(0, 0.05, 0.005))
#fit model
rpartCVFit <- train(label ~. , data=train_new_1[,-1], method = "rpart", trControl=ctrl, metric="RMSE", tuneGrid = Grid)
plot(rpartCVFit)
inSamp_rpart <- apply(rpartCVFit$resample[,1:3], 2, mean)
PredValues_rpart <- predict(rpartCVFit, newdata = calibration_1)
outSamp_rpart <- postResample(pred = PredValues_rpart, obs = calibration_1$label)
inSamp_rpart; outSamp_rpart
## RMSE Rsquared MAE
## 355.1964343 0.6747459 248.6879636
## RMSE Rsquared MAE
## 347.6800634 0.6785883 240.0832782
With an in-sample RMSE of 355.20 and out-of-sample of 347.68, we notice an improvement compared to the multiple linear regression.
The logical next step is to try random forests.
Random forests are very robust to overfitting and provide accurate non linear models. It grows different trees without pruning.
We use the “ranger” package which is much quicker than the widely used randomForest package and yields similar results.
#trainControl object with 10-fold cross validation
ctrl<-trainControl(method = "cv", number = 10)
#fit model
rfCVFit<-train(label ~ ., data = train_new_1[,-1], trControl = ctrl, method = "ranger", metric="RMSE")
inSamp_rf <- apply(rfCVFit$resample[,1:3], 2, mean)
PredValues_rf <- predict(rfCVFit, newdata = calibration_1)
outSamp_rf <- postResample(pred = PredValues_rf, obs = calibration_1$label)
inSamp_rf; outSamp_rf
## RMSE Rsquared MAE
## 299.5159514 0.7672192 210.1169362
## RMSE Rsquared MAE
## 305.8755619 0.7513004 207.4755160
We obtain now better metrics, 300 (in-sample) and 306 (out-of-sample).
glmnet provides a parsimonious model with a combination of lasso regression (penalizes non-zero coefficients) and Ridge regression (penalizes absolute magnitude of coefficients). Typically used for classification issues, we can try if it improves our predictions.
#create grid to try different values of alpha (from 0 to 1, all lasso or all ridge) and lambda (size of the penalty)
myGrid <- expand.grid(alpha = seq(0, 1, length = 5),
lambda = seq(0.0001, 1, length = 20))
#fit model
glmnetCVFit <- train(label ~., data = train_new_1[,-1], method = "glmnet", tuneGrid = myGrid, trControl = ctrl, metric = "RMSE")
inSamp_glmnet <- apply(glmnetCVFit$resample[,1:3], 2, mean)
PredValues_glmnet <- predict(glmnetCVFit, newdata = calibration_1)
outSamp_glmnet <- postResample(pred = PredValues_glmnet, obs = calibration_1$label)
inSamp_glmnet; outSamp_glmnet
## RMSE Rsquared MAE
## 400.9727038 0.5803318 296.3920374
## RMSE Rsquared MAE
## 413.2968003 0.5462272 304.4698967
The results are within the same range than our linear model, so not satisfactory.
Neural networks usually yield among the best results in machine learning. We use the neuralnet package, which, although not as complete as packages like Keras, is easy to configure and give us an idea whether it performs better than the other models.
To use the algorithm, we’ll need numerical predictors, that we normalize so that they are comprised between 0 and 1.
Steps: create new dataset (copied from our train_new dataset), convert the season variable to numerical, normalize/scale the variables, split the dataframe into train and test, apply the model, predict, scale back the predictions and test values and compute the RMSE.
#STEP 1 - create new dataset, convert categorical data to numerical
data_nn <- train_new
data_nn <- data_nn %>% mutate(season = ifelse(Month %in% 1:3, 1,
ifelse(Month %in% 4:6, 2,
ifelse(Month %in% 7:9, 3,
4))))
#STEP 2 - normalize
maxs <- apply(data_nn[,-1], 2, max)
mins <- apply(data_nn[,-1], 2, min)
scaled <- as.data.frame(scale(data_nn[,-1], center = mins, scale = maxs - mins))
#STEP 3 - Create partition
set.seed(75984325)
indexes <- createDataPartition(data_nn$label, times = 1, p = 0.7, list = FALSE)
train_nn <- scaled[indexes,]
test_nn <- scaled[-indexes,]
#STEP 4 - TRAIN
names_nn <- names(train_nn)
formula_nn <- as.formula(paste("label ~", paste(names_nn[!names_nn %in% "label"], collapse = " + ")))
set.seed(75984325)
model_nn <- neuralnet(formula_nn, data = train_nn, hidden = c(6,1), linear.output=T)
#STEP 5 - PLOT
plot(model_nn)
#STEP 6 - PREDICT
pred_nn <- compute(model_nn, test_nn[,-c(9)])
#STEP 7 - SCALE RESULTS BACK
#scale back the predictions
pred_nn_scale_back <- pred_nn$net.result*(max(data_nn$label)-min(data_nn$label))+min(data_nn$label)
#scale back the test set
test_nn_scale_back <- test_nn$label * (max(data_nn$label)-min(data_nn$label))+min(data_nn$label)
#STEP 8 - COMPUTE RMSE
RMSE(pred_nn_scale_back, test_nn_scale_back)
## [1] 318.751492
We obtain an out of sample RMSE of 318 (after trying many different parameters, we did not document all the different tries in this document), which is less that the out of sample error of the random forest model but could be a second place candidate.
We have selected only 4 algorithms for this case study, with a bit more time, we could improve this by more pre-processing, exploring other variations of these models or new algorithms …
But now comes the time to make a choice for our final predictions.
We can start by visualizing what our final predictions look like. This time we train the data on the full train set (train_new) instead of the sub train sample we took before (train_new_1), then we predict on the final test set:
#LINEAR MODEL
#train on the full train set (train_new)
ctrl<-trainControl(method = "cv",number = 10)
lmCVFit<-train(label ~ ., data = train_new[,-1], method = "lm", trControl = ctrl, metric="RMSE")
#predict on test set
predtest_lm <- predict(lmCVFit, newdata = test_new)
#in-sample RMSE
inSamp_lm_final <- apply(lmCVFit$resample[,1:3], 2, mean)
#RPART, same process
Grid <- expand.grid(cp = seq(0, 0.05, 0.005))
rpartCVFit <- train(label ~. , data=train_new[,-1], method = "rpart", trControl=ctrl, metric="RMSE", tuneGrid = Grid)
predtest_rpart <- predict(rpartCVFit, newdata = test_new)
inSamp_rpart_final <- apply(rpartCVFit$resample[,1:3], 2, mean)
#RANDOM FOREST
rfCVFit<-train(label ~ ., data = train_new[,-1], trControl = ctrl, method = "ranger", metric="RMSE")
predtest_rf <- predict(rfCVFit, newdata = test_new)
inSamp_rf_final <- apply(rfCVFit$resample[,1:3], 2, mean)
#glmnet
myGrid <- expand.grid(alpha = seq(0, 1, length = 5), lambda = seq(0.0001, 1, length = 20))
glmnetCVFit <- train(label ~., data = train_new[,-1], method = "glmnet", tuneGrid = myGrid, trControl = ctrl, metric = "RMSE")
predtest_glmnet <- predict(glmnetCVFit, newdata = test_new)
inSamp_glmnet_final <- apply(glmnetCVFit$resample[,1:3], 2, mean)
#create data frame with test set predictions per algorithm
final_predictions <- data.frame(date = test_new$date, lm = predtest_lm, rpart = predtest_rpart, rf = predtest_rf, glmnet = predtest_glmnet)
#plotting predictions
p1 <- ggplot(data = final_predictions, aes(x = date, y = lm)) + geom_line(data = train_new, aes(x = date, y = label)) + geom_line(color = "red") + labs(title = "Linear regression")
p2 <- ggplot(data = final_predictions, aes(x = date, y = rpart)) + geom_line(data = train_new, aes(x = date, y = label)) + geom_line(color = "red") + labs(title = "Decision Tree (rpart)")
p3 <- ggplot(data = final_predictions, aes(x = date, y = rf)) + geom_line(data = train_new, aes(x = date, y = label)) + geom_line(color = "red") + labs(title = "Random Forests")
p4 <- ggplot(data = final_predictions, aes(x = date, y = glmnet)) + geom_line(data = train_new, aes(x = date, y = label)) + geom_line(color = "red") + labs(title = "glmnet")
#merging plots into one figure only
grid.arrange(p1, p2, p3, p4, ncol = 2, nrow = 2)
Even though we visually notice some differences, the best method is to print the RMSE for each model used. We use the in-sample metrics as we do not have an out-of-sample reference:
kable(rbind(inSamp_lm_final, inSamp_rpart_final, inSamp_rf_final, inSamp_glmnet_final), align = "ccc", digits = 2)
| RMSE | Rsquared | MAE | |
|---|---|---|---|
| inSamp_lm_final | 404.01 | 0.57 | 298.80 |
| inSamp_rpart_final | 349.77 | 0.68 | 244.74 |
| inSamp_rf_final | 303.17 | 0.76 | 206.64 |
| inSamp_glmnet_final | 405.35 | 0.57 | 299.72 |
Random forest seems again the best performing solution.
Exporting the data:
# Select only date and random forest predictions
predictions_test <- final_predictions[,c(1,4)]
# rename the variables
names(predictions_test) <- c("date", "label")
# export the file, removing row names
write.csv(predictions_test, file = "predictions_test.csv", row.names=FALSE)
# verifying that the export is compliant, load, and inspect
predictions_test <- read.csv("predictions_test.csv", stringsAsFactors = FALSE)
predictions_test$date <- as.POSIXct(predictions_test$date)
# check structure
str(predictions_test)
## 'data.frame': 363 obs. of 2 variables:
## $ date : POSIXct, format: "2010-01-01" "2010-01-02" ...
## $ label: num 871 1171 1129 1196 1206 ...
The format looks as expected. In case the reader imports the file with R with the read.csv function, make sure to include the argument stringsAsFactor = FALSE.
Conclusion:
Our case study consists in predicting numeric values for the daily visitors in a leisure park in Germany.
We have a train set (ranging from 2005 to 2009) and a test set (for year 2010) with a number of explanatory variables. Our task is to predict the 363 values of label variable in the test set.
We performed some transformations on the datasets:
- include weather data originating from a 3rd dataset (we had to impute missing values there and merge the final result with our train and test datasets)
- create a few variables infered from the date variable (season, month, weekend/weakday, …)
We performed a quick exploratory data analysis showing a few insights, the most notable one being a decreasing trend of visitors over time.
We finally tested 5 different algorithms, retaining the variables as numeric predictors (except for the season variable).
Random forest is the model that showed the lowest RMSE (root mean squared error), roughly 200 points below our baseline of 500 (in-sample metric on the full train set).
With more time, we would experiment with additional or improved versions of these models to obtain better predictions.
Strategy for improving the predictions:
1. More pre-processing on the dataset: grouping some columns like is.weekend and bank_holiday for both counties (value == 3). Other numerical transformations (log, …) could also be explored, depending on the algorithms and the type of data they work with the best.
2. Binning some of the variables, we may not need that level of details for variables such as precipitation.
3. Tweaking the parameters of the models further.
4. Investigating more models, there are literaly hundreds of potential algorithms to choose from and we could think of automating a system in caret.
We performed a couple of these above steps, that we did not document here. For instance, grouping some of the bank_holiday/school_holiday/weekend variables and other transformations were run on the 5 algorithms tested here, yielding better results for the linear and decision tree models, but deprecating the results of random forest. Performing this, we still obtained the same algorithm ranking in terms of RMSE for each algorithm and we decided to stick to our first examples above as the above parameters yield overall the best results with random forests.
Contact:
Thank you for reading through this document, for any question, feel free to reach out:
Xavier Valdayron
xavier@measuringsocial.com
xavier_valdayron@yahoo.com