Multiple Linear Regression (MLR) and Logistic Models
Project Goal
In this report, we conduct an exploratory data analysis (EDA) on the
Seoul
Bike Sharing Demand dataset. Then we fit multiple linear regression
models and with logistic regression models to make predictions on the
two response variables: numeric Rent_Count and its re-coded
variable Rent_Many. Model performances are then compared
using accuracy metric.
Set up: Packages and Helper Functions
In this task, we will use the following packages:
here: enables easy file referencing and builds file paths in a OS-independent waystats: loads this before loadingtidyverseto avoid masking sometidyversefunctionstidyverse: includes collections of useful packages likedplyr(data manipulation),tidyr(tidying data),ggplots(creating graphs), etc.glue: embeds and evaluates R expressions into strings to be printed as messagesscales: formats and labels scales nicely for better visualizationGGally: extends ‘ggplot2’ by combining geometric objects with transformed datacorrplot: Provides a visual exploratory tool on correlation matrix that supports automatic variable reordering to help detect hidden patterns among variables.Hmisc: Contains many functions useful for data analysis, high-level graphics, utility operations, functions for computing sample size and power, simulationggmosaic: create visualizations of categorical data and is capable of producing bar charts, stacked bar charts, mosaic plots, and double decker plots.caret: training and plotting classification and regression modelsbroom: tidy and unify the fit objects returned by most of the modeling functionsmodelr:Functions for modelling that help you seamlessly integrate modelling into a pipeline of data manipulation and visualisation.
## Loading required package: pacman
pacman::p_load(
here,
stats,
tidyverse,
glue, scales,
GGally, corrplot, Hmisc, ggmosaic,
caret, broom, modelr
)And define the following helper functions:
fit_glm_models()
Perform fitting of multiple glm candidate models using cross validation. Use the best one from cross validation on the test set and summarize the metrics.
- Arguments:
response: the response variablemodels: a vector of formula stringdata_train: training setdata_test: test setfamily: family of error distribution used in glm()metric: what summary metric will be used to select the optimal model
- Returned Value:
- A data frame to summarize the metric scores
# fit multiple glm models and summarize the metrics
fit_glm_models <- function(
response, models,
data_train, data_test,
family = c("gaussian", "binomial"),
metric = c("RMSE", "Accuracy")){
# verify query family and metric arguments
family <- match.arg(family)
metric <- match.arg(metric)
# initial empty lists to save fittings, predictions and metric scores
lst_score <- c()
# train models
for (x in models) {
fit <- train(
as.formula(x),
data = data_train,
method = "glm",
family = family,
metric = metric,
preProcess = c("center", "scale"),
trControl = trainControl(method = "cv", number = 5))
# make prediction on test data
pred <- predict(fit, newdata = data_test)
# get metric
if(metric == "RMSE"){
score <- modelr::rmse(fit, data_test)
} else if(metric == "Accuracy"){
score <- confusionMatrix(data_test[[response]], pred)$overall["Accuracy"]
}
# save metric score
lst_score <- append(lst_score, round(score, 2))
# print progress
print(glue(
"Fit model: '{x}' \n",
"{metric} = {score} \n"))
}
# create a summary data frame
df_summary <- tibble(model = models, metric = metric, score = lst_score)
# return all results
return(df_summary)
}Data
The Seoul
Bike Sharing Demand dataset contains count of public bicycles rented
per hour in the Seoul Bike Sharing System, with corresponding weather
data and holiday information. A local copy is saved in the
data folder. Since the original column names contain space
and special characters, and are long as well, we rename the columns when
we read the data in.
df_raw <- read_csv(
here("data", "SeoulBikeData.csv"),
col_names = c(
"Date", "Rent_Count", "Hour",
"Temperature", "Humidity", "Wind_Speed",
"Visibility", "Dew_Point", "Solar_Radiation",
"Rainfall", "Snowfall",
"Seasons", "Holiday","Functioning_Day"),
col_types = c("?", rep("n", 10), rep("c", 3)),
# skip first column (Date)
col_select = -1,
# skip first row
skip = 1)
# show raw data frame
df_raw## tibble [8,760 × 13] (S3: tbl_df/tbl/data.frame)
## $ Rent_Count : num [1:8760] 254 204 173 107 78 100 181 460 930 490 ...
## $ Hour : num [1:8760] 0 1 2 3 4 5 6 7 8 9 ...
## $ Temperature : num [1:8760] -5.2 -5.5 -6 -6.2 -6 -6.4 -6.6 -7.4 -7.6 -6.5 ...
## $ Humidity : num [1:8760] 37 38 39 40 36 37 35 38 37 27 ...
## $ Wind_Speed : num [1:8760] 2.2 0.8 1 0.9 2.3 1.5 1.3 0.9 1.1 0.5 ...
## $ Visibility : num [1:8760] 2000 2000 2000 2000 2000 ...
## $ Dew_Point : num [1:8760] -17.6 -17.6 -17.7 -17.6 -18.6 -18.7 -19.5 -19.3 -19.8 -22.4 ...
## $ Solar_Radiation: num [1:8760] 0 0 0 0 0 0 0 0 0.01 0.23 ...
## $ Rainfall : num [1:8760] 0 0 0 0 0 0 0 0 0 0 ...
## $ Snowfall : num [1:8760] 0 0 0 0 0 0 0 0 0 0 ...
## $ Seasons : chr [1:8760] "Winter" "Winter" "Winter" "Winter" ...
## $ Holiday : chr [1:8760] "No Holiday" "No Holiday" "No Holiday" "No Holiday" ...
## $ Functioning_Day: chr [1:8760] "Yes" "Yes" "Yes" "Yes" ...
## - attr(*, "spec")=
## .. cols(
## .. Date = col_skip(),
## .. Rent_Count = col_double(),
## .. Hour = col_double(),
## .. Temperature = col_double(),
## .. Humidity = col_double(),
## .. Wind_Speed = col_double(),
## .. Visibility = col_double(),
## .. Dew_Point = col_double(),
## .. Solar_Radiation = col_double(),
## .. Rainfall = col_double(),
## .. Snowfall = col_double(),
## .. Seasons = col_character(),
## .. Holiday = col_character(),
## .. Functioning_Day = col_character()
## .. )
## [1] "Winter" "Spring" "Summer" "Autumn"
## [1] "No Holiday" "Holiday"
## [1] "Yes" "No"
Next, we will prepare the data by convert the three categorical
columns Seasons, Holiday and
Functioning_Day to factors with proper levels.
For the response variable which is renamed to Count, we
will create a new numeric variable Many_Rents as instructed
in the home assignment.
df <- df_raw %>%
mutate(
Seasons = factor(Seasons, levels = c("Spring", "Summer", "Autumn", "Winter")),
Holiday = factor(Holiday, levels = c("No Holiday", "Holiday")),
Functioning_Day = factor(Functioning_Day , levels = c("No", "Yes")),
Rent_Many = if_else(Rent_Count >= 700, "1", "0") %>% factor()
) %>%
relocate(Rent_Many, .after = Rent_Count
)
# save column locations for each variable types
lst_cols <- list(
y = 1:2,
x_num = 3:11,
x_cat = 12:14
)
# show data frame - only the columns of interest
df[, c(lst_cols$y, lst_cols$x_cat)]## tibble [8,760 × 5] (S3: tbl_df/tbl/data.frame)
## $ Rent_Count : num [1:8760] 254 204 173 107 78 100 181 460 930 490 ...
## $ Rent_Many : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 2 1 ...
## $ Seasons : Factor w/ 4 levels "Spring","Summer",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ Holiday : Factor w/ 2 levels "No Holiday","Holiday": 1 1 1 1 1 1 1 1 1 1 ...
## $ Functioning_Day: Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## - attr(*, "spec")=
## .. cols(
## .. Date = col_skip(),
## .. Rent_Count = col_double(),
## .. Hour = col_double(),
## .. Temperature = col_double(),
## .. Humidity = col_double(),
## .. Wind_Speed = col_double(),
## .. Visibility = col_double(),
## .. Dew_Point = col_double(),
## .. Solar_Radiation = col_double(),
## .. Rainfall = col_double(),
## .. Snowfall = col_double(),
## .. Seasons = col_character(),
## .. Holiday = col_character(),
## .. Functioning_Day = col_character()
## .. )
Split the Data
We will use caret::createDataPartition() to create a
75/25 split of training and test sets.
Basic EDA
A quick EDA will be done on the training set:
- Response variable (numeric
Rent_Countand its re-coded variableRent_Many) - Categorical Predictors
- Numeric Predictors
Response Variable
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 191.5 503.0 704.1 1062.0 3556.0
# histogram
qplot(
df_train$Rent_Count, binwidth = 100,
main = "Histogram of Rented Bike Count",
xlab = "Rented Bike Count",
ylab = "Count")## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
From the 5-number summary and the histogram, the response variable
Rent_Count is a right skewed. distribution. Using rent
count 700 as a threshold, there are 2655 rows of data have rent counts
greater or equal to 700 in the total of 6571 records.
##
## 0 1
## 3916 2655
# boxplot
qplot(
factor(Rent_Many), Rent_Count,
data = df_train,
geom = "boxplot",
main = "Boxplot of Rented Bike Count by Rent_More (>700 or not)",
xlab = "Rented Bike Count >700?",
ylab = "Count")The box plots provide a side-by-side comparison and shows that the group with more than 700 rents has larger variability and is highly right skewed.
Numeric Predictors
## Hour Temperature Humidity Wind_Speed
## Min. : 0.00 Min. :-17.80 Min. : 0.00 Min. :0.000
## 1st Qu.: 5.00 1st Qu.: 3.50 1st Qu.:43.00 1st Qu.:1.000
## Median :11.00 Median : 13.80 Median :57.00 Median :1.500
## Mean :11.45 Mean : 12.88 Mean :58.27 Mean :1.733
## 3rd Qu.:17.00 3rd Qu.: 22.50 3rd Qu.:74.00 3rd Qu.:2.400
## Max. :23.00 Max. : 39.30 Max. :98.00 Max. :7.400
## Visibility Dew_Point Solar_Radiation Rainfall
## Min. : 27 Min. :-30.600 Min. :0.0000 Min. : 0.0000
## 1st Qu.: 936 1st Qu.: -4.550 1st Qu.:0.0000 1st Qu.: 0.0000
## Median :1696 Median : 5.200 Median :0.0100 Median : 0.0000
## Mean :1434 Mean : 4.086 Mean :0.5698 Mean : 0.1541
## 3rd Qu.:2000 3rd Qu.: 14.700 3rd Qu.:0.9400 3rd Qu.: 0.0000
## Max. :2000 Max. : 26.800 Max. :3.5200 Max. :35.0000
## Snowfall
## Min. :0.00000
## 1st Qu.:0.00000
## Median :0.00000
## Mean :0.07486
## 3rd Qu.:0.00000
## Max. :8.80000
# correlation plots
corrplot(
cor(df_train[, c(1, lst_cols$x_num)]),
type = "lower",
method = "square",
addCoef.col = "white",
cl.ratio = 0.2, tl.srt = 45,
title = "Correlation amoung numeric predictors and rent count"
)Some observations include:
Houris very uniformTemperatureandHumidityare quite normalWind_Speedis slightly right skewed andDew_Pointis slightly left-skewedVisibility,Solar_Radiation,RainfallandSnowfallare all very skewed. This is because of the climate condition of the area of the study. Most of days, it’s sunny.TemperatureandDew_Pointare highly correlatedHumidityalso shows correlation withVisibility,Dew_PointandSolar_Radiation
Categorical Predictors
## Seasons Holiday Functioning_Day
## Spring:1670 No Holiday:6251 No : 222
## Summer:1654 Holiday : 320 Yes:6349
## Autumn:1637
## Winter:1610
##
## Spring Summer Autumn Winter
## No 48 0 247 0
## Yes 2160 2208 1937 2160
##
## Spring Summer Autumn Winter
## No Holiday 2136 2160 2064 1968
## Holiday 72 48 120 192
##
## No Yes
## No Holiday 271 8057
## Holiday 24 408
For Holiday and Functioning_Day, we have
unbalanced data. From the two-way tables, we can see in Summer and
Winter, we don’t any day that bikes are done all day (no functional
hours). The sparseness of the categorical predictors can be visualized
in a clear way in the following mosaic plot.
## Warning: `unite_()` was deprecated in tidyr 1.2.0.
## ℹ Please use `unite()` instead.
## ℹ The deprecated feature was likely used in the ggmosaic package.
## Please report the issue at <https://github.com/haleyjeppson/ggmosaic>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
To visualize how these categorical predictors relate to the rent count, faceted box plots will be very helpful.
df %>% ggplot(aes(Functioning_Day, Rent_Count)) +
geom_boxplot() +
geom_jitter(width = 0.05) +
facet_grid(
rows = vars(Holiday),
cols = vars(Seasons)) +
labs(
title = "Boxplots of Rent Count by Seasons",
x = "Functional Day"
)If we plot out the box plots on rent count over these categorical
variable, we can see that there are fewer rent counts during winter
(compared to other seasons) and on holidays (compared to non-holidays).
Of course, on days when there are bike not functioning
(Functioning_Day = No), there are fewer rent counts.
## [1] "Rent_Count" "Rent_Many" "Hour" "Temperature"
## [5] "Humidity" "Wind_Speed" "Visibility" "Dew_Point"
## [9] "Solar_Radiation" "Rainfall" "Snowfall" "Seasons"
## [13] "Holiday" "Functioning_Day"
Based on then finding above, we plot the following scatterplot matrix with the predictors of interest:
# pair-wise scatter plots with correlation
ggpairs(
df_train,
columns = c(1, 3:8),
mapping = ggplot2::aes(color = Holiday, alpha = 0.2),
upper = list(continuous = wrap("cor", size = 2.5)),
axisLabels = "none"
)Fitting MLR Models
We will fit the following candidate models on the numeric response Rent_Count:
# candidate models
model_candidates <- c(
".",
"Hour + Temperature + Humidity + Wind_Speed",
"Hour + Temperature * Humidity + Wind_Speed",
"Hour + Temperature * Humidity + Seasons + Holiday",
"(Hour + Temperature + Humidity)^2 + Seasons + Holiday",
"Hour * Holiday + Temperature * Humidity"
)
# make formula for lm models
response_lm <- names(df_train)[1]
models_lm <- response_lm %&% " ~ " %&% model_candidates
models_lm## [1] "Rent_Count ~ ."
## [2] "Rent_Count ~ Hour + Temperature + Humidity + Wind_Speed"
## [3] "Rent_Count ~ Hour + Temperature * Humidity + Wind_Speed"
## [4] "Rent_Count ~ Hour + Temperature * Humidity + Seasons + Holiday"
## [5] "Rent_Count ~ (Hour + Temperature + Humidity)^2 + Seasons + Holiday"
## [6] "Rent_Count ~ Hour * Holiday + Temperature * Humidity"
For each candidate model, we will perform the linear regression on the train set, find the best one with cross validation, and use it to make predictions on the test set. The helper function returns a summary of metric in a data frame.
# fit the models
fit_lm <- fit_glm_models(response_lm, models_lm, df_train, df_test, "gaussian", "RMSE")## Fit model: 'Rent_Count ~ .'
## RMSE = 325.115109558129
## Fit model: 'Rent_Count ~ Hour + Temperature + Humidity + Wind_Speed'
## RMSE = 489.316684945385
## Fit model: 'Rent_Count ~ Hour + Temperature * Humidity + Wind_Speed'
## RMSE = 485.426023919223
## Fit model: 'Rent_Count ~ Hour + Temperature * Humidity + Seasons + Holiday'
## RMSE = 476.981701386591
## Fit model: 'Rent_Count ~ (Hour + Temperature + Humidity)^2 + Seasons + Holiday'
## RMSE = 457.464578480263
## Fit model: 'Rent_Count ~ Hour * Holiday + Temperature * Humidity'
## RMSE = 483.11649423516
The model “Rent_Count ~ .” has the smallest RMSE of 325.12
Fitting Logistic Regression Models
We will fit the following candidate models on the 2-level categorical response Rent_Many:
# make formula for lm models
response_glm <- names(df_train)[2]
models_glm <- response_glm %&% " ~ " %&% model_candidates
models_glm## [1] "Rent_Many ~ ."
## [2] "Rent_Many ~ Hour + Temperature + Humidity + Wind_Speed"
## [3] "Rent_Many ~ Hour + Temperature * Humidity + Wind_Speed"
## [4] "Rent_Many ~ Hour + Temperature * Humidity + Seasons + Holiday"
## [5] "Rent_Many ~ (Hour + Temperature + Humidity)^2 + Seasons + Holiday"
## [6] "Rent_Many ~ Hour * Holiday + Temperature * Humidity"
For each candidate model, we will perform the logistic regression on the train set, find the best one with cross validation, and use it to make predictions on the test set. The helper function returns a summary of metric in a data frame.
# fit the models
fit_glm <- fit_glm_models(response_glm, models_glm, df_train, df_test, "binomial", "Accuracy")## Fit model: 'Rent_Many ~ .'
## Accuracy = 0.993147555961626
## Fit model: 'Rent_Many ~ Hour + Temperature + Humidity + Wind_Speed'
## Accuracy = 0.804020100502513
## Fit model: 'Rent_Many ~ Hour + Temperature * Humidity + Wind_Speed'
## Accuracy = 0.809502055733212
## Fit model: 'Rent_Many ~ Hour + Temperature * Humidity + Seasons + Holiday'
## Accuracy = 0.819552306989493
## Fit model: 'Rent_Many ~ (Hour + Temperature + Humidity)^2 + Seasons + Holiday'
## Accuracy = 0.822750114207401
## Fit model: 'Rent_Many ~ Hour * Holiday + Temperature * Humidity'
## Accuracy = 0.811329374143445
The model “Rent_Many ~ .” has the largest accuracy of 0.99
Conclusion
Both of MLR models and logistic regression models give similar results. Among the candidate models, the models with only main effects (without interactions and higher-oder terms) perform best to predict the test set.