Multiple Linear Regression (MLR) and Logistic Models

Project Goal

In this report, we conduct an exploratory data analysis (EDA) on the Seoul Bike Sharing Demand dataset. Then we fit multiple linear regression models and with logistic regression models to make predictions on the two response variables: numeric Rent_Count and its re-coded variable Rent_Many. Model performances are then compared using accuracy metric.

Set up: Packages and Helper Functions

In this task, we will use the following packages:

here: enables easy file referencing and builds file paths in a OS-independent way
stats: loads this before loading tidyverse to avoid masking some tidyverse functions
tidyverse: includes collections of useful packages like dplyr (data manipulation), tidyr (tidying data), ggplots (creating graphs), etc.
glue: embeds and evaluates R expressions into strings to be printed as messages
scales: formats and labels scales nicely for better visualization
GGally: extends ‘ggplot2’ by combining geometric objects with transformed data
corrplot: Provides a visual exploratory tool on correlation matrix that supports automatic variable reordering to help detect hidden patterns among variables.
Hmisc: Contains many functions useful for data analysis, high-level graphics, utility operations, functions for computing sample size and power, simulation
ggmosaic: create visualizations of categorical data and is capable of producing bar charts, stacked bar charts, mosaic plots, and double decker plots.
caret: training and plotting classification and regression models
broom: tidy and unify the fit objects returned by most of the modeling functions
modelr:Functions for modelling that help you seamlessly integrate modelling into a pipeline of data manipulation and visualisation.

if (!require("pacman")) utils::install.packages("pacman", dependencies = TRUE)

## Loading required package: pacman

pacman::p_load(
  here,
  stats,
  tidyverse,
  glue, scales,
  GGally, corrplot, Hmisc, ggmosaic,
  caret, broom, modelr
)

And define the following helper functions:

# string concatenation
# ex: "act" %&% "5"
'%&%' <- function(x, y) paste0(x, y)

`fit_glm_models()`

Perform fitting of multiple glm candidate models using cross validation. Use the best one from cross validation on the test set and summarize the metrics.

Arguments:
- response: the response variable
- models: a vector of formula string
- data_train: training set
- data_test: test set
- family: family of error distribution used in glm()
- metric: what summary metric will be used to select the optimal model
Returned Value:
- A data frame to summarize the metric scores

# fit multiple glm models and summarize the metrics
fit_glm_models <- function(
    response, models, 
    data_train, data_test, 
    family = c("gaussian", "binomial"), 
    metric = c("RMSE", "Accuracy")){

  # verify query family and metric arguments
  family <- match.arg(family)
  metric <- match.arg(metric)
  
  
  # initial empty lists to save fittings, predictions and metric scores
  lst_score <- c()
  
  # train models
  for (x in models) {
    fit <- train(
      as.formula(x),
      data = data_train,
      method = "glm",
      family = family,
      metric = metric,
      preProcess = c("center", "scale"),
      trControl = trainControl(method = "cv", number = 5))
    
    # make prediction on test data
    pred <- predict(fit, newdata = data_test)

    # get metric
    if(metric == "RMSE"){
      score <-  modelr::rmse(fit, data_test)
    } else if(metric == "Accuracy"){
      score <- confusionMatrix(data_test[[response]], pred)$overall["Accuracy"]
    }
    
    # save metric score
    lst_score <- append(lst_score, round(score, 2))
  
    # print progress
    print(glue(
      "Fit model: '{x}' \n",
      "{metric} = {score} \n"))
  }
  
  # create a summary data frame
  df_summary <- tibble(model = models, metric = metric, score = lst_score)
  
  # return all results
  return(df_summary)
}

Data

The Seoul Bike Sharing Demand dataset contains count of public bicycles rented per hour in the Seoul Bike Sharing System, with corresponding weather data and holiday information. A local copy is saved in the data folder. Since the original column names contain space and special characters, and are long as well, we rename the columns when we read the data in.

df_raw <- read_csv(
  here("data", "SeoulBikeData.csv"), 
  col_names = c(
    "Date", "Rent_Count", "Hour", 
    "Temperature", "Humidity", "Wind_Speed", 
    "Visibility", "Dew_Point", "Solar_Radiation", 
    "Rainfall", "Snowfall", 
    "Seasons", "Holiday","Functioning_Day"),
  col_types = c("?", rep("n", 10), rep("c", 3)),
  # skip first column (Date)
  col_select = -1,
  # skip first row
  skip = 1)
 
# show raw data frame
df_raw

# check structure
str(df_raw)

## tibble [8,760 × 13] (S3: tbl_df/tbl/data.frame)
##  $ Rent_Count     : num [1:8760] 254 204 173 107 78 100 181 460 930 490 ...
##  $ Hour           : num [1:8760] 0 1 2 3 4 5 6 7 8 9 ...
##  $ Temperature    : num [1:8760] -5.2 -5.5 -6 -6.2 -6 -6.4 -6.6 -7.4 -7.6 -6.5 ...
##  $ Humidity       : num [1:8760] 37 38 39 40 36 37 35 38 37 27 ...
##  $ Wind_Speed     : num [1:8760] 2.2 0.8 1 0.9 2.3 1.5 1.3 0.9 1.1 0.5 ...
##  $ Visibility     : num [1:8760] 2000 2000 2000 2000 2000 ...
##  $ Dew_Point      : num [1:8760] -17.6 -17.6 -17.7 -17.6 -18.6 -18.7 -19.5 -19.3 -19.8 -22.4 ...
##  $ Solar_Radiation: num [1:8760] 0 0 0 0 0 0 0 0 0.01 0.23 ...
##  $ Rainfall       : num [1:8760] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Snowfall       : num [1:8760] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Seasons        : chr [1:8760] "Winter" "Winter" "Winter" "Winter" ...
##  $ Holiday        : chr [1:8760] "No Holiday" "No Holiday" "No Holiday" "No Holiday" ...
##  $ Functioning_Day: chr [1:8760] "Yes" "Yes" "Yes" "Yes" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Date = col_skip(),
##   ..   Rent_Count = col_double(),
##   ..   Hour = col_double(),
##   ..   Temperature = col_double(),
##   ..   Humidity = col_double(),
##   ..   Wind_Speed = col_double(),
##   ..   Visibility = col_double(),
##   ..   Dew_Point = col_double(),
##   ..   Solar_Radiation = col_double(),
##   ..   Rainfall = col_double(),
##   ..   Snowfall = col_double(),
##   ..   Seasons = col_character(),
##   ..   Holiday = col_character(),
##   ..   Functioning_Day = col_character()
##   .. )

# list the unique values of the three categorical variables
unique(df_raw$Seasons)

## [1] "Winter" "Spring" "Summer" "Autumn"

unique(df_raw$Holiday)

## [1] "No Holiday" "Holiday"

unique(df_raw$Functioning_Day)

## [1] "Yes" "No"

Next, we will prepare the data by convert the three categorical columns Seasons, Holiday and Functioning_Day to factors with proper levels.

For the response variable which is renamed to Count, we will create a new numeric variable Many_Rents as instructed in the home assignment.

df <- df_raw %>% 
  mutate(
    Seasons = factor(Seasons, levels = c("Spring", "Summer", "Autumn", "Winter")),
    Holiday = factor(Holiday, levels = c("No Holiday", "Holiday")),
    Functioning_Day = factor(Functioning_Day , levels = c("No",  "Yes")),
    Rent_Many = if_else(Rent_Count >= 700, "1", "0") %>% factor()
  ) %>% 
  relocate(Rent_Many, .after = Rent_Count
  )

# save column locations for each variable types
lst_cols <- list(
  y = 1:2,
  x_num = 3:11,
  x_cat = 12:14
)

# show data frame - only the columns of interest
df[, c(lst_cols$y, lst_cols$x_cat)]

# check structure - only the columns of interest
str(df[, c(lst_cols$y, lst_cols$x_cat)])

## tibble [8,760 × 5] (S3: tbl_df/tbl/data.frame)
##  $ Rent_Count     : num [1:8760] 254 204 173 107 78 100 181 460 930 490 ...
##  $ Rent_Many      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 2 1 ...
##  $ Seasons        : Factor w/ 4 levels "Spring","Summer",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Holiday        : Factor w/ 2 levels "No Holiday","Holiday": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Functioning_Day: Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Date = col_skip(),
##   ..   Rent_Count = col_double(),
##   ..   Hour = col_double(),
##   ..   Temperature = col_double(),
##   ..   Humidity = col_double(),
##   ..   Wind_Speed = col_double(),
##   ..   Visibility = col_double(),
##   ..   Dew_Point = col_double(),
##   ..   Solar_Radiation = col_double(),
##   ..   Rainfall = col_double(),
##   ..   Snowfall = col_double(),
##   ..   Seasons = col_character(),
##   ..   Holiday = col_character(),
##   ..   Functioning_Day = col_character()
##   .. )

Split the Data

We will use caret::createDataPartition() to create a 75/25 split of training and test sets.

set.seed(2023)

# split data
trainIndex <- createDataPartition(df$Rent_Many, p = 0.75, list = FALSE)
df_train <- df[trainIndex, ]
df_test <- df[-trainIndex, ]

Basic EDA

A quick EDA will be done on the training set:

Response variable (numeric Rent_Count and its re-coded variable Rent_Many)
Categorical Predictors
Numeric Predictors

Response Variable

# 5-number summary
summary(df_train$Rent_Count)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   191.5   503.0   704.1  1062.0  3556.0

# histogram
qplot(
  df_train$Rent_Count, binwidth = 100,
  main = "Histogram of Rented Bike Count", 
  xlab = "Rented Bike Count",
  ylab = "Count")

## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

From the 5-number summary and the histogram, the response variable Rent_Count is a right skewed. distribution. Using rent count 700 as a threshold, there are 2655 rows of data have rent counts greater or equal to 700 in the total of 6571 records.

# check balance
table(df_train$Rent_Many)

## 
##    0    1 
## 3916 2655

# boxplot
qplot(
  factor(Rent_Many), Rent_Count, 
  data = df_train, 
  geom = "boxplot",
  main = "Boxplot of Rented Bike Count by Rent_More (>700 or not)", 
  xlab = "Rented Bike Count >700?",
  ylab = "Count")

The box plots provide a side-by-side comparison and shows that the group with more than 700 rents has larger variability and is highly right skewed.

Numeric Predictors

# 5-number summary on the numeric predictors
summary(df_train[, lst_cols$x_num])

##       Hour        Temperature        Humidity       Wind_Speed   
##  Min.   : 0.00   Min.   :-17.80   Min.   : 0.00   Min.   :0.000  
##  1st Qu.: 5.00   1st Qu.:  3.50   1st Qu.:43.00   1st Qu.:1.000  
##  Median :11.00   Median : 13.80   Median :57.00   Median :1.500  
##  Mean   :11.45   Mean   : 12.88   Mean   :58.27   Mean   :1.733  
##  3rd Qu.:17.00   3rd Qu.: 22.50   3rd Qu.:74.00   3rd Qu.:2.400  
##  Max.   :23.00   Max.   : 39.30   Max.   :98.00   Max.   :7.400  
##    Visibility     Dew_Point       Solar_Radiation     Rainfall      
##  Min.   :  27   Min.   :-30.600   Min.   :0.0000   Min.   : 0.0000  
##  1st Qu.: 936   1st Qu.: -4.550   1st Qu.:0.0000   1st Qu.: 0.0000  
##  Median :1696   Median :  5.200   Median :0.0100   Median : 0.0000  
##  Mean   :1434   Mean   :  4.086   Mean   :0.5698   Mean   : 0.1541  
##  3rd Qu.:2000   3rd Qu.: 14.700   3rd Qu.:0.9400   3rd Qu.: 0.0000  
##  Max.   :2000   Max.   : 26.800   Max.   :3.5200   Max.   :35.0000  
##     Snowfall      
##  Min.   :0.00000  
##  1st Qu.:0.00000  
##  Median :0.00000  
##  Mean   :0.07486  
##  3rd Qu.:0.00000  
##  Max.   :8.80000

# histogram
hist.data.frame(df_train[, lst_cols$x_num])

# correlation plots
corrplot(
  cor(df_train[, c(1, lst_cols$x_num)]),
  type = "lower",
  method = "square",
  addCoef.col = "white",
  cl.ratio = 0.2, tl.srt = 45,
  title = "Correlation amoung numeric predictors and rent count"
  )

Some observations include:

Hour is very uniform
Temperature and Humidity are quite normal
Wind_Speed is slightly right skewed and Dew_Point is slightly left-skewed
Visibility, Solar_Radiation, Rainfall and Snowfall are all very skewed. This is because of the climate condition of the area of the study. Most of days, it’s sunny.
Temperature and Dew_Point are highly correlated
Humidity also shows correlation with Visibility, Dew_Point and Solar_Radiation

Categorical Predictors

# one-way table on the categorical predictors
summary(df_train[, lst_cols$x_cat])

##    Seasons           Holiday     Functioning_Day
##  Spring:1670   No Holiday:6251   No : 222       
##  Summer:1654   Holiday   : 320   Yes:6349       
##  Autumn:1637                                    
##  Winter:1610

# two-way contingency table
table(df$Functioning_Day, df$Seasons)

##      
##       Spring Summer Autumn Winter
##   No      48      0    247      0
##   Yes   2160   2208   1937   2160

table(df$Holiday, df$Seasons)

##             
##              Spring Summer Autumn Winter
##   No Holiday   2136   2160   2064   1968
##   Holiday        72     48    120    192

table(df$Holiday, df$Functioning_Day)

##             
##                No  Yes
##   No Holiday  271 8057
##   Holiday      24  408

For Holiday and Functioning_Day, we have unbalanced data. From the two-way tables, we can see in Summer and Winter, we don’t any day that bikes are done all day (no functional hours). The sparseness of the categorical predictors can be visualized in a clear way in the following mosaic plot.

ggplot(df) +
  geom_mosaic(
    aes(x = product(Seasons, Holiday), fill = Functioning_Day)
  )

## Warning: `unite_()` was deprecated in tidyr 1.2.0.
## ℹ Please use `unite()` instead.
## ℹ The deprecated feature was likely used in the ggmosaic package.
##   Please report the issue at <https://github.com/haleyjeppson/ggmosaic>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

To visualize how these categorical predictors relate to the rent count, faceted box plots will be very helpful.

df %>% ggplot(aes(Functioning_Day, Rent_Count)) +
  geom_boxplot() +
  geom_jitter(width = 0.05) +
  facet_grid(
    rows = vars(Holiday),
    cols = vars(Seasons)) +
  labs(
    title = "Boxplots of Rent Count by Seasons",
    x = "Functional Day"
  )

If we plot out the box plots on rent count over these categorical variable, we can see that there are fewer rent counts during winter (compared to other seasons) and on holidays (compared to non-holidays). Of course, on days when there are bike not functioning (Functioning_Day = No), there are fewer rent counts.

names(df_train)

##  [1] "Rent_Count"      "Rent_Many"       "Hour"            "Temperature"    
##  [5] "Humidity"        "Wind_Speed"      "Visibility"      "Dew_Point"      
##  [9] "Solar_Radiation" "Rainfall"        "Snowfall"        "Seasons"        
## [13] "Holiday"         "Functioning_Day"

Based on then finding above, we plot the following scatterplot matrix with the predictors of interest:

# pair-wise scatter plots with correlation 
ggpairs(
  df_train,
  columns = c(1, 3:8),
  mapping = ggplot2::aes(color = Holiday, alpha = 0.2),
  upper = list(continuous = wrap("cor", size = 2.5)),
  axisLabels = "none"
)

Fitting MLR Models

We will fit the following candidate models on the numeric response Rent_Count:

# candidate models
model_candidates <- c(
  ".",
  "Hour +  Temperature + Humidity + Wind_Speed",
  "Hour +  Temperature * Humidity + Wind_Speed",
  "Hour +  Temperature * Humidity + Seasons + Holiday",
  "(Hour + Temperature + Humidity)^2 + Seasons + Holiday",
  "Hour * Holiday + Temperature * Humidity"
)

# make formula for lm models
response_lm <- names(df_train)[1]
models_lm <- response_lm %&% " ~ " %&% model_candidates 
models_lm

## [1] "Rent_Count ~ ."                                                    
## [2] "Rent_Count ~ Hour +  Temperature + Humidity + Wind_Speed"          
## [3] "Rent_Count ~ Hour +  Temperature * Humidity + Wind_Speed"          
## [4] "Rent_Count ~ Hour +  Temperature * Humidity + Seasons + Holiday"   
## [5] "Rent_Count ~ (Hour + Temperature + Humidity)^2 + Seasons + Holiday"
## [6] "Rent_Count ~ Hour * Holiday + Temperature * Humidity"

For each candidate model, we will perform the linear regression on the train set, find the best one with cross validation, and use it to make predictions on the test set. The helper function returns a summary of metric in a data frame.

# fit the models
fit_lm <- fit_glm_models(response_lm, models_lm, df_train, df_test, "gaussian", "RMSE")

## Fit model: 'Rent_Count ~ .' 
## RMSE = 325.115109558129 
## Fit model: 'Rent_Count ~ Hour +  Temperature + Humidity + Wind_Speed' 
## RMSE = 489.316684945385 
## Fit model: 'Rent_Count ~ Hour +  Temperature * Humidity + Wind_Speed' 
## RMSE = 485.426023919223 
## Fit model: 'Rent_Count ~ Hour +  Temperature * Humidity + Seasons + Holiday' 
## RMSE = 476.981701386591 
## Fit model: 'Rent_Count ~ (Hour + Temperature + Humidity)^2 + Seasons + Holiday' 
## RMSE = 457.464578480263 
## Fit model: 'Rent_Count ~ Hour * Holiday + Temperature * Humidity' 
## RMSE = 483.11649423516

fit_lm

The model “Rent_Count ~ .” has the smallest RMSE of 325.12

Fitting Logistic Regression Models

We will fit the following candidate models on the 2-level categorical response Rent_Many:

# make formula for lm models
response_glm <- names(df_train)[2]
models_glm <- response_glm %&% " ~ " %&% model_candidates 
models_glm

## [1] "Rent_Many ~ ."                                                    
## [2] "Rent_Many ~ Hour +  Temperature + Humidity + Wind_Speed"          
## [3] "Rent_Many ~ Hour +  Temperature * Humidity + Wind_Speed"          
## [4] "Rent_Many ~ Hour +  Temperature * Humidity + Seasons + Holiday"   
## [5] "Rent_Many ~ (Hour + Temperature + Humidity)^2 + Seasons + Holiday"
## [6] "Rent_Many ~ Hour * Holiday + Temperature * Humidity"

For each candidate model, we will perform the logistic regression on the train set, find the best one with cross validation, and use it to make predictions on the test set. The helper function returns a summary of metric in a data frame.

# fit the models
fit_glm <- fit_glm_models(response_glm, models_glm, df_train, df_test, "binomial", "Accuracy")

## Fit model: 'Rent_Many ~ .' 
## Accuracy = 0.993147555961626 
## Fit model: 'Rent_Many ~ Hour +  Temperature + Humidity + Wind_Speed' 
## Accuracy = 0.804020100502513 
## Fit model: 'Rent_Many ~ Hour +  Temperature * Humidity + Wind_Speed' 
## Accuracy = 0.809502055733212 
## Fit model: 'Rent_Many ~ Hour +  Temperature * Humidity + Seasons + Holiday' 
## Accuracy = 0.819552306989493 
## Fit model: 'Rent_Many ~ (Hour + Temperature + Humidity)^2 + Seasons + Holiday' 
## Accuracy = 0.822750114207401 
## Fit model: 'Rent_Many ~ Hour * Holiday + Temperature * Humidity' 
## Accuracy = 0.811329374143445

fit_glm

The model “Rent_Many ~ .” has the largest accuracy of 0.99

Conclusion

Both of MLR models and logistic regression models give similar results. Among the candidate models, the models with only main effects (without interactions and higher-oder terms) perform best to predict the test set.