Preliminaries

Building on the insights derived among the variables from timed task 1, the task at hand requires the building of a linear regression model, a logistic regression model, a comparison of the either of these types of models with a modified counterpart and dimesionality reduction. Before going about building any model, relevant libraries must be imported, the data set must be imported and inspected for missing values and outliers.

#Remember to install these packages if you haven't already done so

library(pastecs) #For creating descriptive statistic summaries
library(ggplot2) #For creating histograms with more detail than plot
library(semTools) #For skewness and kurtosis
## Loading required package: lavaan
## This is lavaan 0.6-12
## lavaan is FREE software! Please report any bugs.
## 
## ###############################################################################
## This is semTools 0.5-6
## All users of R (or SEM) are invited to submit functions or ideas for functions.
## ###############################################################################
needed_packages <- c("foreign",  "lm.beta", "stargazer", "car", "ppcor", "rstatix")                      
# Extract not installed packages
not_installed <- needed_packages[!(needed_packages %in% installed.packages()[ , "Package"])]    
# Install not installed packages
if(length(not_installed)) install.packages(not_installed) 
library(foreign) #To work with SPSS data
library(lm.beta) #Will allow us to isolate the beta co-efficients
library(stargazer)#For formatting outputs/tables
## 
## Please cite as:
##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
library(rstatix)#Bartlett and posthoc Anova testing
## 
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
## 
##     filter
library(ppcor)#partial correlation
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:rstatix':
## 
##     select
library(car)#Levene's test
## Loading required package: carData
options(repos="https://cran.rstudio.com" )
needed_packages <- c("foreign",  "lm.beta", "stargazer", "olsrr", "car")                      
# Extract not installed packages
not_installed <- needed_packages[!(needed_packages %in% installed.packages()[ , "Package"])]    
# Install not installed packages
if(length(not_installed)) install.packages(not_installed) 

library(olsrr)
## 
## Attaching package: 'olsrr'
## The following object is masked from 'package:MASS':
## 
##     cement
## The following object is masked from 'package:datasets':
## 
##     rivers
library(car)
library(foreign) #To work with SPSS data
library(lm.beta) #Will allow us to isolate the beta co-efficients
library(stargazer)#For formatting outputs/tables
library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(DescTools)
## 
## Attaching package: 'DescTools'
## The following object is masked from 'package:car':
## 
##     Recode
library(regclass)
## Loading required package: bestglm
## Loading required package: leaps
## Loading required package: VGAM
## Loading required package: stats4
## Loading required package: splines
## 
## Attaching package: 'VGAM'
## The following object is masked from 'package:DescTools':
## 
##     Rank
## The following object is masked from 'package:lmtest':
## 
##     lrtest
## The following object is masked from 'package:car':
## 
##     logit
## Loading required package: rpart
## Loading required package: randomForest
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## Important regclass change from 1.3:
## All functions that had a . in the name now have an _
## all.correlations -> all_correlations, cor.demo -> cor_demo, etc.
## 
## Attaching package: 'regclass'
## The following object is masked from 'package:DescTools':
## 
##     VIF
library(generalhoslem)
## Loading required package: reshape
library(visreg)
library(lsr)
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'lattice'
## The following object is masked from 'package:regclass':
## 
##     qq
## 
## Attaching package: 'caret'
## The following object is masked from 'package:VGAM':
## 
##     predictors
## The following objects are masked from 'package:DescTools':
## 
##     MAE, RMSE
# Common functions
getmode <- function(t_list) { 
  uniq_val <- unique(t_list)
  uniq_val[which.max(tabulate(match(t_list, uniq_val)))]
  }


bikeshare<- read.csv("BikeSharing.csv",header=TRUE)
#Setting the column names to be that used in the dataset but in lowercase to make life a bit easier
colnames(bikeshare) <- tolower(colnames(bikeshare))
###Dimension of Data
dim(bikeshare)
## [1] 731  16
### Checking wheather the bikeshare dataset contain null values
is.null(bikeshare)
## [1] FALSE
###I found that there is no null values present 

Linear Regression

Linear regression seeks to construct a predictive model that can investigate the relationship between a single outcome variable and a set of independent variables (predictors) by examining the variance in the outcome variable and determining how much of that variance can be attributed to the predictor variables.

Model 1:

Total number of bikes hired can be considered to be predicted from Temperature, Humidity and the weather situation

Total number of bikes hired

  • Total number of bikes hired is a ratio variable that is a normal distribution. The basic stats, proof for normality is included in the first assignment.
  • Weather situation is a three-valued variable in the biking data set (1,2,3). It is a categorical variable defined category variable. There is no need to test for normalcy because the variable is categorical. Before we start the tests, we’ll merely run some basic statistics. Because the variable is nominal, we will only get the mode of the data.

Correlation between the weather situation and the total number of bikes hired

  • Null hypothesis: There is no relation between the weather situation and the total number of bikes hired.
  • Alternate hypothesis: There is some relationship between the weather situation and the total number of bikes hired.

How are the predictors chosen:

  • In the first assignment, we established a relationship between temperature and the quantity of bikes hired.
  • In the previous assignment, we demonstrated that there is a correlation between humidity and the number of bikes hired.
  • Now, we will see if there is a correlation between the weather situation and the number of bikes hired.
getmode(bikeshare$workingday)
## [1] 1

As a result, the weather situation with the label 1 is the mode of the weather situation column. Because it is a nominal variable with more than two categories, we will use the Bartlett’s test to determine homoscedasticity, followed by the ANOVA test.

#Barlett's test
bartlett.test(cnt~weathersit, bikeshare)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  cnt by weathersit
## Bartlett's K-squared = 5.3765, df = 2, p-value = 0.068

The Bartlett’s test has a p value greater than 0.05. As a result, we can reject the alternative hypothesis that variance differs and accept the null hypothesis which says that there is homogeneity in variance. Now, we can proceed with conducting the ANOVA test

#ANOVA test
res <- stats::oneway.test(cnt ~ as.factor(weathersit), data = bikeshare, var.equal = TRUE)
res
## 
##  One-way analysis of means
## 
## data:  cnt and as.factor(weathersit)
## F = 40.066, num df = 2, denom df = 728, p-value < 2.2e-16

In this case p-value< 0.05. As a result, we can conclude that the test is statistically significant.

Reporting result of correlation test between weather situation and total number of bikes hired

  • A one-way between-groups analysis of variance (ANOVA) was conducted to explore the impact of weather situation on total number of bikes hired. The results of the ANOVA test are statistically significant with p < 0.05. The effect size calculated using eta squared was moderate (0.09).

  • The null hypothesis is rejected, and the alternate hypothesis is kept. Thus, we conclude that it is possible to use weather situation as a predictor.

Justification for inclusion of variables in the Linear Regression Model

  • There is a relationship between the outcome variable (cnt) and the scalar independent variables (temp, hum, weathersit) while the dummy variable also exhibits a differential effect on both the outcome variable and the other independent variables.
  • Therefore, it has been shown that these variables contribute to the variation in the outcome variable and statistical evidence for their inclusion in the model has been obtained.

Constructing the Linear regression model

  • We now tried to build a linear regression model with the total number of bikes as the outcome and temperature as a predictor, while controlling for humidity and weather situation

  • Below, we will build the linear model and highlight some of its key features:

# Making dummy variables

# Creating labels for weather situation 
clear<-bikeshare%>%
  filter(weathersit==1)%>%
  mutate(weatherLab="Clear")

partCloud<-bikeshare%>%
  filter(weathersit==2)%>%
  mutate(weatherLab="Partly Cloudy")

precipitating<-bikeshare%>%
  filter(weathersit==3)%>%
  mutate(weatherLab="Precipitating")

bikeshare<-rbind(clear, partCloud, precipitating)
bikeshare$weatherLab<-as.factor(bikeshare$weatherLab)
contrasts(bikeshare$weatherLab)
##               Partly Cloudy Precipitating
## Clear                     0             0
## Partly Cloudy             1             0
## Precipitating             0             1
head(bikeshare)
##   instant     dteday season yr mnth holiday weekday workingday weathersit
## 1       3 2011-01-03      1  0    1       0       1          1          1
## 2       4 2011-01-04      1  0    1       0       2          1          1
## 3       5 2011-01-05      1  0    1       0       3          1          1
## 4       6 2011-01-06      1  0    1       0       4          1          1
## 5       9 2011-01-09      1  0    1       0       0          0          1
## 6      10 2011-01-10      1  0    1       0       1          1          1
##       temp    atemp      hum windspeed casual registered  cnt weatherLab
## 1 0.196364 0.189405 0.437273 0.2483090    120       1229 1349      Clear
## 2 0.200000 0.212122 0.590435 0.1602960    108       1454 1562      Clear
## 3 0.226957 0.229270 0.436957 0.1869000     82       1518 1600      Clear
## 4 0.204348 0.233209 0.518261 0.0895652     88       1518 1606      Clear
## 5 0.138333 0.116175 0.434167 0.3619500     54        768  822      Clear
## 6 0.150833 0.150888 0.482917 0.2232670     41       1280 1321      Clear
#Creating the model
model1 <- lm(bikeshare$cnt ~ bikeshare$temp+bikeshare$hum+as.factor(bikeshare$weatherLab))
stargazer::stargazer(model1, type="text")
## 
## ====================================================
##                              Dependent variable:    
##                          ---------------------------
##                                      cnt            
## ----------------------------------------------------
## temp                            6,525.027***        
##                                   (300.530)         
##                                                     
## hum                             -1,070.558**        
##                                   (475.854)         
##                                                     
## weatherLab)Partly Cloudy         -400.649***        
##                                   (138.290)         
##                                                     
## weatherLab)Precipitating        -2,260.622***       
##                                   (349.282)         
##                                                     
## Constant                        2,144.466***        
##                                   (282.773)         
##                                                     
## ----------------------------------------------------
## Observations                         731            
## R2                                  0.459           
## Adjusted R2                         0.456           
## Residual Std. Error         1,429.396 (df = 726)    
## F Statistic               153.706*** (df = 4; 726)  
## ====================================================
## Note:                    *p<0.1; **p<0.05; ***p<0.01

NOTE:

  • Adjusted R^2 - Tells us how much of the variance in the outcome is explained by the model. Here, the adjusted R^2 is 0.46. There 46% of the variance in the outcome variable is explained by this model.

  • F-statistic - The significance of F statistic shows if the predictors are related to the outcome. In the above, the mark next to F-statistic shows that p-value for it p < 0.01. Therefore there is some relation between the predictors and the outcome.

Test assumptions on first model:

Cook’s distance Cook’s distance is used to find if there are any influential outliers. If any of the observations have a Cook’s distance > 1, it is better to remove such observations.

#Influential Outliers - Cook's distance
cooksd<-sort(cooks.distance(model1))

# plot Cook's distance
plot(cooksd, pch="*", cex=2, main="Influential Obs by Cooks distance")  
abline(h = 4*mean(cooksd, na.rm=T), col="red")  # add cutoff line
text(x=1:length(cooksd)+1, y=cooksd, labels=ifelse(cooksd>4*mean(cooksd, na.rm=T),names(cooksd),""), col="red")

From the above, we can see that no observation has a Cook’s distance >1. Thus we will not remove any of them. Below, we will check if there are any influential values for any column:

influential <- as.numeric(names(cooksd)[(cooksd > 4*mean(cooksd, na.rm=T))])  # influential row numbers
infl <- stem(influential)
## 
##   The decimal point is 2 digit(s) to the right of the |
## 
##   0 | 333333
##   2 | 96
##   4 | 93456
##   6 | 121123333
infl
## NULL
## NULL
head(bikeshare[influential, ]) 
##     instant     dteday season yr mnth holiday weekday workingday weathersit
## 134     212 2011-07-31      3  0    7       0       0          0          1
## 613     442 2012-03-17      1  1    3       0       6          0          2
## 356     554 2012-07-07      3  1    7       0       6          0          1
## 556     266 2011-09-23      4  0    9       0       5          1          2
## 289     463 2012-04-07      2  1    4       0       6          0          1
## 713      90 2011-03-31      2  0    3       0       4          1          3
##         temp    atemp      hum windspeed casual registered  cnt    weatherLab
## 134 0.805833 0.729796 0.480833 0.1648130   1524       2778 4302         Clear
## 613 0.514167 0.505046 0.755833 0.1107040   3155       4681 7836 Partly Cloudy
## 356 0.861667 0.804913 0.492083 0.1635540   1448       3392 4840         Clear
## 556 0.609167 0.522125 0.972500 0.0783667    258       2137 2395 Partly Cloudy
## 289 0.437500 0.426129 0.254167 0.2748710   3252       3605 6857         Clear
## 713 0.268333 0.257575 0.918333 0.2176460    179       1506 1685 Precipitating
head(bikeshare[influential, ]$temp)
## [1] 0.805833 0.514167 0.861667 0.609167 0.437500 0.268333
head(bikeshare[influential, ]$hum)
## [1] 0.480833 0.755833 0.492083 0.972500 0.254167 0.918333
head(bikeshare[influential, ]$weathersit)
## [1] 1 2 1 2 1 3
  • The Bonferroni test reports if there any any outcome values that have an unusual value for their predictor.

  • The below test reports the Bonferroni p-values for testing the observations to see if there are any mean-shift outliers:

#Bonferroni outlier test
car::outlierTest(model1)
## No Studentized residuals with Bonferroni p < 0.05
## Largest |rstudent|:
##      rstudent unadjusted p-value Bonferroni p
## 545 -2.932583          0.0034675           NA
#Leverage plots:
#Leverage plots show the unique effect of a term in a Linear model. The below are the leverage plots for all the predictors:
car::leveragePlots(model1)

Residual plots:

These plots are used to assess non-linearity, unequal error variances and outliers.

plot(model1,1)

plot(model1, 3)

hist(model1$residuals, main = "Linear model 1 -Histogram of residuals")

#Residual density plot
plot(density(resid(model1)))

car::qqPlot(model1, main="Linear model 1 - QQ Plot")

## [1] 126 545

Collinearity and tolerance:

#Calculate Collinearity
vifmodel<-car::vif(model1)
vifmodel
##                                     GVIF Df GVIF^(1/(2*Df))
## bikeshare$temp                  1.081276  1        1.039844
## bikeshare$hum                   1.641202  1        1.281094
## as.factor(bikeshare$weatherLab) 1.638598  2        1.131405
1/vifmodel
##                                      GVIF  Df GVIF^(1/(2*Df))
## bikeshare$temp                  0.9248334 1.0       0.9616826
## bikeshare$hum                   0.6093096 1.0       0.7805829
## as.factor(bikeshare$weatherLab) 0.6102778 0.5       0.8838568

Conclusion for Linear Model 1:

A multiple regression analysis was conducted to determine if the temperature, humidity and the weather situation could predict the total number of bikes hired.The lm function in R automatically converts the factor variable weather situation into the corresponding dummy variables.But still to illustrate the concept of dummy variables, weather situation is broken down into “Partly Cloudy” and “Precipitating” and stored in new data frame called “WeatherLab”.

Examination of the histogram, normal P-P plot of standardized residuals and the scatter plot of the dependent variable, temperature, humidity and weather situation showed that the some outliers existed. However, examination of the standardized residuals showed that none could be considered to have undue influence (95% within limits of -1.96 to plus 1.96 and none with Cook’s distance >1 as outlined in Field (2013).

Examination for multicollinearity showed that the tolerance and variance influence factor measures were within acceptable levels (tolerance >0.4, VIF <2.5 ) as outlined in Tarling (2008). The scatter plot of standardized residuals showed that the data met the assumptions of homogeneity of variance and linearity. The data also meets the assumption of non-zero variances of the predictors.


Model 2

Linear Regression model 2:

Total number of bikes hired can be considered to be predicted from temperature and humidity alone

We will also attempt another model with only the numeric quantities from above and see if that model turns out to be a better fit for the data.

  • Outcome variable : Total Number of bikes hired
  • Predictor variables :Temperature, Humidity

How are the predictor variables chosen:

  • The relationship between Temperature and total number of bikes hired is checked in the assignment 1 , and a correlation was found.
  • The relationship between humidity and total number of bikes hired is checked in assignment 1, and a correlation was found.

In the following section, I tried to build a linear model with the total number of bikes hired as outcome and temperature and humidity as predictors. The model is finished, and a few significant characteristics are shown below:

model2 <- lm(bikeshare$cnt ~ bikeshare$hum+bikeshare$temp)
stargazer::stargazer(model2, type="text")
## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                                 cnt            
## -----------------------------------------------
## hum                        -2,492.854***       
##                              (384.764)         
##                                                
## temp                       6,886.974***        
##                              (299.379)         
##                                                
## Constant                   2,657.895***        
##                              (272.423)         
##                                                
## -----------------------------------------------
## Observations                    731            
## R2                             0.427           
## Adjusted R2                    0.425           
## Residual Std. Error    1,468.676 (df = 728)    
## F Statistic          271.031*** (df = 2; 728)  
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

Some things to note here are: Adjusted R^2 - Tells us how much of the variance in the outcome is explained by the model. Here, the adjusted R^2 is 0.43. There 43% of the variance in the outcome variable is explained by this model. F-statistic - The significance of F statistic shows if the predictors are related to the outcome. In the above, the mark next to F-statistic shows that p-value for it p < 0.01. Therefore there is some relation between the predictors and the outcome.

Test assumptions on second model:

Cook’s distance Cook’s distance is used to find if there are any influential outliers. If any of the observations have a Cook’s distance > 1, it is better to remove such observations.

#Influential Outliers - Cook's distance
cooksd<-sort(cooks.distance(model2 ))
# plot Cook's distance
plot(cooksd, pch="*", cex=2, main="Influential Obs by Cooks distance")  
abline(h = 4*mean(cooksd, na.rm=T), col="red")  # add cutoff line
text(x=1:length(cooksd)+1, y=cooksd, labels=ifelse(cooksd>4*mean(cooksd, na.rm=T),names(cooksd),""), col="red")  # add labels

#find rows related to influential observations
influential <- as.numeric(names(cooksd)[(cooksd > 4*mean(cooksd, na.rm=T))])  # influential row numbers
stem(influential)
## 
##   The decimal point is 2 digit(s) to the right of the |
## 
##   0 | 3393333333
##   2 | 6
##   4 | 93456
##   6 | 0112233
head(bikeshare[influential, ])
##     instant     dteday season yr mnth holiday weekday workingday weathersit
## 556     266 2011-09-23      4  0    9       0       5          1          2
## 87      153 2011-06-02      2  0    6       0       4          1          1
## 26       45 2011-02-14      1  0    2       0       1          1          1
## 537     202 2011-07-21      3  0    7       0       4          1          2
## 714     106 2011-04-16      2  0    4       0       6          0          3
## 716     250 2011-09-07      3  0    9       0       3          1          3
##         temp    atemp      hum windspeed casual registered  cnt    weatherLab
## 556 0.609167 0.522125 0.972500 0.0783667    258       2137 2395 Partly Cloudy
## 87  0.715000 0.643942 0.305000 0.2922870    736       4232 4968         Clear
## 26  0.415000 0.398350 0.375833 0.4179080    208       1705 1913         Clear
## 537 0.815000 0.826371 0.691250 0.2220210    632       3152 3784 Partly Cloudy
## 714 0.430833 0.425492 0.888333 0.3408080    121        674  795 Precipitating
## 716 0.599167 0.544229 0.917083 0.0970208    118       1878 1996 Precipitating
# influential observations.
head(bikeshare[influential, ]$cnt)  
## [1] 2395 4968 1913 3784  795 1996
head(bikeshare[influential, ]$temp)  
## [1] 0.609167 0.715000 0.415000 0.815000 0.430833 0.599167
head(bikeshare[influential, ]$hum)  
## [1] 0.972500 0.305000 0.375833 0.691250 0.888333 0.917083
car::outlierTest(model2 ) 
## No Studentized residuals with Bonferroni p < 0.05
## Largest |rstudent|:
##      rstudent unadjusted p-value Bonferroni p
## 712 -3.277679          0.0010964      0.80147

An examination of standard residuals was performed on the data to identify any outliers, which revealed 165 cases of concern. In some circumstances, the outcome variable has an uncommon variable for its predictor values, as determined by the Bonferonni test.

Leverage points are observations with x values that are significantly different from the mean of x.

# leverage plots
car::leveragePlots(model2 ) 

The variation around the line in a well-fitting regression model is constant along the line. The leverage charts demonstrate this to some extent, while some spots show far greater fluctuation than others.

Homoscedasticity ensures that points have the same scatter and data values that are spread out to roughly the same extent. It is essential to take into account because heteroscedasticity increased bias and the probability of Type I error.

#Assess homocedasticity 
plot(model2,2)
plot(model2, 2)

#Create histogram and  density plot of the residuals
plot(density(resid(model2)))

#The plot appears to skew left, further investigation is required.

#Create a QQ plotqqPlot(model, main="QQ Plot") #qq plot for studentized resid 
car::qqPlot(model2, main="QQ Plot") #qq plot for studentized resid

## [1] 545 712

Collinearity occurs when two or more independent variables contain strongly redundant information Conducting MLR with collinear variables will produce invalid results.

#Calculate Collinearity
vifmodel<-car::vif(model2)
vifmodel
##  bikeshare$hum bikeshare$temp 
##       1.016384       1.016384
#Calculate tolerance
1/vifmodel
##  bikeshare$hum bikeshare$temp 
##      0.9838804      0.9838804

Conclusion for linear model 2:

A multiple regression analysis was conducted to determine if the temperature and humidity could predict the total number of bikes hired. Examination of the histogram, normal P-P plot of standardized residuals and the scatter plot of the dependent variable, temperature and humidity showed that the some outliers existed. However, examination of the standardized residuals showed that none could be considered to have undue influence (95% within limits of -1.96 to plus 1.96 and none with Cook’s distance >1 as outlined in Field (2013).

Examination for multicollinearity showed that the tolerance and variance influence factor measures were within acceptable levels (tolerance >0.4, VIF <2.5 ) as outlined in Tarling (2008). The scatter plot of standardized residuals showed that the data met the assumptions of homogeneity of variance and linearity. The data also meets the assumption of non-zero variances of the predictors.

Comparing the results of the two models

stargazer::stargazer(list(model1, model2), type = 'text')
## 
## ==========================================================================
##                                         Dependent variable:               
##                          -------------------------------------------------
##                                                 cnt                       
##                                    (1)                      (2)           
## --------------------------------------------------------------------------
## temp                           6,525.027***             6,886.974***      
##                                 (300.530)                (299.379)        
##                                                                           
## hum                            -1,070.558**            -2,492.854***      
##                                 (475.854)                (384.764)        
##                                                                           
## weatherLab)Partly Cloudy       -400.649***                                
##                                 (138.290)                                 
##                                                                           
## weatherLab)Precipitating      -2,260.622***                               
##                                 (349.282)                                 
##                                                                           
## Constant                       2,144.466***             2,657.895***      
##                                 (282.773)                (272.423)        
##                                                                           
## --------------------------------------------------------------------------
## Observations                       731                      731           
## R2                                0.459                    0.427          
## Adjusted R2                       0.456                    0.425          
## Residual Std. Error        1,429.396 (df = 726)     1,468.676 (df = 728)  
## F Statistic              153.706*** (df = 4; 726) 271.031*** (df = 2; 728)
## ==========================================================================
## Note:                                          *p<0.1; **p<0.05; ***p<0.01

We will look at two important factors here: The adjusted R ^2 and the F-statistic. The adjusted R^2 is more for the first model than the second model. Thus, the first model is able to explain more of the variance of the outcome variable. The F-statistic is significant for both models. It is also seen that the first model has a lower standard error than the second model. Thus, we can say that the first model is a better fit than the second model.


Logistic regression :

I created another BusyDay variable ,and it can be considered to be predicted from weather situation, temperature and humidity.

Defining a Busy day: A day where the total number of bikes hired is above the average total number of bikes hired in a busy day.

How the predictors are chosen:

  • For weather situation, we know that it is correlated to the total number of bikes hired in the linear regression part - There is no specific test to establish a correlation between a categorical outcome and a numerical predictor(Point biserial only when the outcome contains two categories).
  • Regression is used to assess correlations between a categorical outcome and a numerical predictor in some ways. Total number of bikes hired per day is dependent to humidity and actual temperature.
#Creating the busy day column
bikedata <- bikeshare
avg_cnt <- mean(bikedata$cnt)
bikedata$BusyDay[bikedata$cnt>avg_cnt] <- 1
bikedata$BusyDay[bikedata$cnt<=avg_cnt] <- 0
#Split into train and test set
train_split <- caret::createDataPartition(bikedata$BusyDay, p = 0.75, list = FALSE)

train <- bikedata[train_split, ]
test <- bikedata[-train_split, ]

set.seed(42)
default_idx <- sample(nrow(bikedata), ceiling(nrow(bikedata) / 2))
train <-  bikedata[default_idx, ]
train <- bikedata[-default_idx, ]

table(train$BusyDay)
## 
##   0   1 
## 178 187

Constructing and visualizing the model:

We construct a logistic regression model with ‘BusyDay’ as the outcome and temperature, humidity and the weatherSit as predictors. Since the weather situation is categorical with more than two categories, we use 2 coded dummy variable instead. We visualize the model using visreg.

#Build the model
logmodel <- glm(BusyDay ~ temp+hum+weatherLab, data = train,  family = binomial)
visreg::visreg(logmodel, xvar = "temp", by = "hum", scale = "response")

visreg::visreg(logmodel, xvar = "temp", by = "hum", scale = "linear")

Summary of model:

summary(logmodel)
## 
## Call:
## glm(formula = BusyDay ~ temp + hum + weatherLab, family = binomial, 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7987  -0.7112   0.2691   0.7467   2.0360  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -2.5906     0.6950  -3.728 0.000193 ***
## temp                      8.7306     0.9631   9.065  < 2e-16 ***
## hum                      -1.8537     1.2514  -1.481 0.138532    
## weatherLabPartly Cloudy  -0.9469     0.3401  -2.784 0.005368 ** 
## weatherLabPrecipitating -16.4868   866.4841  -0.019 0.984819    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 505.78  on 364  degrees of freedom
## Residual deviance: 340.51  on 360  degrees of freedom
## AIC: 350.51
## 
## Number of Fisher Scoring iterations: 15

Temperature and humidity are the most significant for the model, as seen above (p<0.001 for temp, p<0.01 for humidity). AIC stands for Akaike Information Criterion. It enables us to determine how well the model fits the data. It is primarily used to compare models.

Other tests: Chi-sq test:

lmtest::lrtest(logmodel)
## Likelihood ratio test
## 
## Model 1: BusyDay ~ temp + hum + weatherLab
## Model 2: BusyDay ~ 1
##   #Df  LogLik Df  Chisq Pr(>Chisq)    
## 1   5 -170.26                         
## 2   1 -252.89 -4 165.26  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Chi-sq has -4 degrees of freedom and a chi-sq value of 165.26. The result is statistically significant. Thus there is a significant difference between the baseline model and model created.

DescTools::PseudoR2(logmodel, which="all")
##        McFadden     McFaddenAdj        CoxSnell      Nagelkerke   AldrichNelson 
##       0.3267535       0.3069818       0.3641406       0.4856193       0.3116635 
## VeallZimmermann           Efron McKelveyZavoina            Tjur             AIC 
##       0.5365798       0.4104952       0.7149562       0.4012614     350.5116038 
##             BIC          logLik         logLik0              G2 
##     370.0110906    -170.2558019    -252.8877508     165.2638977
stargazer::stargazer(logmodel, type="text")
## 
## ===================================================
##                             Dependent variable:    
##                         ---------------------------
##                                   BusyDay          
## ---------------------------------------------------
## temp                             8.731***          
##                                   (0.963)          
##                                                    
## hum                               -1.854           
##                                   (1.251)          
##                                                    
## weatherLabPartly Cloudy          -0.947***         
##                                   (0.340)          
##                                                    
## weatherLabPrecipitating           -16.487          
##                                  (866.484)         
##                                                    
## Constant                         -2.591***         
##                                   (0.695)          
##                                                    
## ---------------------------------------------------
## Observations                        365            
## Log Likelihood                   -170.256          
## Akaike Inf. Crit.                 350.512          
## ===================================================
## Note:                   *p<0.1; **p<0.05; ***p<0.01

The below gives the coefficients for all the predictors in the logistic regression model:

#Exponentiate the co-efficients
exp(coefficients(logmodel))
##             (Intercept)                    temp                     hum 
##            7.497816e-02            6.189211e+03            1.566495e-01 
## weatherLabPartly Cloudy weatherLabPrecipitating 
##            3.879485e-01            6.916012e-08

The Odds ratios of the predictors are shown below:

## odds ratios 
cbind(Estimate=round(coef(logmodel),4),
OR=round(exp(coef(logmodel)),4))
##                         Estimate        OR
## (Intercept)              -2.5906    0.0750
## temp                      8.7306 6189.2109
## hum                      -1.8537    0.1566
## weatherLabPartly Cloudy  -0.9469    0.3879
## weatherLabPrecipitating -16.4868    0.0000
  • If the odds ratio is positive, the predictor improves the likelihood of an outcome occurring. If the odds ratio is negative, it indicates that the predictor reduces the likelihood of an outcome occurring. We can observe from the above that all predictors have a positive Odds ratio.
  • This means that the occurrence of the predictors enhances the likelihood of the significant result (Busyday). All of the forecasters were really pleased with the outcome.

Confusion matrix:

We have trained the model on the training dataset. Now we can run it on the test dataset and check how it is performing. The model is run with the test data and the results are displayed in a confusion matrix:

#Confusion matrix
regclass::confusion_matrix(logmodel, test)
##          Predicted 0 Predicted 1 Total
## Actual 0          68          21    89
## Actual 1          16          77    93
## Total             84          98   182

The accuracy of the model = Correct predictions/ Total no. of observations = (66+69)/182 = 135/182 = 0.741 This shows that the model accuracy is 74%.

threshold=0.5
predicted_values<-ifelse(predict(logmodel,type="response")>threshold,1,0)
actual_values<-logmodel$y
cm<-table(predicted_values,actual_values)
cm
##                 actual_values
## predicted_values   0   1
##                0 141  42
##                1  37 145

The sensitivity and specificity of the model can be calculated from the confusion matrix as shows:

sensitivity <- cm[4] / sum(cm[4], cm[3])
specificity <- cm[1] / sum(cm[1], cm[2])
sensitivity
## [1] 0.7754011
specificity
## [1] 0.7921348

Hosmer and Lemeshow test:

This is a “goodness of fit test” for the model. For this model, large values of p (> 0.05) say that the model is good fit for the data.

generalhoslem::logitgof(train$BusyDay, fitted(logmodel))
## 
##  Hosmer and Lemeshow test (binary model)
## 
## data:  train$BusyDay, fitted(logmodel)
## X-squared = 26.408, df = 8, p-value = 0.0008941

Here, the p-value is 0.0008 meaning that there might be some issues with the fit of the model. Collinearity and Tolerance:

Collinearity

vifmodel<-car::vif(logmodel)
vifmodel
##                GVIF Df GVIF^(1/(2*Df))
## temp       1.212502  1        1.101137
## hum        1.771815  1        1.331095
## weatherLab 1.524476  2        1.111169
1/vifmodel
##                 GVIF  Df GVIF^(1/(2*Df))
## temp       0.8247410 1.0       0.9081525
## hum        0.5643931 1.0       0.7512610
## weatherLab 0.6559633 0.5       0.8999531

Conclusion for Logistic Regression model 1:

A binomial logistic regression analysis was conducted with a busy day or not as the outcome variable with temperature, humidity and weather situation as predictors. The data met the assumption for independent observations. Examination for multicollinearity showed that the tolerance and variance influence factor measures were within acceptable levels (tolerance >0.4, VIF <2.5 ) as outlined in Tarling (2008). The Hosmer Lemeshow goodness of fit statistic has a small p value, indicating some issues with the assumption of linearity between the independent variables and the log odds of the model (χ2 (n=8)=26.408, p=0.0008).