Building on the insights derived among the variables from timed task 1, the task at hand requires the building of a linear regression model, a logistic regression model, a comparison of the either of these types of models with a modified counterpart and dimesionality reduction. Before going about building any model, relevant libraries must be imported, the data set must be imported and inspected for missing values and outliers.
#Remember to install these packages if you haven't already done so
library(pastecs) #For creating descriptive statistic summaries
library(ggplot2) #For creating histograms with more detail than plot
library(semTools) #For skewness and kurtosis
## Loading required package: lavaan
## This is lavaan 0.6-12
## lavaan is FREE software! Please report any bugs.
##
## ###############################################################################
## This is semTools 0.5-6
## All users of R (or SEM) are invited to submit functions or ideas for functions.
## ###############################################################################
needed_packages <- c("foreign", "lm.beta", "stargazer", "car", "ppcor", "rstatix")
# Extract not installed packages
not_installed <- needed_packages[!(needed_packages %in% installed.packages()[ , "Package"])]
# Install not installed packages
if(length(not_installed)) install.packages(not_installed)
library(foreign) #To work with SPSS data
library(lm.beta) #Will allow us to isolate the beta co-efficients
library(stargazer)#For formatting outputs/tables
##
## Please cite as:
## Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
library(rstatix)#Bartlett and posthoc Anova testing
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
library(ppcor)#partial correlation
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:rstatix':
##
## select
library(car)#Levene's test
## Loading required package: carData
options(repos="https://cran.rstudio.com" )
needed_packages <- c("foreign", "lm.beta", "stargazer", "olsrr", "car")
# Extract not installed packages
not_installed <- needed_packages[!(needed_packages %in% installed.packages()[ , "Package"])]
# Install not installed packages
if(length(not_installed)) install.packages(not_installed)
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:MASS':
##
## cement
## The following object is masked from 'package:datasets':
##
## rivers
library(car)
library(foreign) #To work with SPSS data
library(lm.beta) #Will allow us to isolate the beta co-efficients
library(stargazer)#For formatting outputs/tables
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(DescTools)
##
## Attaching package: 'DescTools'
## The following object is masked from 'package:car':
##
## Recode
library(regclass)
## Loading required package: bestglm
## Loading required package: leaps
## Loading required package: VGAM
## Loading required package: stats4
## Loading required package: splines
##
## Attaching package: 'VGAM'
## The following object is masked from 'package:DescTools':
##
## Rank
## The following object is masked from 'package:lmtest':
##
## lrtest
## The following object is masked from 'package:car':
##
## logit
## Loading required package: rpart
## Loading required package: randomForest
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## Important regclass change from 1.3:
## All functions that had a . in the name now have an _
## all.correlations -> all_correlations, cor.demo -> cor_demo, etc.
##
## Attaching package: 'regclass'
## The following object is masked from 'package:DescTools':
##
## VIF
library(generalhoslem)
## Loading required package: reshape
library(visreg)
library(lsr)
library(caret)
## Loading required package: lattice
##
## Attaching package: 'lattice'
## The following object is masked from 'package:regclass':
##
## qq
##
## Attaching package: 'caret'
## The following object is masked from 'package:VGAM':
##
## predictors
## The following objects are masked from 'package:DescTools':
##
## MAE, RMSE
# Common functions
getmode <- function(t_list) {
uniq_val <- unique(t_list)
uniq_val[which.max(tabulate(match(t_list, uniq_val)))]
}
bikeshare<- read.csv("BikeSharing.csv",header=TRUE)
#Setting the column names to be that used in the dataset but in lowercase to make life a bit easier
colnames(bikeshare) <- tolower(colnames(bikeshare))
###Dimension of Data
dim(bikeshare)
## [1] 731 16
### Checking wheather the bikeshare dataset contain null values
is.null(bikeshare)
## [1] FALSE
###I found that there is no null values present
Linear regression seeks to construct a predictive model that can investigate the relationship between a single outcome variable and a set of independent variables (predictors) by examining the variance in the outcome variable and determining how much of that variance can be attributed to the predictor variables.
Total number of bikes hired can be considered to be predicted from Temperature, Humidity and the weather situation
How are the predictors chosen:
getmode(bikeshare$workingday)
## [1] 1
As a result, the weather situation with the label 1 is the mode of the weather situation column. Because it is a nominal variable with more than two categories, we will use the Bartlett’s test to determine homoscedasticity, followed by the ANOVA test.
#Barlett's test
bartlett.test(cnt~weathersit, bikeshare)
##
## Bartlett test of homogeneity of variances
##
## data: cnt by weathersit
## Bartlett's K-squared = 5.3765, df = 2, p-value = 0.068
The Bartlett’s test has a p value greater than 0.05. As a result, we can reject the alternative hypothesis that variance differs and accept the null hypothesis which says that there is homogeneity in variance. Now, we can proceed with conducting the ANOVA test
#ANOVA test
res <- stats::oneway.test(cnt ~ as.factor(weathersit), data = bikeshare, var.equal = TRUE)
res
##
## One-way analysis of means
##
## data: cnt and as.factor(weathersit)
## F = 40.066, num df = 2, denom df = 728, p-value < 2.2e-16
In this case p-value< 0.05. As a result, we can conclude that the test is statistically significant.
A one-way between-groups analysis of variance (ANOVA) was conducted to explore the impact of weather situation on total number of bikes hired. The results of the ANOVA test are statistically significant with p < 0.05. The effect size calculated using eta squared was moderate (0.09).
The null hypothesis is rejected, and the alternate hypothesis is kept. Thus, we conclude that it is possible to use weather situation as a predictor.
We now tried to build a linear regression model with the total number of bikes as the outcome and temperature as a predictor, while controlling for humidity and weather situation
Below, we will build the linear model and highlight some of its key features:
# Making dummy variables
# Creating labels for weather situation
clear<-bikeshare%>%
filter(weathersit==1)%>%
mutate(weatherLab="Clear")
partCloud<-bikeshare%>%
filter(weathersit==2)%>%
mutate(weatherLab="Partly Cloudy")
precipitating<-bikeshare%>%
filter(weathersit==3)%>%
mutate(weatherLab="Precipitating")
bikeshare<-rbind(clear, partCloud, precipitating)
bikeshare$weatherLab<-as.factor(bikeshare$weatherLab)
contrasts(bikeshare$weatherLab)
## Partly Cloudy Precipitating
## Clear 0 0
## Partly Cloudy 1 0
## Precipitating 0 1
head(bikeshare)
## instant dteday season yr mnth holiday weekday workingday weathersit
## 1 3 2011-01-03 1 0 1 0 1 1 1
## 2 4 2011-01-04 1 0 1 0 2 1 1
## 3 5 2011-01-05 1 0 1 0 3 1 1
## 4 6 2011-01-06 1 0 1 0 4 1 1
## 5 9 2011-01-09 1 0 1 0 0 0 1
## 6 10 2011-01-10 1 0 1 0 1 1 1
## temp atemp hum windspeed casual registered cnt weatherLab
## 1 0.196364 0.189405 0.437273 0.2483090 120 1229 1349 Clear
## 2 0.200000 0.212122 0.590435 0.1602960 108 1454 1562 Clear
## 3 0.226957 0.229270 0.436957 0.1869000 82 1518 1600 Clear
## 4 0.204348 0.233209 0.518261 0.0895652 88 1518 1606 Clear
## 5 0.138333 0.116175 0.434167 0.3619500 54 768 822 Clear
## 6 0.150833 0.150888 0.482917 0.2232670 41 1280 1321 Clear
#Creating the model
model1 <- lm(bikeshare$cnt ~ bikeshare$temp+bikeshare$hum+as.factor(bikeshare$weatherLab))
stargazer::stargazer(model1, type="text")
##
## ====================================================
## Dependent variable:
## ---------------------------
## cnt
## ----------------------------------------------------
## temp 6,525.027***
## (300.530)
##
## hum -1,070.558**
## (475.854)
##
## weatherLab)Partly Cloudy -400.649***
## (138.290)
##
## weatherLab)Precipitating -2,260.622***
## (349.282)
##
## Constant 2,144.466***
## (282.773)
##
## ----------------------------------------------------
## Observations 731
## R2 0.459
## Adjusted R2 0.456
## Residual Std. Error 1,429.396 (df = 726)
## F Statistic 153.706*** (df = 4; 726)
## ====================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
NOTE:
Adjusted R^2 - Tells us how much of the variance in the outcome is explained by the model. Here, the adjusted R^2 is 0.46. There 46% of the variance in the outcome variable is explained by this model.
F-statistic - The significance of F statistic shows if the predictors are related to the outcome. In the above, the mark next to F-statistic shows that p-value for it p < 0.01. Therefore there is some relation between the predictors and the outcome.
Cook’s distance Cook’s distance is used to find if there are any influential outliers. If any of the observations have a Cook’s distance > 1, it is better to remove such observations.
#Influential Outliers - Cook's distance
cooksd<-sort(cooks.distance(model1))
# plot Cook's distance
plot(cooksd, pch="*", cex=2, main="Influential Obs by Cooks distance")
abline(h = 4*mean(cooksd, na.rm=T), col="red") # add cutoff line
text(x=1:length(cooksd)+1, y=cooksd, labels=ifelse(cooksd>4*mean(cooksd, na.rm=T),names(cooksd),""), col="red")
From the above, we can see that no observation has a Cook’s distance >1. Thus we will not remove any of them. Below, we will check if there are any influential values for any column:
influential <- as.numeric(names(cooksd)[(cooksd > 4*mean(cooksd, na.rm=T))]) # influential row numbers
infl <- stem(influential)
##
## The decimal point is 2 digit(s) to the right of the |
##
## 0 | 333333
## 2 | 96
## 4 | 93456
## 6 | 121123333
infl
## NULL
## NULL
head(bikeshare[influential, ])
## instant dteday season yr mnth holiday weekday workingday weathersit
## 134 212 2011-07-31 3 0 7 0 0 0 1
## 613 442 2012-03-17 1 1 3 0 6 0 2
## 356 554 2012-07-07 3 1 7 0 6 0 1
## 556 266 2011-09-23 4 0 9 0 5 1 2
## 289 463 2012-04-07 2 1 4 0 6 0 1
## 713 90 2011-03-31 2 0 3 0 4 1 3
## temp atemp hum windspeed casual registered cnt weatherLab
## 134 0.805833 0.729796 0.480833 0.1648130 1524 2778 4302 Clear
## 613 0.514167 0.505046 0.755833 0.1107040 3155 4681 7836 Partly Cloudy
## 356 0.861667 0.804913 0.492083 0.1635540 1448 3392 4840 Clear
## 556 0.609167 0.522125 0.972500 0.0783667 258 2137 2395 Partly Cloudy
## 289 0.437500 0.426129 0.254167 0.2748710 3252 3605 6857 Clear
## 713 0.268333 0.257575 0.918333 0.2176460 179 1506 1685 Precipitating
head(bikeshare[influential, ]$temp)
## [1] 0.805833 0.514167 0.861667 0.609167 0.437500 0.268333
head(bikeshare[influential, ]$hum)
## [1] 0.480833 0.755833 0.492083 0.972500 0.254167 0.918333
head(bikeshare[influential, ]$weathersit)
## [1] 1 2 1 2 1 3
The Bonferroni test reports if there any any outcome values that have an unusual value for their predictor.
The below test reports the Bonferroni p-values for testing the observations to see if there are any mean-shift outliers:
#Bonferroni outlier test
car::outlierTest(model1)
## No Studentized residuals with Bonferroni p < 0.05
## Largest |rstudent|:
## rstudent unadjusted p-value Bonferroni p
## 545 -2.932583 0.0034675 NA
#Leverage plots:
#Leverage plots show the unique effect of a term in a Linear model. The below are the leverage plots for all the predictors:
car::leveragePlots(model1)
These plots are used to assess non-linearity, unequal error variances and outliers.
plot(model1,1)
plot(model1, 3)
hist(model1$residuals, main = "Linear model 1 -Histogram of residuals")
#Residual density plot
plot(density(resid(model1)))
car::qqPlot(model1, main="Linear model 1 - QQ Plot")
## [1] 126 545
#Calculate Collinearity
vifmodel<-car::vif(model1)
vifmodel
## GVIF Df GVIF^(1/(2*Df))
## bikeshare$temp 1.081276 1 1.039844
## bikeshare$hum 1.641202 1 1.281094
## as.factor(bikeshare$weatherLab) 1.638598 2 1.131405
1/vifmodel
## GVIF Df GVIF^(1/(2*Df))
## bikeshare$temp 0.9248334 1.0 0.9616826
## bikeshare$hum 0.6093096 1.0 0.7805829
## as.factor(bikeshare$weatherLab) 0.6102778 0.5 0.8838568
A multiple regression analysis was conducted to determine if the temperature, humidity and the weather situation could predict the total number of bikes hired.The lm function in R automatically converts the factor variable weather situation into the corresponding dummy variables.But still to illustrate the concept of dummy variables, weather situation is broken down into “Partly Cloudy” and “Precipitating” and stored in new data frame called “WeatherLab”.
Examination of the histogram, normal P-P plot of standardized residuals and the scatter plot of the dependent variable, temperature, humidity and weather situation showed that the some outliers existed. However, examination of the standardized residuals showed that none could be considered to have undue influence (95% within limits of -1.96 to plus 1.96 and none with Cook’s distance >1 as outlined in Field (2013).
Examination for multicollinearity showed that the tolerance and variance influence factor measures were within acceptable levels (tolerance >0.4, VIF <2.5 ) as outlined in Tarling (2008). The scatter plot of standardized residuals showed that the data met the assumptions of homogeneity of variance and linearity. The data also meets the assumption of non-zero variances of the predictors.
Total number of bikes hired can be considered to be predicted from temperature and humidity alone
We will also attempt another model with only the numeric quantities from above and see if that model turns out to be a better fit for the data.
In the following section, I tried to build a linear model with the total number of bikes hired as outcome and temperature and humidity as predictors. The model is finished, and a few significant characteristics are shown below:
model2 <- lm(bikeshare$cnt ~ bikeshare$hum+bikeshare$temp)
stargazer::stargazer(model2, type="text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## cnt
## -----------------------------------------------
## hum -2,492.854***
## (384.764)
##
## temp 6,886.974***
## (299.379)
##
## Constant 2,657.895***
## (272.423)
##
## -----------------------------------------------
## Observations 731
## R2 0.427
## Adjusted R2 0.425
## Residual Std. Error 1,468.676 (df = 728)
## F Statistic 271.031*** (df = 2; 728)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Some things to note here are: Adjusted R^2 - Tells us how much of the variance in the outcome is explained by the model. Here, the adjusted R^2 is 0.43. There 43% of the variance in the outcome variable is explained by this model. F-statistic - The significance of F statistic shows if the predictors are related to the outcome. In the above, the mark next to F-statistic shows that p-value for it p < 0.01. Therefore there is some relation between the predictors and the outcome.
Cook’s distance Cook’s distance is used to find if there are any influential outliers. If any of the observations have a Cook’s distance > 1, it is better to remove such observations.
#Influential Outliers - Cook's distance
cooksd<-sort(cooks.distance(model2 ))
# plot Cook's distance
plot(cooksd, pch="*", cex=2, main="Influential Obs by Cooks distance")
abline(h = 4*mean(cooksd, na.rm=T), col="red") # add cutoff line
text(x=1:length(cooksd)+1, y=cooksd, labels=ifelse(cooksd>4*mean(cooksd, na.rm=T),names(cooksd),""), col="red") # add labels
#find rows related to influential observations
influential <- as.numeric(names(cooksd)[(cooksd > 4*mean(cooksd, na.rm=T))]) # influential row numbers
stem(influential)
##
## The decimal point is 2 digit(s) to the right of the |
##
## 0 | 3393333333
## 2 | 6
## 4 | 93456
## 6 | 0112233
head(bikeshare[influential, ])
## instant dteday season yr mnth holiday weekday workingday weathersit
## 556 266 2011-09-23 4 0 9 0 5 1 2
## 87 153 2011-06-02 2 0 6 0 4 1 1
## 26 45 2011-02-14 1 0 2 0 1 1 1
## 537 202 2011-07-21 3 0 7 0 4 1 2
## 714 106 2011-04-16 2 0 4 0 6 0 3
## 716 250 2011-09-07 3 0 9 0 3 1 3
## temp atemp hum windspeed casual registered cnt weatherLab
## 556 0.609167 0.522125 0.972500 0.0783667 258 2137 2395 Partly Cloudy
## 87 0.715000 0.643942 0.305000 0.2922870 736 4232 4968 Clear
## 26 0.415000 0.398350 0.375833 0.4179080 208 1705 1913 Clear
## 537 0.815000 0.826371 0.691250 0.2220210 632 3152 3784 Partly Cloudy
## 714 0.430833 0.425492 0.888333 0.3408080 121 674 795 Precipitating
## 716 0.599167 0.544229 0.917083 0.0970208 118 1878 1996 Precipitating
# influential observations.
head(bikeshare[influential, ]$cnt)
## [1] 2395 4968 1913 3784 795 1996
head(bikeshare[influential, ]$temp)
## [1] 0.609167 0.715000 0.415000 0.815000 0.430833 0.599167
head(bikeshare[influential, ]$hum)
## [1] 0.972500 0.305000 0.375833 0.691250 0.888333 0.917083
car::outlierTest(model2 )
## No Studentized residuals with Bonferroni p < 0.05
## Largest |rstudent|:
## rstudent unadjusted p-value Bonferroni p
## 712 -3.277679 0.0010964 0.80147
An examination of standard residuals was performed on the data to identify any outliers, which revealed 165 cases of concern. In some circumstances, the outcome variable has an uncommon variable for its predictor values, as determined by the Bonferonni test.
Leverage points are observations with x values that are significantly different from the mean of x.
# leverage plots
car::leveragePlots(model2 )
The variation around the line in a well-fitting regression model is constant along the line. The leverage charts demonstrate this to some extent, while some spots show far greater fluctuation than others.
Homoscedasticity ensures that points have the same scatter and data values that are spread out to roughly the same extent. It is essential to take into account because heteroscedasticity increased bias and the probability of Type I error.
#Assess homocedasticity
plot(model2,2)
plot(model2, 2)
#Create histogram and density plot of the residuals
plot(density(resid(model2)))
#The plot appears to skew left, further investigation is required.
#Create a QQ plotqqPlot(model, main="QQ Plot") #qq plot for studentized resid
car::qqPlot(model2, main="QQ Plot") #qq plot for studentized resid
## [1] 545 712
Collinearity occurs when two or more independent variables contain strongly redundant information Conducting MLR with collinear variables will produce invalid results.
#Calculate Collinearity
vifmodel<-car::vif(model2)
vifmodel
## bikeshare$hum bikeshare$temp
## 1.016384 1.016384
#Calculate tolerance
1/vifmodel
## bikeshare$hum bikeshare$temp
## 0.9838804 0.9838804
A multiple regression analysis was conducted to determine if the temperature and humidity could predict the total number of bikes hired. Examination of the histogram, normal P-P plot of standardized residuals and the scatter plot of the dependent variable, temperature and humidity showed that the some outliers existed. However, examination of the standardized residuals showed that none could be considered to have undue influence (95% within limits of -1.96 to plus 1.96 and none with Cook’s distance >1 as outlined in Field (2013).
Examination for multicollinearity showed that the tolerance and variance influence factor measures were within acceptable levels (tolerance >0.4, VIF <2.5 ) as outlined in Tarling (2008). The scatter plot of standardized residuals showed that the data met the assumptions of homogeneity of variance and linearity. The data also meets the assumption of non-zero variances of the predictors.
stargazer::stargazer(list(model1, model2), type = 'text')
##
## ==========================================================================
## Dependent variable:
## -------------------------------------------------
## cnt
## (1) (2)
## --------------------------------------------------------------------------
## temp 6,525.027*** 6,886.974***
## (300.530) (299.379)
##
## hum -1,070.558** -2,492.854***
## (475.854) (384.764)
##
## weatherLab)Partly Cloudy -400.649***
## (138.290)
##
## weatherLab)Precipitating -2,260.622***
## (349.282)
##
## Constant 2,144.466*** 2,657.895***
## (282.773) (272.423)
##
## --------------------------------------------------------------------------
## Observations 731 731
## R2 0.459 0.427
## Adjusted R2 0.456 0.425
## Residual Std. Error 1,429.396 (df = 726) 1,468.676 (df = 728)
## F Statistic 153.706*** (df = 4; 726) 271.031*** (df = 2; 728)
## ==========================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
We will look at two important factors here: The adjusted R ^2 and the F-statistic. The adjusted R^2 is more for the first model than the second model. Thus, the first model is able to explain more of the variance of the outcome variable. The F-statistic is significant for both models. It is also seen that the first model has a lower standard error than the second model. Thus, we can say that the first model is a better fit than the second model.
I created another BusyDay variable ,and it can be considered to be predicted from weather situation, temperature and humidity.
Defining a Busy day: A day where the total number of bikes hired is above the average total number of bikes hired in a busy day.
Outcome variable:Busy day or not
Predictors: humidity,Weather situation, temperature
#Creating the busy day column
bikedata <- bikeshare
avg_cnt <- mean(bikedata$cnt)
bikedata$BusyDay[bikedata$cnt>avg_cnt] <- 1
bikedata$BusyDay[bikedata$cnt<=avg_cnt] <- 0
#Split into train and test set
train_split <- caret::createDataPartition(bikedata$BusyDay, p = 0.75, list = FALSE)
train <- bikedata[train_split, ]
test <- bikedata[-train_split, ]
set.seed(42)
default_idx <- sample(nrow(bikedata), ceiling(nrow(bikedata) / 2))
train <- bikedata[default_idx, ]
train <- bikedata[-default_idx, ]
table(train$BusyDay)
##
## 0 1
## 178 187
We construct a logistic regression model with ‘BusyDay’ as the outcome and temperature, humidity and the weatherSit as predictors. Since the weather situation is categorical with more than two categories, we use 2 coded dummy variable instead. We visualize the model using visreg.
#Build the model
logmodel <- glm(BusyDay ~ temp+hum+weatherLab, data = train, family = binomial)
visreg::visreg(logmodel, xvar = "temp", by = "hum", scale = "response")
visreg::visreg(logmodel, xvar = "temp", by = "hum", scale = "linear")
summary(logmodel)
##
## Call:
## glm(formula = BusyDay ~ temp + hum + weatherLab, family = binomial,
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7987 -0.7112 0.2691 0.7467 2.0360
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.5906 0.6950 -3.728 0.000193 ***
## temp 8.7306 0.9631 9.065 < 2e-16 ***
## hum -1.8537 1.2514 -1.481 0.138532
## weatherLabPartly Cloudy -0.9469 0.3401 -2.784 0.005368 **
## weatherLabPrecipitating -16.4868 866.4841 -0.019 0.984819
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 505.78 on 364 degrees of freedom
## Residual deviance: 340.51 on 360 degrees of freedom
## AIC: 350.51
##
## Number of Fisher Scoring iterations: 15
Temperature and humidity are the most significant for the model, as seen above (p<0.001 for temp, p<0.01 for humidity). AIC stands for Akaike Information Criterion. It enables us to determine how well the model fits the data. It is primarily used to compare models.
lmtest::lrtest(logmodel)
## Likelihood ratio test
##
## Model 1: BusyDay ~ temp + hum + weatherLab
## Model 2: BusyDay ~ 1
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 5 -170.26
## 2 1 -252.89 -4 165.26 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Chi-sq has -4 degrees of freedom and a chi-sq value of 165.26. The result is statistically significant. Thus there is a significant difference between the baseline model and model created.
DescTools::PseudoR2(logmodel, which="all")
## McFadden McFaddenAdj CoxSnell Nagelkerke AldrichNelson
## 0.3267535 0.3069818 0.3641406 0.4856193 0.3116635
## VeallZimmermann Efron McKelveyZavoina Tjur AIC
## 0.5365798 0.4104952 0.7149562 0.4012614 350.5116038
## BIC logLik logLik0 G2
## 370.0110906 -170.2558019 -252.8877508 165.2638977
stargazer::stargazer(logmodel, type="text")
##
## ===================================================
## Dependent variable:
## ---------------------------
## BusyDay
## ---------------------------------------------------
## temp 8.731***
## (0.963)
##
## hum -1.854
## (1.251)
##
## weatherLabPartly Cloudy -0.947***
## (0.340)
##
## weatherLabPrecipitating -16.487
## (866.484)
##
## Constant -2.591***
## (0.695)
##
## ---------------------------------------------------
## Observations 365
## Log Likelihood -170.256
## Akaike Inf. Crit. 350.512
## ===================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
The below gives the coefficients for all the predictors in the logistic regression model:
#Exponentiate the co-efficients
exp(coefficients(logmodel))
## (Intercept) temp hum
## 7.497816e-02 6.189211e+03 1.566495e-01
## weatherLabPartly Cloudy weatherLabPrecipitating
## 3.879485e-01 6.916012e-08
The Odds ratios of the predictors are shown below:
## odds ratios
cbind(Estimate=round(coef(logmodel),4),
OR=round(exp(coef(logmodel)),4))
## Estimate OR
## (Intercept) -2.5906 0.0750
## temp 8.7306 6189.2109
## hum -1.8537 0.1566
## weatherLabPartly Cloudy -0.9469 0.3879
## weatherLabPrecipitating -16.4868 0.0000
We have trained the model on the training dataset. Now we can run it on the test dataset and check how it is performing. The model is run with the test data and the results are displayed in a confusion matrix:
#Confusion matrix
regclass::confusion_matrix(logmodel, test)
## Predicted 0 Predicted 1 Total
## Actual 0 68 21 89
## Actual 1 16 77 93
## Total 84 98 182
The accuracy of the model = Correct predictions/ Total no. of observations = (66+69)/182 = 135/182 = 0.741 This shows that the model accuracy is 74%.
threshold=0.5
predicted_values<-ifelse(predict(logmodel,type="response")>threshold,1,0)
actual_values<-logmodel$y
cm<-table(predicted_values,actual_values)
cm
## actual_values
## predicted_values 0 1
## 0 141 42
## 1 37 145
The sensitivity and specificity of the model can be calculated from the confusion matrix as shows:
sensitivity <- cm[4] / sum(cm[4], cm[3])
specificity <- cm[1] / sum(cm[1], cm[2])
sensitivity
## [1] 0.7754011
specificity
## [1] 0.7921348
This is a “goodness of fit test” for the model. For this model, large values of p (> 0.05) say that the model is good fit for the data.
generalhoslem::logitgof(train$BusyDay, fitted(logmodel))
##
## Hosmer and Lemeshow test (binary model)
##
## data: train$BusyDay, fitted(logmodel)
## X-squared = 26.408, df = 8, p-value = 0.0008941
Here, the p-value is 0.0008 meaning that there might be some issues with the fit of the model. Collinearity and Tolerance:
vifmodel<-car::vif(logmodel)
vifmodel
## GVIF Df GVIF^(1/(2*Df))
## temp 1.212502 1 1.101137
## hum 1.771815 1 1.331095
## weatherLab 1.524476 2 1.111169
1/vifmodel
## GVIF Df GVIF^(1/(2*Df))
## temp 0.8247410 1.0 0.9081525
## hum 0.5643931 1.0 0.7512610
## weatherLab 0.6559633 0.5 0.8999531
A binomial logistic regression analysis was conducted with a busy day or not as the outcome variable with temperature, humidity and weather situation as predictors. The data met the assumption for independent observations. Examination for multicollinearity showed that the tolerance and variance influence factor measures were within acceptable levels (tolerance >0.4, VIF <2.5 ) as outlined in Tarling (2008). The Hosmer Lemeshow goodness of fit statistic has a small p value, indicating some issues with the assumption of linearity between the independent variables and the log odds of the model (χ2 (n=8)=26.408, p=0.0008).