For this week’s homework, I will be using the The Bus Breakdown and Delay dataset founded on Kaggle. The dataset is hsoted by City of New York and it collects informatiom from school bus vendors operating out in the field in real time. The depemdent variable will be bus delay and the independent variables will be Borough, School Age or Prek, and Run Type.
I unloaded the functions I will be using/I might use.
library(readr)
library(dplyr)
library(Zelig)
library(texreg)
library(pander)
library(visreg)
library(effects)
I imported the dataset to R.
busdelay<-read_csv("C:/Users/wroni/Downloads/ny-bus-breakdown-and-delays/bus-breakdown-and-delays.csv")
Since I will be focusing on the probability of bus delay, I recoded the dependent vaiable between 0 (no delay due to running late) or 1 (delay due to running late).
busdelay2 <- mutate(busdelay, busdelay_binary= recode(Breakdown_or_Running_Late,`Running Late` = 1, `Breakdown` = 0))
This shows all the variables in the dataset, including the recoded variable from the previous step.
head(busdelay2)
The first model determines the odd of a bus delay by borough.
m1 <- glm(busdelay_binary ~ Boro, family = binomial, data = busdelay2)
summary(m1)
##
## Call:
## glm(formula = busdelay_binary ~ Boro, family = binomial, data = busdelay2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7957 0.3093 0.4784 0.5287 0.7399
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.99800 0.21845 13.724 < 2e-16 ***
## BoroBronx -1.10098 0.21872 -5.034 4.81e-07 ***
## BoroBrooklyn -0.88810 0.21881 -4.059 4.93e-05 ***
## BoroConnecticut 0.01753 0.39067 0.045 0.964205
## BoroManhattan 0.01788 0.21913 0.082 0.934981
## BoroNassau County -1.33911 0.22222 -6.026 1.68e-09 ***
## BoroNew Jersey 0.15775 0.25111 0.628 0.529872
## BoroQueens -1.84233 0.21877 -8.421 < 2e-16 ***
## BoroRockland County -0.05356 0.26667 -0.201 0.840809
## BoroStaten Island 0.39824 0.22355 1.781 0.074836 .
## BoroWestchester 0.88962 0.23390 3.803 0.000143 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 199544 on 287612 degrees of freedom
## Residual deviance: 187666 on 287602 degrees of freedom
## (14376 observations deleted due to missingness)
## AIC: 187688
##
## Number of Fisher Scoring iterations: 6
This model adds a second indepndent variable of School Age or Pre-K. The second model determines the odd of bus delay by Borough and School Age or Pre-K.
m2 <- glm(busdelay_binary ~ Boro + School_Age_or_PreK, family = binomial, data = busdelay2)
summary(m2)
##
## Call:
## glm(formula = busdelay_binary ~ Boro + School_Age_or_PreK, family = binomial,
## data = busdelay2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.4230 0.2037 0.3093 0.5078 0.7500
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.45975 0.22194 24.600 < 2e-16 ***
## BoroBronx -1.59527 0.21876 -7.292 3.04e-13 ***
## BoroBrooklyn -1.01483 0.21882 -4.638 3.52e-06 ***
## BoroConnecticut 0.01753 0.39067 0.045 0.964205
## BoroManhattan 0.01781 0.21913 0.081 0.935208
## BoroNassau County -1.33911 0.22222 -6.026 1.68e-09 ***
## BoroNew Jersey 0.15775 0.25111 0.628 0.529872
## BoroQueens -1.87333 0.21877 -8.563 < 2e-16 ***
## BoroRockland County -0.05356 0.26667 -0.201 0.840809
## BoroStaten Island 0.39571 0.22355 1.770 0.076704 .
## BoroWestchester 0.88962 0.23390 3.803 0.000143 ***
## School_Age_or_PreKSchool-Age -2.46175 0.03918 -62.829 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 199544 on 287612 degrees of freedom
## Residual deviance: 179849 on 287601 degrees of freedom
## (14376 observations deleted due to missingness)
## AIC: 179873
##
## Number of Fisher Scoring iterations: 6
This model adds a third indepndent variable of Run Type. The third model determines the odd of bus delay by Borough, School Age or Pre-K, and Run Type. This model also includes an interaction between the variables of School Age or Pre-K and Run Type. However, the model shows NA.
m3 <- glm(busdelay_binary ~ Boro + School_Age_or_PreK * Run_Type, family = binomial, data = busdelay2)
summary(m3)
##
## Call:
## glm(formula = busdelay_binary ~ Boro + School_Age_or_PreK * Run_Type,
## family = binomial, data = busdelay2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.4412 0.2035 0.3393 0.5751 0.9954
##
## Coefficients: (10 not defined because of singularities)
## Estimate
## (Intercept) 5.21563
## BoroBronx -1.34831
## BoroBrooklyn -0.78379
## BoroConnecticut 0.11510
## BoroManhattan 0.25517
## BoroNassau County -1.21267
## BoroNew Jersey 0.15601
## BoroQueens -1.64646
## BoroRockland County 0.06861
## BoroStaten Island 0.70274
## BoroWestchester 0.89665
## School_Age_or_PreKSchool-Age -2.64456
## Run_TypeGeneral Ed Field Trip -0.32177
## Run_TypeGeneral Ed PM Run -0.44231
## Run_TypePre-K/EI NA
## Run_TypeProject Read AM Run 0.22859
## Run_TypeProject Read Field Trip -0.59623
## Run_TypeProject Read PM Run 0.95148
## Run_TypeSpecial Ed AM Run 0.49293
## Run_TypeSpecial Ed Field Trip -0.48014
## Run_TypeSpecial Ed PM Run -0.33142
## School_Age_or_PreKSchool-Age:Run_TypeGeneral Ed Field Trip NA
## School_Age_or_PreKSchool-Age:Run_TypeGeneral Ed PM Run NA
## School_Age_or_PreKSchool-Age:Run_TypePre-K/EI NA
## School_Age_or_PreKSchool-Age:Run_TypeProject Read AM Run NA
## School_Age_or_PreKSchool-Age:Run_TypeProject Read Field Trip NA
## School_Age_or_PreKSchool-Age:Run_TypeProject Read PM Run NA
## School_Age_or_PreKSchool-Age:Run_TypeSpecial Ed AM Run NA
## School_Age_or_PreKSchool-Age:Run_TypeSpecial Ed Field Trip NA
## School_Age_or_PreKSchool-Age:Run_TypeSpecial Ed PM Run NA
## Std. Error
## (Intercept) 0.22234
## BoroBronx 0.21917
## BoroBrooklyn 0.21923
## BoroConnecticut 0.39169
## BoroManhattan 0.21953
## BoroNassau County 0.22264
## BoroNew Jersey 0.25148
## BoroQueens 0.21917
## BoroRockland County 0.26728
## BoroStaten Island 0.22416
## BoroWestchester 0.23424
## School_Age_or_PreKSchool-Age 0.04142
## Run_TypeGeneral Ed Field Trip 0.07936
## Run_TypeGeneral Ed PM Run 0.02967
## Run_TypePre-K/EI NA
## Run_TypeProject Read AM Run 0.30058
## Run_TypeProject Read Field Trip 1.24128
## Run_TypeProject Read PM Run 0.15421
## Run_TypeSpecial Ed AM Run 0.01742
## Run_TypeSpecial Ed Field Trip 0.07580
## Run_TypeSpecial Ed PM Run 0.02040
## School_Age_or_PreKSchool-Age:Run_TypeGeneral Ed Field Trip NA
## School_Age_or_PreKSchool-Age:Run_TypeGeneral Ed PM Run NA
## School_Age_or_PreKSchool-Age:Run_TypePre-K/EI NA
## School_Age_or_PreKSchool-Age:Run_TypeProject Read AM Run NA
## School_Age_or_PreKSchool-Age:Run_TypeProject Read Field Trip NA
## School_Age_or_PreKSchool-Age:Run_TypeProject Read PM Run NA
## School_Age_or_PreKSchool-Age:Run_TypeSpecial Ed AM Run NA
## School_Age_or_PreKSchool-Age:Run_TypeSpecial Ed Field Trip NA
## School_Age_or_PreKSchool-Age:Run_TypeSpecial Ed PM Run NA
## z value
## (Intercept) 23.458
## BoroBronx -6.152
## BoroBrooklyn -3.575
## BoroConnecticut 0.294
## BoroManhattan 1.162
## BoroNassau County -5.447
## BoroNew Jersey 0.620
## BoroQueens -7.512
## BoroRockland County 0.257
## BoroStaten Island 3.135
## BoroWestchester 3.828
## School_Age_or_PreKSchool-Age -63.848
## Run_TypeGeneral Ed Field Trip -4.055
## Run_TypeGeneral Ed PM Run -14.909
## Run_TypePre-K/EI NA
## Run_TypeProject Read AM Run 0.760
## Run_TypeProject Read Field Trip -0.480
## Run_TypeProject Read PM Run 6.170
## Run_TypeSpecial Ed AM Run 28.303
## Run_TypeSpecial Ed Field Trip -6.334
## Run_TypeSpecial Ed PM Run -16.246
## School_Age_or_PreKSchool-Age:Run_TypeGeneral Ed Field Trip NA
## School_Age_or_PreKSchool-Age:Run_TypeGeneral Ed PM Run NA
## School_Age_or_PreKSchool-Age:Run_TypePre-K/EI NA
## School_Age_or_PreKSchool-Age:Run_TypeProject Read AM Run NA
## School_Age_or_PreKSchool-Age:Run_TypeProject Read Field Trip NA
## School_Age_or_PreKSchool-Age:Run_TypeProject Read PM Run NA
## School_Age_or_PreKSchool-Age:Run_TypeSpecial Ed AM Run NA
## School_Age_or_PreKSchool-Age:Run_TypeSpecial Ed Field Trip NA
## School_Age_or_PreKSchool-Age:Run_TypeSpecial Ed PM Run NA
## Pr(>|z|)
## (Intercept) < 2e-16 ***
## BoroBronx 7.65e-10 ***
## BoroBrooklyn 0.000350 ***
## BoroConnecticut 0.768863
## BoroManhattan 0.245096
## BoroNassau County 5.13e-08 ***
## BoroNew Jersey 0.535022
## BoroQueens 5.82e-14 ***
## BoroRockland County 0.797407
## BoroStaten Island 0.001719 **
## BoroWestchester 0.000129 ***
## School_Age_or_PreKSchool-Age < 2e-16 ***
## Run_TypeGeneral Ed Field Trip 5.02e-05 ***
## Run_TypeGeneral Ed PM Run < 2e-16 ***
## Run_TypePre-K/EI NA
## Run_TypeProject Read AM Run 0.446959
## Run_TypeProject Read Field Trip 0.630993
## Run_TypeProject Read PM Run 6.82e-10 ***
## Run_TypeSpecial Ed AM Run < 2e-16 ***
## Run_TypeSpecial Ed Field Trip 2.38e-10 ***
## Run_TypeSpecial Ed PM Run < 2e-16 ***
## School_Age_or_PreKSchool-Age:Run_TypeGeneral Ed Field Trip NA
## School_Age_or_PreKSchool-Age:Run_TypeGeneral Ed PM Run NA
## School_Age_or_PreKSchool-Age:Run_TypePre-K/EI NA
## School_Age_or_PreKSchool-Age:Run_TypeProject Read AM Run NA
## School_Age_or_PreKSchool-Age:Run_TypeProject Read Field Trip NA
## School_Age_or_PreKSchool-Age:Run_TypeProject Read PM Run NA
## School_Age_or_PreKSchool-Age:Run_TypeSpecial Ed AM Run NA
## School_Age_or_PreKSchool-Age:Run_TypeSpecial Ed Field Trip NA
## School_Age_or_PreKSchool-Age:Run_TypeSpecial Ed PM Run NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 199539 on 287609 degrees of freedom
## Residual deviance: 176362 on 287590 degrees of freedom
## (14379 observations deleted due to missingness)
## AIC: 176402
##
## Number of Fisher Scoring iterations: 6
Both the AIC and BIC shows that Model 3 seems to the best fit model (lower balues indicating better fit).
The results show that the odds of a bus delay increases by 0.90 for Westchester, 0.70 for Staten Island, 0.07 for Rockland County, 0.16 for New Jersey, 0.12 for Connecticut, and 0.26 for Manhattan. The odds of a bus delay decreases by 1.35 for Bronx, .78 for Brooklyn, 1.21 for Nassau County, and 1.65 for Queens. Please note that the dataset contains more than just the five major NYC boroughs even though the variable is labeled Borough.
Once School Age or Prek variable is factored in with Borough, the odds of a bus delay decreases by 2.64.
Once the third variable of Run Type is factored in with Borough and School Age or PreK, then the odds of a bus delay decreases by 0.32 for General Education Field Trip, 0.44 for General Education PM Run, 0.48 for Special Education Field Trip, and 0.33 for Special Education PM Run. The odds of a bus delay increases by 0.49 for Special Education AM Run.
table1 <- htmlreg(list(m1, m2, m3), doctype= FALSE)
pander(table1)
| Model 1 | Model 2 | Model 3 | ||
|---|---|---|---|---|
| (Intercept) | 3.00*** | 5.46*** | 5.22*** | |
| (0.22) | (0.22) | (0.22) | ||
| BoroBronx | -1.10*** | -1.60*** | -1.35*** | |
| (0.22) | (0.22) | (0.22) | ||
| BoroBrooklyn | -0.89*** | -1.01*** | -0.78*** | |
| (0.22) | (0.22) | (0.22) | ||
| BoroConnecticut | 0.02 | 0.02 | 0.12 | |
| (0.39) | (0.39) | (0.39) | ||
| BoroManhattan | 0.02 | 0.02 | 0.26 | |
| (0.22) | (0.22) | (0.22) | ||
| BoroNassau County | -1.34*** | -1.34*** | -1.21*** | |
| (0.22) | (0.22) | (0.22) | ||
| BoroNew Jersey | 0.16 | 0.16 | 0.16 | |
| (0.25) | (0.25) | (0.25) | ||
| BoroQueens | -1.84*** | -1.87*** | -1.65*** | |
| (0.22) | (0.22) | (0.22) | ||
| BoroRockland County | -0.05 | -0.05 | 0.07 | |
| (0.27) | (0.27) | (0.27) | ||
| BoroStaten Island | 0.40 | 0.40 | 0.70** | |
| (0.22) | (0.22) | (0.22) | ||
| BoroWestchester | 0.89*** | 0.89*** | 0.90*** | |
| (0.23) | (0.23) | (0.23) | ||
| School_Age_or_PreKSchool-Age | -2.46*** | -2.64*** | ||
| (0.04) | (0.04) | |||
| Run_TypeGeneral Ed Field Trip | -0.32*** | |||
| (0.08) | ||||
| Run_TypeGeneral Ed PM Run | -0.44*** | |||
| (0.03) | ||||
| Run_TypeProject Read AM Run | 0.23 | |||
| (0.30) | ||||
| Run_TypeProject Read Field Trip | -0.60 | |||
| (1.24) | ||||
| Run_TypeProject Read PM Run | 0.95*** | |||
| (0.15) | ||||
| Run_TypeSpecial Ed AM Run | 0.49*** | |||
| (0.02) | ||||
| Run_TypeSpecial Ed Field Trip | -0.48*** | |||
| (0.08) | ||||
| Run_TypeSpecial Ed PM Run | -0.33*** | |||
| (0.02) | ||||
| AIC | 187688.49 | 179873.39 | 176402.47 | |
| BIC | 187804.75 | 180000.23 | 176613.86 | |
| Log Likelihood | -93833.24 | -89924.70 | -88181.24 | |
| Deviance | 187666.49 | 179849.39 | 176362.47 | |
| Num. obs. | 287613 | 287613 | 287610 | |
| p < 0.001, p < 0.01, p < 0.05 | ||||
This plot shows that the proability of bus delay increases in Wechester and Staten Island, especially for School-Age.
visreg(m3,"Boro", by = "School_Age_or_PreK", scale="response")
This plot again shows that the probability of bus delay increases for School-Age, epsecially in Westchester.
visreg(m3,"School_Age_or_PreK", by = "Boro", scale="response")
This plot again shows that bus delay increases for Run Type Special Education AM Run, especially in Westchester and Staten Island.
visreg(m3,"Run_Type", by = "Boro", scale="response")