This analysis will use independent variables from the HHTS to build a model that attempts to explain why a given commuter drove to work. This analysis will use the binomial logit methodology. Initial steps include joining the person dataset to the household dataset in order to retain personal variables that may help explain mode choice. Second, I isolated only trips to work by sorting on the destination and filtering to only the first one-way trip to work that person took. Additionally, I filtered out many of the “don’t know” observations for variables I knew I would like to test.
This model assumes that “driving to work” means the person commuting to work reported being the driver of a vehicle. This does not include carpooling.
T-Tests
The final model includes two binary variables (as recorded by DVRPC), gender and disabled status. Separate t-tests between “Comm_drive” and gender and disability resulted in low p-values, meaning we can reject the null hypothesis that the difference in means is 0.
t.test (Comm_drive ~ GEND, data = CommuteTripsPerson5, paired = FALSE)
##
## Welch Two Sample t-test
##
## data: Comm_drive by GEND
## t = 3.3472, df = 3001.4, p-value = 0.0008263
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.01582825 0.06059652
## sample estimates:
## mean in group 1 mean in group 2
## 0.9063942 0.8681818
t.test (Comm_drive ~ DISAB, data = CommuteTripsPerson5, paired = FALSE)
##
## Welch Two Sample t-test
##
## data: Comm_drive by DISAB
## t = -4.0028, df = 13.048, p-value = 0.001494
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.8196964 -0.2451831
## sample estimates:
## mean in group 1 mean in group 2
## 0.3571429 0.8895826
Chi-Squared Tests
To analyze potential continuous independent variables for inclusion, I ran chi-squared tests between “Comm_drive” and age, trip duration, and income.
The results of the chi-squared tests were similar. Tests between all three variables produced a p-value that approaches 0, meaning we can reject the null hypothesis that there is no relationship between the variables. Because of this finding, there is evidence of association between the variables respectively.
CrossTable (CommuteTripsPerson5$Comm_drive, CommuteTripsPerson5$AGE, fisher = FALSE, chisq = TRUE,
expected = TRUE, sresid = FALSE, format="SPSS")
CrossTable (CommuteTripsPerson5$Comm_drive, CommuteTripsPerson5$TRPDUR, fisher = FALSE, chisq = TRUE,
expected = TRUE, sresid = FALSE, format="SPSS")
CrossTable (CommuteTripsPerson5$Comm_drive, CommuteTripsPerson5$INCOME, fisher = FALSE, chisq = TRUE,
expected = TRUE, sresid = FALSE, format="SPSS")
The final model includes trip duration, income, whether or not a commuter has a disability, time to walk to work from a parking location, and gender. All of these variables are statistically significant within the model working to predict whether or not someone will drive to work. This is shown in the model’s results, as well as the anova and drop1 tests.
The results of the model indicate that:
mod <- glm ( Comm_drive ~ TRPDUR+INCOME+as.factor(DISAB)+W1WLK+as.factor(GEND), data=CommuteTripsPerson5, y=TRUE, x = TRUE, family = binomial)
summary(mod)
##
## Call:
## glm(formula = Comm_drive ~ TRPDUR + INCOME + as.factor(DISAB) +
## W1WLK + as.factor(GEND), family = binomial, data = CommuteTripsPerson5,
## x = TRUE, y = TRUE)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5154 0.3757 0.4406 0.5055 2.5925
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.376104672 0.579983930 -0.648 0.51668
## TRPDUR -0.012580667 0.002692118 -4.673 0.000002966 ***
## INCOME 0.000004255 0.000001370 3.105 0.00191 **
## as.factor(DISAB)2 2.986652815 0.575062997 5.194 0.000000206 ***
## W1WLK -0.069890534 0.012905048 -5.416 0.000000061 ***
## as.factor(GEND)2 -0.463703573 0.119870120 -3.868 0.00011 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2154.8 on 3056 degrees of freedom
## Residual deviance: 2053.8 on 3051 degrees of freedom
## AIC: 2065.8
##
## Number of Fisher Scoring iterations: 5
Model Plotting
The following plots show how the probability of driving alone varies with different variables, with the model predictions shown in red.
Scenario Plotting
The plot below shows how differing incomes impact the probability that a given commuter will drive alone vs. take other modes. These scenarios assume that the commuter identifies as female and does not have a disability. The scenario also assumes that this commuter has an average walk from where they park to where they work. Based on these assumptions, we can see that based on this model the probability that someone will drive alone to work is highest in the lowest income scenario and lowest in the highest income scenario.
This analysis will follow a similar pattern from the first analysis in building a model to predict why some commuters walk/bike to work over other modes. However, in the sample of 3,347 trips, only a handful of these trips reflect walking or biking as the mode of transport. As a result, variables like, “licensed” or “disabled” that logically would impact mode choice do not necessarily show up as statistically significant within the model. This is a limitation of the data and is worth noting.
T-Tests
I conducted t-tests on multiple variables for both inclusion in the model and to show the potential skewedness of the data. For example, a t-test conducted between Comm_WB and gender resulted in a p-value of 0.945, meaning that we cannot reject the null hypothesis that the means are different. However, in transportation we know that men are much more likely to cycle generally than women, and would expect this trend to be present in a more holistic set of data. The same is true of the t-test conducted between Comm_WB and having a disability. The resulting p-value is 0.2072, meaning we cannot reject the null hypothesis, when again in reality we would expect having a disability to impact the decision to walk or bike to work. T-tests between Comm_WB and having a driver’s license or not resulted in a p-value low enough for us to reject the null hypothesis.
t.test (Comm_WB ~ GEND, data = CommuteTripsPerson5, paired = FALSE)
##
## Welch Two Sample t-test
##
## data: Comm_WB by GEND
## t = 1.0376, df = 412.79, p-value = 0.3001
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.02018198 0.06530457
## sample estimates:
## mean in group 1 mean in group 2
## 0.06481481 0.04225352
t.test (Comm_WB ~ LIC, data = CommuteTripsPerson5, paired = FALSE)
##
## Welch Two Sample t-test
##
## data: Comm_WB by LIC
## t = -2.4938, df = 18.293, p-value = 0.02242
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.50517111 -0.04348101
## sample estimates:
## mean in group 1 mean in group 2
## 0.04146341 0.31578947
t.test (Comm_WB ~ EDUC, data = CommuteTripsPerson5, paired = FALSE)
##
## Welch Two Sample t-test
##
## data: Comm_WB by EDUC
## t = 1.4785, df = 59.754, p-value = 0.1445
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.02321977 0.15477532
## sample estimates:
## mean in group 1 mean in group 2
## 0.11111111 0.04533333
Chi-Squared Tests
To analyze potential continuous independent variables for inclusion, I ran chi-squared tests between “Comm_WB” and age, trip duration, and income.
The results of the chi-squared tests were varied. Tests between income and trip duration produced a p-value that approaches 0, meaning we can reject the null hypothesis that there is no relationship between the variables. Because of this finding, there is evidence of association between the variables respectively.
Chi-squared tests between Comm_WB and age produced a p-value that does not allow us to reject the null hypothesis, meaning that the true means may be the same across ages. This again does not necessarily reflect reality. We would expect younger people to be more willing to commute to work via biking or walking.
CrossTable (CommuteTripsPerson5$Comm_WB, CommuteTripsPerson5$INCOME, fisher = FALSE, chisq = TRUE,
expected = TRUE, sresid = FALSE, format="SPSS")
CrossTable (CommuteTripsPerson5$Comm_WB, CommuteTripsPerson5$AGE, fisher = FALSE, chisq = TRUE,
expected = TRUE, sresid = FALSE, format="SPSS")
CrossTable (CommuteTripsPerson5$Comm_WB, CommuteTripsPerson5$TRPDUR, fisher = FALSE, chisq = TRUE,
expected = TRUE, sresid = FALSE, format="SPSS")
The final model includes a binary variable representing if a commuter is licensed or unlicensed, and a categorical variable describing a commuter’s variation in work start time. The license variable is the most statistically significant, with two categories of variation in work start time significant at the 0.05 level. An attempt to categorize the variations in work start times as “can vary by up to 30 minutes” and “can vary by more than 30 minutes” did not yield statistically significant results. The results of the model indicate that: 1. If a commuter reported being unlicensed, they are 2.614% more likely to walk or bike to work. 2. If a commuter reported either a 15-minute grace period in work start time, or answered “other”, they are 1.971% and 1.767% more likely to walk or bike to work, respectively.
As mentioned throughout this section of the analysis, the data available for commutes who walk and bike to work is extremely limited. This made it difficult to both develop a statistically significant model and include variables that transportation planners traditionally recognize as indicators of biking or walking. This analysis specifically picked the first work trip per household, and captured all reported instances of biking and walking throughout the modes reported (there is space for a person to report up to four modes used to make a given trip). To broaden the dataset, it may be beneficial to group all non-driving commutes together to analyze. Alternatively, instead of just selecting the first work trip in the dataset, perhaps all work trips could be used.
mod_wb <- glm ( Comm_WB ~ as.factor(LIC)+ as.factor(VSTIM), data=CommuteTripsPerson5, family = binomial)
summary(mod_wb)
##
## Call:
## glm(formula = Comm_WB ~ as.factor(LIC) + as.factor(VSTIM), family = binomial,
## data = CommuteTripsPerson5)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9070 -0.2669 -0.2669 -0.2295 2.7019
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.6237 0.5715 -6.341 0.000000000229 ***
## as.factor(LIC)2 2.6416 0.5898 4.479 0.000007501951 ***
## as.factor(VSTIM)2 1.2258 1.1906 1.030 0.3032
## as.factor(VSTIM)3 1.8319 0.9539 1.920 0.0548 .
## as.factor(VSTIM)4 -15.2432 1169.3208 -0.013 0.9896
## as.factor(VSTIM)5 0.3064 0.6211 0.493 0.6218
## as.factor(VSTIM)7 1.6260 0.7918 2.054 0.0400 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 179.34 on 428 degrees of freedom
## Residual deviance: 155.05 on 422 degrees of freedom
## AIC: 169.05
##
## Number of Fisher Scoring iterations: 17
Model Plotting
Though not as clear as the previous plots, the plot below shows that the probability of commuting via walking/biking increases generally with the unlicensed.
plot(CommuteTripsPerson5$LIC, CommuteTripsPerson5$Comm_WB,
ylab="Commutes Via Walking/Biking",
xlab="Licensed or Unlicensed",
main="Probability of Commuting via Walking/Biking Based on Licensed Status")
points (CommuteTripsPerson5$LIC, fitted (mod_wb), col="red")
legend('topright', c('Observed', 'Predicted'),
col=c("black", "red"), pch=21:21,lty=1:1)
Similarly, this plot unfortunately does not help us generalize the results of the model, due to the reasons outlined previously. Further, we cannot create a useful scenarios model because no continuous variables were statistically significant enough for inclusion in the final model. However, as demonstrated by the plot below, the scenarios model likely would have produced inaccurate results based on the data limitations.
plot(CommuteTripsPerson5$VSTIM, CommuteTripsPerson5$Comm_WB,
ylab="Commutes Via Walking/Biking",
xlab="Has Flexibility in Work Start Time (least to most flexible)",
main="Probability of Commuting via Walking/Biking by Work Flexibility")
points (CommuteTripsPerson5$VSTIM, fitted (mod_wb), col="red")
legend('topright', c('Observed', 'Predicted'),
col=c("black", "red"), pch=21:21,lty=1:1)
To create a multinomial logit model, we first need to create a standard measure across modes. To do this, we assign a speed to each mode.
Assigned speeds are as follows:
I then created a distance column within the data by multiplying these rates by the recorded trip duration. Finally, I divided this new column by the speed itself to get the time spent in each mode. This allows the model to compare alternatives in a standardized manner.
After cleaning and standardizing the data, I attempted multiple combinations of the variables: time, age, gender, and income to build a multinomial logit model that predicts mode choice (between driving alone, carpooling, walking/biking, or taking transit). Gender is treated as a binary variable in this analysis. I attempted models that used decision-maker specific data, as well as alternative-specific data. Ultimately, the models that used a combination of both produced the best results.
mod.int <- mlogit (mode ~ 1, data = datMNL)
summary(mod.int)
mod_ml <- mlogit (mode ~ time | 1, data = datMNL)
summary(mod_ml)
mod_ml2 <- mlogit (mode ~ time | AGE + INCOME, data = datMNL)
summary(mod_ml2)
mod_ml3 <- mlogit (mode ~ time | GEND + INCOME, data = datMNL)
summary(mod_ml3)
mod_ml4 <- mlogit (mode ~ time | GEND, data = datMNL)
summary(mod_ml4)
mod_ml5 <- mlogit (mode ~ time | GEND + AGE + INCOME, data = datMNL)
summary(mod_ml5)
mod_ml6 <- mlogit (mode ~ 1 | GEND + AGE + INCOME, data = datMNL)
summary(mod_ml6)
mod_ml7 <- mlogit (mode ~ 1 | GEND + INCOME, data = datMNL)
summary(mod_ml7)
After cleaning and standardizing the data, I attempted multiple combinations of the variables: time, age, gender, and income to build a multinomial logit model that predicts mode choice (between driving alone, carpooling, walking/biking, or taking transit). Each model produced at least some statistically significant results, so I chose to compare models using the AIC function (results included below). Based on these results, the best fitting model for the data includes time, gender, and income as the best determinants of mode choice. Results from select models are included below. The final model is shown in the third column of the Stargazer table.
AIC(mod.int, mod_ml, mod_ml2, mod_ml3, mod_ml4, mod_ml5)
## df AIC
## mod.int 3 3294.285
## mod_ml 4 3099.229
## mod_ml2 10 3042.927
## mod_ml3 10 3031.889
## mod_ml4 7 3092.444
## mod_ml5 13 3037.371
stargazer(mod.int, mod_ml, mod_ml3, mod_ml5, type = "text")
##
## ================================================================================================
## Dependent variable:
## ----------------------------------------------------------------------------
## mode
## (1) (2) (3) (4)
## ------------------------------------------------------------------------------------------------
## (Intercept):carpool -2.895*** -2.512*** -2.759*** -2.760***
## (0.086) (0.092) (0.315) (0.316)
##
## (Intercept):transit -3.085*** -2.702*** -1.654*** -1.639***
## (0.094) (0.100) (0.355) (0.361)
##
## (Intercept):WB -2.881*** -0.980*** -0.629* -0.642*
## (0.085) (0.150) (0.369) (0.371)
##
## time -4.739*** -4.663*** -4.664***
## (0.442) (0.439) (0.439)
##
## GEND:carpool 0.446*** 0.460***
## (0.150) (0.155)
##
## GEND:transit 0.302* 0.312*
## (0.175) (0.181)
##
## GEND:WB -0.097 -0.095
## (0.177) (0.176)
##
## AGE:carpool -0.0005
## (0.001)
##
## AGE:transit -0.001
## (0.001)
##
## AGE:WB 0.0002
## (0.001)
##
## INCOME:carpool -0.00001** -0.00001**
## (0.00000) (0.00000)
##
## INCOME:transit -0.00002*** -0.00002***
## (0.00000) (0.00000)
##
## INCOME:WB -0.00000 -0.00000
## (0.00000) (0.00000)
##
## ------------------------------------------------------------------------------------------------
## Observations 3,012 3,012 3,012 3,012
## R2 0.000 0.060 0.084 0.084
## Log Likelihood -1,644.143 -1,545.614 -1,505.944 -1,505.685
## LR Test 0.000 (df = 3) 197.057*** (df = 4) 276.397*** (df = 10) 276.915*** (df = 13)
## ================================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Model Results
The reference mode category for this model is driving alone. Because of this baseline, the model tells us that people are less likely to carpool, take transit, or walk or bike than drive alone to work, because all of these coefficients are negative. Further, as time increases, while controlling for gender and income, a commuter is significantly less likely to carpool, take transit, or walk or bike in comparison to driving alone.
As mentioned previously, gender is treated as a binary variable in this analysis, with male and female as the two options. Based on this, commuters who identify as female are 0.446% and 0.302% more likely to carpool or take transit to work respectively than males when compared to driving alone. Those who identify as female are 0.097% less likely than males to walk or bike to work when compared to driving alone.
Income is included as a continuous variable in this analysis. Based on this, a one dollar increase in income is associated with a 0.00000629%, 0.0000241%, and 0.00000295% decrease in the odds of carpooling, taking transit, or walking/biking to work respectively, when compared to driving alone.
To generalize these results, the odds of a commuter driving alone are generally higher than than of the other modes. As travel time goes up, the odds of a commuter not driving alone decrease. Women have generally higher odds of carpooling or taking transit than men when compared to driving alone. Women also have lower odds of walking or biking to work than men when compared to driving alone. Finally, an increase in income generally lowers the odds of using another mode vs. driving alone when commuting.
However, drawbacks and limitations of this model must be noted. First, the McFadden r-squared for this model is relatively low, meaning though some coefficients are statistically significant, the model itself is not particularly strong. New and additional variables may need to be included to better fit future models. Second, the limitations noted in the walking/biking binary logit model section still hold true here. There are very few observations where commuters walked/biked or carpooled. This limitation makes it difficult to generalize results, especially when compared to the data for driving alone, which makes up the bulk of the responses. Future models may need to be run on larger datasets, or again on more generalized trip data. While the multinomial logit model is not the best for this dataset, the binomial logit model developed in part B.1 that outlines the probability of driving alone to work is strong, due to having more observations.
If more data were available, I believe we could use the variables included to build strong predictive models of mode choice. However, it is important to remember that quantitative data can often gloss over the nuanced reasons certain people choose certain things. For this reason, more observations would allow more generalizable results across modes. Additionally, if resources were deployed to help people complete the survey in order to minimize the amount of “don’t knows” and “refused”, that could potentially help enrich the data as well.
mod_ml3 <- mlogit (mode ~ time | GEND + INCOME, data = datMNL)
exp(coef(mod_ml3))
## (Intercept):carpool (Intercept):transit (Intercept):WB time
## 0.06333675 0.19120488 0.53335414 0.00944111
## GEND:carpool GEND:transit GEND:WB INCOME:carpool
## 1.56132244 1.35218845 0.90781658 0.99999371
## INCOME:transit INCOME:WB
## 0.99997585 0.99999705
summary(mod_ml3)
##
## Call:
## mlogit(formula = mode ~ time | GEND + INCOME, data = datMNL,
## method = "nr")
##
## Frequencies of alternatives:choice
## auto carpool transit WB
## 0.864210 0.047809 0.039509 0.048473
##
## nr method
## 7 iterations, 0h:0m:0s
## g'(-H)^-1g = 1.71E-05
## successive function values within tolerance limits
##
## Coefficients :
## Estimate Std. Error z-value Pr(>|z|)
## (Intercept):carpool -2.7592895799 0.3147229801 -8.7674 < 0.00000000000000022
## (Intercept):transit -1.6544097808 0.3548408410 -4.6624 0.000003125436340
## (Intercept):WB -0.6285696538 0.3690400246 -1.7033 0.088520
## time -4.6626817603 0.4392778945 -10.6144 < 0.00000000000000022
## GEND:carpool 0.4455331802 0.1495668941 2.9788 0.002894
## GEND:transit 0.3017243558 0.1748076489 1.7260 0.084341
## GEND:WB -0.0967129235 0.1766643040 -0.5474 0.584077
## INCOME:carpool -0.0000062886 0.0000024616 -2.5547 0.010628
## INCOME:transit -0.0000241498 0.0000034841 -6.9315 0.000000000004165
## INCOME:WB -0.0000029535 0.0000023902 -1.2357 0.216580
##
## (Intercept):carpool ***
## (Intercept):transit ***
## (Intercept):WB .
## time ***
## GEND:carpool **
## GEND:transit .
## GEND:WB
## INCOME:carpool *
## INCOME:transit ***
## INCOME:WB
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Log-Likelihood: -1505.9
## McFadden R^2: 0.084055
## Likelihood ratio test : chisq = 276.4 (p.value = < 0.000000000000000222)
Predicted Values
While we cannot plot the same scenarios for multinomial logit, the probabilities for each mode are shown below. Finally, though confusing, I’ve included my best attempt at a plot of probabilities for each mode by age.
dat$predicted<- predict(mod_ml3, datMNL, type="response")
summary(dat$predicted)
## auto carpool transit WB
## Min. :0.3455 Min. :0.0000486 Min. :0.00004203 Min. :0.000000
## 1st Qu.:0.8174 1st Qu.:0.0359235 1st Qu.:0.01690930 1st Qu.:0.007591
## Median :0.8806 Median :0.0459537 Median :0.03313923 Median :0.029236
## Mean :0.8642 Mean :0.0478088 Mean :0.03950863 Mean :0.048473
## 3rd Qu.:0.9265 3rd Qu.:0.0617216 3rd Qu.:0.05269232 3rd Qu.:0.073370
## Max. :0.9999 Max. :0.5124735 Max. :0.15752963 Max. :0.259662
lpp <- melt(dat, id.vars = c("mode", "AGE", "GEND", "INCOME"), value.name = "predicted")
ggplot(data=lpp, aes(x = AGE, y = predicted, colour = mode)) + geom_line(
fill= c("blue", "red", "green", "orange"))+
ylim(0,1) +
xlim(0,90) +
ggtitle("Probabilities of Mode by Age")