Asssignment 2: regression

Prepared by Gregory Mallon

October 23rd, 2021

Transportation Demand Modeling

Package load and data connection

Preparation: read data into a R data frame:

#install.packages("tidyverse")
library(tidyverse)
household_hbw <- read_csv("data/household_hbw.csv")
household_hbw

## # A tibble: 6,449 × 83
##      sampn  area rtcqc   pnr advltr rmode rtmode  lang incen   day  assn ribus
##      <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 2000019    21     1    NA      2    NA     NA    NA     2     2   415     2
##  2 2000051    21     4    NA      2    NA     NA    NA     2     5   397     2
##  3 2000056    21     1    NA      2    NA     NA    NA     2     3   381     1
##  4 2000057    21     4    NA      2    NA     NA    NA     2     5   355     2
##  5 2000095    21     1    NA      2    NA     NA    NA     2     2   373     2
##  6 2000116    21     2    NA      2    NA     NA    NA     2     4   375     2
##  7 2000122    21     2    NA      2    NA     NA    NA     2     1   379     2
##  8 2000138    21     2    NA      2    NA     NA    NA     2     1   365     2
##  9 2000151    21     1    NA      2    NA     NA    NA     2     5   355     2
## 10 2000165    21     1    NA      2    NA     NA    NA     2     3   353     1
## # … with 6,439 more rows, and 71 more variables: s1a <dbl>, wabik <dbl>,
## #   hhsiz <dbl>, hhveh <dbl>, vflag <dbl>, vhold <dbl>, bikes <dbl>,
## #   flexc <dbl>, resty <dbl>, o_resty <chr>, own <dbl>, o_own <lgl>,
## #   hlive <dbl>, bfcity <chr>, bfstate <chr>, bfzip <dbl>, bfres <dbl>,
## #   o_bfres <chr>, bfown <dbl>, o_bfown <lgl>, mfact1 <dbl>, mfact2 <dbl>,
## #   mfact3 <dbl>, mfact4 <dbl>, mfact5 <dbl>, mfact6 <dbl>, mfact7 <dbl>,
## #   o_mfact <chr>, fact2 <dbl>, fact4 <dbl>, mloca1 <dbl>, mloca2 <dbl>, …

Variable Transformations

# classify households by type of residence 3 categories single family, multi-family, other
household_hbw_sub <- household_hbw %>%
    mutate(single_family=ifelse(resty==1, TRUE, FALSE),
      resty_cat=case_when(resty == 1 ~ "Single-Family",
                           resty <= 3 ~ "Multi-Family",
                           #resty == 4 ~ "Mobile Home",
                           TRUE ~ "Other"
# classify households by number of vehicles into 3 categories. O vehicles, 1-2 vehicles, or 3 or more vehicles.                           
    ),
    hhveh_cat=case_when(hhveh == 0 ~ "0",
                             hhveh <= 2 ~ "1-2",
                             hhveh >= 3 ~ "3+"
         ),


#hhwrk into categories between 0-1 and 2 or more
hhwrk_cat=case_when(hhwrk <= 1 ~ "0-1",
                           TRUE ~ "2 or More" ),

 
  inc_cat=case_when(income == 1 ~ "0-24999",
                           income <= 2 ~ "25000-49999",
                           income >= 3 ~ "50000+"),

inc_cat_two=case_when(income == 1 ~ "0-49999",
                           income <= 2 ~ "50000-99999",
                           income >= 3 ~ "100000+"
    ),

own_cat=case_when(own == 1 ~ "Owned",
                           own == 2 ~ "Rented",
                           FALSE ~ "Other"),

hhsiz_cat=case_when(hhsiz == 1 ~ "1",
                  hhsiz == 2 ~ "2",
                  hhsiz == 3 ~ "3",
                  hhsiz == 4 ~ "4",
                  hhsiz == 5 ~ "5",
                  TRUE ~ "More Than 5"))

Cross Classification Trip Table and Line Chart Household Type Categories and Number of Vehicles Categories

#trip rate table for resty_cat and hhveh_cat
trip_rates_one <- household_hbw_sub %>%
  #filter(!is.na(resty_cat), resty != 9) %>%
  group_by(resty_cat, hhveh_cat) %>%
  summarize(avg_hbw=mean(hbw, na.rm=TRUE)) 

# pretty print the trip rates table
options(digits=2)
trip_rates_one %>%
  pivot_wider(names_from = "hhveh_cat", values_from = "avg_hbw") %>%
  knitr::kable()

resty_cat	0	1-2	3+
Multi-Family	0.18	0.48	0.72
Other	0.08	0.34	1.10
Single-Family	0.28	0.69	1.07

# plotting the trip rates
ggplot(trip_rates_one, aes(x=hhveh_cat, y=avg_hbw, group=resty_cat, color=resty_cat)) +
  
  geom_line()

Model 1 Assessing a goodness of fit with R^2

Joinging trip rates with HH data.

# the htrips column is the observed htrips, the avg_htrips column is the predicted htrips

household_hbw_sub$resty_cat <- as.factor(household_hbw_sub$resty_cat)
household_hbw_sub$hhveh_cat <- as.factor(household_hbw_sub$hhveh_cat)

predictions_one <- household_hbw_sub %>%
  left_join(trip_rates_one %>% rename(pred_hbw=avg_hbw), 
            by=c("resty_cat", "hhveh_cat")) %>%
  select(sampn, hbw, pred_hbw)

Assess goodness-of-fit of trip rates 1 with r^2

predictions_one %>%
  summarise(SSE=sum((hbw - pred_hbw)^2),
            SST=sum((hbw - mean(hbw))^2),
            R2= 1 - SSE/SST)

## # A tibble: 1 × 3
##     SSE   SST     R2
##   <dbl> <dbl>  <dbl>
## 1 4256. 4596. 0.0741

Estimate a linear Regression Model of the home-based-work trip generation

household_hbw_sub$resty_cat <- as.factor(household_hbw_sub$resty_cat)
household_hbw_sub$hhveh_cat <- as.factor(household_hbw_sub$hhveh_cat)

lm(hbw ~ hhveh_cat, data=household_hbw_sub) %>%
  summary()

## 
## Call:
## lm(formula = hbw ~ hhveh_cat, data = household_hbw_sub)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.057 -0.644 -0.057  0.356  4.356 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.2046     0.0438    4.67  3.1e-06 ***
## hhveh_cat1-2   0.4391     0.0455    9.65  < 2e-16 ***
## hhveh_cat3+    0.8525     0.0485   17.58  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.82 on 6446 degrees of freedom
## Multiple R-squared:  0.0643, Adjusted R-squared:  0.064 
## F-statistic:  222 on 2 and 6446 DF,  p-value: <2e-16

lm(hbw ~ hhveh, data=household_hbw) %>%
  summary()

## 
## Call:
## lm(formula = hbw ~ hhveh, data = household_hbw)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.073 -0.732 -0.284  0.492  4.492 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.28424    0.02008    14.2   <2e-16 ***
## hhveh        0.22365    0.00893    25.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.81 on 6447 degrees of freedom
## Multiple R-squared:  0.0887, Adjusted R-squared:  0.0885 
## F-statistic:  627 on 1 and 6447 DF,  p-value: <2e-16

lm(hbw ~ hhveh_cat * resty_cat, data=household_hbw_sub) %>%
  summary()

## 
## Call:
## lm(formula = hbw ~ hhveh_cat * resty_cat, data = household_hbw_sub)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.100 -0.692 -0.065  0.308  4.308 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                           0.1837     0.0519    3.54  0.00041 ***
## hhveh_cat1-2                          0.2992     0.0586    5.11  3.3e-07 ***
## hhveh_cat3+                           0.5413     0.1386    3.90  9.5e-05 ***
## resty_catOther                       -0.1003     0.2403   -0.42  0.67634    
## resty_catSingle-Family                0.0941     0.1002    0.94  0.34768    
## hhveh_cat1-2:resty_catOther          -0.0377     0.2571   -0.15  0.88352    
## hhveh_cat3+:resty_catOther            0.4753     0.3276    1.45  0.14685    
## hhveh_cat1-2:resty_catSingle-Family   0.1150     0.1047    1.10  0.27217    
## hhveh_cat3+:resty_catSingle-Family    0.2463     0.1643    1.50  0.13393    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.81 on 6440 degrees of freedom
## Multiple R-squared:  0.0741, Adjusted R-squared:  0.0729 
## F-statistic: 64.4 on 8 and 6440 DF,  p-value: <2e-16

R^2

#install.packages("broom")
library(broom)
lm(hbw ~ hhveh_cat * resty_cat, data=household_hbw_sub) %>%
  augment() %>%
  summarize(SST = sum((hbw-mean(hbw))^2), 
          SSE = sum((hbw-.fitted)^2)) %>% 
  mutate(SSR = SST - SSE,
         R2 = 1 - SSE/SST) %>% 
  knitr::kable()

SST	SSE	SSR	R2
4596	4256	340	0.07

E) Adjusted R^2

#install.packages("texreg")
library(texreg)
lm(hbw ~ hhveh_cat * resty_cat, data=household_hbw_sub) %>%
  htmlreg(doctype = FALSE)

Statistical models
	Model 1
(Intercept)	0.18^***
	(0.05)
hhveh_cat1-2	0.30^***
	(0.06)
hhveh_cat3+	0.54^***
	(0.14)
resty_catOther	-0.10
	(0.24)
resty_catSingle-Family	0.09
	(0.10)
hhveh_cat1-2:resty_catOther	-0.04
	(0.26)
hhveh_cat3+:resty_catOther	0.48
	(0.33)
hhveh_cat1-2:resty_catSingle-Family	0.11
	(0.10)
hhveh_cat3+:resty_catSingle-Family	0.25
	(0.16)
R²	0.07
Adj. R²	0.07
Num. obs.	6449
p < 0.001; p < 0.01; p < 0.05

2nd Cross Classification Trip Table and Line Chart Household Type Categories and Number of Vehicles Categories

# Recoding Income to Income Categories (inc_cat) (0-24000, 25000-49999, 50000+)


household_hbw_sub %>%
  filter(!is.na(resty), resty != 9)

## # A tibble: 6,444 × 91
##      sampn  area rtcqc   pnr advltr rmode rtmode  lang incen   day  assn ribus
##      <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 2000019    21     1    NA      2    NA     NA    NA     2     2   415     2
##  2 2000051    21     4    NA      2    NA     NA    NA     2     5   397     2
##  3 2000056    21     1    NA      2    NA     NA    NA     2     3   381     1
##  4 2000057    21     4    NA      2    NA     NA    NA     2     5   355     2
##  5 2000095    21     1    NA      2    NA     NA    NA     2     2   373     2
##  6 2000116    21     2    NA      2    NA     NA    NA     2     4   375     2
##  7 2000122    21     2    NA      2    NA     NA    NA     2     1   379     2
##  8 2000138    21     2    NA      2    NA     NA    NA     2     1   365     2
##  9 2000151    21     1    NA      2    NA     NA    NA     2     5   355     2
## 10 2000165    21     1    NA      2    NA     NA    NA     2     3   353     1
## # … with 6,434 more rows, and 79 more variables: s1a <dbl>, wabik <dbl>,
## #   hhsiz <dbl>, hhveh <dbl>, vflag <dbl>, vhold <dbl>, bikes <dbl>,
## #   flexc <dbl>, resty <dbl>, o_resty <chr>, own <dbl>, o_own <lgl>,
## #   hlive <dbl>, bfcity <chr>, bfstate <chr>, bfzip <dbl>, bfres <dbl>,
## #   o_bfres <chr>, bfown <dbl>, o_bfown <lgl>, mfact1 <dbl>, mfact2 <dbl>,
## #   mfact3 <dbl>, mfact4 <dbl>, mfact5 <dbl>, mfact6 <dbl>, mfact7 <dbl>,
## #   o_mfact <chr>, fact2 <dbl>, fact4 <dbl>, mloca1 <dbl>, mloca2 <dbl>, …

household_hbw_sub %>%
  filter(!is.na(resty), resty != 9) %>%
  count(resty, inc_cat_two) %>%
  group_by(resty) %>%
  mutate(percent=n/sum(n)) %>%
  select(-n) %>%
  pivot_wider(names_from = "inc_cat_two", values_from = "percent")

## # A tibble: 5 × 4
## # Groups:   resty [5]
##   resty `0-49999` `100000+` `50000-99999`
##   <dbl>     <dbl>     <dbl>         <dbl>
## 1     1    0.0206     0.929        0.0502
## 2     2    0.128      0.707        0.165 
## 3     3    0.168      0.661        0.171 
## 4     4    0.150      0.617        0.234 
## 5     7   NA          1           NA

#trip rate table for hhwrk_cat and inc_cat_two
trip_rates_two <- household_hbw_sub %>%
  #filter(!is.na(resty_cat), resty != 9) %>%
  group_by(hhwrk_cat, inc_cat_two) %>%
  summarize(avg_hbw=mean(hbw, na.rm=TRUE)) 

# pretty print the trip rates table
options(digits=2)
trip_rates_two %>%
  pivot_wider(names_from = "inc_cat_two", values_from = "avg_hbw") %>%
  knitr::kable()

hhwrk_cat	0-49999	100000+	50000-99999
0-1	0.13	0.42	0.26
2 or More	0.74	1.15	0.87

household_hbw_sub$hhwrk_cat <- as.factor(household_hbw_sub$hhwrk_cat)

# plotting the trip rates
ggplot(trip_rates_two, aes(x=inc_cat_two, y=avg_hbw, group=hhwrk_cat, color=hhwrk_cat)) +
  geom_line()

Cross-Classification trip table Assessing a goodness of fit with R^2

# the htrips column is the observed htrips, the avg_htrips column is the predicted htrips

predictions_two <- household_hbw_sub %>%
  left_join(trip_rates_two %>% rename(pred_hbw=avg_hbw), 
            by=c("hhwrk_cat", "inc_cat_two")) %>%
  select(sampn, hbw, pred_hbw)


predictions_two %>%
  summarise(SSE=sum((hbw - pred_hbw)^2),
            SST=sum((hbw - mean(hbw))^2),
            R2= 1 - SSE/SST)

## # A tibble: 1 × 3
##     SSE   SST    R2
##   <dbl> <dbl> <dbl>
## 1 3626. 4596. 0.211

lm(htrips ~ hhwrk_cat, data=household_hbw_sub) %>%
  summary()

## 
## Call:
## lm(formula = htrips ~ hhwrk_cat, data = household_hbw_sub)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -13.10  -5.10  -1.15   2.90  73.85 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           7.151      0.121    58.9   <2e-16 ***
## hhwrk_cat2 or More    5.946      0.181    32.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.2 on 6447 degrees of freedom
## Multiple R-squared:  0.143,  Adjusted R-squared:  0.143 
## F-statistic: 1.08e+03 on 1 and 6447 DF,  p-value: <2e-16

lm(htrips ~ hhwrk, data=household_hbw_sub) %>%
  summary()

## 
## Call:
## lm(formula = htrips ~ hhwrk, data = household_hbw_sub)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -16.05  -4.66  -1.46   3.34  72.54 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    4.663      0.169    27.6   <2e-16 ***
## hhwrk          3.794      0.106    35.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.1 on 6447 degrees of freedom
## Multiple R-squared:  0.166,  Adjusted R-squared:  0.166 
## F-statistic: 1.28e+03 on 1 and 6447 DF,  p-value: <2e-16

lm(htrips ~ hhwrk_cat * inc_cat_two, data=household_hbw_sub) %>%
  summary()

## 
## Call:
## lm(formula = htrips ~ hhwrk_cat * inc_cat_two, data = household_hbw_sub)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -13.15  -5.15  -1.43   2.85  73.57 
## 
## Coefficients:
##                                           Estimate Std. Error t value Pr(>|t|)
## (Intercept)                                  6.085      0.421   14.46   <2e-16
## hhwrk_cat2 or More                           4.547      1.710    2.66   0.0079
## inc_cat_two100000+                           1.350      0.442    3.05   0.0023
## inc_cat_two50000-99999                      -0.138      0.552   -0.25   0.8019
## hhwrk_cat2 or More:inc_cat_two100000+        1.170      1.721    0.68   0.4966
## hhwrk_cat2 or More:inc_cat_two50000-99999    1.213      1.936    0.63   0.5308
##                                              
## (Intercept)                               ***
## hhwrk_cat2 or More                        ** 
## inc_cat_two100000+                        ** 
## inc_cat_two50000-99999                       
## hhwrk_cat2 or More:inc_cat_two100000+        
## hhwrk_cat2 or More:inc_cat_two50000-99999    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.2 on 6443 degrees of freedom
## Multiple R-squared:  0.147,  Adjusted R-squared:  0.146 
## F-statistic:  221 on 5 and 6443 DF,  p-value: <2e-16

#install.packages("broom")
library(broom)
lm(hbw ~ hhwrk_cat * inc_cat_two, data=household_hbw_sub) %>%
  augment() %>%
  summarize(SST = sum((hbw-mean(hbw))^2), 
          SSE = sum((hbw-.fitted)^2)) %>% 
  mutate(SSR = SST - SSE,
         R2 = 1 - SSE/SST) %>% 
  knitr::kable()

SST	SSE	SSR	R2
4596	3626	970	0.21

#install.packages("texreg")
library(texreg)
lm(hbw ~ hhwrk_cat * inc_cat_two, data=household_hbw_sub) %>%
  htmlreg(doctype = FALSE)

Statistical models
	Model 1
(Intercept)	0.13^**
	(0.04)
hhwrk_cat2 or More	0.60^***
	(0.18)
inc_cat_two100000+	0.29^***
	(0.05)
inc_cat_two50000-99999	0.13^*
	(0.06)
hhwrk_cat2 or More:inc_cat_two100000+	0.13
	(0.18)
hhwrk_cat2 or More:inc_cat_two50000-99999	-0.00
	(0.20)
R²	0.21
Adj. R²	0.21
Num. obs.	6449
p < 0.001; p < 0.01; p < 0.05

Write Up Section 1

To: Dr. Liming Wang From: Gregory Mallon Subject: Trip rates tables, plots and R^2

In the first Cross Classification tab, I looked at two categorized variables: resty (household type) and hhveh (number of vehicles per household). I used a categorized version of resty (resty_cat) to avoid the “zero-cell” problem where there were no observations within a couple of the categories of regular resty. resty_cat was resty sorted based on single-family, multi-family, and other. Additionally, I categorized household vehicles into 0 vehicles, 1-2 vehicles, and 3 or more vehicles. I categorized them to avoid the “zero-cell” problem again and decrease the number of categories to fewer than five.

in the first table, it looked as though home-based work (hbw) trips increased as the number of cars in the household increased for every category of household type (rest_cat). Comparing household type, single-family residences took more hbw trips than any other category of housing type for households with 0 cars. Single-family residences took more hbw trips than any other housing category for households with 1-2 cars as well. However, for “Other” housing types took more trips than single-family for households with more than 3 cars. Additionally, in ggplot 1 you can see that the “Other” category crossed both single-family and multi-family indicating that “Other” household types with 3 or more cars were taking a lot more hbw trips than the other two housing categories within their respective household vehicle categories.

Assessing a goodness of fit for the first model was scored a R^2 of .07 indicating that these variables together are not good indicators of hbw trips. Comparing hhveh_cat to uncategorized hhveh in relation to hbw trips showed that the categorized version of hhveh was a worse predictor of hbw trips. Adjusted R^2 was almost identical to R^2 (slightly decreased from .074 to .073) meaning that not including one of the variables would likely be a worse alternative than using both.

In model 2, my better performing model, I looked at two categorized versions of independent variables: household workers and income. I sorted hhwrk into two categories and sorted income into three categories. The cross classification table and plot are a little hard to read because the largest income category $100,000 or more was put in the middle of both, but as income increases, the number of hbw trips increases regardless of the number hhwrk. His models goodness of fit measure with R^2 was .21 which is still not great. This suggests that other variables should be considered rather than using the number of household vehicles (hhveh_cat) and income (income_cat_two). I tried a number of different categorization techniques which had little effect over the R^2 goodness of fit metric. This included a hhwrk category that had 0, 1 worker, and more than two and a hhwrk_cat that had 0-1 worker and 2 or more workers.

When I started this assignment, I thought that income and house type would be better indicators of hbw trip than they were. This is because I was confused and thought the survey was aching about home-based work car trips only. So I was thinking that I would select variables based on an assumption of heavy car use to get to work. (i.e. single-family homes are farther away from downtown cores, and the number of vehicles may indicate how someone would get to work.)

2 Estimating a Linear Regression Model of Household Trip Generation

Use Excel’s Regression function to estimate linear regression models HBW trips. Try at least three (3) different linear regression specifications using different combinations of variables or different forms of a variable (e.g. workers in a numeric form vs in a coded categorical form). For your first model, start with the same variables you used in the Cross-Classification estimation. How do the results compare with the Cross-Classification version in terms of model goodnessof- fit (R2)?

household_hbw_sub$hhveh_cat <- as.factor(household_hbw_sub$hhveh_cat)
household_hbw_sub$hhwrk <- as.factor(household_hbw_sub$hhwrk)
household_hbw_sub$inc_cat_two <- as.factor(household_hbw_sub$inc_cat_two)
#install.packages("texreg")
library(texreg)
  #screenreg(list(
#htmlreg(list()) is the aternate
  htmlreg(list(
    "Model A"=lm(hbw ~ hhwrk_cat * inc_cat_two + resty_cat * hhveh_cat, data=household_hbw_sub),
    "Model B"=lm(hbw ~ hhwrk + hhveh_cat,  data=household_hbw_sub),
    "Model C"=lm(htrips ~ hhsiz_cat + hhwrk, data=household_hbw_sub)
  ),
  doctype = FALSE)

Statistical models
	Model A	Model B	Model C
(Intercept)	0.03	-0.14^***	3.76^***
	(0.06)	(0.04)	(0.20)
hhwrk_cat2 or More	0.54^**
	(0.18)
inc_cat_two100000+	0.19^***
	(0.05)
inc_cat_two50000-99999	0.10
	(0.06)
resty_catOther	-0.08
	(0.22)
resty_catSingle-Family	-0.01
	(0.09)
hhveh_cat1-2	0.16^**	0.15^***
	(0.06)	(0.04)
hhveh_cat3+	0.18	0.29^***
	(0.13)	(0.04)
hhwrk_cat2 or More:inc_cat_two100000+	0.14
	(0.18)
hhwrk_cat2 or More:inc_cat_two50000-99999	-0.00
	(0.20)
resty_catOther:hhveh_cat1-2	-0.02
	(0.24)
resty_catSingle-Family:hhveh_cat1-2	0.02
	(0.10)
resty_catOther:hhveh_cat3+	0.43
	(0.30)
resty_catSingle-Family:hhveh_cat3+	0.20
	(0.15)
hhwrk1		0.51^***	1.37^***
		(0.03)	(0.22)
hhwrk2		1.03^***	2.87^***
		(0.03)	(0.24)
hhwrk3		1.39^***	2.71^***
		(0.05)	(0.41)
hhwrk4		1.88^***	2.73^***
		(0.10)	(0.82)
hhwrk5		2.54^***	5.32^*
		(0.29)	(2.40)
hhsiz_cat2			2.22^***
			(0.20)
hhsiz_cat3			5.75^***
			(0.26)
hhsiz_cat4			10.90^***
			(0.28)
hhsiz_cat5			14.51^***
			(0.41)
hhsiz_catMore Than 5			20.13^***
			(0.53)
R²	0.22	0.28	0.45
Adj. R²	0.22	0.28	0.45
Num. obs.	6449	6449	6449
p < 0.001; p < 0.01; p < 0.05

Ftest Failed Can’t find ModelA or ModelB

ModelA <- lm(hbw ~ hhwrk_cat * inc_cat_two + resty_cat * hhveh_cat, data=household_hbw_sub),
ModelB <- lm(hbw ~ hhwrk + hhveh_cat,  data=household_hbw_sub)

options(knitr.kable.NA = ’’) anova(ModelA , ModelB) %>% knitr::kable()

Write up - 2

I compared two models for home-based-work (hbw) trip generation. In the first model, I used variables from my 2nd Cross-Classification Table above, which includes household worker categories (hhwrk_cat) and income categories (inc_cat). Additionally, I created a third model predicting htrips.

I categorized variables is as follows: hhwrk_cat=case_when(hhwrk <= 1 ~ “0-1”, TRUE ~ “2 or More” ),

inc_cat=case_when(income == 1 ~ “0-24999”, income <= 2 ~ “25000-49999”, income >= 3 ~ “50000+”)

Model A results showed that the variables I had selected were not a particularly good way to predict the average number of home-based-work trips, as my R^2 was extremely low, scoring .22 of a goodness-of-fit. With this in mind

For Model B, I experimented with many variables categorized, non-categorized, and as factors rather than in numerical form, trying to maximize my R^2 and adjusted R^2 output, respectively. In the end, I used two variables, household workers uncategorized and the number of household vehicles categorized. I used the variables as two separate terms rather than a cross-classification. I did this because the household workers returned 0 observations for households with 4 and 5 workers with More than three vehicles. Model B Scored a bit better than Model A with an R^2 of .28. This is still not a great model for predicting hbw trips. However, I learned some exciting things.

A household with one worker is more likely to produce .5 more hbw trips than a household with no workers, and as that number increases from a household with one worker to a household with two workers, the number increase to 1 hbw trips per day. This makes sense that a household with one worker would take more hbw trips; however interesting that a household with no workers is still predicted to take half of a new trip. As the number of hh workers increases from 2 to 3, .36 trips are added rather than .5 like it did for the relationship between 0 and 1 hh workers. This suggests that the relationship of hh workers and hbw trips is not linear; rather, it is a curve. I think this makes sense as, within households with multiple workers, not all of them may be working full-time jobs which require the hbw trip daily.

The other term in Model B was a categorized version of hh vehicles. I categorized household vehicles into three buckets, 0 vehicles, 1-2 vehicles, and three or more vehicles. Households with 1-2 vehicles showed they took about .15 more hbw trips than households without vehicles. Households with three or more vehicles showed that they took .29 more trips than households with 1-2 vehicles.

Of the two models, I prefer the 2nd with the higher R^2. I couldn’t complete an F-test, so I really can’t say with high confidence that this is the case. However, considering my model inputs, it does make sense that Model B would score higher at predicting the dependent variable, home-based-work trips. Model 2 considers a critical, independent variable, an uncategorized version of household workers. This variable does better than the categorized version of household workers. I included the categorized version in Model A due to the necessity regarding the cross-classification tab.

A critique in Model B would has to do with 0 household workers making work trips. I think of this as an obvious weakness. However, on the whole, home-based-work trips going up incrementally and not at a linear rate kind of baffles me a bit. The only possible conclusion I came to about that was perhaps they weren’t making daily trips for a part-time job.

Model C was one I developed to predict average trips rather than home-based-work trips. This is primarily due to the lack of ability to get models scoring well in goodness-of-fit for the hbw trips. Additionally, I wanted to make as simple a model that would give me an R^2 of something greater than .28. My third Model scored the best with an R^2 of .45. I used the variable household workers again as well as household size. I tried a bunch of other variables and variations of variables, but this was the simplest Model with the same R^2 as using more variables.

In conclusion, I think modeling home-based-work trips is challenging and would love to see how someone else did it if they got a better fitting model. I’m not confident in any model I produced to predict hbw trips. However, I feel pretty decent that my Model for htrips is relatively strong. Unfortunately, I couldn’t figure out the F-test, which would, at a minimum, help me test ModelA and ModelB against a null hypothesis.