Transit time by city (in Texas)

In this assignment, I’m going to be examining the transit time of commuters in Texas metro areas. The commuters will be adults (age>=18), in the workforce, and residents of a Texas Metropolitan Statistical Area.

First, we should load our libraries and the data:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(haven)
library(broom)
library(ggplot2)
ipums<-read_dta("https://github.com/coreysparks/data/blob/master/usa_00045.dta?raw=true")

Filtering our target population

Here, I will be filtering our target population from the IPUMS data set: residents of a Texas MSA, greater than or equal to age 18 and who are actively employed in the labor force. I have also given names to the MSA’s (which would otherwise be simple code-numbers). I have also removed “000” observations from the trantime variable, which the IPUMS codebook identifies as “NA” values:

#filters by age (greater than or equal to 18), presence in labor force and location in Texas Metro Statistical Area
new_ipums<-ipums %>% filter(statefip=="48" & met2013!=0 & age>=18 & labforce==2 & trantime!="000")

#adds name to Texas MSA for ease-of-use

new_ipums<-new_ipums %>%
  mutate(MSA_NAME=case_when(.$met2013=="11100"~"Amarillo, TX",
                            .$met2013=="12420"~"Austin-Round Rock, TX",
                            .$met2013=="13140"~"Beaumont-Port Arthur, TX",
                            .$met2013=="15180"~"Brownsville-Harlingen, TX",
                            .$met2013=="17780"~"College Station-Bryan, TX",
                            .$met2013=="19100"~"Dallas-Ft Worth-Arlington, TX",
                            .$met2013=="21340"~"El Paso, TX",
                            .$met2013=="26420"~"Houston-The Woodlands-Sugar Land, TX",
                            .$met2013=="28660"~"Kileen-Temple, TX",
                            .$met2013=="29700"~"Laredo, TX",
                            .$met2013=="31180"~"Lubbock, TX",
                            .$met2013=="32580"~"McAllen-Edinburg-Mission, TX",
                            .$met2013=="33260"~"Midland, TX",
                            .$met2013=="36220"~"Odessa, TX",
                            .$met2013=="41660"~"San Angelo, TX",
                            .$met2013=="41700"~"San Antonio-New Braunfels, TX",
                            .$met2013=="48660"~"Wichita Falls, TX",
                            .$met2013=="47380"~"Waco, TX",
                            .$met2013=="46340"~"Tyler, TX",
                            .$met2013=="18580"~"Corpus Christi, TX"))
new_ipums$met2013<-as.factor(new_ipums$met2013)

Descriptive Statistics for Transit Time

Here, I will run standard descriptive statistics for our population and the transit-time variable:

#descriptive stats for trantime in Texas
new_ipums %>%
  group_by(met2013) %>%
summarise(mean_transit_time=mean(trantime),sd=sd(trantime),n=n())
## # A tibble: 19 x 4
##    met2013 mean_transit_time       sd     n
##     <fctr>             <dbl>    <dbl> <int>
##  1   11100          16.76692 21.78368   133
##  2   12420          22.80106 18.89597   945
##  3   13140          21.01325 22.32188   151
##  4   15180          16.53846 11.43919   130
##  5   17780          20.37624 24.73291   101
##  6   18580          19.36161 23.53661   224
##  7   19100          24.97202 22.98984  3217
##  8   21340          21.04908 24.71307   326
##  9   26420          26.46783 22.44292  2456
## 10   29700          19.34343 20.35459    99
## 11   31180          15.99265 17.31345   136
## 12   32580          19.38806 22.01837   201
## 13   33260          19.40816 16.94466    49
## 14   36220          17.26190 12.81778    42
## 15   41660          25.04545 32.96507    44
## 16   41700          24.30786 22.46521   903
## 17   46340          21.75758 27.18271    99
## 18   47380          20.06716 18.82136   134
## 19   48660          11.98649 10.24962    74

Here we see that R has sorted the data numerically using “met2013”, the MSA code that has been used as a grouping variable. We can rearrange this table to sort by Mean Transit Time:

new_ipums %>%
  group_by(met2013) %>%
summarise(mean_transit_time=mean(trantime),sd=sd(trantime),n=n()) %>%
  arrange(desc(mean_transit_time))
## # A tibble: 19 x 4
##    met2013 mean_transit_time       sd     n
##     <fctr>             <dbl>    <dbl> <int>
##  1   26420          26.46783 22.44292  2456
##  2   41660          25.04545 32.96507    44
##  3   19100          24.97202 22.98984  3217
##  4   41700          24.30786 22.46521   903
##  5   12420          22.80106 18.89597   945
##  6   46340          21.75758 27.18271    99
##  7   21340          21.04908 24.71307   326
##  8   13140          21.01325 22.32188   151
##  9   17780          20.37624 24.73291   101
## 10   47380          20.06716 18.82136   134
## 11   33260          19.40816 16.94466    49
## 12   32580          19.38806 22.01837   201
## 13   18580          19.36161 23.53661   224
## 14   29700          19.34343 20.35459    99
## 15   36220          17.26190 12.81778    42
## 16   11100          16.76692 21.78368   133
## 17   15180          16.53846 11.43919   130
## 18   31180          15.99265 17.31345   136
## 19   48660          11.98649 10.24962    74

We see that “26420” is the metro area with the longest mean transit time. But which metro area is that? We can group by the character string I created above (MSA_NAME) to find out:

new_ipums %>%
  group_by(MSA_NAME) %>%
summarise(mean_transit_time=mean(trantime),sd=sd(trantime),n=n()) %>%
  arrange(desc(mean_transit_time))
## # A tibble: 19 x 4
##                                MSA_NAME mean_transit_time       sd     n
##                                   <chr>             <dbl>    <dbl> <int>
##  1 Houston-The Woodlands-Sugar Land, TX          26.46783 22.44292  2456
##  2                       San Angelo, TX          25.04545 32.96507    44
##  3        Dallas-Ft Worth-Arlington, TX          24.97202 22.98984  3217
##  4        San Antonio-New Braunfels, TX          24.30786 22.46521   903
##  5                Austin-Round Rock, TX          22.80106 18.89597   945
##  6                            Tyler, TX          21.75758 27.18271    99
##  7                          El Paso, TX          21.04908 24.71307   326
##  8             Beaumont-Port Arthur, TX          21.01325 22.32188   151
##  9            College Station-Bryan, TX          20.37624 24.73291   101
## 10                             Waco, TX          20.06716 18.82136   134
## 11                          Midland, TX          19.40816 16.94466    49
## 12         McAllen-Edinburg-Mission, TX          19.38806 22.01837   201
## 13                   Corpus Christi, TX          19.36161 23.53661   224
## 14                           Laredo, TX          19.34343 20.35459    99
## 15                           Odessa, TX          17.26190 12.81778    42
## 16                         Amarillo, TX          16.76692 21.78368   133
## 17            Brownsville-Harlingen, TX          16.53846 11.43919   130
## 18                          Lubbock, TX          15.99265 17.31345   136
## 19                    Wichita Falls, TX          11.98649 10.24962    74

Houston (unsurprisingly) has the longest mean travel times for our population. Wichita Falls has the least. This observation is also borne out in the graphs below. First, we observe by “met2013” sorted by highest mean transit time to lowest:

new_ipums %>%
  ggplot() +
  geom_boxplot(aes(x=reorder(met2013,trantime,FUN=mean),y=trantime)) +
  xlab(label="Metro Statistical Area")+
  ylab(label="Transit Time") +
  coord_flip()

And then, for the sake of clarity, we add metro area names from MSA_NAME to the same boxplot:

new_ipums %>%
  ggplot() +
  geom_boxplot(aes(x=reorder(MSA_NAME,trantime,FUN=mean),y=trantime)) +
  xlab(label="Metro Statistical Area")+
  ylab(label="Transit Time") +
  coord_flip()

Here once again, we see the data sorted by Transit Time, but this time with the city names visible. Houston-The Woodlands-Sugar Land continue to have the longest mean commutes, with Wichita Falls the least. We also observe there are significant outliers in the data.

Here we run a histogram on new_ipums$trantime to confirm the existence of outliers:

hist(new_ipums$trantime)

It appears that the distribution is fairly skewed!

ANOVA Table / F Test

Here, I will perform ANOVA using trantime as the outcome, and met2013 as the grouping variable:

trantime_fit<-new_ipums %>%
  lm(trantime~met2013,data=.)
anova(trantime_fit)
## Analysis of Variance Table
## 
## Response: trantime
##             Df  Sum Sq Mean Sq F value    Pr(>F)    
## met2013     18   75244  4180.2  8.5119 < 2.2e-16 ***
## Residuals 9445 4638523   491.1                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see, then, that with an F value of “8.5119” and a P value of “.0000000000000002”, we can reject the null hypothesis with some certainty. We can be fairly sure that there is genuine statistically significant difference in at least some of the transit times within the various Metropolitan Statistical Areas of Texas.

Here, I have releveled the linear model of Texas MSA transit times with “Houston-The Woodlands-Sugar Land” as the reference variable. When printed out, we see that many Texas MSA’s do indeed have quite different mean transit times from Houston:

new_ipums$met2013b<-relevel(as.factor(new_ipums$met2013),ref="26420")
trantime_fit2<-lm(trantime~met2013b,data=new_ipums)
tidy(trantime_fit2)
##             term   estimate std.error  statistic      p.value
## 1    (Intercept)  26.467834 0.4471721 59.1893676 0.000000e+00
## 2  met2013b11100  -9.700917 1.9729447 -4.9169735 8.937290e-07
## 3  met2013b12420  -3.666776 0.8483247 -4.3223729 1.559420e-05
## 4  met2013b13140  -5.454589 1.8580472 -2.9356568 3.336447e-03
## 5  met2013b15180  -9.929372 1.9944230 -4.9785688 6.518855e-07
## 6  met2013b17780  -6.091596 2.2499837 -2.7073957 6.793538e-03
## 7  met2013b18580  -7.106227 1.5467427 -4.5943172 4.397936e-06
## 8  met2013b19100  -1.495810 0.5938210 -2.5189581 1.178664e-02
## 9  met2013b21340  -5.418754 1.3063046 -4.1481551 3.381007e-05
## 10 met2013b29700  -7.124400 2.2717082 -3.1361419 1.717118e-03
## 11 met2013b31180 -10.475187 1.9521930 -5.3658562 8.247355e-08
## 12 met2013b32580  -7.079774 1.6258200 -4.3545868 1.347239e-05
## 13 met2013b33260  -7.059671 3.1972789 -2.2080246 2.726649e-02
## 14 met2013b36220  -9.205929 3.4486274 -2.6694473 7.610613e-03
## 15 met2013b41660  -1.422379 3.3706865 -0.4219851 6.730455e-01
## 16 met2013b41700  -2.159971 0.8624538 -2.5044487 1.228096e-02
## 17 met2013b46340  -4.710258 2.2717082 -2.0734432 3.815809e-02
## 18 met2013b47380  -6.400670 1.9659487 -3.2557664 1.134853e-03
## 19 met2013b48660 -14.481347 2.6146833 -5.5384708 3.133030e-08

Recap and Conclusion:

F TEST: 8.5119***

Longest average commute time: Houston-The Woodlands-Sugarland MSA

Shortest average communite time: Wichita Falls MSA