In this assignment, I’m going to be examining the transit time of commuters in Texas metro areas. The commuters will be adults (age>=18), in the workforce, and residents of a Texas Metropolitan Statistical Area.
First, we should load our libraries and the data:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(haven)
library(broom)
library(ggplot2)
ipums<-read_dta("https://github.com/coreysparks/data/blob/master/usa_00045.dta?raw=true")
Here, I will be filtering our target population from the IPUMS data set: residents of a Texas MSA, greater than or equal to age 18 and who are actively employed in the labor force. I have also given names to the MSA’s (which would otherwise be simple code-numbers). I have also removed “000” observations from the trantime variable, which the IPUMS codebook identifies as “NA” values:
#filters by age (greater than or equal to 18), presence in labor force and location in Texas Metro Statistical Area
new_ipums<-ipums %>% filter(statefip=="48" & met2013!=0 & age>=18 & labforce==2 & trantime!="000")
#adds name to Texas MSA for ease-of-use
new_ipums<-new_ipums %>%
mutate(MSA_NAME=case_when(.$met2013=="11100"~"Amarillo, TX",
.$met2013=="12420"~"Austin-Round Rock, TX",
.$met2013=="13140"~"Beaumont-Port Arthur, TX",
.$met2013=="15180"~"Brownsville-Harlingen, TX",
.$met2013=="17780"~"College Station-Bryan, TX",
.$met2013=="19100"~"Dallas-Ft Worth-Arlington, TX",
.$met2013=="21340"~"El Paso, TX",
.$met2013=="26420"~"Houston-The Woodlands-Sugar Land, TX",
.$met2013=="28660"~"Kileen-Temple, TX",
.$met2013=="29700"~"Laredo, TX",
.$met2013=="31180"~"Lubbock, TX",
.$met2013=="32580"~"McAllen-Edinburg-Mission, TX",
.$met2013=="33260"~"Midland, TX",
.$met2013=="36220"~"Odessa, TX",
.$met2013=="41660"~"San Angelo, TX",
.$met2013=="41700"~"San Antonio-New Braunfels, TX",
.$met2013=="48660"~"Wichita Falls, TX",
.$met2013=="47380"~"Waco, TX",
.$met2013=="46340"~"Tyler, TX",
.$met2013=="18580"~"Corpus Christi, TX"))
new_ipums$met2013<-as.factor(new_ipums$met2013)
Here, I will run standard descriptive statistics for our population and the transit-time variable:
#descriptive stats for trantime in Texas
new_ipums %>%
group_by(met2013) %>%
summarise(mean_transit_time=mean(trantime),sd=sd(trantime),n=n())
## # A tibble: 19 x 4
## met2013 mean_transit_time sd n
## <fctr> <dbl> <dbl> <int>
## 1 11100 16.76692 21.78368 133
## 2 12420 22.80106 18.89597 945
## 3 13140 21.01325 22.32188 151
## 4 15180 16.53846 11.43919 130
## 5 17780 20.37624 24.73291 101
## 6 18580 19.36161 23.53661 224
## 7 19100 24.97202 22.98984 3217
## 8 21340 21.04908 24.71307 326
## 9 26420 26.46783 22.44292 2456
## 10 29700 19.34343 20.35459 99
## 11 31180 15.99265 17.31345 136
## 12 32580 19.38806 22.01837 201
## 13 33260 19.40816 16.94466 49
## 14 36220 17.26190 12.81778 42
## 15 41660 25.04545 32.96507 44
## 16 41700 24.30786 22.46521 903
## 17 46340 21.75758 27.18271 99
## 18 47380 20.06716 18.82136 134
## 19 48660 11.98649 10.24962 74
Here we see that R has sorted the data numerically using “met2013”, the MSA code that has been used as a grouping variable. We can rearrange this table to sort by Mean Transit Time:
new_ipums %>%
group_by(met2013) %>%
summarise(mean_transit_time=mean(trantime),sd=sd(trantime),n=n()) %>%
arrange(desc(mean_transit_time))
## # A tibble: 19 x 4
## met2013 mean_transit_time sd n
## <fctr> <dbl> <dbl> <int>
## 1 26420 26.46783 22.44292 2456
## 2 41660 25.04545 32.96507 44
## 3 19100 24.97202 22.98984 3217
## 4 41700 24.30786 22.46521 903
## 5 12420 22.80106 18.89597 945
## 6 46340 21.75758 27.18271 99
## 7 21340 21.04908 24.71307 326
## 8 13140 21.01325 22.32188 151
## 9 17780 20.37624 24.73291 101
## 10 47380 20.06716 18.82136 134
## 11 33260 19.40816 16.94466 49
## 12 32580 19.38806 22.01837 201
## 13 18580 19.36161 23.53661 224
## 14 29700 19.34343 20.35459 99
## 15 36220 17.26190 12.81778 42
## 16 11100 16.76692 21.78368 133
## 17 15180 16.53846 11.43919 130
## 18 31180 15.99265 17.31345 136
## 19 48660 11.98649 10.24962 74
We see that “26420” is the metro area with the longest mean transit time. But which metro area is that? We can group by the character string I created above (MSA_NAME) to find out:
new_ipums %>%
group_by(MSA_NAME) %>%
summarise(mean_transit_time=mean(trantime),sd=sd(trantime),n=n()) %>%
arrange(desc(mean_transit_time))
## # A tibble: 19 x 4
## MSA_NAME mean_transit_time sd n
## <chr> <dbl> <dbl> <int>
## 1 Houston-The Woodlands-Sugar Land, TX 26.46783 22.44292 2456
## 2 San Angelo, TX 25.04545 32.96507 44
## 3 Dallas-Ft Worth-Arlington, TX 24.97202 22.98984 3217
## 4 San Antonio-New Braunfels, TX 24.30786 22.46521 903
## 5 Austin-Round Rock, TX 22.80106 18.89597 945
## 6 Tyler, TX 21.75758 27.18271 99
## 7 El Paso, TX 21.04908 24.71307 326
## 8 Beaumont-Port Arthur, TX 21.01325 22.32188 151
## 9 College Station-Bryan, TX 20.37624 24.73291 101
## 10 Waco, TX 20.06716 18.82136 134
## 11 Midland, TX 19.40816 16.94466 49
## 12 McAllen-Edinburg-Mission, TX 19.38806 22.01837 201
## 13 Corpus Christi, TX 19.36161 23.53661 224
## 14 Laredo, TX 19.34343 20.35459 99
## 15 Odessa, TX 17.26190 12.81778 42
## 16 Amarillo, TX 16.76692 21.78368 133
## 17 Brownsville-Harlingen, TX 16.53846 11.43919 130
## 18 Lubbock, TX 15.99265 17.31345 136
## 19 Wichita Falls, TX 11.98649 10.24962 74
Houston (unsurprisingly) has the longest mean travel times for our population. Wichita Falls has the least. This observation is also borne out in the graphs below. First, we observe by “met2013” sorted by highest mean transit time to lowest:
new_ipums %>%
ggplot() +
geom_boxplot(aes(x=reorder(met2013,trantime,FUN=mean),y=trantime)) +
xlab(label="Metro Statistical Area")+
ylab(label="Transit Time") +
coord_flip()
And then, for the sake of clarity, we add metro area names from MSA_NAME to the same boxplot:
new_ipums %>%
ggplot() +
geom_boxplot(aes(x=reorder(MSA_NAME,trantime,FUN=mean),y=trantime)) +
xlab(label="Metro Statistical Area")+
ylab(label="Transit Time") +
coord_flip()
Here once again, we see the data sorted by Transit Time, but this time with the city names visible. Houston-The Woodlands-Sugar Land continue to have the longest mean commutes, with Wichita Falls the least. We also observe there are significant outliers in the data.
Here we run a histogram on new_ipums$trantime to confirm the existence of outliers:
hist(new_ipums$trantime)
It appears that the distribution is fairly skewed!
Here, I will perform ANOVA using trantime as the outcome, and met2013 as the grouping variable:
trantime_fit<-new_ipums %>%
lm(trantime~met2013,data=.)
anova(trantime_fit)
## Analysis of Variance Table
##
## Response: trantime
## Df Sum Sq Mean Sq F value Pr(>F)
## met2013 18 75244 4180.2 8.5119 < 2.2e-16 ***
## Residuals 9445 4638523 491.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We see, then, that with an F value of “8.5119” and a P value of “.0000000000000002”, we can reject the null hypothesis with some certainty. We can be fairly sure that there is genuine statistically significant difference in at least some of the transit times within the various Metropolitan Statistical Areas of Texas.
Here, I have releveled the linear model of Texas MSA transit times with “Houston-The Woodlands-Sugar Land” as the reference variable. When printed out, we see that many Texas MSA’s do indeed have quite different mean transit times from Houston:
new_ipums$met2013b<-relevel(as.factor(new_ipums$met2013),ref="26420")
trantime_fit2<-lm(trantime~met2013b,data=new_ipums)
tidy(trantime_fit2)
## term estimate std.error statistic p.value
## 1 (Intercept) 26.467834 0.4471721 59.1893676 0.000000e+00
## 2 met2013b11100 -9.700917 1.9729447 -4.9169735 8.937290e-07
## 3 met2013b12420 -3.666776 0.8483247 -4.3223729 1.559420e-05
## 4 met2013b13140 -5.454589 1.8580472 -2.9356568 3.336447e-03
## 5 met2013b15180 -9.929372 1.9944230 -4.9785688 6.518855e-07
## 6 met2013b17780 -6.091596 2.2499837 -2.7073957 6.793538e-03
## 7 met2013b18580 -7.106227 1.5467427 -4.5943172 4.397936e-06
## 8 met2013b19100 -1.495810 0.5938210 -2.5189581 1.178664e-02
## 9 met2013b21340 -5.418754 1.3063046 -4.1481551 3.381007e-05
## 10 met2013b29700 -7.124400 2.2717082 -3.1361419 1.717118e-03
## 11 met2013b31180 -10.475187 1.9521930 -5.3658562 8.247355e-08
## 12 met2013b32580 -7.079774 1.6258200 -4.3545868 1.347239e-05
## 13 met2013b33260 -7.059671 3.1972789 -2.2080246 2.726649e-02
## 14 met2013b36220 -9.205929 3.4486274 -2.6694473 7.610613e-03
## 15 met2013b41660 -1.422379 3.3706865 -0.4219851 6.730455e-01
## 16 met2013b41700 -2.159971 0.8624538 -2.5044487 1.228096e-02
## 17 met2013b46340 -4.710258 2.2717082 -2.0734432 3.815809e-02
## 18 met2013b47380 -6.400670 1.9659487 -3.2557664 1.134853e-03
## 19 met2013b48660 -14.481347 2.6146833 -5.5384708 3.133030e-08