Mini projects are designed to allow you to work on real-world planning-related data. The first mini project presents THREE questions related to recent discussions in (transportation) planning.
Your tasks include:
That is, you not only wrangle data but also make sense of their patterns. To help you do so, each question has several subquestions that guide your process of data processing and analysis. Hopefully, this is a fun and useful exercise in your geosptial data analysis and visualization career.
Submission instruction:
#comments in your script file.Happy R scripting!
We will examine whether the use of ridehailing services leads to lifestyles with fewer/no personal vehicles. Existing studies on this issue are limited in part because of use of non-representative samples (e.g., Hampshire et al. 2017). Here, we use recently collected nation-wide representative travel survey data, the 2017 National Household Travel Survey (NHTS). Go to https://nhts.ornl.gov/, and download data by clicking Download Now! in the middle. Download csv files (alternatively, click here). This is a zipped file, and once unzipping, you will use the person table: perpub.csv
read_csv()# place perpub.csv in your working directory
# to check your working directory, getwd()
per <- read_csv(_____)
colnames(). To understand variable names, refer to the codebook.colnames(per)
select(). Then, exclude rows (i.e., individuals) younger than 18 or older than 65 (i.e., include those from 18 to 65(inclusive)) by filter(). Also, examine the values of RIDESHARE. What do negative values mean?per2 <- per %>%
select(_____) %>%
filter(_____)
HHVEHCNT, and the other indicating the level of using ridehailing services in the last 30 days (four levels: 0, 1-3, 4-7, and 8+) with RIDESHARE. Search about cut() for these recoding tasks: exampleper3 <- per2 %>%
mutate(
hhvehcat = cut(HHVEHCNT, _____),
rdshcat = cut(RIDESHARE, _____)
)
WTPERFIN). Sum the weight for all possible combinations of the two categorical variables by group_by and count()per4 <- per3 %>%
group_by(hhvehcat, rdshcat) %>%
count(wt = _____) %>%
group_by(_____) %>%
mutate(
share = _____,
label = paste(as.character(round(share*100, 1)), "%", sep = " ")
) %>%
ungroup()
# a small trick: reverse the order of the factor variable
# https://forcats.tidyverse.org/reference/fct_rev.html
per4$hhvehcat <- fct_rev(per4$hhvehcat)
ggplot(): on the x axis, you have frequency level categories, and on the y axis, you have the share of each vehicle ownership category. Make sure the height of all bars is always 100%.install.packages("RColorBrewer")
library(RColorBrewer)
ggplot(per4) +
geom_bar(aes(x = _____, y = _____, fill = _____),
stat = _____) +
scale_y_continuous(labels = scales::percent) +
geom_text(aes(x = _____, y = _____, label = label), size = 3,
position = position_stack(vjust = .5)) +
coord_flip() +
scale_fill_brewer(palette = "RdYlGn") +
...
Transportation planners, researchers, and journalists speculate that the increasing availability of ridershailing services in large US cities may have caused reduction in transit ridership (e.g., Graehler, Mucci, & Erhardt, 2018). To further examine this hypothesis, we use data from two sources: National Transit Database (NTD) and Google Trend. First, download the latest NTD monthly ridership data. As a measure for monthly transit ridership, we use unlinked passegner trip (UPT) counts.
readxl package: a good referenceinstall.packages("tidyverse")
library(tidyverse)
library(readxl)
temp <- read_excel(_____)
excel_sheets(temp) # check the list of sheets
upt <- read_excel(temp, sheet = "UPT")
upt2 <- upt %>%
filter(!is.na(UZA) == TRUE) %>% # at the bottom, we have several rows of summaries
group_by(_____) %>%
summarize(
YEAR12 = _____,
YEAR13 = _____,
YEAR14 = _____,
YEAR15 = _____,
YEAR16 = _____,
YEAR17 = _____
) %>%
filter(YEAR12 + YEAR13 + YEAR14 + YEAR15 + YEAR16 + YEAR17 > 0 ) %>%
mutate( # percent change from one year to the next year
pct_change_1213 = (YEAR13-YEAR12)/YEAR12 * 100,
pct_change_1314 = ____________________________,
pct_change_1415 = ____________________________,
pct_change_1516 = ____________________________,
pct_change_1617 = ____________________________
)
min_rank() and filter().year and pct_change, contain values from the five variables (pct_change_1213:pct_change_1617) above.scale() to standardize raw percent change values for each year separately.upt2$UZA_size <- _____(upt2$YEAR12)
upt3 <- upt2 %>%
arrange(desc(UZA_size)) %>%
filter(UZA_size>=_____) %>%
select(-(YEAR13:YEAR17))
upt4 <- upt3 %>%
gather(_____:_____, key = year, value = pct_change) %>%
group_by(year) %>%
mutate(
pct_change2 = _____(pct_change)
)
gtrendsR package: the creator’s GitHub and simple examples. Unfortunately, Google recently changed their data format, which requires the developer version of gtrendsR. What do I mean by the developer version?install.packages().gtrends_tb below (weekly search results with the keyword “Uber” for each city separately from 2012 to 2017).if (!require("devtools")) install.packages("devtools")
devtools::install_github("PMassicotte/gtrendsR")
library(gtrendsR)
One critical issue is that, for the Google Trends index that varies by city, we need to specify city codes as an input. Unfortunately, extracting city codes from gtrendsR is not very intuitive (FYI, check here). Thus, a preprocessed data file is provided here to save your time. This file also contains two name columns: one from the gtrendsR package, and the other from the NTD. Download city_st_tb3.csv from my Dropbox account, which is made available to anyone with the link. Also, the same file is available on Canvas.
city_st_NTD <- read_csv(_____)
head(city_st_NTD)
Once the city codes are ready, run a for loop to extract the historical Google Trends indexes for 45 US cities (five cities among the top 50 UZA of the NTD data do not have city codes in gtrendsR).
gtrends_list <- list()
for (i in 1:45){
results <- gtrends(c("Uber"), geo = c(city_st_NTD$_____[i]), time = "2012-01-01 2017-12-31")
results2 <- as_tibble(results$interest_over_time) %>%
separate(date, into = c("year", "month_date"), sep = _____) %>%
group_by(_____) %>% # calculate means for each year
summarize(
gtrends = _____(hits, na.rm = TRUE)
)
results2$`UZA Name` <- city_st_NTD$NTD[i]
gtrends_list[[i]] <- results2
}
Since Rtools are not available on the Arch-W358 computers yet, download gtrends_list.rds from my Dropbox account, which is made available to anyone with the link. Also, the same file is available on Canvas
gtrends_tb <- as_tibble(do.call("rbind", gtrends_list))
spread the year and gtrends columnsgather with key year and value gtrendsscale the gtrends values for each year (first group_by by year).gtrends_tb2 <- gtrends_tb %>%
spread(_____, _____) %>%
mutate(
pct_change_1213 = (`2013`-`2012`)/`2012`*100,
pct_change_1314 = __________________________,
pct_change_1415 = __________________________,
pct_change_1516 = __________________________,
pct_change_1617 = __________________________
) %>%
filter(!is.na(`2012`) == TRUE) %>%
select(`UZA Name`, pct_change_1213:pct_change_1617) %>%
gather(pct_change_1213:pct_change_1617, key = _____, value = _____) %>%
group_by(_____) %>%
mutate(
gtrends2 = _____(gtrends)
)
upt4 and gtrends_tb2, create the following chart.color by year.log(YEAR12).y ~ x at 50th percentile.facet_wrap (by year).upt5 <- _____(upt4, gtrends_tb2)
upt5 %>%
ggplot(aes(y = _____, x = _____, color = _____)) +
geom_point(aes(size = log(YEAR12))) +
geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
geom_quantile(formula = _____, quantiles = c(0.5), size = 0.5) +
_____(~_____) +
xlim(-2, 2) # remove the outliers
The literature on the recent urban revival claims that young professionals, or those who are 25 to 34 years old with bachelor’s degrees, choose to live in neighborhoods with convenient access to businesses/places for social, entertainment, and cultural activities (e.g., cafes, bars, theaters, and art museums) (Couture & Hanbury, 2017). A related question is whether the location choice patterns of young professionals is associated with gentrification in inner city neighborhoods, many of which have rich access to those “consumption amenity”. Also, studies documenting the historical trends of young professionals’ location choice stopped at 2010 (e.g., Baum-Snow, 2016) in part because of data availability at the time of their analyses.
In this context, we will use the latest US Census American Community Survey (ACS) to further examine this issue. The latest update is 2017 5-year estimates, providing summary statistics at various geographies based on data collected for five years from 2013 to 2017.
tidycensus package (A good introduction by its author: why tidycensus?). Do so at the census tract level for the top 50 Urban Areas (UAs), defined by 2010 population counts. First, check the census tract list and the state/county FIPS code list. You need ua50_tracts_list.Rds from my Dropbox account, which is made available to anyone with the link. Also, the same file is available on Canvas.# a list of 27,245 census tracts (2010) in the top 50 UAs
ua50_tracts_list <- read_rds(_____)
head(ua50_tracts_list)
# a list of state and county fips codes for data extraction: nrow = 241 counties (in 37 states)
# you can create this from ua50_tracts_list
temp <- tibble(
st = rep("00", times = 27245),
cnty = rep("000", times = 27245)
)
temp$st <- substr(ua50_tracts_list$GEOID, 1, 2) # two-digit state FIPS code
temp$cnty <- substr(ua50_tracts_list$GEOID, 3, 5) # three-digit county FIPS code
ua50_stcntyfp_list <- distinct(as_tibble(temp)) # remove duplicates
head(ua50_stcntyfp_list)
Get your API here and feed it to your R environment.
install.packages("tidycensus")
library(tidycensus)
census_api_key("YOUR API KEY GOES HERE")
Use get_acs to extract several key variables at the census tract level:
B01001_001 : Total population countB01001_011, B01001_012 : Male age 25-29, 30-34B01001_035, B01001_036 : Female age 25-29, 30-34B19013_001 : Median household incomeB01001H_001 : Non-Hispanic WhiteB15002_001 : Population 25 years old or older whose educational attainment is determined.B15002_015, B15002_016, B15002_017, B15002_018 : Educational attainment (male)B15002_032, B15002_033, B15002_034, B15002_035: Educational attainment (female)Note that we do this twice, one for year 2010 and the other for 2017.
Since we send many API calls, it’s a good practice to pause between calls by Sys.sleep() (unit: second).
pop2010_list <- list()
for (i in 1:241){
pop2010_list[[i]] <- get_acs(state = ua50_stcntyfp_list$st[i],
county = ua50_stcntyfp_list$cnty[i],
geography = "tract",
variables = c("B01001_001", ... ),
year = 2010, output = "wide")
Sys.sleep(2) # pause for two seconds
}
pop2010 <- as_tibble(do.call("rbind", pop2010_list))
By the way, how can we know variable names?
v15 <- load_variables(2017, "acs5", cache = TRUE)
View(v15)
ua50_tracts_listand pop2010.group_by(), mutate(), and scale() (check how to use: StackOverFlow).data2010 <- _____(ua50_tracts_list, pop2010) %>%
mutate(
pop_2010 = ,
den_pop_2010 = pop_2010/ALAND *(1000000/0.386102159), # convert unit from sq meter to sq mile
pct_2534_2010 = _____,
pct_nhwhite_2010 = _____,
pct_college_2010 = _____,
med_hhinc_2010 = _____
) %>%
select(GEOID, UAGEOID, pop_2010, den_pop_2010, pct_2534_2010,
pct_nhwhite_2010, pct_college_2010, med_hhinc_2010) %>%
group_by(_____) %>%
mutate(
pct_nhwhite_2010 = _____(pct_nhwhite_2010),
pct_college_2010 = _____(pct_college_2010),
med_hhinc_2010 = _____(med_hhinc_2010),
ses_2010 = (pct_nhwhite_2010+ pct_college_2010+ med_hhinc_2010)/3
) %>%
ungroup()
pop2017 <- ...
data2017 <- ...
data2010 and data2017. Compute the difference in the % 25-34 and SES index. Again, scale pct_2534_diff by UAGEOID. Last but not least, create a new categorical variable, which divides all census tracts into five groups by the raw population density in 2010 (use cut_number).data2yr <- _____(data2010, data2017) %>%
mutate(
ses_diff = _____,
pct_2534_diff = _____
) %>%
group_by(_____) %>%
mutate(
pct_2534_diff = _____(pct_2534_diff)
) %>%
ungroup()
data2yr$group_den <- _____(data2yr$den_pop_2010, _____)
geom_point() and inside aes(), change color by group_den.facet_warp to split the chart into five sub-charts by group_den.ggplot(data2yr, aes(y = _____, x = _____)) +
geom_point(aes(color = _____)) +
geom_quantile(quantiles = 0.5, size = 0.5, alpha = 0.5, color = "red") +
_____(~ _____, nrow = 2) +
geom_hline(yintercept = 0, color = "grey50", alpha = 0.5) +
coord_cartesian(ylim=c(-7.5, 7.5), xlim=c(-1.5, 1.5))