1st Mini-project

Mini projects are designed to allow you to work on real-world planning-related data. The first mini project presents THREE questions related to recent discussions in (transportation) planning.

Ridehailing and Vehicle Ownership
Ridehailing and Transit Ridership
Neighborhood Change and Young Adults in the Central City

Your tasks include:

Understand data and variables
Transform data
Visualize data
Interpret patterns in your visualization

That is, you not only wrangle data but also make sense of their patterns. To help you do so, each question has several subquestions that guide your process of data processing and analysis. Hopefully, this is a fun and useful exercise in your geosptial data analysis and visualization career.

Submission instruction:

Due: 11:59:00 PM on February 17 (Sunday), 2019
Where: Canvas
Deliverables: your R script file and images. Include your discussion as #comments in your script file.

Happy R scripting!

1. Ridehailing and Vehicle Onwership

Background

We will examine whether the use of ridehailing services leads to lifestyles with fewer/no personal vehicles. Existing studies on this issue are limited in part because of use of non-representative samples (e.g., Hampshire et al. 2017). Here, we use recently collected nation-wide representative travel survey data, the 2017 National Household Travel Survey (NHTS). Go to https://nhts.ornl.gov/, and download data by clicking Download Now! in the middle. Download csv files (alternatively, click here). This is a zipped file, and once unzipping, you will use the person table: perpub.csv

Read in perpub.csv by read_csv()

# place perpub.csv in your working directory 
# to check your working directory, getwd() 
per <- read_csv(_____)

Check column names by colnames(). To understand variable names, refer to the codebook.

colnames(per)

Create a new dataframe only with six variables (household id, person id, person age, frequency of using ridehailing app in the last 30 days, household vehicle count, and personal weights) by select(). Then, exclude rows (i.e., individuals) younger than 18 or older than 65 (i.e., include those from 18 to 65(inclusive)) by filter(). Also, examine the values of RIDESHARE. What do negative values mean?

per2 <- per %>% 
  select(_____) %>% 
  filter(_____)

Compute two categorical variables: one denoting the number of household vehicles (four levels: 0, 1, 2, & 3+) with HHVEHCNT, and the other indicating the level of using ridehailing services in the last 30 days (four levels: 0, 1-3, 4-7, and 8+) with RIDESHARE. Search about cut() for these recoding tasks: example

per3 <- per2 %>% 
  mutate(
    hhvehcat = cut(HHVEHCNT, _____), 
    rdshcat = cut(RIDESHARE, _____)
  )

To generate weighted summary statistics, you need to use the person weight (WTPERFIN). Sum the weight for all possible combinations of the two categorical variables by group_by and count()

per4 <- per3 %>%
  group_by(hhvehcat, rdshcat) %>%
  count(wt = _____) %>%
  group_by(_____) %>%
  mutate(
    share = _____, 
    label = paste(as.character(round(share*100, 1)), "%", sep = " ")
  ) %>%
  ungroup()

# a small trick: reverse the order of the factor variable 
# https://forcats.tidyverse.org/reference/fct_rev.html
per4$hhvehcat <- fct_rev(per4$hhvehcat)

Plot a bar chart by ggplot(): on the x axis, you have frequency level categories, and on the y axis, you have the share of each vehicle ownership category. Make sure the height of all bars is always 100%.

install.packages("RColorBrewer")
library(RColorBrewer)

ggplot(per4) + 
  geom_bar(aes(x = _____, y = _____, fill = _____), 
           stat = _____) + 
  scale_y_continuous(labels = scales::percent) + 
  geom_text(aes(x = _____, y = _____, label = label), size = 3, 
            position = position_stack(vjust = .5)) + 
  coord_flip() + 
  scale_fill_brewer(palette = "RdYlGn") + 
  ...

Explain patterns in your chart.

Do you see any correlation between more ridehailng use and fewer household vehicles?
How do you explain the patterns in your chart? (e.g., one causes the other, or a 3rd variable explains both.)
Suggest alternative visualization schemes that examine the relationships between ridehialing use and number of household vehicles.

2. Ridehailing and Transit Ridership

Background

Transportation planners, researchers, and journalists speculate that the increasing availability of ridershailing services in large US cities may have caused reduction in transit ridership (e.g., Graehler, Mucci, & Erhardt, 2018). To further examine this hypothesis, we use data from two sources: National Transit Database (NTD) and Google Trend. First, download the latest NTD monthly ridership data. As a measure for monthly transit ridership, we use unlinked passegner trip (UPT) counts.

Read in data from the UPT sheet of November 2018 Adjusted Database.xlsx with the readxl package: a good reference

install.packages("tidyverse")
library(tidyverse)
library(readxl)

temp <- read_excel(_____)
excel_sheets(temp) # check the list of sheets

upt <- read_excel(temp, sheet = "UPT")

Make summaries for each urban area (UZA) for each year. In the UPT sheet, individual rows present monthly ridership for transit services operated by transit agencies, and most UZAs have multiple transit agencies. Thus, we will sum UPT counts by UZA. Also, monthly counts show seasonality (e.g., higher UPTs under mild weather). Thus, we will sum monthly UPT by year. Then, compute percent change from one year to the next.

upt2 <- upt %>%
  filter(!is.na(UZA) == TRUE) %>% # at the bottom, we have several rows of summaries 
  group_by(_____) %>%
  summarize(
    YEAR12 = _____, 
    YEAR13 = _____, 
    YEAR14 = _____, 
    YEAR15 = _____, 
    YEAR16 = _____, 
    YEAR17 = _____ 
  ) %>% 
  filter(YEAR12 + YEAR13 + YEAR14 + YEAR15 + YEAR16 + YEAR17 > 0 ) %>% 
  mutate( # percent change from one year to the next year
    pct_change_1213 = (YEAR13-YEAR12)/YEAR12 * 100, 
    pct_change_1314 = ____________________________,
    pct_change_1415 = ____________________________,
    pct_change_1516 = ____________________________,
    pct_change_1617 = ____________________________ 
  )

Do a little more data wrangling to reformat the data.

Filter only top 50 UZA by their 2012 UPT counts with min_rank() and filter().
Transform data so that two new columns, year and pct_change, contain values from the five variables (pct_change_1213:pct_change_1617) above.
Use scale() to standardize raw percent change values for each year separately.

upt2$UZA_size <- _____(upt2$YEAR12)

upt3 <- upt2 %>%
  arrange(desc(UZA_size)) %>% 
  filter(UZA_size>=_____) %>%
  select(-(YEAR13:YEAR17))

upt4 <- upt3 %>% 
  gather(_____:_____, key = year, value = pct_change) %>%
  group_by(year) %>% 
  mutate(
    pct_change2 = _____(pct_change)
  )

Next, we will extract data from Google Trends. It provides the weekly frequency of certain search keywords on Google since January 2004. A recent study employed the Google Trends index for “Uber” as a proxy for the supply of Uber drivers in a given city (Hall, Palsson, & Price, 2018). We follow their approach here. How? Instead of directly extracting Google Trends data, we will use the gtrendsR package: the creator’s GitHub and simple examples. Unfortunately, Google recently changed their data format, which requires the developer version of gtrendsR. What do I mean by the developer version?

Why do we need it?
If you plan to use your own machine, you need to first install the latest version of Rtools. Of course, close RStudio, install Rtools, and restart RStudio. Then, you type the below script instead of install.packages().
If you plan to use lab machines, I will update you after talking to the College of Design IT support. If they decline to install Rtools on the lab machines, I will post the Google Trends data file here, or gtrends_tb below (weekly search results with the keyword “Uber” for each city separately from 2012 to 2017).

if (!require("devtools")) install.packages("devtools")
devtools::install_github("PMassicotte/gtrendsR")
library(gtrendsR)

One critical issue is that, for the Google Trends index that varies by city, we need to specify city codes as an input. Unfortunately, extracting city codes from gtrendsR is not very intuitive (FYI, check here). Thus, a preprocessed data file is provided here to save your time. This file also contains two name columns: one from the gtrendsR package, and the other from the NTD. Download city_st_tb3.csv from my Dropbox account, which is made available to anyone with the link. Also, the same file is available on Canvas.

city_st_NTD <- read_csv(_____)
head(city_st_NTD)

Once the city codes are ready, run a for loop to extract the historical Google Trends indexes for 45 US cities (five cities among the top 50 UZA of the NTD data do not have city codes in gtrendsR).

gtrends_list <- list()

for (i in 1:45){
  results <- gtrends(c("Uber"), geo = c(city_st_NTD$_____[i]), time = "2012-01-01 2017-12-31")
  results2 <- as_tibble(results$interest_over_time) %>% 
    separate(date, into = c("year", "month_date"), sep = _____) %>%
    group_by(_____) %>% # calculate means for each year 
    summarize(
      gtrends = _____(hits, na.rm = TRUE) 
    ) 
  results2$`UZA Name` <- city_st_NTD$NTD[i]
  gtrends_list[[i]] <- results2
}

Since Rtools are not available on the Arch-W358 computers yet, download gtrends_list.rds from my Dropbox account, which is made available to anyone with the link. Also, the same file is available on Canvas

gtrends_tb <- as_tibble(do.call("rbind", gtrends_list))

Do a little data wrangling on the Google Trends indexes.

spread the year and gtrends columns
Calculate percent change in Google Trends indexes from year to year
gather with key year and value gtrends
scale the gtrends values for each year (first group_by by year).

gtrends_tb2 <- gtrends_tb %>% 
  spread(_____, _____) %>% 
  mutate(
    pct_change_1213 = (`2013`-`2012`)/`2012`*100, 
    pct_change_1314 = __________________________, 
    pct_change_1415 = __________________________, 
    pct_change_1516 = __________________________, 
    pct_change_1617 = __________________________
  ) %>% 
  filter(!is.na(`2012`) == TRUE) %>% 
  select(`UZA Name`, pct_change_1213:pct_change_1617) %>% 
  gather(pct_change_1213:pct_change_1617, key = _____, value = _____) %>%
  group_by(_____) %>% 
  mutate(
    gtrends2 = _____(gtrends)
  )

After joining two data sets, upt4 and gtrends_tb2, create the following chart.

Put Google Trends indexes on x, and percent change in transit ridership on y.
Change color by year.
Change the size of points by log(YEAR12).
Add a horizonal line at zero.
Add a quantile regression line with a simple formula: y ~ x at 50th percentile.
Split the chart into five sub-charts with facet_wrap (by year).
Adjust the x range.
Change labels accordingly (the below chart presents several ugly labels… ).

upt5 <- _____(upt4, gtrends_tb2)

upt5 %>% 
  ggplot(aes(y = _____, x = _____, color = _____)) + 
  geom_point(aes(size = log(YEAR12))) + 
  geom_hline(yintercept = 0, color = "red", linetype = "dashed") + 
  geom_quantile(formula = _____, quantiles = c(0.5), size = 0.5) + 
  _____(~_____) +
  xlim(-2, 2) # remove the outliers

Discuss the patterns in the below chart.

Can you see any correlation between the supply of ridehailing services to cities and changes in transit trips?
How can you explain seemlingly positive relationships between these two, especially from 2012 to 2013?
Which additional variables may explain positive/negative correlations between ridehailing and transit use?

3. Neighborhood Change and Young Adults in the Central City

Background

The literature on the recent urban revival claims that young professionals, or those who are 25 to 34 years old with bachelor’s degrees, choose to live in neighborhoods with convenient access to businesses/places for social, entertainment, and cultural activities (e.g., cafes, bars, theaters, and art museums) (Couture & Hanbury, 2017). A related question is whether the location choice patterns of young professionals is associated with gentrification in inner city neighborhoods, many of which have rich access to those “consumption amenity”. Also, studies documenting the historical trends of young professionals’ location choice stopped at 2010 (e.g., Baum-Snow, 2016) in part because of data availability at the time of their analyses.

In this context, we will use the latest US Census American Community Survey (ACS) to further examine this issue. The latest update is 2017 5-year estimates, providing summary statistics at various geographies based on data collected for five years from 2013 to 2017.

Retrieve the 2010 & 2017 ACS estimates with the tidycensus package (A good introduction by its author: why tidycensus?). Do so at the census tract level for the top 50 Urban Areas (UAs), defined by 2010 population counts. First, check the census tract list and the state/county FIPS code list. You need ua50_tracts_list.Rds from my Dropbox account, which is made available to anyone with the link. Also, the same file is available on Canvas.

# a list of 27,245 census tracts (2010) in the top 50 UAs 
ua50_tracts_list <- read_rds(_____)
head(ua50_tracts_list)

# a list of state and county fips codes for data extraction: nrow = 241 counties (in 37 states)
# you can create this from ua50_tracts_list
temp <- tibble(
  st = rep("00", times = 27245), 
  cnty = rep("000", times = 27245)
)
temp$st <- substr(ua50_tracts_list$GEOID, 1, 2) # two-digit state FIPS code
temp$cnty <- substr(ua50_tracts_list$GEOID, 3, 5) # three-digit county FIPS code

ua50_stcntyfp_list <- distinct(as_tibble(temp)) # remove duplicates 
head(ua50_stcntyfp_list)

Get your API here and feed it to your R environment.

install.packages("tidycensus")
library(tidycensus)

census_api_key("YOUR API KEY GOES HERE")

Use get_acs to extract several key variables at the census tract level:

B01001_001 : Total population count
B01001_011, B01001_012 : Male age 25-29, 30-34
B01001_035, B01001_036 : Female age 25-29, 30-34
B19013_001 : Median household income
B01001H_001 : Non-Hispanic White
B15002_001 : Population 25 years old or older whose educational attainment is determined.
B15002_015, B15002_016, B15002_017, B15002_018 : Educational attainment (male)
B15002_032, B15002_033, B15002_034, B15002_035: Educational attainment (female)

Note that we do this twice, one for year 2010 and the other for 2017.

Since we send many API calls, it’s a good practice to pause between calls by Sys.sleep() (unit: second).

pop2010_list <- list()

for (i in 1:241){
  pop2010_list[[i]] <- get_acs(state = ua50_stcntyfp_list$st[i], 
                               county = ua50_stcntyfp_list$cnty[i], 
                               geography = "tract",  
                               variables = c("B01001_001", ... ),  
                               year = 2010, output = "wide")
  Sys.sleep(2) # pause for two seconds 
}

pop2010 <- as_tibble(do.call("rbind", pop2010_list))

By the way, how can we know variable names?

v15 <- load_variables(2017, "acs5", cache = TRUE)
View(v15)

Compute a socioeconomic status (SES) index at the census tract level. We follow Baum-Snow (2016) (working paper, presentation).

Join ua50_tracts_listand pop2010.
Compute a few key variables: % 25-34, % college educate, % non-Hispanic white, and log(median household income).
Standardize these three variables for each UA. Use group_by(), mutate(), and scale() (check how to use: StackOverFlow).
Calculate the average of these three standardized variables. The higher the average for a census tract is, the higher the socioeconomic status of the census tract is in its UA.

data2010 <- _____(ua50_tracts_list, pop2010) %>% 
  mutate(
    pop_2010 = ,
    den_pop_2010 = pop_2010/ALAND *(1000000/0.386102159), # convert unit from sq meter to sq mile 
    pct_2534_2010 = _____,
    pct_nhwhite_2010 = _____,
    pct_college_2010 = _____, 
    med_hhinc_2010 = _____
    ) %>%
  select(GEOID, UAGEOID, pop_2010, den_pop_2010, pct_2534_2010,  
         pct_nhwhite_2010, pct_college_2010, med_hhinc_2010) %>% 
  group_by(_____) %>%
  mutate(
    pct_nhwhite_2010 = _____(pct_nhwhite_2010), 
    pct_college_2010 = _____(pct_college_2010), 
    med_hhinc_2010 = _____(med_hhinc_2010), 
    ses_2010 = (pct_nhwhite_2010+ pct_college_2010+ med_hhinc_2010)/3
  ) %>%
  ungroup()

Repeat the same operations for 2017.

pop2017 <- ... 
data2017 <- ...

Join data2010 and data2017. Compute the difference in the % 25-34 and SES index. Again, scale pct_2534_diff by UAGEOID. Last but not least, create a new categorical variable, which divides all census tracts into five groups by the raw population density in 2010 (use cut_number).

data2yr <- _____(data2010, data2017) %>% 
  mutate( 
    ses_diff = _____, 
    pct_2534_diff = _____ 
  ) %>% 
  group_by(_____) %>%
  mutate(
    pct_2534_diff = _____(pct_2534_diff) 
  ) %>% 
  ungroup()

data2yr$group_den <- _____(data2yr$den_pop_2010, _____)

Create the following chart.

Put the difference in SES index on x, and the different in % 25-34 on y.
Use geom_point() and inside aes(), change color by group_den.
Use facet_warp to split the chart into five sub-charts by group_den.
Add a horizonal line at zero.
Adjust x and y ranges.
Revise labels accordingly (Replace any ugly labels with appropriate ones.)

ggplot(data2yr, aes(y = _____, x = _____)) +
  geom_point(aes(color = _____)) + 
  geom_quantile(quantiles = 0.5, size = 0.5, alpha = 0.5, color = "red") + 
  _____(~ _____, nrow = 2) +
  geom_hline(yintercept = 0, color = "grey50", alpha = 0.5) + 
  coord_cartesian(ylim=c(-7.5, 7.5), xlim=c(-1.5, 1.5))

Discuss the patterns in the below chart.

Do you see any correlation between neighborhood change (i.e., chnages in the SES index) and the share of 25-34 age group?
Do you see the relationships between two measures different at different density levels?
How would you further investigate their relationships with additional variables?

1st Mini-project

Will be finalized before Feb 4, 2019 (Monday). Until then, there will be some modifications.

1. Ridehailing and Vehicle Onwership

Background

2. Ridehailing and Transit Ridership

Background

3. Neighborhood Change and Young Adults in the Central City

Background