1 Setup

1.1 Libraries

Load needed libraries

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.1.0     v dplyr   1.0.5
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(tidytuesdayR)
library(forcats)
library(rmdformats)

1.2 Reading in data

Now load the transit project data from tidytuesdayR

tuesdata <- tidytuesdayR::tt_load('2021-01-05')
## Only 9 Github queries remaining until 2021-03-18 12:14:02 AM CDT.
## Only 9 Github queries remaining until 2021-03-18 12:14:02 AM CDT.
## Only 9 Github queries remaining until 2021-03-18 12:14:02 AM CDT.
## Only 9 Github queries remaining until 2021-03-18 12:14:02 AM CDT.
## Only 9 Github queries remaining until 2021-03-18 12:14:03 AM CDT.
## --- Compiling #TidyTuesday Information for 2021-01-05 ----
## Only 8 Github queries remaining until 2021-03-18 12:14:03 AM CDT.
## --- There is 1 file available ---
## Only 7 Github queries remaining until 2021-03-18 12:14:02 AM CDT.
## --- Starting Download ---
## Only 7 Github queries remaining until 2021-03-18 12:14:02 AM CDT.
##  Downloading file 1 of 1: `transit_cost.csv`
## Only 6 Github queries remaining until 2021-03-18 12:14:02 AM CDT.
## --- Download complete ---
data <- tuesdata$transit_cost

Now let’s take a look at the data

## # A tibble: 544 x 20
##        e country city   line  start_year end_year    rr length tunnel_per tunnel
##    <dbl> <chr>   <chr>  <chr> <chr>      <chr>    <dbl>  <dbl> <chr>       <dbl>
##  1  7136 CA      Vanco~ Broa~ 2020       2025         0    5.7 87.72%        5  
##  2  7137 CA      Toron~ Vaug~ 2009       2017         0    8.6 100.00%       8.6
##  3  7138 CA      Toron~ Scar~ 2020       2030         0    7.8 100.00%       7.8
##  4  7139 CA      Toron~ Onta~ 2020       2030         0   15.5 57.00%        8.8
##  5  7144 CA      Toron~ Yong~ 2020       2030         0    7.4 100.00%       7.4
##  6  7145 NL      Amste~ Nort~ 2003       2018         0    9.7 73.00%        7.1
##  7  7146 CA      Montr~ Blue~ 2020       2026         0    5.8 100.00%       5.8
##  8  7147 US      Seatt~ U-Li~ 2009       2016         0    5.1 100.00%       5.1
##  9  7152 US      Los A~ Purp~ 2020       2027         0    4.2 100.00%       4.2
## 10  7153 US      Los A~ Purp~ 2018       2026         0    4.2 100.00%       4.2
## # ... with 534 more rows, and 10 more variables: stations <dbl>, source1 <chr>,
## #   cost <dbl>, currency <chr>, year <dbl>, ppp_rate <dbl>, real_cost <chr>,
## #   cost_km_millions <dbl>, source2 <chr>, reference <chr>

2 Looking at the Data

What question do I want to ask? Well first look at a few cities and countries to see if there’s anything interesting to look at.

2.1 Exploring Different Places: NYC

Let’s start with New York:

## # A tibble: 5 x 20
##       e country city   line   start_year end_year    rr length tunnel_per tunnel
##   <dbl> <chr>   <chr>  <chr>  <chr>      <chr>    <dbl>  <dbl> <chr>       <dbl>
## 1  7408 US      New Y~ 7 ext~ 2007       2014         0    1.6 100.00%       1.6
## 2  7409 US      New Y~ Secon~ 2007       2016         0    2.7 100.00%       2.7
## 3  7410 US      New Y~ Secon~ 2019       2029         0    2.6 100.00%       2.6
## 4  7411 US      New Y~ East ~ 2007       2022         1    2.8 100.00%       2.8
## 5  7416 US      New Y~ Gatew~ 2019       2026         1    5.3 100.00%       5.3
## # ... with 10 more variables: stations <dbl>, source1 <chr>, cost <dbl>,
## #   currency <chr>, year <dbl>, ppp_rate <dbl>, real_cost <chr>,
## #   cost_km_millions <dbl>, source2 <chr>, reference <chr>

New York has 3 subway projects (really only 2 since 2nd Avenue subway is in 2 parts) and 2 railroad projects. This could be interesting, but let’s look at other cities.

Ideally I’d like to compare the progress of various projects for one city or country.

2.2 Toronto

Let’s try another North American city, Toronto:

## # A tibble: 6 x 20
##       e country city   line   start_year end_year    rr length tunnel_per tunnel
##   <dbl> <chr>   <chr>  <chr>  <chr>      <chr>    <dbl>  <dbl> <chr>       <dbl>
## 1  7137 CA      Toron~ Vaugh~ 2009       2017         0    8.6 100.00%       8.6
## 2  7138 CA      Toron~ Scarb~ 2020       2030         0    7.8 100.00%       7.8
## 3  7139 CA      Toron~ Ontar~ 2020       2030         0   15.5 57.00%        8.8
## 4  7144 CA      Toron~ Yonge~ 2020       2030         0    7.4 100.00%       7.4
## 5  7283 CA      Toron~ Eglin~ 2011       2022         0   19   53.00%       10  
## 6  7400 CA      Toron~ Shepp~ 1994       2002         0    5.5 100.00%       5.5
## # ... with 10 more variables: stations <dbl>, source1 <chr>, cost <dbl>,
## #   currency <chr>, year <dbl>, ppp_rate <dbl>, real_cost <chr>,
## #   cost_km_millions <dbl>, source2 <chr>, reference <chr>

Toronto has 6 projects, none of which are with railroads, and most of which involve tunnels.

2.3 China, Beijing

Let’s try a different country this time. Maybe China or India?

## # A tibble: 253 x 20
##        e country city   line  start_year end_year    rr length tunnel_per tunnel
##    <dbl> <chr>   <chr>  <chr> <chr>      <chr>    <dbl>  <dbl> <chr>       <dbl>
##  1  7640 CN      Beiji~ Line~ 2004       2009         0  28.2  95.17%     26.8  
##  2  7649 CN      Beiji~ Line~ 1999       2003         0  40.8  0.00%       0    
##  3  7760 CN      Beiji~ Line~ 2017       2021         0   8.78 100.00%     8.78 
##  4  7769 CN      Beiji~ Line~ 2019       2021         0   4    100.00%     4    
##  5  7776 CN      Beiji~ Line~ 2016       2021         0  37.4  55.61%     20.8  
##  6  7777 CN      Beiji~ Line~ 2017       2021         0  29.2  100.00%    29.2  
##  7  7785 CN      Beiji~ Line~ 2016       2018         0   3.4  29.08%      0.989
##  8  7792 CN      Beiji~ Line~ 2017       2021         0  22.4  100.00%    22.4  
##  9  7793 CN      Beiji~ Line~ 2015       2022         0  49.7  100.00%    49.7  
## 10  7794 CN      Beiji~ Line~ 2013       2021         0  49.8  100.00%    49.8  
## 11  7795 CN      Beiji~ Line~ 2017       2020         0   5    <NA>       NA    
## 12  7816 CN      Beiji~ Line~ 2016       2020         0  17.2  87.21%     15    
## 13  7824 CN      Beiji~ Line~ 2019       2021         0  16.6  <NA>       NA    
## 14  7832 CN      Beiji~ Line~ 2008       2014         0  42.8  100.00%    42.8  
## 15  7864 CN      Beiji~ Line~ <NA>       <NA>         0  78.6  62.00%     48.5  
## 16  7883 CN      Beiji~ Line~ 2019       2022         0  30.2  60.26%     18.2  
## 17  7904 CN      Beiji~ Line~ 2009       2014         0  23.7  100.00%    23.7  
## 18  7960 CN      Beiji~ Line~ 2003       2008         0  24.6  100.00%    24.6  
## 19  7985 CN      Beiji~ Daxi~ 2016       2019         0  44    <NA>       NA    
## 20  7986 CN      Beiji~ Daxi~ 2019       <NA>         0   3.5  <NA>       NA    
## # ... with 233 more rows, and 10 more variables: stations <dbl>, source1 <chr>,
## #   cost <dbl>, currency <chr>, year <dbl>, ppp_rate <dbl>, real_cost <chr>,
## #   cost_km_millions <dbl>, source2 <chr>, reference <chr>

Wow! China has a LOT of projects, 253 to be exact! This is definitely too much to compare all across China, but let’s try just looking at one city in China, like Beijing:

## # A tibble: 27 x 20
##        e country city   line  start_year end_year    rr length tunnel_per tunnel
##    <dbl> <chr>   <chr>  <chr> <chr>      <chr>    <dbl>  <dbl> <chr>       <dbl>
##  1  7640 CN      Beiji~ Line~ 2004       2009         0  28.2  95.17%     26.8  
##  2  7649 CN      Beiji~ Line~ 1999       2003         0  40.8  0.00%       0    
##  3  7760 CN      Beiji~ Line~ 2017       2021         0   8.78 100.00%     8.78 
##  4  7769 CN      Beiji~ Line~ 2019       2021         0   4    100.00%     4    
##  5  7776 CN      Beiji~ Line~ 2016       2021         0  37.4  55.61%     20.8  
##  6  7777 CN      Beiji~ Line~ 2017       2021         0  29.2  100.00%    29.2  
##  7  7785 CN      Beiji~ Line~ 2016       2018         0   3.4  29.08%      0.989
##  8  7792 CN      Beiji~ Line~ 2017       2021         0  22.4  100.00%    22.4  
##  9  7793 CN      Beiji~ Line~ 2015       2022         0  49.7  100.00%    49.7  
## 10  7794 CN      Beiji~ Line~ 2013       2021         0  49.8  100.00%    49.8  
## # ... with 17 more rows, and 10 more variables: stations <dbl>, source1 <chr>,
## #   cost <dbl>, currency <chr>, year <dbl>, ppp_rate <dbl>, real_cost <chr>,
## #   cost_km_millions <dbl>, source2 <chr>, reference <chr>

To no surpise, the capital of China has over 27 different transit projects. Still this is probably too much to convey in one graphic.

2.4 India

Let’s look at India now and see how it compares:

## # A tibble: 29 x 20
##        e country city   line  start_year end_year    rr length tunnel_per tunnel
##    <dbl> <chr>   <chr>  <chr> <chr>      <chr>    <dbl>  <dbl> <chr>       <dbl>
##  1  7546 IN      Ahmad~ Phas~ 2015       2023         0   40   16.00%        6.4
##  2  7536 IN      Banga~ Phas~ 2017       2024         0   73.9 0.00%        14  
##  3  7537 IN      Banga~ Phas~ 2006       2017         0   42.3 20.00%        8.5
##  4  7553 IN      Chenn~ Phas~ 2009       2019         0   45.1 53.00%       24  
##  5  7554 IN      Chenn~ Phas~ 2016       2020         0    9.1 26.00%        2.4
##  6  7555 IN      Chenn~ Phas~ 2020       2026         0  119.  36.00%       42.8
##  7  7368 IN      Delhi  Phas~ 2012       2020         0  161.  33.00%       53  
##  8  7369 IN      Delhi  Phas~ 2019       2025         0   61.7 36.00%       22.3
##  9  7370 IN      Delhi  Phas~ 2019       2025         0   42.3 35.00%       14.7
## 10  7538 IN      Gurga~ Phas~ 2009       2013         0    5.5 0.00%         0  
## 11  7539 IN      Gurga~ Phas~ 2013       2017         0    6.3 0.00%         0  
## 12  7547 IN      Hyder~ Phas~ 2012       2020         0   72   0.00%         0  
## 13  7552 IN      Hyder~ Airp~ 2020       <NA>         0   31   8.00%         2.5
## 14  7363 IN      Kochi  Metro 2013       2017         0   13   0.00%         0  
## 15  7288 IN      Mumbai Mono~ 2009       2019         0   20.2 0.00%         0  
## 16  7289 IN      Mumbai Line~ 2016       2022         0   33.5 100.00%      33.5
## 17  7290 IN      Mumbai Line~ 2016       2021         0   18.6 0.00%         0  
## 18  7291 IN      Mumbai Line~ 2018       2023         0   23.5 0.00%         0  
## 19  7296 IN      Mumbai Line~ 2018       2022         0   32.3 0.00%         0  
## 20  7297 IN      Mumbai Line~ 2019       2022         0    2.7 0.00%         0  
## 21  7298 IN      Mumbai Line~ 2017       2024         0   24.9 0.00%         0  
## 22  7299 IN      Mumbai Line~ 2017       2022         0   14.5 0.00%         0  
## 23  7304 IN      Mumbai Line~ 2016       2021         0   16.5 0.00%         0  
## 24  7305 IN      Mumbai Line~ 2019       2022         0   13.5 16.00%        2.2
## 25  7306 IN      Mumbai Line~ 2020       2024         0    9.2 8.00%         0.7
## 26  7307 IN      Mumbai Line~ 2020       2026         0   12.8 69.00%        8.8
## 27  7312 IN      Mumbai Line~ 2020       2026         0   20.8 0.00%         0  
## 28  7544 IN      Nagpur Phas~ 2014       2020         0   38.2 0.00%         0  
## 29  7545 IN      Nagpur Phas~ <NA>       <NA>         0   43.8 0.00%         0  
## # ... with 10 more variables: stations <dbl>, source1 <chr>, cost <dbl>,
## #   currency <chr>, year <dbl>, ppp_rate <dbl>, real_cost <chr>,
## #   cost_km_millions <dbl>, source2 <chr>, reference <chr>

Alright, only 29! That’s a lot more manageable! Mumbai is the biggest city in India and the most populous, so it probably has the most projects going on.

2.5 Mumbai (Bombay)

Let’s check:

## # A tibble: 13 x 20
##        e country city   line  start_year end_year    rr length tunnel_per tunnel
##    <dbl> <chr>   <chr>  <chr> <chr>      <chr>    <dbl>  <dbl> <chr>       <dbl>
##  1  7288 IN      Mumbai Mono~ 2009       2019         0   20.2 0.00%         0  
##  2  7289 IN      Mumbai Line~ 2016       2022         0   33.5 100.00%      33.5
##  3  7290 IN      Mumbai Line~ 2016       2021         0   18.6 0.00%         0  
##  4  7291 IN      Mumbai Line~ 2018       2023         0   23.5 0.00%         0  
##  5  7296 IN      Mumbai Line~ 2018       2022         0   32.3 0.00%         0  
##  6  7297 IN      Mumbai Line~ 2019       2022         0    2.7 0.00%         0  
##  7  7298 IN      Mumbai Line~ 2017       2024         0   24.9 0.00%         0  
##  8  7299 IN      Mumbai Line~ 2017       2022         0   14.5 0.00%         0  
##  9  7304 IN      Mumbai Line~ 2016       2021         0   16.5 0.00%         0  
## 10  7305 IN      Mumbai Line~ 2019       2022         0   13.5 16.00%        2.2
## 11  7306 IN      Mumbai Line~ 2020       2024         0    9.2 8.00%         0.7
## 12  7307 IN      Mumbai Line~ 2020       2026         0   12.8 69.00%        8.8
## 13  7312 IN      Mumbai Line~ 2020       2026         0   20.8 0.00%         0  
## # ... with 10 more variables: stations <dbl>, source1 <chr>, cost <dbl>,
## #   currency <chr>, year <dbl>, ppp_rate <dbl>, real_cost <chr>,
## #   cost_km_millions <dbl>, source2 <chr>, reference <chr>

This is interesting, several different projects, a few of which include tunnels, but most of which do not. With so many lines, let’s take a look at what this system looks/will look like:

3 Finding Variables of Interest & Cleaning

3.1 Some Thoughts to Consider

So now that I have identified a location of interest, I need to think about what it is I am interested about these projects. As India’s largest city, Mumbai has a population of 12.4 Million spread across the city’s 603 square kilometers, and a metro population of 20 Million spread across the greater metro area of over 4,300 square kilometers, so it’s no surprise that the city’s metro network is a long system consisting of over fourteen proposed lines once complete.

Let’s refresh my memory and remind myself of the variables we have to work with:

## # A tibble: 13 x 20
##        e country city   line  start_year end_year    rr length tunnel_per tunnel
##    <dbl> <chr>   <chr>  <chr> <chr>      <chr>    <dbl>  <dbl> <chr>       <dbl>
##  1  7288 IN      Mumbai Mono~ 2009       2019         0   20.2 0.00%         0  
##  2  7289 IN      Mumbai Line~ 2016       2022         0   33.5 100.00%      33.5
##  3  7290 IN      Mumbai Line~ 2016       2021         0   18.6 0.00%         0  
##  4  7291 IN      Mumbai Line~ 2018       2023         0   23.5 0.00%         0  
##  5  7296 IN      Mumbai Line~ 2018       2022         0   32.3 0.00%         0  
##  6  7297 IN      Mumbai Line~ 2019       2022         0    2.7 0.00%         0  
##  7  7298 IN      Mumbai Line~ 2017       2024         0   24.9 0.00%         0  
##  8  7299 IN      Mumbai Line~ 2017       2022         0   14.5 0.00%         0  
##  9  7304 IN      Mumbai Line~ 2016       2021         0   16.5 0.00%         0  
## 10  7305 IN      Mumbai Line~ 2019       2022         0   13.5 16.00%        2.2
## 11  7306 IN      Mumbai Line~ 2020       2024         0    9.2 8.00%         0.7
## 12  7307 IN      Mumbai Line~ 2020       2026         0   12.8 69.00%        8.8
## 13  7312 IN      Mumbai Line~ 2020       2026         0   20.8 0.00%         0  
## # ... with 10 more variables: stations <dbl>, source1 <chr>, cost <dbl>,
## #   currency <chr>, year <dbl>, ppp_rate <dbl>, real_cost <chr>,
## #   cost_km_millions <dbl>, source2 <chr>, reference <chr>

One thing I am noticing is that despite the observation of Mumbai’s size, there’s a good variety of lengths among these projects, so this might be something to explore.

## # A tibble: 13 x 2
##    line        length
##    <chr>        <dbl>
##  1 Line 3        33.5
##  2 Line 4        32.3
##  3 Line 5        24.9
##  4 Line 2B       23.5
##  5 Line 12       20.8
##  6 Monorail      20.2
##  7 Line 2A       18.6
##  8 Line 7        16.5
##  9 Line 6        14.5
## 10 Line 7A + 9   13.5
## 11 Line 11       12.8
## 12 Line 10        9.2
## 13 Line 4A        2.7

Given the variety in length, I imagine there’s a good variety of cost among the different lines, so let’s look at that next.

## # A tibble: 13 x 4
##    line        length real_cost cost_km_millions
##    <chr>        <dbl> <chr>                <dbl>
##  1 Line 4        32.3 6838.03              212. 
##  2 Line 2B       23.5 5163.42              220. 
##  3 Line 5        24.9 3955.755             159. 
##  4 Line 4A        2.7 377.28               140. 
##  5 Line 11       12.8 3400.8               266. 
##  6 Line 2A       18.6 3076.8               165. 
##  7 Line 7A + 9   13.5 3063.46              227. 
##  8 Line 7        16.5 2979.84              181. 
##  9 Line 6        14.5 2783.5               192. 
## 10 Line 12       20.8 2274.24              109. 
## 11 Line 10        9.2 1834.08              199. 
## 12 Monorail      20.2 1620                  80.2
## 13 Line 3        33.5 15040                449.

Hm that doesn’t seem right, the code told R to arrange the real cost in descending order, but that doesn’t seem to have happened. It looks like real cost is currently a character variable, which explains the unexpected ordering (descending by digit without respect to the decimal).

Looking over the values for real cost, I don’t see any values that would cause issues with a simple as.numeric() conversion.

3.2 Quick Data Cleaning

Before continuing, let’s fix that:

data %>%
  filter(city == "Mumbai") %>%
  select(line, length, real_cost, cost_km_millions) %>%
  mutate(real_cost = as.numeric(real_cost)) %>%
  arrange(desc(real_cost))
## # A tibble: 13 x 4
##    line        length real_cost cost_km_millions
##    <chr>        <dbl>     <dbl>            <dbl>
##  1 Line 3        33.5    15040             449. 
##  2 Line 4        32.3     6838.            212. 
##  3 Line 2B       23.5     5163.            220. 
##  4 Line 5        24.9     3956.            159. 
##  5 Line 11       12.8     3401.            266. 
##  6 Line 2A       18.6     3077.            165. 
##  7 Line 7A + 9   13.5     3063.            227. 
##  8 Line 7        16.5     2980.            181. 
##  9 Line 6        14.5     2784.            192. 
## 10 Line 12       20.8     2274.            109. 
## 11 Line 10        9.2     1834.            199. 
## 12 Monorail      20.2     1620              80.2
## 13 Line 4A        2.7      377.            140.

OK so that roughly follows with what we would expect, the more expensive a line is, the longer it tends to be (although Lines 11 and 7A + 9 seem to be an exception).

Now there is one factor that is in the data that I haven’t considered: the presence of tunnels. Usually the more a transit line is in a tunnel (i.e., below grade) the more expensive the project tends to be. With that in mind we would expect the projects with the highest cost per km to have tunnels.

To do this, we first need to create a new logical variable to show the presence or absence of tunnels. We can do this using a simple ifelse() operation on the tunnel variable, which is the length of any tunnels (so a 0 would mean no tunnels).

data %>%
  filter(city == "Mumbai") %>%
  select(line, length, real_cost, cost_km_millions, tunnel) %>%
  mutate(real_cost = as.numeric(real_cost)) %>%
  mutate(tunnel_yn = ifelse(tunnel > 0, T, F)) %>%
  select(line, length, real_cost, cost_km_millions, tunnel_yn) %>%
  arrange(desc(cost_km_millions))
## # A tibble: 13 x 5
##    line        length real_cost cost_km_millions tunnel_yn
##    <chr>        <dbl>     <dbl>            <dbl> <lgl>    
##  1 Line 3        33.5    15040             449.  TRUE     
##  2 Line 11       12.8     3401.            266.  TRUE     
##  3 Line 7A + 9   13.5     3063.            227.  TRUE     
##  4 Line 2B       23.5     5163.            220.  FALSE    
##  5 Line 4        32.3     6838.            212.  FALSE    
##  6 Line 10        9.2     1834.            199.  TRUE     
##  7 Line 6        14.5     2784.            192.  FALSE    
##  8 Line 7        16.5     2980.            181.  FALSE    
##  9 Line 2A       18.6     3077.            165.  FALSE    
## 10 Line 5        24.9     3956.            159.  FALSE    
## 11 Line 4A        2.7      377.            140.  FALSE    
## 12 Line 12       20.8     2274.            109.  FALSE    
## 13 Monorail      20.2     1620              80.2 FALSE

Great! It looks like for the most part my reasoning held up as the top three most expensive per km projects all have tunnels, while the sole exception (Line 10) has only a very short tunnel section, and thus does not suffer from an inflated cost per km.

What is interesting is that the two projects with the lowest cost per km are two of the longer lines at over 20km in length.

3.3 Research Question and Variables

So now that I have identified my variables of interest, I’ve decided to look at the different projects in Mumbai and see how they compare in length, real total cost, real cost per km, and the presence or absence of a tunnel.

To answer this I will look at the following variables:

  • Length (length)
  • Total Real Cost (real_cost)
  • Real Cost per km (cost_km_millions)
  • Tunnel vs. No Tunnel (tunnel_yn)

4 Visualization

Since the main variable I want to highlight is total cost (a continuous variable) I think the best way to plot this as a bar plot. I imagine this as a similar plot to the API plot I did for assignment 6, but one thing to make it easier is I will use coord_flip while still coding my x axis as the projects/lines, that I will recode as factors and then reorder so that they are in descending order by cost per km, so that the first bar is the project with the greatest cost per km.

To make the x (really y) axis ordering more apparent, I will also overplot a second geom_bar over the first to show the cost per km of each line so that it is clear they are in descending order, and so the scale is more apparent.

4.1 A Few Simple Plots

Let’s start with a simple plot. First, let’s define our data for the plot.

plot_data <-  data %>%  
  filter(city == "Mumbai") %>%
  select(line, length, real_cost, cost_km_millions, tunnel) %>%
  mutate(real_cost = as.numeric(real_cost)) %>%
  mutate(tunnel_yn = ifelse(tunnel > 0, T, F)) %>%
  select(line, length, real_cost, cost_km_millions, tunnel_yn) %>%
  mutate(line = as.factor(line))

Now let’s run a basic plot

plot_data %>% # Call data
  mutate(line = fct_reorder(line, real_cost)) %>% # Reorders lines by real cost 
  ggplot() + # Start ggplot
  geom_col(aes(x = line, y = real_cost, fill = tunnel_yn)) + # First bars which show cost and presence/absence of tunnel
  coord_flip() + # flips x/y so everything I code as x is really on the y axis
  labs(title = "Real Cost of Mumbai Transit Projects") + # Add a title
  xlab("Transit Line") + # Label x axis 
  ylab("Real Cost in Millions of Dollars ($)") # Label y axis

Interesting, now let’s visually compare cost and length and see if there’s anything interesting we can observe.

plot_data %>% 
  mutate(line = fct_reorder(line, length)) %>% # This time order by length 
  ggplot() +
  geom_col(aes(x = line, y = length, fill = tunnel_yn)) + # Same thing plot length instead of cost 
  coord_flip() + 
  labs(title = "Real Cost of Mumbai Transit Projects") + 
  xlab("Transit Line") + 
  ylab("Length in km") # Update for new axis

Interestingly the monorail, despite being the second cheapest project is actually one of the longer projects!

Although the length graph is interesting, I am going to keep the cost rather than length. If I find I want to add in length later I can always use a geom_text and list line lengths.

4.2 Refining the Plot

Let’s move back to plot 1 and start making some refinements. First, the original plot has pretty wide variance due to the wide range between Line 4A and Line 3, so taking a log transformation may make this a little less extreme.

plot_data %>%
  mutate(line = fct_reorder(line, real_cost)) %>% # Flip back to ordering by real cost instead of length
  ggplot() +
  geom_col(aes(x = line, y = real_cost, fill = tunnel_yn)) + # Back to bars to show cost and presence/absence of tunnel
  coord_flip() +  
  labs(title = "Real Cost of Mumbai Transit Projects") +  
  xlab("Transit Line") + 
  ylab("Real Cost in Millions of Dollars ($)") + # Update back to real cost
  scale_y_continuous(trans = 'log10') # Do a log10 transformation to y-axis to make reduce extremes/variance

Great! Now we should just remember to make note of the log scale at some point since log scales are not always typical (and can be optically deceptive).

Now let’s try to do the overplot to show the cost per km

plot_data %>%
  mutate(line = fct_reorder(line, real_cost)) %>% 
  ggplot() +
  geom_col(aes(x = line, y = real_cost, fill = tunnel_yn)) +
  geom_col(aes(x = line, y = cost_km_millions)) + # Adds additional plot to show the cost/km
  coord_flip() + 
  labs(title = "Real Cost of Mumbai Transit Projects") + 
  xlab("Transit Line") + 
  ylab("Real Cost in Millions of Dollars ($)") + 
  scale_y_continuous(trans = 'log10')

Now let’s try reordering the lines so they are descending by cost per km.

plot_data %>%
  mutate(line = fct_reorder(line, cost_km_millions)) %>% # Changes ordering to descending by cost/km
  ggplot() +
  geom_col(aes(x = line, y = real_cost, fill = tunnel_yn)) +
  geom_col(aes(x = line, y = cost_km_millions)) +
  coord_flip() + 
  labs(title = "Real Cost of Mumbai Transit Projects") + 
  xlab("Transit Line") + 
  ylab("Real Cost in Millions of Dollars ($)") + 
  scale_y_continuous(trans = 'log10')

Now let’s add in the line length as text beside each bar

plot_data %>%
  mutate(line = fct_reorder(line, cost_km_millions)) %>% 
  ggplot() + # Start of ggplot
  geom_col(aes(x = line, y = real_cost, fill = tunnel_yn)) + 
  geom_col(aes(x = line, y = cost_km_millions)) + 
   geom_text(aes(x = line, 
                y = real_cost, 
                label = paste0(length, "km"), # Assigns text to display as the length followed by km
                color = "#FF9933",# Add some color in! 
                fontface = "bold"), # Make it bold to stand out
            nudge_y = .3) + # Nudge slightly so just past the bars on graph
  coord_flip() + 
  labs(title = "Real Cost of Mumbai Transit Projects") + 
  xlab("Transit Line") + 
  ylab("Real Cost in Millions of Dollars ($)") + 
  scale_y_continuous(trans = 'log10') + 
  scale_color_manual(name = "Text Color", label = "Project Length in km", values = "#FF9933") # Adds in key to explain text

4.3 Adding Creative Touches

This is coming together nicely, now let’s just clean up a few details, add a few thematic elements and update our labels.

plot_data %>%
  mutate(line = fct_reorder(line, cost_km_millions)) %>% 
  ggplot() + 
  geom_col(aes(x = line, y = real_cost, fill = tunnel_yn)) + 
  geom_col(aes(x = line, y = cost_km_millions, fill = "#FF9933")) + 
   geom_text(aes(x = line, 
                y = real_cost, 
                label = paste0(length, "km"), 
                color = "#FF9933",
                fontface = "bold"), 
            nudge_y = .3) + 
  coord_flip() + 
  scale_fill_manual(name = "Bar Color", # Manually controls bar fill to reflect different characteristics
                    labels = c("Cost per km", "No Tunnel", "Has Tunnel"),  # Will cause a legend to appear with labels
                    values = c("#FF9933", "#000080", "#138808")) + # Assign color values for each (taken from Indian flag)
  labs(title = "Real Cost of Transit Projects in Mumbai, India") + 
  xlab("Transit Line") + 
  ylab("Real Cost in Millions of Dollars ($)") + 
  scale_y_continuous(trans = 'log10') + 
  scale_color_manual(name = "Text Color", label = "Project Length in km", values = "#FF9933") # Adds in key to explain text

Hmmm with the log transformation of the axis, the cost per km bar really loses its intended meaning, which was to show how much 1km makes up the total cost. Let’s try dropping that and seeing how it looks.

plot_data %>%
  mutate(line = fct_reorder(line, cost_km_millions)) %>% 
  ggplot() + 
  geom_col(aes(x = line, y = real_cost, fill = tunnel_yn)) + 
  geom_col(aes(x = line, y = cost_km_millions, fill = "#FF9933")) + 
   geom_text(aes(x = line, 
                y = real_cost, 
                label = paste0(length, "km"), 
                color = "#FF9933",
                fontface = "bold"), 
            nudge_y = .3) + 
  coord_flip() + 
  scale_fill_manual(name = "Bar Color", 
                    labels = c("Cost per km", "No Tunnel", "Has Tunnel"), 
                    values = c("#FF9933", "#000080", "#138808")) + 
  labs(title = "Real Cost of Transit Projects in Mumbai, India") + 
  xlab("Transit Line") + 
  ylab("Real Cost in Millions of Dollars ($)") + 
  # scale_y_continuous(trans = 'log10') + 
  scale_color_manual(name = "Text Color", label = "Project Length in km", values = "#FF9933") 

Ugh, while this restores the intended effect of showing the cost per km as equal fractions of the line (i.e., you could take the top bar for Line 3 and multiply by 33.5 and it would span the whole bar) but it makes the orange line so small for most that it almost becomes meaningless, so let’s stick with the logged axis.

4.4 The Final Plot

Now to make a few finishing touches!

plot_data %>%
  mutate(line = fct_reorder(line, cost_km_millions)) %>% 
  ggplot() + 
  geom_col(aes(x = line, y = real_cost, fill = tunnel_yn)) + 
  geom_col(aes(x = line, y = cost_km_millions, fill = "#FF9933")) + 
  geom_text(aes(x = line, 
                y = real_cost, 
                label = paste0(length, "km"), 
                color = "#FF9933",
                fontface = "bold"),
            size = 3.1, # make the font a little smaller 
            nudge_y = .25) + 
  coord_flip() + 
  scale_fill_manual(name = "Bar Color", 
                    labels = c("Cost per km", "No Tunnel", "Has Tunnel"), 
                    values = c("#FF9933", "#000080", "#138808")) + 
  labs(title = "Real Cost of Transit Projects in Mumbai, India",
       subtitle = "Total Real Cost vs. Cost/km, with Project Length", # Add a subtitle 
       caption = "X-Axis is scaled to log_10") + # Make note about x-axis being logged  
  xlab("Transit Line") + 
  ylab("Real Cost in Millions of Dollars ($)") + 
  scale_y_continuous(trans = 'log10') + 
  scale_color_manual(name = "Text Color", label = "Project Length in km", values = "#FF9933") + 
  theme_minimal() + # Pick a neutral theme 
  theme(plot.title = element_text(color = "#FF9933", size = 20, face = "bold"), # Make the title bold, larger and color to match
        plot.subtitle = element_text(face = "bold.italic"), # Put subtitle in bold italics 
        plot.caption = element_text(face = "italic", hjust = 0.5), # Centers and puts caption in italics
        legend.position = "right", legend.box = "vertical") # Puts legend at the right but keeps bar and text colors separate

And voila! A good graph that is easy to read, has nice colors (that fit with the subject) and is clearly labeled.