Load needed libraries
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.1.0 v dplyr 1.0.5
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(tidytuesdayR)
library(forcats)
library(rmdformats)
Now load the transit project data from tidytuesdayR
tuesdata <- tidytuesdayR::tt_load('2021-01-05')
## Only 9 Github queries remaining until 2021-03-18 12:14:02 AM CDT.
## Only 9 Github queries remaining until 2021-03-18 12:14:02 AM CDT.
## Only 9 Github queries remaining until 2021-03-18 12:14:02 AM CDT.
## Only 9 Github queries remaining until 2021-03-18 12:14:02 AM CDT.
## Only 9 Github queries remaining until 2021-03-18 12:14:03 AM CDT.
## --- Compiling #TidyTuesday Information for 2021-01-05 ----
## Only 8 Github queries remaining until 2021-03-18 12:14:03 AM CDT.
## --- There is 1 file available ---
## Only 7 Github queries remaining until 2021-03-18 12:14:02 AM CDT.
## --- Starting Download ---
## Only 7 Github queries remaining until 2021-03-18 12:14:02 AM CDT.
## Downloading file 1 of 1: `transit_cost.csv`
## Only 6 Github queries remaining until 2021-03-18 12:14:02 AM CDT.
## --- Download complete ---
data <- tuesdata$transit_cost
Now let’s take a look at the data
## # A tibble: 544 x 20
## e country city line start_year end_year rr length tunnel_per tunnel
## <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl>
## 1 7136 CA Vanco~ Broa~ 2020 2025 0 5.7 87.72% 5
## 2 7137 CA Toron~ Vaug~ 2009 2017 0 8.6 100.00% 8.6
## 3 7138 CA Toron~ Scar~ 2020 2030 0 7.8 100.00% 7.8
## 4 7139 CA Toron~ Onta~ 2020 2030 0 15.5 57.00% 8.8
## 5 7144 CA Toron~ Yong~ 2020 2030 0 7.4 100.00% 7.4
## 6 7145 NL Amste~ Nort~ 2003 2018 0 9.7 73.00% 7.1
## 7 7146 CA Montr~ Blue~ 2020 2026 0 5.8 100.00% 5.8
## 8 7147 US Seatt~ U-Li~ 2009 2016 0 5.1 100.00% 5.1
## 9 7152 US Los A~ Purp~ 2020 2027 0 4.2 100.00% 4.2
## 10 7153 US Los A~ Purp~ 2018 2026 0 4.2 100.00% 4.2
## # ... with 534 more rows, and 10 more variables: stations <dbl>, source1 <chr>,
## # cost <dbl>, currency <chr>, year <dbl>, ppp_rate <dbl>, real_cost <chr>,
## # cost_km_millions <dbl>, source2 <chr>, reference <chr>
What question do I want to ask? Well first look at a few cities and countries to see if there’s anything interesting to look at.
Let’s start with New York:
## # A tibble: 5 x 20
## e country city line start_year end_year rr length tunnel_per tunnel
## <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl>
## 1 7408 US New Y~ 7 ext~ 2007 2014 0 1.6 100.00% 1.6
## 2 7409 US New Y~ Secon~ 2007 2016 0 2.7 100.00% 2.7
## 3 7410 US New Y~ Secon~ 2019 2029 0 2.6 100.00% 2.6
## 4 7411 US New Y~ East ~ 2007 2022 1 2.8 100.00% 2.8
## 5 7416 US New Y~ Gatew~ 2019 2026 1 5.3 100.00% 5.3
## # ... with 10 more variables: stations <dbl>, source1 <chr>, cost <dbl>,
## # currency <chr>, year <dbl>, ppp_rate <dbl>, real_cost <chr>,
## # cost_km_millions <dbl>, source2 <chr>, reference <chr>
New York has 3 subway projects (really only 2 since 2nd Avenue subway is in 2 parts) and 2 railroad projects. This could be interesting, but let’s look at other cities.
Ideally I’d like to compare the progress of various projects for one city or country.
Let’s try another North American city, Toronto:
## # A tibble: 6 x 20
## e country city line start_year end_year rr length tunnel_per tunnel
## <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl>
## 1 7137 CA Toron~ Vaugh~ 2009 2017 0 8.6 100.00% 8.6
## 2 7138 CA Toron~ Scarb~ 2020 2030 0 7.8 100.00% 7.8
## 3 7139 CA Toron~ Ontar~ 2020 2030 0 15.5 57.00% 8.8
## 4 7144 CA Toron~ Yonge~ 2020 2030 0 7.4 100.00% 7.4
## 5 7283 CA Toron~ Eglin~ 2011 2022 0 19 53.00% 10
## 6 7400 CA Toron~ Shepp~ 1994 2002 0 5.5 100.00% 5.5
## # ... with 10 more variables: stations <dbl>, source1 <chr>, cost <dbl>,
## # currency <chr>, year <dbl>, ppp_rate <dbl>, real_cost <chr>,
## # cost_km_millions <dbl>, source2 <chr>, reference <chr>
Toronto has 6 projects, none of which are with railroads, and most of which involve tunnels.
Let’s try a different country this time. Maybe China or India?
## # A tibble: 253 x 20
## e country city line start_year end_year rr length tunnel_per tunnel
## <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl>
## 1 7640 CN Beiji~ Line~ 2004 2009 0 28.2 95.17% 26.8
## 2 7649 CN Beiji~ Line~ 1999 2003 0 40.8 0.00% 0
## 3 7760 CN Beiji~ Line~ 2017 2021 0 8.78 100.00% 8.78
## 4 7769 CN Beiji~ Line~ 2019 2021 0 4 100.00% 4
## 5 7776 CN Beiji~ Line~ 2016 2021 0 37.4 55.61% 20.8
## 6 7777 CN Beiji~ Line~ 2017 2021 0 29.2 100.00% 29.2
## 7 7785 CN Beiji~ Line~ 2016 2018 0 3.4 29.08% 0.989
## 8 7792 CN Beiji~ Line~ 2017 2021 0 22.4 100.00% 22.4
## 9 7793 CN Beiji~ Line~ 2015 2022 0 49.7 100.00% 49.7
## 10 7794 CN Beiji~ Line~ 2013 2021 0 49.8 100.00% 49.8
## 11 7795 CN Beiji~ Line~ 2017 2020 0 5 <NA> NA
## 12 7816 CN Beiji~ Line~ 2016 2020 0 17.2 87.21% 15
## 13 7824 CN Beiji~ Line~ 2019 2021 0 16.6 <NA> NA
## 14 7832 CN Beiji~ Line~ 2008 2014 0 42.8 100.00% 42.8
## 15 7864 CN Beiji~ Line~ <NA> <NA> 0 78.6 62.00% 48.5
## 16 7883 CN Beiji~ Line~ 2019 2022 0 30.2 60.26% 18.2
## 17 7904 CN Beiji~ Line~ 2009 2014 0 23.7 100.00% 23.7
## 18 7960 CN Beiji~ Line~ 2003 2008 0 24.6 100.00% 24.6
## 19 7985 CN Beiji~ Daxi~ 2016 2019 0 44 <NA> NA
## 20 7986 CN Beiji~ Daxi~ 2019 <NA> 0 3.5 <NA> NA
## # ... with 233 more rows, and 10 more variables: stations <dbl>, source1 <chr>,
## # cost <dbl>, currency <chr>, year <dbl>, ppp_rate <dbl>, real_cost <chr>,
## # cost_km_millions <dbl>, source2 <chr>, reference <chr>
Wow! China has a LOT of projects, 253 to be exact! This is definitely too much to compare all across China, but let’s try just looking at one city in China, like Beijing:
## # A tibble: 27 x 20
## e country city line start_year end_year rr length tunnel_per tunnel
## <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl>
## 1 7640 CN Beiji~ Line~ 2004 2009 0 28.2 95.17% 26.8
## 2 7649 CN Beiji~ Line~ 1999 2003 0 40.8 0.00% 0
## 3 7760 CN Beiji~ Line~ 2017 2021 0 8.78 100.00% 8.78
## 4 7769 CN Beiji~ Line~ 2019 2021 0 4 100.00% 4
## 5 7776 CN Beiji~ Line~ 2016 2021 0 37.4 55.61% 20.8
## 6 7777 CN Beiji~ Line~ 2017 2021 0 29.2 100.00% 29.2
## 7 7785 CN Beiji~ Line~ 2016 2018 0 3.4 29.08% 0.989
## 8 7792 CN Beiji~ Line~ 2017 2021 0 22.4 100.00% 22.4
## 9 7793 CN Beiji~ Line~ 2015 2022 0 49.7 100.00% 49.7
## 10 7794 CN Beiji~ Line~ 2013 2021 0 49.8 100.00% 49.8
## # ... with 17 more rows, and 10 more variables: stations <dbl>, source1 <chr>,
## # cost <dbl>, currency <chr>, year <dbl>, ppp_rate <dbl>, real_cost <chr>,
## # cost_km_millions <dbl>, source2 <chr>, reference <chr>
To no surpise, the capital of China has over 27 different transit projects. Still this is probably too much to convey in one graphic.
Let’s look at India now and see how it compares:
## # A tibble: 29 x 20
## e country city line start_year end_year rr length tunnel_per tunnel
## <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl>
## 1 7546 IN Ahmad~ Phas~ 2015 2023 0 40 16.00% 6.4
## 2 7536 IN Banga~ Phas~ 2017 2024 0 73.9 0.00% 14
## 3 7537 IN Banga~ Phas~ 2006 2017 0 42.3 20.00% 8.5
## 4 7553 IN Chenn~ Phas~ 2009 2019 0 45.1 53.00% 24
## 5 7554 IN Chenn~ Phas~ 2016 2020 0 9.1 26.00% 2.4
## 6 7555 IN Chenn~ Phas~ 2020 2026 0 119. 36.00% 42.8
## 7 7368 IN Delhi Phas~ 2012 2020 0 161. 33.00% 53
## 8 7369 IN Delhi Phas~ 2019 2025 0 61.7 36.00% 22.3
## 9 7370 IN Delhi Phas~ 2019 2025 0 42.3 35.00% 14.7
## 10 7538 IN Gurga~ Phas~ 2009 2013 0 5.5 0.00% 0
## 11 7539 IN Gurga~ Phas~ 2013 2017 0 6.3 0.00% 0
## 12 7547 IN Hyder~ Phas~ 2012 2020 0 72 0.00% 0
## 13 7552 IN Hyder~ Airp~ 2020 <NA> 0 31 8.00% 2.5
## 14 7363 IN Kochi Metro 2013 2017 0 13 0.00% 0
## 15 7288 IN Mumbai Mono~ 2009 2019 0 20.2 0.00% 0
## 16 7289 IN Mumbai Line~ 2016 2022 0 33.5 100.00% 33.5
## 17 7290 IN Mumbai Line~ 2016 2021 0 18.6 0.00% 0
## 18 7291 IN Mumbai Line~ 2018 2023 0 23.5 0.00% 0
## 19 7296 IN Mumbai Line~ 2018 2022 0 32.3 0.00% 0
## 20 7297 IN Mumbai Line~ 2019 2022 0 2.7 0.00% 0
## 21 7298 IN Mumbai Line~ 2017 2024 0 24.9 0.00% 0
## 22 7299 IN Mumbai Line~ 2017 2022 0 14.5 0.00% 0
## 23 7304 IN Mumbai Line~ 2016 2021 0 16.5 0.00% 0
## 24 7305 IN Mumbai Line~ 2019 2022 0 13.5 16.00% 2.2
## 25 7306 IN Mumbai Line~ 2020 2024 0 9.2 8.00% 0.7
## 26 7307 IN Mumbai Line~ 2020 2026 0 12.8 69.00% 8.8
## 27 7312 IN Mumbai Line~ 2020 2026 0 20.8 0.00% 0
## 28 7544 IN Nagpur Phas~ 2014 2020 0 38.2 0.00% 0
## 29 7545 IN Nagpur Phas~ <NA> <NA> 0 43.8 0.00% 0
## # ... with 10 more variables: stations <dbl>, source1 <chr>, cost <dbl>,
## # currency <chr>, year <dbl>, ppp_rate <dbl>, real_cost <chr>,
## # cost_km_millions <dbl>, source2 <chr>, reference <chr>
Alright, only 29! That’s a lot more manageable! Mumbai is the biggest city in India and the most populous, so it probably has the most projects going on.
Let’s check:
## # A tibble: 13 x 20
## e country city line start_year end_year rr length tunnel_per tunnel
## <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl>
## 1 7288 IN Mumbai Mono~ 2009 2019 0 20.2 0.00% 0
## 2 7289 IN Mumbai Line~ 2016 2022 0 33.5 100.00% 33.5
## 3 7290 IN Mumbai Line~ 2016 2021 0 18.6 0.00% 0
## 4 7291 IN Mumbai Line~ 2018 2023 0 23.5 0.00% 0
## 5 7296 IN Mumbai Line~ 2018 2022 0 32.3 0.00% 0
## 6 7297 IN Mumbai Line~ 2019 2022 0 2.7 0.00% 0
## 7 7298 IN Mumbai Line~ 2017 2024 0 24.9 0.00% 0
## 8 7299 IN Mumbai Line~ 2017 2022 0 14.5 0.00% 0
## 9 7304 IN Mumbai Line~ 2016 2021 0 16.5 0.00% 0
## 10 7305 IN Mumbai Line~ 2019 2022 0 13.5 16.00% 2.2
## 11 7306 IN Mumbai Line~ 2020 2024 0 9.2 8.00% 0.7
## 12 7307 IN Mumbai Line~ 2020 2026 0 12.8 69.00% 8.8
## 13 7312 IN Mumbai Line~ 2020 2026 0 20.8 0.00% 0
## # ... with 10 more variables: stations <dbl>, source1 <chr>, cost <dbl>,
## # currency <chr>, year <dbl>, ppp_rate <dbl>, real_cost <chr>,
## # cost_km_millions <dbl>, source2 <chr>, reference <chr>
This is interesting, several different projects, a few of which include tunnels, but most of which do not. With so many lines, let’s take a look at what this system looks/will look like:
So now that I have identified a location of interest, I need to think about what it is I am interested about these projects. As India’s largest city, Mumbai has a population of 12.4 Million spread across the city’s 603 square kilometers, and a metro population of 20 Million spread across the greater metro area of over 4,300 square kilometers, so it’s no surprise that the city’s metro network is a long system consisting of over fourteen proposed lines once complete.
Let’s refresh my memory and remind myself of the variables we have to work with:
## # A tibble: 13 x 20
## e country city line start_year end_year rr length tunnel_per tunnel
## <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl>
## 1 7288 IN Mumbai Mono~ 2009 2019 0 20.2 0.00% 0
## 2 7289 IN Mumbai Line~ 2016 2022 0 33.5 100.00% 33.5
## 3 7290 IN Mumbai Line~ 2016 2021 0 18.6 0.00% 0
## 4 7291 IN Mumbai Line~ 2018 2023 0 23.5 0.00% 0
## 5 7296 IN Mumbai Line~ 2018 2022 0 32.3 0.00% 0
## 6 7297 IN Mumbai Line~ 2019 2022 0 2.7 0.00% 0
## 7 7298 IN Mumbai Line~ 2017 2024 0 24.9 0.00% 0
## 8 7299 IN Mumbai Line~ 2017 2022 0 14.5 0.00% 0
## 9 7304 IN Mumbai Line~ 2016 2021 0 16.5 0.00% 0
## 10 7305 IN Mumbai Line~ 2019 2022 0 13.5 16.00% 2.2
## 11 7306 IN Mumbai Line~ 2020 2024 0 9.2 8.00% 0.7
## 12 7307 IN Mumbai Line~ 2020 2026 0 12.8 69.00% 8.8
## 13 7312 IN Mumbai Line~ 2020 2026 0 20.8 0.00% 0
## # ... with 10 more variables: stations <dbl>, source1 <chr>, cost <dbl>,
## # currency <chr>, year <dbl>, ppp_rate <dbl>, real_cost <chr>,
## # cost_km_millions <dbl>, source2 <chr>, reference <chr>
One thing I am noticing is that despite the observation of Mumbai’s size, there’s a good variety of lengths among these projects, so this might be something to explore.
## # A tibble: 13 x 2
## line length
## <chr> <dbl>
## 1 Line 3 33.5
## 2 Line 4 32.3
## 3 Line 5 24.9
## 4 Line 2B 23.5
## 5 Line 12 20.8
## 6 Monorail 20.2
## 7 Line 2A 18.6
## 8 Line 7 16.5
## 9 Line 6 14.5
## 10 Line 7A + 9 13.5
## 11 Line 11 12.8
## 12 Line 10 9.2
## 13 Line 4A 2.7
Given the variety in length, I imagine there’s a good variety of cost among the different lines, so let’s look at that next.
## # A tibble: 13 x 4
## line length real_cost cost_km_millions
## <chr> <dbl> <chr> <dbl>
## 1 Line 4 32.3 6838.03 212.
## 2 Line 2B 23.5 5163.42 220.
## 3 Line 5 24.9 3955.755 159.
## 4 Line 4A 2.7 377.28 140.
## 5 Line 11 12.8 3400.8 266.
## 6 Line 2A 18.6 3076.8 165.
## 7 Line 7A + 9 13.5 3063.46 227.
## 8 Line 7 16.5 2979.84 181.
## 9 Line 6 14.5 2783.5 192.
## 10 Line 12 20.8 2274.24 109.
## 11 Line 10 9.2 1834.08 199.
## 12 Monorail 20.2 1620 80.2
## 13 Line 3 33.5 15040 449.
Hm that doesn’t seem right, the code told R to arrange the real cost in descending order, but that doesn’t seem to have happened. It looks like real cost is currently a character variable, which explains the unexpected ordering (descending by digit without respect to the decimal).
Looking over the values for real cost, I don’t see any values that would cause issues with a simple as.numeric() conversion.
Before continuing, let’s fix that:
data %>%
filter(city == "Mumbai") %>%
select(line, length, real_cost, cost_km_millions) %>%
mutate(real_cost = as.numeric(real_cost)) %>%
arrange(desc(real_cost))
## # A tibble: 13 x 4
## line length real_cost cost_km_millions
## <chr> <dbl> <dbl> <dbl>
## 1 Line 3 33.5 15040 449.
## 2 Line 4 32.3 6838. 212.
## 3 Line 2B 23.5 5163. 220.
## 4 Line 5 24.9 3956. 159.
## 5 Line 11 12.8 3401. 266.
## 6 Line 2A 18.6 3077. 165.
## 7 Line 7A + 9 13.5 3063. 227.
## 8 Line 7 16.5 2980. 181.
## 9 Line 6 14.5 2784. 192.
## 10 Line 12 20.8 2274. 109.
## 11 Line 10 9.2 1834. 199.
## 12 Monorail 20.2 1620 80.2
## 13 Line 4A 2.7 377. 140.
OK so that roughly follows with what we would expect, the more expensive a line is, the longer it tends to be (although Lines 11 and 7A + 9 seem to be an exception).
Now there is one factor that is in the data that I haven’t considered: the presence of tunnels. Usually the more a transit line is in a tunnel (i.e., below grade) the more expensive the project tends to be. With that in mind we would expect the projects with the highest cost per km to have tunnels.
To do this, we first need to create a new logical variable to show the presence or absence of tunnels. We can do this using a simple ifelse() operation on the tunnel variable, which is the length of any tunnels (so a 0 would mean no tunnels).
data %>%
filter(city == "Mumbai") %>%
select(line, length, real_cost, cost_km_millions, tunnel) %>%
mutate(real_cost = as.numeric(real_cost)) %>%
mutate(tunnel_yn = ifelse(tunnel > 0, T, F)) %>%
select(line, length, real_cost, cost_km_millions, tunnel_yn) %>%
arrange(desc(cost_km_millions))
## # A tibble: 13 x 5
## line length real_cost cost_km_millions tunnel_yn
## <chr> <dbl> <dbl> <dbl> <lgl>
## 1 Line 3 33.5 15040 449. TRUE
## 2 Line 11 12.8 3401. 266. TRUE
## 3 Line 7A + 9 13.5 3063. 227. TRUE
## 4 Line 2B 23.5 5163. 220. FALSE
## 5 Line 4 32.3 6838. 212. FALSE
## 6 Line 10 9.2 1834. 199. TRUE
## 7 Line 6 14.5 2784. 192. FALSE
## 8 Line 7 16.5 2980. 181. FALSE
## 9 Line 2A 18.6 3077. 165. FALSE
## 10 Line 5 24.9 3956. 159. FALSE
## 11 Line 4A 2.7 377. 140. FALSE
## 12 Line 12 20.8 2274. 109. FALSE
## 13 Monorail 20.2 1620 80.2 FALSE
Great! It looks like for the most part my reasoning held up as the top three most expensive per km projects all have tunnels, while the sole exception (Line 10) has only a very short tunnel section, and thus does not suffer from an inflated cost per km.
What is interesting is that the two projects with the lowest cost per km are two of the longer lines at over 20km in length.
So now that I have identified my variables of interest, I’ve decided to look at the different projects in Mumbai and see how they compare in length, real total cost, real cost per km, and the presence or absence of a tunnel.
To answer this I will look at the following variables:
Since the main variable I want to highlight is total cost (a continuous variable) I think the best way to plot this as a bar plot. I imagine this as a similar plot to the API plot I did for assignment 6, but one thing to make it easier is I will use coord_flip while still coding my x axis as the projects/lines, that I will recode as factors and then reorder so that they are in descending order by cost per km, so that the first bar is the project with the greatest cost per km.
To make the x (really y) axis ordering more apparent, I will also overplot a second geom_bar over the first to show the cost per km of each line so that it is clear they are in descending order, and so the scale is more apparent.
Let’s start with a simple plot. First, let’s define our data for the plot.
plot_data <- data %>%
filter(city == "Mumbai") %>%
select(line, length, real_cost, cost_km_millions, tunnel) %>%
mutate(real_cost = as.numeric(real_cost)) %>%
mutate(tunnel_yn = ifelse(tunnel > 0, T, F)) %>%
select(line, length, real_cost, cost_km_millions, tunnel_yn) %>%
mutate(line = as.factor(line))
Now let’s run a basic plot
plot_data %>% # Call data
mutate(line = fct_reorder(line, real_cost)) %>% # Reorders lines by real cost
ggplot() + # Start ggplot
geom_col(aes(x = line, y = real_cost, fill = tunnel_yn)) + # First bars which show cost and presence/absence of tunnel
coord_flip() + # flips x/y so everything I code as x is really on the y axis
labs(title = "Real Cost of Mumbai Transit Projects") + # Add a title
xlab("Transit Line") + # Label x axis
ylab("Real Cost in Millions of Dollars ($)") # Label y axis
Interesting, now let’s visually compare cost and length and see if there’s anything interesting we can observe.
plot_data %>%
mutate(line = fct_reorder(line, length)) %>% # This time order by length
ggplot() +
geom_col(aes(x = line, y = length, fill = tunnel_yn)) + # Same thing plot length instead of cost
coord_flip() +
labs(title = "Real Cost of Mumbai Transit Projects") +
xlab("Transit Line") +
ylab("Length in km") # Update for new axis
Interestingly the monorail, despite being the second cheapest project is actually one of the longer projects!
Although the length graph is interesting, I am going to keep the cost rather than length. If I find I want to add in length later I can always use a geom_text and list line lengths.
Let’s move back to plot 1 and start making some refinements. First, the original plot has pretty wide variance due to the wide range between Line 4A and Line 3, so taking a log transformation may make this a little less extreme.
plot_data %>%
mutate(line = fct_reorder(line, real_cost)) %>% # Flip back to ordering by real cost instead of length
ggplot() +
geom_col(aes(x = line, y = real_cost, fill = tunnel_yn)) + # Back to bars to show cost and presence/absence of tunnel
coord_flip() +
labs(title = "Real Cost of Mumbai Transit Projects") +
xlab("Transit Line") +
ylab("Real Cost in Millions of Dollars ($)") + # Update back to real cost
scale_y_continuous(trans = 'log10') # Do a log10 transformation to y-axis to make reduce extremes/variance
Great! Now we should just remember to make note of the log scale at some point since log scales are not always typical (and can be optically deceptive).
Now let’s try to do the overplot to show the cost per km
plot_data %>%
mutate(line = fct_reorder(line, real_cost)) %>%
ggplot() +
geom_col(aes(x = line, y = real_cost, fill = tunnel_yn)) +
geom_col(aes(x = line, y = cost_km_millions)) + # Adds additional plot to show the cost/km
coord_flip() +
labs(title = "Real Cost of Mumbai Transit Projects") +
xlab("Transit Line") +
ylab("Real Cost in Millions of Dollars ($)") +
scale_y_continuous(trans = 'log10')
Now let’s try reordering the lines so they are descending by cost per km.
plot_data %>%
mutate(line = fct_reorder(line, cost_km_millions)) %>% # Changes ordering to descending by cost/km
ggplot() +
geom_col(aes(x = line, y = real_cost, fill = tunnel_yn)) +
geom_col(aes(x = line, y = cost_km_millions)) +
coord_flip() +
labs(title = "Real Cost of Mumbai Transit Projects") +
xlab("Transit Line") +
ylab("Real Cost in Millions of Dollars ($)") +
scale_y_continuous(trans = 'log10')
Now let’s add in the line length as text beside each bar
plot_data %>%
mutate(line = fct_reorder(line, cost_km_millions)) %>%
ggplot() + # Start of ggplot
geom_col(aes(x = line, y = real_cost, fill = tunnel_yn)) +
geom_col(aes(x = line, y = cost_km_millions)) +
geom_text(aes(x = line,
y = real_cost,
label = paste0(length, "km"), # Assigns text to display as the length followed by km
color = "#FF9933",# Add some color in!
fontface = "bold"), # Make it bold to stand out
nudge_y = .3) + # Nudge slightly so just past the bars on graph
coord_flip() +
labs(title = "Real Cost of Mumbai Transit Projects") +
xlab("Transit Line") +
ylab("Real Cost in Millions of Dollars ($)") +
scale_y_continuous(trans = 'log10') +
scale_color_manual(name = "Text Color", label = "Project Length in km", values = "#FF9933") # Adds in key to explain text
This is coming together nicely, now let’s just clean up a few details, add a few thematic elements and update our labels.
plot_data %>%
mutate(line = fct_reorder(line, cost_km_millions)) %>%
ggplot() +
geom_col(aes(x = line, y = real_cost, fill = tunnel_yn)) +
geom_col(aes(x = line, y = cost_km_millions, fill = "#FF9933")) +
geom_text(aes(x = line,
y = real_cost,
label = paste0(length, "km"),
color = "#FF9933",
fontface = "bold"),
nudge_y = .3) +
coord_flip() +
scale_fill_manual(name = "Bar Color", # Manually controls bar fill to reflect different characteristics
labels = c("Cost per km", "No Tunnel", "Has Tunnel"), # Will cause a legend to appear with labels
values = c("#FF9933", "#000080", "#138808")) + # Assign color values for each (taken from Indian flag)
labs(title = "Real Cost of Transit Projects in Mumbai, India") +
xlab("Transit Line") +
ylab("Real Cost in Millions of Dollars ($)") +
scale_y_continuous(trans = 'log10') +
scale_color_manual(name = "Text Color", label = "Project Length in km", values = "#FF9933") # Adds in key to explain text
Hmmm with the log transformation of the axis, the cost per km bar really loses its intended meaning, which was to show how much 1km makes up the total cost. Let’s try dropping that and seeing how it looks.
plot_data %>%
mutate(line = fct_reorder(line, cost_km_millions)) %>%
ggplot() +
geom_col(aes(x = line, y = real_cost, fill = tunnel_yn)) +
geom_col(aes(x = line, y = cost_km_millions, fill = "#FF9933")) +
geom_text(aes(x = line,
y = real_cost,
label = paste0(length, "km"),
color = "#FF9933",
fontface = "bold"),
nudge_y = .3) +
coord_flip() +
scale_fill_manual(name = "Bar Color",
labels = c("Cost per km", "No Tunnel", "Has Tunnel"),
values = c("#FF9933", "#000080", "#138808")) +
labs(title = "Real Cost of Transit Projects in Mumbai, India") +
xlab("Transit Line") +
ylab("Real Cost in Millions of Dollars ($)") +
# scale_y_continuous(trans = 'log10') +
scale_color_manual(name = "Text Color", label = "Project Length in km", values = "#FF9933")
Ugh, while this restores the intended effect of showing the cost per km as equal fractions of the line (i.e., you could take the top bar for Line 3 and multiply by 33.5 and it would span the whole bar) but it makes the orange line so small for most that it almost becomes meaningless, so let’s stick with the logged axis.
Now to make a few finishing touches!
plot_data %>%
mutate(line = fct_reorder(line, cost_km_millions)) %>%
ggplot() +
geom_col(aes(x = line, y = real_cost, fill = tunnel_yn)) +
geom_col(aes(x = line, y = cost_km_millions, fill = "#FF9933")) +
geom_text(aes(x = line,
y = real_cost,
label = paste0(length, "km"),
color = "#FF9933",
fontface = "bold"),
size = 3.1, # make the font a little smaller
nudge_y = .25) +
coord_flip() +
scale_fill_manual(name = "Bar Color",
labels = c("Cost per km", "No Tunnel", "Has Tunnel"),
values = c("#FF9933", "#000080", "#138808")) +
labs(title = "Real Cost of Transit Projects in Mumbai, India",
subtitle = "Total Real Cost vs. Cost/km, with Project Length", # Add a subtitle
caption = "X-Axis is scaled to log_10") + # Make note about x-axis being logged
xlab("Transit Line") +
ylab("Real Cost in Millions of Dollars ($)") +
scale_y_continuous(trans = 'log10') +
scale_color_manual(name = "Text Color", label = "Project Length in km", values = "#FF9933") +
theme_minimal() + # Pick a neutral theme
theme(plot.title = element_text(color = "#FF9933", size = 20, face = "bold"), # Make the title bold, larger and color to match
plot.subtitle = element_text(face = "bold.italic"), # Put subtitle in bold italics
plot.caption = element_text(face = "italic", hjust = 0.5), # Centers and puts caption in italics
legend.position = "right", legend.box = "vertical") # Puts legend at the right but keeps bar and text colors separate
And voila! A good graph that is easy to read, has nice colors (that fit with the subject) and is clearly labeled.