This assignment is to select a Tidy Tuesday posting and use the data to demonstrate some of the methods of visualization you learned in the Datacamp course on best practices.

Problem 1

Which Tidy Tuesday item did you select (date)? What is it about? Which variables are you using for your demonstration. You will need a quantitative variable and a categorical variable.

Place your answer here.

I choose the Tidy Tuesday of 2021-01-05, Transit Cost Project. https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-01-05/readme.md

This dataset has information about transit system construction projects such as cost, length, and start year. The data dictionary is below. I choose to use the number of stations as my quantitative variable and year as caregorical variable.

Problem 2

Load the libraries you need and the data from Tidy Tuesday.

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.0.6     v dplyr   1.0.4
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
transit_cost <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-01-05/transit_cost.csv')
## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_character(),
##   e = col_double(),
##   rr = col_double(),
##   length = col_double(),
##   tunnel = col_double(),
##   stations = col_double(),
##   cost = col_double(),
##   year = col_double(),
##   ppp_rate = col_double(),
##   cost_km_millions = col_double()
## )
## i Use `spec()` for the full column specifications.
transit_cost <- transit_cost %>%
  filter(start_year > 2016) #filter to only recent projects

transit_cost$year <- as.factor(transit_cost$year)

Problem 3

Create a histogram for your quantitative variable. Display three alternative choices of bin width.

# Place your code here.

ggplot(transit_cost, aes(x=stations)) + geom_histogram(binwidth = 3, color = "white") + labs(title = "Distribution of Stations Variable") #binwidth = 3
## Warning: Removed 1 rows containing non-finite values (stat_bin).

ggplot(transit_cost, aes(x=stations)) + geom_histogram(binwidth = 6, color = "white") + labs(title = "Distribution of Stations Variable")#binwidth = 6
## Warning: Removed 1 rows containing non-finite values (stat_bin).

ggplot(transit_cost, aes(x=stations)) + geom_histogram(binwidth = 10, color = "white") + labs(title = "Distribution of Stations Variable") #binwidth = 10
## Warning: Removed 1 rows containing non-finite values (stat_bin).

Problem 4

Create a density plot with a rug for your quantitative variable. Show three different alternatives for the parameter bw.

# Place your code here.

ggplot(transit_cost, aes(x=stations)) + geom_density(bw=5) + geom_rug(alpha=0.1) + labs(title = "Distribution of Stations Variable")#bw = 5
## Warning: Removed 1 rows containing non-finite values (stat_density).

ggplot(transit_cost, aes(x=stations)) + geom_density(bw=10) + geom_rug(alpha=0.1) + labs(title = "Distribution of Stations Variable") #bw = 10
## Warning: Removed 1 rows containing non-finite values (stat_density).

ggplot(transit_cost, aes(x=stations)) + geom_density(bw=15) + geom_rug(alpha=0.1) + labs(title = "Distribution of Stations Variable")#bw = 15
## Warning: Removed 1 rows containing non-finite values (stat_density).

Problem 5

Create an appropriate visualization of the relationship between your categorical and quantitative variables.

# Place your code here.

ggplot(transit_cost, aes(x=year, y=stations, fill = year)) + geom_boxplot(alpha = 0.1) + geom_jitter(width=0.25, alpha = 0.25) + coord_flip() + labs(title = "Stations on Transit Lines Each Year")
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).
## Warning: Removed 1 rows containing missing values (geom_point).

Problem 6

Provide an alternative answer to Problem 5.

# Place your code here.

ggplot(transit_cost, aes(x=year, y=stations, fill = year)) + geom_violin() + coord_flip() + labs(title = "Stations on Transit Lines Each Year")
## Warning: Removed 1 rows containing non-finite values (stat_ydensity).