The goal of this assignment is to practice in preparing different datasets for downstream analysis work.
We were tasked with choosing three wide datasets identified in the Week 5 Discussion items.
The datasets I chose were:
Cameron Identified a dataset located in the MTA Web Site listing annual ridership at all stations from 2013-2018.
He also suggested “An example of analysis that can be done is the change in ridership per station, or perhaps by borough, from one year to the next.”
Getting the Data I copied/pasted the data (minus the borough totals at the end) from that site to an excel spreadsheet and subsequently saved it to a csv located at my DS607 github repo.
subway_raw <- read.csv(file = 'https://raw.githubusercontent.com/georg4re/DS607/master/data/subway-ridership.csv')
head(subway_raw)
## ï..Station..alphabetical.by.borough. X2013 X2014
## 1 The Bronx
## 2 138 St-Grand Concourse 4 subway 5 subway 957,984 1,033,559
## 3 149 St-Grand Concourse 2 subway 4 subway 5 subway 4,427,399 4,536,888
## 4 161 St-Yankee Stadium B subway D subway 4 subway 8,766,012 8,961,029
## 5 167 St 4 subway 3,081,534 3,067,345
## 6 167 St B subwayD subway 3,091,289 3,245,977
## X2015 X2016 X2017 X2018 X2017.2018.Change X X2018.Rank
## 1 NA
## 2 1,056,380 1,070,024 1,036,746 944,598 -92,148 -8.90% 365
## 3 4,424,754 4,381,900 4,255,015 3,972,763 -282,252 -6.60% 121
## 4 8,922,188 8,784,407 8,596,506 8,392,290 -204,216 -2.40% 38
## 5 3,180,274 3,179,087 2,954,228 2,933,140 -21,088 -0.70% 165
## 6 3,295,032 3,365,748 3,293,451 2,022,919 -1,270,532 -38.60% 231
## ï..Station..alphabetical.by.borough. X2013 X2014
## Length:428 Length:428 Length:428
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## X2015 X2016 X2017 X2018
## Length:428 Length:428 Length:428 Length:428
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## X2017.2018.Change X X2018.Rank
## Length:428 Length:428 Min. : 1.0
## Class :character Class :character 1st Qu.:106.8
## Mode :character Mode :character Median :212.5
## Mean :212.5
## 3rd Qu.:318.2
## Max. :424.0
## NA's :4
As we can see, the dataset is “wide”, it presents several other challenges like:
Cleaning the Data 1: Change column Names
#First column name is tough to rename so using basic R
names(subway_raw)[1] = 'station'
#using dplyr to rename other columns
subway_data <- subway_raw %>%
rename( '2013'= 'X2013',
'2014' = 'X2014',
'2015' = 'X2015',
'2016' = 'X2016',
'2017' = 'X2017',
'2018' = 'X2018',
'2017-2018 Change' = X2017.2018.Change,
'Pct Change' = X,
'2018 Rank' = X2018.Rank
)
colnames(subway_data)
## [1] "station" "2013" "2014" "2015"
## [5] "2016" "2017" "2018" "2017-2018 Change"
## [9] "Pct Change" "2018 Rank"
Cleaning the Data 2: Assign the Borough name to each row
#Find the rows containing the Borough names
#I noticed these are the only ones that have NA in the 2018 rank
borough_names <- subway_data %>%
dplyr::filter(is.na(subway_data$`2018 Rank`)) %>%
select(station)
# Loop thru the subway_data and assign the proper borough to the data
subway_data$borough <- ''
idx <- 0 #begin with the first borough
for (i in 1:nrow(subway_data)) {
if (is.na(subway_data$`2018 Rank`[i])) { #it's the boroug's row
idx <- idx + 1
} else {
subway_data$borough[i] = borough_names$station[idx]
}
}
#I notice I could have avoided extracting the borough's name into
# a vector and just use the previous name within the for loop
#I will now remove the rows with only borough names from the data
subway_data <- subway_data %>%
drop_na(`2018 Rank`)
head(subway_data)
## station 2013 2014
## 1 138 St-Grand Concourse 4 subway 5 subway 957,984 1,033,559
## 2 149 St-Grand Concourse 2 subway 4 subway 5 subway 4,427,399 4,536,888
## 3 161 St-Yankee Stadium B subway D subway 4 subway 8,766,012 8,961,029
## 4 167 St 4 subway 3,081,534 3,067,345
## 5 167 St B subwayD subway 3,091,289 3,245,977
## 6 170 St 4 subway 2,961,575 2,941,958
## 2015 2016 2017 2018 2017-2018 Change Pct Change 2018 Rank
## 1 1,056,380 1,070,024 1,036,746 944,598 -92,148 -8.90% 365
## 2 4,424,754 4,381,900 4,255,015 3,972,763 -282,252 -6.60% 121
## 3 8,922,188 8,784,407 8,596,506 8,392,290 -204,216 -2.40% 38
## 4 3,180,274 3,179,087 2,954,228 2,933,140 -21,088 -0.70% 165
## 5 3,295,032 3,365,748 3,293,451 2,022,919 -1,270,532 -38.60% 231
## 6 3,045,205 3,038,777 2,785,331 2,562,443 -222,888 -8.00% 183
## borough
## 1 The Bronx
## 2 The Bronx
## 3 The Bronx
## 4 The Bronx
## 5 The Bronx
## 6 The Bronx
Finish cleaning data: Changing the numeric column types + remove unneeded columns
subway_data <- subway_data %>%
mutate('2013' = as.numeric(gsub(',', '', subway_data$`2013`)),
'2014' = as.numeric(gsub(',', '', subway_data$`2014`)),
'2015' = as.numeric(gsub(',', '', subway_data$`2015`)),
'2016' = as.numeric(gsub(',', '', subway_data$`2016`)),
'2017' = as.numeric(gsub(',', '', subway_data$`2017`)),
'2018' = as.numeric(gsub(',', '', subway_data$`2018`))) %>%
select(borough, station, '2013', '2014', '2015', '2016', '2017', '2018')
head(subway_data)
## borough station 2013
## 1 The Bronx 138 St-Grand Concourse 4 subway 5 subway 957984
## 2 The Bronx 149 St-Grand Concourse 2 subway 4 subway 5 subway 4427399
## 3 The Bronx 161 St-Yankee Stadium B subway D subway 4 subway 8766012
## 4 The Bronx 167 St 4 subway 3081534
## 5 The Bronx 167 St B subwayD subway 3091289
## 6 The Bronx 170 St 4 subway 2961575
## 2014 2015 2016 2017 2018
## 1 1033559 1056380 1070024 1036746 944598
## 2 4536888 4424754 4381900 4255015 3972763
## 3 8961029 8922188 8784407 8596506 8392290
## 4 3067345 3180274 3179087 2954228 2933140
## 5 3245977 3295032 3365748 3293451 2022919
## 6 2941958 3045205 3038777 2785331 2562443
Pivoting Let’s pivot from wide to long so as to have one single observation per row.
subway_clean <- subway_data %>%
pivot_longer(
3:8,
names_to="year",
values_to="riders",
values_drop_na = TRUE
)
head(subway_clean)
## # A tibble: 6 x 4
## borough station year riders
## <chr> <chr> <chr> <dbl>
## 1 The Bronx 138 St-Grand Concourse 4 subway 5 subway 2013 957984
## 2 The Bronx 138 St-Grand Concourse 4 subway 5 subway 2014 1033559
## 3 The Bronx 138 St-Grand Concourse 4 subway 5 subway 2015 1056380
## 4 The Bronx 138 St-Grand Concourse 4 subway 5 subway 2016 1070024
## 5 The Bronx 138 St-Grand Concourse 4 subway 5 subway 2017 1036746
## 6 The Bronx 138 St-Grand Concourse 4 subway 5 subway 2018 944598
Let’s Analyze this data
Now that we have a long, focused data set, we can go ahead and try to perform the requested analysis. Because of the many stations, I will forgo analyzing a particular station’s number of riders but this analysis can be performed using the subway_clean data set.
riders_per_borough <- subway_clean %>%
group_by(borough, year) %>%
summarize(avg_riders = mean(riders, na.rm = TRUE), .groups = "drop")
riders_per_borough
## # A tibble: 24 x 3
## borough year avg_riders
## <chr> <chr> <dbl>
## 1 Brooklyn 2013 2371925.
## 2 Brooklyn 2014 2437738.
## 3 Brooklyn 2015 2468462.
## 4 Brooklyn 2016 2449301.
## 5 Brooklyn 2017 2416386.
## 6 Brooklyn 2018 2358161.
## 7 Manhattan 2013 8037119.
## 8 Manhattan 2014 8250454.
## 9 Manhattan 2015 8223157.
## 10 Manhattan 2016 8189785.
## # ... with 14 more rows
library(ggplot2)
library(scales)
p <- ggplot(data = riders_per_borough, aes(x = year, y = avg_riders, group = borough, colour=borough)) +
geom_line() +
geom_point() +
scale_x_discrete(breaks = riders_per_borough$year, labels = riders_per_borough$year)
#Ensure a clean display
p + scale_y_continuous(labels = comma)
All boroughs show a decline on ridership within the last few years. This is noticeable from 2016. There was an increase in ridership in 2013 followed by slight increases until 2016. Manhattan, The Bronx and Queens show a steeper decline than Brooklyn. Let’s take a look at the overall ridership:
all_riders <- subway_clean %>%
group_by(year) %>%
summarize(avg_riders = mean(riders, na.rm = TRUE))
## `summarise()` ungrouping output (override with `.groups` argument)
p <- ggplot(data = all_riders) +
geom_bar(mapping = aes(x = year, y = avg_riders, fill=year), stat='identity') +
scale_y_continuous(labels = comma)
p
This graph shows that the average number of riders has steadily declined in NY since 2016.
Arushi Arora Identified a dataset located in the Fred Economic Data Website listing annual US Gross Domestic product for 2019 and 2020
Arora suggested that “It might be helpful specially during the pandemic to identify sectors that are doing great and other that have been greatly impacted.”
Getting the Data I saved the data provided by Aroro as a tab delimited file located at my DS607 github repo.
gdp_raw <- read.table(file= 'https://raw.githubusercontent.com/georg4re/DS607/master/data/2019-2020-gdp.txt',
sep="\t", header=TRUE)
head(gdp_raw)
## Line Name Q2.2020 Q1.2020 Q2.2019
## 1 1 Gross domestic product 19,520.114 21,561.139 21,329.877
## 2 2 Personal consumption expenditures 13,097.348 14,545.460 14,497.320
## 3 3 Goods 4,361.518 4,552.919 4,517.679
## 4 4 Durable goods 1,478.299 1,496.444 1,535.984
## 5 5 Nondurable goods 2,883.219 3,056.474 2,981.695
## 6 6 Services 8,735.830 9,992.541 9,979.641
## Line Name Q2.2020 Q1.2020
## Min. : 1.00 Length:26 Length:26 Length:26
## 1st Qu.: 7.25 Class :character Class :character Class :character
## Median :13.50 Mode :character Mode :character Mode :character
## Mean :13.50
## 3rd Qu.:19.75
## Max. :26.00
## Q2.2019
## Length:26
## Class :character
## Mode :character
##
##
##
As we can see, the dataset is “wide”, it also treats the values as characters and not as numeric values. Several sub section data also have the same names, we will drop all repeated subsections with the same name for a smaller dataset without losing important data.
####Cleaning data: Changing the numeric column types
gdp_data <- gdp_raw %>%
mutate('Q2.2020' = as.numeric(gsub(',', '', gdp_raw$`Q2.2020`)) * 1000,
'Q1.2020' = as.numeric(gsub(',', '', gdp_raw$`Q1.2020`)) * 1000,
'Q2.2019' = as.numeric(gsub(',', '', gdp_raw$`Q2.2019`)) * 1000) %>%
rename( '2020.2'= 'Q2.2020',
'2020.1' = 'Q1.2020',
'2019.2' = 'Q2.2019') %>%
distinct(Name, .keep_all = TRUE)
#I multiplied by 1000 to get rid of the decimal point in the notation
head(gdp_data)
## Line Name 2020.2 2020.1 2019.2
## 1 1 Gross domestic product 19520114 21561139 21329877
## 2 2 Personal consumption expenditures 13097348 14545460 14497320
## 3 3 Goods 4361518 4552919 4517679
## 4 4 Durable goods 1478299 1496444 1535984
## 5 5 Nondurable goods 2883219 3056474 2981695
## 6 6 Services 8735830 9992541 9979641
Pivoting Let’s pivot from wide to long so as to have one single observation per row.
gdp_clean <- gdp_data %>%
pivot_longer(
3:5,
names_to="Quarter",
values_to="Gdp",
values_drop_na = TRUE
) %>%
select(Name, Quarter, Gdp)%>%
arrange(Name, Quarter, by_group = TRUE)
head(gdp_clean)
## # A tibble: 6 x 3
## Name Quarter Gdp
## <chr> <chr> <dbl>
## 1 Change in private inventories 2019.2 53060
## 2 Change in private inventories 2020.1 -52117
## 3 Change in private inventories 2020.2 -298356
## 4 Durable goods 2019.2 1535984
## 5 Durable goods 2020.1 1496444
## 6 Durable goods 2020.2 1478299
Now that we have a long, focused data set, we can go ahead and try to perform the requested analysis. We can pay special attention to the second quarter of 2020 to see the areas more greatly affected by the pandemic
all_gdp <- gdp_clean %>%
group_by(Quarter) %>%
summarise(avg_gdp = mean(Gdp), median_gdp = median(Gdp))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 3
## Quarter avg_gdp median_gdp
## <chr> <dbl> <dbl>
## 1 2019.2 3764302 2420831
## 2 2020.1 3779190. 2410136.
## 3 2020.2 3416684. 2060748.
Based on the overall mean gdp we can see that, as we could probably presume, the average GDP figure fell significantly during the 2nd quarter of 2020.
We’ll add a percent change column to our clean data to identify gains/losses by quarter
gdp_by_pct_change <-mutate(gdp_clean, Row = 1:n()) %>%
group_by(Name) %>%
mutate(pct_change = Gdp/lag(Gdp) * 100) %>%
ungroup %>%
select(Name, Quarter, pct_change) %>%
arrange(pct_change)
head(gdp_by_pct_change)
## # A tibble: 6 x 3
## Name Quarter pct_change
## <chr> <chr> <dbl>
## 1 Change in private inventories 2020.1 -98.2
## 2 Exports 2020.2 73.3
## 3 Net exports of goods and services 2020.1 76.7
## 4 Imports 2020.2 79.6
## 5 Gross private domestic investment 2020.2 85.1
## 6 Services 2020.2 87.4
By looking at this data, we can see that Exports and Imports suffered the most during the Second Quarter as they decreased the most among the categories we have.
p <- ggplot(data = gdp_clean, aes(x = Quarter, y = Gdp, group=Name, color=Name )) +
geom_line() +
geom_point() +
scale_y_continuous(labels = comma)
p <- p + theme(legend.position="bottom")
p <- p + guides(fill=guide_legend(nrow=5, byrow=TRUE))
p
We see a general decline in values, let’s take a look at this same graph by percent change:
#Assign 100 to all NA assuming we start at 100% on 2019.2
gdp_by_pct_change[is.na(gdp_by_pct_change)] <- 100
p <- ggplot(data = gdp_by_pct_change, aes(x = Quarter, y = pct_change, group=Name, color=Name )) +
geom_line() +
geom_point() +
scale_y_continuous(labels = comma)
p <- p + theme(legend.position="bottom")
p <- p + guides(fill=guide_legend(nrow=10, byrow=TRUE))
p
Let’s take a look at the bigger changes in the 2nd Quarter:
gdp_by_pct_change %>%
filter(Quarter=="2020.2") %>%
arrange(desc(pct_change)) %>%
slice(1:5) %>%
ggplot(data = .) +
geom_bar(mapping = aes(x = Name, y = pct_change, fill=Name), stat='identity') +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank()
)
Based on this, the biggest gain was in Change in private inventories with a near 600% increase. This is also the same variable decreased almost 100% between 2019 and the 1st quarter in 2020
Most affected
gdp_by_pct_change %>%
filter(Quarter == '2020.2') %>%
arrange(pct_change) %>%
slice(1:5) %>%
ggplot(data = .) +
geom_bar(mapping = aes(x = Name, y = pct_change, fill=Name), stat='identity') +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank()
)
Again, this graph shows that exports, imports and Gross private domestic investments were the most affected sectors in the 2nd Quarter.
Although there is a general decline in values, some sectors like Private inventories and Defense reported gains in their GDP during the 2nd quarter of 2020.