library(tidyverse)
library(openxlsx)
library(patchwork)

Introduction

Tidying data is apparently one of the most common uses of your time as a data scientist. That is why having the methods for tidying data down is important to learn. In this assignment we will be importing untidy data from a .csv file, tidying the data up, and then performing analysis on the data. The data we will be working on are three different untidy datasets provided by our classmates:

  1. The New York Gasoline Retail Prices Weekly Average by Region - Susanna Wong

Here the requested analysis is to: “Create a line graph to view the trend or compare the prices of the gasoline for each region. We can view if the prices increases or decreases overall”.

  1. Healthcare employment and wages - Jian Quan Chen

Here the requested analysis is to: “Look at the trend in each profession and see if employment has been increasing or decreasing and compare this with the change in mean salary.”

  1. International Mortality Estimates by Year - Shoshana Farber

Here the requested analysis is to: “Compare mortality rates for males and females between countries”.

Loading The Data

For the New York gas prices, we’re able to load the data in by utilizing readr to read the download url directly from data.ny.gov.

url <- r"(https://data.ny.gov/api/views/nqur-w4p7/rows.csv?accessType=DOWNLOAD&sorting=true)"
df_gas <- read_csv(url, skip_empty_rows = TRUE, show_col_types = FALSE)
glimpse(df_gas)
## Rows: 801
## Columns: 18
## $ Date                             <chr> "02/27/2023", "02/20/2023", "02/13/20…
## $ `New York State Average ($/gal)` <dbl> 3.47, 3.51, 3.54, 3.57, 3.56, 3.48, 3…
## $ `Albany Average ($/gal)`         <dbl> 3.51, 3.54, 3.85, 3.61, 3.59, 3.48, 3…
## $ `Batavia Average ($/gal)`        <dbl> 3.38, 3.38, 3.72, 3.44, 3.43, 3.37, 3…
## $ `Binghamton Average ($/gal)`     <dbl> 3.41, 3.45, 3.77, 3.52, 3.50, 3.45, 3…
## $ `Buffalo Average ($/gal)`        <dbl> 3.39, 3.41, 3.75, 3.45, 3.45, 3.41, 3…
## $ `Dutchess Average ($/gal)`       <dbl> 3.56, 3.59, 3.90, 3.63, 3.62, 3.54, 3…
## $ `Elmira Average ($/gal)`         <dbl> 3.40, 3.43, 3.77, 3.51, 3.47, 3.37, 3…
## $ `Glens Falls Average ($/gal)`    <dbl> 3.62, 3.63, 3.88, 3.69, 3.67, 3.58, 3…
## $ `Ithaca Average ($/gal)`         <dbl> 3.47, 3.50, 3.82, 3.53, 3.51, 3.49, 3…
## $ `Kingston Average ($/gal)`       <dbl> 3.38, 3.40, 3.74, 3.49, 3.47, 3.38, 3…
## $ `Nassau Average ($/gal)`         <dbl> 3.36, 3.42, 3.73, 3.58, 3.49, 3.37, 3…
## $ `New York City Average ($/gal)`  <dbl> 3.48, 3.54, 3.82, 3.64, 3.62, 3.53, 3…
## $ `Rochester Average ($/gal)`      <dbl> 3.45, 3.47, 3.78, 3.52, 3.51, 3.46, 3…
## $ `Syracuse Average ($/gal)`       <dbl> 3.45, 3.47, 3.81, 3.52, 3.50, 3.44, 3…
## $ `Utica Average ($/gal)`          <dbl> 3.50, 3.53, 3.87, 3.62, 3.62, 3.59, 3…
## $ `Watertown Average ($/gal)`      <dbl> 3.48, 3.50, 3.86, 3.56, 3.54, 3.47, 3…
## $ `White Plains Average ($/gal)`   <dbl> 3.56, 3.61, 3.94, 3.66, 3.66, 3.58, 3…

For the healthcare employment and wages data, the CDC provides a download link for the data in xlsx format. Looking through packages in r, I discovered openxlsx which allows for loading in xlsx files through their download urls as dataframes.

While loading in this data we preemptively start cleaning away metadata that comes in the file. The first 4 rows are skipped as they don’t contain any data. Then we take a subset that does not include the last 3 rows as this is also metadata. We also want to add indicators for what the date columns are for to break it down further down the line.

url <- r"(https://ftp.cdc.gov/pub/Health_Statistics/NCHS/Publications/Health_US/hus20-21tables/hcempl.xlsx)"
df_healthcare <- head(
    read.xlsx(url,na.strings = c('…'),startRow = 5)
  , -5)
colnames(df_healthcare)[2:8] <- paste(colnames(df_healthcare)[2:8],"Employment",sep="_")
colnames(df_healthcare)[9:15] <- paste(colnames(df_healthcare)[9:15],"Wage",sep="_")
glimpse(df_healthcare)
## Rows: 42
## Columns: 15
## $ Occupation.title  <chr> "Health care practitioners and technical occupations…
## $ `2000_Employment` <dbl> NA, 11530, 40080, NA, 148460, 31760, 28010, 43030, 1…
## $ `2005_Employment` <dbl> NA, 10030, 43560, NA, 161140, 43590, 23780, 48850, 1…
## $ `2009_Employment` <dbl> NA, 12590, 48070, NA, 173900, 51630, 24510, 53220, 2…
## $ `2010_Employment` <dbl> NA, 12860, 48720, NA, 177520, 53010, 23890, 53510, 2…
## $ `2015_Employment` <dbl> NA, 12070, 51400, 320550, 200550, 61250, 28950, 5974…
## $ `2016_Employment` <dbl> NA, 12310, 53760, 326920, 204990, 65790, 32240, 6143…
## $ `2020_Employment` <dbl> NA, 13300, 55980, 326220, 194830, 73920, 26430, 6633…
## $ `2000_Wage`       <dbl> NA, 22.92, 16.81, NA, 24.99, 22.03, 10.98, 18.76, 11…
## $ `2005_Wage`       <dbl> NA, 27.72, 19.99, NA, 29.15, 26.65, 12.20, 22.09, 13…
## $ `2009_Wage`       <dbl> NA, 32.14, 23.91, NA, 32.63, 30.60, 13.72, 25.59, 15…
## $ `2010_Wage`       <dbl> NA, 33.58, 24.38, NA, 33.02, 31.20, 13.86, 26.13, 16…
## $ `2015_Wage`       <dbl> NA, 37.22, 26.97, 24.91, 34.96, 34.08, 14.03, 28.08,…
## $ `2016_Wage`       <dbl> NA, 38.12, 27.45, 25.13, 35.31, 34.49, 14.12, 28.69,…
## $ `2020_Wage`       <dbl> NA, 42.90, 29.30, 26.92, 37.53, 37.40, 15.83, 30.84,…

For the international mortality dataset, I downloaded the csv provided by the source’s website and uploaded it to GitHub. From GitHub it is loaded in through readr where we skip the first row in order to have direct access to the underlying columns. We skip the first three columns as a row index would be redundant and we will not need country codes for the analysis. We also set the names, in order to be able to pivot longer during data tidying.

url <- r"(https://raw.githubusercontent.com/alu-potato/DATA607/main/Assignments/Week%206%20Project/InternationalMortality.csv)"
names_mortality <- c("Row","FIPS","GENC","Country","Year","Both_Infant Mortality","Male_Infant Mortality","Female_Infant Mortality","Both_Child Mortality","Male_Child Mortality","Female_Child Mortality","Both_Under 5 Mortality","Male_Under 5 Mortality","Female_Under 5 Mortality","Both_Life Expectancy","Male_Life Expectancy","Female_Life Expectancy")
suppressMessages(
  df_mortality <- read_csv(url, skip_empty_rows = TRUE, show_col_types = FALSE, skip = 2, col_select = c(-1:-3), col_names = names_mortality)
)
glimpse(df_mortality)
## Rows: 3,178
## Columns: 14
## $ Country                    <chr> "Afghanistan", "Afghanistan", "Afghanistan"…
## $ Year                       <dbl> 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2…
## $ `Both_Infant Mortality`    <dbl> 127.79, 125.79, 123.82, 121.88, 119.96, 118…
## $ `Male_Infant Mortality`    <dbl> 136.47, 134.45, 132.47, 130.50, 128.57, 126…
## $ `Female_Infant Mortality`  <dbl> 118.68, 116.70, 114.74, 112.81, 110.92, 109…
## $ `Both_Child Mortality`     <dbl> 71.37, 69.90, 68.45, 67.04, 65.64, 64.29, 6…
## $ `Male_Child Mortality`     <dbl> 70.34, 68.93, 67.54, 66.19, 64.85, 63.55, 6…
## $ `Female_Child Mortality`   <dbl> 72.44, 70.89, 69.38, 67.91, 66.46, 65.05, 6…
## $ `Both_Under 5 Mortality`   <dbl> 190.04, 186.89, 183.79, 180.75, 177.73, 174…
## $ `Male_Under 5 Mortality`   <dbl> 197.21, 194.11, 191.06, 188.05, 185.08, 182…
## $ `Female_Under 5 Mortality` <dbl> 182.52, 179.32, 176.16, 173.06, 170.01, 167…
## $ `Both_Life Expectancy`     <dbl> 48.93, 49.32, 49.71, 50.10, 50.49, 50.87, 5…
## $ `Male_Life Expectancy`     <dbl> 47.72, 48.08, 48.45, 48.81, 49.17, 49.52, 4…
## $ `Female_Life Expectancy`   <dbl> 50.20, 50.62, 51.04, 51.46, 51.88, 52.29, 5…

Tidying The Data

Now that we have a tibble ready to be worked with in R, we need to think about what exactly needs to be transformed to make this data tidy. First we’ll consider the variables within each data set.

In the case of the gasoline retail prices there are only the variables of date, region, and gas price. Currently date is stored as a string, so we want to change that to the type of date. Region is currently wide and needs to be turned into a long format. Gas price has the same problem as region since these two are tied together.

To achieve these changes: 1. Utilize dplyr::mutate() in order to change the column type of date to date. 2. Utilize tidyr::pivot_longer() in order to make region and gas price individual columns.

df_gas_tidy <- df_gas |>
  mutate(Date = as.Date(Date,"%m/%d/%Y")) |>
  pivot_longer(cols=2:18,names_to= "Region",values_to = "Average Cost", names_pattern = r"((.*)\sAverage)")
glimpse(df_gas_tidy)
## Rows: 13,617
## Columns: 3
## $ Date           <date> 2023-02-27, 2023-02-27, 2023-02-27, 2023-02-27, 2023-0…
## $ Region         <chr> "New York State", "Albany", "Batavia", "Binghamton", "B…
## $ `Average Cost` <dbl> 3.47, 3.51, 3.38, 3.41, 3.39, 3.56, 3.40, 3.62, 3.47, 3…

In the case of healthcare employment there are the variables of year, occupation title, employment, and mean hourly wage. Currently year is stored as the column names, so we want to pivot that lengthwise. Occupation title needs to have suffixes such as “\4” removed from it and additionally there are occupation categories mixed into the data which don’t provide us much value. Employment and hourly wage are currently tied to year in the wide format that needs to be change lengthwise.

To achieve these changes: 1. Utilize dplyr::mutate() in order to remove the extra characters from the occupation title column. 2. Utilize drop_na to extract occupation categories which can be identified by having NAs in the 2020 columns. 3. Utilize tidyr::pivot_longer() in order to make year, employment, and mean hourly wage individual columns.

df_healthcare_tidy <- df_healthcare |>
  mutate(Occupation.title = gsub(r"(\\\d)","",Occupation.title)) |>
  drop_na(`2020_Employment`) |>
  pivot_longer(cols= -Occupation.title,names_to= c("Year",".value"), names_transform = c(Year = as.integer),names_sep = "_") |>
  rename("Occupation" = Occupation.title)
glimpse(df_healthcare_tidy)
## Rows: 280
## Columns: 4
## $ Occupation <chr> "Audiologists", "Audiologists", "Audiologists", "Audiologis…
## $ Year       <int> 2000, 2005, 2009, 2010, 2015, 2016, 2020, 2000, 2005, 2009,…
## $ Employment <dbl> 11530, 10030, 12590, 12860, 12070, 12310, 13300, 40080, 435…
## $ Wage       <dbl> 22.92, 27.72, 32.14, 33.58, 37.22, 38.12, 42.90, 16.81, 19.…

In the case of international mortality there are the variables of country, year, mortality category, gender, and mortality rate. Currently country and year are stored as we would like. Mortality category needs to be pivoted to be a row value along with gender. Mortality rate is tied to these two variables and will need to be turned into a single value per row.

To achieve these changes: 1. Utilize tidyr::pivot_longer() in order to make mortality category, gender, and rate individual columns.

df_mortality_tidy <- df_mortality |>
  pivot_longer(cols= c(-Country,-Year),names_to= c("Gender","Mortality Category"),values_to = "Mortality Value",names_sep = "_")
glimpse(df_mortality_tidy)
## Rows: 38,136
## Columns: 5
## $ Country              <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afg…
## $ Year                 <dbl> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2…
## $ Gender               <chr> "Both", "Male", "Female", "Both", "Male", "Female…
## $ `Mortality Category` <chr> "Infant Mortality", "Infant Mortality", "Infant M…
## $ `Mortality Value`    <dbl> 127.79, 136.47, 118.68, 71.37, 70.34, 72.44, 190.…

Analyzing The Data

Now that our data is tidied up, we can analyze it.

For the gasoline data we want to: Create a line graph to view the trend or compare the prices of the gasoline for each region. We can view if the prices increases or decreases overall. This is a fairly simple analysis that can be achieved through a faceted ggplot line plot.

If we compare the overall trend below starting from 2018 which is when all regions began collecting data, we can see that the price increase and decrease are tied to each other. Overall since 2018 there has been an increase in the gas prices.

df_gas_tidy |>
  filter(Date > as.Date("2018-01-01")) |>
  ggplot(aes(x=Date,y=`Average Cost`)) +
    geom_line() +
    facet_wrap(~Region)

If we tighten our scope on when we want to look at the trend of gas prices and only look at New York State, we can see that overall gas prices have still increased in this year, but they are on a downward trend at the moment in New York State as a whole.

df_gas_tidy |>
  filter(Region == "New York State" & Date > as.Date("2023-01-01")) |>
  ggplot(aes(x=Date,y=`Average Cost`)) +
    geom_line()

Now say we wanted to find out overall what region gas prices are cheapest at, it would be hard to differentiate simply by looking at the line graphs. By looking at a table of mean gas price since 2018 per region, we can see that Elmira is currently the cheapest region to get gas from.

df_gas_tidy |>
  filter(Date > as.Date("2018-01-01")) |>
  group_by(Region) |>
  summarize("Mean Gas Price" = mean(`Average Cost`)) |>
  knitr::kable()
Region Mean Gas Price
Albany 3.000372
Batavia 3.011933
Binghamton 3.009442
Buffalo 3.013606
Dutchess 3.102639
Elmira 2.964721
Glens Falls 3.045502
Ithaca 3.018439
Kingston 2.972416
Nassau 3.005762
New York City 3.120855
New York State 3.050297
Rochester 3.030520
Syracuse 2.991227
Utica 3.061747
Watertown 3.071004
White Plains 3.168104

For the healthcare employment data we want to: Look at the trend in each profession and see if employment has been increasing or decreasing and compare this with the change in mean salary.

We go through this analysis by finding the average of both employment and wage each year. As comparing every single profession individually would not be necessary. If we plot mean employment through the years and mean wage through the years we see a strong upward trend overall for both of these. However, employment actually falls temporarily between 2010 and 2015. Perhaps this was linked to the 2008 financial crisis. It is a promising sign that wages stay increasing throughout regardless.

healthcare_employment <- df_healthcare_tidy |>
  group_by(Year) |>
  summarize("Mean Employment" = mean(Employment,na.rm = TRUE), "Mean Wage" = mean(Wage,na.rm = TRUE)) |>
  ggplot() +
    geom_line(aes(x=Year,y = `Mean Employment`))

healthcare_wage <- df_healthcare_tidy |>
  group_by(Year) |>
  summarize("Mean Employment" = mean(Employment,na.rm = TRUE), "Mean Wage" = mean(Wage,na.rm = TRUE)) |>
  ggplot() +
    geom_line(aes(x=Year,y = `Mean Wage`))

healthcare_employment + healthcare_wage

For the mortality estimates we want to: Compare mortality rates for males and females between countries.

We can begin this comparison by taking the data frame and filtering out the data regarding both genders as we do not need it in our comparison. Then if we group by “Year”,“Gender”, and “Mortality Category” we can summarize the mortality rates per country into the given groups. Finally, we plot each type of mortality split on gender over time.

Utilizing these plots we can compare the different mortality rates for males and females between all the countries. All types of mortality have consistently decreased in the past decade, conversely so has life expectancy overall. The changes do not seem to vary between gender. However, we can see these constant differences between genders: Child mortality does not have a discernible difference between genders. However, for each other type of mortality, the rate is higher for men by about 8%. Life expectancy continues this trend with female life expectancy being about 5 years longer.

df_mortality_tidy |>
  filter(Gender != "Both") |>
  group_by(Year, Gender, `Mortality Category`) |>
  summarize("Mean Mortality" = mean(`Mortality Value`, na.rm = TRUE), .groups = "drop_last")|>
  ggplot() +
    geom_line(aes(x=Year,y = `Mean Mortality`, color=Gender)) +
    facet_wrap(~`Mortality Category`)

Conclusions

Here are our conclusions for each dataset:

  1. Gas prices in New York State as a whole have increased over time.

  2. For healthcare workers in the United States between 2000 and 2020, overall employment has increased and so have wages.

  3. For international mortality rates, both male and female mortality has decreased, but females have a constant lead in not dying compared to men.