Introduction

The objective of Project 2 is to get more practice tidying and transforming data. The CDC publishes data regarding employment numbers and mean salaries of healthcare workers for selected years.

The data is available at: https://ftp.cdc.gov/pub/Health_Statistics/NCHS/Publications/Health_US/hus20-21tables/hcempl.xlsx

For analysis, the objective is to look at the trend in each profession and see if employment has been increasing or decreasing and compare this with the change in mean salary. Also, determine which profession has grown most throughout the last 20 years.

Processing

First, load required packages.

library(tidyverse)
library(kableExtra)
library(rio)
library(httr)

Load the data and take a look at it.

# Learned how to read in excel file from github from:
# https://community.rstudio.com/t/read-xlsx-from-github/9386/7

github_link <- "https://github.com/klgriffen96/spring23_data607_proj2/blob/main/hcempl.xlsx?raw=true"
temp_file <- tempfile(fileext = ".xlsx")
req <- GET(github_link, 
          # write result to disk
           write_disk(path = temp_file))
df <- import(temp_file, skip=3)
kable(head(df))  |>
  kable_styling("striped")
…1 Employment\1 …3 …4 …5 …6 …7 …8 Mean hourly wage (dollars)\2 …10 …11 …12 …13 …14 …15
Occupation title 2000 2005 2009 2010 2015 2016 2020 2000 2005 2009 2010 2015 2016 2020.00
Health care practitioners and technical occupations NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Audiologists 11530 10030 12590 12860 12070 12310 13300 22.92 27.72 32.14 33.58 37.22 38.119999999999997 42.90
Cardiovascular technologists and technicians 40080 43560 48070 48720 51400 53760 55980 16.809999999999999 19.989999999999998 23.91 24.38 26.97 27.45 29.30
Clinical laboratory technologists and technicians 320550 326920 326220 24.91 25.13 26.92
Dental hygienists 148460 161140 173900 177520 200550 204990 194830 24.99 29.15 32.630000000000003 33.020000000000003 34.96 35.31 37.53

Ultimately, the component names in the data frame should be the following:

First notice how the information that we want to be the column name in most cases is spread between the column name and the first row, rename the columns so they have all the relevant information.

df <- rename(df, employment = "Employment\\1")
df <- rename(df, mean_hourly_wage = "Mean hourly wage (dollars)\\2")
temp <- ""
for (i in 1:length(df)){
  if (str_starts(colnames(df)[i], "[a-zA-Z]")){
    temp <- colnames(df)[i]
  }
  colnames(df)[i] <- paste(temp, df[[i]][1] ,sep = ',')
  if (i == 1) {
    colnames(df)[i] <- df[[i]][1]
  }
}
df <- rename(df, occupation_title = "Occupation title")
kable(head(df))  |>
  kable_styling("striped")
occupation_title employment,2000 employment,2005 employment,2009 employment,2010 employment,2015 employment,2016 employment,2020 mean_hourly_wage,2000 mean_hourly_wage,2005 mean_hourly_wage,2009 mean_hourly_wage,2010 mean_hourly_wage,2015 mean_hourly_wage,2016 mean_hourly_wage,2020
Occupation title 2000 2005 2009 2010 2015 2016 2020 2000 2005 2009 2010 2015 2016 2020.00
Health care practitioners and technical occupations NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Audiologists 11530 10030 12590 12860 12070 12310 13300 22.92 27.72 32.14 33.58 37.22 38.119999999999997 42.90
Cardiovascular technologists and technicians 40080 43560 48070 48720 51400 53760 55980 16.809999999999999 19.989999999999998 23.91 24.38 26.97 27.45 29.30
Clinical laboratory technologists and technicians 320550 326920 326220 24.91 25.13 26.92
Dental hygienists 148460 161140 173900 177520 200550 204990 194830 24.99 29.15 32.630000000000003 33.020000000000003 34.96 35.31 37.53

Cut out the first row which no longer has information needed as that information got appended to first row. Cut out the last three rows which contained notes.

df <- df[-1,]
df <- head(df, - 3)  

There are two categories of occupations which is worth preserving, there is “Health care practitioners and technical occupations” and then there is “Health care support occupations”. These are good to preserve because analysis can be broken up by category. These two distinctions are called out as a row in the occupation column, and then they have no column associated with them. Create a column for occupation type and fill it in with the appropriate name.

occupation_category_list <- c()
temp <- ""
for (i in 1:dim(df)[1]){
  if (str_detect(df$occupation_title[i], "occupations")){
    temp <- df$occupation_title[i]
  }
  occupation_category_list <- append(occupation_category_list, temp)
}
df <- mutate(df, occupation_category = occupation_category_list)

Now remove those rows that just have the occupation category and no data.

cat <- unique(df$occupation_category)
rem <- c()
for (i in 1:dim(df)[1]){
  if (df$occupation_title[i] %in% unique(df$occupation_category)){
    rem <- append(rem, i)
  }
}
df <- df[-(rem),]

Some of the data was read in as numbers and others read in as characters, change so that every number is represented as an integer.

for (i in 1:length(df)){
  if (str_detect(colnames(df)[i], "occupation") == FALSE){
    df[[i]] <- as.integer(df[[i]])
  }
}

Now that the data has been cleaned up a bit, perform the necessary pivots to get the desired dataframe.

df_l <- df |> 
  pivot_longer(
    cols = !(occupation_title | occupation_category),
    names_to = c("emp_count_wage", "year"), 
    names_sep = ",",
    values_to = "count_wage"
  )

df_l$year <- as.integer(df_l$year)

df_w <- df_l |>
  pivot_wider(names_from = emp_count_wage,
              values_from = count_wage
  )

Take a look at the new dataframe.

head(df_w)
## # A tibble: 6 × 5
##   occupation_title occupation_category                      year emplo…¹ mean_…²
##   <chr>            <chr>                                   <int>   <int>   <int>
## 1 Audiologists     Health care practitioners and technica…  2000   11530      22
## 2 Audiologists     Health care practitioners and technica…  2005   10030      27
## 3 Audiologists     Health care practitioners and technica…  2009   12590      32
## 4 Audiologists     Health care practitioners and technica…  2010   12860      33
## 5 Audiologists     Health care practitioners and technica…  2015   12070      37
## 6 Audiologists     Health care practitioners and technica…  2016   12310      38
## # … with abbreviated variable names ¹​employment, ²​mean_hourly_wage

Analysis

As previously mentioned this analysis is to look at the trend in each profession and see if employment has been increasing or decreasing and compare this with the change in mean salary. Also, determine which profession has grown most throughout the last 20 years.

First, just get a general idea of the data, did employment go up or go down?

ggp <- df_w|>
  group_by(occupation_title) |>
  mutate(emp_norm = employment/max(employment)) |>
  ggplot(aes(x = year, y = emp_norm, color=occupation_title)) +
  geom_point() +
  geom_smooth(method = lm, se = FALSE, formula = 'y ~ x') +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1), 
                                   legend.position="none")

show(ggp)

Based on this plot, not all employment numbers increased. Check the 3 professions employment increased the most for and then the three professions it increased the least (or decreased) for.

df_n <- df_w|>
  group_by(occupation_title) |>
  mutate(emp_norm = employment/max(employment)) |>
  filter(year == 2020,
         is.na(emp_norm) == FALSE) |>
  arrange(desc(emp_norm), desc(employment))

kable(head(df_n,3))  |>
  kable_styling("striped")
occupation_title occupation_category year employment mean_hourly_wage emp_norm
Registered nurses\4 Health care practitioners and technical occupations 2020 2986500 38 1
Medical assistants Health care support occupations 2020 710200 17 1
Pharmacy technicians Health care practitioners and technical occupations 2020 415310 17 1
kable(tail(df_n,3))  |>
  kable_styling("striped")
occupation_title occupation_category year employment mean_hourly_wage emp_norm
Pharmacy aides Health care support occupations 2020 38900 15 0.6495241
Occupational therapy aides Health care support occupations 2020 5630 16 0.6332958
Medical transcriptionists Health care support occupations 2020 49530 17 0.5088873

It looks like the professions that had the least relative growth in 2020 were all in the health care support occupations category.

Now, take a look at wages.

ggp <- df_w|>
  group_by(occupation_title) |>
  ggplot(aes(x = year, y = mean_hourly_wage, color=occupation_title)) +
  geom_point() +
  geom_smooth(method = lm, se = FALSE, formula = 'y ~ x') +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1), 
                                   legend.position="none")

show(ggp)

It looks like across all professions, hourly wage grew. Check which 3 professions hourly wages grew the most and which 3 grew the least.

df_n <- df_w|>
  group_by(occupation_title) |>
  mutate(wage_norm = mean_hourly_wage/max(mean_hourly_wage),
         wage_diff = 1 - min(wage_norm)) |>
  filter(is.na(wage_diff) == FALSE,
         year == 2020) |>
  arrange(desc(wage_diff))

kable(head(df_n,3))  |>
  kable_styling("striped")
occupation_title occupation_category year employment mean_hourly_wage wage_norm wage_diff
Audiologists Health care practitioners and technical occupations 2020 13300 42 1 0.4761905
Medical equipment preparers Health care support occupations 2020 56870 19 1 0.4736842
Physician assistants Health care practitioners and technical occupations 2020 125280 55 1 0.4727273
kable(tail(df_n,3))  |>
  kable_styling("striped")
occupation_title occupation_category year employment mean_hourly_wage wage_norm wage_diff
Occupational therapy aides Health care support occupations 2020 5630 16 1 0.3125000
Medical transcriptionists Health care support occupations 2020 49530 17 1 0.2941176
Physical therapist aides Health care support occupations 2020 45790 14 1 0.2857143

The occupation category that wages grew the least for was Health care support occupations, so it is no surprise that that category sees professionals leaving - both Occupational therapy aids and Medical transcriptions are listed in the bottom 3 pay increase and have high percentages of their workforce leaving.

Conclusion

In conclusion, the goal was to be able to study the CDC data on employment numbers and wages for health care workers. This objective was met.