The objective of Project 2 is to get more practice tidying and transforming data. The CDC publishes data regarding employment numbers and mean salaries of healthcare workers for selected years.
The data is available at: https://ftp.cdc.gov/pub/Health_Statistics/NCHS/Publications/Health_US/hus20-21tables/hcempl.xlsx
For analysis, the objective is to look at the trend in each profession and see if employment has been increasing or decreasing and compare this with the change in mean salary. Also, determine which profession has grown most throughout the last 20 years.
First, load required packages.
library(tidyverse)
library(kableExtra)
library(rio)
library(httr)
Load the data and take a look at it.
# Learned how to read in excel file from github from:
# https://community.rstudio.com/t/read-xlsx-from-github/9386/7
github_link <- "https://github.com/klgriffen96/spring23_data607_proj2/blob/main/hcempl.xlsx?raw=true"
temp_file <- tempfile(fileext = ".xlsx")
req <- GET(github_link,
# write result to disk
write_disk(path = temp_file))
df <- import(temp_file, skip=3)
kable(head(df)) |>
kable_styling("striped")
…1 | Employment\1 | …3 | …4 | …5 | …6 | …7 | …8 | Mean hourly wage (dollars)\2 | …10 | …11 | …12 | …13 | …14 | …15 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Occupation title | 2000 | 2005 | 2009 | 2010 | 2015 | 2016 | 2020 | 2000 | 2005 | 2009 | 2010 | 2015 | 2016 | 2020.00 |
Health care practitioners and technical occupations | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Audiologists | 11530 | 10030 | 12590 | 12860 | 12070 | 12310 | 13300 | 22.92 | 27.72 | 32.14 | 33.58 | 37.22 | 38.119999999999997 | 42.90 |
Cardiovascular technologists and technicians | 40080 | 43560 | 48070 | 48720 | 51400 | 53760 | 55980 | 16.809999999999999 | 19.989999999999998 | 23.91 | 24.38 | 26.97 | 27.45 | 29.30 |
Clinical laboratory technologists and technicians | … | … | … | … | 320550 | 326920 | 326220 | … | … | … | … | 24.91 | 25.13 | 26.92 |
Dental hygienists | 148460 | 161140 | 173900 | 177520 | 200550 | 204990 | 194830 | 24.99 | 29.15 | 32.630000000000003 | 33.020000000000003 | 34.96 | 35.31 | 37.53 |
Ultimately, the component names in the data frame should be the following:
First notice how the information that we want to be the column name in most cases is spread between the column name and the first row, rename the columns so they have all the relevant information.
df <- rename(df, employment = "Employment\\1")
df <- rename(df, mean_hourly_wage = "Mean hourly wage (dollars)\\2")
temp <- ""
for (i in 1:length(df)){
if (str_starts(colnames(df)[i], "[a-zA-Z]")){
temp <- colnames(df)[i]
}
colnames(df)[i] <- paste(temp, df[[i]][1] ,sep = ',')
if (i == 1) {
colnames(df)[i] <- df[[i]][1]
}
}
df <- rename(df, occupation_title = "Occupation title")
kable(head(df)) |>
kable_styling("striped")
occupation_title | employment,2000 | employment,2005 | employment,2009 | employment,2010 | employment,2015 | employment,2016 | employment,2020 | mean_hourly_wage,2000 | mean_hourly_wage,2005 | mean_hourly_wage,2009 | mean_hourly_wage,2010 | mean_hourly_wage,2015 | mean_hourly_wage,2016 | mean_hourly_wage,2020 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Occupation title | 2000 | 2005 | 2009 | 2010 | 2015 | 2016 | 2020 | 2000 | 2005 | 2009 | 2010 | 2015 | 2016 | 2020.00 |
Health care practitioners and technical occupations | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Audiologists | 11530 | 10030 | 12590 | 12860 | 12070 | 12310 | 13300 | 22.92 | 27.72 | 32.14 | 33.58 | 37.22 | 38.119999999999997 | 42.90 |
Cardiovascular technologists and technicians | 40080 | 43560 | 48070 | 48720 | 51400 | 53760 | 55980 | 16.809999999999999 | 19.989999999999998 | 23.91 | 24.38 | 26.97 | 27.45 | 29.30 |
Clinical laboratory technologists and technicians | … | … | … | … | 320550 | 326920 | 326220 | … | … | … | … | 24.91 | 25.13 | 26.92 |
Dental hygienists | 148460 | 161140 | 173900 | 177520 | 200550 | 204990 | 194830 | 24.99 | 29.15 | 32.630000000000003 | 33.020000000000003 | 34.96 | 35.31 | 37.53 |
Cut out the first row which no longer has information needed as that information got appended to first row. Cut out the last three rows which contained notes.
df <- df[-1,]
df <- head(df, - 3)
There are two categories of occupations which is worth preserving, there is “Health care practitioners and technical occupations” and then there is “Health care support occupations”. These are good to preserve because analysis can be broken up by category. These two distinctions are called out as a row in the occupation column, and then they have no column associated with them. Create a column for occupation type and fill it in with the appropriate name.
occupation_category_list <- c()
temp <- ""
for (i in 1:dim(df)[1]){
if (str_detect(df$occupation_title[i], "occupations")){
temp <- df$occupation_title[i]
}
occupation_category_list <- append(occupation_category_list, temp)
}
df <- mutate(df, occupation_category = occupation_category_list)
Now remove those rows that just have the occupation category and no data.
cat <- unique(df$occupation_category)
rem <- c()
for (i in 1:dim(df)[1]){
if (df$occupation_title[i] %in% unique(df$occupation_category)){
rem <- append(rem, i)
}
}
df <- df[-(rem),]
Some of the data was read in as numbers and others read in as characters, change so that every number is represented as an integer.
for (i in 1:length(df)){
if (str_detect(colnames(df)[i], "occupation") == FALSE){
df[[i]] <- as.integer(df[[i]])
}
}
Now that the data has been cleaned up a bit, perform the necessary pivots to get the desired dataframe.
df_l <- df |>
pivot_longer(
cols = !(occupation_title | occupation_category),
names_to = c("emp_count_wage", "year"),
names_sep = ",",
values_to = "count_wage"
)
df_l$year <- as.integer(df_l$year)
df_w <- df_l |>
pivot_wider(names_from = emp_count_wage,
values_from = count_wage
)
Take a look at the new dataframe.
head(df_w)
## # A tibble: 6 × 5
## occupation_title occupation_category year emplo…¹ mean_…²
## <chr> <chr> <int> <int> <int>
## 1 Audiologists Health care practitioners and technica… 2000 11530 22
## 2 Audiologists Health care practitioners and technica… 2005 10030 27
## 3 Audiologists Health care practitioners and technica… 2009 12590 32
## 4 Audiologists Health care practitioners and technica… 2010 12860 33
## 5 Audiologists Health care practitioners and technica… 2015 12070 37
## 6 Audiologists Health care practitioners and technica… 2016 12310 38
## # … with abbreviated variable names ¹employment, ²mean_hourly_wage
As previously mentioned this analysis is to look at the trend in each profession and see if employment has been increasing or decreasing and compare this with the change in mean salary. Also, determine which profession has grown most throughout the last 20 years.
First, just get a general idea of the data, did employment go up or go down?
ggp <- df_w|>
group_by(occupation_title) |>
mutate(emp_norm = employment/max(employment)) |>
ggplot(aes(x = year, y = emp_norm, color=occupation_title)) +
geom_point() +
geom_smooth(method = lm, se = FALSE, formula = 'y ~ x') +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1),
legend.position="none")
show(ggp)
Based on this plot, not all employment numbers increased. Check the 3 professions employment increased the most for and then the three professions it increased the least (or decreased) for.
df_n <- df_w|>
group_by(occupation_title) |>
mutate(emp_norm = employment/max(employment)) |>
filter(year == 2020,
is.na(emp_norm) == FALSE) |>
arrange(desc(emp_norm), desc(employment))
kable(head(df_n,3)) |>
kable_styling("striped")
occupation_title | occupation_category | year | employment | mean_hourly_wage | emp_norm |
---|---|---|---|---|---|
Registered nurses\4 | Health care practitioners and technical occupations | 2020 | 2986500 | 38 | 1 |
Medical assistants | Health care support occupations | 2020 | 710200 | 17 | 1 |
Pharmacy technicians | Health care practitioners and technical occupations | 2020 | 415310 | 17 | 1 |
kable(tail(df_n,3)) |>
kable_styling("striped")
occupation_title | occupation_category | year | employment | mean_hourly_wage | emp_norm |
---|---|---|---|---|---|
Pharmacy aides | Health care support occupations | 2020 | 38900 | 15 | 0.6495241 |
Occupational therapy aides | Health care support occupations | 2020 | 5630 | 16 | 0.6332958 |
Medical transcriptionists | Health care support occupations | 2020 | 49530 | 17 | 0.5088873 |
It looks like the professions that had the least relative growth in 2020 were all in the health care support occupations category.
Now, take a look at wages.
ggp <- df_w|>
group_by(occupation_title) |>
ggplot(aes(x = year, y = mean_hourly_wage, color=occupation_title)) +
geom_point() +
geom_smooth(method = lm, se = FALSE, formula = 'y ~ x') +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1),
legend.position="none")
show(ggp)
It looks like across all professions, hourly wage grew. Check which 3 professions hourly wages grew the most and which 3 grew the least.
df_n <- df_w|>
group_by(occupation_title) |>
mutate(wage_norm = mean_hourly_wage/max(mean_hourly_wage),
wage_diff = 1 - min(wage_norm)) |>
filter(is.na(wage_diff) == FALSE,
year == 2020) |>
arrange(desc(wage_diff))
kable(head(df_n,3)) |>
kable_styling("striped")
occupation_title | occupation_category | year | employment | mean_hourly_wage | wage_norm | wage_diff |
---|---|---|---|---|---|---|
Audiologists | Health care practitioners and technical occupations | 2020 | 13300 | 42 | 1 | 0.4761905 |
Medical equipment preparers | Health care support occupations | 2020 | 56870 | 19 | 1 | 0.4736842 |
Physician assistants | Health care practitioners and technical occupations | 2020 | 125280 | 55 | 1 | 0.4727273 |
kable(tail(df_n,3)) |>
kable_styling("striped")
occupation_title | occupation_category | year | employment | mean_hourly_wage | wage_norm | wage_diff |
---|---|---|---|---|---|---|
Occupational therapy aides | Health care support occupations | 2020 | 5630 | 16 | 1 | 0.3125000 |
Medical transcriptionists | Health care support occupations | 2020 | 49530 | 17 | 1 | 0.2941176 |
Physical therapist aides | Health care support occupations | 2020 | 45790 | 14 | 1 | 0.2857143 |
The occupation category that wages grew the least for was Health care support occupations, so it is no surprise that that category sees professionals leaving - both Occupational therapy aids and Medical transcriptions are listed in the bottom 3 pay increase and have high percentages of their workforce leaving.
In conclusion, the goal was to be able to study the CDC data on employment numbers and wages for health care workers. This objective was met.