Project 2 - Health Care Data

Introduction

The objective of Project 2 is to get more practice tidying and transforming data. The CDC publishes data regarding employment numbers and mean salaries of healthcare workers for selected years.

The data is available at: https://ftp.cdc.gov/pub/Health_Statistics/NCHS/Publications/Health_US/hus20-21tables/hcempl.xlsx

For analysis, the objective is to look at the trend in each profession and see if employment has been increasing or decreasing and compare this with the change in mean salary. Also, determine which profession has grown most throughout the last 20 years.

Processing

First, load required packages.

library(tidyverse)
library(kableExtra)
library(rio)
library(httr)

Load the data and take a look at it.

# Learned how to read in excel file from github from:
# https://community.rstudio.com/t/read-xlsx-from-github/9386/7

github_link <- "https://github.com/klgriffen96/spring23_data607_proj2/blob/main/hcempl.xlsx?raw=true"
temp_file <- tempfile(fileext = ".xlsx")
req <- GET(github_link, 
          # write result to disk
           write_disk(path = temp_file))
df <- import(temp_file, skip=3)
kable(head(df))  |>
  kable_styling("striped")

…1	Employment\1	…3	…4	…5	…6	…7	…8	Mean hourly wage (dollars)\2	…10	…11	…12	…13	…14	…15
Occupation title	2000	2005	2009	2010	2015	2016	2020	2000	2005	2009	2010	2015	2016	2020.00
Health care practitioners and technical occupations	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
Audiologists	11530	10030	12590	12860	12070	12310	13300	22.92	27.72	32.14	33.58	37.22	38.119999999999997	42.90
Cardiovascular technologists and technicians	40080	43560	48070	48720	51400	53760	55980	16.809999999999999	19.989999999999998	23.91	24.38	26.97	27.45	29.30
Clinical laboratory technologists and technicians	…	…	…	…	320550	326920	326220	…	…	…	…	24.91	25.13	26.92
Dental hygienists	148460	161140	173900	177520	200550	204990	194830	24.99	29.15	32.630000000000003	33.020000000000003	34.96	35.31	37.53

Ultimately, the component names in the data frame should be the following:

occupation
occupation_category
year
people_employed
mean_hourly_wage

First notice how the information that we want to be the column name in most cases is spread between the column name and the first row, rename the columns so they have all the relevant information.

df <- rename(df, employment = "Employment\\1")
df <- rename(df, mean_hourly_wage = "Mean hourly wage (dollars)\\2")
temp <- ""
for (i in 1:length(df)){
  if (str_starts(colnames(df)[i], "[a-zA-Z]")){
    temp <- colnames(df)[i]
  }
  colnames(df)[i] <- paste(temp, df[[i]][1] ,sep = ',')
  if (i == 1) {
    colnames(df)[i] <- df[[i]][1]
  }
}
df <- rename(df, occupation_title = "Occupation title")
kable(head(df))  |>
  kable_styling("striped")

occupation_title	employment,2000	employment,2005	employment,2009	employment,2010	employment,2015	employment,2016	employment,2020	mean_hourly_wage,2000	mean_hourly_wage,2005	mean_hourly_wage,2009	mean_hourly_wage,2010	mean_hourly_wage,2015	mean_hourly_wage,2016	mean_hourly_wage,2020
Occupation title	2000	2005	2009	2010	2015	2016	2020	2000	2005	2009	2010	2015	2016	2020.00
Health care practitioners and technical occupations	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
Audiologists	11530	10030	12590	12860	12070	12310	13300	22.92	27.72	32.14	33.58	37.22	38.119999999999997	42.90
Cardiovascular technologists and technicians	40080	43560	48070	48720	51400	53760	55980	16.809999999999999	19.989999999999998	23.91	24.38	26.97	27.45	29.30
Clinical laboratory technologists and technicians	…	…	…	…	320550	326920	326220	…	…	…	…	24.91	25.13	26.92
Dental hygienists	148460	161140	173900	177520	200550	204990	194830	24.99	29.15	32.630000000000003	33.020000000000003	34.96	35.31	37.53

Cut out the first row which no longer has information needed as that information got appended to first row. Cut out the last three rows which contained notes.

df <- df[-1,]
df <- head(df, - 3)

There are two categories of occupations which is worth preserving, there is “Health care practitioners and technical occupations” and then there is “Health care support occupations”. These are good to preserve because analysis can be broken up by category. These two distinctions are called out as a row in the occupation column, and then they have no column associated with them. Create a column for occupation type and fill it in with the appropriate name.

occupation_category_list <- c()
temp <- ""
for (i in 1:dim(df)[1]){
  if (str_detect(df$occupation_title[i], "occupations")){
    temp <- df$occupation_title[i]
  }
  occupation_category_list <- append(occupation_category_list, temp)
}
df <- mutate(df, occupation_category = occupation_category_list)

Now remove those rows that just have the occupation category and no data.

cat <- unique(df$occupation_category)
rem <- c()
for (i in 1:dim(df)[1]){
  if (df$occupation_title[i] %in% unique(df$occupation_category)){
    rem <- append(rem, i)
  }
}
df <- df[-(rem),]

Some of the data was read in as numbers and others read in as characters, change so that every number is represented as an integer.

for (i in 1:length(df)){
  if (str_detect(colnames(df)[i], "occupation") == FALSE){
    df[[i]] <- as.integer(df[[i]])
  }
}

Now that the data has been cleaned up a bit, perform the necessary pivots to get the desired dataframe.

df_l <- df |> 
  pivot_longer(
    cols = !(occupation_title | occupation_category),
    names_to = c("emp_count_wage", "year"), 
    names_sep = ",",
    values_to = "count_wage"
  )

df_l$year <- as.integer(df_l$year)

df_w <- df_l |>
  pivot_wider(names_from = emp_count_wage,
              values_from = count_wage
  )

Take a look at the new dataframe.

head(df_w)

## # A tibble: 6 × 5
##   occupation_title occupation_category                      year emplo…¹ mean_…²
##   <chr>            <chr>                                   <int>   <int>   <int>
## 1 Audiologists     Health care practitioners and technica…  2000   11530      22
## 2 Audiologists     Health care practitioners and technica…  2005   10030      27
## 3 Audiologists     Health care practitioners and technica…  2009   12590      32
## 4 Audiologists     Health care practitioners and technica…  2010   12860      33
## 5 Audiologists     Health care practitioners and technica…  2015   12070      37
## 6 Audiologists     Health care practitioners and technica…  2016   12310      38
## # … with abbreviated variable names ¹employment, ²mean_hourly_wage

Analysis

As previously mentioned this analysis is to look at the trend in each profession and see if employment has been increasing or decreasing and compare this with the change in mean salary. Also, determine which profession has grown most throughout the last 20 years.

First, just get a general idea of the data, did employment go up or go down?

ggp <- df_w|>
  group_by(occupation_title) |>
  mutate(emp_norm = employment/max(employment)) |>
  ggplot(aes(x = year, y = emp_norm, color=occupation_title)) +
  geom_point() +
  geom_smooth(method = lm, se = FALSE, formula = 'y ~ x') +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1), 
                                   legend.position="none")

show(ggp)

Based on this plot, not all employment numbers increased. Check the 3 professions employment increased the most for and then the three professions it increased the least (or decreased) for.

df_n <- df_w|>
  group_by(occupation_title) |>
  mutate(emp_norm = employment/max(employment)) |>
  filter(year == 2020,
         is.na(emp_norm) == FALSE) |>
  arrange(desc(emp_norm), desc(employment))

kable(head(df_n,3))  |>
  kable_styling("striped")

occupation_title	occupation_category	year	employment	mean_hourly_wage	emp_norm
Registered nurses\4	Health care practitioners and technical occupations	2020	2986500	38	1
Medical assistants	Health care support occupations	2020	710200	17	1
Pharmacy technicians	Health care practitioners and technical occupations	2020	415310	17	1

kable(tail(df_n,3))  |>
  kable_styling("striped")

occupation_title	occupation_category	year	employment	mean_hourly_wage	emp_norm
Pharmacy aides	Health care support occupations	2020	38900	15	0.6495241
Occupational therapy aides	Health care support occupations	2020	5630	16	0.6332958
Medical transcriptionists	Health care support occupations	2020	49530	17	0.5088873

It looks like the professions that had the least relative growth in 2020 were all in the health care support occupations category.

Now, take a look at wages.

ggp <- df_w|>
  group_by(occupation_title) |>
  ggplot(aes(x = year, y = mean_hourly_wage, color=occupation_title)) +
  geom_point() +
  geom_smooth(method = lm, se = FALSE, formula = 'y ~ x') +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1), 
                                   legend.position="none")

show(ggp)

It looks like across all professions, hourly wage grew. Check which 3 professions hourly wages grew the most and which 3 grew the least.

df_n <- df_w|>
  group_by(occupation_title) |>
  mutate(wage_norm = mean_hourly_wage/max(mean_hourly_wage),
         wage_diff = 1 - min(wage_norm)) |>
  filter(is.na(wage_diff) == FALSE,
         year == 2020) |>
  arrange(desc(wage_diff))

kable(head(df_n,3))  |>
  kable_styling("striped")

occupation_title	occupation_category	year	employment	mean_hourly_wage	wage_norm	wage_diff
Audiologists	Health care practitioners and technical occupations	2020	13300	42	1	0.4761905
Medical equipment preparers	Health care support occupations	2020	56870	19	1	0.4736842
Physician assistants	Health care practitioners and technical occupations	2020	125280	55	1	0.4727273

kable(tail(df_n,3))  |>
  kable_styling("striped")

occupation_title	occupation_category	year	employment	mean_hourly_wage	wage_norm	wage_diff
Occupational therapy aides	Health care support occupations	2020	5630	16	1	0.3125000
Medical transcriptionists	Health care support occupations	2020	49530	17	1	0.2941176
Physical therapist aides	Health care support occupations	2020	45790	14	1	0.2857143

The occupation category that wages grew the least for was Health care support occupations, so it is no surprise that that category sees professionals leaving - both Occupational therapy aids and Medical transcriptions are listed in the bottom 3 pay increase and have high percentages of their workforce leaving.

Conclusion

In conclusion, the goal was to be able to study the CDC data on employment numbers and wages for health care workers. This objective was met.

Project 2 - Health Care Data

Kayleah Griffen

3/1/2023

Introduction

Processing

Analysis

Conclusion