library(tidyverse)
library(reshape2)

Read in CSV

I’ll start by reading in the file and pulling up a quick preview of the top and bottom of the data frame. I know from previewing the .csv that the top four rows contain extaneous info, so I’ll skip them when reading in.

employment <- read_csv('https://github.com/kac624/cuny/raw/main/D607/data/healthcare_empl.csv',
                 skip = 4)
## New names:
## Rows: 47 Columns: 15
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (13): Occupation title, 2000...2, 2005...3, 2009...4, 2010...5, 2015...6... dbl
## (1): 2020...15 num (1): 2020...8
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `2000` -> `2000...2`
## • `2005` -> `2005...3`
## • `2009` -> `2009...4`
## • `2010` -> `2010...5`
## • `2015` -> `2015...6`
## • `2016` -> `2016...7`
## • `2020` -> `2020...8`
## • `2000` -> `2000...9`
## • `2005` -> `2005...10`
## • `2009` -> `2009...11`
## • `2010` -> `2010...12`
## • `2015` -> `2015...13`
## • `2016` -> `2016...14`
## • `2020` -> `2020...15`
head(employment)
## # A tibble: 6 × 15
##   Occupation t…¹ 2000.…² 2005.…³ 2009.…⁴ 2010.…⁵ 2015.…⁶ 2016.…⁷ 2020.…⁸ 2000.…⁹
##   <chr>          <chr>   <chr>   <chr>   <chr>   <chr>   <chr>     <dbl> <chr>  
## 1 Health care p… <NA>    <NA>    <NA>    <NA>    <NA>    <NA>         NA <NA>   
## 2 Audiologists   11,530  10,030  12,590  12,860  12,070  12,310    13300 22.92  
## 3 Cardiovascula… 40,080  43,560  48,070  48,720  51,400  53,760    55980 16.81  
## 4 Clinical labo… …       …       …       …       320,550 326,920  326220 …      
## 5 Dental hygien… 148,460 161,140 173,900 177,520 200,550 204,990  194830 24.99  
## 6 Diagnostic me… 31,760  43,590  51,630  53,010  61,250  65,790    73920 22.03  
## # … with 6 more variables: `2005...10` <chr>, `2009...11` <chr>,
## #   `2010...12` <chr>, `2015...13` <chr>, `2016...14` <chr>, `2020...15` <dbl>,
## #   and abbreviated variable names ¹​`Occupation title`, ²​`2000...2`,
## #   ³​`2005...3`, ⁴​`2009...4`, ⁵​`2010...5`, ⁶​`2015...6`, ⁷​`2016...7`,
## #   ⁸​`2020...8`, ⁹​`2000...9`
tail(employment)
## # A tibble: 6 × 15
##   Occupation t…¹ 2000.…² 2005.…³ 2009.…⁴ 2010.…⁵ 2015.…⁶ 2016.…⁷ 2020.…⁸ 2000.…⁹
##   <chr>          <chr>   <chr>   <chr>   <chr>   <chr>   <chr>     <dbl> <chr>  
## 1 "Physical the… 34,620  41,930  44,160  45,900  50,540  50,030    45790 10.06  
## 2 "Physical the… 44,120  58,670  63,750  65,960  81,230  85,580    92740 16.52  
## 3 "Psychiatric … 57,680  56,150  62,610  64,730  69,550  67,410    51550 10.79  
## 4 "… Category n… <NA>    <NA>    <NA>    <NA>    <NA>    <NA>         NA <NA>   
## 5 "NOTES: This … <NA>    <NA>    <NA>    <NA>    <NA>    <NA>         NA <NA>   
## 6 "SOURCE: U.S.… <NA>    <NA>    <NA>    <NA>    <NA>    <NA>         NA <NA>   
## # … with 6 more variables: `2005...10` <chr>, `2009...11` <chr>,
## #   `2010...12` <chr>, `2015...13` <chr>, `2016...14` <chr>, `2020...15` <dbl>,
## #   and abbreviated variable names ¹​`Occupation title`, ²​`2000...2`,
## #   ³​`2005...3`, ⁴​`2009...4`, ⁵​`2010...5`, ⁶​`2015...6`, ⁷​`2016...7`,
## #   ⁸​`2020...8`, ⁹​`2000...9`

Problems appear at both the top and the bottom. The columns names are difficult to understand, so I’ll rename them. The first set of years detail the number of filled positions for each occupation in each year, whereas the second set of years shows the mean hourly wage. We’ll clean these a bit more later, but for now, I’ll rename the columns to show an e (for employees) and a w (for wages) to distinguish the two sets of variables.

colnames(employment)
##  [1] "Occupation title" "2000...2"         "2005...3"         "2009...4"        
##  [5] "2010...5"         "2015...6"         "2016...7"         "2020...8"        
##  [9] "2000...9"         "2005...10"        "2009...11"        "2010...12"       
## [13] "2015...13"        "2016...14"        "2020...15"
colnames(employment) <- c('occupation',
                          'e2000', 'e2005', 'e2009', 'e2010', 'e2015', 'e2016', 'e2020',
                          'w2000', 'w2005', 'w2009', 'w2010', 'w2015', 'w2016', 'w2020')

The bottom of the dataframe contains some extreanous information, as well, so we can remove the bottom three rows.

employment <- employment[1:(nrow(employment) - 3),]

tail(employment)
## # A tibble: 6 × 15
##   occupation   e2000 e2005 e2009 e2010 e2015 e2016 e2020 w2000 w2005 w2009 w2010
##   <chr>        <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr>
## 1 "Occupation… 15,9… 22,1… 26,6… 27,7… 35,4… 38,1… 42750 16.76 19.13 24.44 24.66
## 2 "Orderlies\… …     …     …     …     52,6… 52,9… 43570 …     …     …     …    
## 3 "Pharmacy a… 59,8… 46,6… 52,2… 49,5… 38,0… 36,6… 38900 9.10  9.76  10.74 10.98
## 4 "Physical t… 34,6… 41,9… 44,1… 45,9… 50,5… 50,0… 45790 10.06 11.01 12.01 12.02
## 5 "Physical t… 44,1… 58,6… 63,7… 65,9… 81,2… 85,5… 92740 16.52 18.98 23.36 23.95
## 6 "Psychiatri… 57,6… 56,1… 62,6… 64,7… 69,5… 67,4… 51550 10.79 11.47 13.19 12.84
## # … with 3 more variables: w2015 <chr>, w2016 <chr>, w2020 <dbl>

There appear to be two columns that contain all NAs. They serve as categories for the occupations, but they serve no purpose in our data. So, I remove them.

employment %>%
  filter(is.na(e2000))
## # A tibble: 2 × 15
##   occupation   e2000 e2005 e2009 e2010 e2015 e2016 e2020 w2000 w2005 w2009 w2010
##   <chr>        <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr>
## 1 Health care… <NA>  <NA>  <NA>  <NA>  <NA>  <NA>     NA <NA>  <NA>  <NA>  <NA> 
## 2 Health care… <NA>  <NA>  <NA>  <NA>  <NA>  <NA>     NA <NA>  <NA>  <NA>  <NA> 
## # … with 3 more variables: w2015 <chr>, w2016 <chr>, w2020 <dbl>
employment <- employment %>%
  filter(!is.na(e2000))

It appears that missing values are denoted by an ellipses (“…”), so we must find and replace those values with NAs. This will support conversion of the values from characters to strings later on.

employment %>%
  filter(e2000 == '…')
## # A tibble: 8 × 15
##   occupation  e2000 e2005 e2009 e2010 e2015 e2016  e2020 w2000 w2005 w2009 w2010
##   <chr>       <chr> <chr> <chr> <chr> <chr> <chr>  <dbl> <chr> <chr> <chr> <chr>
## 1 "Clinical … …     …     …     …     320,… 326,… 3.26e5 …     …     …     …    
## 2 "Magnetic … …     …     …     …     33,4… 35,8… 3.93e4 …     …     …     …    
## 3 "Medical d… …     …     …     …     …     …     3.18e5 …     …     …     …    
## 4 "Nurse ane… …     …     …     …     39,4… 39,8… 4.20e4 …     …     …     …    
## 5 "Nurse mid… …     …     …     …     7,430 6,270 7.12e3 …     …     …     …    
## 6 "Nurse pra… …     …     …     …     136,… 150,… 2.11e5 …     …     …     …    
## 7 "Home heal… …     …     …     …     …     …     3.21e6 …     …     …     …    
## 8 "Orderlies… …     …     …     …     52,6… 52,9… 4.36e4 …     …     …     …    
## # … with 3 more variables: w2015 <chr>, w2016 <chr>, w2020 <dbl>
employment[employment == '…'] <- NA

Next, we’ll convert the dataset from a wide to a long format. We want to remove years as columns and convert it to a variable, but we have two sets of years in the data. So, we’ll perform to separate conversions and then join them in a final data frame.

positions <- employment %>%
  melt(measure.vars = 2:8, variable.name = 'year', value.name = 'positions') %>%
  mutate(year = str_replace(year,'e','')) %>%
  select(occupation, year, positions)

wages <- employment %>%
  melt(measure.vars = 9:15, variable.name = 'year', value.name = 'wage') %>%
  mutate(year = str_replace(year,'w','')) %>%
  select(occupation, year, wage)

employment_long <- left_join(positions, wages, by = c('occupation', 'year'))

head(employment_long,10)
##                                           occupation year positions  wage
## 1                                       Audiologists 2000    11,530 22.92
## 2       Cardiovascular technologists and technicians 2000    40,080 16.81
## 3  Clinical laboratory technologists and technicians 2000      <NA>  <NA>
## 4                                  Dental hygienists 2000   148,460 24.99
## 5                    Diagnostic medical sonographers 2000    31,760 22.03
## 6                               Dietetic technicians 2000    28,010 10.98
## 7                       Dietitians and nutritionists 2000    43,030 18.76
## 8       Emergency medical technicians and paramedics 2000   165,530 11.89
## 9  Licensed practical and licensed vocational nurses 2000   679,470 14.65
## 10       Magnetic resonance imaging technologists\\3 2000      <NA>  <NA>

Now that we have a clean, long format, we can convert our numeric variables from string to double format. First, however, we must remove the commas from the position count values.

employment_long$positions <- str_replace_all(employment_long$positions,',','')

employment_long[2:4] <- lapply(employment_long[2:4], function(x) as.numeric(as.character(x)))

Analysis

To analyze the data, we’ll construct a number of plots to identify trends. If we aggregate across occupations, we see a clear upward trend both in the total number of filled positions and in the mean wage.

employment_long %>%
  filter(!is.na(positions)) %>%
  group_by(year) %>%
  summarize(total_positions = sum(positions)) %>%
  ggplot(aes(x = year, y = total_positions)) +
  geom_line(linewidth = 1)

employment_long %>%
  filter(!is.na(positions)) %>%
  group_by(year) %>%
  summarize(mean_wage = mean(wage)) %>%
  ggplot(aes(x = year, y = mean_wage)) +
  geom_line(linewidth = 1)

What about trends in individual occupations? In terms of filled positions, it seems that, while the aggregate figure is going up, various trends emerge within individual occupations. Wages, however, show a much clearer upward trend across all occupations. Finally, we can try and subset a group of related occupations. For example, if we look at positions related to physical therapy, it becomes clear that this particular field is growing.

employment_long %>%
  filter(!is.na(positions)) %>%
  ggplot(aes(x = year, color = occupation)) +
  geom_line(aes(y = positions), show.legend = FALSE) +
  scale_y_log10()

employment_long %>%
  filter(!is.na(positions)) %>%
  ggplot(aes(x = year, color = occupation)) +
  geom_line(aes(y = wage), show.legend = FALSE)

employment_long %>%
  filter(!is.na(positions),
         str_detect(occupation, 'Phys')) %>%
  ggplot(aes(x = year, color = occupation)) +
  geom_line(aes(y = positions), linewidth = 1)

Finally, we’ll create a plot to observe the relationship between volume and employment and wages. Because of the large disparity in volumes across occupations, however, the first plot is quite difficult to read. So, we create a second plot and transform both variables to log space. While we’re now able to distinguish across points, there is no apparent trend. I suspected a downward trend might emerge, where, positions with fewer filled roles have higher wages (because of increased demand and/or specialization). However, no such trend emerges. We do see, however that wage observations tend to be higher for later years, which matches the observations above regarding the upward trend in wage growth.

employment_long %>%
  filter(!is.na(positions)) %>%
  ggplot(aes(x = wage, y = positions, color = year)) +
  geom_point()

employment_long %>%
  filter(!is.na(positions)) %>%
  ggplot(aes(x = wage, y = positions, color = year)) +
  geom_point() + 
  scale_x_log10() +
  scale_y_log10()