library(tidyverse)
library(reshape2)
I’ll start by reading in the file and pulling up a quick preview of the top and bottom of the data frame. I know from previewing the .csv that the top four rows contain extaneous info, so I’ll skip them when reading in.
employment <- read_csv('https://github.com/kac624/cuny/raw/main/D607/data/healthcare_empl.csv',
skip = 4)
## New names:
## Rows: 47 Columns: 15
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (13): Occupation title, 2000...2, 2005...3, 2009...4, 2010...5, 2015...6... dbl
## (1): 2020...15 num (1): 2020...8
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `2000` -> `2000...2`
## • `2005` -> `2005...3`
## • `2009` -> `2009...4`
## • `2010` -> `2010...5`
## • `2015` -> `2015...6`
## • `2016` -> `2016...7`
## • `2020` -> `2020...8`
## • `2000` -> `2000...9`
## • `2005` -> `2005...10`
## • `2009` -> `2009...11`
## • `2010` -> `2010...12`
## • `2015` -> `2015...13`
## • `2016` -> `2016...14`
## • `2020` -> `2020...15`
head(employment)
## # A tibble: 6 × 15
## Occupation t…¹ 2000.…² 2005.…³ 2009.…⁴ 2010.…⁵ 2015.…⁶ 2016.…⁷ 2020.…⁸ 2000.…⁹
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
## 1 Health care p… <NA> <NA> <NA> <NA> <NA> <NA> NA <NA>
## 2 Audiologists 11,530 10,030 12,590 12,860 12,070 12,310 13300 22.92
## 3 Cardiovascula… 40,080 43,560 48,070 48,720 51,400 53,760 55980 16.81
## 4 Clinical labo… … … … … 320,550 326,920 326220 …
## 5 Dental hygien… 148,460 161,140 173,900 177,520 200,550 204,990 194830 24.99
## 6 Diagnostic me… 31,760 43,590 51,630 53,010 61,250 65,790 73920 22.03
## # … with 6 more variables: `2005...10` <chr>, `2009...11` <chr>,
## # `2010...12` <chr>, `2015...13` <chr>, `2016...14` <chr>, `2020...15` <dbl>,
## # and abbreviated variable names ¹`Occupation title`, ²`2000...2`,
## # ³`2005...3`, ⁴`2009...4`, ⁵`2010...5`, ⁶`2015...6`, ⁷`2016...7`,
## # ⁸`2020...8`, ⁹`2000...9`
tail(employment)
## # A tibble: 6 × 15
## Occupation t…¹ 2000.…² 2005.…³ 2009.…⁴ 2010.…⁵ 2015.…⁶ 2016.…⁷ 2020.…⁸ 2000.…⁹
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
## 1 "Physical the… 34,620 41,930 44,160 45,900 50,540 50,030 45790 10.06
## 2 "Physical the… 44,120 58,670 63,750 65,960 81,230 85,580 92740 16.52
## 3 "Psychiatric … 57,680 56,150 62,610 64,730 69,550 67,410 51550 10.79
## 4 "… Category n… <NA> <NA> <NA> <NA> <NA> <NA> NA <NA>
## 5 "NOTES: This … <NA> <NA> <NA> <NA> <NA> <NA> NA <NA>
## 6 "SOURCE: U.S.… <NA> <NA> <NA> <NA> <NA> <NA> NA <NA>
## # … with 6 more variables: `2005...10` <chr>, `2009...11` <chr>,
## # `2010...12` <chr>, `2015...13` <chr>, `2016...14` <chr>, `2020...15` <dbl>,
## # and abbreviated variable names ¹`Occupation title`, ²`2000...2`,
## # ³`2005...3`, ⁴`2009...4`, ⁵`2010...5`, ⁶`2015...6`, ⁷`2016...7`,
## # ⁸`2020...8`, ⁹`2000...9`
Problems appear at both the top and the bottom. The columns names are difficult to understand, so I’ll rename them. The first set of years detail the number of filled positions for each occupation in each year, whereas the second set of years shows the mean hourly wage. We’ll clean these a bit more later, but for now, I’ll rename the columns to show an e (for employees) and a w (for wages) to distinguish the two sets of variables.
colnames(employment)
## [1] "Occupation title" "2000...2" "2005...3" "2009...4"
## [5] "2010...5" "2015...6" "2016...7" "2020...8"
## [9] "2000...9" "2005...10" "2009...11" "2010...12"
## [13] "2015...13" "2016...14" "2020...15"
colnames(employment) <- c('occupation',
'e2000', 'e2005', 'e2009', 'e2010', 'e2015', 'e2016', 'e2020',
'w2000', 'w2005', 'w2009', 'w2010', 'w2015', 'w2016', 'w2020')
The bottom of the dataframe contains some extreanous information, as well, so we can remove the bottom three rows.
employment <- employment[1:(nrow(employment) - 3),]
tail(employment)
## # A tibble: 6 × 15
## occupation e2000 e2005 e2009 e2010 e2015 e2016 e2020 w2000 w2005 w2009 w2010
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr>
## 1 "Occupation… 15,9… 22,1… 26,6… 27,7… 35,4… 38,1… 42750 16.76 19.13 24.44 24.66
## 2 "Orderlies\… … … … … 52,6… 52,9… 43570 … … … …
## 3 "Pharmacy a… 59,8… 46,6… 52,2… 49,5… 38,0… 36,6… 38900 9.10 9.76 10.74 10.98
## 4 "Physical t… 34,6… 41,9… 44,1… 45,9… 50,5… 50,0… 45790 10.06 11.01 12.01 12.02
## 5 "Physical t… 44,1… 58,6… 63,7… 65,9… 81,2… 85,5… 92740 16.52 18.98 23.36 23.95
## 6 "Psychiatri… 57,6… 56,1… 62,6… 64,7… 69,5… 67,4… 51550 10.79 11.47 13.19 12.84
## # … with 3 more variables: w2015 <chr>, w2016 <chr>, w2020 <dbl>
There appear to be two columns that contain all NAs. They serve as categories for the occupations, but they serve no purpose in our data. So, I remove them.
employment %>%
filter(is.na(e2000))
## # A tibble: 2 × 15
## occupation e2000 e2005 e2009 e2010 e2015 e2016 e2020 w2000 w2005 w2009 w2010
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr>
## 1 Health care… <NA> <NA> <NA> <NA> <NA> <NA> NA <NA> <NA> <NA> <NA>
## 2 Health care… <NA> <NA> <NA> <NA> <NA> <NA> NA <NA> <NA> <NA> <NA>
## # … with 3 more variables: w2015 <chr>, w2016 <chr>, w2020 <dbl>
employment <- employment %>%
filter(!is.na(e2000))
It appears that missing values are denoted by an ellipses (“…”), so we must find and replace those values with NAs. This will support conversion of the values from characters to strings later on.
employment %>%
filter(e2000 == '…')
## # A tibble: 8 × 15
## occupation e2000 e2005 e2009 e2010 e2015 e2016 e2020 w2000 w2005 w2009 w2010
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr>
## 1 "Clinical … … … … … 320,… 326,… 3.26e5 … … … …
## 2 "Magnetic … … … … … 33,4… 35,8… 3.93e4 … … … …
## 3 "Medical d… … … … … … … 3.18e5 … … … …
## 4 "Nurse ane… … … … … 39,4… 39,8… 4.20e4 … … … …
## 5 "Nurse mid… … … … … 7,430 6,270 7.12e3 … … … …
## 6 "Nurse pra… … … … … 136,… 150,… 2.11e5 … … … …
## 7 "Home heal… … … … … … … 3.21e6 … … … …
## 8 "Orderlies… … … … … 52,6… 52,9… 4.36e4 … … … …
## # … with 3 more variables: w2015 <chr>, w2016 <chr>, w2020 <dbl>
employment[employment == '…'] <- NA
Next, we’ll convert the dataset from a wide to a long format. We want to remove years as columns and convert it to a variable, but we have two sets of years in the data. So, we’ll perform to separate conversions and then join them in a final data frame.
positions <- employment %>%
melt(measure.vars = 2:8, variable.name = 'year', value.name = 'positions') %>%
mutate(year = str_replace(year,'e','')) %>%
select(occupation, year, positions)
wages <- employment %>%
melt(measure.vars = 9:15, variable.name = 'year', value.name = 'wage') %>%
mutate(year = str_replace(year,'w','')) %>%
select(occupation, year, wage)
employment_long <- left_join(positions, wages, by = c('occupation', 'year'))
head(employment_long,10)
## occupation year positions wage
## 1 Audiologists 2000 11,530 22.92
## 2 Cardiovascular technologists and technicians 2000 40,080 16.81
## 3 Clinical laboratory technologists and technicians 2000 <NA> <NA>
## 4 Dental hygienists 2000 148,460 24.99
## 5 Diagnostic medical sonographers 2000 31,760 22.03
## 6 Dietetic technicians 2000 28,010 10.98
## 7 Dietitians and nutritionists 2000 43,030 18.76
## 8 Emergency medical technicians and paramedics 2000 165,530 11.89
## 9 Licensed practical and licensed vocational nurses 2000 679,470 14.65
## 10 Magnetic resonance imaging technologists\\3 2000 <NA> <NA>
Now that we have a clean, long format, we can convert our numeric variables from string to double format. First, however, we must remove the commas from the position count values.
employment_long$positions <- str_replace_all(employment_long$positions,',','')
employment_long[2:4] <- lapply(employment_long[2:4], function(x) as.numeric(as.character(x)))
To analyze the data, we’ll construct a number of plots to identify trends. If we aggregate across occupations, we see a clear upward trend both in the total number of filled positions and in the mean wage.
employment_long %>%
filter(!is.na(positions)) %>%
group_by(year) %>%
summarize(total_positions = sum(positions)) %>%
ggplot(aes(x = year, y = total_positions)) +
geom_line(linewidth = 1)
employment_long %>%
filter(!is.na(positions)) %>%
group_by(year) %>%
summarize(mean_wage = mean(wage)) %>%
ggplot(aes(x = year, y = mean_wage)) +
geom_line(linewidth = 1)
What about trends in individual occupations? In terms of filled positions, it seems that, while the aggregate figure is going up, various trends emerge within individual occupations. Wages, however, show a much clearer upward trend across all occupations. Finally, we can try and subset a group of related occupations. For example, if we look at positions related to physical therapy, it becomes clear that this particular field is growing.
employment_long %>%
filter(!is.na(positions)) %>%
ggplot(aes(x = year, color = occupation)) +
geom_line(aes(y = positions), show.legend = FALSE) +
scale_y_log10()
employment_long %>%
filter(!is.na(positions)) %>%
ggplot(aes(x = year, color = occupation)) +
geom_line(aes(y = wage), show.legend = FALSE)
employment_long %>%
filter(!is.na(positions),
str_detect(occupation, 'Phys')) %>%
ggplot(aes(x = year, color = occupation)) +
geom_line(aes(y = positions), linewidth = 1)
Finally, we’ll create a plot to observe the relationship between volume and employment and wages. Because of the large disparity in volumes across occupations, however, the first plot is quite difficult to read. So, we create a second plot and transform both variables to log space. While we’re now able to distinguish across points, there is no apparent trend. I suspected a downward trend might emerge, where, positions with fewer filled roles have higher wages (because of increased demand and/or specialization). However, no such trend emerges. We do see, however that wage observations tend to be higher for later years, which matches the observations above regarding the upward trend in wage growth.
employment_long %>%
filter(!is.na(positions)) %>%
ggplot(aes(x = wage, y = positions, color = year)) +
geom_point()
employment_long %>%
filter(!is.na(positions)) %>%
ggplot(aes(x = wage, y = positions, color = year)) +
geom_point() +
scale_x_log10() +
scale_y_log10()