In this project the goal is to practice data transformation applied to untidy data. For this purpose, I chose a classmate’s untidy dataset from our discussion board. Using such dataset, the following items will be addressed:
The dataset is Bureau of Labor Statistics data on occupations and industries for the US for a given year (there is 2015 to 2020 in separate files, but year 2020 was chosen). The source file can be found at: https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-02-23/bls-2020.xlsx
The file is wide - industries run down the left and occupations run across the top. The industries on the left are repeated 6 times for Male, Female, Black, White, Asian, and total. Some of the rows are there to sum up other rows.
Much you could do here - take a year and identify industries and occupations by gender - any surprises? What you expected? Do occupations by gender stay relatively consistent whatever industry you are in? Or does industry matter? If you are ambitious you could read in several years and do a trend analysis.
# Packages to be used
library(tidyverse)
library(readxl)Read the Bureau of Labor Statistics data
Let’s read and display the raw data
# Read data from Excel file
xl_file_path <- "C:\\tmp\\bls-2020.xlsx"
# Read entire data set
bls_raw_data <- readxl::read_excel(xl_file_path,
sheet = "cpsaat17",
)
# Display some of the raw data
head(bls_raw_data, 20)## # A tibble: 20 x 13
## `HOUSEHOLD DATA\~ ...2 ...3 ...4 ...5 ...6 ...7 ...8 ...9 ...10 ...11
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 [In thousands] <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 3 Industry, sex, a~ 2020 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 4 <NA> Tota~ Mana~ <NA> "Ser~ <NA> "Sal~ <NA> "Nat~ <NA> <NA>
## 5 <NA> <NA> Mana~ Prof~ "Pro~ Serv~ "Sal~ "Off~ "Far~ "Con~ "Ins~
## 6 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 7 TOTAL <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 8 Agriculture and ~ 2349 961 58 "13" 94 "12" "96" "931" "10" "33"
## 9 Mining, quarryin~ 684 146 104 "1" 6 "11" "38" "0" "208" "39"
## 10 Construction 10786 2210 337 "14" 46 "145" "545" "6" "640~ "618"
## 11 Manufacturing 14550 2845 2247 "34" 176 "631" "114~ "17" "301" "653"
## 12 Durable goods 9129 1767 1620 "17" 79 "344" "697" "7" "233" "414"
## 13 Nondurable goods 5421 1078 627 "17" 97 "288" "451" "11" "68" "239"
## 14 Wholesale and re~ 18989 1607 1214 "68" 611 "876~ "197~ "45" "129" "656"
## 15 Wholesale trade 3380 519 172 "10" 39 "116~ "445" "28" "24" "144"
## 16 Retail trade 15609 1088 1041 "58" 573 "760~ "152~ "17" "105" "513"
## 17 Transportation a~ 8552 1012 402 "47" 100 "150" "184~ "3" "210" "515"
## 18 Information 2594 651 1095 "6" 38 "200" "329" "0" "12" "219"
## 19 Financial activi~ 10646 4729 1081 "56" 222 "228~ "193~ "1" "48" "164"
## 20 Professional and~ 18816 5064 6601 "540" 2757 "583" "200~ "12" "153" "286"
## # ... with 2 more variables: ...12 <chr>, ...13 <chr>
Because we are interested in analyzing the data at the sex/gender level, as we read the data from the file, we can perform an initial in flight cleanup as follows:
# The column names in the Excel file are too long and are multi-lined,
# Because the existing headers cannot be used, define new ones.
bls_col_names <- c("Industry","Total_Employed","Management_Business_Financial",
"Professional","Protective_Services",
"Non_Protective_Services",
"Sales_Occupations","Office_Admin_Support",
"Farming_Fishing_Forestry",
"Construction_Extraction","Intallation_Maintenance_Repair",
"Production_Occupations","Transportation")
#Define list of Industry Categories that will be added to the data
ind_cats <- c(
"Agriculture and related"
,"Mining, quarrying, and oil and gas extraction"
,"Construction"
,"Manufacturing"
,"Manufacturing"
,"Wholesale and retail trade"
,"Wholesale and retail trade"
,"Transportation and utilities"
,"Information"
,"Financial activities"
,"Professional and business services"
,"Education and health services"
,"Leisure and hospitality"
,"Other services"
,"Other services"
,"Public administration"
)
# Define list of row names, which contain Totals. Those will need to be removed.
totals_row_names <- c("Manufacturing",
"Wholesale and retail trade",
"Other services")# From the entire data set, we are only interested in getting employment data by sex/gender
men_bls_raw_data <- readxl::read_excel(xl_file_path,
sheet = "cpsaat17",
col_names = bls_col_names,
skip = 29, n_max = 19)
women_bls_raw_data <- readxl::read_excel(xl_file_path,
sheet = "cpsaat17",
col_names = bls_col_names,
skip = 50,
n_max = 19)Let’s take a look at the partially cleaned data.
| Industry | Total_Employed | Management_Business_Financial | Professional | Protective_Services | Non_Protective_Services | Sales_Occupations | Office_Admin_Support | Farming_Fishing_Forestry | Construction_Extraction | Intallation_Maintenance_Repair | Production_Occupations | Transportation |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Agriculture and related | 1699 | 707 | 36 | 11 | 52 | 6 | 16 | 714 | 10 | 31 | 36 | 80 |
| Mining, quarrying, and oil and gas extraction | 585 | 104 | 90 | 1 | 4 | 9 | 11 | 0 | 200 | 38 | 56 | 72 |
| Construction | 9613 | 1894 | 256 | 10 | 32 | 112 | 101 | 6 | 6157 | 604 | 147 | 294 |
| Manufacturing | 10255 | 1831 | 1722 | 26 | 111 | 402 | 466 | 15 | 287 | 607 | 3857 | 932 |
| Durable goods | 6814 | 1197 | 1305 | 12 | 50 | 257 | 290 | 7 | 221 | 383 | 2562 | 530 |
| Nondurable goods | 3441 | 634 | 417 | 13 | 61 | 145 | 176 | 8 | 66 | 224 | 1295 | 402 |
| Industry | Total_Employed | Management_Business_Financial | Professional | Protective_Services | Non_Protective_Services | Sales_Occupations | Office_Admin_Support | Farming_Fishing_Forestry | Construction_Extraction | Intallation_Maintenance_Repair | Production_Occupations | Transportation |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Agriculture and related | 650 | 253 | 22 | 2 | 42 | 6 | 79 | 217 | 0 | 2 | 7 | 20 |
| Mining, quarrying, and oil and gas extraction | 99 | 43 | 14 | 0 | 2 | 1 | 28 | 0 | 8 | 1 | 1 | 2 |
| Construction | 1173 | 316 | 82 | 4 | 14 | 32 | 444 | 0 | 247 | 14 | 7 | 14 |
| Manufacturing | 4295 | 1014 | 525 | 9 | 66 | 230 | 682 | 3 | 14 | 46 | 1477 | 230 |
| Durable goods | 2314 | 570 | 315 | 5 | 29 | 87 | 406 | 0 | 12 | 32 | 760 | 99 |
| Nondurable goods | 1981 | 445 | 210 | 4 | 37 | 143 | 276 | 3 | 2 | 14 | 717 | 131 |
Additional cleanup need to be performed on the data:
# Perform additional data wrangling
men_bls_data <- men_bls_raw_data %>%
#Remove rows containing totals
filter(!Industry %in% totals_row_names) %>%
#Add new variable for Industry Category
mutate(Industry_Category = ind_cats)
women_bls_data <- women_bls_raw_data %>%
#Remove rows containing totals
filter(!Industry %in% totals_row_names) %>%
#Add new variable for Industry Category
mutate(Industry_Category = ind_cats)Let’s take a look at the cleaned up data.
| Industry | Total_Employed | Management_Business_Financial | Professional | Protective_Services | Non_Protective_Services | Sales_Occupations | Office_Admin_Support | Farming_Fishing_Forestry | Construction_Extraction | Intallation_Maintenance_Repair | Production_Occupations | Transportation | Industry_Category |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Agriculture and related | 1699 | 707 | 36 | 11 | 52 | 6 | 16 | 714 | 10 | 31 | 36 | 80 | Agriculture and related |
| Mining, quarrying, and oil and gas extraction | 585 | 104 | 90 | 1 | 4 | 9 | 11 | 0 | 200 | 38 | 56 | 72 | Mining, quarrying, and oil and gas extraction |
| Construction | 9613 | 1894 | 256 | 10 | 32 | 112 | 101 | 6 | 6157 | 604 | 147 | 294 | Construction |
| Durable goods | 6814 | 1197 | 1305 | 12 | 50 | 257 | 290 | 7 | 221 | 383 | 2562 | 530 | Manufacturing |
| Nondurable goods | 3441 | 634 | 417 | 13 | 61 | 145 | 176 | 8 | 66 | 224 | 1295 | 402 | Manufacturing |
| Wholesale trade | 2358 | 340 | 106 | 7 | 28 | 855 | 159 | 16 | 24 | 140 | 86 | 597 | Wholesale and retail trade |
| Industry | Total_Employed | Management_Business_Financial | Professional | Protective_Services | Non_Protective_Services | Sales_Occupations | Office_Admin_Support | Farming_Fishing_Forestry | Construction_Extraction | Intallation_Maintenance_Repair | Production_Occupations | Transportation | Industry_Category |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Agriculture and related | 650 | 253 | 22 | 2 | 42 | 6 | 79 | 217 | 0 | 2 | 7 | 20 | Agriculture and related |
| Mining, quarrying, and oil and gas extraction | 99 | 43 | 14 | 0 | 2 | 1 | 28 | 0 | 8 | 1 | 1 | 2 | Mining, quarrying, and oil and gas extraction |
| Construction | 1173 | 316 | 82 | 4 | 14 | 32 | 444 | 0 | 247 | 14 | 7 | 14 | Construction |
| Durable goods | 2314 | 570 | 315 | 5 | 29 | 87 | 406 | 0 | 12 | 32 | 760 | 99 | Manufacturing |
| Nondurable goods | 1981 | 445 | 210 | 4 | 37 | 143 | 276 | 3 | 2 | 14 | 717 | 131 | Manufacturing |
| Wholesale trade | 1022 | 179 | 67 | 3 | 10 | 310 | 286 | 11 | 0 | 4 | 44 | 107 | Wholesale and retail trade |
Much you could do here - take a year and identify industries and occupations by gender - any surprises? What you expected? Do occupations by gender stay relatively consistent whatever industry you are in? Or does industry matter? If you are ambitious you could read in several years and do a trend analysis.
ggplot(men_bls_data, aes(x = reorder(Industry, Total_Employed), y = Total_Employed, fill = Industry)) +
geom_bar(stat = "identity") +
coord_flip() +
ggtitle("Men Employment by Industry") +
labs(x = "Industry", y = "Total Employed") +
theme(legend.position = "none")
ggplot(women_bls_data, aes(x = reorder(Industry, Total_Employed), y = Total_Employed, fill = Industry)) +
geom_bar(stat = "identity") +
coord_flip() +
ggtitle("Women Employment by Industry") +
labs(x = "Industry", y = "Total Employed") +
theme(legend.position = "none")From the graph above we can see that the total average number of employed people per industry is very different for men vs women. We can see that,
In this project I used the package “readxl” to read the data from an Excel source. This package’s pre-processing parameters helped to partially pre-clean up the file while reading it into R. Using its very useful parameters became an effective tool to prepare the data for analysis. I also used the “dplyr” package to perform additional data wrangling. Both packages proved to be very useful in preparing tidy data for data analysis.