Step 1: Problem Statement

In this project the goal is to practice data transformation applied to untidy data. For this purpose, I chose a classmate’s untidy dataset from our discussion board. Using such dataset, the following items will be addressed:

  • Create a .CSV or .XLSX file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.
  • Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!]
  • Perform the analysis requested in the discussion item.

About the dataset and the analysis on it

The dataset is Bureau of Labor Statistics data on occupations and industries for the US for a given year (there is 2015 to 2020 in separate files, but year 2020 was chosen). The source file can be found at: https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-02-23/bls-2020.xlsx

The file is wide - industries run down the left and occupations run across the top. The industries on the left are repeated 6 times for Male, Female, Black, White, Asian, and total. Some of the rows are there to sum up other rows.

Much you could do here - take a year and identify industries and occupations by gender - any surprises? What you expected? Do occupations by gender stay relatively consistent whatever industry you are in? Or does industry matter? If you are ambitious you could read in several years and do a trend analysis.

Step 2: Data Collection and Wrangling

# Packages to be used
library(tidyverse)
library(readxl)

Read the Bureau of Labor Statistics data

Let’s read and display the raw data

# Read data from Excel file
xl_file_path <- "C:\\tmp\\bls-2020.xlsx"

# Read entire data set
bls_raw_data <- readxl::read_excel(xl_file_path,
                                       sheet = "cpsaat17",
                                       )
# Display some of the raw data
head(bls_raw_data, 20)
## # A tibble: 20 x 13
##    `HOUSEHOLD DATA\~ ...2  ...3  ...4  ...5  ...6  ...7  ...8  ...9  ...10 ...11
##    <chr>             <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
##  1 [In thousands]    <NA>  <NA>  <NA>   <NA> <NA>   <NA>  <NA>  <NA>  <NA>  <NA>
##  2 <NA>              <NA>  <NA>  <NA>   <NA> <NA>   <NA>  <NA>  <NA>  <NA>  <NA>
##  3 Industry, sex, a~ 2020  <NA>  <NA>   <NA> <NA>   <NA>  <NA>  <NA>  <NA>  <NA>
##  4 <NA>              Tota~ Mana~ <NA>  "Ser~ <NA>  "Sal~  <NA> "Nat~  <NA>  <NA>
##  5 <NA>              <NA>  Mana~ Prof~ "Pro~ Serv~ "Sal~ "Off~ "Far~ "Con~ "Ins~
##  6 <NA>              <NA>  <NA>  <NA>   <NA> <NA>   <NA>  <NA>  <NA>  <NA>  <NA>
##  7 TOTAL             <NA>  <NA>  <NA>   <NA> <NA>   <NA>  <NA>  <NA>  <NA>  <NA>
##  8 Agriculture and ~ 2349  961   58    "13"  94    "12"  "96"  "931" "10"  "33" 
##  9 Mining, quarryin~ 684   146   104   "1"   6     "11"  "38"  "0"   "208" "39" 
## 10 Construction      10786 2210  337   "14"  46    "145" "545" "6"   "640~ "618"
## 11 Manufacturing     14550 2845  2247  "34"  176   "631" "114~ "17"  "301" "653"
## 12 Durable goods     9129  1767  1620  "17"  79    "344" "697" "7"   "233" "414"
## 13 Nondurable goods  5421  1078  627   "17"  97    "288" "451" "11"  "68"  "239"
## 14 Wholesale and re~ 18989 1607  1214  "68"  611   "876~ "197~ "45"  "129" "656"
## 15 Wholesale trade   3380  519   172   "10"  39    "116~ "445" "28"  "24"  "144"
## 16 Retail trade      15609 1088  1041  "58"  573   "760~ "152~ "17"  "105" "513"
## 17 Transportation a~ 8552  1012  402   "47"  100   "150" "184~ "3"   "210" "515"
## 18 Information       2594  651   1095  "6"   38    "200" "329" "0"   "12"  "219"
## 19 Financial activi~ 10646 4729  1081  "56"  222   "228~ "193~ "1"   "48"  "164"
## 20 Professional and~ 18816 5064  6601  "540" 2757  "583" "200~ "12"  "153" "286"
## # ... with 2 more variables: ...12 <chr>, ...13 <chr>

Because we are interested in analyzing the data at the sex/gender level, as we read the data from the file, we can perform an initial in flight cleanup as follows:

  • Read only the rows for male and female groups
  • Skip all the rows before the target groups to avoid unnecessary cleanup later
  • Since each of the groups of rows in the file have 19 rows, read only the corresponding 19 rows for each sex group.
# The column names in the Excel file are too long and are multi-lined, 
# Because the existing headers cannot be used, define new ones.
bls_col_names <- c("Industry","Total_Employed","Management_Business_Financial",
                   "Professional","Protective_Services",
                   "Non_Protective_Services",
                   "Sales_Occupations","Office_Admin_Support",
                   "Farming_Fishing_Forestry",
                   "Construction_Extraction","Intallation_Maintenance_Repair",
                   "Production_Occupations","Transportation")

#Define list of Industry Categories that will be added to the data
ind_cats <- c(
"Agriculture and related"
,"Mining, quarrying, and oil and gas extraction"
,"Construction"
,"Manufacturing"
,"Manufacturing"
,"Wholesale and retail trade"
,"Wholesale and retail trade"
,"Transportation and utilities"
,"Information"
,"Financial activities"
,"Professional and business services"
,"Education and health services"
,"Leisure and hospitality"
,"Other services"
,"Other services"
,"Public administration"
)

# Define list of row names, which contain Totals. Those will need to be removed.
totals_row_names <- c("Manufacturing",
                      "Wholesale and retail trade",
                      "Other services")
# From the entire data set, we are only interested in getting employment data by sex/gender
men_bls_raw_data <- readxl::read_excel(xl_file_path,
                                       sheet = "cpsaat17",
                                       col_names = bls_col_names,
                                       skip = 29, n_max = 19)

women_bls_raw_data <- readxl::read_excel(xl_file_path,
                                         sheet = "cpsaat17",
                                         col_names = bls_col_names,
                                         skip = 50,
                                         n_max = 19)

Let’s take a look at the partially cleaned data.

Table: Untidy data for men employment
Industry Total_Employed Management_Business_Financial Professional Protective_Services Non_Protective_Services Sales_Occupations Office_Admin_Support Farming_Fishing_Forestry Construction_Extraction Intallation_Maintenance_Repair Production_Occupations Transportation
Agriculture and related 1699 707 36 11 52 6 16 714 10 31 36 80
Mining, quarrying, and oil and gas extraction 585 104 90 1 4 9 11 0 200 38 56 72
Construction 9613 1894 256 10 32 112 101 6 6157 604 147 294
Manufacturing 10255 1831 1722 26 111 402 466 15 287 607 3857 932
Durable goods 6814 1197 1305 12 50 257 290 7 221 383 2562 530
Nondurable goods 3441 634 417 13 61 145 176 8 66 224 1295 402
Table: Untidy data for women employment
Industry Total_Employed Management_Business_Financial Professional Protective_Services Non_Protective_Services Sales_Occupations Office_Admin_Support Farming_Fishing_Forestry Construction_Extraction Intallation_Maintenance_Repair Production_Occupations Transportation
Agriculture and related 650 253 22 2 42 6 79 217 0 2 7 20
Mining, quarrying, and oil and gas extraction 99 43 14 0 2 1 28 0 8 1 1 2
Construction 1173 316 82 4 14 32 444 0 247 14 7 14
Manufacturing 4295 1014 525 9 66 230 682 3 14 46 1477 230
Durable goods 2314 570 315 5 29 87 406 0 12 32 760 99
Nondurable goods 1981 445 210 4 37 143 276 3 2 14 717 131

Additional cleanup need to be performed on the data:

  • Rows containing totals need to be removed
  • The industry column needs to be split into two columns Industry and Industry Category
# Perform additional data wrangling

men_bls_data <- men_bls_raw_data %>%
  #Remove rows containing totals
  filter(!Industry %in% totals_row_names) %>%  
  #Add new variable for Industry Category
  mutate(Industry_Category = ind_cats)

women_bls_data <- women_bls_raw_data %>%
  #Remove rows containing totals
  filter(!Industry %in% totals_row_names) %>%
  #Add new variable for Industry Category
  mutate(Industry_Category = ind_cats)

Let’s take a look at the cleaned up data.

Table: Tidy data for men employment
Industry Total_Employed Management_Business_Financial Professional Protective_Services Non_Protective_Services Sales_Occupations Office_Admin_Support Farming_Fishing_Forestry Construction_Extraction Intallation_Maintenance_Repair Production_Occupations Transportation Industry_Category
Agriculture and related 1699 707 36 11 52 6 16 714 10 31 36 80 Agriculture and related
Mining, quarrying, and oil and gas extraction 585 104 90 1 4 9 11 0 200 38 56 72 Mining, quarrying, and oil and gas extraction
Construction 9613 1894 256 10 32 112 101 6 6157 604 147 294 Construction
Durable goods 6814 1197 1305 12 50 257 290 7 221 383 2562 530 Manufacturing
Nondurable goods 3441 634 417 13 61 145 176 8 66 224 1295 402 Manufacturing
Wholesale trade 2358 340 106 7 28 855 159 16 24 140 86 597 Wholesale and retail trade
Table: Tidy data for women employment
Industry Total_Employed Management_Business_Financial Professional Protective_Services Non_Protective_Services Sales_Occupations Office_Admin_Support Farming_Fishing_Forestry Construction_Extraction Intallation_Maintenance_Repair Production_Occupations Transportation Industry_Category
Agriculture and related 650 253 22 2 42 6 79 217 0 2 7 20 Agriculture and related
Mining, quarrying, and oil and gas extraction 99 43 14 0 2 1 28 0 8 1 1 2 Mining, quarrying, and oil and gas extraction
Construction 1173 316 82 4 14 32 444 0 247 14 7 14 Construction
Durable goods 2314 570 315 5 29 87 406 0 12 32 760 99 Manufacturing
Nondurable goods 1981 445 210 4 37 143 276 3 2 14 717 131 Manufacturing
Wholesale trade 1022 179 67 3 10 310 286 11 0 4 44 107 Wholesale and retail trade

Step 3: Visualize & Summarize Data

Much you could do here - take a year and identify industries and occupations by gender - any surprises? What you expected? Do occupations by gender stay relatively consistent whatever industry you are in? Or does industry matter? If you are ambitious you could read in several years and do a trend analysis.

ggplot(men_bls_data, aes(x = reorder(Industry, Total_Employed), y = Total_Employed, fill = Industry)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  ggtitle("Men Employment by Industry") +
  labs(x = "Industry", y = "Total Employed") +
  theme(legend.position = "none")

ggplot(women_bls_data, aes(x = reorder(Industry, Total_Employed), y = Total_Employed, fill = Industry)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  ggtitle("Women Employment by Industry") +
  labs(x = "Industry", y = "Total Employed") +
  theme(legend.position = "none")

From the graph above we can see that the total average number of employed people per industry is very different for men vs women. We can see that,

  • The highest number number of employed men are in the “Professional and business services” industry, while women are in “Education and health services”.
  • The lowest number number of employed men are in the “Private Household services” industry, while women are in “Mining, quarrying, and oil and gas extraction”.

Step 4: Communication of Results

In this project I used the package “readxl” to read the data from an Excel source. This package’s pre-processing parameters helped to partially pre-clean up the file while reading it into R. Using its very useful parameters became an effective tool to prepare the data for analysis. I also used the “dplyr” package to perform additional data wrangling. Both packages proved to be very useful in preparing tidy data for data analysis.