Project 2 - Data Transformation 1 of 3

Step 1: Problem Statement

In this project the goal is to practice data transformation applied to untidy data. For this purpose, I chose a classmate’s untidy dataset from our discussion board. Using such dataset, the following items will be addressed:

Create a .CSV or .XLSX file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.
Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!]
Perform the analysis requested in the discussion item.

About the dataset and the analysis on it

The dataset is Bureau of Labor Statistics data on occupations and industries for the US for a given year (there is 2015 to 2020 in separate files, but year 2020 was chosen). The source file can be found at: https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-02-23/bls-2020.xlsx

The file is wide - industries run down the left and occupations run across the top. The industries on the left are repeated 6 times for Male, Female, Black, White, Asian, and total. Some of the rows are there to sum up other rows.

Much you could do here - take a year and identify industries and occupations by gender - any surprises? What you expected? Do occupations by gender stay relatively consistent whatever industry you are in? Or does industry matter? If you are ambitious you could read in several years and do a trend analysis.

Step 2: Data Collection and Wrangling

# Packages to be used
library(tidyverse)
library(readxl)

Read the Bureau of Labor Statistics data

Let’s read and display the raw data

# Read data from Excel file
xl_file_path <- "C:\\tmp\\bls-2020.xlsx"

# Read entire data set
bls_raw_data <- readxl::read_excel(xl_file_path,
                                       sheet = "cpsaat17",
                                       )
# Display some of the raw data
head(bls_raw_data, 20)

## # A tibble: 20 x 13
##    `HOUSEHOLD DATA\~ ...2  ...3  ...4  ...5  ...6  ...7  ...8  ...9  ...10 ...11
##    <chr>             <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
##  1 [In thousands]    <NA>  <NA>  <NA>   <NA> <NA>   <NA>  <NA>  <NA>  <NA>  <NA>
##  2 <NA>              <NA>  <NA>  <NA>   <NA> <NA>   <NA>  <NA>  <NA>  <NA>  <NA>
##  3 Industry, sex, a~ 2020  <NA>  <NA>   <NA> <NA>   <NA>  <NA>  <NA>  <NA>  <NA>
##  4 <NA>              Tota~ Mana~ <NA>  "Ser~ <NA>  "Sal~  <NA> "Nat~  <NA>  <NA>
##  5 <NA>              <NA>  Mana~ Prof~ "Pro~ Serv~ "Sal~ "Off~ "Far~ "Con~ "Ins~
##  6 <NA>              <NA>  <NA>  <NA>   <NA> <NA>   <NA>  <NA>  <NA>  <NA>  <NA>
##  7 TOTAL             <NA>  <NA>  <NA>   <NA> <NA>   <NA>  <NA>  <NA>  <NA>  <NA>
##  8 Agriculture and ~ 2349  961   58    "13"  94    "12"  "96"  "931" "10"  "33" 
##  9 Mining, quarryin~ 684   146   104   "1"   6     "11"  "38"  "0"   "208" "39" 
## 10 Construction      10786 2210  337   "14"  46    "145" "545" "6"   "640~ "618"
## 11 Manufacturing     14550 2845  2247  "34"  176   "631" "114~ "17"  "301" "653"
## 12 Durable goods     9129  1767  1620  "17"  79    "344" "697" "7"   "233" "414"
## 13 Nondurable goods  5421  1078  627   "17"  97    "288" "451" "11"  "68"  "239"
## 14 Wholesale and re~ 18989 1607  1214  "68"  611   "876~ "197~ "45"  "129" "656"
## 15 Wholesale trade   3380  519   172   "10"  39    "116~ "445" "28"  "24"  "144"
## 16 Retail trade      15609 1088  1041  "58"  573   "760~ "152~ "17"  "105" "513"
## 17 Transportation a~ 8552  1012  402   "47"  100   "150" "184~ "3"   "210" "515"
## 18 Information       2594  651   1095  "6"   38    "200" "329" "0"   "12"  "219"
## 19 Financial activi~ 10646 4729  1081  "56"  222   "228~ "193~ "1"   "48"  "164"
## 20 Professional and~ 18816 5064  6601  "540" 2757  "583" "200~ "12"  "153" "286"
## # ... with 2 more variables: ...12 <chr>, ...13 <chr>

Because we are interested in analyzing the data at the sex/gender level, as we read the data from the file, we can perform an initial in flight cleanup as follows:

Read only the rows for male and female groups
Skip all the rows before the target groups to avoid unnecessary cleanup later
Since each of the groups of rows in the file have 19 rows, read only the corresponding 19 rows for each sex group.

# The column names in the Excel file are too long and are multi-lined, 
# Because the existing headers cannot be used, define new ones.
bls_col_names <- c("Industry","Total_Employed","Management_Business_Financial",
                   "Professional","Protective_Services",
                   "Non_Protective_Services",
                   "Sales_Occupations","Office_Admin_Support",
                   "Farming_Fishing_Forestry",
                   "Construction_Extraction","Intallation_Maintenance_Repair",
                   "Production_Occupations","Transportation")

#Define list of Industry Categories that will be added to the data
ind_cats <- c(
"Agriculture and related"
,"Mining, quarrying, and oil and gas extraction"
,"Construction"
,"Manufacturing"
,"Manufacturing"
,"Wholesale and retail trade"
,"Wholesale and retail trade"
,"Transportation and utilities"
,"Information"
,"Financial activities"
,"Professional and business services"
,"Education and health services"
,"Leisure and hospitality"
,"Other services"
,"Other services"
,"Public administration"
)

# Define list of row names, which contain Totals. Those will need to be removed.
totals_row_names <- c("Manufacturing",
                      "Wholesale and retail trade",
                      "Other services")

# From the entire data set, we are only interested in getting employment data by sex/gender
men_bls_raw_data <- readxl::read_excel(xl_file_path,
                                       sheet = "cpsaat17",
                                       col_names = bls_col_names,
                                       skip = 29, n_max = 19)

women_bls_raw_data <- readxl::read_excel(xl_file_path,
                                         sheet = "cpsaat17",
                                         col_names = bls_col_names,
                                         skip = 50,
                                         n_max = 19)

Let’s take a look at the partially cleaned data.

Table: Untidy data for men employment
Industry	Total_Employed	Management_Business_Financial	Professional	Protective_Services	Non_Protective_Services	Sales_Occupations	Office_Admin_Support	Farming_Fishing_Forestry	Construction_Extraction	Intallation_Maintenance_Repair	Production_Occupations	Transportation
Agriculture and related	1699	707	36	11	52	6	16	714	10	31	36	80
Mining, quarrying, and oil and gas extraction	585	104	90	1	4	9	11	0	200	38	56	72
Construction	9613	1894	256	10	32	112	101	6	6157	604	147	294
Manufacturing	10255	1831	1722	26	111	402	466	15	287	607	3857	932
Durable goods	6814	1197	1305	12	50	257	290	7	221	383	2562	530
Nondurable goods	3441	634	417	13	61	145	176	8	66	224	1295	402

Table: Untidy data for women employment
Industry	Total_Employed	Management_Business_Financial	Professional	Protective_Services	Non_Protective_Services	Sales_Occupations	Office_Admin_Support	Farming_Fishing_Forestry	Construction_Extraction	Intallation_Maintenance_Repair	Production_Occupations	Transportation
Agriculture and related	650	253	22	2	42	6	79	217	0	2	7	20
Mining, quarrying, and oil and gas extraction	99	43	14	0	2	1	28	0	8	1	1	2
Construction	1173	316	82	4	14	32	444	0	247	14	7	14
Manufacturing	4295	1014	525	9	66	230	682	3	14	46	1477	230
Durable goods	2314	570	315	5	29	87	406	0	12	32	760	99
Nondurable goods	1981	445	210	4	37	143	276	3	2	14	717	131

Additional cleanup need to be performed on the data:

Rows containing totals need to be removed
The industry column needs to be split into two columns Industry and Industry Category

# Perform additional data wrangling

men_bls_data <- men_bls_raw_data %>%
  #Remove rows containing totals
  filter(!Industry %in% totals_row_names) %>%  
  #Add new variable for Industry Category
  mutate(Industry_Category = ind_cats)

women_bls_data <- women_bls_raw_data %>%
  #Remove rows containing totals
  filter(!Industry %in% totals_row_names) %>%
  #Add new variable for Industry Category
  mutate(Industry_Category = ind_cats)

Let’s take a look at the cleaned up data.

Table: Tidy data for men employment
Industry	Total_Employed	Management_Business_Financial	Professional	Protective_Services	Non_Protective_Services	Sales_Occupations	Office_Admin_Support	Farming_Fishing_Forestry	Construction_Extraction	Intallation_Maintenance_Repair	Production_Occupations	Transportation	Industry_Category
Agriculture and related	1699	707	36	11	52	6	16	714	10	31	36	80	Agriculture and related
Mining, quarrying, and oil and gas extraction	585	104	90	1	4	9	11	0	200	38	56	72	Mining, quarrying, and oil and gas extraction
Construction	9613	1894	256	10	32	112	101	6	6157	604	147	294	Construction
Durable goods	6814	1197	1305	12	50	257	290	7	221	383	2562	530	Manufacturing
Nondurable goods	3441	634	417	13	61	145	176	8	66	224	1295	402	Manufacturing
Wholesale trade	2358	340	106	7	28	855	159	16	24	140	86	597	Wholesale and retail trade

Table: Tidy data for women employment
Industry	Total_Employed	Management_Business_Financial	Professional	Protective_Services	Non_Protective_Services	Sales_Occupations	Office_Admin_Support	Farming_Fishing_Forestry	Construction_Extraction	Intallation_Maintenance_Repair	Production_Occupations	Transportation	Industry_Category
Agriculture and related	650	253	22	2	42	6	79	217	0	2	7	20	Agriculture and related
Mining, quarrying, and oil and gas extraction	99	43	14	0	2	1	28	0	8	1	1	2	Mining, quarrying, and oil and gas extraction
Construction	1173	316	82	4	14	32	444	0	247	14	7	14	Construction
Durable goods	2314	570	315	5	29	87	406	0	12	32	760	99	Manufacturing
Nondurable goods	1981	445	210	4	37	143	276	3	2	14	717	131	Manufacturing
Wholesale trade	1022	179	67	3	10	310	286	11	0	4	44	107	Wholesale and retail trade

Step 3: Visualize & Summarize Data

ggplot(men_bls_data, aes(x = reorder(Industry, Total_Employed), y = Total_Employed, fill = Industry)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  ggtitle("Men Employment by Industry") +
  labs(x = "Industry", y = "Total Employed") +
  theme(legend.position = "none")

ggplot(women_bls_data, aes(x = reorder(Industry, Total_Employed), y = Total_Employed, fill = Industry)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  ggtitle("Women Employment by Industry") +
  labs(x = "Industry", y = "Total Employed") +
  theme(legend.position = "none")

From the graph above we can see that the total average number of employed people per industry is very different for men vs women. We can see that,

The highest number number of employed men are in the “Professional and business services” industry, while women are in “Education and health services”.
The lowest number number of employed men are in the “Private Household services” industry, while women are in “Mining, quarrying, and oil and gas extraction”.

Step 4: Communication of Results

In this project I used the package “readxl” to read the data from an Excel source. This package’s pre-processing parameters helped to partially pre-clean up the file while reading it into R. Using its very useful parameters became an effective tool to prepare the data for analysis. I also used the “dplyr” package to perform additional data wrangling. Both packages proved to be very useful in preparing tidy data for data analysis.