Assignment

Requirements

\(MSDA\ 607\ Final\ Project\)

Deliverables Schedule

Deliverable	Date	Points
One Paragraph Proposal	Sunday April 26	20
Final Project	Wednesday May 19	120
Final Project Presentation	Before or during Final Meetup on Wednesday May 19	30

Policy on Collaboration

You may work in a team of up to three people. Each project team member is responsible for understanding and being able to explain all of the submitted project code. Remember that you can take work that you find elsewhere as a base to build on, but you need to acknowledge the source, so that I base your grade on what you contributed, not on what you started with!

Approval Meeting

Once you’ve turned in your one paragraph proposal, I want to schedule a 15 minute phone meeting with each person or team (starting when we return from Thanksgiving break), where you’ll describe the reason (benefit) for doing this work and/or question you’re seeking to answer, where you’ll source the data, and the overall flow. For team projects, I also want you to articulate the roles and responsibilities of each team member.

Final Project Checklist

To receive full credit, you’ll need to deliver on all of the items in the checklist below. Please read carefully through this checklist before you make your project proposal. You are (within these checklist constraints) strongly urged to limit scope and make the necessary simplifying assumptions so that you can deliver your work on time!

Scope

With COVID-19 impacting employment worldwide, many articles have been published highlighting the toll it has taken on specific demographics. Our motivation to dive deeper into the impact by demographic, has been influenced by articles such as COVID-19 has spurred America’s first female recession. This article talk about the negative affect the pandemic has had on women in the United States. Using data from the U.S. Bureau of Labor Statistics, we hope to identify if the impact was disproportionate or great to women vs. men.

Data Preparation

Load Data

In order to conduct our studies we used data from U.S. Bureau of Labor Statistics | Labor Force Statistics from the Current Population Survey specifically data regarding:

The data was web scraped then encapsulated using the functions as.dataframe(), html_table() and read_html, then subsetted.

url_9  <- "https://www.bls.gov/web/empsit/cpseea09.htm"
url_10 <- "https://www.bls.gov/web/empsit/cpseea10.htm"
url_30 <- "https://www.bls.gov/web/empsit/cpseea30.htm"

Table 9 Data

head(df_tb9  <- as.data.frame(html_table(read_html(url_9))[[2]]),5)%>%
  reactable(highlight = TRUE, pagination = FALSE, height = 450)

________________________________________________________________________________

Table 10 Data

head(df_tb10 <- as.data.frame(html_table(read_html(url_10))[[2]]),5)%>%
  reactable(highlight = TRUE, pagination = FALSE, height = 450)

Table 30 Data

head(df_tb30 <- as.data.frame(html_table(read_html(url_30))[[2]]),10)%>%
  reactable(highlight = TRUE, pagination = TRUE, height = 850)

Transform Data

Data manipulation and tidying for Table 9 included:

relabeling column values to reflect a (4 digit: year) - (2 digit: month) (%y%m) format.
subsetting data frames to remove unnecessary data or blank rows with the help of piping %>%
categorizing data with a Gender column, that included a Male,Female and Total value respectively.

Note: Total value was used to specify data that reflects total unemployment counts.

colnames(df_tb10) <- c("Label","2020-03", "2020-04",
                         "2020-05", "2020-06", "2020-07",
                         "2020-08", "2020-09", "2020-10",
                         "2020-11", "2020-12","2021-01",
                         "2021-02", "2021-03")
                              

colnames(df_tb9) <- c("Label", "2020-03", "2020-04",
                         "2020-05", "2020-06", "2020-07",
                         "2020-08", "2020-09", "2020-10",
                         "2020-11", "2020-12","2021-01",
                         "2021-02", "2021-03")

df_tb9_tdy<-
    as.data.frame(df_tb9[1,]%>%
      assign_in(list(1,1),"Month")%>%
        bind_rows(df_tb9[3:14,],
          df_tb9[16:27,],
          df_tb9[29:40,])
    )
# Adding male and female category
df_tb9_tdy$Gender<-"Total"
df_tb9_tdy$Gender[14:25] <- "Male"
df_tb9_tdy$Gender[26:37] <- "Female"

And repeated for Table 10.

df_tb10_tdy<-
    as.data.frame(df_tb10[1,]%>%
      assign_in(list(1,1),"Month")%>%
        bind_rows(df_tb10[3:14,],
          df_tb10[16:27,],
          df_tb10[29:40,])
    )

# Adding male and female category
df_tb10_tdy$Gender<-"Total"
df_tb10_tdy$Gender[14:25] <- "Male"
df_tb10_tdy$Gender[26:37] <- "Female"

This tidy data (df_tb9_tdy,df_tb10_tdy) is the baseline used for our evaluation of the comparison between unemployment rates of men and women, during the beginning month until present day, of the pandemic.

Using ZevRoss: Know Your Data | Easy multi-panel plots in R using facet_wrap() and facet_grid() from ggplot2 as a point of reference, we continue to transform the data.

Specifically:

regex is used within the gsub() function, to remove , from our Rate column.
Month and Rate is then transformed to data types that will allow for further analysis.

#Dependency on df_tb10_tdy    Rate, Gender, Age Group
df_tb10_ln <- as.data.frame(df_tb10_tdy %>% 
  slice(c(16,17,19,22,23,24,25,28,29,31,34,35,36,37)) %>% 
    gather("Month_Year", "Rate", 2:14)%>%
      lapply(function(x){ 
        gsub(",", "", x)}))

df_tb10_ln$Month_Year<- as.yearmon(df_tb10_ln$Month_Year) 
df_tb10_ln$Rate <- as.numeric(df_tb10_ln$Rate)

The data is stored to df_tb10_ln. When sorted, it is easy to see that the highest unemployment rates overall, did occur among women between the ages of 16 to 19, during the months of March and April of 2020, by a small margin over men 18 to 19 in March & April 2020.

Label	Gender	Month_Year	Rate
18 to 19 years	Female	Mar 2020	38.7
16 to 17 years	Female	Mar 2020	33.3
16 to 17 years	Female	Apr 2020	31.0
18 to 19 years	Male	Mar 2020	30.7
18 to 19 years	Female	Apr 2020	30.3
18 to 19 years	Male	Apr 2020	28.8

From here we calculate rates and store the values to df_tb9_rt. The data is transformed in a month over month format, so that it can be assessed as a trend throughout the year. The functions used to create this data frame included:

piping %>%
slice()
gather()
mutate()

This data frame is grouped by Gender specifically.

Dates and counts are converted with as.yearmon() & as.numeric().

The calculations for our month-over-month (MoM) values and month-over-month rates (MoM_Rates) were as follows:

\(MoM= \frac{Count - lag(Count)}{lag(Count)}\) \(MoM\_Rate =MoM\times100 :rounded\ to\ the\ tenth\)

where lag() is a time series shift from the stats package.

# Month over Month employment rate change by gender 
#Dependency on df_tb9_tdy:     Rate and Gender
df_tb9_rt <- df_tb9_tdy %>% 
  slice(c(14,26)) %>% 
    gather("Month_Year", "Count", 2:14)

df_tb9_rt$Month_Year<- as.yearmon(df_tb9_rt$Month_Year)
df_tb9_rt$Count <- as.numeric(gsub(",","",df_tb9_rt$Count))

df_tb9_rt <- 
    df_tb9_rt %>% 
      group_by(Gender) %>% 
        mutate(MoM = (Count - lag(Count)) / lag(Count)) %>% 
          mutate(MoM_Rate =  round(MoM * 100, 2))

The result is a data frame with rates for ages 16 and over (or all age groups).

Label	Gender	Month_Year	Count	MoM	MoM_Rate
Women, 16 years and over	Female	Nov 2020	5042	0.0430285	4.30
Men, 16 years and over	Male	Mar 2021	5394	0.0301757	3.02
Men, 16 years and over	Male	Jan 2021	5395	-0.0071770	-0.72
Men, 16 years and over	Male	Oct 2020	5894	-0.0107419	-1.07
Women, 16 years and over	Female	Mar 2021	4418	-0.0125168	-1.25
Women, 16 years and over	Female	Feb 2021	4474	-0.0225038	-2.25

Following a similar approach, but grouping by Gender & Label we create data frame df_tb9_MoM and can see the impact for each age bracket.

# Month over Month employment rate change by age and gender 
#Dependency on df_tb9_tdy:     Count, age, and Gender

df_tb9_MoM <- df_tb9_tdy %>% 
  slice(c(16,17,19,22,23,24,25,28,29,31,34,35,36,37)) %>% 
  gather("Month_Year", "Count", 2:14)

df_tb9_MoM$Month_Year<- as.yearmon(df_tb9_MoM$Month_Year) 
df_tb9_MoM$Count <- as.numeric(gsub(",","",df_tb9_MoM$Count))

df_tb9_MoM <- df_tb9_MoM %>% 
  group_by(Gender, Label) %>% 
  mutate(MoM = (Count - lag(Count)) / lag(Count)) %>% 
  mutate(MoM_Rate =  round(MoM * 100, 1))

Label	Gender	Month_Year	Count	MoM	MoM_Rate
16 to 17 years	Male	Oct 2020	146	-0.1360947	-13.6
16 to 17 years	Female	Oct 2020	168	0.2923077	29.2
16 to 17 years	Male	Nov 2020	159	0.0890411	8.9
16 to 17 years	Female	Nov 2020	143	-0.1488095	-14.9
16 to 17 years	Male	Dec 2020	157	-0.0125786	-1.3
16 to 17 years	Female	Dec 2020	180	0.2587413	25.9
16 to 17 years	Male	Jan 2021	175	0.1146497	11.5
16 to 17 years	Female	Jan 2021	149	-0.1722222	-17.2
16 to 17 years	Male	Feb 2021	143	-0.1828571	-18.3
16 to 17 years	Female	Feb 2021	135	-0.0939597	-9.4
16 to 17 years	Male	Mar 2021	126	-0.1188811	-11.9

For our linear trend analysis df_tb10_trend is created for an additional perspective of unemployment by demographic by year.

#Dependency on df_tb10_tdy:     Rate and Gender
#data transformation operation: from wide to long function: gather, package: tidyr  
df_tb10_trend <- df_tb10_tdy %>% 
  slice(c(14,26)) %>% 
  gather("Month_Year", "Rate", 2:14)

#data transformation operation: date format function: as.yearmon, package: zoo  
df_tb10_trend$Month_Year<- as.yearmon(df_tb10_trend$Month_Year) 
df_tb10_trend$Rate <- as.numeric(df_tb10_trend$Rate)

Label	Gender	Month_Year	Rate
Men, 16 years and over	Male	Mar 2020	13.6
Women, 16 years and over	Female	Mar 2020	16.1
Men, 16 years and over	Male	Apr 2020	12.2
Women, 16 years and over	Female	Apr 2020	14.5
Men, 16 years and over	Male	May 2020	10.5

Because the format of table 30 was slightly different from the latter, information it was tidied using the same overview but accounting for the specific rows and columns we required.

df_tb30 <- 
  as.data.frame(
    bind_rows(df_tb30[3,]
              ,df_tb30[5:33,])%>%
      select(-2,-3)%>%
                lapply(function(x){
                   gsub(" occupations", "", x)}
                        ))%>%
  `colnames<-`(c("Occupation_Label",
                 "Total_Unemployed_Rates_April_2020",
    "Total_Unemployed_Rates_April_2021", "Men_Unemployment_Rate_2020",
    "Men_Unemployment_Rate_2021", "Women_Unemployment_Rate_2020",
    "Women_Unemployment_Rate_2021"))

Industries are categorized as primary industry categories and sub categories in the original HTML file shown below

therefore 2 data frames were made with the gather() function from the tidyr package.

df_tb30_full : noting all main and sub industry names
df_tb30_sub : noting only the primary categories
df_tb30_rc : including rate change

Gender_Year column is created in the transformation with rate being transformed into a single column.

# All categories
df_tb30_full<-as.data.frame(df_tb30 %>% 
  select(-2,-3)%>%
    slice(c(2:30)) %>%
      gather("Gender_Year", "Rate", 2:5))%>%
        mutate(Rate = as.numeric(Rate))
# Only Primary categories
df_tb30_sub<-as.data.frame(df_tb30 %>% 
  select(-2,-3)%>%
    slice(c(3,6,15,21,24,28)) %>%
      gather("Gender_Year", "Rate", 2:5))%>%
        mutate(Rate = as.numeric(Rate))

An if else statement is used, referencing Gender_Year to categorize gender in our results.

df_tb30_full$Gender<-
  ifelse(grepl("Men", df_tb30_full$Gender_Year), "Male",
               ifelse(grepl("Women", df_tb30_full$Gender_Year), "Female", ""))

df_tb30_sub$Gender<-
  ifelse(grepl("Men", df_tb30_sub$Gender_Year), "Male",
         ifelse(grepl("Women", df_tb30_sub$Gender_Year), "Female", ""))

The Year column is added to df_tb30_sub using case_when() and endsWith()

df_tb30_sub <- 
  df_tb30_sub %>%
  mutate(Year = case_when(
    endsWith(Gender_Year, "2020") ~ "2020",
    endsWith(Gender_Year, "2021") ~ "2021"))

In order to create Rate_Change in df_tb30_rc we used the following calculation in mutate():

\(\frac{2021-2020}{2020} \times 100\)

df_tb30_rc<-
  df_tb30_sub %>%
    select(- 2) %>% 
      spread(Year,Rate)

df_tb30_rc<-df_tb30_rc%>%
  mutate(Rate_Change =
    round((((df_tb30_rc$'2021' - df_tb30_rc$'2020')/df_tb30_rc$'2020')*100),2))

The resulting data frames are as follows:

df_tb30_full

Occupation_Label	Gender_Year	Rate	Gender
Management, professional, and related	Men_Unemployment_Rate_2020	6.6	Male
Management, business, and financial operations	Men_Unemployment_Rate_2020	5.8	Male
Management	Men_Unemployment_Rate_2020	5.9	Male
Business and financial operations	Men_Unemployment_Rate_2020	5.3	Male
Professional and related	Men_Unemployment_Rate_2020	7.3	Male
Computer and mathematical	Men_Unemployment_Rate_2020	4.2	Male
Architecture and engineering	Men_Unemployment_Rate_2020	5.2	Male
Life, physical, and social science	Men_Unemployment_Rate_2020	7.1	Male
Community and social service	Men_Unemployment_Rate_2020	5.1	Male
Legal	Men_Unemployment_Rate_2020	4.8	Male

df_tb30_sub

Occupation_Label	Gender_Year	Rate	Gender	Year
Management, business, and financial operations	Men_Unemployment_Rate_2020	5.8	Male	2020
Professional and related	Men_Unemployment_Rate_2020	7.3	Male	2020
Service	Men_Unemployment_Rate_2020	23.6	Male	2020
Sales and office	Men_Unemployment_Rate_2020	12.6	Male	2020
Natural resources, construction, and maintenance	Men_Unemployment_Rate_2020	16.2	Male	2020
Production, transportation, and material moving	Men_Unemployment_Rate_2020	17.4	Male	2020
Management, business, and financial operations	Men_Unemployment_Rate_2021	2.9	Male	2021
Professional and related	Men_Unemployment_Rate_2021	3.1	Male	2021
Service	Men_Unemployment_Rate_2021	8.2	Male	2021
Sales and office	Men_Unemployment_Rate_2021	5.6	Male	2021

df_tb30_rc

Occupation_Label	Gender	2020	2021	Rate_Change
Management, business, and financial operations	Female	6.7	3.4	-49.25
Management, business, and financial operations	Male	5.8	2.9	-50.00
Natural resources, construction, and maintenance	Female	17.0	10.2	-40.00
Natural resources, construction, and maintenance	Male	16.2	7.8	-51.85
Production, transportation, and material moving	Female	21.1	9.0	-57.35
Production, transportation, and material moving	Male	17.4	7.8	-55.17
Professional and related	Female	9.9	2.9	-70.71
Professional and related	Male	7.3	3.1	-57.53
Sales and office	Female	16.1	5.6	-65.22
Sales and office	Male	12.6	5.6	-55.56
Service	Female	29.8	7.9	-73.49
Service	Male	23.6	8.2	-65.25

Finally for our analysis of married women vs. married men rates we utilize the rows 43 and 44 from df_tb10 and transform in a similar fashion as the above data.

df_tb10marital<-df_tb10[43:44,]
names(df_tb10marital)[1] <- "Marital Label"
df_tb10marital$`Marital Label`<- c("Male with Spouse Present","Female with Spouse Present")
df_tb10marital$Gender<-c("Male","Female")

df_tb10marital <- 
  as.data.frame(df_tb10marital %>% 
    gather("Month_Year", "Rate", 2:14))%>%
      mutate(Rate = as.numeric(Rate))%>%
        mutate(Month_Year = as.yearmon(Month_Year))

df_tb10marital

Marital Label	Gender	Month_Year	Rate
Male with Spouse Present	Male	Mar 2020	9.6
Female with Spouse Present	Female	Mar 2020	13.1
Male with Spouse Present	Male	Apr 2020	8.2
Female with Spouse Present	Female	Apr 2020	11.5
Male with Spouse Present	Male	May 2020	6.9

Exploratory Data Analysis

The data from Unemployment by Occupation and Genderseems to support that women were more significantly impacted by unemployment in 2020, than men in virtually all industries.

ggplot(df_tb30_full, aes(x=fct_reorder(factor(Occupation_Label),Rate), y = (Rate),color = Gender,
                         group = Gender, fill=Gender)) +
  geom_col(position="dodge") +
  theme_minimal()+
  theme(axis.text.x = element_text( hjust=1))+
  labs(title = 'Unemployment by Occupation and Gender'
       ,subtitle = "4/2020-4/2021",
       x = "Occupation", y = "Rate")+
  coord_flip()

However when running summary() on our data, we see an low R-squared value of 0.0118. This indicates that our data does not fit well in our regression model.

#testing linear model. issue: gender is categorical var- need to look into this
# only accounts for 1.18% of data, Multiple R-squared:  0.0118
summary(lm(Count~Gender, df_tb9_rt))

## 
## Call:
## lm(formula = Count ~ Gender, data = df_tb9_rt)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -2123  -1627  -1095   1431   5293 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6540.8      634.3  10.312 2.69e-10 ***
## GenderMale     480.2      897.1   0.535    0.597    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2287 on 24 degrees of freedom
## Multiple R-squared:  0.0118, Adjusted R-squared:  -0.02938 
## F-statistic: 0.2865 on 1 and 24 DF,  p-value: 0.5974

This fact is supported when running plot(), which results in various charts that we cannot used to support whether a relationship between unemployment and gender exists.

plot(lm(Count~Gender, df_tb9_rt))

## hat values (leverages) are all = 0.07692308
##  and there are no factor predictors; no plot no. 5

Analysis

Based on the the data seen in the below, women experienced a substantial loss in employment from the months of March to August 2020. This coincides with the first day of stay at home order in the United States.

Gender & Marital Status

Marital Status and Unemployment Rate

At the beginning of the pandemic the unemployment rates were higher in the married women than married men. As time progresses the unemployment rate decreases for both the married men and women. However, married women are still experiencing higher unemployment rate than the married men.

ggplot(df_tb10marital, aes(x=factor(Month_Year), y = (Rate),color = `Gender`,
   group = `Gender`, fill=`Gender`)) +
   geom_col(position="dodge") +
   theme_minimal()+
   theme(axis.text.x = element_text(hjust=1))+
   labs(title = 'Unemployment Rates by Gender and Marital Status'
        ,subtitle = "Married with Spouse Present",
        x = "Month-Year", y = "Rate")+
   coord_flip()

Gender & Age

Overall

We see that both male and females saw their largest unemployment spike in early March, women consistently having a higher rate monthly until about September 2020.

df_tb10_trend

ggplot(df_tb10_trend, aes(x=factor(Month_Year), y = (Rate),color = Gender,
                          group = Gender, fill=Gender)) +
  geom_col(position="dodge") +
  theme_minimal()+
  theme(axis.text.x = element_text(angle = 45, hjust=1))+
  labs(title = 'Unemployment Rates by Gender'
       ,subtitle = "03/2020-03/2021",
       x = "Month", y = "Rate")

Additionally, when broken down into age groups, we can see 18 to 19 years, 16 to 17 years and 35 to 44 years of age where the highest for women and far exceeded the rates for men in the early months.

df_tb10_ln

(ggplot(data = df_tb10_ln, aes(Month_Year,Rate,color = Gender,group = Gender))+
  geom_line(size = 1) +
  geom_point() + 
  theme_minimal()+
  theme(axis.text.x = element_text(angle = 45, hjust=1))+
  labs(title = 'Unemployment Rates by Age Group and Gender'
       ,subtitle = "03/2020-03/2021",
       x = "Month", y = "Rate")+
  facet_wrap( ~ Label, scales = "free_y"))

Month-over-month

Grouping and plotting our data by month, allows us to see the trend from March 2020 to March 2021. Doing so paints a clearer picture of the impact by demographic of COVID.

df_tb9_rt

ggplot(df_tb9_rt, aes(x=factor(Month_Year), y=MoM, fill=Gender)) + 
  geom_col(position="dodge") +
  scale_y_continuous(labels = scales::percent)+  
  theme_minimal()+
  theme(axis.text.x = element_text(angle = 45, hjust=1))+
  labs(title = 'Month over Month Unemployment Rate Change by Gender'
       ,subtitle = "03/2020-03/2021, in thousands",
       x = "Month", y = "Rate Chnage")

df_tb9_MoM

##bar plot month over month age grp and geneder
(ggplot(df_tb9_MoM, aes(x=factor(Month_Year), y = (MoM), color = Gender,
                        group = Gender, fill=Gender)) +
  geom_col(position="dodge") +
  theme_minimal()+
  theme(axis.text.x = element_text(angle=90, hjust=1))+
  labs(title= 'Month over Month Unemployment Rate Change by Age Group and Gender',
       subtitle="03/2020-03/2021",
       x="Month", y="Rate Change")+
  facet_wrap( ~ Label, scales = "free_y"))

Gender & Occupation

When the data is stratified by industry, we see the trend holds true regardless of profession.

(ggplot(data = df_tb30_sub, aes(Year, Rate, color = Gender,
                                group = Gender, fill = Gender)) +
  geom_col(position="dodge") +
  theme_minimal()+
  theme(axis.text.x = element_text(angle = 45, hjust=1))+
  labs(title = 'Unemployment Rates by Occupation and Gender'
       ,subtitle = "2020-2021",
       x = "Year", y = "Rate")+
  facet_wrap( ~ Occupation_Label, scales = "free_y",
              ncol = 3,labeller = labeller(Service = "Gender_Year")))

##bar plot month over month age group and gender
ggplot(df_tb30_rc, aes(x=factor(Occupation_Label), y = (Rate_Change), color = Gender,
                        group = Gender, fill=Gender)) +
  geom_col(position="dodge") +
  theme_minimal()+
  theme(axis.text.x = element_text(hjust=1))+
  labs(title= 'Rate Change Per Industry',
       subtitle="04/2020-04/2021",
       x="Industry", y="Rate Change")+coord_flip()

Conclusion

Our results support the title for the the article which inspired us, since it was published August 5,2020. At this point in the pandemic women were losing or leaving their jobs at a alarming rate. The concerns raised at the time were valid, and was at least 1 month away from producing data to supported the contrary. However, with the pandemic coming to a close (hopefully), we believe the trend should lean more favorably in women returning to the workforce.

DATA607 Final

Gabriella Martinez, Maliat Islam, Gabriel Campos

May 19 2021