Deliverables Schedule
| Deliverable | Date | Points |
|---|---|---|
| One Paragraph Proposal | Sunday April 26 | 20 |
| Final Project | Wednesday May 19 | 120 |
| Final Project Presentation | Before or during Final Meetup on Wednesday May 19 | 30 |
Policy on Collaboration
You may work in a team of up to three people. Each project team member is responsible for understanding and being able to explain all of the submitted project code. Remember that you can take work that you find elsewhere as a base to build on, but you need to acknowledge the source, so that I base your grade on what you contributed, not on what you started with!
Approval Meeting
Once you’ve turned in your one paragraph proposal, I want to schedule a 15 minute phone meeting with each person or team (starting when we return from Thanksgiving break), where you’ll describe the reason (benefit) for doing this work and/or question you’re seeking to answer, where you’ll source the data, and the overall flow. For team projects, I also want you to articulate the roles and responsibilities of each team member.
Final Project Checklist
To receive full credit, you’ll need to deliver on all of the items in the checklist below. Please read carefully through this checklist before you make your project proposal. You are (within these checklist constraints) strongly urged to limit scope and make the necessary simplifying assumptions so that you can deliver your work on time!
With COVID-19 impacting employment worldwide, many articles have been published highlighting the toll it has taken on specific demographics. Our motivation to dive deeper into the impact by demographic, has been influenced by articles such as COVID-19 has spurred America’s first female recession. This article talk about the negative affect the pandemic has had on women in the United States. Using data from the U.S. Bureau of Labor Statistics, we hope to identify if the impact was disproportionate or great to women vs. men.
In order to conduct our studies we used data from U.S. Bureau of Labor Statistics | Labor Force Statistics from the Current Population Survey specifically data regarding:
The data was web scraped then encapsulated using the functions as.dataframe(), html_table() and read_html, then subsetted.
url_9 <- "https://www.bls.gov/web/empsit/cpseea09.htm"
url_10 <- "https://www.bls.gov/web/empsit/cpseea10.htm"
url_30 <- "https://www.bls.gov/web/empsit/cpseea30.htm"Table 9 Data
head(df_tb9 <- as.data.frame(html_table(read_html(url_9))[[2]]),5)%>%
reactable(highlight = TRUE, pagination = FALSE, height = 450)
________________________________________________________________________________
Table 10 Data
head(df_tb10 <- as.data.frame(html_table(read_html(url_10))[[2]]),5)%>%
reactable(highlight = TRUE, pagination = FALSE, height = 450)Table 30 Data
Data manipulation and tidying for Table 9 included:
relabeling column values to reflect a (4 digit: year) - (2 digit: month) (%y%m) format.
subsetting data frames to remove unnecessary data or blank rows with the help of piping %>%
categorizing data with a Gender column, that included a Male,Female and Total value respectively.
Note: Total value was used to specify data that reflects total unemployment counts.
colnames(df_tb10) <- c("Label","2020-03", "2020-04",
"2020-05", "2020-06", "2020-07",
"2020-08", "2020-09", "2020-10",
"2020-11", "2020-12","2021-01",
"2021-02", "2021-03")
colnames(df_tb9) <- c("Label", "2020-03", "2020-04",
"2020-05", "2020-06", "2020-07",
"2020-08", "2020-09", "2020-10",
"2020-11", "2020-12","2021-01",
"2021-02", "2021-03")df_tb9_tdy<-
as.data.frame(df_tb9[1,]%>%
assign_in(list(1,1),"Month")%>%
bind_rows(df_tb9[3:14,],
df_tb9[16:27,],
df_tb9[29:40,])
)
# Adding male and female category
df_tb9_tdy$Gender<-"Total"
df_tb9_tdy$Gender[14:25] <- "Male"
df_tb9_tdy$Gender[26:37] <- "Female"And repeated for Table 10.
df_tb10_tdy<-
as.data.frame(df_tb10[1,]%>%
assign_in(list(1,1),"Month")%>%
bind_rows(df_tb10[3:14,],
df_tb10[16:27,],
df_tb10[29:40,])
)
# Adding male and female category
df_tb10_tdy$Gender<-"Total"
df_tb10_tdy$Gender[14:25] <- "Male"
df_tb10_tdy$Gender[26:37] <- "Female"This tidy data (df_tb9_tdy,df_tb10_tdy) is the baseline used for our evaluation of the comparison between unemployment rates of men and women, during the beginning month until present day, of the pandemic.
Using ZevRoss: Know Your Data | Easy multi-panel plots in R using facet_wrap() and facet_grid() from ggplot2 as a point of reference, we continue to transform the data.
Specifically:
regex is used within the gsub() function, to remove , from our Rate column.Month and Rate is then transformed to data types that will allow for further analysis.#Dependency on df_tb10_tdy Rate, Gender, Age Group
df_tb10_ln <- as.data.frame(df_tb10_tdy %>%
slice(c(16,17,19,22,23,24,25,28,29,31,34,35,36,37)) %>%
gather("Month_Year", "Rate", 2:14)%>%
lapply(function(x){
gsub(",", "", x)}))df_tb10_ln$Month_Year<- as.yearmon(df_tb10_ln$Month_Year)
df_tb10_ln$Rate <- as.numeric(df_tb10_ln$Rate)The data is stored to df_tb10_ln. When sorted, it is easy to see that the highest unemployment rates overall, did occur among women between the ages of 16 to 19, during the months of March and April of 2020, by a small margin over men 18 to 19 in March & April 2020.
| Label | Gender | Month_Year | Rate |
|---|---|---|---|
| 18 to 19 years | Female | Mar 2020 | 38.7 |
| 16 to 17 years | Female | Mar 2020 | 33.3 |
| 16 to 17 years | Female | Apr 2020 | 31.0 |
| 18 to 19 years | Male | Mar 2020 | 30.7 |
| 18 to 19 years | Female | Apr 2020 | 30.3 |
| 18 to 19 years | Male | Apr 2020 | 28.8 |
From here we calculate rates and store the values to df_tb9_rt. The data is transformed in a month over month format, so that it can be assessed as a trend throughout the year. The functions used to create this data frame included:
%>%slice()gather()mutate()This data frame is grouped by Gender specifically.
Dates and counts are converted with as.yearmon() & as.numeric().
The calculations for our month-over-month (MoM) values and month-over-month rates (MoM_Rates) were as follows:
\(MoM= \frac{Count - lag(Count)}{lag(Count)}\) \(MoM\_Rate =MoM\times100 :rounded\ to\ the\ tenth\)
where lag() is a time series shift from the stats package.
# Month over Month employment rate change by gender
#Dependency on df_tb9_tdy: Rate and Gender
df_tb9_rt <- df_tb9_tdy %>%
slice(c(14,26)) %>%
gather("Month_Year", "Count", 2:14)
df_tb9_rt$Month_Year<- as.yearmon(df_tb9_rt$Month_Year)
df_tb9_rt$Count <- as.numeric(gsub(",","",df_tb9_rt$Count))
df_tb9_rt <-
df_tb9_rt %>%
group_by(Gender) %>%
mutate(MoM = (Count - lag(Count)) / lag(Count)) %>%
mutate(MoM_Rate = round(MoM * 100, 2))The result is a data frame with rates for ages 16 and over (or all age groups).
| Label | Gender | Month_Year | Count | MoM | MoM_Rate |
|---|---|---|---|---|---|
| Women, 16 years and over | Female | Nov 2020 | 5042 | 0.0430285 | 4.30 |
| Men, 16 years and over | Male | Mar 2021 | 5394 | 0.0301757 | 3.02 |
| Men, 16 years and over | Male | Jan 2021 | 5395 | -0.0071770 | -0.72 |
| Men, 16 years and over | Male | Oct 2020 | 5894 | -0.0107419 | -1.07 |
| Women, 16 years and over | Female | Mar 2021 | 4418 | -0.0125168 | -1.25 |
| Women, 16 years and over | Female | Feb 2021 | 4474 | -0.0225038 | -2.25 |
Following a similar approach, but grouping by Gender & Label we create data frame df_tb9_MoM and can see the impact for each age bracket.
# Month over Month employment rate change by age and gender
#Dependency on df_tb9_tdy: Count, age, and Gender
df_tb9_MoM <- df_tb9_tdy %>%
slice(c(16,17,19,22,23,24,25,28,29,31,34,35,36,37)) %>%
gather("Month_Year", "Count", 2:14)
df_tb9_MoM$Month_Year<- as.yearmon(df_tb9_MoM$Month_Year)
df_tb9_MoM$Count <- as.numeric(gsub(",","",df_tb9_MoM$Count))
df_tb9_MoM <- df_tb9_MoM %>%
group_by(Gender, Label) %>%
mutate(MoM = (Count - lag(Count)) / lag(Count)) %>%
mutate(MoM_Rate = round(MoM * 100, 1))| Label | Gender | Month_Year | Count | MoM | MoM_Rate |
|---|---|---|---|---|---|
| 16 to 17 years | Male | Oct 2020 | 146 | -0.1360947 | -13.6 |
| 16 to 17 years | Female | Oct 2020 | 168 | 0.2923077 | 29.2 |
| 16 to 17 years | Male | Nov 2020 | 159 | 0.0890411 | 8.9 |
| 16 to 17 years | Female | Nov 2020 | 143 | -0.1488095 | -14.9 |
| 16 to 17 years | Male | Dec 2020 | 157 | -0.0125786 | -1.3 |
| 16 to 17 years | Female | Dec 2020 | 180 | 0.2587413 | 25.9 |
| 16 to 17 years | Male | Jan 2021 | 175 | 0.1146497 | 11.5 |
| 16 to 17 years | Female | Jan 2021 | 149 | -0.1722222 | -17.2 |
| 16 to 17 years | Male | Feb 2021 | 143 | -0.1828571 | -18.3 |
| 16 to 17 years | Female | Feb 2021 | 135 | -0.0939597 | -9.4 |
| 16 to 17 years | Male | Mar 2021 | 126 | -0.1188811 | -11.9 |
For our linear trend analysis df_tb10_trend is created for an additional perspective of unemployment by demographic by year.
#Dependency on df_tb10_tdy: Rate and Gender
#data transformation operation: from wide to long function: gather, package: tidyr
df_tb10_trend <- df_tb10_tdy %>%
slice(c(14,26)) %>%
gather("Month_Year", "Rate", 2:14)
#data transformation operation: date format function: as.yearmon, package: zoo
df_tb10_trend$Month_Year<- as.yearmon(df_tb10_trend$Month_Year)
df_tb10_trend$Rate <- as.numeric(df_tb10_trend$Rate)| Label | Gender | Month_Year | Rate |
|---|---|---|---|
| Men, 16 years and over | Male | Mar 2020 | 13.6 |
| Women, 16 years and over | Female | Mar 2020 | 16.1 |
| Men, 16 years and over | Male | Apr 2020 | 12.2 |
| Women, 16 years and over | Female | Apr 2020 | 14.5 |
| Men, 16 years and over | Male | May 2020 | 10.5 |
Because the format of table 30 was slightly different from the latter, information it was tidied using the same overview but accounting for the specific rows and columns we required.
df_tb30 <-
as.data.frame(
bind_rows(df_tb30[3,]
,df_tb30[5:33,])%>%
select(-2,-3)%>%
lapply(function(x){
gsub(" occupations", "", x)}
))%>%
`colnames<-`(c("Occupation_Label",
"Total_Unemployed_Rates_April_2020",
"Total_Unemployed_Rates_April_2021", "Men_Unemployment_Rate_2020",
"Men_Unemployment_Rate_2021", "Women_Unemployment_Rate_2020",
"Women_Unemployment_Rate_2021"))Industries are categorized as primary industry categories and sub categories in the original HTML file shown below
therefore 2 data frames were made with the gather() function from the tidyr package.
df_tb30_full : noting all main and sub industry namesdf_tb30_sub : noting only the primary categoriesdf_tb30_rc : including rate changeGender_Year column is created in the transformation with rate being transformed into a single column.
# All categories
df_tb30_full<-as.data.frame(df_tb30 %>%
select(-2,-3)%>%
slice(c(2:30)) %>%
gather("Gender_Year", "Rate", 2:5))%>%
mutate(Rate = as.numeric(Rate))
# Only Primary categories
df_tb30_sub<-as.data.frame(df_tb30 %>%
select(-2,-3)%>%
slice(c(3,6,15,21,24,28)) %>%
gather("Gender_Year", "Rate", 2:5))%>%
mutate(Rate = as.numeric(Rate))An if else statement is used, referencing Gender_Year to categorize gender in our results.
df_tb30_full$Gender<-
ifelse(grepl("Men", df_tb30_full$Gender_Year), "Male",
ifelse(grepl("Women", df_tb30_full$Gender_Year), "Female", ""))
df_tb30_sub$Gender<-
ifelse(grepl("Men", df_tb30_sub$Gender_Year), "Male",
ifelse(grepl("Women", df_tb30_sub$Gender_Year), "Female", ""))The Year column is added to df_tb30_sub using case_when() and endsWith()
df_tb30_sub <-
df_tb30_sub %>%
mutate(Year = case_when(
endsWith(Gender_Year, "2020") ~ "2020",
endsWith(Gender_Year, "2021") ~ "2021"))In order to create Rate_Change in df_tb30_rc we used the following calculation in mutate():
\(\frac{2021-2020}{2020} \times 100\)
df_tb30_rc<-
df_tb30_sub %>%
select(- 2) %>%
spread(Year,Rate)
df_tb30_rc<-df_tb30_rc%>%
mutate(Rate_Change =
round((((df_tb30_rc$'2021' - df_tb30_rc$'2020')/df_tb30_rc$'2020')*100),2))The resulting data frames are as follows:
df_tb30_full
| Occupation_Label | Gender_Year | Rate | Gender |
|---|---|---|---|
| Management, professional, and related | Men_Unemployment_Rate_2020 | 6.6 | Male |
| Management, business, and financial operations | Men_Unemployment_Rate_2020 | 5.8 | Male |
| Management | Men_Unemployment_Rate_2020 | 5.9 | Male |
| Business and financial operations | Men_Unemployment_Rate_2020 | 5.3 | Male |
| Professional and related | Men_Unemployment_Rate_2020 | 7.3 | Male |
| Computer and mathematical | Men_Unemployment_Rate_2020 | 4.2 | Male |
| Architecture and engineering | Men_Unemployment_Rate_2020 | 5.2 | Male |
| Life, physical, and social science | Men_Unemployment_Rate_2020 | 7.1 | Male |
| Community and social service | Men_Unemployment_Rate_2020 | 5.1 | Male |
| Legal | Men_Unemployment_Rate_2020 | 4.8 | Male |
df_tb30_sub
| Occupation_Label | Gender_Year | Rate | Gender | Year |
|---|---|---|---|---|
| Management, business, and financial operations | Men_Unemployment_Rate_2020 | 5.8 | Male | 2020 |
| Professional and related | Men_Unemployment_Rate_2020 | 7.3 | Male | 2020 |
| Service | Men_Unemployment_Rate_2020 | 23.6 | Male | 2020 |
| Sales and office | Men_Unemployment_Rate_2020 | 12.6 | Male | 2020 |
| Natural resources, construction, and maintenance | Men_Unemployment_Rate_2020 | 16.2 | Male | 2020 |
| Production, transportation, and material moving | Men_Unemployment_Rate_2020 | 17.4 | Male | 2020 |
| Management, business, and financial operations | Men_Unemployment_Rate_2021 | 2.9 | Male | 2021 |
| Professional and related | Men_Unemployment_Rate_2021 | 3.1 | Male | 2021 |
| Service | Men_Unemployment_Rate_2021 | 8.2 | Male | 2021 |
| Sales and office | Men_Unemployment_Rate_2021 | 5.6 | Male | 2021 |
df_tb30_rc
| Occupation_Label | Gender | 2020 | 2021 | Rate_Change |
|---|---|---|---|---|
| Management, business, and financial operations | Female | 6.7 | 3.4 | -49.25 |
| Management, business, and financial operations | Male | 5.8 | 2.9 | -50.00 |
| Natural resources, construction, and maintenance | Female | 17.0 | 10.2 | -40.00 |
| Natural resources, construction, and maintenance | Male | 16.2 | 7.8 | -51.85 |
| Production, transportation, and material moving | Female | 21.1 | 9.0 | -57.35 |
| Production, transportation, and material moving | Male | 17.4 | 7.8 | -55.17 |
| Professional and related | Female | 9.9 | 2.9 | -70.71 |
| Professional and related | Male | 7.3 | 3.1 | -57.53 |
| Sales and office | Female | 16.1 | 5.6 | -65.22 |
| Sales and office | Male | 12.6 | 5.6 | -55.56 |
| Service | Female | 29.8 | 7.9 | -73.49 |
| Service | Male | 23.6 | 8.2 | -65.25 |
Finally for our analysis of married women vs. married men rates we utilize the rows 43 and 44 from df_tb10 and transform in a similar fashion as the above data.
df_tb10marital<-df_tb10[43:44,]
names(df_tb10marital)[1] <- "Marital Label"
df_tb10marital$`Marital Label`<- c("Male with Spouse Present","Female with Spouse Present")
df_tb10marital$Gender<-c("Male","Female")
df_tb10marital <-
as.data.frame(df_tb10marital %>%
gather("Month_Year", "Rate", 2:14))%>%
mutate(Rate = as.numeric(Rate))%>%
mutate(Month_Year = as.yearmon(Month_Year))df_tb10marital
| Marital Label | Gender | Month_Year | Rate |
|---|---|---|---|
| Male with Spouse Present | Male | Mar 2020 | 9.6 |
| Female with Spouse Present | Female | Mar 2020 | 13.1 |
| Male with Spouse Present | Male | Apr 2020 | 8.2 |
| Female with Spouse Present | Female | Apr 2020 | 11.5 |
| Male with Spouse Present | Male | May 2020 | 6.9 |
The data from Unemployment by Occupation and Genderseems to support that women were more significantly impacted by unemployment in 2020, than men in virtually all industries.
ggplot(df_tb30_full, aes(x=fct_reorder(factor(Occupation_Label),Rate), y = (Rate),color = Gender,
group = Gender, fill=Gender)) +
geom_col(position="dodge") +
theme_minimal()+
theme(axis.text.x = element_text( hjust=1))+
labs(title = 'Unemployment by Occupation and Gender'
,subtitle = "4/2020-4/2021",
x = "Occupation", y = "Rate")+
coord_flip()However when running summary() on our data, we see an low R-squared value of 0.0118. This indicates that our data does not fit well in our regression model.
#testing linear model. issue: gender is categorical var- need to look into this
# only accounts for 1.18% of data, Multiple R-squared: 0.0118
summary(lm(Count~Gender, df_tb9_rt))##
## Call:
## lm(formula = Count ~ Gender, data = df_tb9_rt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2123 -1627 -1095 1431 5293
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6540.8 634.3 10.312 2.69e-10 ***
## GenderMale 480.2 897.1 0.535 0.597
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2287 on 24 degrees of freedom
## Multiple R-squared: 0.0118, Adjusted R-squared: -0.02938
## F-statistic: 0.2865 on 1 and 24 DF, p-value: 0.5974
This fact is supported when running plot(), which results in various charts that we cannot used to support whether a relationship between unemployment and gender exists.
## hat values (leverages) are all = 0.07692308
## and there are no factor predictors; no plot no. 5
Based on the the data seen in the below, women experienced a substantial loss in employment from the months of March to August 2020. This coincides with the first day of stay at home order in the United States.
Marital Status and Unemployment Rate
At the beginning of the pandemic the unemployment rates were higher in the married women than married men. As time progresses the unemployment rate decreases for both the married men and women. However, married women are still experiencing higher unemployment rate than the married men.
ggplot(df_tb10marital, aes(x=factor(Month_Year), y = (Rate),color = `Gender`,
group = `Gender`, fill=`Gender`)) +
geom_col(position="dodge") +
theme_minimal()+
theme(axis.text.x = element_text(hjust=1))+
labs(title = 'Unemployment Rates by Gender and Marital Status'
,subtitle = "Married with Spouse Present",
x = "Month-Year", y = "Rate")+
coord_flip()We see that both male and females saw their largest unemployment spike in early March, women consistently having a higher rate monthly until about September 2020.
df_tb10_trend
ggplot(df_tb10_trend, aes(x=factor(Month_Year), y = (Rate),color = Gender,
group = Gender, fill=Gender)) +
geom_col(position="dodge") +
theme_minimal()+
theme(axis.text.x = element_text(angle = 45, hjust=1))+
labs(title = 'Unemployment Rates by Gender'
,subtitle = "03/2020-03/2021",
x = "Month", y = "Rate")Additionally, when broken down into age groups, we can see 18 to 19 years, 16 to 17 years and 35 to 44 years of age where the highest for women and far exceeded the rates for men in the early months.
df_tb10_ln
(ggplot(data = df_tb10_ln, aes(Month_Year,Rate,color = Gender,group = Gender))+
geom_line(size = 1) +
geom_point() +
theme_minimal()+
theme(axis.text.x = element_text(angle = 45, hjust=1))+
labs(title = 'Unemployment Rates by Age Group and Gender'
,subtitle = "03/2020-03/2021",
x = "Month", y = "Rate")+
facet_wrap( ~ Label, scales = "free_y"))Grouping and plotting our data by month, allows us to see the trend from March 2020 to March 2021. Doing so paints a clearer picture of the impact by demographic of COVID.
df_tb9_rt
ggplot(df_tb9_rt, aes(x=factor(Month_Year), y=MoM, fill=Gender)) +
geom_col(position="dodge") +
scale_y_continuous(labels = scales::percent)+
theme_minimal()+
theme(axis.text.x = element_text(angle = 45, hjust=1))+
labs(title = 'Month over Month Unemployment Rate Change by Gender'
,subtitle = "03/2020-03/2021, in thousands",
x = "Month", y = "Rate Chnage")df_tb9_MoM
##bar plot month over month age grp and geneder
(ggplot(df_tb9_MoM, aes(x=factor(Month_Year), y = (MoM), color = Gender,
group = Gender, fill=Gender)) +
geom_col(position="dodge") +
theme_minimal()+
theme(axis.text.x = element_text(angle=90, hjust=1))+
labs(title= 'Month over Month Unemployment Rate Change by Age Group and Gender',
subtitle="03/2020-03/2021",
x="Month", y="Rate Change")+
facet_wrap( ~ Label, scales = "free_y"))When the data is stratified by industry, we see the trend holds true regardless of profession.
(ggplot(data = df_tb30_sub, aes(Year, Rate, color = Gender,
group = Gender, fill = Gender)) +
geom_col(position="dodge") +
theme_minimal()+
theme(axis.text.x = element_text(angle = 45, hjust=1))+
labs(title = 'Unemployment Rates by Occupation and Gender'
,subtitle = "2020-2021",
x = "Year", y = "Rate")+
facet_wrap( ~ Occupation_Label, scales = "free_y",
ncol = 3,labeller = labeller(Service = "Gender_Year")))##bar plot month over month age group and gender
ggplot(df_tb30_rc, aes(x=factor(Occupation_Label), y = (Rate_Change), color = Gender,
group = Gender, fill=Gender)) +
geom_col(position="dodge") +
theme_minimal()+
theme(axis.text.x = element_text(hjust=1))+
labs(title= 'Rate Change Per Industry',
subtitle="04/2020-04/2021",
x="Industry", y="Rate Change")+coord_flip()Our results support the title for the the article which inspired us, since it was published August 5,2020. At this point in the pandemic women were losing or leaving their jobs at a alarming rate. The concerns raised at the time were valid, and was at least 1 month away from producing data to supported the contrary. However, with the pandemic coming to a close (hopefully), we believe the trend should lean more favorably in women returning to the workforce.