This assignment looked at the U.S. Unemployment Rates analysis located on Kaggle and written by Rajat Raj. This utilized the U.S. Unemployment data set also located on Kaggle and authored by Guillem Servera.
This analysis looks at U.S. unemployment rates through a wide variety of different lenses (data visualization and analysis methods) showing how different groups of people are affected in different ways. It ends by suggesting that men’s and women’s unemployment rates often move in tandem, but there are periods of divergence that can highlight unique circumstances each gender may experience in the workplace.
The original analysis produced over a dozen visualizations in Python that necessitated different feature engineering. I will explore a few that I found the most interesting.
U.S. Unemployment data originally sourced from kaggle. A copy has been stored in via github.
unemployment_sex_url <-
"https://raw.githubusercontent.com/cdube89128/DATA-607/main/week-01/df_sex_unemployment_rates.csv"
df_unemployed_sex <- read.csv(unemployment_sex_url)
I’m starting here by exploring the structure of the data. I want to know what data types need clean up, and get an idea of what feature engineering might be applicable later. I would also check for errors in reading the csv, but there were none with this source data.
summary(df_unemployed_sex)
## date overall_rate men_rate women_rate
## Length:917 Min. : 2.500 Min. : 2.300 Min. : 2.700
## Class :character 1st Qu.: 4.400 1st Qu.: 4.200 1st Qu.: 4.800
## Mode :character Median : 5.500 Median : 5.300 Median : 5.700
## Mean : 5.695 Mean : 5.568 Mean : 5.949
## 3rd Qu.: 6.700 3rd Qu.: 6.700 3rd Qu.: 6.900
## Max. :14.800 Max. :13.500 Max. :16.200
##
## men_16_17_rate women_16_17_rate men_16_19_rate women_16_19_rate
## Min. : 6.30 Min. : 5.00 Min. : 6.40 Min. : 5.80
## 1st Qu.:15.10 1st Qu.:14.60 1st Qu.:14.10 1st Qu.:12.80
## Median :18.40 Median :17.20 Median :16.50 Median :15.10
## Mean :18.57 Mean :17.08 Mean :16.77 Mean :15.11
## 3rd Qu.:21.40 3rd Qu.:19.70 3rd Qu.:19.00 3rd Qu.:17.40
## Max. :36.40 Max. :35.90 Max. :30.70 Max. :37.50
##
## men_18_19_rate women_18_19_rate men_16_24_rate women_16_24_rate
## Min. : 4.50 Min. : 4.50 Min. : 4.80 Min. : 4.70
## 1st Qu.:12.80 1st Qu.:11.60 1st Qu.:10.00 1st Qu.: 9.30
## Median :15.20 Median :13.70 Median :11.90 Median :11.00
## Mean :15.55 Mean :13.81 Mean :12.15 Mean :11.02
## 3rd Qu.:17.80 3rd Qu.:16.00 3rd Qu.:13.70 3rd Qu.:12.70
## Max. :31.20 Max. :38.90 Max. :24.60 Max. :30.40
##
## men_20_24_rate women_20_24_rate men_25plus_rate women_25plus_rate
## Min. : 3.200 Min. : 3.0 Min. : 1.600 Min. : 2.1
## 1st Qu.: 7.800 1st Qu.: 7.0 1st Qu.: 3.100 1st Qu.: 3.8
## Median : 9.200 Median : 8.6 Median : 4.000 Median : 4.5
## Mean : 9.735 Mean : 8.7 Mean : 4.276 Mean : 4.7
## 3rd Qu.:11.300 3rd Qu.:10.1 3rd Qu.: 5.100 3rd Qu.: 5.4
## Max. :23.100 Max. :27.9 Max. :12.100 Max. :14.2
##
## men_25_34_rate women_25_34_rate men_25_54_rate women_25_54_rate
## Min. : 1.500 Min. : 2.600 Min. : 1.500 Min. : 2.200
## 1st Qu.: 3.700 1st Qu.: 4.900 1st Qu.: 3.100 1st Qu.: 4.000
## Median : 4.900 Median : 5.900 Median : 4.000 Median : 4.700
## Mean : 5.192 Mean : 6.068 Mean : 4.354 Mean : 4.951
## 3rd Qu.: 6.400 3rd Qu.: 7.100 3rd Qu.: 5.300 3rd Qu.: 5.700
## Max. :14.200 Max. :15.000 Max. :12.100 Max. :13.700
##
## men_35_44_rate women_35_44_rate men_45_54_rate women_45_54_rate
## Min. : 1.100 Min. : 1.800 Min. : 1.200 Min. : 1.700
## 1st Qu.: 2.800 1st Qu.: 3.800 1st Qu.: 2.700 1st Qu.: 3.100
## Median : 3.700 Median : 4.500 Median : 3.400 Median : 3.800
## Mean : 3.925 Mean : 4.682 Mean : 3.755 Mean : 3.959
## 3rd Qu.: 4.800 3rd Qu.: 5.400 3rd Qu.: 4.500 3rd Qu.: 4.600
## Max. :10.500 Max. :12.700 Max. :11.300 Max. :13.400
##
## men_55plus_rate women_55plus_rate
## Min. : 1.50 Min. : 1.900
## 1st Qu.: 3.00 1st Qu.: 2.900
## Median : 3.60 Median : 3.400
## Mean : 3.91 Mean : 3.814
## 3rd Qu.: 4.50 3rd Qu.: 4.000
## Max. :12.10 Max. :15.300
## NA's :552
glimpse(df_unemployed_sex)
## Rows: 917
## Columns: 26
## $ date <chr> "1948-01-01", "1948-02-01", "1948-03-01", "1948-04-0…
## $ overall_rate <dbl> 3.4, 3.8, 4.0, 3.9, 3.5, 3.6, 3.6, 3.9, 3.8, 3.7, 3.…
## $ men_rate <dbl> 3.4, 3.6, 3.8, 3.8, 3.5, 3.3, 3.4, 3.6, 3.7, 3.6, 3.…
## $ women_rate <dbl> 3.3, 4.5, 4.4, 4.3, 3.7, 4.3, 4.2, 4.4, 4.1, 4.0, 3.…
## $ men_16_17_rate <dbl> 9.7, 13.0, 14.0, 11.6, 7.1, 11.3, 9.9, 9.8, 10.2, 7.…
## $ women_16_17_rate <dbl> 8.8, 13.2, 11.4, 10.6, 5.4, 12.9, 11.0, 7.6, 7.3, 8.…
## $ men_16_19_rate <dbl> 9.4, 10.8, 11.9, 9.8, 7.6, 9.3, 10.2, 10.4, 9.6, 9.4…
## $ women_16_19_rate <dbl> 7.2, 8.9, 8.6, 9.2, 6.1, 9.3, 9.0, 8.5, 7.6, 7.3, 8.…
## $ men_18_19_rate <dbl> 9.5, 9.2, 10.3, 8.6, 8.6, 7.4, 11.0, 10.3, 8.9, 10.5…
## $ women_18_19_rate <dbl> 6.8, 6.8, 7.3, 8.6, 7.0, 6.8, 7.2, 8.8, 7.3, 6.1, 8.…
## $ men_16_24_rate <dbl> 8.0, 8.6, 10.0, 8.6, 7.6, 7.8, 7.5, 7.7, 7.5, 6.9, 7…
## $ women_16_24_rate <dbl> 4.9, 6.3, 6.7, 6.7, 5.4, 6.5, 7.5, 6.2, 5.5, 5.7, 6.…
## $ men_20_24_rate <dbl> 7.2, 7.4, 9.0, 7.9, 7.6, 6.9, 6.0, 6.2, 6.3, 5.6, 5.…
## $ women_20_24_rate <dbl> 3.4, 4.5, 5.4, 5.0, 4.9, 4.6, 6.5, 4.7, 4.1, 4.6, 4.…
## $ men_25plus_rate <dbl> 2.5, 2.6, 2.6, 2.8, 2.7, 2.4, 2.5, 2.8, 2.9, 2.8, 3.…
## $ women_25plus_rate <dbl> 2.7, 3.7, 3.2, 3.5, 3.1, 3.5, 3.3, 4.0, 3.6, 3.4, 3.…
## $ men_25_34_rate <dbl> 2.6, 2.7, 2.7, 3.2, 2.9, 2.5, 2.4, 3.1, 2.8, 3.0, 3.…
## $ women_25_34_rate <dbl> 4.3, 5.1, 3.5, 3.8, 3.3, 4.2, 4.2, 4.7, 4.9, 4.5, 4.…
## $ men_25_54_rate <dbl> 2.3, 2.6, 2.6, 2.8, 2.5, 2.3, 2.4, 2.7, 2.8, 2.6, 2.…
## $ women_25_54_rate <dbl> 2.8, 3.7, 3.3, 3.5, 3.1, 3.6, 3.3, 4.1, 3.7, 3.6, 3.…
## $ men_35_44_rate <dbl> 2.1, 2.5, 2.6, 2.7, 2.4, 2.3, 2.4, 2.3, 2.6, 2.3, 2.…
## $ women_35_44_rate <dbl> 1.8, 2.6, 3.0, 3.5, 3.0, 3.3, 2.6, 4.1, 3.0, 3.0, 3.…
## $ men_45_54_rate <dbl> 2.3, 2.6, 2.4, 2.5, 2.3, 2.2, 2.2, 2.4, 3.2, 2.5, 2.…
## $ women_45_54_rate <dbl> 2.1, 3.3, 3.3, 3.1, 2.9, 3.1, 3.1, 3.4, 3.2, 3.0, 2.…
## $ men_55plus_rate <dbl> 3.0, 2.9, 2.8, 2.9, 3.1, 2.8, 2.9, 3.3, 3.4, 3.3, 3.…
## $ women_55plus_rate <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#Seeing a lot of NAs in women_55plus_rate, so double checking that there is data present
df_unemployed_sex %>%
summarise_all(list(
count = ~sum(is.na(.))
))
## date_count overall_rate_count men_rate_count women_rate_count
## 1 0 0 0 0
## men_16_17_rate_count women_16_17_rate_count men_16_19_rate_count
## 1 0 0 0
## women_16_19_rate_count men_18_19_rate_count women_18_19_rate_count
## 1 0 0 0
## men_16_24_rate_count women_16_24_rate_count men_20_24_rate_count
## 1 0 0 0
## women_20_24_rate_count men_25plus_rate_count women_25plus_rate_count
## 1 0 0 0
## men_25_34_rate_count women_25_34_rate_count men_25_54_rate_count
## 1 0 0 0
## women_25_54_rate_count men_35_44_rate_count women_35_44_rate_count
## 1 0 0 0
## men_45_54_rate_count women_45_54_rate_count men_55plus_rate_count
## 1 0 0 0
## women_55plus_rate_count
## 1 552
#Comparing to the total number of lines in the data frame
nrow(df_unemployed_sex)
## [1] 917
Only the date column was read in as a character vector instead of date. Altering that here. Additionally, I am doing some minor feature engineering and separating the year and month into their own columns to allow for grouping by those values.
df_unemployed_sex <- df_unemployed_sex %>%
mutate(date = as.Date(date),
year = year(date),
month = month(date)
)
These are fairly straightforward plots for the audience to get their bearings on the data segmented by gender and by age group. They are based on the ones present in the original analysis.
# Plot for men and women
ggplot(df_unemployed_sex, aes(x = date)) +
geom_line(aes(y = overall_rate, color = "Overall")) +
geom_line(aes(y = men_rate, color = "Men")) +
geom_line(aes(y = women_rate, color = "Women")) +
labs(title = "Unemployment Rates Over Time",
x = "Date",
y = "Unemployment Rate (%)",
color = "Rate Type") +
scale_color_manual(values = c("Overall" = "black",
"Men" = "blue",
"Women" = "red")) +
ylim(0, 40) +
theme_minimal()
# Plot for youth age groups (16-19)
ggplot(df_unemployed_sex, aes(x = date)) +
geom_line(aes(y = men_16_17_rate, color = "Men 16-17")) +
geom_line(aes(y = women_16_17_rate, color = "Women 16-17")) +
geom_line(aes(y = men_18_19_rate, color = "Men 18-19")) +
geom_line(aes(y = women_18_19_rate, color = "Women 18-19")) +
labs(title = "Youth Unemployment Rates by Age Group and Gender",
x = "Date",
y = "Unemployment Rate (%)",
color = "Age Group") +
ylim(0, 40) +
theme_minimal() +
theme(legend.position = "bottom")
# Plot for adult age groups (20+)
ggplot(df_unemployed_sex, aes(x = date)) +
geom_line(aes(y = men_20_24_rate, color = "Men 20-24")) +
geom_line(aes(y = women_20_24_rate, color = "Women 20-24")) +
geom_line(aes(y = men_25_34_rate, color = "Men 25-34")) +
geom_line(aes(y = women_25_34_rate, color = "Women 25-34")) +
geom_line(aes(y = men_35_44_rate, color = "Men 35-44")) +
geom_line(aes(y = women_35_44_rate, color = "Women 35-44")) +
geom_line(aes(y = men_45_54_rate, color = "Men 45-54")) +
geom_line(aes(y = women_45_54_rate, color = "Women 45-54")) +
geom_line(aes(y = men_55plus_rate, color = "Men 55+")) +
# Note: women_55plus_rate appears to have mostly NA values
labs(title = "Adult Unemployment Rates by Age Group and Gender",
x = "Date",
y = "Unemployment Rate (%)",
color = "Age Group") +
ylim(0, 40) +
theme_minimal() +
theme(legend.position = "bottom")
This is one of the author’s original visualizations that I found interesting. The author of the original analysis also opted to single out the columns that they wanted to visualize. Because this assignment said “You should finish with a data frame that contains a subset of the columns in your selected dataset”, I am opting to remove many of the now unnecessary columns altogether.
# Reduce columns down to just the ones the original author wanted to visualize
df_unemployed_sex <- df_unemployed_sex[,
c('date', 'year', 'month', 'overall_rate', 'men_rate',
'women_rate', 'men_16_17_rate', 'women_16_17_rate',
'men_25_34_rate', 'women_25_34_rate', 'men_55plus_rate', 'women_55plus_rate')]
columns_to_visualize <- c('men_rate', 'women_rate', 'men_16_17_rate', 'women_16_17_rate',
'men_25_34_rate', 'women_25_34_rate', 'men_55plus_rate', 'women_55plus_rate')
# Group by year and calculate mean for each category
heatmap_data <- df_unemployed_sex %>%
group_by(year) %>%
summarise(
men_rate = mean(men_rate, na.rm = TRUE),
women_rate = mean(women_rate, na.rm = TRUE),
men_16_17_rate = mean(men_16_17_rate, na.rm = TRUE),
women_16_17_rate = mean(women_16_17_rate, na.rm = TRUE),
men_25_34_rate = mean(men_25_34_rate, na.rm = TRUE),
women_25_34_rate = mean(women_25_34_rate, na.rm = TRUE),
men_55plus_rate = mean(men_55plus_rate, na.rm = TRUE),
women_55plus_rate = mean(women_55plus_rate, na.rm = TRUE)
) %>%
ungroup()
# Convert to long format for ggplot2
heatmap_long <- heatmap_data %>%
pivot_longer(cols = -year,
names_to = "category",
values_to = "unemployment_rate")
# Create the heatmap
ggplot(heatmap_long, aes(x = year, y = category, fill = unemployment_rate)) +
geom_tile() +
scale_fill_gradientn(colors = hcl.colors(20, "YlGnBu"),
name = "Unemployment Rate (%)") +
labs(title = "Unemployment Rates Heatmap: Gender & Age Groups Over Years",
x = "Year",
y = "Category") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(hjust = 0.5))
This is another visualization that I had found very interesting. Implementing it meant calculating and storing the 12-month moving averages on the data frame as well.
# Calculate 12-month moving averages
df_unemployed_sex <- df_unemployed_sex %>%
arrange(date) %>%
mutate(
overall_moving_avg = rollmean(overall_rate, k = 12, fill = NA, align = "right"),
men_moving_avg = rollmean(men_rate, k = 12, fill = NA, align = "right"),
women_moving_avg = rollmean(women_rate, k = 12, fill = NA, align = "right")
)
# Plot
ggplot(df_unemployed_sex, aes(x = date)) +
geom_line(aes(y = overall_moving_avg, color = "Overall", linetype = "Overall"), linewidth = 1) +
geom_line(aes(y = men_moving_avg, color = "Men", linetype = "Men"), linewidth = 1) +
geom_line(aes(y = women_moving_avg, color = "Women", linetype = "Women"), linewidth = 1) +
labs(title = "12-Month Moving Average of Unemployment Rates Over Time",
x = "Date",
y = "Unemployment Rate (%)",
color = "Category",
linetype = "Category") +
scale_color_manual(values = c("Overall" = "black",
"Men" = "blue",
"Women" = "orange")) +
scale_linetype_manual(values = c("Overall" = "solid",
"Men" = "dashed",
"Women" = "dashed")) +
theme_minimal() +
theme(
panel.grid.major = element_line(linewidth = 0.5, linetype = "dashed"),
panel.grid.minor = element_blank(),
legend.position = "bottom"
)
This was just a snippet of what Rajat Raj originally did in their analysis. I really enjoyed how many perspectives the original took to the data. However, I am slightly confused by the age range choice that they opted to focus on. While I understand that including all of the age ranges made for more visually overwhelming graphs, neglecting the middle-aged U.S. unemployment rate felt odd, and I did not get a clear reasoning for why it was done.
As an aside, this source data didn’t require much cleaning, so I leaned slightly more into trimming the dataset down to only what was needed, and feature engineering.