DATA 607 Assignment 1

Introduction

This assignment looked at the U.S. Unemployment Rates analysis located on Kaggle and written by Rajat Raj. This utilized the U.S. Unemployment data set also located on Kaggle and authored by Guillem Servera.

This analysis looks at U.S. unemployment rates through a wide variety of different lenses (data visualization and analysis methods) showing how different groups of people are affected in different ways. It ends by suggesting that men’s and women’s unemployment rates often move in tandem, but there are periods of divergence that can highlight unique circumstances each gender may experience in the workplace.

The original analysis produced over a dozen visualizations in Python that necessitated different feature engineering. I will explore a few that I found the most interesting.

Load (Raw) Data

U.S. Unemployment data originally sourced from kaggle. A copy has been stored in via github.

unemployment_sex_url <- 
  "https://raw.githubusercontent.com/cdube89128/DATA-607/main/week-01/df_sex_unemployment_rates.csv"
df_unemployed_sex <- read.csv(unemployment_sex_url)

Exploratory Analysis

I’m starting here by exploring the structure of the data. I want to know what data types need clean up, and get an idea of what feature engineering might be applicable later. I would also check for errors in reading the csv, but there were none with this source data.

summary(df_unemployed_sex)

##      date            overall_rate       men_rate        women_rate    
##  Length:917         Min.   : 2.500   Min.   : 2.300   Min.   : 2.700  
##  Class :character   1st Qu.: 4.400   1st Qu.: 4.200   1st Qu.: 4.800  
##  Mode  :character   Median : 5.500   Median : 5.300   Median : 5.700  
##                     Mean   : 5.695   Mean   : 5.568   Mean   : 5.949  
##                     3rd Qu.: 6.700   3rd Qu.: 6.700   3rd Qu.: 6.900  
##                     Max.   :14.800   Max.   :13.500   Max.   :16.200  
##                                                                       
##  men_16_17_rate  women_16_17_rate men_16_19_rate  women_16_19_rate
##  Min.   : 6.30   Min.   : 5.00    Min.   : 6.40   Min.   : 5.80   
##  1st Qu.:15.10   1st Qu.:14.60    1st Qu.:14.10   1st Qu.:12.80   
##  Median :18.40   Median :17.20    Median :16.50   Median :15.10   
##  Mean   :18.57   Mean   :17.08    Mean   :16.77   Mean   :15.11   
##  3rd Qu.:21.40   3rd Qu.:19.70    3rd Qu.:19.00   3rd Qu.:17.40   
##  Max.   :36.40   Max.   :35.90    Max.   :30.70   Max.   :37.50   
##                                                                   
##  men_18_19_rate  women_18_19_rate men_16_24_rate  women_16_24_rate
##  Min.   : 4.50   Min.   : 4.50    Min.   : 4.80   Min.   : 4.70   
##  1st Qu.:12.80   1st Qu.:11.60    1st Qu.:10.00   1st Qu.: 9.30   
##  Median :15.20   Median :13.70    Median :11.90   Median :11.00   
##  Mean   :15.55   Mean   :13.81    Mean   :12.15   Mean   :11.02   
##  3rd Qu.:17.80   3rd Qu.:16.00    3rd Qu.:13.70   3rd Qu.:12.70   
##  Max.   :31.20   Max.   :38.90    Max.   :24.60   Max.   :30.40   
##                                                                   
##  men_20_24_rate   women_20_24_rate men_25plus_rate  women_25plus_rate
##  Min.   : 3.200   Min.   : 3.0     Min.   : 1.600   Min.   : 2.1     
##  1st Qu.: 7.800   1st Qu.: 7.0     1st Qu.: 3.100   1st Qu.: 3.8     
##  Median : 9.200   Median : 8.6     Median : 4.000   Median : 4.5     
##  Mean   : 9.735   Mean   : 8.7     Mean   : 4.276   Mean   : 4.7     
##  3rd Qu.:11.300   3rd Qu.:10.1     3rd Qu.: 5.100   3rd Qu.: 5.4     
##  Max.   :23.100   Max.   :27.9     Max.   :12.100   Max.   :14.2     
##                                                                      
##  men_25_34_rate   women_25_34_rate men_25_54_rate   women_25_54_rate
##  Min.   : 1.500   Min.   : 2.600   Min.   : 1.500   Min.   : 2.200  
##  1st Qu.: 3.700   1st Qu.: 4.900   1st Qu.: 3.100   1st Qu.: 4.000  
##  Median : 4.900   Median : 5.900   Median : 4.000   Median : 4.700  
##  Mean   : 5.192   Mean   : 6.068   Mean   : 4.354   Mean   : 4.951  
##  3rd Qu.: 6.400   3rd Qu.: 7.100   3rd Qu.: 5.300   3rd Qu.: 5.700  
##  Max.   :14.200   Max.   :15.000   Max.   :12.100   Max.   :13.700  
##                                                                     
##  men_35_44_rate   women_35_44_rate men_45_54_rate   women_45_54_rate
##  Min.   : 1.100   Min.   : 1.800   Min.   : 1.200   Min.   : 1.700  
##  1st Qu.: 2.800   1st Qu.: 3.800   1st Qu.: 2.700   1st Qu.: 3.100  
##  Median : 3.700   Median : 4.500   Median : 3.400   Median : 3.800  
##  Mean   : 3.925   Mean   : 4.682   Mean   : 3.755   Mean   : 3.959  
##  3rd Qu.: 4.800   3rd Qu.: 5.400   3rd Qu.: 4.500   3rd Qu.: 4.600  
##  Max.   :10.500   Max.   :12.700   Max.   :11.300   Max.   :13.400  
##                                                                     
##  men_55plus_rate women_55plus_rate
##  Min.   : 1.50   Min.   : 1.900   
##  1st Qu.: 3.00   1st Qu.: 2.900   
##  Median : 3.60   Median : 3.400   
##  Mean   : 3.91   Mean   : 3.814   
##  3rd Qu.: 4.50   3rd Qu.: 4.000   
##  Max.   :12.10   Max.   :15.300   
##                  NA's   :552

glimpse(df_unemployed_sex)

## Rows: 917
## Columns: 26
## $ date              <chr> "1948-01-01", "1948-02-01", "1948-03-01", "1948-04-0…
## $ overall_rate      <dbl> 3.4, 3.8, 4.0, 3.9, 3.5, 3.6, 3.6, 3.9, 3.8, 3.7, 3.…
## $ men_rate          <dbl> 3.4, 3.6, 3.8, 3.8, 3.5, 3.3, 3.4, 3.6, 3.7, 3.6, 3.…
## $ women_rate        <dbl> 3.3, 4.5, 4.4, 4.3, 3.7, 4.3, 4.2, 4.4, 4.1, 4.0, 3.…
## $ men_16_17_rate    <dbl> 9.7, 13.0, 14.0, 11.6, 7.1, 11.3, 9.9, 9.8, 10.2, 7.…
## $ women_16_17_rate  <dbl> 8.8, 13.2, 11.4, 10.6, 5.4, 12.9, 11.0, 7.6, 7.3, 8.…
## $ men_16_19_rate    <dbl> 9.4, 10.8, 11.9, 9.8, 7.6, 9.3, 10.2, 10.4, 9.6, 9.4…
## $ women_16_19_rate  <dbl> 7.2, 8.9, 8.6, 9.2, 6.1, 9.3, 9.0, 8.5, 7.6, 7.3, 8.…
## $ men_18_19_rate    <dbl> 9.5, 9.2, 10.3, 8.6, 8.6, 7.4, 11.0, 10.3, 8.9, 10.5…
## $ women_18_19_rate  <dbl> 6.8, 6.8, 7.3, 8.6, 7.0, 6.8, 7.2, 8.8, 7.3, 6.1, 8.…
## $ men_16_24_rate    <dbl> 8.0, 8.6, 10.0, 8.6, 7.6, 7.8, 7.5, 7.7, 7.5, 6.9, 7…
## $ women_16_24_rate  <dbl> 4.9, 6.3, 6.7, 6.7, 5.4, 6.5, 7.5, 6.2, 5.5, 5.7, 6.…
## $ men_20_24_rate    <dbl> 7.2, 7.4, 9.0, 7.9, 7.6, 6.9, 6.0, 6.2, 6.3, 5.6, 5.…
## $ women_20_24_rate  <dbl> 3.4, 4.5, 5.4, 5.0, 4.9, 4.6, 6.5, 4.7, 4.1, 4.6, 4.…
## $ men_25plus_rate   <dbl> 2.5, 2.6, 2.6, 2.8, 2.7, 2.4, 2.5, 2.8, 2.9, 2.8, 3.…
## $ women_25plus_rate <dbl> 2.7, 3.7, 3.2, 3.5, 3.1, 3.5, 3.3, 4.0, 3.6, 3.4, 3.…
## $ men_25_34_rate    <dbl> 2.6, 2.7, 2.7, 3.2, 2.9, 2.5, 2.4, 3.1, 2.8, 3.0, 3.…
## $ women_25_34_rate  <dbl> 4.3, 5.1, 3.5, 3.8, 3.3, 4.2, 4.2, 4.7, 4.9, 4.5, 4.…
## $ men_25_54_rate    <dbl> 2.3, 2.6, 2.6, 2.8, 2.5, 2.3, 2.4, 2.7, 2.8, 2.6, 2.…
## $ women_25_54_rate  <dbl> 2.8, 3.7, 3.3, 3.5, 3.1, 3.6, 3.3, 4.1, 3.7, 3.6, 3.…
## $ men_35_44_rate    <dbl> 2.1, 2.5, 2.6, 2.7, 2.4, 2.3, 2.4, 2.3, 2.6, 2.3, 2.…
## $ women_35_44_rate  <dbl> 1.8, 2.6, 3.0, 3.5, 3.0, 3.3, 2.6, 4.1, 3.0, 3.0, 3.…
## $ men_45_54_rate    <dbl> 2.3, 2.6, 2.4, 2.5, 2.3, 2.2, 2.2, 2.4, 3.2, 2.5, 2.…
## $ women_45_54_rate  <dbl> 2.1, 3.3, 3.3, 3.1, 2.9, 3.1, 3.1, 3.4, 3.2, 3.0, 2.…
## $ men_55plus_rate   <dbl> 3.0, 2.9, 2.8, 2.9, 3.1, 2.8, 2.9, 3.3, 3.4, 3.3, 3.…
## $ women_55plus_rate <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …

#Seeing a lot of NAs in women_55plus_rate, so double checking that there is data present
df_unemployed_sex %>%
  summarise_all(list(
    count = ~sum(is.na(.))
  ))

##   date_count overall_rate_count men_rate_count women_rate_count
## 1          0                  0              0                0
##   men_16_17_rate_count women_16_17_rate_count men_16_19_rate_count
## 1                    0                      0                    0
##   women_16_19_rate_count men_18_19_rate_count women_18_19_rate_count
## 1                      0                    0                      0
##   men_16_24_rate_count women_16_24_rate_count men_20_24_rate_count
## 1                    0                      0                    0
##   women_20_24_rate_count men_25plus_rate_count women_25plus_rate_count
## 1                      0                     0                       0
##   men_25_34_rate_count women_25_34_rate_count men_25_54_rate_count
## 1                    0                      0                    0
##   women_25_54_rate_count men_35_44_rate_count women_35_44_rate_count
## 1                      0                    0                      0
##   men_45_54_rate_count women_45_54_rate_count men_55plus_rate_count
## 1                    0                      0                     0
##   women_55plus_rate_count
## 1                     552

#Comparing to the total number of lines in the data frame
nrow(df_unemployed_sex)

## [1] 917

Clean Up

Only the date column was read in as a character vector instead of date. Altering that here. Additionally, I am doing some minor feature engineering and separating the year and month into their own columns to allow for grouping by those values.

df_unemployed_sex <- df_unemployed_sex %>%
    mutate(date = as.Date(date),
           year = year(date),
           month = month(date)
           )

Unemployment Rate Over Time

These are fairly straightforward plots for the audience to get their bearings on the data segmented by gender and by age group. They are based on the ones present in the original analysis.

# Plot for men and women
ggplot(df_unemployed_sex, aes(x = date)) +
  geom_line(aes(y = overall_rate, color = "Overall")) +
  geom_line(aes(y = men_rate, color = "Men")) +
  geom_line(aes(y = women_rate, color = "Women")) +
  labs(title = "Unemployment Rates Over Time",
       x = "Date",
       y = "Unemployment Rate (%)",
       color = "Rate Type") +
  scale_color_manual(values = c("Overall" = "black", 
                               "Men" = "blue", 
                               "Women" = "red")) +
  ylim(0, 40) +
  theme_minimal()

# Plot for youth age groups (16-19)
ggplot(df_unemployed_sex, aes(x = date)) +
  geom_line(aes(y = men_16_17_rate, color = "Men 16-17")) +
  geom_line(aes(y = women_16_17_rate, color = "Women 16-17")) +
  geom_line(aes(y = men_18_19_rate, color = "Men 18-19")) +
  geom_line(aes(y = women_18_19_rate, color = "Women 18-19")) +
  labs(title = "Youth Unemployment Rates by Age Group and Gender",
       x = "Date", 
       y = "Unemployment Rate (%)",
       color = "Age Group") +
  ylim(0, 40) +
  theme_minimal() +
  theme(legend.position = "bottom")

# Plot for adult age groups (20+)
ggplot(df_unemployed_sex, aes(x = date)) +
  geom_line(aes(y = men_20_24_rate, color = "Men 20-24")) +
  geom_line(aes(y = women_20_24_rate, color = "Women 20-24")) +
  geom_line(aes(y = men_25_34_rate, color = "Men 25-34")) +
  geom_line(aes(y = women_25_34_rate, color = "Women 25-34")) +
  geom_line(aes(y = men_35_44_rate, color = "Men 35-44")) +
  geom_line(aes(y = women_35_44_rate, color = "Women 35-44")) +
  geom_line(aes(y = men_45_54_rate, color = "Men 45-54")) +
  geom_line(aes(y = women_45_54_rate, color = "Women 45-54")) +
  geom_line(aes(y = men_55plus_rate, color = "Men 55+")) +
  # Note: women_55plus_rate appears to have mostly NA values
  labs(title = "Adult Unemployment Rates by Age Group and Gender",
       x = "Date",
       y = "Unemployment Rate (%)",
       color = "Age Group") +
  ylim(0, 40) +
  theme_minimal() +
  theme(legend.position = "bottom")

Unemployment Rates Heatmap

This is one of the author’s original visualizations that I found interesting. The author of the original analysis also opted to single out the columns that they wanted to visualize. Because this assignment said “You should finish with a data frame that contains a subset of the columns in your selected dataset”, I am opting to remove many of the now unnecessary columns altogether.

# Reduce columns down to just the ones the original author wanted to visualize
df_unemployed_sex <- df_unemployed_sex[,
                    c('date', 'year', 'month', 'overall_rate', 'men_rate', 
                      'women_rate', 'men_16_17_rate', 'women_16_17_rate',
                     'men_25_34_rate', 'women_25_34_rate', 'men_55plus_rate', 'women_55plus_rate')]
columns_to_visualize <- c('men_rate', 'women_rate', 'men_16_17_rate', 'women_16_17_rate',
                         'men_25_34_rate', 'women_25_34_rate', 'men_55plus_rate', 'women_55plus_rate')

# Group by year and calculate mean for each category
heatmap_data <- df_unemployed_sex %>%
  group_by(year) %>%
  summarise(
    men_rate = mean(men_rate, na.rm = TRUE),
    women_rate = mean(women_rate, na.rm = TRUE),
    men_16_17_rate = mean(men_16_17_rate, na.rm = TRUE),
    women_16_17_rate = mean(women_16_17_rate, na.rm = TRUE),
    men_25_34_rate = mean(men_25_34_rate, na.rm = TRUE),
    women_25_34_rate = mean(women_25_34_rate, na.rm = TRUE),
    men_55plus_rate = mean(men_55plus_rate, na.rm = TRUE),
    women_55plus_rate = mean(women_55plus_rate, na.rm = TRUE)
  ) %>%
  ungroup()

# Convert to long format for ggplot2
heatmap_long <- heatmap_data %>%
  pivot_longer(cols = -year, 
               names_to = "category", 
               values_to = "unemployment_rate")

# Create the heatmap
ggplot(heatmap_long, aes(x = year, y = category, fill = unemployment_rate)) +
  geom_tile() +
  scale_fill_gradientn(colors = hcl.colors(20, "YlGnBu"),
                       name = "Unemployment Rate (%)") +
  labs(title = "Unemployment Rates Heatmap: Gender & Age Groups Over Years",
       x = "Year",
       y = "Category") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title = element_text(hjust = 0.5))

12-Month Moving Average

This is another visualization that I had found very interesting. Implementing it meant calculating and storing the 12-month moving averages on the data frame as well.

# Calculate 12-month moving averages
df_unemployed_sex <- df_unemployed_sex %>%
  arrange(date) %>%  
  mutate(
    overall_moving_avg = rollmean(overall_rate, k = 12, fill = NA, align = "right"),
    men_moving_avg = rollmean(men_rate, k = 12, fill = NA, align = "right"),
    women_moving_avg = rollmean(women_rate, k = 12, fill = NA, align = "right")
  )

# Plot
ggplot(df_unemployed_sex, aes(x = date)) +
  geom_line(aes(y = overall_moving_avg, color = "Overall", linetype = "Overall"), linewidth = 1) +
  geom_line(aes(y = men_moving_avg, color = "Men", linetype = "Men"), linewidth = 1) +
  geom_line(aes(y = women_moving_avg, color = "Women", linetype = "Women"), linewidth = 1) +
  labs(title = "12-Month Moving Average of Unemployment Rates Over Time",
       x = "Date",
       y = "Unemployment Rate (%)",
       color = "Category",
       linetype = "Category") +
  scale_color_manual(values = c("Overall" = "black", 
                               "Men" = "blue", 
                               "Women" = "orange")) +
  scale_linetype_manual(values = c("Overall" = "solid",
                                  "Men" = "dashed",
                                  "Women" = "dashed")) +
  theme_minimal() +
  theme(
    panel.grid.major = element_line(linewidth = 0.5, linetype = "dashed"),
    panel.grid.minor = element_blank(),
    legend.position = "bottom"
  )

Conclusions

This was just a snippet of what Rajat Raj originally did in their analysis. I really enjoyed how many perspectives the original took to the data. However, I am slightly confused by the age range choice that they opted to focus on. While I understand that including all of the age ranges made for more visually overwhelming graphs, neglecting the middle-aged U.S. unemployment rate felt odd, and I did not get a clear reasoning for why it was done.

As an aside, this source data didn’t require much cleaning, so I leaned slightly more into trimming the dataset down to only what was needed, and feature engineering.