project1

Author

Asma Abbas

Introduction

The dataset I’ve chosen to make a visualization for is called “labor.” In it is several statistics regarding employment in the United states spanning from 1972 to 2015. I got if off of the github, and it was compiled by Austin Cory Bart. It focuses on data across different groups of people, and the visualization produced will take a look into that. All of the data stems from the Current Population Survey, conducted by the Census Bureau. Some of the important variables will be:

Data.Unemployed.Black or African American.Unemployment Rate.Men: The unemployment rates for the demographic of Black or African American men.

Data.Unemployed.Black or African American.Unemployment Rate.Women: The unemployment rates for the demographic of Black or African American women.

Data.Unemployed.Asian.Unemployment Rate: The unemployment rate for the Asian demographic of the survey.

Data.Unemployed.White.Unemployment Rate.Men: The unemployment rate for White men.

Data.Unemployed.White.Unemployment Rate.Women: The unemployment rate for White women.

Using this dataset, I want to compare unemployment rates across these different demographics.

Loading the necessary libraries

library(tidyverse)

Warning: package 'tidyverse' was built under R version 4.4.3

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

setwd("C:/Users/Saima Abbas/Downloads")
labor <- read_csv("labor.csv")

Rows: 528 Columns: 51
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): Time.Month Name
dbl (50): Time.Month, Time.Year, Data.Civilian Noninstitutional Population.A...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Loading tidyverse and the csv file containing the dataset.

Cleaning up the data

When I was looking through the dataset, I noticed that there were several columns that contained zeros. I thought this would hinder the data, so it might be best to filter it out.

labor_2 <- labor |>
  select(
    `Data.Unemployed.Black or African American.Unemployment Rate.Men`,
    `Data.Unemployed.Black or African American.Unemployment Rate.Women`,
    `Data.Unemployed.Asian.Unemployment Rate`,
    `Data.Unemployed.White.Unemployment Rate.Men`,
    `Data.Unemployed.White.Unemployment Rate.Women`)|>
filter(if_all(everything(), ~ !is.na(.) & . != 0))

What I did here was cleaning. I used the select function to pick out what I thought to be the important variables, since we’re going to be looking at unemployment rates. Then, I used the filter function to remove the zeros from these columns. (I used a website to help me formulate how to clean out the zeros, since I couldn’t quite figure it out from the notes. Will be cited below)

Creating a dataframe

unemployment_rates <- labor_2 |>
  gather(key = "Group", value = "Unemployment_Rate")

What this chunk does is have the group be the column names, and unemployment rate will hold the values.

Creating the data frame and setting up for the graph.

unemployment_rates <- unemployment_rates |>
  mutate(
    Race = case_when(
      grepl("Black", Group) ~ "Black or African American",
      grepl("Asian", Group) ~ "Asian",
      grepl("White", Group) ~ "White",),
    Gender = case_when(
      grepl("Men", Group) ~ "Men",
      grepl("Women", Group) ~ "Women",
      TRUE ~ "Both"))

Here I set up a dataframe containing what was necessary for the graphs. First I used the mutate function to make new columns/variables, simpling the original variables down, and then using the grepl command right after to confirm which group each row belongs to. After doing it for the different races, I repeated the same process for gender. However, because (for some reason?) the dataset doesn’t include Asian men and women as seperate variables, I had to list them under both using the true function. Basically, if the data is both man and woman, it goes under both.

Making the visualization

ggplot(unemployment_rates, aes(x = Race, y = Unemployment_Rate, fill = Gender)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(
    title = "Unemployment Rates Across Different Groups and Gender",
    x = "Race/Ethnicity",
    y = "Unemployment Rate (%)",
    fill = "Gender",
    caption = "Source: Austin Cory Bart, Labor Dataset") +
  theme_light() +
  scale_fill_manual(values = c("Men" = "#007BA7", "Women" = "#800020", "Both" = "#FF00FF"))

In this bargraph, I began by filling in which variable goes on either axis, and then working on the visuals of the graph. I didn’t perform any calculations on the data and wanted it plotted exactly as how it is in the dataset, so I used “identity” and then used “dodge” to make each bar separately next to each other for each variable. All that was left after that was labeling each axis, providing the source, and picking out colors. I picked out one of the random themes, and found hexcodes on google for colors I thought would look good together.

Alternative visualization

ggplot(unemployment_rates, aes(x = Race, y = Unemployment_Rate, fill = Gender)) +
  geom_bar(stat = "identity") +  
  labs(
    title = "Unemployment Across Race and Gender",
    x = "Race",
    y = "Unemployment Rate (%)",
    fill = "Gender",
    caption = "Source: Austin Cory Bart, Labor Dataset") +
  theme_light() +
  scale_fill_manual(values = c("Men" = "darkblue", "Women" = "hotpink", "Both" = "purple"))

Since I wasn’t just comparing race, but gender as well, I thought it would make sense to highlight that factor as well. I figured a nicer way to present it would be a stacked bar graph. Essentially what I did here was the same as the previous bar graph, but I omitted the use of “dodge” so that the data stacked by default. Over here instead of using hexcodes, I typed in the colors I wanted instead and picked out a theme (I hope that still counts as customizing outside of the default colors?)

Essay response:

How you cleaned the dataset up (be detailed and specific, using proper terminology where appropriate).
What the visualization represents, any interesting patterns or surprises that arise within the visualization.
Anything that you might have shown that you could not get to work or that you wished you could have included

  Cleaning the dataset was the worst part of the process honestly, because I couldn't figure out what I needed to clean. I feel as though I isolated the variables, rather than cleaning up the dataset. However, what I did do was remove zeros from the entire dataset, so that was what I did in order to prepare it, and move forward in the process. I suppose using the gather function to make the comparison by group easier. From there, I just created the dataframes, which isolated the variables needed to create the visualization(s). 
  
  The visualization at hand represents the unemplyoment rates across different demographics. The categories and their differences range between both race, and gender. The dataset only provided data on three races, which is why those are the only ones included. I was surprised that there was no data on Hispanic demographics, or generally just more going on in general. I was also surprised with how little the unemployment rates were for the Asian demographic. Asia is such an umbrella term, that I figured there would be a high population, which meant more unemployment rates, if that makes sense. As an (South) Asian who had the worst luck in finding a job, I thought the rate would be higher (but I guess thats just me). However, since there were so many zeros present in the dataset, I don't think it really encapsulates all Asians (East Asians, South East Asians, South Asians, etc.). It was also unfortunate that the dataset didn't provide data dividing men and women, to see that comparison. On the next group in the graphs, it can be seen that the unemployment Something interesting though, is that the unemployment rate for Black or African American men is signifgantly higher than it is for Black or African American women. For White Men, their unemployment rate is slightly higher than it is for white women. All of these are interesting, and I wonder how they've changed in ten years, since the data stops at 2015.
  
  Inititally, the plan was to create a scatterplot and take a look at these rates across different years, but there was so much to work with, that I believe I got overwhelmed and wasn't able to execute it properly. Initially I also had problems cleaning the dataset, working with removing the zeros and the group function, I just didn't understand it yet. Somethings I wish I could have included were just the use of more variables, maybe the rises and falls of employment across the different years. I also wanted (and will try to do in the future!) to make something that is more visually interesting to look at. I also struggled with understanding why the second graph is in a different scale than the first. For future reference, I will work more with actually cleaning up a dataset, and tackeling more variables from a dataset.

Sources:

The website that I used to help:

Steven P. Sanderson II, MPH. “How to Remove Rows with Any Zeros in R: A Complete Guide with Examples.” Steve’s Data Tips and Tricks, 6 Jan. 2025, www.spsanderson.com/steveondata/posts/2025-01-06/.