Recent Graduates Income

Author

Nhi Vu

What this data set is about?

The data set, the College Income Data set (recent-grads.csv), comes from a GitHub repository, data collected by the U.S. Department of Education and American Community Survey data (2010 - 2012) from the Census Bureau. The data set focuses on college graduates highlighting their fields of studies in order to provide a more detail connections between fields of study, income, and employment outcomes.

The data set includes numerical variables such as median income, income at the 25th and 75th percentiles, total number of graduates, and the sample size. It also includes categorical variables such as major, major category and the percent of females in each major. Employment data is separated by the graduates’ employment at college level jobs or non-college jobs.

My goal is to examine the relationship between different majors, gender representations, how they affect earnings, job placement, and whether there are any patterns or inequalities involved in the transition from college to employment.

Load the libraries and data set

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

setwd("/Users/nhi.vu/Desktop/DATA110")
recent_grads <- read_csv("recent-grads.csv")

Rows: 173 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): Major, Major_category
dbl (19): Rank, Major_code, Total, Men, Women, ShareWomen, Sample_size, Empl...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Cleaning up the data

I make all headers lowercase and remove spaces. I also removed n/a values from specific columns such as total, men, and women, and I double check that all numeric values are stored as numeric values.

names(recent_grads) <- tolower(names(recent_grads)) # to lowercase all headers
names(recent_grads) <- gsub(" ","",names(recent_grads)) # to removed all spaces
nona <- recent_grads |>
  filter(!is.na(total) & !is.na(men) & !is.na(women))  # to filter out all the n/a in those specific columns
head(nona)

# A tibble: 6 × 21
   rank major_code major total   men women major_category sharewomen sample_size
  <dbl>      <dbl> <chr> <dbl> <dbl> <dbl> <chr>               <dbl>       <dbl>
1     1       2419 PETR…  2339  2057   282 Engineering         0.121          36
2     2       2416 MINI…   756   679    77 Engineering         0.102           7
3     3       2415 META…   856   725   131 Engineering         0.153           3
4     4       2417 NAVA…  1258  1123   135 Engineering         0.107          16
5     5       2405 CHEM… 32260 21239 11021 Engineering         0.342         289
6     6       2418 NUCL…  2573  2200   373 Engineering         0.145          17
# ℹ 12 more variables: employed <dbl>, full_time <dbl>, part_time <dbl>,
#   full_time_year_round <dbl>, unemployed <dbl>, unemployment_rate <dbl>,
#   median <dbl>, p25th <dbl>, p75th <dbl>, college_jobs <dbl>,
#   non_college_jobs <dbl>, low_wage_jobs <dbl>

Filter out the top 7 majors by median earning

top7 <- nona |>
  arrange(desc(median)) |> # rearraging the order of the dataset to be descending based on median
  head(7) # only choosing the first 7 rows

Create a treemap for Recent Grads

I would like to create a treemap where:

The index is the major (focusing on top 7)
The size of the box is the number of graduates
The heatmap color is the median earning

library(RColorBrewer)
library(treemap)
treemap(top7,
        index = "major",    #the index which is what the labels are, what the box represent
        vSize = "total",    # the size of the box is relating to the total graduate for that major
        vColor = "median",  # the color is based on the median earnings
        type = "manual",    # the colors are not gradient but rather distinctive colors
        palette = "YlGn",   # choosing a palette for colors
        title = "Top 7 Majors by Median Earnings",  # the title of the visualization
        title.legend = "Median Earnings",           # the title of the legend
        fontsize.labels = 10)

Creating a scatter plot chart to show top 10 majors

First I would have to create a dataframe to filter which columns I need and only keep the first 50 rows.

new1 <- nona |>
  select(sharewomen, total, major, median, major_category) |> 
  arrange(desc(median)) |>
  head(50)

Plotting the scatter plot

For this scatter plot, I want to see how the share of women, the median earning, and the total graduates connects with each other.

plot1 <- ggplot(new1, aes(x = sharewomen, y = median, color = major_category, size = total)) + 
  geom_point(aes(size = total), alpha = 1) + #ggpoint is for the "circle" on the graph
  labs(title = "Earnings by Gender Share and Major Category", # these are the labels for all axes and title
       x = "Share of Women in Major", # x-axis title
       y = "Median Earnings",         # y-axis title
       color = "Major Category",      # the colors for the legends are based on major category
       size = "Total Graduates",      # size of the circles are based on total graduated
       caption = "Source: U.S. Department of Education and ACS data (2010 - 2012) from the Census Bureau.") +
  scale_color_brewer(palette = "Paired") + # choosing a color palette for the circles
  theme_minimal() +
  theme(
    legend.text = element_text(size = 7.5),    # Legend item text size
    legend.title = element_text(size = 9.5),   # Legend title size
    plot.caption = element_text(size = 7),     # Caption size
    legend.spacing.x = unit(0.05, "pt"))
plot1

Overall Essay

To get everything set up for the analysis, the first thing I did was to standardize the variable names to make it easier to reference in code. I did this using tolower() in combination with gsub() which made all the column headers lower case and took the white space out of the variable names. After that, I used the filter() function to remove cell with missing values for variables that I needed to use like total, men, and women, as these values are important to understand gender distribution and total graduates within a specific major. This is an important cleaning step to follow ensure results that are not biased or visuals that are incomplete.

I created a treemap just to see how it works and the treemap shows the top 7 jobs by median earnings. I used major for the index, the size of the box is the number of graduates, and the color is the median earning. From that graph, you could see that majority of top earning are engineering majors. The final product is a scatter plot chart that shows the share of women in a major on the x-axis, the median earnings for graduates on the y-axis. The circles size is based on total graduates and the color is based on major type. This way you can start to think about how gender representation plays into what is being earned across majors. What I found interesting is that all the majors with a lower share of women, usually STEM or engineering, tended to have higher median earnings. However, female dominated majors were all grouped with lower earning fields like education and the arts. A few heavily enrolled majors do not guarantee higher pay which will surprise some who view the chart.

I also do want to give credit to Chat GPT to help guide me through this project, especially the text sizes of the labels and legends.