The data set, the College Income Data set (recent-grads.csv), comes from a GitHub repository, data collected by the U.S. Department of Education and American Community Survey data (2010 - 2012) from the Census Bureau. The data set focuses on college graduates highlighting their fields of studies in order to provide a more detail connections between fields of study, income, and employment outcomes.
The data set includes numerical variables such as median income, income at the 25th and 75th percentiles, total number of graduates, and the sample size. It also includes categorical variables such as major, major category and the percent of females in each major. Employment data is separated by the graduates’ employment at college level jobs or non-college jobs.
My goal is to examine the relationship between different majors, gender representations, how they affect earnings, job placement, and whether there are any patterns or inequalities involved in the transition from college to employment.
Load the libraries and data set
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Rows: 173 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Major, Major_category
dbl (19): Rank, Major_code, Total, Men, Women, ShareWomen, Sample_size, Empl...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Cleaning up the data
I make all headers lowercase and remove spaces. I also removed n/a values from specific columns such as total, men, and women, and I double check that all numeric values are stored as numeric values.
names(recent_grads) <-tolower(names(recent_grads)) # to lowercase all headersnames(recent_grads) <-gsub(" ","",names(recent_grads)) # to removed all spacesnona <- recent_grads |>filter(!is.na(total) &!is.na(men) &!is.na(women)) # to filter out all the n/a in those specific columnshead(nona)
top7 <- nona |>arrange(desc(median)) |># rearraging the order of the dataset to be descending based on medianhead(7) # only choosing the first 7 rows
Create a treemap for Recent Grads
I would like to create a treemap where:
The index is the major (focusing on top 7)
The size of the box is the number of graduates
The heatmap color is the median earning
library(RColorBrewer)library(treemap)treemap(top7,index ="major", #the index which is what the labels are, what the box representvSize ="total", # the size of the box is relating to the total graduate for that majorvColor ="median", # the color is based on the median earningstype ="manual", # the colors are not gradient but rather distinctive colorspalette ="YlGn", # choosing a palette for colorstitle ="Top 7 Majors by Median Earnings", # the title of the visualizationtitle.legend ="Median Earnings", # the title of the legendfontsize.labels =10)
Creating a scatter plot chart to show top 10 majors
First I would have to create a dataframe to filter which columns I need and only keep the first 50 rows.
For this scatter plot, I want to see how the share of women, the median earning, and the total graduates connects with each other.
plot1 <-ggplot(new1, aes(x = sharewomen, y = median, color = major_category, size = total)) +geom_point(aes(size = total), alpha =1) +#ggpoint is for the "circle" on the graphlabs(title ="Earnings by Gender Share and Major Category", # these are the labels for all axes and titlex ="Share of Women in Major", # x-axis titley ="Median Earnings", # y-axis titlecolor ="Major Category", # the colors for the legends are based on major categorysize ="Total Graduates", # size of the circles are based on total graduatedcaption ="Source: U.S. Department of Education and ACS data (2010 - 2012) from the Census Bureau.") +scale_color_brewer(palette ="Paired") +# choosing a color palette for the circlestheme_minimal() +theme(legend.text =element_text(size =7.5), # Legend item text sizelegend.title =element_text(size =9.5), # Legend title sizeplot.caption =element_text(size =7), # Caption sizelegend.spacing.x =unit(0.05, "pt"))plot1
Overall Essay
To get everything set up for the analysis, the first thing I did was to standardize the variable names to make it easier to reference in code. I did this using tolower() in combination with gsub() which made all the column headers lower case and took the white space out of the variable names. After that, I used the filter() function to remove cell with missing values for variables that I needed to use like total, men, and women, as these values are important to understand gender distribution and total graduates within a specific major. This is an important cleaning step to follow ensure results that are not biased or visuals that are incomplete.
I created a treemap just to see how it works and the treemap shows the top 7 jobs by median earnings. I used major for the index, the size of the box is the number of graduates, and the color is the median earning. From that graph, you could see that majority of top earning are engineering majors. The final product is a scatter plot chart that shows the share of women in a major on the x-axis, the median earnings for graduates on the y-axis. The circles size is based on total graduates and the color is based on major type. This way you can start to think about how gender representation plays into what is being earned across majors. What I found interesting is that all the majors with a lower share of women, usually STEM or engineering, tended to have higher median earnings. However, female dominated majors were all grouped with lower earning fields like education and the arts. A few heavily enrolled majors do not guarantee higher pay which will surprise some who view the chart.
I also do want to give credit to Chat GPT to help guide me through this project, especially the text sizes of the labels and legends.