In this report, we shall visualize and analyze data on the job outcomes of students who graduated from college between 2010 and 2012. The original data was released by American Community Survey, and was cleaned by FiveThirtyEight and released on their Github repo. Specifically, we will seek to answer questions on the relationships between student employment, unemployment, gender, major and other variables solely through visualization.
We find that student employment tended to be lower in majors related to the humanities, arts and education, and that these majors tended to have a higher percentage of women enrolled. We also find that women participation is highly volatile across majors, with shares of as little as 2% in majors related to Engineering and Science.
Note: This project is part of Data Quest's Data Scientist in Python track.
We begin by loading the required libraries.
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
recent_grads = pd.read_csv('recent-grads.csv')
recent_grads.info()
recent_grads.head()
recent_grads.describe()
The dataset consists of 173 majors and 21 columns including information on the Rank (rank by median earnings), Major, Major Category, Total number of people with major, Sample Size, Number of male and female graduates, Number of employed graduates, Median salary, and other.
With the exception of Major and Major Category, columns are either an integer or float type. We also note that there are almost no null values.
Each row corresponds to all the information related to a particular Major as seen below, which is the case of Petroleum Engineering.
print('Petroleum Engineering overview')
recent_grads.iloc[0]
As we will be working with Matplotlib, we shall drop rows with missing values. This is because Matplotlib expects that columns of values passed in to have matching lengths.
raw_data_count = recent_grads.shape[0]
recent_grads = recent_grads.dropna()
cleaned_data_count = recent_grads.shape[0]
print('Cleaned data count: ', cleaned_data_count)
print('Raw data count: ', raw_data_count)
From the above, we can see that all null values corresponded to a single row so only one entry was dropped from our dataset.
Now that we've cleaned the dataset, we will explore relationships between various columns using a variety of plots.
ax = recent_grads.plot(x='Total', y='Median', kind='scatter')
ax.set_title('Total number of graduates vs. Median')
This first plot indicates that there is not a strong correlation between popularity of the major (as measured by number of graduates) and the median income of the student.
recent_grads.plot(x = 'Total', y = 'Unemployment_rate', kind = 'scatter',
title = 'Total number of graduates vs. Unemployment Rate')
In this previous plot one can see that there is no relationship between the popularity of the major and the unemployment rate within that major. This goes against theoretical relationships that one could think of, such as very popular majors having a higher unemployment rate due to a higher supply of graduates, but a lot of factors including high demand in those popular sectors is clearly resulting in a limited/null relationship between the two variables.
recent_grads.plot(x = 'Full_time', y = 'Median', kind = 'scatter',
title = 'Number of Full Time Workers vs. Median')
This plot also showcases a lack of relationship between the number of full time workers and the median salary in that major.
recent_grads.plot(x = 'ShareWomen', y = 'Unemployment_rate', kind = 'scatter',
title = 'Share of Women vs. Unemployment Rate')
recent_grads.plot(x = 'ShareWomen', y = 'Median', kind = 'scatter',
title = 'Share of women vs. Median')
The previous two plots highlight that there is no correlation between participation of women in a major and the unemployment rate of graduates from that major. However, there is a slightly negative relationship between the participation of women in a major and the median salary in that major.
As a next step, we will generate histograms to explore the distributions of several columns.
recent_grads['Sample_size'].hist(bins = 25)
plt.title('Histogram of sample size of graduates')
recent_grads['Median'].hist(bins = 20)
plt.title('Histogram of median of graduates')
recent_grads['Employed'].hist(bins = 25)
plt.title('Histogram of number of employed graduates per major')
recent_grads['Full_time'].hist(bins = 25)
plt.title('Histogram of number of full time graduates per major')
recent_grads['ShareWomen'].hist(bins = 15)
plt.title('Histogram of women participation per major')
recent_grads['Unemployment_rate'].hist(bins = 25)
plt.title('Histogram of unemployment rate of graduates per major')
recent_grads['Men'].hist(bins = 25)
plt.title('Histogram of number of male graduates per major')
recent_grads['Women'].hist(bins = 25)
plt.title('Histogram of number of female graduates per major')
Some interesting highlights that one can observe in the previous histograms include the fact that the most common median salary range is USD30,000 - USD35,000, with only a minor amount of graduates earning a median salary of over USD80,000.
Additionally, one can observe that the share of women is relatively uniform between 0.2 and 0.8, indicating a high variance in the participation of women across majors.
The unemployment rate is centered around 6%, with a portion of majors having very high unemployment rates of over 12%.
Next, we will be using the scatter_matrix() function from the pandas.plotting module.
scatter_matrix(recent_grads[['Sample_size', 'Median']], figsize=(10,10))
plt.suptitle('Scatter matrix of sample size and graduate median per major')
scatter_matrix(recent_grads[['Sample_size', 'Median', 'Unemployment_rate']], figsize=(10,10))
plt.suptitle('Scatter matrix of sample size, graduate median and unemployment rate per major')
scatter_matrix(recent_grads[['Men', 'Women', 'Total']], figsize=(10,10))
plt.suptitle('Scatter matrix of number of male, female and total graduates per major')
With these scatter matrix plots, we can confirm our previous statements about the most common median salary range and unemployment rates. Moreover, we see that the median salary and the unemployment rate are largely uncorrelated.
Additionally, in terms of gender distribution, we can see that although the share of women in majors is highly variable across majors, the overall distribution of males and females is very similar, indicating that on aggregate participation of both genders is roughly the same. We can verify this by looking at the mean number of women and men in the recent_grads description generated in the overview, in which the mean number of women surpasses the mean number of men.
We can take a closer look at the variance in the gender participation across majors using bar plots as follows.
recent_grads[:20].plot.bar(x='Major', y='ShareWomen')
plt.title('Participation of Women Across Majors - Engineering')
As the dataset is ordered by Major Category, we see that in the first 20 majors, which are from the Engineering category, female participation is overall quite low, particularly in Petroleum, Mining and Mineral, and Metallurgical Engineering.
recent_grads[-20:].plot.bar(x='Major', y='ShareWomen')
plt.title('Participation of Women Across Majors - Arts, Education and Humanities')
On the other hand, the last 20 rows are majors from the Biology & Life Science, Psychology & Social Work, Education, Arts, and Humanities categories. Here, we see a disproportionate amount of female participation, with women accounting for as much as almost 100% of graduates in majors such as Communication Disorders Sciences and Early Childhood Education.
Let us now take a look at the unemployment rate in these majors.
recent_grads[:20].plot.bar(x='Major', y='Unemployment_rate')
plt.title('Unemployment Rate Across Majors - Engineering')
recent_grads[-20:].plot.bar(x='Major', y='Unemployment_rate')
plt.title('Unemployment Rate Across Majors - Arts, Education and Humanities')
Overall, unemployment tends to be higher in the majors from the last twenty rows, that is, those relating to the humanities, arts and education, which also tend to have a higher participation of women.
These bar plots provide a more granular view of the trends that we noticed in the previous scatter plots and scatter matrix plots.
In this project, we used visualizations including scatter plots, bar plots and histograms to analyze relationships between variables related to job outcomes for graduates across a variety of majors. Using these visualizations, we were able to pinpoint interesting relationships and overall trends, including common ranges for the median salary and unemployment rate, and their differences across different groups including gender and major category.