Introduction: In this short paper will perform exploratory data analysis for the given labor market given by the educational background and subject of education. We will explore the highest paying college degrees, the college degrees with lowest unemployment rates, as well as their counterparts.
(Note: The statistical manipulations are aimed purely to show the R functionalities, in contrast to statistical inferences relative to the data)
Clear previous environment
In order to start the project we are required to clear any previous objects associated with your environment. To accomplish that, we need to run the following
remove(list = ls())
Load Main Libraries
To perform the right manipulations to the data given, we need to load (or install) the following packages:
library(tidyverse)
library(forecast)
library(plyr)
library(reshape)
library(hrbrthemes)
library(plotrix)
Import Dataset
The dataset we are using is imported from Kaggle, (or even better, you can use Google Datasets to find pretty much any dataset). Next, we import that data to our environment.
College_Data <- read.csv('C:/Users/xhoni/Desktop/archive/labor_market_college_grads.csv')
Select Top 3 Majors with Highest Salaries “Early Career”
First, lets explore the data and see the which majors have the highest starting salaries (By default, the data is in given as a median)
top_majors <- head(
arrange(
College_Data,
desc(Median.Wage.Early.Career)
),n=3
)
Now lets plot the data and see the results to our first manipulation.
ggplot(top_majors,
aes(x=Major, y=as.factor(Median.Wage.Early.Career),
fill = Major)) +
geom_bar(stat = "identity",
width = 0.5,
) +
xlab("Majors") + ylab("Starting Wage")+
theme_minimal()+
theme(axis.text.x=element_blank())
Wow, now that is a shocker! All the Engineering subjects (Chemical,Computer,Electrical) have actually the highest starting salaries
Select Worse 3 Majors with Lowest Salaries “Early Career”
Now lets check the majors with the lowest potential to earn a high starting salary.
worse_majors <- tail(
arrange(
College_Data,
desc(Median.Wage.Early.Career)
),n=3
)
Plot the findings
ggplot(worse_majors,
aes(x=Major, y=as.factor(Median.Wage.Early.Career),
fill = Major)) +
geom_bar(stat = "identity",
width = 0.5,
) +
xlab("Majors") + ylab("Starting Wage")+
theme_dark()+
theme(axis.text.x=element_blank())
In the other end of the data we have 3 social science majors, from which Family and Consumer Sciences scored the lowest with a median of $32,300 starting point
Now lets switch gears, and see which for which majors you will have a toughest time finding a job, or otherwise. Lets check, first which has the least unemployment rate.
Least Unemployment Rate
Run the following script to sort out the 4 subjects which have the highest opportunity for you to find a job:
least_unemployment <- tail(
arrange(
College_Data,
desc(Unemployment.Rate)
),n=4
)
Now again, lets plot the findings
ggplot(least_unemployment,
aes(x=Major, y=as.factor(Unemployment.Rate),
fill = Major)) +
geom_bar(stat = "identity",
width = 0.5,
) +
xlab("Majors") + ylab("Unemployment Rate") +
theme_minimal()+
theme(axis.text.x=element_blank())
While not having the highest salaries, being a medical technician or a teacher for early childhood education actually increases the significantly the opportunity to find a job.
On the other end of the spectrum now lets see which which college majors will give you the hardest time seeking a job after college:
highest_unemployment <- head(
arrange(
College_Data,
desc(Unemployment.Rate)
),n=4
)
Now you can plot the results, similarly to aforementioned charts:
ggplot(highest_unemployment,
aes(x=Major, y=as.factor(Unemployment.Rate),
fill = Major)) +
geom_bar(stat = "identity",
width = 0.5,
) +
xlab("Majors") + ylab("Unemployment Rate")+
theme_dark()+
theme(axis.text.x=element_blank())
The one that is very surprising is the Physics major, which has the highest unemployment rate at 7.671%. Right behind it, we can clearly distinguish the Mass Media major. So if you are considering pursuing oe of those majors, make sure you do the market research and enroll in a more competitive program.
Another interest path we can pursue in our EDA path , is to seek which career path will give you the highest rise, starting from early to mid career wages.
First, create a new column which finds difference.
rise_sal <- (College_Data$Median.Wage.Mid.Career - College_Data$Median.Wage.Early.Career)
Next, add the new column to existing dataframe
College_Data$rise_sal <- rise_sal
Now, lets find the majors with highest growth per seniority
highest_rise_sal <- head(
arrange(
College_Data,
desc(rise_sal)
),n=5
)
And similarly, lets plot the results
ggplot(highest_rise_sal,
aes(x=Major, y=rise_sal,
fill = Major)) +
geom_bar(stat = "identity",
width = 0.5,
) +
xlab("Majors") + ylab("Salary Increase")+
theme_classic()+
theme(axis.text.x=element_blank())
An interesting finding, is that Pharmacy majors give the highest / fastest rise while on the other hand Physics majors, (which also has the highest unemployment rate) appeareantly has a very high raise once achieving a mid-career position.
Now, lets see how the distribution density across majors is:
Density
ggplot(College_Data, aes(x=x) ) +
# Top
geom_density( aes(x = Median.Wage.Early.Career , y =..density..), fill="#69b3a2" ) +
# Bottom
geom_density( aes(x = Median.Wage.Mid.Career, y = -..density..), fill= "#404080") +
theme_bw() +
theme(axis.text.y=element_blank())+
xlab("Wage Distribution")
Most of the salaries revolve around the $38,000 mark (which is painfully low), and comparing, most salaries for mid career college graduates revolves approximately to $68,000 mark. We can see outliers above $100k (which are STEM majors as we saw in the previous plots)
A different measure we can intake to discern the distribution is the Q-Q Plot
The Q-Q Plot is the deviation from normal distribution in the college majors wages for mid career professionals.
ggplot(College_Data, aes(sample = Median.Wage.Mid.Career)) +
stat_qq() +
stat_qq_line()+
theme_light()
Theoretically, the data towards the higher wages deviates more from the normal distribution then the lower salaries.
We can infer another test to confirm our findings for the deviation given. The Shapiro Wilk Test for Normality. The null hypothesis for this test is that the data are normally distributed. If the p-value is greater than 0.05, then the null hypothesis is not rejected.
shapiro.test(College_Data$Median.Wage.Mid.Career)
##
## Shapiro-Wilk normality test
##
## data: College_Data$Median.Wage.Mid.Career
## W = 0.95555, p-value = 0.01089
As we can see the p-value is lower, so the null hypothesis for the normal population distribution is rejected (hence our finding previously is affirmed)
Finally lets test for possible correlations among different variable (on a scale from 1 - perfect correlation , to -1 perfectly negative correlation).
cor(College_Data$Share.with.Graduate.Degree, College_Data$Median.Wage.Early.Career)
## [1] -0.07664187
cor(College_Data$Share.with.Graduate.Degree, College_Data$Median.Wage.Mid.Career)
## [1] -0.001777715
cor(College_Data$Share.with.Graduate.Degree, College_Data$rise_sal)
## [1] 0.07915562
cor(College_Data$Unemployment.Rate, College_Data$Median.Wage.Early.Career)
## [1] 0.04972598
No Correlations so far, so we will just leave it at that, and let you explore the other findings through the paths provided.