Labor Market for College Graduates

Introduction: In this short paper will perform exploratory data analysis for the given labor market given by the educational background and subject of education. We will explore the highest paying college degrees, the college degrees with lowest unemployment rates, as well as their counterparts.

(Note: The statistical manipulations are aimed purely to show the R functionalities, in contrast to statistical inferences relative to the data)

Clear previous environment

In order to start the project we are required to clear any previous objects associated with your environment. To accomplish that, we need to run the following

remove(list = ls())

Load Main Libraries

To perform the right manipulations to the data given, we need to load (or install) the following packages:

library(tidyverse)
library(forecast)
library(plyr)
library(reshape)
library(hrbrthemes)
library(plotrix)

Import Dataset

The dataset we are using is imported from Kaggle, (or even better, you can use Google Datasets to find pretty much any dataset). Next, we import that data to our environment.

College_Data <- read.csv('C:/Users/xhoni/Desktop/archive/labor_market_college_grads.csv')

Select Top 3 Majors with Highest Salaries “Early Career”

First, lets explore the data and see the which majors have the highest starting salaries (By default, the data is in given as a median)

top_majors <- head(
  arrange(
    College_Data,
    desc(Median.Wage.Early.Career)
  ),n=3
)

Now lets plot the data and see the results to our first manipulation.

ggplot(top_majors, 
       aes(x=Major, y=as.factor(Median.Wage.Early.Career),
           fill = Major)) + 
  geom_bar(stat = "identity",
           width = 0.5,
           ) + 
  xlab("Majors") + ylab("Starting Wage")+
  theme_minimal()+
  theme(axis.text.x=element_blank())

Wow, now that is a shocker! All the Engineering subjects (Chemical,Computer,Electrical) have actually the highest starting salaries

Select Worse 3 Majors with Lowest Salaries “Early Career”

Now lets check the majors with the lowest potential to earn a high starting salary.

worse_majors <- tail(
  arrange(
    College_Data,
    desc(Median.Wage.Early.Career)
  ),n=3
)

Plot the findings

ggplot(worse_majors, 
       aes(x=Major, y=as.factor(Median.Wage.Early.Career),
           fill = Major)) + 
  geom_bar(stat = "identity",
           width = 0.5,
           ) + 
  xlab("Majors") + ylab("Starting Wage")+
  theme_dark()+
  theme(axis.text.x=element_blank())

In the other end of the data we have 3 social science majors, from which Family and Consumer Sciences scored the lowest with a median of $32,300 starting point

Now lets switch gears, and see which for which majors you will have a toughest time finding a job, or otherwise. Lets check, first which has the least unemployment rate.

Least Unemployment Rate

Run the following script to sort out the 4 subjects which have the highest opportunity for you to find a job:

least_unemployment <- tail(
  arrange(
    College_Data,
    desc(Unemployment.Rate)
  ),n=4
)

Now again, lets plot the findings

ggplot(least_unemployment, 
       aes(x=Major, y=as.factor(Unemployment.Rate),
           fill = Major)) + 
  geom_bar(stat = "identity",
           width = 0.5,
           ) + 
  xlab("Majors") + ylab("Unemployment Rate") +
  theme_minimal()+
  theme(axis.text.x=element_blank())

While not having the highest salaries, being a medical technician or a teacher for early childhood education actually increases the significantly the opportunity to find a job.

On the other end of the spectrum now lets see which which college majors will give you the hardest time seeking a job after college:

highest_unemployment <- head(
  arrange(
    College_Data,
    desc(Unemployment.Rate)
  ),n=4
)

Now you can plot the results, similarly to aforementioned charts:

ggplot(highest_unemployment, 
       aes(x=Major, y=as.factor(Unemployment.Rate),
           fill = Major)) + 
  geom_bar(stat = "identity",
           width = 0.5,
           ) + 
  xlab("Majors") + ylab("Unemployment Rate")+
  theme_dark()+
  theme(axis.text.x=element_blank())

The one that is very surprising is the Physics major, which has the highest unemployment rate at 7.671%. Right behind it, we can clearly distinguish the Mass Media major. So if you are considering pursuing oe of those majors, make sure you do the market research and enroll in a more competitive program.

Another interest path we can pursue in our EDA path , is to seek which career path will give you the highest rise, starting from early to mid career wages.

First, create a new column which finds difference.

rise_sal <- (College_Data$Median.Wage.Mid.Career - College_Data$Median.Wage.Early.Career)

Next, add the new column to existing dataframe

College_Data$rise_sal <- rise_sal

Now, lets find the majors with highest growth per seniority

highest_rise_sal <- head(
  arrange(
    College_Data,
    desc(rise_sal)
  ),n=5
)

And similarly, lets plot the results

ggplot(highest_rise_sal, 
       aes(x=Major, y=rise_sal,
           fill = Major)) + 
  geom_bar(stat = "identity",
           width = 0.5,
           ) + 
  xlab("Majors") + ylab("Salary Increase")+
  theme_classic()+
  theme(axis.text.x=element_blank())

An interesting finding, is that Pharmacy majors give the highest / fastest rise while on the other hand Physics majors, (which also has the highest unemployment rate) appeareantly has a very high raise once achieving a mid-career position.

Now, lets see how the distribution density across majors is:

Density

ggplot(College_Data, aes(x=x) ) +
  # Top
  geom_density( aes(x = Median.Wage.Early.Career , y =..density..), fill="#69b3a2" ) +
  # Bottom
  geom_density( aes(x = Median.Wage.Mid.Career, y = -..density..), fill= "#404080") +
  theme_bw() +
  theme(axis.text.y=element_blank())+
  xlab("Wage Distribution")

Most of the salaries revolve around the $38,000 mark (which is painfully low), and comparing, most salaries for mid career college graduates revolves approximately to $68,000 mark. We can see outliers above $100k (which are STEM majors as we saw in the previous plots)

A different measure we can intake to discern the distribution is the Q-Q Plot

The Q-Q Plot is the deviation from normal distribution in the college majors wages for mid career professionals.

ggplot(College_Data, aes(sample = Median.Wage.Mid.Career)) +
  stat_qq() +
  stat_qq_line()+
  theme_light()

Theoretically, the data towards the higher wages deviates more from the normal distribution then the lower salaries.

We can infer another test to confirm our findings for the deviation given. The Shapiro Wilk Test for Normality. The null hypothesis for this test is that the data are normally distributed. If the p-value is greater than 0.05, then the null hypothesis is not rejected.

shapiro.test(College_Data$Median.Wage.Mid.Career)

## 
##  Shapiro-Wilk normality test
## 
## data:  College_Data$Median.Wage.Mid.Career
## W = 0.95555, p-value = 0.01089

As we can see the p-value is lower, so the null hypothesis for the normal population distribution is rejected (hence our finding previously is affirmed)

Finally lets test for possible correlations among different variable (on a scale from 1 - perfect correlation , to -1 perfectly negative correlation).

Graduate Degree and Early Salary

cor(College_Data$Share.with.Graduate.Degree, College_Data$Median.Wage.Early.Career)

## [1] -0.07664187

Graduate Degree and Mid Career Salary

cor(College_Data$Share.with.Graduate.Degree, College_Data$Median.Wage.Mid.Career)

## [1] -0.001777715

Graduate Degree and Rise in Salaries

cor(College_Data$Share.with.Graduate.Degree, College_Data$rise_sal)

## [1] 0.07915562

Unemployment Rate & Early Wage

cor(College_Data$Unemployment.Rate, College_Data$Median.Wage.Early.Career)

## [1] 0.04972598

No Correlations so far, so we will just leave it at that, and let you explore the other findings through the paths provided.

Labor Market for College Graduates

Xhoni Shollaj

6/13/2021