In this project we want to evaluate what are the best jobs in US job market. For doing that we will use a sample dataset from Kaggle that was scraped from Monster.com
First lets read the file
dataset <- read.csv("monster_com-job_sample.csv")
Lets import needed libraries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ stringr 1.4.0
## ✔ tidyr 1.2.0 ✔ forcats 0.5.1
## ✔ readr 2.1.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(stringr)
Check the dataset contents
head(dataset$job_title)
## [1] "IT Support Technician Job in Madison"
## [2] "Business Reporter/Editor Job in Madison"
## [3] "Johnson & Johnson Family of Companies Job Application for Senior Training Leader | Monster.com var MONS_LOG_VARS = {\"JobID\":"
## [4] "Engineer - Quality Job in Dixon"
## [5] "Shift Supervisor - Part-Time Job in Camphill"
## [6] "Construction PM - Charlottesville Job in Charlottesville"
Check dimensions of the data frame
dim(dataset)
## [1] 22000 14
Check if there are columns containing NA values
colSums(is.na(dataset))
## country country_code date_added has_expired job_board
## 0 0 0 0 0
## job_description job_title job_type location organization
## 0 0 0 0 0
## page_url salary sector uniq_id
## 0 0 0 0
Check column names
colnames(dataset)
## [1] "country" "country_code" "date_added" "has_expired"
## [5] "job_board" "job_description" "job_title" "job_type"
## [9] "location" "organization" "page_url" "salary"
## [13] "sector" "uniq_id"
We do not need all these columns, so we will only select the relevant ones.
dataset <- select(dataset,job_title,job_type,salary,sector)
Check datatypes
str(dataset)
## 'data.frame': 22000 obs. of 4 variables:
## $ job_title: chr "IT Support Technician Job in Madison" "Business Reporter/Editor Job in Madison" "Johnson & Johnson Family of Companies Job Application for Senior Training Leader | Monster.com var MONS_LOG_VARS = {\"JobID\":" "Engineer - Quality Job in Dixon" ...
## $ job_type : chr "Full Time Employee" "Full Time" "Full Time, Employee" "Full Time" ...
## $ salary : chr "" "" "" "" ...
## $ sector : chr "IT/Software Development" "" "" "Experienced (Non-Manager)" ...
As we can see all of them are character type (strings). But salary should be numeric. Soon we will have to clean this column to convert it to numeric and be able to make comparisons and stats.
head(dataset$salary)
## [1] "" "" "" "" "" ""
We can see that some records salaries are empty. We will remove all leading and trailing spaces and discard those empty salaries.
dataset$salary <- str_trim(dataset$salary)
dataset <- dataset[dataset$salary != "",]
Now we only have records with salaries
dataset <- dataset[order(dataset$salary),]
head(dataset$salary)
## [1] "\"Best Employer\" excellent compensation and benefits program"
## [2] "*Excellent compensation and benefits package!*"
## [3] "• Generous salary aligned with experience"
## [4] "+Bonuses + Benefits"
## [5] "$1,000,000.00+ /yearDOE"
## [6] "$1,000,000.00+ /yearDOE"
We do not want the salaries with strings. We only want to keep the salaries expressed in numbers. So we will discard salaries with no numbers. For doing this we will use regular expressions.
dataset <- dataset[grepl("\\d+", dataset$salary),]
Lets create a function to analyze each salary record. We need to differentiate between yearly, monthly and hourly salaries. And then convert the character to numeric field. I will keep the raw salary field and create a new one for the cleaned salary. So we can always check if the output salary is coherent with the input raw salary field.
analyze_salary <- function (dataset){
dataset$cleaned_salary <- as.numeric(gsub(",","",str_extract(dataset$salary,"[\\d,.]+")))
dataset$salary_type <- str_match(dataset$salary,"/(hour|week|month|year)")[,2]
#print(dataset[c("salary","cleaned_salary","salary_type")])
return(dataset)
}
dataset <- analyze_salary(dataset)
## Warning in analyze_salary(dataset): NAs introduced by coercion
Some records did not enter a salary or they entered 0.00 as salary. So we will discard those records.
dataset <- dataset[dataset$cleaned_salary != 0 & !is.na(dataset$cleaned_salary),]
We will do the same with the salary type column. We will discard the records that did not specify if the salary was hourly, weekly, monthly or yearly.
dataset <- dataset[!is.na(dataset$salary_type),]
Lets check if there are still any NA (Not available) field on any of those 2 columns.
sum(is.na(dataset$cleaned_salary))
## [1] 0
sum(is.na(dataset$salary_type))
## [1] 0
Salaries are clean now. But I will create a column for yearly salary, so we can compare all jobs, no matter their ad was not with yearly salary.
dataset <- mutate(dataset, yearly_salary = case_when(
dataset$salary_type == 'year' ~ dataset$cleaned_salary,
dataset$salary_type == 'month' ~ dataset$cleaned_salary * 12,
dataset$salary_type == 'week' ~ dataset$cleaned_salary * 52,
dataset$salary_type == 'hour' ~ dataset$cleaned_salary * 40 * 52
)
)
Let’s remove some outliers that I can see at a glance.
dataset <- dataset[dataset$yearly_salary > 6000 & dataset$yearly_salary < 300000,]
We will check what are the top 10 best paid job ads in monster.us
print(slice_max(dataset,dataset$yearly_salary,n = 10)[,c(1,7)])
## job_title
## 1 Neurologist Job in Fort Bragg
## 2 Real Estate Attorney - Counsel Job in New York
## 3 Staff Psychiatrist Job in Corpus Christi
## 4 Lead Psychiatrist Job in Corpus Christi
## 5 Contract Physician (Suboxone Certified) Job in Columbus
## 6 Contract Physician (Suboxone Certified) Job in Columbus
## 7 Software Systems Engineer Job in Dallas
## 8 Medical Affairs Director - Infectious Diseases & Immunology Job in Madison
## 9 Global Medical Affairs Director - Solid Tumor Therapeutics Job in Northbrook
## 10 General Dentist Job in Greenville
## yearly_salary
## 1 250000.0
## 2 215000.0
## 3 212267.3
## 4 212267.3
## 5 208000.0
## 6 208000.0
## 7 208000.0
## 8 200000.0
## 9 200000.0
## 10 200000.0
We can see that there are duplicates in our data. We need to discard them and keep unique records.
dataset <- distinct(dataset)
Now we check top 10 salaries jobs, and we see there are no duplicates.
print(slice_max(dataset,dataset$yearly_salary,n = 10)[,c(1,7)])
## job_title
## 1 Neurologist Job in Fort Bragg
## 2 Real Estate Attorney - Counsel Job in New York
## 3 Staff Psychiatrist Job in Corpus Christi
## 4 Lead Psychiatrist Job in Corpus Christi
## 5 Contract Physician (Suboxone Certified) Job in Columbus
## 6 Software Systems Engineer Job in Dallas
## 7 Medical Affairs Director - Infectious Diseases & Immunology Job in Madison
## 8 Global Medical Affairs Director - Solid Tumor Therapeutics Job in Northbrook
## 9 General Dentist Job in Greenville
## 10 VP IT Software Development Job in Mount Laurel
## yearly_salary
## 1 250000.0
## 2 215000.0
## 3 212267.3
## 4 212267.3
## 5 208000.0
## 6 208000.0
## 7 200000.0
## 8 200000.0
## 9 200000.0
## 10 190000.0
Lowest 10 salaries
print(slice_min(dataset,dataset$yearly_salary,n = 10)[,c(1,7)])
## job_title yearly_salary
## 1 BH - Server Job in Pittsburgh 6240.0
## 2 Server Assistant Job in Saint Petersburg 10462.4
## 3 Beverage Server Job in Kansas City 12480.0
## 4 FOOD SERVER Job in East Chicago 12584.0
## 5 Route Sales Representative Job in Cincinnati 13000.0
## 6 Hotel Breakfast Server Job in Baton Rouge 15600.0
## 7 Food Assembler Job in Atlanta 16016.0
## 8 Banquet Servers Job in Newark 16640.0
## 9 Picker/Packer Job in Fairburn 16640.0
## 10 Hostess Job in Saint Petersburg 16744.0
Now can start to analyze the sectors Let’s take a look at some of the job sectors available.
head(levels(as.factor(dataset$sector)))
## [1] ""
## [2] "Account Management (Commissioned)Insurance Agent/BrokerFinancial Products Sales/Brokerage"
## [3] "Accounting/Finance/Insurance"
## [4] "Administrative/Clerical"
## [5] "Banking/Real Estate/Mortgage Professionals"
## [6] "Biotech/R&D/Science"
We will create another dataset for analyzing the sectors.
sectors <- dataset[dataset$sector != "",] %>%
group_by(sector) %>%
summarise(count = n(), mean_salary = mean(cleaned_salary))
Let’s check what are the most demanded sectors.
print(slice_max(sectors,sectors$count,n = 5))
## # A tibble: 5 × 3
## sector count mean_salary
## <chr> <int> <dbl>
## 1 Experienced (Non-Manager) 591 37164.
## 2 Manager (Manager/Supervisor of Staff) 271 58297.
## 3 Accounting/Finance/Insurance 178 50847.
## 4 Entry Level 168 16095.
## 5 Sales/Retail/Business Development 91 40829.
Best paid sectors (with more than 1 occurrence)
print(slice_max(sectors[sectors$count > 1,],sectors[sectors$count > 1,]$mean_salary,n = 5))
## # A tibble: 5 × 3
## sector count mean_salary
## <chr> <int> <dbl>
## 1 Executive (SVP, VP, Department Head, etc) 16 111688.
## 2 Building Construction/Skilled Trades 26 69232.
## 3 Marketing/Product 8 64629
## 4 Manager (Manager/Supervisor of Staff) 271 58297.
## 5 Insurance Agent/BrokerGeneral/Other: Sales/Business Develop… 4 57500
I want to order both jobs and sectors datasets by best paid
dataset <- dataset[order(-dataset$yearly_salary),]
sectors <- sectors[order(-sectors$mean_salary),]
Let’s create graphs for the highest paid jobs and sectors.
ggplot(dataset[1:5,]) +
geom_col(aes(yearly_salary,reorder(job_title, yearly_salary),fill=palette(rainbow(5)))) +
theme(legend.position = "none") +
labs(x = "Salary per year", y = "Job") +
geom_text(aes(x=35000,y=job_title,
label=sprintf("$%.2f",yearly_salary)
)
)
ggplot(sectors[sectors$count > 1,][1:5,]) +
geom_col(aes(mean_salary,
reorder(str_sub(sector,1,51), mean_salary),
fill=palette(rainbow(5)))) +
theme(legend.position = "none") +
labs(x = "Average salary", y = "Job") +
geom_text(aes(x=mean_salary-14000,y=str_sub(sector,1,51),
label=sprintf("$%.2f",mean_salary)
)
)
Now I want to know what are the top 10 most sought after skills in the 500 highest paid jobs
skills <- dataset[1:500,] %>%
select('skills' = 1) %>%
separate_rows(skills, sep = " ") %>%
group_by(skills) %>%
summarise(count = n()) %>%
arrange(desc(count))
top_n(skills,n = 20)
## Selecting by count
## # A tibble: 20 × 2
## skills count
## <chr> <int>
## 1 in 465
## 2 Job 463
## 3 Manager 167
## 4 - 147
## 5 Engineer 78
## 6 Dallas 64
## 7 Project 56
## 8 Senior 53
## 9 Quality 40
## 10 / 32
## 11 Director 32
## 12 Analyst 27
## 13 Developer 25
## 14 Houston 24
## 15 of 24
## 16 Chicago 23
## 17 Cincinnati 22
## 18 Medical 21
## 19 New 19
## 20 Construction 18
Lets remove some common words that are not skills.
skills <- skills[!skills$skills %in% c("in","Job","-","Dallas","/","Houston","of","Chicago","Cincinnati","New"),]
top_n(skills,n = 10)
## Selecting by count
## # A tibble: 10 × 2
## skills count
## <chr> <int>
## 1 Manager 167
## 2 Engineer 78
## 3 Project 56
## 4 Senior 53
## 5 Quality 40
## 6 Director 32
## 7 Analyst 27
## 8 Developer 25
## 9 Medical 21
## 10 Construction 18
Let’s plot this info
ggplot(data = skills[1:10,]) +
geom_point(mapping = aes(y = count,
x = reorder(skills,count),
fill = skills),
stat='identity') +
labs(title = "Most sought after skills in top 500 jobs", y = "Times", x = "Skill") +
theme(legend.position = "none")