In this project we want to evaluate what are the best jobs in US job market. For doing that we will use a sample dataset from Kaggle that was scraped from Monster.com

First lets read the file

dataset <- read.csv("monster_com-job_sample.csv")

Lets import needed libraries

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ stringr 1.4.0
## ✔ tidyr   1.2.0     ✔ forcats 0.5.1
## ✔ readr   2.1.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(stringr)

Check the dataset contents

head(dataset$job_title)
## [1] "IT Support Technician Job in Madison"                                                                                          
## [2] "Business Reporter/Editor Job in Madison"                                                                                       
## [3] "Johnson & Johnson Family of Companies Job Application for Senior Training Leader | Monster.com var MONS_LOG_VARS = {\"JobID\":"
## [4] "Engineer - Quality Job in Dixon"                                                                                               
## [5] "Shift Supervisor - Part-Time Job in Camphill"                                                                                  
## [6] "Construction PM - Charlottesville Job in Charlottesville"

Check dimensions of the data frame

dim(dataset)
## [1] 22000    14

Check if there are columns containing NA values

colSums(is.na(dataset))
##         country    country_code      date_added     has_expired       job_board 
##               0               0               0               0               0 
## job_description       job_title        job_type        location    organization 
##               0               0               0               0               0 
##        page_url          salary          sector         uniq_id 
##               0               0               0               0

Check column names

colnames(dataset)
##  [1] "country"         "country_code"    "date_added"      "has_expired"    
##  [5] "job_board"       "job_description" "job_title"       "job_type"       
##  [9] "location"        "organization"    "page_url"        "salary"         
## [13] "sector"          "uniq_id"

We do not need all these columns, so we will only select the relevant ones.

dataset <- select(dataset,job_title,job_type,salary,sector)

Check datatypes

str(dataset)
## 'data.frame':    22000 obs. of  4 variables:
##  $ job_title: chr  "IT Support Technician Job in Madison" "Business Reporter/Editor Job in Madison" "Johnson & Johnson Family of Companies Job Application for Senior Training Leader | Monster.com var MONS_LOG_VARS = {\"JobID\":" "Engineer - Quality Job in Dixon" ...
##  $ job_type : chr  "Full Time Employee" "Full Time" "Full Time, Employee" "Full Time" ...
##  $ salary   : chr  "" "" "" "" ...
##  $ sector   : chr  "IT/Software Development" "" "" "Experienced (Non-Manager)" ...

As we can see all of them are character type (strings). But salary should be numeric. Soon we will have to clean this column to convert it to numeric and be able to make comparisons and stats.

head(dataset$salary)
## [1] "" "" "" "" "" ""

We can see that some records salaries are empty. We will remove all leading and trailing spaces and discard those empty salaries.

dataset$salary <- str_trim(dataset$salary)
dataset <- dataset[dataset$salary != "",]

Now we only have records with salaries

dataset <- dataset[order(dataset$salary),]
head(dataset$salary)
## [1] "\"Best Employer\" excellent compensation and benefits program"
## [2] "*Excellent compensation and benefits package!*"               
## [3] "• Generous salary aligned with experience"                    
## [4] "+Bonuses + Benefits"                                          
## [5] "$1,000,000.00+ /yearDOE"                                      
## [6] "$1,000,000.00+ /yearDOE"

We do not want the salaries with strings. We only want to keep the salaries expressed in numbers. So we will discard salaries with no numbers. For doing this we will use regular expressions.

dataset <- dataset[grepl("\\d+", dataset$salary),]

Lets create a function to analyze each salary record. We need to differentiate between yearly, monthly and hourly salaries. And then convert the character to numeric field. I will keep the raw salary field and create a new one for the cleaned salary. So we can always check if the output salary is coherent with the input raw salary field.

analyze_salary <- function (dataset){
  dataset$cleaned_salary <- as.numeric(gsub(",","",str_extract(dataset$salary,"[\\d,.]+")))
  dataset$salary_type <- str_match(dataset$salary,"/(hour|week|month|year)")[,2]
  #print(dataset[c("salary","cleaned_salary","salary_type")])
  return(dataset)
  }
dataset <- analyze_salary(dataset)
## Warning in analyze_salary(dataset): NAs introduced by coercion

Some records did not enter a salary or they entered 0.00 as salary. So we will discard those records.

dataset <- dataset[dataset$cleaned_salary != 0 & !is.na(dataset$cleaned_salary),]

We will do the same with the salary type column. We will discard the records that did not specify if the salary was hourly, weekly, monthly or yearly.

dataset <- dataset[!is.na(dataset$salary_type),]

Lets check if there are still any NA (Not available) field on any of those 2 columns.

sum(is.na(dataset$cleaned_salary))
## [1] 0
sum(is.na(dataset$salary_type))
## [1] 0

Salaries are clean now. But I will create a column for yearly salary, so we can compare all jobs, no matter their ad was not with yearly salary.

dataset <- mutate(dataset, yearly_salary = case_when(
  dataset$salary_type == 'year' ~ dataset$cleaned_salary,
  dataset$salary_type == 'month' ~ dataset$cleaned_salary * 12,
  dataset$salary_type == 'week' ~ dataset$cleaned_salary * 52,
  dataset$salary_type == 'hour' ~ dataset$cleaned_salary * 40 * 52
)
  )

Let’s remove some outliers that I can see at a glance.

dataset <- dataset[dataset$yearly_salary > 6000 & dataset$yearly_salary < 300000,]

We will check what are the top 10 best paid job ads in monster.us

print(slice_max(dataset,dataset$yearly_salary,n = 10)[,c(1,7)])
##                                                                       job_title
## 1                                                 Neurologist Job in Fort Bragg
## 2                                Real Estate Attorney - Counsel Job in New York
## 3                                      Staff Psychiatrist Job in Corpus Christi
## 4                                       Lead Psychiatrist Job in Corpus Christi
## 5                       Contract Physician (Suboxone Certified) Job in Columbus
## 6                       Contract Physician (Suboxone Certified) Job in Columbus
## 7                                       Software Systems Engineer Job in Dallas
## 8    Medical Affairs Director - Infectious Diseases & Immunology Job in Madison
## 9  Global Medical Affairs Director - Solid Tumor Therapeutics Job in Northbrook
## 10                                            General Dentist Job in Greenville
##    yearly_salary
## 1       250000.0
## 2       215000.0
## 3       212267.3
## 4       212267.3
## 5       208000.0
## 6       208000.0
## 7       208000.0
## 8       200000.0
## 9       200000.0
## 10      200000.0

We can see that there are duplicates in our data. We need to discard them and keep unique records.

dataset <- distinct(dataset)

Now we check top 10 salaries jobs, and we see there are no duplicates.

print(slice_max(dataset,dataset$yearly_salary,n = 10)[,c(1,7)])
##                                                                       job_title
## 1                                                 Neurologist Job in Fort Bragg
## 2                                Real Estate Attorney - Counsel Job in New York
## 3                                      Staff Psychiatrist Job in Corpus Christi
## 4                                       Lead Psychiatrist Job in Corpus Christi
## 5                       Contract Physician (Suboxone Certified) Job in Columbus
## 6                                       Software Systems Engineer Job in Dallas
## 7    Medical Affairs Director - Infectious Diseases & Immunology Job in Madison
## 8  Global Medical Affairs Director - Solid Tumor Therapeutics Job in Northbrook
## 9                                             General Dentist Job in Greenville
## 10                               VP IT Software Development Job in Mount Laurel
##    yearly_salary
## 1       250000.0
## 2       215000.0
## 3       212267.3
## 4       212267.3
## 5       208000.0
## 6       208000.0
## 7       200000.0
## 8       200000.0
## 9       200000.0
## 10      190000.0

Lowest 10 salaries

print(slice_min(dataset,dataset$yearly_salary,n = 10)[,c(1,7)])
##                                       job_title yearly_salary
## 1                 BH - Server Job in Pittsburgh        6240.0
## 2      Server Assistant Job in Saint Petersburg       10462.4
## 3            Beverage Server Job in Kansas City       12480.0
## 4               FOOD SERVER Job in East Chicago       12584.0
## 5  Route Sales Representative Job in Cincinnati       13000.0
## 6     Hotel Breakfast Server Job in Baton Rouge       15600.0
## 7                 Food Assembler Job in Atlanta       16016.0
## 8                 Banquet Servers Job in Newark       16640.0
## 9                 Picker/Packer Job in Fairburn       16640.0
## 10              Hostess Job in Saint Petersburg       16744.0

Now can start to analyze the sectors Let’s take a look at some of the job sectors available.

head(levels(as.factor(dataset$sector)))
## [1] ""                                                                                         
## [2] "Account Management (Commissioned)Insurance Agent/BrokerFinancial Products Sales/Brokerage"
## [3] "Accounting/Finance/Insurance"                                                             
## [4] "Administrative/Clerical"                                                                  
## [5] "Banking/Real Estate/Mortgage Professionals"                                               
## [6] "Biotech/R&D/Science"

We will create another dataset for analyzing the sectors.

sectors <- dataset[dataset$sector != "",] %>%
  group_by(sector) %>%
  summarise(count = n(), mean_salary = mean(cleaned_salary))

Let’s check what are the most demanded sectors.

print(slice_max(sectors,sectors$count,n = 5))
## # A tibble: 5 × 3
##   sector                                count mean_salary
##   <chr>                                 <int>       <dbl>
## 1 Experienced (Non-Manager)               591      37164.
## 2 Manager (Manager/Supervisor of Staff)   271      58297.
## 3 Accounting/Finance/Insurance            178      50847.
## 4 Entry Level                             168      16095.
## 5 Sales/Retail/Business Development        91      40829.

Best paid sectors (with more than 1 occurrence)

print(slice_max(sectors[sectors$count > 1,],sectors[sectors$count > 1,]$mean_salary,n = 5))
## # A tibble: 5 × 3
##   sector                                                       count mean_salary
##   <chr>                                                        <int>       <dbl>
## 1 Executive (SVP, VP, Department Head, etc)                       16     111688.
## 2 Building Construction/Skilled Trades                            26      69232.
## 3 Marketing/Product                                                8      64629 
## 4 Manager (Manager/Supervisor of Staff)                          271      58297.
## 5 Insurance Agent/BrokerGeneral/Other: Sales/Business Develop…     4      57500

I want to order both jobs and sectors datasets by best paid

dataset <- dataset[order(-dataset$yearly_salary),]
sectors <- sectors[order(-sectors$mean_salary),]

Let’s create graphs for the highest paid jobs and sectors.

ggplot(dataset[1:5,]) + 
  geom_col(aes(yearly_salary,reorder(job_title, yearly_salary),fill=palette(rainbow(5)))) +
  theme(legend.position = "none") +
  labs(x = "Salary per year", y = "Job") +
  geom_text(aes(x=35000,y=job_title,
                label=sprintf("$%.2f",yearly_salary)
                )
            )

ggplot(sectors[sectors$count > 1,][1:5,]) + 
  geom_col(aes(mean_salary,
               reorder(str_sub(sector,1,51), mean_salary),
               fill=palette(rainbow(5)))) +
  theme(legend.position = "none") +
  labs(x = "Average salary", y = "Job") +
  geom_text(aes(x=mean_salary-14000,y=str_sub(sector,1,51),
                label=sprintf("$%.2f",mean_salary)
                )
            )

Now I want to know what are the top 10 most sought after skills in the 500 highest paid jobs

skills <- dataset[1:500,] %>%
    select('skills' = 1) %>%
    separate_rows(skills, sep = " ") %>%
    group_by(skills) %>%
    summarise(count = n()) %>%
    arrange(desc(count))

top_n(skills,n = 20)
## Selecting by count
## # A tibble: 20 × 2
##    skills       count
##    <chr>        <int>
##  1 in             465
##  2 Job            463
##  3 Manager        167
##  4 -              147
##  5 Engineer        78
##  6 Dallas          64
##  7 Project         56
##  8 Senior          53
##  9 Quality         40
## 10 /               32
## 11 Director        32
## 12 Analyst         27
## 13 Developer       25
## 14 Houston         24
## 15 of              24
## 16 Chicago         23
## 17 Cincinnati      22
## 18 Medical         21
## 19 New             19
## 20 Construction    18

Lets remove some common words that are not skills.

skills <- skills[!skills$skills %in% c("in","Job","-","Dallas","/","Houston","of","Chicago","Cincinnati","New"),]
top_n(skills,n = 10)
## Selecting by count
## # A tibble: 10 × 2
##    skills       count
##    <chr>        <int>
##  1 Manager        167
##  2 Engineer        78
##  3 Project         56
##  4 Senior          53
##  5 Quality         40
##  6 Director        32
##  7 Analyst         27
##  8 Developer       25
##  9 Medical         21
## 10 Construction    18

Let’s plot this info

ggplot(data = skills[1:10,]) + 
  geom_point(mapping = aes(y = count, 
                         x = reorder(skills,count),
                         fill = skills),
           stat='identity') +
  labs(title = "Most sought after skills in top 500 jobs", y = "Times", x = "Skill") +
  theme(legend.position = "none")