Introduction

The tidying part of our project 3 was split into two parts. In Project 3 Tidying Part 1 , we did the following:

  • Remove leading and trailing white spaces for multiple columns

  • Split the Job location column into two columns: city and state

  • Fill in the missing values in the state column.

  • Remove duplicated job posting. This part was particularly tricky because multiple columns contained duplicates.

However, the dataframe needs further tidying.

Tidying Part 2

Load Packages and CSV file

The LinkedIn dataframe contains 2313 job postings anf 14 columns.

In tidying part 2, we continue to tidy the following.

  • Remove columns that are needed for analysis

  • Tidy the job location columns. This included removiing white space from the job location column.

  • Remove duplicated columns that had the same job description, company and city. Remote was created by extracting it from the description column.

  • Create new columns for degree qualification, job setting, (Remote) and salary. Education and salary columns we created by search for those items in the description column. Extracting salary information proved to be the most challenging, as it was often written in different formats in the description column. If the salary was missing in the description column we used the mice package to impute missing values using Predictive Mean Matching (PMM)

-We also renamed some of the columns.

library(dplyr)
library(stringr) 
library(tidyr)
library(knitr)
library(priceR)
library(rex)

library(mice)
library(DT)
data <- read.csv("https://raw.githubusercontent.com/suswong/DATA-607-Project-3/main/tidied_Linkedin_version4.csv") 

datatable(head(data, 50),
  plugins = "ellipsis",
  options = list(scrollX = TRUE,
    columnDefs = list(list(
      targets = "_all",
      render = JS("$.fn.dataTable.render.ellipsis(30, false )")
    ))
  )
)

Remove Columns

For our analysis, we do not need the following columns. They will be removed.

  • ‘Keyword’: This column contains the keyword used to search for the job posting

  • Job_link: This column contains the job link.

  • ‘Company_ink’: This column contains the company link

  • ‘Post_time’: This column contains the post time of the job

  • ‘Applicants_count’: This column contains the number of applicants

data$Keyword <- NULL
data$Company_link <- NULL
data$Post_time <- NULL
data$Applicants_count <- NULL
data$Education <- NULL
data$Remote <- NULL
data$R <- NULL
data$Salaries <- NULL
data$Search_Engine <- NULL
data$Job_link <- NULL

datatable(head(data, 50),
  plugins = "ellipsis",
  options = list(scrollX = TRUE,
    columnDefs = list(list(
      targets = "_all",
      render = JS("$.fn.dataTable.render.ellipsis(30, false )")
    ))
  )
)

Remove Duplicate Job Description

Although we did remove duplicates in tidying part 1, we were not able to remove all of them due to the different spacing in the job postings. I trimmed the spacing and used the ‘duplicated’ function to remove other duplicated job posting. After removing the duplicated job postings, there are 1303 job postings.

data <- data[!duplicated(data[,1:3]),] 
# Reset Index

rownames(data) <- 1:nrow(data)

datatable(head(data, 50),
  plugins = "ellipsis",
  options = list(scrollX = TRUE,
    columnDefs = list(list(
      targets = "_all",
      render = JS("$.fn.dataTable.render.ellipsis(30, false )")
    ))
  )
)

Tidy Job location

Rename City Name

There were two labels for “New York City”: New York and New York City City Metropolitan Area. Both labels were renamed as “New York City” in the ‘City’ column.

data <- data %>% 
  mutate(across('Job_location_City', str_replace, 'New York', 'New York City'))
data <- data %>% 
  mutate(across('Job_location_City', str_replace, 'New York City City Metropolitan Area
', 'New York City'))

datatable(head(data, 50),
  plugins = "ellipsis",
  options = list(scrollX = TRUE,
    columnDefs = list(list(
      targets = "_all",
      render = JS("$.fn.dataTable.render.ellipsis(30, false )")
    ))
  )
)

Remove Leading Space

In tidying part 1, we split the city and state in the location column by its comma. However, there is a apace after the comma, which we did not remove. I used the ‘str_trim’ to remove the white spaces from the start and end of each value in the state column.

data$Job_location_State <- str_trim(data$Job_location_State)
data$Job_description <- str_trim(data$Job_description)

datatable(head(data, 50),
  plugins = "ellipsis",
  options = list(scrollX = TRUE,
    columnDefs = list(list(
      targets = "_all",
      render = JS("$.fn.dataTable.render.ellipsis(30, false )")
    ))
  )
)

Rename the columns of the dataset

data <- data |> rename_('City' = 'Job_location_City', 'State' = 'Job_location_State', 'Description' = 'Job_description', 'Titile' = 'Job_title', 'Industry' = 'Job_function', 'Function' = 'Industries')

datatable(head(data, 50),
  plugins = "ellipsis",
  options = list(scrollX = TRUE,
    columnDefs = list(list(
      targets = "_all",
      render = JS("$.fn.dataTable.render.ellipsis(30, false )")
    ))
  )
)

Empty Values in State Column

In the raw web scrapped data, the city and state job location was under one column: ‘Job_location”. Some job posting did not have a specific city and state value. Only ’United States’ was recorded. When we split the city and state by the comma, the value ‘United States’ was recorded in the ‘city’ column and no value was recorded in the ‘state’ column.

data <- data[!((data$Company=='') | data$Industry=="") | (data$Description==''), ]

data[data == ''] <- NA

data$State <- data$State %>% replace_na('US')
 
data$State <- gsub("United States", "US", data$State)

datatable(head(data, 50),
  plugins = "ellipsis",
  options = list(scrollX = TRUE,
    columnDefs = list(list(
      targets = "_all",
      render = JS("$.fn.dataTable.render.ellipsis(30, false )")
    ))
  )
)

Create New Columns

Create a column for Degree Qualification

A new column was created for the degree qualification. This information was extracted from the job description column.

data$Education <- sapply(data$Description, FUN = function(x){
  # Use base R regex functions to for conditions, and return values for new column
  if (grepl("bachelor", tolower(x))){
    return("Bachelor")
  }else if (grepl("Bachelor's", tolower(x))){
    return("Bachelor")
  }else if (grepl("bachelor's", toupper(x))){
    return("Bachelor")
  }else if (grepl("master", tolower(x))){
    return("Masters")
  }else if (grepl("Master's", tolower(x))){
    return("Masters")
  }else if (grepl("PhD", tolower(x))){
    return("Doctorate")
  }else{
    return("Not Stated")
  }
})

datatable(head(data, 50),
  plugins = "ellipsis",
  options = list(scrollX = TRUE,
    columnDefs = list(list(
      targets = "_all",
      render = JS("$.fn.dataTable.render.ellipsis(30, false )")
    ))
  )
)

Create a column for job setting

A new column was created for the job setting. This information was extracted from the job description column. A ‘yes’ was recorded if ‘remote’ was within the job description column and ‘no’ otherwise.

data$Remote <- sapply(data$Description, FUN = function(x){
  
  if (grepl("remote", tolower(x))){
    return("Yes")
  }else if (grepl("Remote", tolower(x))){
    return("Yes")
  }else{
    return("No")
  }
})

datatable(head(data, 50),
  plugins = "ellipsis",
  options = list(scrollX = TRUE,
    columnDefs = list(list(
      targets = "_all",
      render = JS("$.fn.dataTable.render.ellipsis(30, false )")
    ))
  )
)

Create a column for salary

First we tackle this problem by searching for number patterns using the regular expressions in the description column. This results in some false positive, so we then use the R price package to filter out non-salary numbers. Some rows were missing salary information, so we use the MICE package to impute missing values using Predictive Mean Matching. This imputation method is very popular and is simillar in concept to K-nearest neigbor but is exclusive to numeric attributes.

data$Salary <- str_extract_all(data$Description, "(\\d{1,4}((\\,|\\.|\\s)(\\d00)){1,2}|(\\d{3,7}))")

data$Salary[3] <- "62,000"

data$Salary[2] <- "110,000"

data$Salary[12] <- "100,000"

data$Salary[9] <- "60,000"
data$Salary <- extract_salary(data$Salary, exclude_below = 20000, exclude_above = 240000)
## Warning in fun(libname, pkgname): couldn't connect to display ":0"
data$Salary <- data$Salary$salary
 
data <- mice(data, method = 'pmm')
## 
##  iter imp variable
##   1   1  Salary
##   1   2  Salary
##   1   3  Salary
##   1   4  Salary
##   1   5  Salary
##   2   1  Salary
##   2   2  Salary
##   2   3  Salary
##   2   4  Salary
##   2   5  Salary
##   3   1  Salary
##   3   2  Salary
##   3   3  Salary
##   3   4  Salary
##   3   5  Salary
##   4   1  Salary
##   4   2  Salary
##   4   3  Salary
##   4   4  Salary
##   4   5  Salary
##   5   1  Salary
##   5   2  Salary
##   5   3  Salary
##   5   4  Salary
##   5   5  Salary
## Warning: Number of logged events: 11
data <- complete(data,2)

datatable(head(data, 50),
  plugins = "ellipsis",
  options = list(scrollX = TRUE,
    columnDefs = list(list(
      targets = "_all",
      render = JS("$.fn.dataTable.render.ellipsis(30, false )")
    ))
  )
)