Introduction

There was a concerning amount of duplicates in our initial dataset (‘Raw_Data_Linkedin1’). Throughout the week, I used Octoparse to web scrap job postings in the United States, using different keywords: “data”, “data analyst”, “data scientist”, and “data engineering”. Then, I combined all the files together, and removed the duplicates, using ‘distinct()’ from ‘dplyr’ package.

LinkedIn CSV Files

All of the files contain the same attributes: “Keyword”, “Location”, “Job_title”, “Job_link”, “Company”, “Company_link”, “Job_location”, “Post_time”, “Applicants_count”, “Job_description”, “Seniority_level”, “Employment_type”, “Job_function”, and”Industries”.

library(dplyr)
Raw_Data_Linkedin1 <- read.csv('https://raw.githubusercontent.com/suswong/Project-3-raw-dataset/main/Job%20details%20by%20search_LinkedIn%20version%202(1).csv')
Raw_Data_Linkedin2 <- read.csv('https://raw.githubusercontent.com/suswong/data-sets/main/LinkedIn%20Version%203.csv')
Raw_Data_Linkedin3 <- read.csv('https://raw.githubusercontent.com/suswong/Project-3-raw-dataset/main/LinkedIn_data_scientist.csv')
Raw_Data_Linkedin4 <- read.csv('https://raw.githubusercontent.com/suswong/Project-3-raw-dataset/main/LinkedIn_data%20engineer.csv')

Linkedin Dataset 1

Raw_Data_Linkedin1 has 5059 job postings. We web scrapped these job posting using the keyword “data”.

library(DT)
datatable(head(Raw_Data_Linkedin1, 50),
  plugins = "ellipsis",
  options = list(scrollX = TRUE,
    columnDefs = list(list(
      targets = "_all",
      render = JS("$.fn.dataTable.render.ellipsis(30, false )")
    ))
  )
)

Linkedin Dataset 2

Raw_Data_Linkedin2 has 522 job postings. We web scrapped these job posting using the keyword “data”.

library(DT)
datatable(head(Raw_Data_Linkedin2, 50),
  plugins = "ellipsis",
  options = list(scrollX = TRUE,
    columnDefs = list(list(
      targets = "_all",
      render = JS("$.fn.dataTable.render.ellipsis(30, false )")
    ))
  )
)

Linkedin Dataset 3

Raw_Data_Linkedin3 has 457 job postings. We web scrapped these job posting using the keyword “data scientist”.

library(DT)
datatable(head(Raw_Data_Linkedin3, 50),
  plugins = "ellipsis",
  options = list(scrollX = TRUE,
    columnDefs = list(list(
      targets = "_all",
      render = JS("$.fn.dataTable.render.ellipsis(30, false )")
    ))
  )
)

Linkedin Dataset 4

Raw_Data_Linkedin4 has 741 job postings. We web scrapped these job posting using the keyword “data engineer”.

library(DT)
datatable(head(Raw_Data_Linkedin4, 50),
  plugins = "ellipsis",
  options = list(scrollX = TRUE,
    columnDefs = list(list(
      targets = "_all",
      render = JS("$.fn.dataTable.render.ellipsis(30, false )")
    ))
  )
)

Combine the Dataframe

I used ‘rbind()’ from the ‘plyr’ package to join our dataframes together, which resulted 7236 job postings. Then, I removed the duplicates, using ‘distinct()’ from ‘dplyr’ package. As a result, there are 2313 job postings from Linkedin. I created a csv file for the dataset to be tidied.

total <- rbind(Raw_Data_Linkedin1, Raw_Data_Linkedin2, Raw_Data_Linkedin3, Raw_Data_Linkedin4)

distinct_total <- distinct(total)

write.csv(distinct_total, "C:\\Desktop\\DATA 607\\distinct_total_Linkedin.csv", row.names=FALSE)