Load the Linkedin csv file into R. Below are some things we need to tidy:
The leading and trailing white spaces
Linkedin <- read.csv('https://raw.githubusercontent.com/suswong/data-sets/main/Job%20details%20by%20search_LinkedIn.csv')
head(Linkedin,1)
## Keyword Location Job_title
## 1 Data Scientist United States Data Scientist
## Job_link
## 1 https://www.linkedin.com/jobs/view/data-scientist-at-apple-3516925104?refId=3ZlScDidpN8txCbEDGFY%2Fg%3D%3D&trackingId=5A3CpcGPEReSCAw45jZG9Q%3D%3D&position=1&pageNum=0&trk=public_jobs_jserp-result_search-card
## Company
## 1 \n Apple\n
## Company_link
## 1 https://www.linkedin.com/company/apple?trk=public_jobs_topcard-org-name
## Job_location
## 1 \n Cupertino, CA\n
## Post_time
## 1 \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n 17 hours ago\n \n
## Applicants_count
## 1 \n \n \n Over 200 applicants\n \n
## Job_description
## 1 \nSummary\n\nAt Apple, you’ll accomplish truly phenomenal things. If you have a passion for developing insights from real world data, have great attention to detail, and are dedicated to improving customer experience, please join us!\n\n\nThe SWE DA Data Science team analyzes and produces insights from diagnostic and usage data from hundreds of millions of devices every day from all over the world. The insights are used to improve Apple's products and services, to inform strategic directions, and to improve user experience. We are a fast-pace and high-functioning team of Data Scientists that use the latest Big Data technologies to tackle complex, large-scale problems using immense quantities of collected data. We work collaboratively to make impact that changes people’s lives, and we have fun while doing it!\n\n\nKey Qualifications\n\nAdvanced statistics and modeling knowledge.\n\nStrong data visualization skills (e.g., Tableau).\n\nProgramming skills in Python.\n\nDetail-oriented, can identify and fix own bugs, and write code that runs efficiently.\n\nGood experience with applying Big Data technologies (e.g., MapReduce, Hadoop, Spark) to large quantities of data.\n\nGood experience using relational databases and SQL.\n\n\nDescription\n\nYou will work cross-functionally with partners in software engineering, hardware, and marketing. You will use your deep knowledge of data extraction, exploration, and analysis to produce reports and visualizations of critical hardware and software phenomena. With the knowledge accumulated, you will build models and validate hypotheses on the uses of our software and devices.\n\n\nEducation & Experience\n\nAdvanced degree in Statistics, Data Mining, Machine Learning, Analytics, Applied Math, Computer Science, Electrical Engineering, Physics, or related fields.\n\n\nAdditional Requirements\n\nStrong interpersonal skills; the ability to understand business requirements and naturally explain complex technical topics to everyone — from data scientists to engineers to product marketing partners to executives.\n\nExcellent understanding of Machine Learning algorithms, including regression, clustering, classification, and other advanced analytic techniques.\n\nSelf-starter and ability to multitask.\n\n\nPay & Benefits\n\nAt Apple, base pay is one part of our total compensation package and is determined within a range. This provides the opportunity to progress as you grow and develop within a role. The base pay range for this role is between $121,000 and $230,000, and your base pay will depend on your skills, qualifications, experience, and location.\n\nApple employees also have the opportunity to become an Apple shareholder through participation in Apple’s discretionary employee stock programs. Apple employees are eligible for discretionary restricted stock unit awards, and can purchase Apple stock at a discount if voluntarily participating in Apple’s Employee Stock Purchase Plan. You’ll also receive benefits including: Comprehensive medical and dental coverage, retirement benefits, a range of discounted products and free services, and for formal education related to advancing your career at Apple, reimbursement for certain educational expenses — including tuition. Additionally, this role might be eligible for discretionary bonuses or commission payments as well as relocation. Learn more about Apple Benefits.\n\nNote: Apple benefit, compensation and employee stock programs are subject to eligibility requirements and other terms of the applicable plan or program.\n\n\nRole Number: 200439245\n\n
## Seniority_level Employment_type
## 1 \n Not Applicable\n \n Full-time\n
## Job_function
## 1 \n Engineering and Information Technology\n
## Industries
## 1 \n Computers and Electronics Manufacturing\n
library(DT)
datatable(Linkedin)
We need to remove the leading and trailing white spaces in the following columns:
Company
Job_location
Applicants_count
Seniority_level
Employment_type
Job_function
Industries
rem_WS_Linkedin <- Linkedin
rem_WS_Linkedin <- data.frame(lapply(rem_WS_Linkedin, trimws), stringsAsFactors = FALSE)
# At first, I removed the white spaces manually (see below). However, the link below shows how to remove leading and trailing white spaces for the entire dataframe. https://stackoverflow.com/questions/20760547/removing-whitespace-from-a-whole-data-frame-in-r
# tidied_Linkedin <- Linkedin
# tidied_Linkedin$Company <- str_trim(tidied_Linkedin$Company)
# tidied_Linkedin$Job_location <- str_trim(tidied_Linkedin$Job_location)
# tidied_Linkedin$Post_time <- str_trim(tidied_Linkedin$Post_time)
# tidied_Linkedin$Applicants_count <- str_trim(tidied_Linkedin$Applicants_count)
# tidied_Linkedin$Seniority_level <- str_trim(tidied_Linkedin$Seniority_level)
# tidied_Linkedin$Employment_type <- str_trim(tidied_Linkedin$Employment_type)
# tidied_Linkedin$Job_function <- str_trim(tidied_Linkedin$Job_function)
# tidied_Linkedin$Industries <- str_trim(tidied_Linkedin$Industries)
head(rem_WS_Linkedin,1)
## Keyword Location Job_title
## 1 Data Scientist United States Data Scientist
## Job_link
## 1 https://www.linkedin.com/jobs/view/data-scientist-at-apple-3516925104?refId=3ZlScDidpN8txCbEDGFY%2Fg%3D%3D&trackingId=5A3CpcGPEReSCAw45jZG9Q%3D%3D&position=1&pageNum=0&trk=public_jobs_jserp-result_search-card
## Company
## 1 Apple
## Company_link
## 1 https://www.linkedin.com/company/apple?trk=public_jobs_topcard-org-name
## Job_location Post_time Applicants_count
## 1 Cupertino, CA 17 hours ago Over 200 applicants
## Job_description
## 1 Summary\n\nAt Apple, you’ll accomplish truly phenomenal things. If you have a passion for developing insights from real world data, have great attention to detail, and are dedicated to improving customer experience, please join us!\n\n\nThe SWE DA Data Science team analyzes and produces insights from diagnostic and usage data from hundreds of millions of devices every day from all over the world. The insights are used to improve Apple's products and services, to inform strategic directions, and to improve user experience. We are a fast-pace and high-functioning team of Data Scientists that use the latest Big Data technologies to tackle complex, large-scale problems using immense quantities of collected data. We work collaboratively to make impact that changes people’s lives, and we have fun while doing it!\n\n\nKey Qualifications\n\nAdvanced statistics and modeling knowledge.\n\nStrong data visualization skills (e.g., Tableau).\n\nProgramming skills in Python.\n\nDetail-oriented, can identify and fix own bugs, and write code that runs efficiently.\n\nGood experience with applying Big Data technologies (e.g., MapReduce, Hadoop, Spark) to large quantities of data.\n\nGood experience using relational databases and SQL.\n\n\nDescription\n\nYou will work cross-functionally with partners in software engineering, hardware, and marketing. You will use your deep knowledge of data extraction, exploration, and analysis to produce reports and visualizations of critical hardware and software phenomena. With the knowledge accumulated, you will build models and validate hypotheses on the uses of our software and devices.\n\n\nEducation & Experience\n\nAdvanced degree in Statistics, Data Mining, Machine Learning, Analytics, Applied Math, Computer Science, Electrical Engineering, Physics, or related fields.\n\n\nAdditional Requirements\n\nStrong interpersonal skills; the ability to understand business requirements and naturally explain complex technical topics to everyone — from data scientists to engineers to product marketing partners to executives.\n\nExcellent understanding of Machine Learning algorithms, including regression, clustering, classification, and other advanced analytic techniques.\n\nSelf-starter and ability to multitask.\n\n\nPay & Benefits\n\nAt Apple, base pay is one part of our total compensation package and is determined within a range. This provides the opportunity to progress as you grow and develop within a role. The base pay range for this role is between $121,000 and $230,000, and your base pay will depend on your skills, qualifications, experience, and location.\n\nApple employees also have the opportunity to become an Apple shareholder through participation in Apple’s discretionary employee stock programs. Apple employees are eligible for discretionary restricted stock unit awards, and can purchase Apple stock at a discount if voluntarily participating in Apple’s Employee Stock Purchase Plan. You’ll also receive benefits including: Comprehensive medical and dental coverage, retirement benefits, a range of discounted products and free services, and for formal education related to advancing your career at Apple, reimbursement for certain educational expenses — including tuition. Additionally, this role might be eligible for discretionary bonuses or commission payments as well as relocation. Learn more about Apple Benefits.\n\nNote: Apple benefit, compensation and employee stock programs are subject to eligibility requirements and other terms of the applicable plan or program.\n\n\nRole Number: 200439245
## Seniority_level Employment_type Job_function
## 1 Not Applicable Full-time Engineering and Information Technology
## Industries
## 1 Computers and Electronics Manufacturing
datatable(rem_WS_Linkedin)
The majority of the values in the ‘Job_location’ column contains both city and state. They are separated by commas. We need to split the ‘Job_location’ by its comma.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(stringr)
split_location_Linkedin <- rem_WS_Linkedin
# tidied_Linkedin %>%
# separate(Job_location,c("City","State"),sep=",")
split_location_Linkedin[c('Job_location_City', 'Job_location_State')] <- str_split_fixed(split_location_Linkedin$Job_location, ',', 2)
#colnames(split_location_Linkedin)
# Drop the 'Job_location' column and
tidied_Linkedin <- split_location_Linkedin[c('Keyword', 'Job_title', 'Job_link', 'Company','Company_link','Job_location_City','Job_location_State','Post_time','Applicants_count','Seniority_level','Employment_type','Job_function','Industries','Job_description')]
datatable(tidied_Linkedin)
Some values in the raw file only contain the state and no city. So when I split the ‘location’ column by comma, those state values was set in the city column. Thus, some values are missing in ‘Job_location_State’.
#unique(tidied_Linkedin$Job_location_State)
tidied_Linkedin$Job_location_State[tidied_Linkedin$Job_location_City == "New York"] <- "NY"
tidied_Linkedin$Job_location_State[tidied_Linkedin$Job_location_City == "Utica-Rome Area"] <- "NA"
tidied_Linkedin$Job_location_State[tidied_Linkedin$Job_location_City == "California"] <- "CA"
tidied_Linkedin$Job_location_State[tidied_Linkedin$Job_location_City == "North Carolina"] <- "NC"
tidied_Linkedin$Job_location_State[tidied_Linkedin$Job_location_City == "Florida"] <- "FL"
tidied_Linkedin$Job_location_State[tidied_Linkedin$Job_location_City == "Texas"] <- "TX"
tidied_Linkedin$Job_location_State[tidied_Linkedin$Job_location_City == "Minnesota"] <- "MN"
tidied_Linkedin$Job_location_State[tidied_Linkedin$Job_location_City == "Ohio"] <- "OH"
tidied_Linkedin$Job_location_State[tidied_Linkedin$Job_location_City == "Des Moines Metropolitan Area"] <- "IA"
datatable(tidied_Linkedin)
#install.packages("XQuartz", dependencies = TRUE)
#install.packages('priceR')
#library(priceR)
#tidied_Linkedin$salary <- extract_salary(tidied_Linkedin$Job_description, salary_range_handling = "min")
glassdoor <- read.csv('https://raw.githubusercontent.com/suswong/data-sets/main/Job%20listing_Glassdoor.csv')
The ‘Location’ column does not contain any values. We need to remove that column. The ‘place’ column contains the job location.
rem_location <- glassdoor[,-2] #Remove the 2nd column, which is 'Location'
We split the ‘Place’ column into ‘Job_location_City’ and ‘Job_location_State’. Then, rearrange the order of the column. In the process, I dropped the following columns: ‘Page’, ‘Place’, The ‘page’ column contains what page the job posting was found on Linkedin. This column is not necessary for our analysis.
library(stringr)
split_Place_Glassdoor <- rem_location
# tidied_Linkedin %>%
# separate(Job_location,c("City","State"),sep=",")
split_Place_Glassdoor[c('Job_location_City', 'Job_location_State')] <- str_split_fixed(split_Place_Glassdoor$Place, ',', 2)
colnames(split_Place_Glassdoor)
## [1] "Keyword" "Page" "company"
## [4] "rating" "Job_title" "Place"
## [7] "salary" "post_date" "Job_description"
## [10] "Job_location_City" "Job_location_State"
# Drop the 'Place' column and "Page" column by not including them in the new table
tidied_glassdoor <- split_Place_Glassdoor[c('Keyword','Job_title', 'company','rating','Job_location_City','Job_location_State','post_date','Job_description')]
In order to combine both datasets into one dataframe, we need to rename common columns with the same title. I also want to add a new column that indicates if the job posting is from Glassdoor or Linkedin. Since we do not need the ‘Keyword’ column, I replace all values with ‘Glassdoor’ or ‘Linkedin’ in that column, and renamed the column to ‘Search_Engine’.
final_glassdoor <- tidied_glassdoor
final_glassdoor$Keyword <- "Glassdoor"
colnames(final_glassdoor)[1] ="Search_Engine"
colnames(final_glassdoor)
## [1] "Search_Engine" "Job_title" "company"
## [4] "rating" "Job_location_City" "Job_location_State"
## [7] "post_date" "Job_description"
final_Linkedin <- tidied_Linkedin
final_Linkedin$Keyword <- "Linkedin"
colnames(final_Linkedin)[1] ="Search_Engine"
colnames(final_Linkedin)
## [1] "Search_Engine" "Job_title" "Job_link"
## [4] "Company" "Company_link" "Job_location_City"
## [7] "Job_location_State" "Post_time" "Applicants_count"
## [10] "Seniority_level" "Employment_type" "Job_function"
## [13] "Industries" "Job_description"
colnames(final_glassdoor) <- c('Search_Engine','Job_title', 'Company','Rating','Job_location_City','Job_location_State','post_date','Job_description')
final_glassdoor <- final_glassdoor[c('Search_Engine','Job_title', 'Company','Job_location_City','Job_location_State','Job_description')]
final_Linkedin <- final_Linkedin[c('Search_Engine','Job_title', 'Company','Job_location_City','Job_location_State','Job_description')]
total <- rbind(final_Linkedin, final_glassdoor)
write.csv(total, "C:\\Desktop\\DATA 607\\Combined_Linkedin_Glassdoor_Version1.csv", row.names=FALSE)