The goal of the project is to use data to answer the question, “Which are the most valued data science skills?”.
Using Octoparse and Parsehub, we acquired job postings from LinkedIn, and Glassdoor.
https://rpubs.com/sussing/1017226
The LinkedIn dataframe contains 2313 job postings and 14 columns. The columns include “Keyword”, “Location”, “Jobtitle”, “Job link”, “Company”, “Company link”, “Job location”, “Post time”, “Applicants count”, “Job description”, “Seniority level”, “Employment type”, “Job function”, and “Industries”.
Linkedin <- read.csv('https://raw.githubusercontent.com/suswong/DATA-607-Project-3/main/distinct_total_Linkedin.csv')
library(DT)
datatable(head(Linkedin, 50),
plugins = "ellipsis",
options = list(scrollX = TRUE,
columnDefs = list(list(
targets = "_all",
render = JS("$.fn.dataTable.render.ellipsis(30, false )")
))
)
)
The Glassdoor dataframe contains 1320 job postings and 10 columns. The columns include “Keyword”, “Location”, “Page”, “company”, “rating”, “Job_title”, “Place”, “salary”, “post_date”, and “Job_description”.
glassdoor <- read.csv('https://raw.githubusercontent.com/suswong/DATA-607-Project-3/main/distinct_total_glassdoor.csv')
datatable(head(glassdoor, 50),
plugins = "ellipsis",
options = list(
scrollX = TRUE,
columnDefs = list(list(
targets = "_all",
render = JS("$.fn.dataTable.render.ellipsis(30, false )")
))
)
)
Below are some things we need to tidy:
Remove leading and trailing white spaces
Split the ‘Job_location’ column into two columns: city and state
Missing values in the state column
We need to remove the leading and trailing white spaces in the following columns:
Company
Job_location
Applicants_count
Seniority_level
Employment_type
Job_function
Industries
There leading and trailing white spaces in several columns. They are indicated by ‘’.
head(Linkedin,1)
## Keyword Location Job_title
## 1 Data United States Data Analyst
## Job_link
## 1 https://www.linkedin.com/jobs/view/data-analyst-at-young-life-3521907674?refId=j1Bxa6%2F%2BWbkCr9i%2FQ091RQ%3D%3D&trackingId=vi3sxWiThZ5AMIhSHt0oWQ%3D%3D&position=1&pageNum=0&trk=public_jobs_jserp-result_search-card
## Company
## 1 \n Young Life\n
## Company_link
## 1 https://www.linkedin.com/company/young-life?trk=public_jobs_topcard-org-name
## Job_location
## 1 \n United States\n
## Post_time
## 1 \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n 1 day ago\n \n
## Applicants_count
## 1 \n \n \n Be among the first 25 applicants\n \n
## Job_description
## 1 Mission/Authority\n\nServe mission leadership and staff by curating and analyzing key mission data to deliver insights and decision support to key Young Life stakeholders. Alongside technical and information leaders, deliver the data necessary to support effective decisions, discover efficiencies in mission data systems, and evaluate business data to determine effective analytics solutions.\n\nEssential Duties\n\nBuild and maintain reporting and analytics is Workday, Salesforce, and Power BI related to Young Life’s Finance, Human Resources, Learning, and Product Development Team.\nAssist in growing data literacy and self-service reporting capabilities for Young Life Finance, Human Resources, Learning, and Field organizations.\nInterpret, analyze, and utilize statistical analysis techniques to discover patterns and key insights in mission critical data.\nExecute the curation of data delivered by the data team and exploratory data for analytics consumption and deployment.\nOrganize and optimize all data residing in mission data schemas.\nSupport department embedded analysts by delivering new data sets to meet their reporting and analytics needs.\nAssist the Data Architect in developing the data models necessary to support a wide variety of information needs.\nCollaborate with platform and data resources to ensure a high level of data availability and performance for all datasets used in reporting and analytics.\nCollaborate with the data quality staff to ensure clean and accurate data for reporting and analytical models.\nCollaborate with key business staff and leaders to design and build analytical solutions to meet the operational reporting and decision support need of the mission.\nAssist in building a business intelligence center of excellence to support a wide variety of analytical and reporting needs for mission leadership and staff.\nAssist in pioneering data activities in response to prospective use cases driven by IS and Strategic Leadership.\nEngage in assigned data related projects across the analytics and data ecosystem.\n\nWorking Relationships\n\nMultiple regular interactions with key technical and ministry operations stakeholders to maintain a sense of awareness and understanding of current ministry needs.\nRegular involvement in the cross-departmental teams.\nCollaborate with Project Management to provide data expertise and perspective on key mission projects.\nAssists the Director of Mission Analytics and Decision Support in furthering the goals of the mission.\n\nEducation\n\nBachelor’s, Master’s Degree, or equivalent experience in a business, technical, or data related field required.\n\nWorking Conditions\n\nOffice Environment at the Service Center in Colorado Springs, CO with hybrid or remote options.\nTravel 1-3 times a year may be required.\nThis position offers a competitive salary range of $59,800 - $74,749 annually commensurate with relevant experience, qualifications, and education.\n\nQualifications Required For The Job\n\n5-7 years minimum work experience.\nProficiency in modern business intelligence and enterprise platform tools is required (Workday, Salesforce, Power BI, Excel, or similar).\nStrong working knowledge of relational databases and/or SQL is required.\nExperience with Snowflake and Python is preferred.\n3 or more years of related data, business intelligence, or data analyst experience required.\nHighly organized, self-motivated, and attentive to detail.\nStrong interpersonal, communication, and problem-solving skills needed.\n\nApplication\n\nPlease submit your application, resume, and cover letter by April 3rd, 2023.
## Seniority_level Employment_type
## 1 \n Entry level\n \n Full-time\n
## Job_function
## 1 \n Strategy/Planning and Information Technology\n
## Industries
## 1 \n Religious Institutions\n
The leading and trailing white spaces are removed. “” are removed by using ‘trims’.
rem_WS_Linkedin <- Linkedin
rem_WS_Linkedin <- data.frame(lapply(rem_WS_Linkedin, trimws), stringsAsFactors = FALSE)
# At first, I removed the white spaces manually (see below). However, the link below shows how to remove leading and trailing white spaces for the entire dataframe. https://stackoverflow.com/questions/20760547/removing-whitespace-from-a-whole-data-frame-in-r
# tidied_Linkedin <- Linkedin
# tidied_Linkedin$Company <- str_trim(tidied_Linkedin$Company)
# tidied_Linkedin$Job_location <- str_trim(tidied_Linkedin$Job_location)
# tidied_Linkedin$Post_time <- str_trim(tidied_Linkedin$Post_time)
# tidied_Linkedin$Applicants_count <- str_trim(tidied_Linkedin$Applicants_count)
# tidied_Linkedin$Seniority_level <- str_trim(tidied_Linkedin$Seniority_level)
# tidied_Linkedin$Employment_type <- str_trim(tidied_Linkedin$Employment_type)
# tidied_Linkedin$Job_function <- str_trim(tidied_Linkedin$Job_function)
# tidied_Linkedin$Industries <- str_trim(tidied_Linkedin$Industries)
head(rem_WS_Linkedin,1)
## Keyword Location Job_title
## 1 Data United States Data Analyst
## Job_link
## 1 https://www.linkedin.com/jobs/view/data-analyst-at-young-life-3521907674?refId=j1Bxa6%2F%2BWbkCr9i%2FQ091RQ%3D%3D&trackingId=vi3sxWiThZ5AMIhSHt0oWQ%3D%3D&position=1&pageNum=0&trk=public_jobs_jserp-result_search-card
## Company
## 1 Young Life
## Company_link
## 1 https://www.linkedin.com/company/young-life?trk=public_jobs_topcard-org-name
## Job_location Post_time Applicants_count
## 1 United States 1 day ago Be among the first 25 applicants
## Job_description
## 1 Mission/Authority\n\nServe mission leadership and staff by curating and analyzing key mission data to deliver insights and decision support to key Young Life stakeholders. Alongside technical and information leaders, deliver the data necessary to support effective decisions, discover efficiencies in mission data systems, and evaluate business data to determine effective analytics solutions.\n\nEssential Duties\n\nBuild and maintain reporting and analytics is Workday, Salesforce, and Power BI related to Young Life’s Finance, Human Resources, Learning, and Product Development Team.\nAssist in growing data literacy and self-service reporting capabilities for Young Life Finance, Human Resources, Learning, and Field organizations.\nInterpret, analyze, and utilize statistical analysis techniques to discover patterns and key insights in mission critical data.\nExecute the curation of data delivered by the data team and exploratory data for analytics consumption and deployment.\nOrganize and optimize all data residing in mission data schemas.\nSupport department embedded analysts by delivering new data sets to meet their reporting and analytics needs.\nAssist the Data Architect in developing the data models necessary to support a wide variety of information needs.\nCollaborate with platform and data resources to ensure a high level of data availability and performance for all datasets used in reporting and analytics.\nCollaborate with the data quality staff to ensure clean and accurate data for reporting and analytical models.\nCollaborate with key business staff and leaders to design and build analytical solutions to meet the operational reporting and decision support need of the mission.\nAssist in building a business intelligence center of excellence to support a wide variety of analytical and reporting needs for mission leadership and staff.\nAssist in pioneering data activities in response to prospective use cases driven by IS and Strategic Leadership.\nEngage in assigned data related projects across the analytics and data ecosystem.\n\nWorking Relationships\n\nMultiple regular interactions with key technical and ministry operations stakeholders to maintain a sense of awareness and understanding of current ministry needs.\nRegular involvement in the cross-departmental teams.\nCollaborate with Project Management to provide data expertise and perspective on key mission projects.\nAssists the Director of Mission Analytics and Decision Support in furthering the goals of the mission.\n\nEducation\n\nBachelor’s, Master’s Degree, or equivalent experience in a business, technical, or data related field required.\n\nWorking Conditions\n\nOffice Environment at the Service Center in Colorado Springs, CO with hybrid or remote options.\nTravel 1-3 times a year may be required.\nThis position offers a competitive salary range of $59,800 - $74,749 annually commensurate with relevant experience, qualifications, and education.\n\nQualifications Required For The Job\n\n5-7 years minimum work experience.\nProficiency in modern business intelligence and enterprise platform tools is required (Workday, Salesforce, Power BI, Excel, or similar).\nStrong working knowledge of relational databases and/or SQL is required.\nExperience with Snowflake and Python is preferred.\n3 or more years of related data, business intelligence, or data analyst experience required.\nHighly organized, self-motivated, and attentive to detail.\nStrong interpersonal, communication, and problem-solving skills needed.\n\nApplication\n\nPlease submit your application, resume, and cover letter by April 3rd, 2023.
## Seniority_level Employment_type Job_function
## 1 Entry level Full-time Strategy/Planning and Information Technology
## Industries
## 1 Religious Institutions
datatable(head(rem_WS_Linkedin), plugins = "ellipsis",
options = list(
scrollX = TRUE,
columnDefs = list(list(
targets = "_all",
render = JS("$.fn.dataTable.render.ellipsis(30, false )")
))
)
)
The majority of the values in the ‘Job_location’ column contains both city and state. They are separated by commas. We need to split the ‘Job_location’ column by its comma.
Then, we need to reorder the columns so that the columns with location are not last in the dataframe. I also dropped the ‘Location’ and ‘Job_location’ columns.
All of the values in the ‘Location’ column contains “United States”. We do not need this column in our analysis as all job postings in this dataframe are from United States.
library(shiny)
library(dplyr)
library(tidyr)
library(stringr)
split_location_Linkedin <- rem_WS_Linkedin # Create a new dataframe to make new edits
# tidied_Linkedin %>%
# separate(Job_location,c("City","State"),sep=",") #I end up not using this code because it did not change it permanently.
split_location_Linkedin[c('Job_location_City', 'Job_location_State')] <- str_split_fixed(split_location_Linkedin$Job_location, ',', 2) # 2 new columns are created. They are the last two columns.
#colnames(split_location_Linkedin)
# Drop the 'Location' and 'Job_location' columns
tidied_Linkedin <- split_location_Linkedin[c('Keyword', 'Job_title', 'Job_link', 'Company','Company_link','Job_location_City','Job_location_State','Post_time','Applicants_count','Seniority_level','Employment_type','Job_function','Industries','Job_description')]
datatable(head(tidied_Linkedin, 50),
plugins = "ellipsis",
options = list(
scrollX = TRUE,
columnDefs = list(list(
targets = "_all",
render = JS("$.fn.dataTable.render.ellipsis(30, false)")
))
)
)
Some values in the raw file only contain the state but no city. When I split the ‘location’ column by comma, those state values was set in the city column. Thus, some values are missing in ‘Job_location_State’. I manually input the missing state abbreviation based on the provided location.
#unique(tidied_Linkedin$Job_location_State)
tidied_Linkedin$Job_location_State[tidied_Linkedin$Job_location_City == "New York"] <- "NY"
tidied_Linkedin$Job_location_State[tidied_Linkedin$Job_location_City == "New York City Metropolitan Area"] <- "NY"
tidied_Linkedin$Job_location_State[tidied_Linkedin$Job_location_City == "Utica-Rome Area"] <- "NY"
tidied_Linkedin$Job_location_State[tidied_Linkedin$Job_location_City == "Denver Metropolitan Area"] <- "CO"
tidied_Linkedin$Job_location_State[tidied_Linkedin$Job_location_City == "California"] <- "CA"
tidied_Linkedin$Job_location_State[tidied_Linkedin$Job_location_City == "San Francisco Bay Area"] <- "CA"
tidied_Linkedin$Job_location_State[tidied_Linkedin$Job_location_City == "Los Angeles Metropolitan Area"] <- "CA"
tidied_Linkedin$Job_location_State[tidied_Linkedin$Job_location_City == "North Carolina"] <- "NC"
tidied_Linkedin$Job_location_State[tidied_Linkedin$Job_location_City == "Florida"] <- "FL"
tidied_Linkedin$Job_location_State[tidied_Linkedin$Job_location_City == "Texas"] <- "TX"
tidied_Linkedin$Job_location_State[tidied_Linkedin$Job_location_City == "Dallas-Fort Worth Metroplex"] <- "TX"
tidied_Linkedin$Job_location_State[tidied_Linkedin$Job_location_City == "Minnesota"] <- "MN"
tidied_Linkedin$Job_location_State[tidied_Linkedin$Job_location_City == "Ohio"] <- "OH"
tidied_Linkedin$Job_location_State[tidied_Linkedin$Job_location_City == "Salt Lake City Metropolitan Area"] <- "UT"
tidied_Linkedin$Job_location_State[tidied_Linkedin$Job_location_City == "Greater Chicago Area"] <- "IA"
tidied_Linkedin$Job_location_State[tidied_Linkedin$Job_location_City == "Greater Seattle Area"] <- "WA"
tidied_Linkedin$Job_location_State[tidied_Linkedin$Job_location_City == "Des Moines Metropolitan Area"] <- "IA"
tidied_Linkedin$Job_location_State[tidied_Linkedin$Job_location_City == "Detroit Metropolitan Area"] <- "MI"
tidied_Linkedin$Job_location_State[tidied_Linkedin$Job_location_City == "Greater Rockford Area"] <- "IL"
tidied_Linkedin$Job_location_State[tidied_Linkedin$Job_location_City == "Charlotte Metro"] <- "NC"
tidied_Linkedin$Job_location_State[tidied_Linkedin$Job_location_City == "Greater Houston"] <- "TX"
tidied_Linkedin$Job_location_State[tidied_Linkedin$Job_location_City == "Greater St. Louis"] <- "MO"
tidied_Linkedin$Job_location_State[tidied_Linkedin$Job_location_City == "Greater Boston"] <- "MA"
tidied_Linkedin$Job_location_State[tidied_Linkedin$Job_location_City == "San Diego Metropolitan Area"] <- "CA"
tidied_Linkedin$Job_location_State[tidied_Linkedin$Job_location_City == "Atlanta Metropolitan Area"] <- "GA"
tidied_Linkedin$Job_location_State[tidied_Linkedin$Job_location_City == "Greater Orlando"] <- "FL"
tidied_Linkedin$Job_location_State[tidied_Linkedin$Job_location_City == "Greater Milwaukee"] <- "WI"
tidied_Linkedin$Job_location_State[tidied_Linkedin$Job_location_City == "Washington DC-Baltimore Area"] <- "VA"
tidied_Linkedin$Job_location_State[tidied_Linkedin$Job_location_City == "Kansas City Metropolitan Area"] <- "MO"
tidied_Linkedin$Job_location_State[tidied_Linkedin$Job_location_City == "Nashville Metropolitan Area"] <- "TN"
tidied_Linkedin$Job_location_State[tidied_Linkedin$Job_location_City == "Greater Philadelphia"] <- "PA"
tidied_Linkedin$Job_location_State[tidied_Linkedin$Job_location_City == "Greater Phoenix Area"] <- "AZ"
tidied_Linkedin$Job_location_State[tidied_Linkedin$Job_location_City == "Cincinnati Metropolitan Area"] <- "OH"
datatable(head(tidied_Linkedin), plugins = "ellipsis",
options = list(
scrollX = TRUE,
columnDefs = list(list(
targets = "_all",
render = JS("$.fn.dataTable.render.ellipsis(30, false )")
))
)
)
Load the Linkedin csv file into R. Below are some things we need to tidy:
Remove any column that does not contain any value
Split the ‘Place’ column into two columns: city and state
Remove duplicates
datatable(head(glassdoor, 50),
plugins = "ellipsis",
options = list(
scrollX = TRUE,
columnDefs = list(list(
targets = "_all",
render = JS("$.fn.dataTable.render.ellipsis(30, false )")
))
)
)
The ‘Location’ column does not contain any values. Thus, that column is removed. The ‘Place’ column contains the job location.
rem_location <- glassdoor[,-2] #Remove the 2nd column, which is 'Location'
datatable(head(rem_location, 50),
plugins = "ellipsis",
options = list(
scrollX = TRUE,
columnDefs = list(list(
targets = "_all",
render = JS("$.fn.dataTable.render.ellipsis(30, false )")
))
)
)
We split the ‘Place’ column into ‘Job_location_City’ and ‘Job_location_State’. Then, rearrange the order of the column. In the process, I dropped the following columns: ‘Page’, ‘Place’. The ‘page’ column contains what page number the job posting was found on Linkedin. This column is not necessary for our analysis.
library(stringr)
split_Place_Glassdoor <- rem_location
# tidied_Linkedin %>%
# separate(Job_location,c("City","State"),sep=",")
split_Place_Glassdoor[c('Job_location_City', 'Job_location_State')] <- str_split_fixed(split_Place_Glassdoor$Place, ',', 2)
colnames(split_Place_Glassdoor)
## [1] "Keyword" "Page" "company"
## [4] "rating" "Job_title" "Place"
## [7] "salary" "post_date" "Job_description"
## [10] "Job_location_City" "Job_location_State"
# Drop the 'Place' column and "Page" column by not including them in the new table
tidied_location_glassdoor <- split_Place_Glassdoor[c('Keyword','Job_title', 'company','rating','Job_location_City','Job_location_State','post_date','Job_description')]
datatable(head(tidied_location_glassdoor, 50),
plugins = "ellipsis",
options = list(
scrollX = TRUE,
columnDefs = list(list(
targets = "_all",
render = JS("$.fn.dataTable.render.ellipsis(30, false )")
))
)
)
There were duplicates of job posting. The duplicates were not detected previously because the same job posting were under different ‘page’ number. After the ‘page’ column is removed, I used the ‘distinct()’ to remove any duplicated job postings. As a result, we have 664 job postings from Glassdoor. About half of the original job postings (1320) were duplicates.
tidied_glassdoor <- distinct(tidied_location_glassdoor)
In order to combine both dataframes into one dataframe, we need to rename common columns with the same title. I also want to add a new column that indicates if the job posting is from Glassdoor or Linkedin. Since we do not need the ‘Keyword’ column, I replace all values with ‘Glassdoor’ or ‘Linkedin’ in that column, and renamed the column to ‘Search_Engine’.
final_Linkedin <- tidied_Linkedin
final_Linkedin$Keyword <- "Linkedin"
colnames(final_Linkedin)[1] ="Search_Engine"
#colnames(final_Linkedin)
final_Linkedin <- final_Linkedin[c('Search_Engine','Job_title', 'Company','Job_location_City','Job_location_State','Job_description')]
datatable(head(final_Linkedin, 50),
plugins = "ellipsis",
options = list(
scrollX = TRUE,
columnDefs = list(list(
targets = "_all",
render = JS("$.fn.dataTable.render.ellipsis(30, false )")
))
)
)
final_glassdoor <- tidied_glassdoor
final_glassdoor$Keyword <- "Glassdoor"
colnames(final_glassdoor)[1] ="Search_Engine"
#colnames(final_glassdoor)
colnames(final_glassdoor) <- c('Search_Engine','Job_title', 'Company','Rating','Job_location_City','Job_location_State','post_date','Job_description')
final_glassdoor <- final_glassdoor[c('Search_Engine','Job_title', 'Company','Job_location_City','Job_location_State','Job_description')]
datatable(head(final_glassdoor , 50),
plugins = "ellipsis",
options = list(
scrollX = TRUE,
columnDefs = list(list(
targets = "_all",
render = JS("$.fn.dataTable.render.ellipsis(30, false )")
))
)
)
Combine both dataframes together. Use ‘distinct()’ from ‘dplyr’ package to remove any duplicate data.
total <- rbind(final_Linkedin, final_glassdoor)
distinct_total_df <- distinct(total)
datatable(distinct_total_df [c(1:20, 5055:5074, 5850:5869), ],
colnames = c('Search Engine','Job Title', 'Company','Job Location City','Job Location State','Job Description'),
plugins = "ellipsis",
options = list(
scrollX = TRUE,
columnDefs = list(list(
targets = "_all",
render = JS("$.fn.dataTable.render.ellipsis(30, false )")
))
)
)
Create a csv for analysis.
write.csv(distinct_total_df, "C:\\Desktop\\DATA 607\\Combined_Linkedin_Glassdoor_Version4.csv", row.names=FALSE)
write.csv(tidied_glassdoor, "C:\\Desktop\\DATA 607\\tidied_glassdoor_version4.csv", row.names=FALSE)
write.csv(tidied_Linkedin, "C:\\Desktop\\DATA 607\\tidied_Linkedin_version4.csv", row.names=FALSE)
The Job_description columns contain long strings. Thus, it was difficult to see the rendered datatable. I used partial of the code from the following website to help me display a more viewable table. https://stackoverflow.com/questions/71713905/r-shiny-datatableoutput-prevent-column-from-showing-full-text-column