##Project Objective: Use data analysis to answer the question: "Which are the most valued data science skills?
The Project Three team first met on September 13th via Google hangout to discuss our approach to the project. We determined that job search sites were the most likely places to get information on what the most highly valued data science skills were. We decided to research web APIs for popular job search websites including CareerBuilder, LinkedIn, Glassdoor, Monster, Indeed, Angelist, Upwork, Idealist, Hired, Dice, Fiverr, Zip Recruiter, and Kaggle.
Unfortunately, we came up empty in our search for APIs. Most of the sites restricted access to their API or require a partnership with the site. We then turned our attention to web-scraping the sites. Working in parallel, each one of us selected a site to scrape (Chester Poon took LinkedIn, Joshua Bentley took CareerBuilder, and John Hancock took Indeed). Web scraping proved very challenging for us. In the sections below, we each recount our experience with scraping each site.
John Hancock (Indeed.com):
The Indeed.com job site consisted of running a search for jobs and returning a page of hyperlinks for each job where you can find a full description of the job. I discovered that the R package, rvest, does not work well with web pages where you have to click on a link to access the information as the Indeed.com site had. So, I turned to Python to do the scraping. In particular, I used the Selenium package which uses a webdriver that automates a web browser to access websites.
My process was automating a search for Data Science jobs in New York, NY. The code navigates through the over 900 pages returned from the search. As stated earlier, each search results page contained hyperlinks to each job where contained more detailed information. The Python code identifies each job link by their XPath variable, puts all of the links on the page into a list, iterates over that list and clicks on each link. Once the page was open, the job description, title, company were scraped from the page. The entire process took quite a while. It ran for more than three days, but I was able to get 330 job postings.
Chester Poon (LinkedIn.com):
One of the major challenges that was specific to LinkedIn was the website’s security barriers. We were only able to do a one time scrape before the site flagged my profile and IP address as the source of a bot. I was able to get the initial pull data, however I was interested in pulling more data to be better in line with my teammates data. I was unable to do so because as soon as I would initialize the script, LinkedIn would automatically log me out. I had also tried creating a fake account as well as using Joshua’s profile as a way to get around security without any luck. It was actually here where we discovered that for each login and for each person accessing the site, there were slight differences in how the page rendered that confounded the script. For example, one version had the text of a button tag nested inside a span tag whereas the other version did not. Other times, the attribute id was different in one version of the site versus the other. These dynamically changing sites and the action of logging me out when they detected a bot, limited our scope.
As a next step, it would be best to gain permission from LinkedIn to webscrape their site. It was discovered rather late in the process that websites might have a robots.txt
that specifies rules for webscraping as well as a contact email on how to gain access to scraping. Knowing this sooner could have saved much time. LinkedIn’s robots.txt
file is located here: https://www.linkedin.com/robots.txt
For python code and markdown for LinkedIn.com scrape, see link below: https://chesterpoon8.github.io/Project3-linked_in.html
A. Joshua Bentley (CareerBuilder.com):
I had actually done a project similar to this in the past and had identified CareerBuilder as an ideal site to scrape as it hosts all of the job information on its own site and web pages where other sites usually have some information but requires you to click through to the hiring company’s site, leading to the coder having to create multiple nested scrapes. Unfortunately during that project the team chose to use a different site but I was happy to be able to return to the effort.
Hearing about the issues that my partners had with their scrapes I was happy to find that my initial intuition about CareerBuilder was correct and I was able to scrape the site using rvest. Given more time with the project I believe I could have set the script up so that you could feed it a list of search terms and have it return jobs for each of the searches.
If there was any drawback it was that the system would only return 300 jobs. To get around this I would probably isolate the city search field and set the code to run through a few big cities. Alternatively I could look into setting some of the search parameters such as pay rate, industry, or type of employment.
rpub and rmd can be found at:
http://rpubs.com/ajbentley/ds607_proj3_cbscrape
This section details how the collected data was compiled into a final data frame.
The data scraped from LinkedIn was more structured than the data from CareerBuilder or Indeed. For most of the postings on LinkedIn job qualifications were broken out in a separate section. The LinkedIn csv that Chester delivered was used as the basis for identifying skills scraped from the other two websites.
LinkedIn_DSjobs <- read.csv("https://raw.githubusercontent.com/JohnKHancock/DATA-607-Data-Acquisition-and-Management/master/Project%20Three/LinkedIn.csv", stringsAsFactors = FALSE)
# Cleaning the phrase "Programming Language" from Python so that we can match it in data from the other websites.
LinkedIn_DSjobs$skills<-str_trim(str_replace(LinkedIn_DSjobs$skills, "\\(Programming Language\\)", ""))
# Creating Frequency counts of the skills
frequency<- count(LinkedIn_DSjobs, skills)
# Code will allow us to limit the # of skills shown if needed in the future (0 means all are included)
top_Skills<- frequency[frequency$n > 0,]
# Setting the list to descending order
top_Skills[order(-top_Skills$n),]
## # A tibble: 912 x 2
## skills n
## <chr> <int>
## 1 Python 249
## 2 Statistics 201
## 3 Master's Degree 187
## 4 Bachelor's Degree 174
## 5 Data Science 167
## 6 Machine Learning 164
## 7 Communication 151
## 8 Data Analysis 149
## 9 SQL 132
## 10 Analytics 113
## # ... with 902 more rows
When we first ran the report we found that there was something called "D" that made a strong appearance in our rankers. We first excluded it because we didn't recognize it but then found that it was a coding language. Later we realized that the code was picking up every instance of a capital D (e.g., in David, Director, or PhD). This was also a likely reason that other coding languages that had a single letter were doing so well (e.g., C and R). We weren't sure whether "C++" was being treated as a regex, either.
We then found that there were other skills that could have been mistaken, such as "Go." It wasn't until very late in the process that we discovered it and after spending some time trying to find a solution we chose to simply exclude D, R, and the C languages from the analysis and note that this is an issue that needs to be addressed in the future.
We also noticed that there were a number of items in this skills list that were not actually skills including education, industry, or simply references to things like “scientists.”
The scraped data from both Indeed.com and CareerBuilder.com was much less structured. The job postings were more free-form text than structured like LinkedIn. Skills requested were not broken out into their section. The entire text from the job description along with the Job Title and Job Company were scraped into csv files.
Indeed_DSjobs <- read.csv("https://raw.githubusercontent.com/JohnKHancock/DATA-607-Data-Acquisition-and-Management/master/Project%20Three/Data_Science_Indeed_Jobs.csv", stringsAsFactors = FALSE, encoding = "UTF-8")
Indeed_DSjobs$Description <- str_replace_all(Indeed_DSjobs$Description,"[\n]","")
top_Skills_Indeed <- data.frame(skills=character(), n=integer())
Each skill listed from LinkedIn were searched for in the job descriptions scraped from the site and the number of hits were captured in a variable, “n”.
for (i in 1:nrow(top_Skills))
{
newRow <- data.frame(skills=top_Skills$skills[i],
n=length(unlist(str_extract_all(Indeed_DSjobs$Description,top_Skills$skills[i]))))
top_Skills_Indeed <- rbind(top_Skills_Indeed, newRow)
}
top_Skills_Indeed$skills <- as.character(top_Skills_Indeed$skills)
top_Skills_Indeed<-top_Skills_Indeed[!duplicated(top_Skills_Indeed),]
In the end, we get a data frame which lists each skill and the number of times that skill is mentioned in the job descriptions on Indeed.com
## skills n
## 659 Python 268
## 770 SQL 256
## 690 Research 174
## 34 Analytics 161
## 362 Go 118
## 761 Spark 112
## 149 Computer Science 100
## 375 Hadoop 89
## 300 Engineering 83
## 421 Java 83
The process for Indeed.com was repeated for CareerBuilder.com
CareerBuilder_DSjobs <- read.csv("https://raw.githubusercontent.com/JohnKHancock/DATA-607-Data-Acquisition-and-Management/master/Project%20Three/careerbuilder.csv", encoding = "UTF-8")
CareerBuilder_DSjobs$jobdesc <- str_replace_all(CareerBuilder_DSjobs$jobdesc,"[\n]","")
top_Skills_CB <- data.frame(skills=character(), n=integer())
for (i in 1:nrow(top_Skills))
{
newRow <- data.frame(skills=top_Skills$skills[i],
n=length(unlist(str_extract_all(CareerBuilder_DSjobs$jobdesc,top_Skills$skills[i]))))
top_Skills_CB<- rbind(top_Skills_CB, newRow)
}
top_Skills_CB<-top_Skills_CB[!duplicated(top_Skills_CB), ]
top_Skills_CB$skills<-as.character(top_Skills_CB$skills)
cts <- top_Skills_CB[order(-top_Skills_CB$n),]
cts[1:10, 1:2]
## skills n
## 659 Python 196
## 770 SQL 169
## 713 SAS 137
## 34 Analytics 107
## 783 Statistics 97
## 454 Machine Learning 96
## 300 Engineering 95
## 375 Hadoop 85
## 690 Research 85
## 248 Design 79
The frequency reports from the three data frames were compiled into one, the master_skills_df, which tallies 912 skills listed on all three sites, the frequency per site, and a total of the number of times the jobs were mentioned across all three sites.
tables <- list(top_Skills,top_Skills_Indeed,top_Skills_CB)
master_skills_df <- reduce(tables,left_join,by="skills")
master_skills_df[1:10, 1:4]
## # A tibble: 10 x 4
## skills n.x n.y n
## <chr> <int> <int> <int>
## 1 .NET Framework 2 0 0
## 2 A/B Testing 2 0 0
## 3 Account Management 4 1 1
## 4 Account Reconciliation 3 0 0
## 5 Accounting 1 3 38
## 6 Accumulo 1 0 4
## 7 Acting 1 2 1
## 8 Active DoD Secret Clearance 1 0 0
## 9 Actuarial Science 3 1 4
## 10 Ad Hoc Reporting 3 0 0
colnames(master_skills_df) <- c("Skills", "LinkedIn", "CareerBuilder","Indeed", "Total")
head(master_skills_df)
## # A tibble: 6 x 5
## Skills LinkedIn CareerBuilder Indeed Total
## <chr> <int> <int> <int> <dbl>
## 1 .NET Framework 2 0 0 2
## 2 A/B Testing 2 0 0 2
## 3 Account Management 4 1 1 6
## 4 Account Reconciliation 3 0 0 3
## 5 Accounting 1 3 38 42
## 6 Accumulo 1 0 4 5
In this section, we detailed how the data was compiled into a relational database.
In order to load the data we scraped from the three different websites, we’ll first clean our data. The following would need to be done:
Let’s load the LinkedIn csv file from our web scrape and take a quick glance.
linkedin <- read.csv('https://raw.githubusercontent.com/JohnKHancock/DATA-607-Data-Acquisition-and-Management/master/Project%20Three/LinkedIn.csv', stringsAsFactors = FALSE)
knitr::kable(head(linkedin), format = "html") %>%
kable_styling(bootstrap_options = c("striped", "condensed"))
industry | skills | title | job_id |
---|---|---|---|
Marketing and Advertising | Python (Programming Language) | Data Scientist | LI1 |
Marketing and Advertising | Data Science | Data Scientist | LI1 |
Marketing and Advertising | Artificial Neural Networks | Data Scientist | LI1 |
Marketing and Advertising | Deep Learning | Data Scientist | LI1 |
Marketing and Advertising | Google Cloud Platform (GCP) | Data Scientist | LI1 |
Marketing and Advertising | Machine Learning | Data Scientist | LI1 |
Looks mostly clean already. We’ll add another column to identify the source of the data, which we’ll also do for the other two data sets.
linkedin <- linkedin %>%
mutate(website = "linkedin.com")
knitr::kable(head(linkedin), format = "html") %>%
kable_styling(bootstrap_options = c("striped", "condensed"))
industry | skills | title | job_id | website |
---|---|---|---|---|
Marketing and Advertising | Python (Programming Language) | Data Scientist | LI1 | linkedin.com |
Marketing and Advertising | Data Science | Data Scientist | LI1 | linkedin.com |
Marketing and Advertising | Artificial Neural Networks | Data Scientist | LI1 | linkedin.com |
Marketing and Advertising | Deep Learning | Data Scientist | LI1 | linkedin.com |
Marketing and Advertising | Google Cloud Platform (GCP) | Data Scientist | LI1 | linkedin.com |
Marketing and Advertising | Machine Learning | Data Scientist | LI1 | linkedin.com |
Indeed
Now we’ll load the csv from the Indeed web scrape. We’ll skip displaying the data from this data frame due to the large amount of text in one of our columns.
indeed <- read.csv('https://raw.githubusercontent.com/JohnKHancock/DATA-607-Data-Acquisition-and-Management/master/Project%20Three/Data_Science_Indeed_Jobs.csv', stringsAsFactors = FALSE)
names(indeed)
## [1] "X" "Title" "Company" "Description" "Link"
From viewing this data:
we can tell that “X” is actually our id for each job posting.
we also will not need the link column as the other two data sets do not have it.
we will keep the “Company” column as we were unable to scrape a specific industry from Indeed.
we’ll clean up the column names so that they are in line with the data frame for LinkedIn. We’ll also make a small edit to job_id
in order to keep things more uniform.
indeed <- indeed %>%
mutate(website = "indeed.com") %>%
rename(job_id = X,
title = Title,
company = Company,
description = Description
) %>%
select(-Link)
indeed$job_id <- paste("IND",as.character(indeed$job_id), sep = "")
str(indeed)
## 'data.frame': 330 obs. of 5 variables:
## $ job_id : chr "IND0" "IND1" "IND2" "IND3" ...
## $ title : chr "Data Science Quant Developer | 2019 Grad" "DATA SCIENCE -INTERN" "Data Science Internship, Summer 2019" "Internships, Data Strategy/Data Science- Summer '19" ...
## $ company : chr "Liquidnet" "Foot Locker, Inc." "Etsy" "Viacom" ...
## $ description: chr "With the acquisition of OTAS Technologies in 2017 and our continued global evolution, Liquidnet is expanding it"| __truncated__ "OVERVIEW\nFoot Locker, Inc. is the world’s leading retailer of athletically inspired footwear and apparel. Our "| __truncated__ "Etsy is looking for Data Science Interns to join our Data Science team in Brooklyn, New York. We are committed "| __truncated__ "Overview\nPlease note: these departments primarily hire Master's Students. If you're an undergrad student inter"| __truncated__ ...
## $ website : chr "indeed.com" "indeed.com" "indeed.com" "indeed.com" ...
CareerBuilder
Our web scrape for CareerBuilder also includes a text heavy description column, so we’ll just take a look at the structure to get an idea.
career <- read.csv('https://raw.githubusercontent.com/JohnKHancock/DATA-607-Data-Acquisition-and-Management/master/Project%20Three/careerbuilder.csv', stringsAsFactors = FALSE)
str(career)
## 'data.frame': 300 obs. of 4 variables:
## $ job_id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ jobbank..Job.Title.: chr "Associate Director - Methodologist / Statistician - Advanced Analytics" "Product Manager | Data Science | Applications" "Data Science & Tagging Analyst" "Data Science Expert" ...
## $ industry : chr "Other Great Industries" "Computer Hardware, Computer Software" "Other Great Industries" "Other Great Industries" ...
## $ jobdesc : chr "IQVIA™ is The Human Data Science Company™, focused on using data and science to help healthcare clients find be"| __truncated__ "AI & Machine Learning Predictive Analytics Start UpProduct Manager | Enterprise PlatformBostonGlocomms is worki"| __truncated__ " Job Description:he Defense and Intelligence Group of Leidos has a Data Science & Tagging Analyst position in B"| __truncated__ "Overview\nMedical Science & Computing (MSC) is an exciting growth oriented company, dedicated to providing miss"| __truncated__ ...
From viewing the structure of the data frame above:
we’ll rename the “jobdesc” and “jobbank..Job.Title.” column to the appropriate names
add the column to indicate the website we scraped from for this set.
we’ll fix the job_id to be in line with the previous two data frames.
career <- career %>%
rename(title = jobbank..Job.Title.,
description = jobdesc) %>%
mutate(website = "careerbuilder.com")
career$job_id <- paste("CB", as.character(career$job_id), sep = "")
str(career)
## 'data.frame': 300 obs. of 5 variables:
## $ job_id : chr "CB1" "CB2" "CB3" "CB4" ...
## $ title : chr "Associate Director - Methodologist / Statistician - Advanced Analytics" "Product Manager | Data Science | Applications" "Data Science & Tagging Analyst" "Data Science Expert" ...
## $ industry : chr "Other Great Industries" "Computer Hardware, Computer Software" "Other Great Industries" "Other Great Industries" ...
## $ description: chr "IQVIA™ is The Human Data Science Company™, focused on using data and science to help healthcare clients find be"| __truncated__ "AI & Machine Learning Predictive Analytics Start UpProduct Manager | Enterprise PlatformBostonGlocomms is worki"| __truncated__ " Job Description:he Defense and Intelligence Group of Leidos has a Data Science & Tagging Analyst position in B"| __truncated__ "Overview\nMedical Science & Computing (MSC) is an exciting growth oriented company, dedicated to providing miss"| __truncated__ ...
## $ website : chr "careerbuilder.com" "careerbuilder.com" "careerbuilder.com" "careerbuilder.com" ...
Now that we have cleaned and prepared our three separate data sets from the three different sources, we can unite them into a single data frame.
main_df <- full_join(linkedin,indeed,
by=c("job_id",
"title",
"website"))
main_df <- full_join(main_df,career,
by=c("job_id",
"title",
"industry",
"description",
"website"))
knitr::kable(head(main_df), format = "html") %>%
kable_styling(bootstrap_options = c("striped", "condensed"))
industry | skills | title | job_id | website | company | description |
---|---|---|---|---|---|---|
Marketing and Advertising | Python (Programming Language) | Data Scientist | LI1 | linkedin.com | NA | NA |
Marketing and Advertising | Data Science | Data Scientist | LI1 | linkedin.com | NA | NA |
Marketing and Advertising | Artificial Neural Networks | Data Scientist | LI1 | linkedin.com | NA | NA |
Marketing and Advertising | Deep Learning | Data Scientist | LI1 | linkedin.com | NA | NA |
Marketing and Advertising | Google Cloud Platform (GCP) | Data Scientist | LI1 | linkedin.com | NA | NA |
Marketing and Advertising | Machine Learning | Data Scientist | LI1 | linkedin.com | NA | NA |
Now that we have our single data frame, we can move onto building our database.
In order to create our database to conform to standards, we’ve created separate data frames to identify unique values for each variable in our main data frame. We then applied IDs for each of those unique values. Each data frame represents a table in our database. The below code takes the unique values for each column except for the jobs_id column. We’ll also check to see each data frame to make sure it has what we want.
industry <- unique(main_df$industry)
industry_id <- paste("IN",as.character(1:length(industry)),sep = "")
industry_df <- data.frame(industry,industry_id)
knitr::kable(head(industry_df), format = "html")%>%
kable_styling(bootstrap_options = c("striped", "condensed"))
industry | industry_id |
---|---|
Marketing and Advertising | IN1 |
Wireless | IN2 |
Financial Services | IN3 |
Publishing | IN4 |
Management Consulting | IN5 |
Information Technology and Services | IN6 |
skills <- unique(main_df$skills)
skills_id <- paste("SK",as.character(1:length(skills)),sep = "")
skills_df <- data.frame(skills,skills_id)
knitr::kable(head(skills_df), format = "html")%>%
kable_styling(bootstrap_options = c("striped", "condensed"))
skills | skills_id |
---|---|
Python (Programming Language) | SK1 |
Data Science | SK2 |
Artificial Neural Networks | SK3 |
Deep Learning | SK4 |
Google Cloud Platform (GCP) | SK5 |
Machine Learning | SK6 |
title <- unique(main_df$title)
title_id <- paste("TI",as.character(1:length(title)),sep = "")
title_df <- data.frame(title,title_id)
knitr::kable(head(title_df), format = "html")%>%
kable_styling(bootstrap_options = c("striped", "condensed"))
title | title_id |
---|---|
Data Scientist | TI1 |
Senior Data Scientist | TI2 |
Data Science Consultant | TI3 |
Data Analyst | TI4 |
ERP -SAP Master Data Analyst | TI5 |
Data Scientist-Python | TI6 |
company <- unique(main_df$company)
company_id <- paste("CO",as.character(1:length(company)),sep = "")
company_df <- data.frame(company,company_id)
knitr::kable(head(company_df), format = "html")%>%
kable_styling(bootstrap_options = c("striped", "condensed"))
company | company_id |
---|---|
NA | CO1 |
Liquidnet | CO2 |
Foot Locker, Inc. | CO3 |
Etsy | CO4 |
Viacom | CO5 |
Mount Sinai Health System | CO6 |
website <- unique(main_df$website)
website_id <- paste("WS",as.character(1:length(website)),sep = "")
website_df <- data.frame(website,website_id)
knitr::kable(head(website_df), format = "html")%>%
kable_styling(bootstrap_options = c("striped", "condensed"))
website | website_id |
---|---|
linkedin.com | WS1 |
indeed.com | WS2 |
careerbuilder.com | WS3 |
description <- unique(main_df$description)
description_id <- paste("D",as.character(1:length(description)),sep = "")
description_df <- data.frame(description,description_id)
str(description_df)
## 'data.frame': 631 obs. of 2 variables:
## $ description : Factor w/ 630 levels "\n\n\n\nLocation\nSan Jose - CA, US\nLevel\nPhD\nExperience\n2-5 Professional\nAvailable since\n3/7/2018\nFunct"| __truncated__,..: NA 623 385 201 391 497 348 141 134 356 ...
## $ description_id: Factor w/ 631 levels "D1","D10","D100",..: 1 112 223 334 445 556 599 610 621 2 ...
Now we’ll build our jobs data frame, which is a replica of the original main data frame, but with ids instead of actual values for each variable. Doing this will allow for more flexibility and efficiency if we’d like to answer a variety of different questions.
jobs_df <- main_df %>%
left_join(industry_df, by="industry") %>%
left_join(skills_df, by="skills") %>%
left_join(title_df, by="title") %>%
left_join(company_df, by="company") %>%
left_join(website_df, by="website") %>%
left_join(description_df, by="description") %>%
select(-industry,-skills,-title,-company,-website,-description)
knitr::kable(head(jobs_df), format = "html")%>%
kable_styling(bootstrap_options = c("striped", "condensed"))
job_id | industry_id | skills_id | title_id | company_id | website_id | description_id |
---|---|---|---|---|---|---|
LI1 | IN1 | SK1 | TI1 | CO1 | WS1 | D1 |
LI1 | IN1 | SK2 | TI1 | CO1 | WS1 | D1 |
LI1 | IN1 | SK3 | TI1 | CO1 | WS1 | D1 |
LI1 | IN1 | SK4 | TI1 | CO1 | WS1 | D1 |
LI1 | IN1 | SK5 | TI1 | CO1 | WS1 | D1 |
LI1 | IN1 | SK6 | TI1 | CO1 | WS1 | D1 |
We can now load our data into a database and create our connection.
ds_job_db <- dbConnect(RSQLite::SQLite(), "")
dbWriteTable(ds_job_db, "industry", industry_df)
dbWriteTable(ds_job_db, "skills", skills_df)
dbWriteTable(ds_job_db, "title", title_df)
dbWriteTable(ds_job_db, "company", company_df)
dbWriteTable(ds_job_db, "website", website_df)
dbWriteTable(ds_job_db, "description", description_df)
dbWriteTable(ds_job_db, "jobs", jobs_df)
We’ll now take a look at the following by writing SQL queries for each summary question and displaying on a simple graph using ggplot2:
Total Number of Jobs by Website
jobs_cnt <- dbGetQuery(ds_job_db,
"
SELECT DISTINCT
W.WEBSITE as Website
,COUNT(DISTINCT J.JOB_ID) AS 'Number of Jobs'
FROM JOBS J
INNER JOIN WEBSITE W ON J.WEBSITE_ID = W.WEBSITE_ID
GROUP BY
W.WEBSITE
"
)
ggplot(jobs_cnt, aes(Website,`Number of Jobs`)) +
geom_bar(stat = "identity", aes(fill=Website)) +
theme_minimal() +
geom_text(aes(label = `Number of Jobs`),vjust = 2)
Top 25 Industries that on average request the most number of skills.
s_avg <- dbGetQuery(ds_job_db,
"
SELECT
MAIN.INDUSTRY
,AVG(MAIN.COUNT) AS 'Average Number of Skills'
FROM (
SELECT DISTINCT
I.INDUSTRY
,J.JOB_ID
,COUNT(DISTINCT S.SKILLS_ID) AS COUNT
FROM JOBS J
INNER JOIN INDUSTRY I ON J.INDUSTRY_ID = I.INDUSTRY_ID
INNER JOIN SKILLS S ON J.SKILLS_ID = S.SKILLS_ID
GROUP BY
I.INDUSTRY
,J.JOB_ID
) MAIN
WHERE
MAIN.INDUSTRY IS NOT NULL
GROUP BY
MAIN.INDUSTRY
ORDER BY
AVG(MAIN.COUNT) DESC
LIMIT 25
"
)
knitr::kable(s_avg, format = "html")%>%
kable_styling(bootstrap_options = c("striped", "condensed"))
INDUSTRY | Average Number of Skills |
---|---|
Apparel & Fashion | 11.00000 |
Aviation & Aerospace | 11.00000 |
Banking | 11.00000 |
Broadcast Media | 11.00000 |
Business Supplies and Equipment | 11.00000 |
Commercial Real Estate | 11.00000 |
Consumer Electronics | 11.00000 |
Defense & Space | 11.00000 |
Education Management | 11.00000 |
Gambling & Casinos | 11.00000 |
Government Administration | 11.00000 |
Health, Wellness and Fitness | 11.00000 |
Hospitality | 11.00000 |
Human Resources | 11.00000 |
Industrial Automation | 11.00000 |
Information Services | 11.00000 |
Leisure, Travel & Tourism | 11.00000 |
Mechanical or Industrial Engineering | 11.00000 |
Mining & Metals | 11.00000 |
Paper & Forest Products | 11.00000 |
Public Relations and Communications | 11.00000 |
Renewables & Environment | 11.00000 |
Security and Investigations | 11.00000 |
Accounting | 10.66667 |
Financial Services | 10.55263 |
We pared down the list to the Top 50 skills by limiting the list to those skills which have more than 41 mentions in the job postings.
Top50_df <- master_skills_df[master_skills_df$Total>41,]
Top50_df <- Top50_df[order(-Top50_df$Total),]
Top50_df_percentages <- Top50_df %>%
mutate(Indeed_pct = round(Indeed / sum(Indeed), 3),
CareerBuilder_pct = round(CareerBuilder / sum(CareerBuilder),3),
LinkedIn_pct = round(LinkedIn / sum(LinkedIn),3),
Total_pct = round(Total / sum(Total),3)) %>%
select(Skills,Indeed_pct,CareerBuilder_pct,LinkedIn_pct,Total_pct)
head(Top50_df_percentages,10)
## # A tibble: 10 x 5
## Skills Indeed_pct CareerBuilder_pct LinkedIn_pct Total_pct
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Python 0.079 0.1 0.102 0.094
## 2 SQL 0.068 0.096 0.054 0.073
## 3 Analytics 0.043 0.06 0.047 0.05
## 4 Statistics 0.039 0.022 0.083 0.047
## 5 Machine Learning 0.039 0.027 0.067 0.044
## 6 Research 0.034 0.065 0.015 0.039
## 7 Management 0.022 0.031 0.044 0.032
## 8 SAS 0.055 0.024 0.006 0.028
## 9 Engineering 0.038 0.031 0.014 0.028
## 10 Hadoop 0.034 0.033 0.015 0.028
## # A tibble: 10 x 5
## Skills Indeed_pct CareerBuilder_pct LinkedIn_pct Total_pct
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Programming 0.008 0.005 0.009 0.007
## 2 Insurance 0.009 0.008 0.003 0.007
## 3 Matlab 0.005 0.005 0.011 0.007
## 4 Databases 0.002 0.002 0.016 0.006
## 5 Data Analytics 0.007 0.006 0.004 0.006
## 6 Microsoft Office 0.007 0.007 0.003 0.006
## 7 NoSQL 0.006 0.008 0.003 0.006
## 8 Testing 0 0.003 0.014 0.006
## 9 Accounting 0.015 0.001 0 0.006
## 10 Security 0.014 0.001 0.002 0.006
## # A tibble: 9 x 5
## Skills LinkedIn CareerBuilder Indeed Total
## <chr> <int> <int> <int> <dbl>
## 1 Python 249 268 196 713
## 2 SQL 132 256 169 557
## 3 Analytics 113 161 107 381
## 4 Statistics 201 59 97 357
## 5 Machine Learning 164 71 96 331
## 6 Research 36 174 85 295
## 7 Management 107 83 55 245
## 8 SAS 14 63 137 214
## 9 Engineering 34 83 95 212
ggplot(data=Top10, aes(x=Skills, y=Total))+
geom_bar(position="dodge",stat="identity",fill=c("orange")) +
coord_flip() +
ggtitle("Top Ten Most In Demand Data Science Skills")
The primary conclusion, besides that web scraping is a very tricky business, is that the most sought-after skills are the ones you would expect. We expect that R would be on this list if we had been able to handle it correctly, but Python, SQL, Analytics, and Statistics are the skills that you need to get in the door.
For next steps we would like to first determine a way to accurately identify when R and the C languages (as well as D, though that is not as much of an issue at present) are being mentioned in descriptions.
We would also work on being able to reduce duplication in skill terms. For example, analytics and analytical skills could be merged.
We also realize that there are jobs that are repeated not only across the three sites, but even on a single site the same position may be listed a number of times. We would need to figure out how to account for and filter these.
The other big piece would be to run a TF-IDF analysis. This would show not just what skills are most sought after, but also the ones that are ones that will get you into more specialized jobs. It does so by looking not just at what skills are mentioned most, but also which ones are mentioned least in the most job descriptions. SQL is often mentioned–it’s a have to have–but if Neural Networks are mentioned less often but in many documents, it may indicate that it is a higher value.