Assignment

W. Edwards Deming said, “In God we trust, all others must bring data.” Please use data to answer the question, “Which are the most valued data science skills?”

Consider your work as an exploration; there is not necessarily a “right answer.” Grading rubric:

Selecting a website

Our approach to this project was to identify a website that listed a variety of job postings related to Data Science. In deciding which platform to use (ex: Linkedin, indeed, glassdoor), we realized that many of the sites included information in a non-standardized way. For example, one posting might list the skills under a section titled “Requirements” whereas another posting might list them under a section titled “Experience”.

After a bit of searching, we found a site titled onetonline.org that included a standardized template for postings. A user can go to the website and search for a general job type in the Occupational Search section to bring up relevant listings.

Approach

After searching for the term Data Scientist on the website, we were directed to a page with a number of links to relevant postings. (see below)

Each link includes a standard template with information. The nice thing about this website is that it breaks down the requirements into a number of general categories, including Technology Skills, Abilities, Skills, and Knowledge. Because we are focusing on the soft skills of data science, we decided to limit our pull to only those skills listed in the Skills and Abilities sections.

Strategy

Our approach to gathering all of the information was to:

  • Pull in the html from the results page of the Data Scientist search
  • Grab all of the websites that were linked to the postings
  • Cycle through each website to extract the relevant skills and abilities for that position
  • Add in the job title and website and classify each record as sourced from the SKILLS or ABILITIES section
  • Combine all results in 1 data frame
  • Load data into a relational database
  • Analyze listings to see what skills appear the most in the set

Libraries

For this project, we will use the rvest package for all of the html extraction. Stringr will be used to identify the appropriate websites to cycle through and **tidyverse* will be used for our data cleansing.

library(rvest)
library(tidyverse)
library(stringr)

Gather Websites for Relevant Job Postings

First, we will need to read in all of the websites that contain our data of interest. Looking at the html on the page, we can see that the job postings are wrapped in tags that have a class of report2ed. We will use a combination of html_nodes (filtered for our class of interest), html_children, and html_attr to pull in the websites associated with all of these postings. We will then eliminate any websites that don’t start with https://.

htmlURL <- 'https://www.onetonline.org/find/result?s=data%20scientist&a=1'
page <- read_html(htmlURL)

#parse out the websites for the links
websites <- page %>% 
  html_nodes("td.report2ed") %>% 
  html_children() %>%
  html_attr('href') 

finalWebsites <- websites[str_detect(websites,'https://*')]
finalWebsites1<- finalWebsites[1:3]

Job Posting Skills

Next, we will cycle through each website to extract the information from the Skills and Abilities sections. We will add in the job title and website as well as a distinction between Skills and Abilities. All of this information will be compiled into 1 final dataframe.

finalJobInfo <- data.frame()

seqList <- seq(1:length(finalWebsites))


for (i in seqList)
{
jobURL <- finalWebsites[i]
#jobURL <- 'https://www.onetonline.org/link/summary/15-1111.00'
jobWebsite <- read_html(jobURL)

# Parse out the job title
jobTitle <- jobWebsite %>%
  html_nodes("span.titleb") %>%
  html_text()


############################################
# Job skills
############################################

# Parse out the job skills
jobSkills <- jobWebsite %>%
  html_nodes("div.section_Skills") %>%
  html_text()

jobSkills <- jobSkills[2]

# convert to tibble, add in job title to record
jobSkills <- as.tibble(jobSkills)
jobSkills$JOB_TITLE <- jobTitle

# split the string into 1 row for each skill listed
s1 <- str_split(jobSkills$value, '\n\n\n\n')
jobSkills <- data.frame(JOB_TITLE = rep(jobSkills$JOB_TITLE, sapply(s1, length)), value = unlist(s1))

jobSkills$TYPE <- 'SKILLS'


############################################
# Job abilities
############################################

# Parse out the job abilities
jobAbilities<- jobWebsite %>%
  html_nodes("div.section_Abilities") %>%
  html_text()

jobAbilities <- jobAbilities[2]

# convert to tibble, add in job title to record
jobAbilities <- as.tibble(jobAbilities)
jobAbilities$JOB_TITLE <- jobTitle

# split the string into 1 row for each ability listed
s <- str_split(jobAbilities$value, '\n\n\n\n')
jobAbilities <- data.frame(JOB_TITLE = rep(jobAbilities$JOB_TITLE, sapply(s, length)), value = unlist(s))

jobAbilities$TYPE <- 'ABILITY'


jobInfo <- rbind(jobAbilities,jobSkills)
jobInfo$WEBSITE <- jobURL

finalJobInfo <- rbind(finalJobInfo, jobInfo)

}

Tidying up the data

Now that all of the information is compiled in one large data frame, we will clean it up a bit. First, we’ll add in a column called SOURCE that will contain an indication of where the information is taken from. Then, we’ll remove any records that don’t have “value” filled in, reorder the columns, and rename the “value” field.

finalJobInfo2<- as_tibble(finalJobInfo)
finalJobInfo2$SOURCE <- 'ONET'

finalJobInfo2 <- finalJobInfo2 %>%
  filter(value != '') %>%
  select(SOURCE,JOB_TITLE,WEBSITE,TYPE,value) %>%
  rename(SKILL=value)

Table Structure

We are going to create a table structure that is flexible enough to allow us to incorporate new information (from other sources) in the future. To do so, we will create 4 tables:

All of these tables can be linked to provide the same information that is contained in the finalJobInfo2 table.

Create table keys

First, we’ll need to create a key for each of the groupings mentioned above.

finalJobInfo2$SKILL_KEY <- finalJobInfo2 %>%
  group_indices(SKILL)

finalJobInfo2$SKILL_TYPE_KEY <- finalJobInfo2 %>%
  group_indices(TYPE)

finalJobInfo2$JOB_POSTING_KEY <- finalJobInfo2 %>%
  group_indices(JOB_TITLE,SOURCE)

Create table structures

Now that we have the keys created, we can split the data into the 4 tables mentioned above. We’ll store these in a csv file and them upload them into a relational database for later use.

SKILLS <- as.data.frame(unique(finalJobInfo2[c("SKILL_KEY","SKILL")]))
SKILL_TYPE <- as.data.frame(unique(finalJobInfo2[c("SKILL_TYPE_KEY","TYPE")]))
JOB_POSTING <- as.data.frame(unique(finalJobInfo2[c("JOB_POSTING_KEY","JOB_TITLE", "SOURCE", "WEBSITE")]))

JOB_SKILLS <- finalJobInfo2 %>%
  select(JOB_POSTING_KEY,SKILL_TYPE_KEY,SKILL_KEY)
  
write.csv(SKILLS,"SKILLS.csv")
write.csv(SKILL_TYPE, "SKILL_TYPE.csv")
write.csv(JOB_POSTING, "JOB_POSTING.csv")
write.csv(JOB_SKILLS, "JOB_SKILLS.csv")

Analysis

Now that we have the raw data, let’s start to take a look at it. First, we’ll have to load it into R.

Libraries

library(RMySQL)
## Loading required package: DBI
library(ggplot2)
jobData <- dbGetQuery(mydb, "
SELECT 
    jobPosting.SOURCE,
    jobPosting.JOB_TITLE,
    jobPosting.WEBSITE,
    skillType.TYPE,
    CASE    WHEN LEFT(skills.SKILL, LOCATE('?',skills.SKILL))= '' THEN skills.SKILL
            ELSE LEFT(skills.SKILL, LOCATE('?',skills.SKILL)-1) END AS SKILL,
    CASE    WHEN LEFT(skills.SKILL, LOCATE('?',skills.SKILL))= '' THEN ''
            ELSE RIGHT(skills.SKILL, LENGTH(skills.SKILL)-LOCATE('?',skills.SKILL)-1) END AS SKILL_DESCRIPTION 
FROM hw.job_skills jobSkills

INNER JOIN hw.job_posting jobPosting ON jobSkills.JOB_POSTING_KEY = jobPosting.JOB_POSTING_KEY
INNER JOIN hw.skill_type skillType ON jobSkills.SKILL_TYPE_KEY = skillType.SKILL_TYPE_KEY 
INNER JOIN hw.skills skills ON jobSkills.SKILL_KEY = skills.SKILL_KEY"
)

Notice in our SQL pull, we separated out the skill into a skill and a description column. Because not all data will have this format, we created a case statement that will account for this.

Now that we have the data loaded, let’s take a look at what the top skills are:

skillFrequency <- as.data.frame(table(jobData$SKILL)%>% sort(decreasing= TRUE))
colnames(skillFrequency) <- c('Skill', 'Frequency')
skillFrequency[1:20,]
##                            Skill Frequency
## 1                   Near Vision        625
## 2            Oral Comprehension        620
## 3               Oral Expression        614
## 4           Problem Sensitivity        610
## 5          Information Ordering        605
## 6           Deductive Reasoning        591
## 7            Speech Recognition        591
## 8              Active Listening        590
## 9                      Speaking        586
## 10               Speech Clarity        583
## 11            Critical Thinking        578
## 12        Written Comprehension        576
## 13        Reading Comprehension        561
## 14                   Monitoring        559
## 15          Inductive Reasoning        556
## 16         Category Flexibility        531
## 17 Judgment and Decision Making        524
## 18           Written Expression        513
## 19              Time Management        496
## 20                 Coordination        494
top_n(skillFrequency, n=20, Frequency) %>%
          ggplot(., aes(x=Skill, y=Frequency))+
              geom_bar(stat='identity') + 
  ggtitle("Soft Skills for Data Scientists") + 
  xlab("Skill") + ylab("Number of postings") +
  theme(axis.text.x = element_text(angle = 90))

Conclusions

From this analysis, we can see that the top 10 skills from this general pull are:

  • Near Vision
  • Oral Comprehension
  • Oral Expression
  • Problem Sensitivity
  • Information Ordering
  • Deductive Reasoning
  • Speech Recognition
  • Active Listening
  • Speaking
  • Speech Clarity

The first skill (which is somewhat humorous) can be eliminated from the list. The other skills, however, are very telling of what a data scientist is expected to do. What is most interesting is the number of communication skills at the top of this list – (Oral comprehension, oral expression, speaking, and speech clarity). These are likely top priorities because data scientists are often asked to communicate their findings with others.

Wait, there’s more!

In doing this analysis, we were curious to see if there were datasets available that provided information about top skills required for data scientists. We stumbled upon the following site with some really cool information regarding data science skills that we thought we’d take a look at. https://www.kaggle.com/discdiver/the-most-in-demand-skills-for-data-scientists/data

library(knitr)

url <- "https://raw.githubusercontent.com/gpadmaperuma/DATA607/master/ds_general_skills_revised.csv"
ds_skills <- read.csv(url, stringsAsFactors = FALSE)

Tidying and wrangling

Let’s take a look at the structure and see what type of tidying is needed.

str(ds_skills)
## 'data.frame':    30 obs. of  5 variables:
##  $ Keyword    : chr  "machine learning" "analysis" "statistics" "computer science" ...
##  $ LinkedIn   : chr  "5,701" "5,168" "4,893" "4,517" ...
##  $ Indeed     : chr  "3,439" "3,500" "2,992" "2,739" ...
##  $ SimplyHired: chr  "2,561" "2,668" "2,308" "2,093" ...
##  $ Monster    : chr  "2,340" "3,306" "2,399" "1,900" ...

We removed the commas and numbers from the data and coverted them from character to numeric.

ds_skills$LinkedIn <- str_replace_all(ds_skills$LinkedIn, ",", "") %>% as.numeric()
ds_skills$Indeed <- str_replace_all(ds_skills$Indeed, ",", "") %>% as.numeric()
ds_skills$SimplyHired <- str_replace_all(ds_skills$SimplyHired, ",", "") %>% as.numeric()
ds_skills$Monster <- str_replace_all(ds_skills$Monster, ",", "") %>% as.numeric()
str(ds_skills)
## 'data.frame':    30 obs. of  5 variables:
##  $ Keyword    : chr  "machine learning" "analysis" "statistics" "computer science" ...
##  $ LinkedIn   : num  5701 5168 4893 4517 3404 ...
##  $ Indeed     : num  3439 3500 2992 2739 2344 ...
##  $ SimplyHired: num  2561 2668 2308 2093 1791 ...
##  $ Monster    : num  2340 3306 2399 1900 2053 ...

We created a subset by eliminating total rows and unnecessary data.

ds_skills_subset <- subset(ds_skills, !is.na(LinkedIn))%>%
  subset((!Keyword == "Total"))

We mutate the data frame to generate a new column Frequence to put the total of all the skill keywords.

ds_skill2 <- ds_skills_subset %>%
  mutate(Frequency = LinkedIn + Indeed + SimplyHired + Monster)

More Tidying

When going through the data science skill keywords, we found that Artificial intelligence is in three different places. To avoid future confusions, we added AI and artificial intelligence and then subtracted the overlapping skills. We assigned the values to AI+artificial intelligence.

ds_skill2[18,2:6] <- ds_skill2[16,2:6] + ds_skill2[17,2:6] - ds_skill2[18,2:6]

We did the same thing with NLP and Natural Language Processing. Updated the NLP+natural language processing

ds_skill2[21,2:6] <- ds_skill2[19,2:6] + ds_skill2[20,2:6] - ds_skill2[21,2:6]

Now that we have updated the two AI+artificial intelligence and NLP+natural language processing, It s time to remove the extra rows

ds_skills_tidy <- ds_skill2[- c(16,17,19,20),]

We created another column percentage to put the percentage calculated by deviding the Frequency of each keyword skill deviding by overall total frequency.

ds_skills_tidy <- ds_skills_tidy %>%
  mutate(Percentage = Frequency/sum(Frequency))

Analysis and Visualization

Using the ggplot2 package, we created a bar plot that shows the total frequency of each data science skill mentioned in the jobboards, ranked from highest to lowest.

#creating a color palette with more colors.
library(RColorBrewer)
nb.cols <- 18
mycolor <- colorRampPalette(brewer.pal(8, "Set2"))(nb.cols)
# plot to show top data skills
library(ggplot2)
library(RColorBrewer)
library(ggthemes)
ggplot(ds_skills_tidy, aes(x = ds_skills_tidy$Keyword, y = ds_skills_tidy$Percentage, fill = as.character(Keyword))) +
  geom_bar(stat = "Identity", position = "dodge") +
  geom_text(aes(label = paste0(round(Frequency,1))), hjust=-0.5, color="black", position = position_dodge(1), size = 2) +
  scale_fill_manual(values = mycolor) +
  theme(axis.text.x=element_text(angle = 0, vjust = 0.5)) +
  theme(plot.title = element_text(hjust = 0.5), legend.position = "bottom") +
  ggtitle("Top Data Science Skills") +
  xlab("Keyword") +  ylab ("Frequency") +
  coord_flip()

Conclusions

Skills that posted on job sites get different from company to company. Data science is a field that one require tech tools as well as soft skills. This analysis shows that Analysis, Machine Learning and Statistics are top data science skills that companies are looking for. These data are grabbed from top job sites such as Linkedin, Indeed, Monster etc.