The purpose of the project is to answer the question: “Which are the most valued data science skills?”. As this is exploratory work, fairly open ended, and does not pretend to give a “right answer”. All conclusions are subjective as multiple approaches can be taken.
In order to answer the question I took data from different sourses and perspectives: employer’s (job website), employee’s (idustry wide survey), also I found ready research on that topic.
Job board website (employer’s perspective). To examine job search which can indicate “in demand skills” for Data Science positions. CWjobs.co.uk. (UK job board) was taken for that purpose.
Survey (employee’s perspective). Kaggle website conducted an industry-wide survey to establish a comprehensive view on the state of data science and machine learning. The survey received over 16,000 responses from 171 countries and territories and contains a lot information about the most valuable data science skills.
Data is presented in csv file, source: https://www.kaggle.com/rayjohnsoncomedy/job-skills/data
The are a lot of researches on what is the most valuable Data Science skills. I will look at one of them.
This project provides general view on the most valued Data Science skills. In this project I ignore the fact that skills vary among projects and types of Data Science jobs.
https://www.cwjobs.co.uk/jobs/data-scientist?s=header
The idea here is to scrape the web pages one by one (with a search criteria - “Data Scientist”) and count of a pre-defined set of keywords found within the jod description text. The result of the count is going to detect skills the most demanded by employers.
Loading nessesary packages
library(rvest)
## Loading required package: xml2
library("rjson")
library("stringi")
vignette("selectorgadget")
## starting httpd help server ...
## done
library("dplyr")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library("tidyr")
library("RMySQL")
## Loading required package: DBI
library("DBI")
library("ggplot2")
Scraping CW jobs website steps:
extract links to job description pages by iterating search results
iterating through links obtained in prvious step extract job description from each page and count number of matches of each skill from list read from file
create DataFrame of following format: skill, jobref, count
save data to MySQL database
retrieved those records from DB into DataFrame and calculate count frequencies for each skill
get.jobs.df <- function(pages_count){
res <- data.frame()
for(i in 1:pages_count){
links <- tryCatch({
get.page.links(i)
},error=function(cond){
message(cond)
print(paste("Oops! Error for links on page: ",toString(i), " pages_count: ",toString(pages_count)))
list()
}
)
for (link in links){
jobs.df <- tryCatch({get.job.descriptions(link)},
error = function(cond){
message(cond)
print(paste("Oops! Error for link:",link))
data.frame()
})
res <- rbind(res,jobs.df)
}
}
return(res)
}
# returns list of links to job description pages on given search page
get.page.links <- function(page_number){
pagination <- if(page_number>1) {paste('&page=',toString(page_number),sep="")} else {''}
html <- read_html(paste("https://www.cwjobs.co.uk/jobs/data-scientist?s=header",pagination,sep=""))
links <- html %>% html_nodes(".job-title") %>% html_nodes("a") %>% html_attr("href")
# relevant_links(unlist(Filter(is.job.relevant, links)))
return( links )
}
# get job description dataframe. Format: columns: jobref,job description
get.job.descriptions <- function(link){
temp.text <- read_html(link)
temp.text.result <- temp.text %>% html_nodes(".job-description") %>% html_text()
# extract job ref from the link to job description page
link.elements <- unlist(strsplit(link,"-"))
job.ref <- tail(link.elements,n=1)
df <- data.frame(jobref=job.ref, descr=temp.text.result)
return( df )
}
# add number of matches per skill to the data frame
get.data <- function(jobs.df){# expected format col.names=c(jobref,descr)
skills <- readLines("/Users/olgashiligin/Documents/skills")
skills.df <- data.frame(matrix(ncol = 3, nrow = 0))
cols <- c("skill","jobref","count")
colnames(skills.df) <- cols
for(i in 1:nrow(jobs.df)){
job.ref<-jobs.df[i,1]
job.descr <- jobs.df[i,2]
for (skill in skills){
count <- get.skill.count(job.descr,skill)
newrow <- data.frame(skill=skill,jobref=job.ref,count=count)
skills.df <- rbind(skills.df,newrow)
}
}
return(skills.df)
}
get.skill.count <- function(descr,skill){
regex <- paste('\\b',tolower(skill),'\\b',sep="")
count <- stri_count_regex(tolower(descr), pattern = regex)
return(count)
}
# jobs.df = get.jobs.df(11)
# results.df <- get.data(jobs.df)
# write.csv(results.df, file = "skills.csv")
# save.to.db(df)
Storing results from the data frame to the relational data base.
making table in relational database
populating table with the scraping results
save.to.db <- function(match.df){
connection <- dbConnect(MySQL(),
user="root", password="olga123",
dbname="607_assignments", host="localhost")
drop.table.query = "DROP TABLE IF EXISTS 607_assignments.SKILL_COUNTS"
drop.table <- dbSendStatement(connection, drop.table.query)
dbClearResult(drop.table)
create.table.query = "CREATE TABLE IF NOT EXISTS 607_assignments.SKILL_COUNTS(
id INT AUTO_INCREMENT,
skill VARCHAR(255) NOT NULL,
jobref VARCHAR(255) NOT NULL,
count INT NOT NULL,
PRIMARY KEY (id)
) ENGINE=INNODB;"
create.table <- dbSendStatement(connection, create.table.query)
dbClearResult(create.table)
for (row in 1:nrow(match.df)) {
skill <- match.df[row, "skill"]
jobref <- match.df[row, "jobref"]
count <- match.df[row, "count"]
insert.query = paste("INSERT INTO 607_assignments.SKILL_COUNTS(skill,jobref,count)
VALUES ('",skill,"', '",jobref,"', ",toString(count)," )",sep="")
insert.row <- dbSendStatement(connection, insert.query)
dbClearResult(insert.row)
}
}
Reading and analysing results from the data base.
conbection <- dbConnect(MySQL(),
user="root", password="olga123",
dbname="607_assignments", host="localhost")
results <- "SELECT * from SKILL_COUNTS"
results <- dbGetQuery(conbection, results)
Creating R data frame and counting number of job’s descriptions scraped.
df<-data.frame(results)
jobs_number <- df %>%
summarise(count = n_distinct(jobref))
jobs_number
## count
## 1 200
Top Data Science skills most demanded by employers
leading_skills<- df %>% filter(count !="0") %>%
group_by(skill) %>%
count(skill) %>%
arrange(desc(n))
leading_skills
## # A tibble: 45 x 2
## # Groups: skill [45]
## skill n
## <chr> <int>
## 1 Python 96
## 2 SQL 86
## 3 machine Learning 72
## 4 R 60
## 5 C 43
## 6 algorithms 42
## 7 modelling 39
## 8 Programming 37
## 9 Analytical 36
## 10 Models 34
## # ... with 35 more rows
p<-ggplot(leading_skills, aes(x=reorder(skill,-n), y=n)) +
geom_bar(stat="identity")+theme_minimal() +
geom_col(aes(fill = n)) +
scale_fill_gradient2(low = "white", high = "blue") +
coord_flip() +
ggtitle ("Data Science Skills Most Demanded By Employers") + xlab("Total Count") + ylab("Skills")
p
Graph shows that the most demanded Data Science skills are Python, SQL, machine learning and R. I would conclude that coding skills are the most searched by employers. Also I would like to note that communication skills despite of their non-technical nature have quite strong middle position in the graph.
Using industry wide survey results the following point of Data Science skills were analysed:
Reading survey data and filtering respondents who has Data Scientist as a job title.
survey_data<- read.csv("responses.csv")
sub_survey<-survey_data %>%
filter (CurrentJobTitleSelect =="Data Scientist")
Coding skills
code_writer<-table(sub_survey$CodeWriter)/2433
frame_coding<-as.data.frame(code_writer)
frame_coding
## Var1 Freq
## 1 0.00000000
## 2 No 0.01685162
## 3 Yes 0.98314838
98% of respondents have coding skills.
Formal Education
formal_edu<-table(sub_survey$FormalEducation)/2433
frame_degree<-as.data.frame(formal_edu) %>%
arrange(desc(Freq))
frame_degree
## Var1
## 1 Master's degree
## 2 Doctoral degree
## 3 Bachelor's degree
## 4 Professional degree
## 5 Some college/university study without earning a bachelor's degree
## 6 I did not complete any formal education past high school
## 7
## 8 I prefer not to answer
## Freq
## 1 0.484997945
## 2 0.248253185
## 3 0.212494862
## 4 0.024249897
## 5 0.020139745
## 6 0.004521167
## 7 0.003699137
## 8 0.001644061
Main type of formal education among Data Scientists is Masters degree (48%), whereas Doctoral and Bachelor’s degrees’ share are almost equal (24% and 21.2% respectively) and have second position in the list.
Importance of Formal Education
importance_edu<-table(sub_survey$UniversityImportance)/2433
frame_imp_edu<-as.data.frame(importance_edu) %>%
arrange(desc(Freq))
frame_imp_edu
## Var1 Freq
## 1 Very important 0.350184957
## 2 Important 0.263460748
## 3 Somewhat important 0.163173037
## 4 0.134812988
## 5 Not very important 0.063296342
## 6 Not at all important 0.023838882
## 7 N/A, I did not receive any formal education 0.001233046
Half of the respondents consider formal education as important or very important.
Coding language recommendation
language<-table(sub_survey$LanguageRecommendationSelect)/2433
frame_language<-as.data.frame(language) %>%
arrange(desc(Freq))
frame_language
## Var1 Freq
## 1 Python 0.4944512947
## 2 0.2494862310
## 3 R 0.1976983148
## 4 SQL 0.0242498972
## 5 Scala 0.0078092889
## 6 C/C++/C# 0.0061652281
## 7 SAS 0.0049321825
## 8 Java 0.0045211673
## 9 Matlab 0.0036991369
## 10 Other 0.0036991369
## 11 Julia 0.0020550760
## 12 Haskell 0.0008220304
## 13 Stata 0.0004110152
## 14 F# 0.0000000000
Python (49%) and R (19.8%) are recommended by the Data Scientists as the best programming language for Data Science.
ml_method<-table(sub_survey$MLMethodNextYearSelect)/2433
frame_ml_method<-as.data.frame(ml_method) %>%
arrange(desc(Freq))
frame_ml_method
## Var1 Freq
## 1 Deep learning 0.327579120
## 2 0.256473490
## 3 Neural Nets 0.099465680
## 4 Bayesian Methods 0.044389642
## 5 Time Series Analysis 0.035347308
## 6 Genetic & Evolutionary Algorithms 0.028771065
## 7 Text Mining 0.025893958
## 8 Anomaly Detection 0.025482943
## 9 Social Network Analysis 0.021783806
## 10 Ensemble Methods (e.g. boosting, bagging) 0.019317715
## 11 Other 0.018495684
## 12 Monte Carlo Methods 0.017673654
## 13 Cluster Analysis 0.013563502
## 14 Survival Analysis 0.009453350
## 15 I don't plan on learning a new ML/DS method 0.008631319
## 16 Proprietary Algorithms 0.006987259
## 17 Link Analysis 0.005343198
## 18 Random Forests 0.005343198
## 19 Regression 0.004932182
## 20 Support Vector Machines (SVM) 0.004932182
## 21 Uplift Modeling 0.004932182
## 22 Factor Analysis 0.004110152
## 23 Association Rules 0.003288122
## 24 Rule Induction 0.003288122
## 25 Decision Trees 0.002877106
## 26 MARS 0.001644061
Deep Learning is the most popular machine learning method among Data Scientists (32,7%)
ml_tool<-table(sub_survey$MLToolNextYearSelect)/2433
frame_ml_tool<-as.data.frame(ml_tool) %>%
arrange(desc(Freq))
frame_ml_tool
## Var1 Freq
## 1 0.2560624743
## 2 TensorFlow 0.2211261817
## 3 Spark / MLlib 0.0990546650
## 4 Python 0.0764488286
## 5 Other 0.0394574599
## 6 R 0.0328812166
## 7 Amazon Machine Learning 0.0271270037
## 8 Julia 0.0246609125
## 9 I don't plan on learning a new tool/technology 0.0238388820
## 10 Hadoop/Hive/Pig 0.0234278668
## 11 Amazon Web services 0.0197287300
## 12 DataRobot 0.0189066995
## 13 Google Cloud Compute 0.0176736539
## 14 Jupyter notebooks 0.0164406083
## 15 IBM Watson / Waton Analytics 0.0143855323
## 16 Microsoft Azure Machine Learning 0.0115084258
## 17 C/C++ 0.0094533498
## 18 Stan 0.0090423346
## 19 Java 0.0078092889
## 20 Microsoft R Server (Formerly Revolution Analytics) 0.0073982737
## 21 Tableau 0.0053431977
## 22 Cloudera 0.0036991369
## 23 NoSQL 0.0036991369
## 24 RapidMiner (free version) 0.0024660912
## 25 SAS Enterprise Miner 0.0024660912
## 26 SQL 0.0024660912
## 27 Unix shell / awk 0.0020550760
## 28 Flume 0.0016440608
## 29 KNIME (free version) 0.0016440608
## 30 SAS Base 0.0016440608
## 31 Weka 0.0016440608
## 32 KNIME (commercial version) 0.0012330456
## 33 Mathematica 0.0012330456
## 34 Oracle Data Mining/ Oracle R Enterprise 0.0012330456
## 35 QlikView 0.0012330456
## 36 SAP BusinessObjects Predictive Analytics 0.0012330456
## 37 IBM SPSS Statistics 0.0008220304
## 38 MATLAB/Octave 0.0008220304
## 39 Orange 0.0008220304
## 40 Perl 0.0008220304
## 41 RapidMiner (commercial version) 0.0008220304
## 42 Salfrod Systems CART/MARS/TreeNet/RF/SPM 0.0008220304
## 43 TIBCO Spotfire 0.0008220304
## 44 Angoss 0.0004110152
## 45 IBM SPSS Modeler 0.0004110152
## 46 Impala 0.0004110152
## 47 Microsoft Excel Data Mining 0.0004110152
## 48 Microsoft SQL Server Data Mining 0.0004110152
## 49 SAS JMP 0.0004110152
## 50 Statistica (Quest/Dell-formerly Statsoft) 0.0004110152
## 51 IBM Cognos 0.0000000000
## 52 Minitab 0.0000000000
TensorFlow and Spark are the most popular Machine Learning Tools among Data Scientists.
One research, the result of which are also based on the survey, provides interesting insights on the question about most valuable data science skills in terms of prevalence of certain skills among Data Scientists.
source: http://businessoverbroadway.com/investigating-data-scientists-their-skills-and-team-makeup
Investigating Data Scientists, their Skills and Team Makeup article gives us analysis on the proficiency in Data Science Skills
knitr::include_graphics('DS_skills.png')
Graph clearly shows that the most valuable skills (in terms of their rarity) are:
Analysis of 3 data sources allows me to select the most valued data science skills so far:
As this project does not pretend to be complete and exhausted, the following further research can be done: analysis of salaries and associated Data Science skills with it.