Project3_DataScienceSkills

Introduction

The purpose of the project is to answer the question: “Which are the most valued data science skills?”. As this is exploratory work, fairly open ended, and does not pretend to give a “right answer”. All conclusions are subjective as multiple approaches can be taken.

In order to answer the question I took data from different sourses and perspectives: employer’s (job website), employee’s (idustry wide survey), also I found ready research on that topic.

Job board website (employer’s perspective). To examine job search which can indicate “in demand skills” for Data Science positions. CWjobs.co.uk. (UK job board) was taken for that purpose.
Survey (employee’s perspective). Kaggle website conducted an industry-wide survey to establish a comprehensive view on the state of data science and machine learning. The survey received over 16,000 responses from 171 countries and territories and contains a lot information about the most valuable data science skills.

Data is presented in csv file, source: https://www.kaggle.com/rayjohnsoncomedy/job-skills/data

Ready to use analysis from Web (analyst/market experts’ view).

The are a lot of researches on what is the most valuable Data Science skills. I will look at one of them.

Assumptions:

This project provides general view on the most valued Data Science skills. In this project I ignore the fact that skills vary among projects and types of Data Science jobs.

Job board website: scrape the data.

https://www.cwjobs.co.uk/jobs/data-scientist?s=header

The idea here is to scrape the web pages one by one (with a search criteria - “Data Scientist”) and count of a pre-defined set of keywords found within the jod description text. The result of the count is going to detect skills the most demanded by employers.

Loading nessesary packages

library(rvest)

## Loading required package: xml2

library("rjson")
library("stringi")
vignette("selectorgadget")

## starting httpd help server ...

##  done

library("dplyr")

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library("tidyr")
library("RMySQL")

## Loading required package: DBI

library("DBI")
library("ggplot2")

Scraping CW jobs website steps:

extract links to job description pages by iterating search results
iterating through links obtained in prvious step extract job description from each page and count number of matches of each skill from list read from file
create DataFrame of following format: skill, jobref, count
save data to MySQL database
retrieved those records from DB into DataFrame and calculate count frequencies for each skill

get.jobs.df <- function(pages_count){
  res <- data.frame()
  for(i in 1:pages_count){
    links <- tryCatch({
               get.page.links(i)
            },error=function(cond){
                  message(cond)
                  print(paste("Oops! Error for links on page: ",toString(i), " pages_count: ",toString(pages_count)))
                  list()
            }
            )
     
     for (link in links){
          jobs.df <- tryCatch({get.job.descriptions(link)},
                        error = function(cond){
                            message(cond)
                            print(paste("Oops! Error for link:",link))
                            data.frame()
                        })
           res <- rbind(res,jobs.df)
    }   
  } 
  return(res)
}

# returns list of links to job description pages on given search page
get.page.links <- function(page_number){
                            pagination <- if(page_number>1) {paste('&page=',toString(page_number),sep="")} else {''}
                                    
                            html <- read_html(paste("https://www.cwjobs.co.uk/jobs/data-scientist?s=header",pagination,sep=""))
                            links <- html %>% html_nodes(".job-title") %>% html_nodes("a") %>% html_attr("href")
                            # relevant_links(unlist(Filter(is.job.relevant, links)))
                            
                            return( links )
}
# get job description dataframe. Format: columns: jobref,job description
get.job.descriptions <- function(link){
                                    temp.text <- read_html(link)
                                    temp.text.result <- temp.text %>% html_nodes(".job-description") %>% html_text()
# extract job ref from the link to job description page
                                    link.elements <- unlist(strsplit(link,"-"))
                                    job.ref <- tail(link.elements,n=1)
                                    df <- data.frame(jobref=job.ref, descr=temp.text.result)
                                    return( df )
                               }
#  add number of matches per skill to the data frame
get.data <- function(jobs.df){# expected format col.names=c(jobref,descr)
  skills <- readLines("/Users/olgashiligin/Documents/skills")
  skills.df <- data.frame(matrix(ncol = 3, nrow = 0))
  cols <- c("skill","jobref","count")
  colnames(skills.df) <- cols
  for(i in 1:nrow(jobs.df)){
    job.ref<-jobs.df[i,1]
    job.descr <- jobs.df[i,2]
    for (skill in skills){
      count <- get.skill.count(job.descr,skill)
      newrow <- data.frame(skill=skill,jobref=job.ref,count=count)
      skills.df <- rbind(skills.df,newrow)
    }
  }
  return(skills.df)
}

get.skill.count <- function(descr,skill){
  regex <- paste('\\b',tolower(skill),'\\b',sep="")
  count <- stri_count_regex(tolower(descr), pattern = regex)
  return(count)
}

# jobs.df = get.jobs.df(11)
# results.df <- get.data(jobs.df)
# write.csv(results.df, file = "skills.csv")
# save.to.db(df)

Storing results from the data frame to the relational data base.

making table in relational database
populating table with the scraping results

save.to.db <- function(match.df){
  connection <- dbConnect(MySQL(),
                          user="root", password="olga123",
                          dbname="607_assignments", host="localhost")
  
  drop.table.query = "DROP TABLE IF EXISTS 607_assignments.SKILL_COUNTS"
  drop.table <- dbSendStatement(connection, drop.table.query)
  dbClearResult(drop.table)
  
  create.table.query = "CREATE TABLE IF NOT EXISTS 607_assignments.SKILL_COUNTS(
                                                                                  id INT AUTO_INCREMENT,
                                                                                  skill VARCHAR(255) NOT NULL,
                                                                                  jobref VARCHAR(255) NOT NULL,
                                                                                  count INT NOT NULL,
                                                                                  PRIMARY KEY (id)
                                                                                  )  ENGINE=INNODB;"
  create.table <- dbSendStatement(connection, create.table.query)
  dbClearResult(create.table)
  
  for (row in 1:nrow(match.df)) {
    skill <- match.df[row, "skill"]
    jobref  <- match.df[row, "jobref"]
    count  <- match.df[row, "count"]
    
    insert.query = paste("INSERT INTO 607_assignments.SKILL_COUNTS(skill,jobref,count) 
                        VALUES ('",skill,"', '",jobref,"', ",toString(count)," )",sep="")
    insert.row <-  dbSendStatement(connection, insert.query)
    dbClearResult(insert.row)
  }
}

Reading and analysing results from the data base.

conbection <- dbConnect(MySQL(),
                 user="root", password="olga123",
                 dbname="607_assignments", host="localhost")

results <- "SELECT * from SKILL_COUNTS"
results <- dbGetQuery(conbection, results)

Creating R data frame and counting number of job’s descriptions scraped.

df<-data.frame(results)
jobs_number <- df %>% 
  summarise(count = n_distinct(jobref))
jobs_number

##   count
## 1   200

Top Data Science skills most demanded by employers

leading_skills<- df %>% filter(count !="0") %>%
  group_by(skill) %>% 
  count(skill) %>%  
  arrange(desc(n))
leading_skills

## # A tibble: 45 x 2
## # Groups:   skill [45]
##    skill                n
##    <chr>            <int>
##  1 Python              96
##  2 SQL                 86
##  3 machine Learning    72
##  4 R                   60
##  5 C                   43
##  6 algorithms          42
##  7 modelling           39
##  8 Programming         37
##  9 Analytical          36
## 10 Models              34
## # ... with 35 more rows

p<-ggplot(leading_skills, aes(x=reorder(skill,-n), y=n)) +
  geom_bar(stat="identity")+theme_minimal() +
  geom_col(aes(fill = n)) + 
  scale_fill_gradient2(low = "white", high = "blue") + 
  coord_flip() + 
  ggtitle ("Data Science Skills Most Demanded By Employers") + xlab("Total Count") + ylab("Skills")
p

Graph shows that the most demanded Data Science skills are Python, SQL, machine learning and R. I would conclude that coding skills are the most searched by employers. Also I would like to note that communication skills despite of their non-technical nature have quite strong middle position in the graph.

2. Industry-wide survey

Using industry wide survey results the following point of Data Science skills were analysed:

General coding skills
Type of formal education
Importance of formal education
Coding language recommendation
Machine Learning Methods
Machine Learning Tools

Reading survey data and filtering respondents who has Data Scientist as a job title.

survey_data<- read.csv("responses.csv")
sub_survey<-survey_data %>% 
  filter (CurrentJobTitleSelect =="Data Scientist")

Coding skills

code_writer<-table(sub_survey$CodeWriter)/2433
frame_coding<-as.data.frame(code_writer) 
frame_coding

##   Var1       Freq
## 1      0.00000000
## 2   No 0.01685162
## 3  Yes 0.98314838

98% of respondents have coding skills.

Formal Education

formal_edu<-table(sub_survey$FormalEducation)/2433
frame_degree<-as.data.frame(formal_edu) %>% 
  arrange(desc(Freq))
frame_degree

##                                                                Var1
## 1                                                   Master's degree
## 2                                                   Doctoral degree
## 3                                                 Bachelor's degree
## 4                                               Professional degree
## 5 Some college/university study without earning a bachelor's degree
## 6          I did not complete any formal education past high school
## 7                                                                  
## 8                                            I prefer not to answer
##          Freq
## 1 0.484997945
## 2 0.248253185
## 3 0.212494862
## 4 0.024249897
## 5 0.020139745
## 6 0.004521167
## 7 0.003699137
## 8 0.001644061

Main type of formal education among Data Scientists is Masters degree (48%), whereas Doctoral and Bachelor’s degrees’ share are almost equal (24% and 21.2% respectively) and have second position in the list.

Importance of Formal Education

importance_edu<-table(sub_survey$UniversityImportance)/2433
frame_imp_edu<-as.data.frame(importance_edu) %>% 
  arrange(desc(Freq))
frame_imp_edu

##                                          Var1        Freq
## 1                              Very important 0.350184957
## 2                                   Important 0.263460748
## 3                          Somewhat important 0.163173037
## 4                                             0.134812988
## 5                          Not very important 0.063296342
## 6                        Not at all important 0.023838882
## 7 N/A, I did not receive any formal education 0.001233046

Half of the respondents consider formal education as important or very important.

Coding language recommendation

language<-table(sub_survey$LanguageRecommendationSelect)/2433
frame_language<-as.data.frame(language) %>% 
  arrange(desc(Freq))
frame_language

##        Var1         Freq
## 1    Python 0.4944512947
## 2           0.2494862310
## 3         R 0.1976983148
## 4       SQL 0.0242498972
## 5     Scala 0.0078092889
## 6  C/C++/C# 0.0061652281
## 7       SAS 0.0049321825
## 8      Java 0.0045211673
## 9    Matlab 0.0036991369
## 10    Other 0.0036991369
## 11    Julia 0.0020550760
## 12  Haskell 0.0008220304
## 13    Stata 0.0004110152
## 14       F# 0.0000000000

Python (49%) and R (19.8%) are recommended by the Data Scientists as the best programming language for Data Science.

ml_method<-table(sub_survey$MLMethodNextYearSelect)/2433
frame_ml_method<-as.data.frame(ml_method) %>% 
  arrange(desc(Freq))
frame_ml_method

##                                           Var1        Freq
## 1                                Deep learning 0.327579120
## 2                                              0.256473490
## 3                                  Neural Nets 0.099465680
## 4                             Bayesian Methods 0.044389642
## 5                         Time Series Analysis 0.035347308
## 6            Genetic & Evolutionary Algorithms 0.028771065
## 7                                  Text Mining 0.025893958
## 8                            Anomaly Detection 0.025482943
## 9                      Social Network Analysis 0.021783806
## 10   Ensemble Methods (e.g. boosting, bagging) 0.019317715
## 11                                       Other 0.018495684
## 12                         Monte Carlo Methods 0.017673654
## 13                            Cluster Analysis 0.013563502
## 14                           Survival Analysis 0.009453350
## 15 I don't plan on learning a new ML/DS method 0.008631319
## 16                      Proprietary Algorithms 0.006987259
## 17                               Link Analysis 0.005343198
## 18                              Random Forests 0.005343198
## 19                                  Regression 0.004932182
## 20               Support Vector Machines (SVM) 0.004932182
## 21                             Uplift Modeling 0.004932182
## 22                             Factor Analysis 0.004110152
## 23                           Association Rules 0.003288122
## 24                              Rule Induction 0.003288122
## 25                              Decision Trees 0.002877106
## 26                                        MARS 0.001644061

Deep Learning is the most popular machine learning method among Data Scientists (32,7%)

ml_tool<-table(sub_survey$MLToolNextYearSelect)/2433
frame_ml_tool<-as.data.frame(ml_tool) %>% 
  arrange(desc(Freq))
frame_ml_tool

##                                                  Var1         Freq
## 1                                                     0.2560624743
## 2                                          TensorFlow 0.2211261817
## 3                                       Spark / MLlib 0.0990546650
## 4                                              Python 0.0764488286
## 5                                               Other 0.0394574599
## 6                                                   R 0.0328812166
## 7                             Amazon Machine Learning 0.0271270037
## 8                                               Julia 0.0246609125
## 9      I don't plan on learning a new tool/technology 0.0238388820
## 10                                    Hadoop/Hive/Pig 0.0234278668
## 11                                Amazon Web services 0.0197287300
## 12                                          DataRobot 0.0189066995
## 13                               Google Cloud Compute 0.0176736539
## 14                                  Jupyter notebooks 0.0164406083
## 15                       IBM Watson / Waton Analytics 0.0143855323
## 16                   Microsoft Azure Machine Learning 0.0115084258
## 17                                              C/C++ 0.0094533498
## 18                                               Stan 0.0090423346
## 19                                               Java 0.0078092889
## 20 Microsoft R Server (Formerly Revolution Analytics) 0.0073982737
## 21                                            Tableau 0.0053431977
## 22                                           Cloudera 0.0036991369
## 23                                              NoSQL 0.0036991369
## 24                          RapidMiner (free version) 0.0024660912
## 25                               SAS Enterprise Miner 0.0024660912
## 26                                                SQL 0.0024660912
## 27                                   Unix shell / awk 0.0020550760
## 28                                              Flume 0.0016440608
## 29                               KNIME (free version) 0.0016440608
## 30                                           SAS Base 0.0016440608
## 31                                               Weka 0.0016440608
## 32                         KNIME (commercial version) 0.0012330456
## 33                                        Mathematica 0.0012330456
## 34            Oracle Data Mining/ Oracle R Enterprise 0.0012330456
## 35                                           QlikView 0.0012330456
## 36           SAP BusinessObjects Predictive Analytics 0.0012330456
## 37                                IBM SPSS Statistics 0.0008220304
## 38                                      MATLAB/Octave 0.0008220304
## 39                                             Orange 0.0008220304
## 40                                               Perl 0.0008220304
## 41                    RapidMiner (commercial version) 0.0008220304
## 42           Salfrod Systems CART/MARS/TreeNet/RF/SPM 0.0008220304
## 43                                     TIBCO Spotfire 0.0008220304
## 44                                             Angoss 0.0004110152
## 45                                   IBM SPSS Modeler 0.0004110152
## 46                                             Impala 0.0004110152
## 47                        Microsoft Excel Data Mining 0.0004110152
## 48                   Microsoft SQL Server Data Mining 0.0004110152
## 49                                            SAS JMP 0.0004110152
## 50          Statistica (Quest/Dell-formerly Statsoft) 0.0004110152
## 51                                         IBM Cognos 0.0000000000
## 52                                            Minitab 0.0000000000

TensorFlow and Spark are the most popular Machine Learning Tools among Data Scientists.

3. Ready to use analysis from Web (analyst/market experts’ view).

One research, the result of which are also based on the survey, provides interesting insights on the question about most valuable data science skills in terms of prevalence of certain skills among Data Scientists.

source: http://businessoverbroadway.com/investigating-data-scientists-their-skills-and-team-makeup

Investigating Data Scientists, their Skills and Team Makeup article gives us analysis on the proficiency in Data Science Skills

knitr::include_graphics('DS_skills.png')

Graph clearly shows that the most valuable skills (in terms of their rarity) are:

Cloud Management
Programming Skills
Big data and Distributed data
NLP
Programming Skills
Unstructured Data
Bayesian Stats

Conclusions

Analysis of 3 data sources allows me to select the most valued data science skills so far:

coding skill (in particular: Python, R, SQL);
machine learning skills
skill to work with big and unstructured data
communication skills

Further Research and Analysis

As this project does not pretend to be complete and exhausted, the following further research can be done: analysis of salaries and associated Data Science skills with it.