What are employers looking for in a data scientist?

We did a search for the Data Scientist terms in Indeed. Our findings were interesting, and maybe not surprising; however, they were insightful.

This kind of analysis can also be helpful for getting insights into what your customers are commenting online about your business, your products, your brand, your ads and campaigns, and more.

Keep in mind that this was only done for the search results for the first 5 pages. If we wanted to gain better insights, we might have wanted to get more pages, and from other sites too.

Let’s take a look at what the analysis reveals.

First, we loaded the necessary libraries

library(xml2)
library(rvest)
library(plyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tm)

## Loading required package: NLP

Then, we mined indeed’s results for Data Scientist jobs in the United States

url='https://www.indeed.com/q-Data-Scientist-jobs.html'
names = read_html(url)
scraped_summaries = html_nodes(names, '.summary')
names_text = html_text(scraped_summaries)
names_text

##  [1] "\n                            Ready to demonstrate your marketing analytics and data sciences prowess? You should be well-versed in machine learning algorithms – which is best for achieving..." 
##  [2] "\n                            You’ll work with world-class data scientists and engineers. We are looking for rockstar Data Scientists to join our team...."                                       
##  [3] "\n                            We are looking to hire an analytics & insights leader to partner with the Structured Data Product team. This leader needs to possess not only super strong..."      
##  [4] "\n                            The Data Scientist utilizes extensive knowledge in data structures, data relationships, and tools to perform statistical analysis, data visualizations, analytic..."
##  [5] "\n            The Machine Learning Engineer will build extensible and highly scalable machine learning components for One Click Retail's micro-services...."                                      
##  [6] "\n            Interprets results from multiple sources using a variety of techniques, ranging from structured data extraction to complex data mining from clinical notes...."                     
##  [7] "\n            You will build machine learning models to transform social media feeds into actionable items. Available employment types:...."                                                      
##  [8] "\n            Extensive experience with Python applying some of the models above to real-world data; Experience with data visualization tools (e.g., Tableau, Power BI, D3.js..."                 
##  [9] "\n            Build complex data sets from multiple data sources. In engineering, computer science, physics, mathematics or other quantitative fields OR Master's Degree in a..."                 
## [10] "\n            Use data mining and machine learning skills to design and develop products which drive engagement, growth, retention, and monetization...."                                         
## [11] "\n            Data Scientist in Fremont, CA*. Areas of team focus include mHealth, genomics, and other projects in support of design, analysis, and meta-analysis of clinical..."                 
## [12] "\n            Data modeling, data mining, data engineering, data analysis, and machine learning. Experience with manipulating and analyzing clinical data and machine learning..."                
## [13] "\n            The R Programmer will perform research on large historical healthcare data sets, develop and improve statistical and machine learning approaches for products..."                   
## [14] "\n            As a key member of the Coca-Cola Company Freestyle organization supporting specifically Engineering & Innovation, the Data Scientist will perform/lead data..."                     
## [15] "\n                            A critical member of the Analytics team, the Data Scientist will work with the Business Intelligence team, web analysts, marketers, and our web development team..."
## [16] "\n                            6+ years of professional experience in a business environment as a Data Scientist, Machine Learning engineer or comparable analytical position...."                 
## [17] "\n                            And interpret complex SQL and Hive QL queries to inform data mining and. As a Data Scientist, you will use statistical analysis...."

Create code to cycle through the first 5 pages

pa = c("", "&start=10","&start=20","&start=30","&start=40","&start=50")
pages = paste('https://www.indeed.com/jobs?q=Data+Scientist&l=United+States', pa, sep = '')
BB = c()

Clean up the text

for (i in 1:length(pages)) {
  url = pages[i]
  names = read_html(url)
  scraped_names = html_nodes(names, '.pagination , .company, .summary, .location, .turnstileLink')
  names_text = html_text(scraped_names)
  names_text
  # Clean up text using regular expressions 
  # LOOKUP ?regex and select Regular Expressionas as used in R for more info)
  installed.packages("stringi")
  library(stringi)
  names2 = stri_replace_all_regex(names_text, "\\(", "")
  names2
  names3 = stri_replace_all_regex(names2, "[:punct:]", "")
  names3
  names4 = stri_replace_all_regex(names3, "[:digit:]", " ")
  names4
  names5 = stri_replace_all_regex(names4, "\n", " ")
  names5
  names6 = stri_replace_all_regex(names5, ",", " ")
  names6
  names7 = iconv(names6, to = "ASCII//TRANSLIT")
  names7
  # we have spaces left to eliminate, we eliminate them now
  names8 = stri_replace_all_regex(names6, "   ", "")
  B = names8
  # For loop result, add to the previous one
  BB = c(BB, B)
}
BB <- iconv(BB, to = "ASCII//TRANSLIT")
BB <- tolower(BB)
head(BB, 20)

##  [1] "machine learning engineer"                                                                                                                                
##  [2] " one click retail"                                                                                                                                        
##  [3] "  one click retail"                                                                                                                                       
##  [4] "  reviews"                                                                                                                                                
##  [5] "salt lake city ut"                                                                                                                                        
##  [6] "  the machine learning engineer will build extensible and highly scalable machine learning components for one click retails microservices"                
##  [7] "sr data scientist"                                                                                                                                        
##  [8] " captioncall"                                                                                                                                             
##  [9] "  captioncall"                                                                                                                                            
## [10] "reviews"                                                                                                                                                  
## [11] "salt lake city ut"                                                                                                                                        
## [12] "  utilizes data mining techniques and develops data models to assist in the visualization and interpretation of data"                                     
## [13] "machine learning engineer"                                                                                                                                
## [14] " banjo"                                                                                                                                                   
## [15] "  banjo"                                                                                                                                                  
## [16] "reviews"                                                                                                                                                  
## [17] "park city ut"                                                                                                                                             
## [18] "  banjo is looking to add a machine learning engineer to our team in park city natural inclination to demand precision in everything you build even while"
## [19] "consumer product data scientist"                                                                                                                          
## [20] "  laird superfood"

Remove stopwords, and other selected words you deem unimportant. Also, look at the first 5 results

BB.corpus <- Corpus(VectorSource(BB))
BB.corpus <- tm_map(BB.corpus, removeWords, stopwords("en"))
inspect(BB.corpus[1:5])

## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 5
## 
## [1] machine learning engineer  one click retail        
## [3]   one click retail          reviews                
## [5] salt lake city ut

myStopwords <- c("engineer", "reviews", "using", "ebay", "central", "leasing", "will", "draper", "date", "york", "seeking", "scientist",
"scientists", "dell", "beaverton", "queue", "salt", "lake", "engineeri", "excell", "strong", "members", "including", "city", 
                 "using", "draw", "washington", "resource", "area", "84123", "large", "python", "intern", "health", "ensure",
                 "portland", "central", "deep", "experience", "include", "captioncall", "97204", "agency", "makers", "work",
                 "architects", "clinical", "scale", "motivated", "very", "investments", "level", "applied", "progressive",
                 "results?", "next?", "looking", "leasing", "page", "click", "software", "involves", "progressive", "atlanta",
                 "support", "santa", "fellow", "valley", "collaborate", "other", "youll", "redmond", "explorer", "part", "nexta",
                 "that", "program", "resultsa", "national", "stewards", "such", "denver", "with", "skillset", "engineers", "apple",
                 "infrastructure", "clara", "working", "distributed", "developers", "previousa", "nexta", "resultsa", "previousa", 
                 "geographically", "shape", "palo", "alto", "johnson", "ca", "ut", "co", "tx", "uncovers", "initiatives", "A", "pagea")
BB.corpus <- tm_map(BB.corpus, removeWords, myStopwords)
inspect(BB.corpus[1:5])

## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 5
## 
## [1] machine learning   one  retail        one  retail                      
## [5]

Create a term document matrix

tdmBB <- TermDocumentMatrix(BB.corpus, control = list(wordLengths = c(4, 20)))
tdmBB

## <<TermDocumentMatrix (terms: 446, documents: 463)>>
## Non-/sparse entries: 1397/205101
## Sparsity           : 99%
## Maximal term length: 17
## Weighting          : term frequency (tf)

inspect(tdmBB)

## <<TermDocumentMatrix (terms: 446, documents: 463)>>
## Non-/sparse entries: 1397/205101
## Sparsity           : 99%
## Maximal term length: 17
## Weighting          : term frequency (tf)
## Sample             :
##             Docs
## Terms        119 138 175 202 263 297 341 41 421 59
##   analytics    0   1   0   0   0   1   0  0   0  0
##   banjo        0   0   0   0   0   0   0  0   0  0
##   build        0   0   0   0   0   0   0  0   0  0
##   data         1   4   2   1   0   4   0  1   0  1
##   junior       0   0   0   0   0   0   0  0   0  0
##   learning     1   0   1   2   0   0   0  1   0  1
##   machine      1   0   1   2   0   0   0  1   0  1
##   mining       0   0   0   0   0   0   0  0   0  1
##   predictive   0   0   0   0   0   0   0  0   0  0
##   science      0   1   1   1   0   1   0  1   0  0

Get terms with a frequency of 10 or higher

findFreqTerms(tdmBB, lowfreq = 10)

##  [1] "learning"   "machine"    "retail"     "build"      "data"      
##  [6] "mining"     "models"     "banjo"      "park"       "product"   
## [11] "marketing"  "teams"      "analysis"   "science"    "business"  
## [16] "analytics"  "predictive" "junior"

We might only want those terms with a higher frequency than 10. So, we select 15 as the new frequency, let’s check the results.

findFreqTerms(tdmBB, lowfreq = 15)

## [1] "learning"  "machine"   "data"      "mining"    "banjo"     "science"  
## [7] "analytics" "junior"

Now, we sort the terms by number of times each shows in total in decreasing order.

BBtermFreq <- rowSums(as.matrix(tdmBB))
BBtermFreq <- subset(BBtermFreq, BBtermFreq >= 10)
sort(BBtermFreq, decreasing = TRUE)

##       data   learning    machine  analytics     mining    science 
##        200         62         56         35         19         18 
##      banjo     junior predictive      build    product     retail 
##         17         16         14         13         13         12 
##     models   analysis       park  marketing      teams   business 
##         12         12         11         11         10         10

We then create a data frame we can work with

dfTermfreq <- data.frame(term = names(BBtermFreq), freq = BBtermFreq)
dfTermfreq

##                  term freq
## learning     learning   62
## machine       machine   56
## retail         retail   12
## build           build   13
## data             data  200
## mining         mining   19
## models         models   12
## banjo           banjo   17
## park             park   11
## product       product   13
## marketing   marketing   11
## teams           teams   10
## analysis     analysis   12
## science       science   18
## business     business   10
## analytics   analytics   35
## predictive predictive   14
## junior         junior   16

And we plot the results

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

We can find associations between any of the terms we find interesting in our job search for example.

findAssocs(tdmBB, 'machine', 0.85)

## $machine
## learning 
##      0.9

findAssocs(tdmBB, 'visualization', 0.65)

## $visualization
##         assist       develops       utilizes     techniques interpretation 
##           0.79           0.79           0.79           0.75           0.72

findAssocs(tdmBB, 'retail', 0.45)

## $retail
## numeric(0)

findAssocs(tdmBB, 'marketing', 0.55)

## $marketing
##       leadership responsibilities            sales      demonstrate 
##             0.67             0.67             0.67             0.67 
##          prowess            ready         sciences        achieving 
##             0.67             0.67             0.67             0.61 
##       wellversed            teams             best 
##             0.61             0.56             0.56

findAssocs(tdmBB, 'solutions', 0.55)

## $solutions
## theory 
##   0.92

findAssocs(tdmBB, 'insights', 0.75)

## $insights
## numeric(0)

findAssocs(tdmBB, 'algorithms', 0.75)

## $algorithms
##   achieving  wellversed        best demonstrate     prowess       ready 
##        0.86        0.86        0.80        0.79        0.79        0.79 
##    sciences 
##        0.79

We then create a matrix to use for the worldcloud we will come up.

matrix <- as.matrix(tdmBB)
wordFreq <- sort(rowSums(matrix), decreasing = TRUE)
set.seed(375)
library(wordcloud)

## Loading required package: RColorBrewer

Create the wordcloud

## Warning in wordcloud(words = names(wordFreq), freq = wordFreq, min.freq =
## 5, : responsibilities could not be fit on page. It will not be plotted.

## Warning in wordcloud(words = names(wordFreq), freq = wordFreq, min.freq =
## 5, : demonstrate could not be fit on page. It will not be plotted.

## Warning in wordcloud(words = names(wordFreq), freq = wordFreq, min.freq =
## 5, : sciences could not be fit on page. It will not be plotted.

## Warning in wordcloud(words = names(wordFreq), freq = wordFreq, min.freq =
## 5, : complex could not be fit on page. It will not be plotted.

## Warning in wordcloud(words = names(wordFreq), freq = wordFreq, min.freq =
## 5, : methods could not be fit on page. It will not be plotted.

## Warning in wordcloud(words = names(wordFreq), freq = wordFreq, min.freq =
## 5, : application could not be fit on page. It will not be plotted.

## Warning in wordcloud(words = names(wordFreq), freq = wordFreq, min.freq =
## 5, : projects could not be fit on page. It will not be plotted.

Conclusion

If you were looking for a data scientist job, you would want to have skills in machine learning, be able to assist in developing, interpreting and utilizing visualizations, and demonstrating how you use your skills in marketing to achieve results. It is also worth mentioning that being a team player is important, and that data mining and analytics come in handy, and are skills highly sought for this role as well.

What are employers looking for in a data scientist?

Rene A. Nunez

May 10, 2018