We did a search for the Data Scientist terms in Indeed. Our findings were interesting, and maybe not surprising; however, they were insightful.
This kind of analysis can also be helpful for getting insights into what your customers are commenting online about your business, your products, your brand, your ads and campaigns, and more.
Keep in mind that this was only done for the search results for the first 5 pages. If we wanted to gain better insights, we might have wanted to get more pages, and from other sites too.
Let’s take a look at what the analysis reveals.
First, we loaded the necessary libraries
library(xml2)
library(rvest)
library(plyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tm)
## Loading required package: NLP
Then, we mined indeed’s results for Data Scientist jobs in the United States
url='https://www.indeed.com/q-Data-Scientist-jobs.html'
names = read_html(url)
scraped_summaries = html_nodes(names, '.summary')
names_text = html_text(scraped_summaries)
names_text
## [1] "\n Ready to demonstrate your marketing analytics and data sciences prowess? You should be well-versed in machine learning algorithms – which is best for achieving..."
## [2] "\n You’ll work with world-class data scientists and engineers. We are looking for rockstar Data Scientists to join our team...."
## [3] "\n We are looking to hire an analytics & insights leader to partner with the Structured Data Product team. This leader needs to possess not only super strong..."
## [4] "\n The Data Scientist utilizes extensive knowledge in data structures, data relationships, and tools to perform statistical analysis, data visualizations, analytic..."
## [5] "\n The Machine Learning Engineer will build extensible and highly scalable machine learning components for One Click Retail's micro-services...."
## [6] "\n Interprets results from multiple sources using a variety of techniques, ranging from structured data extraction to complex data mining from clinical notes...."
## [7] "\n You will build machine learning models to transform social media feeds into actionable items. Available employment types:...."
## [8] "\n Extensive experience with Python applying some of the models above to real-world data; Experience with data visualization tools (e.g., Tableau, Power BI, D3.js..."
## [9] "\n Build complex data sets from multiple data sources. In engineering, computer science, physics, mathematics or other quantitative fields OR Master's Degree in a..."
## [10] "\n Use data mining and machine learning skills to design and develop products which drive engagement, growth, retention, and monetization...."
## [11] "\n Data Scientist in Fremont, CA*. Areas of team focus include mHealth, genomics, and other projects in support of design, analysis, and meta-analysis of clinical..."
## [12] "\n Data modeling, data mining, data engineering, data analysis, and machine learning. Experience with manipulating and analyzing clinical data and machine learning..."
## [13] "\n The R Programmer will perform research on large historical healthcare data sets, develop and improve statistical and machine learning approaches for products..."
## [14] "\n As a key member of the Coca-Cola Company Freestyle organization supporting specifically Engineering & Innovation, the Data Scientist will perform/lead data..."
## [15] "\n A critical member of the Analytics team, the Data Scientist will work with the Business Intelligence team, web analysts, marketers, and our web development team..."
## [16] "\n 6+ years of professional experience in a business environment as a Data Scientist, Machine Learning engineer or comparable analytical position...."
## [17] "\n And interpret complex SQL and Hive QL queries to inform data mining and. As a Data Scientist, you will use statistical analysis...."
Create code to cycle through the first 5 pages
pa = c("", "&start=10","&start=20","&start=30","&start=40","&start=50")
pages = paste('https://www.indeed.com/jobs?q=Data+Scientist&l=United+States', pa, sep = '')
BB = c()
Clean up the text
for (i in 1:length(pages)) {
url = pages[i]
names = read_html(url)
scraped_names = html_nodes(names, '.pagination , .company, .summary, .location, .turnstileLink')
names_text = html_text(scraped_names)
names_text
# Clean up text using regular expressions
# LOOKUP ?regex and select Regular Expressionas as used in R for more info)
installed.packages("stringi")
library(stringi)
names2 = stri_replace_all_regex(names_text, "\\(", "")
names2
names3 = stri_replace_all_regex(names2, "[:punct:]", "")
names3
names4 = stri_replace_all_regex(names3, "[:digit:]", " ")
names4
names5 = stri_replace_all_regex(names4, "\n", " ")
names5
names6 = stri_replace_all_regex(names5, ",", " ")
names6
names7 = iconv(names6, to = "ASCII//TRANSLIT")
names7
# we have spaces left to eliminate, we eliminate them now
names8 = stri_replace_all_regex(names6, " ", "")
B = names8
# For loop result, add to the previous one
BB = c(BB, B)
}
BB <- iconv(BB, to = "ASCII//TRANSLIT")
BB <- tolower(BB)
head(BB, 20)
## [1] "machine learning engineer"
## [2] " one click retail"
## [3] " one click retail"
## [4] " reviews"
## [5] "salt lake city ut"
## [6] " the machine learning engineer will build extensible and highly scalable machine learning components for one click retails microservices"
## [7] "sr data scientist"
## [8] " captioncall"
## [9] " captioncall"
## [10] "reviews"
## [11] "salt lake city ut"
## [12] " utilizes data mining techniques and develops data models to assist in the visualization and interpretation of data"
## [13] "machine learning engineer"
## [14] " banjo"
## [15] " banjo"
## [16] "reviews"
## [17] "park city ut"
## [18] " banjo is looking to add a machine learning engineer to our team in park city natural inclination to demand precision in everything you build even while"
## [19] "consumer product data scientist"
## [20] " laird superfood"
Remove stopwords, and other selected words you deem unimportant. Also, look at the first 5 results
BB.corpus <- Corpus(VectorSource(BB))
BB.corpus <- tm_map(BB.corpus, removeWords, stopwords("en"))
inspect(BB.corpus[1:5])
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 5
##
## [1] machine learning engineer one click retail
## [3] one click retail reviews
## [5] salt lake city ut
myStopwords <- c("engineer", "reviews", "using", "ebay", "central", "leasing", "will", "draper", "date", "york", "seeking", "scientist",
"scientists", "dell", "beaverton", "queue", "salt", "lake", "engineeri", "excell", "strong", "members", "including", "city",
"using", "draw", "washington", "resource", "area", "84123", "large", "python", "intern", "health", "ensure",
"portland", "central", "deep", "experience", "include", "captioncall", "97204", "agency", "makers", "work",
"architects", "clinical", "scale", "motivated", "very", "investments", "level", "applied", "progressive",
"results?", "next?", "looking", "leasing", "page", "click", "software", "involves", "progressive", "atlanta",
"support", "santa", "fellow", "valley", "collaborate", "other", "youll", "redmond", "explorer", "part", "nexta",
"that", "program", "resultsa", "national", "stewards", "such", "denver", "with", "skillset", "engineers", "apple",
"infrastructure", "clara", "working", "distributed", "developers", "previousa", "nexta", "resultsa", "previousa",
"geographically", "shape", "palo", "alto", "johnson", "ca", "ut", "co", "tx", "uncovers", "initiatives", "A", "pagea")
BB.corpus <- tm_map(BB.corpus, removeWords, myStopwords)
inspect(BB.corpus[1:5])
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 5
##
## [1] machine learning one retail one retail
## [5]
Create a term document matrix
tdmBB <- TermDocumentMatrix(BB.corpus, control = list(wordLengths = c(4, 20)))
tdmBB
## <<TermDocumentMatrix (terms: 446, documents: 463)>>
## Non-/sparse entries: 1397/205101
## Sparsity : 99%
## Maximal term length: 17
## Weighting : term frequency (tf)
inspect(tdmBB)
## <<TermDocumentMatrix (terms: 446, documents: 463)>>
## Non-/sparse entries: 1397/205101
## Sparsity : 99%
## Maximal term length: 17
## Weighting : term frequency (tf)
## Sample :
## Docs
## Terms 119 138 175 202 263 297 341 41 421 59
## analytics 0 1 0 0 0 1 0 0 0 0
## banjo 0 0 0 0 0 0 0 0 0 0
## build 0 0 0 0 0 0 0 0 0 0
## data 1 4 2 1 0 4 0 1 0 1
## junior 0 0 0 0 0 0 0 0 0 0
## learning 1 0 1 2 0 0 0 1 0 1
## machine 1 0 1 2 0 0 0 1 0 1
## mining 0 0 0 0 0 0 0 0 0 1
## predictive 0 0 0 0 0 0 0 0 0 0
## science 0 1 1 1 0 1 0 1 0 0
Get terms with a frequency of 10 or higher
findFreqTerms(tdmBB, lowfreq = 10)
## [1] "learning" "machine" "retail" "build" "data"
## [6] "mining" "models" "banjo" "park" "product"
## [11] "marketing" "teams" "analysis" "science" "business"
## [16] "analytics" "predictive" "junior"
We might only want those terms with a higher frequency than 10. So, we select 15 as the new frequency, let’s check the results.
findFreqTerms(tdmBB, lowfreq = 15)
## [1] "learning" "machine" "data" "mining" "banjo" "science"
## [7] "analytics" "junior"
Now, we sort the terms by number of times each shows in total in decreasing order.
BBtermFreq <- rowSums(as.matrix(tdmBB))
BBtermFreq <- subset(BBtermFreq, BBtermFreq >= 10)
sort(BBtermFreq, decreasing = TRUE)
## data learning machine analytics mining science
## 200 62 56 35 19 18
## banjo junior predictive build product retail
## 17 16 14 13 13 12
## models analysis park marketing teams business
## 12 12 11 11 10 10
We then create a data frame we can work with
dfTermfreq <- data.frame(term = names(BBtermFreq), freq = BBtermFreq)
dfTermfreq
## term freq
## learning learning 62
## machine machine 56
## retail retail 12
## build build 13
## data data 200
## mining mining 19
## models models 12
## banjo banjo 17
## park park 11
## product product 13
## marketing marketing 11
## teams teams 10
## analysis analysis 12
## science science 18
## business business 10
## analytics analytics 35
## predictive predictive 14
## junior junior 16
And we plot the results
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
We can find associations between any of the terms we find interesting in our job search for example.
findAssocs(tdmBB, 'machine', 0.85)
## $machine
## learning
## 0.9
findAssocs(tdmBB, 'visualization', 0.65)
## $visualization
## assist develops utilizes techniques interpretation
## 0.79 0.79 0.79 0.75 0.72
findAssocs(tdmBB, 'retail', 0.45)
## $retail
## numeric(0)
findAssocs(tdmBB, 'marketing', 0.55)
## $marketing
## leadership responsibilities sales demonstrate
## 0.67 0.67 0.67 0.67
## prowess ready sciences achieving
## 0.67 0.67 0.67 0.61
## wellversed teams best
## 0.61 0.56 0.56
findAssocs(tdmBB, 'solutions', 0.55)
## $solutions
## theory
## 0.92
findAssocs(tdmBB, 'insights', 0.75)
## $insights
## numeric(0)
findAssocs(tdmBB, 'algorithms', 0.75)
## $algorithms
## achieving wellversed best demonstrate prowess ready
## 0.86 0.86 0.80 0.79 0.79 0.79
## sciences
## 0.79
We then create a matrix to use for the worldcloud we will come up.
matrix <- as.matrix(tdmBB)
wordFreq <- sort(rowSums(matrix), decreasing = TRUE)
set.seed(375)
library(wordcloud)
## Loading required package: RColorBrewer
Create the wordcloud
## Warning in wordcloud(words = names(wordFreq), freq = wordFreq, min.freq =
## 5, : responsibilities could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = names(wordFreq), freq = wordFreq, min.freq =
## 5, : demonstrate could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = names(wordFreq), freq = wordFreq, min.freq =
## 5, : sciences could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = names(wordFreq), freq = wordFreq, min.freq =
## 5, : complex could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = names(wordFreq), freq = wordFreq, min.freq =
## 5, : methods could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = names(wordFreq), freq = wordFreq, min.freq =
## 5, : application could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = names(wordFreq), freq = wordFreq, min.freq =
## 5, : projects could not be fit on page. It will not be plotted.
Conclusion
If you were looking for a data scientist job, you would want to have skills in machine learning, be able to assist in developing, interpreting and utilizing visualizations, and demonstrating how you use your skills in marketing to achieve results. It is also worth mentioning that being a team player is important, and that data mining and analytics come in handy, and are skills highly sought for this role as well.