Data Gathering
Objective of this project is to find out what are the in-demand skills and requirements for Data Analyst job roles in London, Uk. I have scrapped data from 115 job posts in LinkedIn and gathered 6,145 data points. I’ll be performing text analysis to find out the in demand skills (word frequency and association) and visualize the output by generating a Word Cloud.
Project Findings
Setup and Data Loading
# Install Packages
install.packages("tm") # for text mining
install.packages("SnowballC") # for text stemming
install.packages("wordcloud") # word-cloud generator
install.packages("RColorBrewer") # color palettes
install.packages("syuzhet") # for sentiment analysis
install.packages("ggplot2") # for plotting graphs
# Load Libraries
library("tm")
library("SnowballC")
library("wordcloud")
library("RColorBrewer")
library("syuzhet")
library("ggplot2")
# Dataset Import
text <- readLines("Dataset.txt")
# Load Data as Corpus
TextDoc <- Corpus(VectorSource(text))
Data Cleanup
As this is a web-scrapped data, it needs some basic cleanup. I’ll remove some punctuation, symbols and English stop words from the data-set. Then I’ll convert them all into lowercase and remove extra white space. Finally I will be converting the words to their root form by steaming.
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
TextDoc <- tm_map(TextDoc, toSpace, "/") # Replace "/" with Space
TextDoc <- tm_map(TextDoc, toSpace, "@") # Replace "@" with Space
TextDoc <- tm_map(TextDoc, toSpace, "\\|") # Replace "\\|" with Space
TextDoc <- tm_map(TextDoc, toSpace, "-") # Replace "-" with Space
TextDoc <- tm_map(TextDoc, removeNumbers) # Remove Numbers
TextDoc <- tm_map(TextDoc, removePunctuation) # Remove Punctuation
TextDoc <- tm_map(TextDoc, content_transformer(tolower)) # Convert the text to lower case
TextDoc <- tm_map(TextDoc, stripWhitespace) # Remove White space
TextDoc <- tm_map(TextDoc, removeWords, stopwords("english")) # Remove common stop word
TextDoc <- tm_map(TextDoc, stemDocument) # Converting to Root Format
Preliminary Results
Now I’ll be viewing the preliminary results to see if there are any un-wanted word counts, which needs to be removing. I’ll be using the function TermDocumentMatrix() from the text mining package, it will show a table with frequency of the words. The results will be sorted in descended order and top 400 word frequency will be viewed.
TextDoc_dtm <- TermDocumentMatrix(TextDoc)
dtm_m <- as.matrix(TextDoc_dtm)
dtm_v <- sort(rowSums(dtm_m),decreasing=TRUE)
dtm_d <- data.frame(word = names(dtm_v),freq=dtm_v)
head(dtm_d, 400)
## word freq
## data data 191
## experi experi 133
## work work 86
## analyt analyt 66
## manag manag 65
## skill skill 58
## abil abil 52
## strong strong 46
## knowledg knowledg 44
## use use 40
## sql sql 39
## understand understand 37
## includ includ 32
## tool tool 31
## busi busi 31
## model model 30
## communic communic 29
## python python 29
## team team 28
## process process 26
## develop develop 25
## servic servic 24
## excel excel 24
## client client 22
## audit audit 21
## demonstr demonstr 21
## technic technic 21
## build build 21
## stakehold stakehold 20
## requir requir 20
## learn learn 19
## complex complex 19
## scienc scienc 19
## insight insight 19
## report report 18
## financi financi 18
## statist statist 18
## effect effect 17
## system system 17
## good good 17
## problem problem 17
## environ environ 17
## support support 16
## – – 16
## degre degre 16
## year year 16
## intern intern 15
## analysi analysi 15
## technolog technolog 15
## within within 14
## level level 14
## larg larg 14
## account account 14
## project project 14
## engin engin 14
## financ financ 14
## tableau tableau 14
## will will 14
## abl abl 14
## perform perform 13
## qualiti qualiti 13
## activ activ 13
## relev relev 13
## industri industri 13
## visualis visualis 13
## solut solut 13
## etc etc 13
## creat creat 12
## relationship relationship 12
## present present 12
## solv solv 12
## machin machin 12
## comfort comfort 12
## googl googl 12
## research research 12
## coach coach 11
## relat relat 11
## provid provid 11
## previous previous 11
## written written 11
## organis organis 11
## prefer prefer 11
## deliveri deliveri 11
## languag languag 11
## program program 11
## profici profici 11
## queri queri 11
## power power 11
## deliv deliv 10
## time time 10
## oper oper 10
## comput comput 10
## similar similar 10
## databas databas 10
## econom econom 10
## standard standard 9
## identifi identifi 9
## high high 9
## well well 9
## set set 9
## equival equival 9
## practic practic 9
## structur structur 9
## record record 9
## track track 9
## appli appli 9
## across across 9
## document document 9
## can can 9
## market market 9
## proven proven 9
## need need 8
## profession profession 8
## extern extern 8
## posit posit 8
## self self 8
## term term 8
## manipul manipul 8
## issu issu 8
## new new 8
## background background 8
## test test 8
## one one 8
## plan plan 8
## aca aca 8
## acca acca 8
## detail detail 8
## mathemat mathemat 8
## quantit quantit 8
## desir desir 8
## design design 8
## advanc advanc 8
## familiar familiar 8
## improv improv 7
## lead lead 7
## senior senior 7
## advantag advantag 7
## verbal verbal 7
## essenti essenti 7
## consult consult 7
## commerci commerci 7
## inform inform 7
## platform platform 7
## cloud cloud 7
## product product 7
## ideal ideal 7
## role role 7
## principl principl 7
## write write 7
## analyz analyz 7
## etl etl 7
## concept concept 7
## qualifi qualifi 7
## other other 6
## non non 6
## offic offic 6
## qualif qualif 6
## part part 6
## sas sas 6
## execut execut 6
## techniqu techniqu 6
## govern govern 6
## interest interest 6
## architectur architectur 6
## like like 6
## orient orient 6
## least least 6
## junior junior 6
## follow follow 6
## deadlin deadlin 6
## applic applic 6
## current current 6
## help help 6
## basic basic 6
## fast fast 6
## pace pace 6
## passion passion 6
## action action 6
## method method 6
## focus focus 5
## extens extens 5
## chang chang 5
## motiv motiv 5
## toward toward 5
## function function 5
## leadership leadership 5
## alteryx alteryx 5
## collabor collabor 5
## except except 5
## softwar softwar 5
## azur azur 5
## organ organ 5
## field field 5
## studi studi 5
## english english 5
## implement implement 5
## multipl multipl 5
## compet compet 5
## bigqueri bigqueri 5
## find find 5
## contribut contribut 5
## analyst analyst 5
## extract extract 5
## powerbi powerbi 5
## engag engag 5
## candid candid 5
## gather gather 5
## looker looker 5
## effici effici 4
## particular particular 4
## area area 4
## framework framework 4
## result result 4
## subject subject 4
## colleagu colleagu 4
## core core 4
## peopl peopl 4
## task task 4
## meet meet 4
## innov innov 4
## control control 4
## exist exist 4
## opportun opportun 4
## firm firm 4
## generat generat 4
## prepar prepar 4
## trend trend 4
## measur measur 4
## member member 4
## big big 4
## gcp gcp 4
## convers convers 4
## combin combin 4
## attent attent 4
## integr integr 4
## creativ creativ 4
## interperson interperson 4
## approach approach 4
## impact impact 4
## dataset dataset 4
## definit definit 4
## collect collect 4
## discoveri discoveri 4
## person person 4
## think think 4
## sector sector 4
## methodolog methodolog 4
## cleans cleans 4
## clear clear 4
## risk risk 4
## futur futur 4
## solid solid 4
## key key 4
## bachelor bachelor 4
## map map 4
## must must 4
## storytel storytel 4
## access access 4
## strategi strategi 4
## word word 4
## quick quick 4
## social social 4
## deep deep 4
## ica ica 4
## aftermarket aftermarket 4
## tag tag 4
## epidemiolog epidemiolog 4
## health health 4
## respons respons 3
## supervis supervis 3
## output output 3
## assur assur 3
## review review 3
## user user 3
## expert expert 3
## pro pro 3
## attitud attitud 3
## prioritis prioritis 3
## willing willing 3
## minimum minimum 3
## univers univers 3
## audienc audienc 3
## revenu revenu 3
## compani compani 3
## anomali anomali 3
## optimis optimis 3
## recommend recommend 3
## analys analys 3
## necessari necessari 3
## scientist scientist 3
## object object 3
## demand demand 3
## msc msc 3
## phd phd 3
## plus plus 3
## academ academ 3
## exposur exposur 3
## awar awar 3
## tight tight 3
## flexibl flexibl 3
## pragmat pragmat 3
## idea idea 3
## defin defin 3
## know know 3
## rang rang 3
## content content 3
## determin determin 3
## error error 3
## rule rule 3
## critic critic 3
## explain explain 3
## situat situat 3
## differ differ 3
## influenc influenc 3
## code code 3
## transform transform 3
## general general 3
## concis concis 3
## progress progress 3
## budget budget 3
## individu individu 3
## full full 3
## capabl capabl 3
## master master 3
## start start 3
## adob adob 3
## api api 3
## migrat migrat 3
## expertis expertis 3
## snowflak snowflak 3
## artefact artefact 3
## dashboard dashboard 3
## suit suit 3
## visual visual 3
## administr administr 3
## bachelor’ bachelor’ 3
## mindset mindset 3
## player player 3
## autom autom 3
## web web 3
## growth growth 3
## deal deal 3
## agil agil 3
## love love 3
## systemat systemat 3
## take take 3
## sourc sourc 3
## pressur pressur 3
## powerpoint powerpoint 3
## challeng challeng 3
## media media 3
## evalu evalu 3
## prioriti prioriti 3
## custom custom 3
## target target 3
## html html 3
## adher adher 2
## ensur ensur 2
## feedback feedback 2
## mental mental 2
## tailor tailor 2
## upon upon 2
## overal overal 2
## perspect perspect 2
## consist consist 2
## point point 2
## escal escal 2
## face face 2
## bespok bespok 2
## depend depend 2
## line line 2
## mentor mentor 2
## first first 2
## addit addit 2
## success success 2
## complet complet 2
## limit limit 2
## origin origin 2
## pitch pitch 2
## portfolio portfolio 2
## java java 2
## librari librari 2
## vision vision 2
## forecast forecast 2
## hypothes hypothes 2
## pattern pattern 2
## predict predict 2
## metric metric 2
## aspect aspect 2
## experiment experiment 2
## centric centric 2
## stori stori 2
## scala scala 2
## script script 2
## store store 2
## varieti varieti 2
## git git 2
## best best 2
## nosql nosql 2
## certif certif 2
Removing Irrelevant Words
After initial inspection of the result, I’ll be removing 61 words from the data-set. Then again load the TermDocumentMatrix() function and count the word frequency.
TextDoc <- tm_map(TextDoc, removeWords, c("data","experi","work", "analyt", "manag", "skill", "abil", "strong", "knowledg", "use", "understand", "includ", "process", "develop", "servic", "client", "technic", "build", "stakehold", "requir", "learn", "effect", "system", "good", "degre", "year", "intern", "analysi", "technolog", "within", "level", "account", "project", "engin", "financ", "will", "abl", "perform", "activ", "relev", "industri", "visualis", "solut", "etc", "creat", "relationship", "present", "solv", "comfort", "coach", "relat", "provid", "previous", "written", "organis", "program", "profici", "queri", "deliv", "time","oper"))
TextDoc_dtm <- TermDocumentMatrix(TextDoc)
dtm_m <- as.matrix(TextDoc_dtm)
dtm_v <- sort(rowSums(dtm_m),decreasing=TRUE)
dtm_d <- data.frame(word = names(dtm_v),freq=dtm_v)
head(dtm_d, 150)
## word freq
## sql sql 39
## tool tool 31
## busi busi 31
## model model 30
## communic communic 29
## python python 29
## team team 28
## excel excel 24
## audit audit 21
## demonstr demonstr 21
## complex complex 19
## scienc scienc 19
## insight insight 19
## report report 18
## financi financi 18
## statist statist 18
## problem problem 17
## environ environ 17
## support support 16
## – – 16
## larg larg 14
## tableau tableau 14
## qualiti qualiti 13
## machin machin 12
## googl googl 12
## research research 12
## prefer prefer 11
## deliveri deliveri 11
## languag languag 11
## power power 11
## comput comput 10
## similar similar 10
## databas databas 10
## econom econom 10
## standard standard 9
## identifi identifi 9
## high high 9
## well well 9
## set set 9
## equival equival 9
## practic practic 9
## structur structur 9
## record record 9
## track track 9
## appli appli 9
## across across 9
## document document 9
## can can 9
## market market 9
## proven proven 9
## need need 8
## profession profession 8
## extern extern 8
## posit posit 8
## self self 8
## term term 8
## manipul manipul 8
## issu issu 8
## new new 8
## background background 8
## test test 8
## one one 8
## plan plan 8
## aca aca 8
## acca acca 8
## detail detail 8
## mathemat mathemat 8
## quantit quantit 8
## desir desir 8
## design design 8
## advanc advanc 8
## familiar familiar 8
## improv improv 7
## lead lead 7
## senior senior 7
## advantag advantag 7
## verbal verbal 7
## essenti essenti 7
## consult consult 7
## commerci commerci 7
## inform inform 7
## platform platform 7
## cloud cloud 7
## product product 7
## ideal ideal 7
## role role 7
## principl principl 7
## write write 7
## analyz analyz 7
## etl etl 7
## concept concept 7
## qualifi qualifi 7
## other other 6
## non non 6
## offic offic 6
## qualif qualif 6
## part part 6
## sas sas 6
## execut execut 6
## techniqu techniqu 6
## govern govern 6
## interest interest 6
## architectur architectur 6
## like like 6
## orient orient 6
## least least 6
## junior junior 6
## follow follow 6
## deadlin deadlin 6
## applic applic 6
## current current 6
## help help 6
## basic basic 6
## fast fast 6
## pace pace 6
## passion passion 6
## action action 6
## method method 6
## focus focus 5
## extens extens 5
## chang chang 5
## motiv motiv 5
## toward toward 5
## function function 5
## leadership leadership 5
## alteryx alteryx 5
## collabor collabor 5
## except except 5
## softwar softwar 5
## azur azur 5
## organ organ 5
## field field 5
## studi studi 5
## english english 5
## implement implement 5
## multipl multipl 5
## compet compet 5
## bigqueri bigqueri 5
## find find 5
## contribut contribut 5
## analyst analyst 5
## extract extract 5
## powerbi powerbi 5
## engag engag 5
## candid candid 5
## gather gather 5
## looker looker 5
## effici effici 4
## particular particular 4
## area area 4
Generating Word Cloud
Now that the result is per my satisfaction, I’ll be visualizing the output with a Word Cloud with minimum frequency of 5 and maximum word count of 250 in the Descending Order.
#generate word cloud
set.seed(1234)
wordcloud(words = dtm_d$word, freq = dtm_d$freq, min.freq = 5,
max.words=150, random.order=FALSE, rot.per=0.40,
colors=brewer.pal(8, "Dark2"))
Skills Association
From the previous results, I know which skills are in demand for Data Analyst. Now I need to find out the top skills association with other words, for the following skills, within co:relation of minimum 0.25.
findAssocs(TextDoc_dtm, terms = c("sql","python","alteryx","insight","excel", "azur", "tableau", "powerbi", "model","communic", "team"), corlimit = 0.25)
## $sql
## server databas manipul
## 0.30 0.28 0.26
##
## $python
## languag java librari scala php
## 0.44 0.26 0.26 0.26 0.26
##
## $alteryx
## knime storag assist lab digit suit visual
## 0.45 0.45 0.45 0.45 0.31 0.26 0.26
##
## $insight
## synthes action competit custom
## 0.32 0.32 0.32 0.26
##
## $excel
## word powerpoint spreadsheet advanc
## 0.40 0.35 0.29 0.28
##
## $azur
## major – scientist migrat
## 0.45 0.31 0.26 0.26
##
## $tableau
## looker power qlik tool necess metabas periscop lab
## 0.47 0.44 0.38 0.30 0.27 0.27 0.27 0.27
## apach superset dax
## 0.27 0.27 0.27
##
## $powerbi
## metabas periscop looker qlik dashboard
## 0.45 0.45 0.40 0.31 0.26
##
## $model
## entiti semant regress dimension distinct jsdm json xdm
## 0.48 0.48 0.33 0.33 0.32 0.32 0.32 0.32
## elast linear price propens
## 0.32 0.32 0.32 0.32
##
## $communic
## verbal non simpl influenc interperson clear
## 0.40 0.36 0.36 0.31 0.26 0.26
## write varieti propos email
## 0.26 0.25 0.25 0.25
##
## $team
## effici player lead overal divers associ close
## 0.44 0.41 0.39 0.37 0.37 0.35 0.35
## enthusiast real want world member
## 0.35 0.35 0.35 0.35 0.26