Background & Findings

Data Gathering

Objective of this project is to find out what are the in-demand skills and requirements for Data Analyst job roles in London, Uk. I have scrapped data from 115 job posts in LinkedIn and gathered 6,145 data points. I’ll be performing text analysis to find out the in demand skills (word frequency and association) and visualize the output by generating a Word Cloud.

Project Findings

Project Walk-through

Setup and Data Loading

# Install Packages
install.packages("tm")  # for text mining
install.packages("SnowballC") # for text stemming
install.packages("wordcloud") # word-cloud generator
install.packages("RColorBrewer") # color palettes
install.packages("syuzhet") # for sentiment analysis
install.packages("ggplot2") # for plotting graphs

# Load Libraries
library("tm")
library("SnowballC")
library("wordcloud")
library("RColorBrewer")
library("syuzhet")
library("ggplot2")

# Dataset Import
text <- readLines("Dataset.txt")

# Load Data as Corpus
TextDoc <- Corpus(VectorSource(text))

Data Cleanup

As this is a web-scrapped data, it needs some basic cleanup. I’ll remove some punctuation, symbols and English stop words from the data-set. Then I’ll convert them all into lowercase and remove extra white space. Finally I will be converting the words to their root form by steaming.

toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
TextDoc <- tm_map(TextDoc, toSpace, "/") # Replace "/" with Space 
TextDoc <- tm_map(TextDoc, toSpace, "@") # Replace "@" with Space
TextDoc <- tm_map(TextDoc, toSpace, "\\|") # Replace "\\|" with Space
TextDoc <- tm_map(TextDoc, toSpace, "-") # Replace "-" with Space
TextDoc <- tm_map(TextDoc, removeNumbers) # Remove Numbers 
TextDoc <- tm_map(TextDoc, removePunctuation) # Remove Punctuation
TextDoc <- tm_map(TextDoc, content_transformer(tolower)) # Convert the text to lower case
TextDoc <- tm_map(TextDoc, stripWhitespace) # Remove White space
TextDoc <- tm_map(TextDoc, removeWords, stopwords("english")) # Remove common stop word
TextDoc <- tm_map(TextDoc, stemDocument) # Converting to Root Format

Preliminary Results

Now I’ll be viewing the preliminary results to see if there are any un-wanted word counts, which needs to be removing. I’ll be using the function TermDocumentMatrix() from the text mining package, it will show a table with frequency of the words. The results will be sorted in descended order and top 400 word frequency will be viewed.

TextDoc_dtm <- TermDocumentMatrix(TextDoc)
dtm_m <- as.matrix(TextDoc_dtm)
dtm_v <- sort(rowSums(dtm_m),decreasing=TRUE)
dtm_d <- data.frame(word = names(dtm_v),freq=dtm_v)
head(dtm_d, 400)
##                      word freq
## data                 data  191
## experi             experi  133
## work                 work   86
## analyt             analyt   66
## manag               manag   65
## skill               skill   58
## abil                 abil   52
## strong             strong   46
## knowledg         knowledg   44
## use                   use   40
## sql                   sql   39
## understand     understand   37
## includ             includ   32
## tool                 tool   31
## busi                 busi   31
## model               model   30
## communic         communic   29
## python             python   29
## team                 team   28
## process           process   26
## develop           develop   25
## servic             servic   24
## excel               excel   24
## client             client   22
## audit               audit   21
## demonstr         demonstr   21
## technic           technic   21
## build               build   21
## stakehold       stakehold   20
## requir             requir   20
## learn               learn   19
## complex           complex   19
## scienc             scienc   19
## insight           insight   19
## report             report   18
## financi           financi   18
## statist           statist   18
## effect             effect   17
## system             system   17
## good                 good   17
## problem           problem   17
## environ           environ   17
## support           support   16
## –                       –   16
## degre               degre   16
## year                 year   16
## intern             intern   15
## analysi           analysi   15
## technolog       technolog   15
## within             within   14
## level               level   14
## larg                 larg   14
## account           account   14
## project           project   14
## engin               engin   14
## financ             financ   14
## tableau           tableau   14
## will                 will   14
## abl                   abl   14
## perform           perform   13
## qualiti           qualiti   13
## activ               activ   13
## relev               relev   13
## industri         industri   13
## visualis         visualis   13
## solut               solut   13
## etc                   etc   13
## creat               creat   12
## relationship relationship   12
## present           present   12
## solv                 solv   12
## machin             machin   12
## comfort           comfort   12
## googl               googl   12
## research         research   12
## coach               coach   11
## relat               relat   11
## provid             provid   11
## previous         previous   11
## written           written   11
## organis           organis   11
## prefer             prefer   11
## deliveri         deliveri   11
## languag           languag   11
## program           program   11
## profici           profici   11
## queri               queri   11
## power               power   11
## deliv               deliv   10
## time                 time   10
## oper                 oper   10
## comput             comput   10
## similar           similar   10
## databas           databas   10
## econom             econom   10
## standard         standard    9
## identifi         identifi    9
## high                 high    9
## well                 well    9
## set                   set    9
## equival           equival    9
## practic           practic    9
## structur         structur    9
## record             record    9
## track               track    9
## appli               appli    9
## across             across    9
## document         document    9
## can                   can    9
## market             market    9
## proven             proven    9
## need                 need    8
## profession     profession    8
## extern             extern    8
## posit               posit    8
## self                 self    8
## term                 term    8
## manipul           manipul    8
## issu                 issu    8
## new                   new    8
## background     background    8
## test                 test    8
## one                   one    8
## plan                 plan    8
## aca                   aca    8
## acca                 acca    8
## detail             detail    8
## mathemat         mathemat    8
## quantit           quantit    8
## desir               desir    8
## design             design    8
## advanc             advanc    8
## familiar         familiar    8
## improv             improv    7
## lead                 lead    7
## senior             senior    7
## advantag         advantag    7
## verbal             verbal    7
## essenti           essenti    7
## consult           consult    7
## commerci         commerci    7
## inform             inform    7
## platform         platform    7
## cloud               cloud    7
## product           product    7
## ideal               ideal    7
## role                 role    7
## principl         principl    7
## write               write    7
## analyz             analyz    7
## etl                   etl    7
## concept           concept    7
## qualifi           qualifi    7
## other               other    6
## non                   non    6
## offic               offic    6
## qualif             qualif    6
## part                 part    6
## sas                   sas    6
## execut             execut    6
## techniqu         techniqu    6
## govern             govern    6
## interest         interest    6
## architectur   architectur    6
## like                 like    6
## orient             orient    6
## least               least    6
## junior             junior    6
## follow             follow    6
## deadlin           deadlin    6
## applic             applic    6
## current           current    6
## help                 help    6
## basic               basic    6
## fast                 fast    6
## pace                 pace    6
## passion           passion    6
## action             action    6
## method             method    6
## focus               focus    5
## extens             extens    5
## chang               chang    5
## motiv               motiv    5
## toward             toward    5
## function         function    5
## leadership     leadership    5
## alteryx           alteryx    5
## collabor         collabor    5
## except             except    5
## softwar           softwar    5
## azur                 azur    5
## organ               organ    5
## field               field    5
## studi               studi    5
## english           english    5
## implement       implement    5
## multipl           multipl    5
## compet             compet    5
## bigqueri         bigqueri    5
## find                 find    5
## contribut       contribut    5
## analyst           analyst    5
## extract           extract    5
## powerbi           powerbi    5
## engag               engag    5
## candid             candid    5
## gather             gather    5
## looker             looker    5
## effici             effici    4
## particular     particular    4
## area                 area    4
## framework       framework    4
## result             result    4
## subject           subject    4
## colleagu         colleagu    4
## core                 core    4
## peopl               peopl    4
## task                 task    4
## meet                 meet    4
## innov               innov    4
## control           control    4
## exist               exist    4
## opportun         opportun    4
## firm                 firm    4
## generat           generat    4
## prepar             prepar    4
## trend               trend    4
## measur             measur    4
## member             member    4
## big                   big    4
## gcp                   gcp    4
## convers           convers    4
## combin             combin    4
## attent             attent    4
## integr             integr    4
## creativ           creativ    4
## interperson   interperson    4
## approach         approach    4
## impact             impact    4
## dataset           dataset    4
## definit           definit    4
## collect           collect    4
## discoveri       discoveri    4
## person             person    4
## think               think    4
## sector             sector    4
## methodolog     methodolog    4
## cleans             cleans    4
## clear               clear    4
## risk                 risk    4
## futur               futur    4
## solid               solid    4
## key                   key    4
## bachelor         bachelor    4
## map                   map    4
## must                 must    4
## storytel         storytel    4
## access             access    4
## strategi         strategi    4
## word                 word    4
## quick               quick    4
## social             social    4
## deep                 deep    4
## ica                   ica    4
## aftermarket   aftermarket    4
## tag                   tag    4
## epidemiolog   epidemiolog    4
## health             health    4
## respons           respons    3
## supervis         supervis    3
## output             output    3
## assur               assur    3
## review             review    3
## user                 user    3
## expert             expert    3
## pro                   pro    3
## attitud           attitud    3
## prioritis       prioritis    3
## willing           willing    3
## minimum           minimum    3
## univers           univers    3
## audienc           audienc    3
## revenu             revenu    3
## compani           compani    3
## anomali           anomali    3
## optimis           optimis    3
## recommend       recommend    3
## analys             analys    3
## necessari       necessari    3
## scientist       scientist    3
## object             object    3
## demand             demand    3
## msc                   msc    3
## phd                   phd    3
## plus                 plus    3
## academ             academ    3
## exposur           exposur    3
## awar                 awar    3
## tight               tight    3
## flexibl           flexibl    3
## pragmat           pragmat    3
## idea                 idea    3
## defin               defin    3
## know                 know    3
## rang                 rang    3
## content           content    3
## determin         determin    3
## error               error    3
## rule                 rule    3
## critic             critic    3
## explain           explain    3
## situat             situat    3
## differ             differ    3
## influenc         influenc    3
## code                 code    3
## transform       transform    3
## general           general    3
## concis             concis    3
## progress         progress    3
## budget             budget    3
## individu         individu    3
## full                 full    3
## capabl             capabl    3
## master             master    3
## start               start    3
## adob                 adob    3
## api                   api    3
## migrat             migrat    3
## expertis         expertis    3
## snowflak         snowflak    3
## artefact         artefact    3
## dashboard       dashboard    3
## suit                 suit    3
## visual             visual    3
## administr       administr    3
## bachelor’       bachelor’    3
## mindset           mindset    3
## player             player    3
## autom               autom    3
## web                   web    3
## growth             growth    3
## deal                 deal    3
## agil                 agil    3
## love                 love    3
## systemat         systemat    3
## take                 take    3
## sourc               sourc    3
## pressur           pressur    3
## powerpoint     powerpoint    3
## challeng         challeng    3
## media               media    3
## evalu               evalu    3
## prioriti         prioriti    3
## custom             custom    3
## target             target    3
## html                 html    3
## adher               adher    2
## ensur               ensur    2
## feedback         feedback    2
## mental             mental    2
## tailor             tailor    2
## upon                 upon    2
## overal             overal    2
## perspect         perspect    2
## consist           consist    2
## point               point    2
## escal               escal    2
## face                 face    2
## bespok             bespok    2
## depend             depend    2
## line                 line    2
## mentor             mentor    2
## first               first    2
## addit               addit    2
## success           success    2
## complet           complet    2
## limit               limit    2
## origin             origin    2
## pitch               pitch    2
## portfolio       portfolio    2
## java                 java    2
## librari           librari    2
## vision             vision    2
## forecast         forecast    2
## hypothes         hypothes    2
## pattern           pattern    2
## predict           predict    2
## metric             metric    2
## aspect             aspect    2
## experiment     experiment    2
## centric           centric    2
## stori               stori    2
## scala               scala    2
## script             script    2
## store               store    2
## varieti           varieti    2
## git                   git    2
## best                 best    2
## nosql               nosql    2
## certif             certif    2

Removing Irrelevant Words

After initial inspection of the result, I’ll be removing 61 words from the data-set. Then again load the TermDocumentMatrix() function and count the word frequency.

TextDoc <- tm_map(TextDoc, removeWords, c("data","experi","work",   "analyt",   "manag",    "skill",    "abil", "strong",   "knowledg", "use",  "understand",   "includ",   "process",  "develop",  "servic",   "client",   "technic",  "build",    "stakehold",    "requir",   "learn",    "effect",   "system",   "good", "degre",    "year", "intern",   "analysi",  "technolog",    "within",   "level",    "account",  "project",  "engin",    "financ",   "will", "abl",  "perform",  "activ",    "relev",    "industri", "visualis", "solut",    "etc",  "creat",    "relationship", "present",  "solv", "comfort",  "coach",    "relat",    "provid",   "previous", "written",  "organis",  "program",  "profici",  "queri",    "deliv",    "time","oper"))

TextDoc_dtm <- TermDocumentMatrix(TextDoc)
dtm_m <- as.matrix(TextDoc_dtm)
dtm_v <- sort(rowSums(dtm_m),decreasing=TRUE)
dtm_d <- data.frame(word = names(dtm_v),freq=dtm_v)
head(dtm_d, 150)
##                    word freq
## sql                 sql   39
## tool               tool   31
## busi               busi   31
## model             model   30
## communic       communic   29
## python           python   29
## team               team   28
## excel             excel   24
## audit             audit   21
## demonstr       demonstr   21
## complex         complex   19
## scienc           scienc   19
## insight         insight   19
## report           report   18
## financi         financi   18
## statist         statist   18
## problem         problem   17
## environ         environ   17
## support         support   16
## –                     –   16
## larg               larg   14
## tableau         tableau   14
## qualiti         qualiti   13
## machin           machin   12
## googl             googl   12
## research       research   12
## prefer           prefer   11
## deliveri       deliveri   11
## languag         languag   11
## power             power   11
## comput           comput   10
## similar         similar   10
## databas         databas   10
## econom           econom   10
## standard       standard    9
## identifi       identifi    9
## high               high    9
## well               well    9
## set                 set    9
## equival         equival    9
## practic         practic    9
## structur       structur    9
## record           record    9
## track             track    9
## appli             appli    9
## across           across    9
## document       document    9
## can                 can    9
## market           market    9
## proven           proven    9
## need               need    8
## profession   profession    8
## extern           extern    8
## posit             posit    8
## self               self    8
## term               term    8
## manipul         manipul    8
## issu               issu    8
## new                 new    8
## background   background    8
## test               test    8
## one                 one    8
## plan               plan    8
## aca                 aca    8
## acca               acca    8
## detail           detail    8
## mathemat       mathemat    8
## quantit         quantit    8
## desir             desir    8
## design           design    8
## advanc           advanc    8
## familiar       familiar    8
## improv           improv    7
## lead               lead    7
## senior           senior    7
## advantag       advantag    7
## verbal           verbal    7
## essenti         essenti    7
## consult         consult    7
## commerci       commerci    7
## inform           inform    7
## platform       platform    7
## cloud             cloud    7
## product         product    7
## ideal             ideal    7
## role               role    7
## principl       principl    7
## write             write    7
## analyz           analyz    7
## etl                 etl    7
## concept         concept    7
## qualifi         qualifi    7
## other             other    6
## non                 non    6
## offic             offic    6
## qualif           qualif    6
## part               part    6
## sas                 sas    6
## execut           execut    6
## techniqu       techniqu    6
## govern           govern    6
## interest       interest    6
## architectur architectur    6
## like               like    6
## orient           orient    6
## least             least    6
## junior           junior    6
## follow           follow    6
## deadlin         deadlin    6
## applic           applic    6
## current         current    6
## help               help    6
## basic             basic    6
## fast               fast    6
## pace               pace    6
## passion         passion    6
## action           action    6
## method           method    6
## focus             focus    5
## extens           extens    5
## chang             chang    5
## motiv             motiv    5
## toward           toward    5
## function       function    5
## leadership   leadership    5
## alteryx         alteryx    5
## collabor       collabor    5
## except           except    5
## softwar         softwar    5
## azur               azur    5
## organ             organ    5
## field             field    5
## studi             studi    5
## english         english    5
## implement     implement    5
## multipl         multipl    5
## compet           compet    5
## bigqueri       bigqueri    5
## find               find    5
## contribut     contribut    5
## analyst         analyst    5
## extract         extract    5
## powerbi         powerbi    5
## engag             engag    5
## candid           candid    5
## gather           gather    5
## looker           looker    5
## effici           effici    4
## particular   particular    4
## area               area    4

Generating Word Cloud

Now that the result is per my satisfaction, I’ll be visualizing the output with a Word Cloud with minimum frequency of 5 and maximum word count of 250 in the Descending Order.

#generate word cloud
set.seed(1234)
wordcloud(words = dtm_d$word, freq = dtm_d$freq, min.freq = 5,
          max.words=150, random.order=FALSE, rot.per=0.40, 
          colors=brewer.pal(8, "Dark2"))

Skills Association

From the previous results, I know which skills are in demand for Data Analyst. Now I need to find out the top skills association with other words, for the following skills, within co:relation of minimum 0.25.

findAssocs(TextDoc_dtm, terms = c("sql","python","alteryx","insight","excel", "azur", "tableau", "powerbi", "model","communic", "team"), corlimit = 0.25)
## $sql
##  server databas manipul 
##    0.30    0.28    0.26 
## 
## $python
## languag    java librari   scala     php 
##    0.44    0.26    0.26    0.26    0.26 
## 
## $alteryx
##  knime storag assist    lab  digit   suit visual 
##   0.45   0.45   0.45   0.45   0.31   0.26   0.26 
## 
## $insight
##  synthes   action competit   custom 
##     0.32     0.32     0.32     0.26 
## 
## $excel
##        word  powerpoint spreadsheet      advanc 
##        0.40        0.35        0.29        0.28 
## 
## $azur
##     major         – scientist    migrat 
##      0.45      0.31      0.26      0.26 
## 
## $tableau
##   looker    power     qlik     tool   necess  metabas periscop      lab 
##     0.47     0.44     0.38     0.30     0.27     0.27     0.27     0.27 
##    apach superset      dax 
##     0.27     0.27     0.27 
## 
## $powerbi
##   metabas  periscop    looker      qlik dashboard 
##      0.45      0.45      0.40      0.31      0.26 
## 
## $model
##    entiti    semant   regress dimension  distinct      jsdm      json       xdm 
##      0.48      0.48      0.33      0.33      0.32      0.32      0.32      0.32 
##     elast    linear     price   propens 
##      0.32      0.32      0.32      0.32 
## 
## $communic
##      verbal         non       simpl    influenc interperson       clear 
##        0.40        0.36        0.36        0.31        0.26        0.26 
##       write     varieti      propos       email 
##        0.26        0.25        0.25        0.25 
## 
## $team
##     effici     player       lead     overal     divers     associ      close 
##       0.44       0.41       0.39       0.37       0.37       0.35       0.35 
## enthusiast       real       want      world     member 
##       0.35       0.35       0.35       0.35       0.26