The three documents that compose our corpus can be found here:
Indeed: https://raw.githubusercontent.com/vindication09/Data607_Project3/master/Indeed_Corpus.txt
ZipRec: https://raw.githubusercontent.com/vindication09/Data607_Project3/master/ZipRecruiter_Corpus.txt
Reddit: https://raw.githubusercontent.com/vindication09/Data607_Project3/master/Reddit_Corpus.txt
This may solve the memory problem with the R markdown. library(ff)
library(ff)
## Warning: package 'ff' was built under R version 3.4.4
## Warning: package 'bit' was built under R version 3.4.1
In order to reduce depending on large files, we will retireve the corpus from their respective online sources.
library(readr)
## Warning: package 'readr' was built under R version 3.4.3
library(stringr)
## Warning: package 'stringr' was built under R version 3.4.1
indeed <- read_file("https://raw.githubusercontent.com/vindication09/Data607_Project3/master/Indeed_Corpus.txt")
ziprec<-read_file("https://raw.githubusercontent.com/vindication09/Data607_Project3/master/ZipRecruiter_Corpus.txt")
reddit<-read_file("https://raw.githubusercontent.com/vindication09/Data607_Project3/master/Reddit_Corpus.txt")
Let’s clean out anything that is not an alphanumeric character.
str_replace_all(indeed, "[^[:alnum:]]", "")
str_replace_all(ziprec, "[^[:alnum:]]", "")
str_replace_all(reddit, "[^[:alnum:]]", "")
The special characters will cause an error. We need to remove them.
indeed <- sapply(indeed,function(row) iconv(row, "latin1", "ASCII", sub=""))
ziprec <- sapply(ziprec,function(row) iconv(row, "latin1", "ASCII", sub=""))
reddit <- sapply(reddit,function(row) iconv(row, "latin1", "ASCII", sub=""))
We have our documents imported into our working directory, so we can bind them together.
docs<-c(indeed, ziprec, reddit)
We need additional libraries.
library(tm)
## Warning: package 'tm' was built under R version 3.4.4
## Loading required package: NLP
## Warning: package 'NLP' was built under R version 3.4.1
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.4
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
library(stringr)
Define the corpus.
corpus <-Corpus(VectorSource(docs))
There should be 3 documents in our corpus. Let’s check.
length(corpus)
## [1] 3
Let’s inspect documents.
corpus[[1]];corpus[[2]];corpus[[3]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 630186
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 1377744
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 1242510
Let’s convert to a docment term matrix. We need to process the data first. (Note: tolower was giving an encoding error)
doc.corpus <- tm_map(corpus, tolower)
doc.corpus <- tm_map(corpus, removePunctuation)
doc.corpus <- tm_map(corpus, removeNumbers)
doc.corpus <- tm_map(corpus, removeWords, stopwords("english"))
This next part removes endings such as “ing” or “s”.
library(SnowballC)
## Warning: package 'SnowballC' was built under R version 3.4.1
doc.corpus <- tm_map(doc.corpus, stemDocument)
We should remove additional whitespace caused by all our transformations.
doc.corpus <- tm_map(doc.corpus, stripWhitespace)
Now we create a term document matrix.
tdm = TermDocumentMatrix(doc.corpus)
tdm
## <<TermDocumentMatrix (terms: 14731, documents: 3)>>
## Non-/sparse entries: 21325/22868
## Sparsity : 52%
## Maximal term length: 89
## Weighting : term frequency (tf)
Let’s inspect some elements in the index. (Commented out due to large output)
#inspect(tdm[1:3,1:3])
We can compute the transpose of this matrix.
dtm <- DocumentTermMatrix(doc.corpus)
Inspect some elements in the transpose matrix. (commented out due to large output)
#inspect(dtm[1:3,1:3])
There seems to be no indication that the transpose is better than the original so we will keep working off tdm.
Let’s find the most frequent words.
findFreqTerms(tdm, 2000)
## [1] "busi" "data" "develop" "experi" "scienc" "team" "will"
## [8] "work"
We can also do some word association. How do the words in this corpus associate with data? Let’s set our correlation threshhold at 0.99.
findAssocs(tdm, "data", 0.99)
## $data
## 000 3rd algorithms amazon austin
## 1.00 1.00 1.00 1.00 1.00
## auto avail bound chief cluster
## 1.00 1.00 1.00 1.00 1.00
## colorado company concurr constant defens
## 1.00 1.00 1.00 1.00 1.00
## depart deriv diagram eleg element
## 1.00 1.00 1.00 1.00 1.00
## era exceed excel expans experienc
## 1.00 1.00 1.00 1.00 1.00
## export exposur famili five floor
## 1.00 1.00 1.00 1.00 1.00
## follow four fulfil greater greatest
## 1.00 1.00 1.00 1.00 1.00
## holist immedi index industri invest
## 1.00 1.00 1.00 1.00 1.00
## investor issu knowledge latenc layout
## 1.00 1.00 1.00 1.00 1.00
## life lift logic march mentor
## 1.00 1.00 1.00 1.00 1.00
## microsoft minim minor modeling mongodb
## 1.00 1.00 1.00 1.00 1.00
## occasion occur overal paid parti
## 1.00 1.00 1.00 1.00 1.00
## petabyt placement pleas position postgr
## 1.00 1.00 1.00 1.00 1.00
## practic princip produc profil progress
## 1.00 1.00 1.00 1.00 1.00
## promot proof pyspark quality queri
## 1.00 1.00 1.00 1.00 1.00
## reassign reduc region requisit respect
## 1.00 1.00 1.00 1.00 1.00
## review routin sale senior seven
## 1.00 1.00 1.00 1.00 1.00
## site snowflak south specialti substitut
## 1.00 1.00 1.00 1.00 1.00
## suitabl throughput timelin title tools
## 1.00 1.00 1.00 1.00 1.00
## understand view virtual vulner women
## 1.00 1.00 1.00 1.00 1.00
## workshop 2005 2006 3nf 5pm
## 1.00 1.00 1.00 1.00 1.00
## ace agent ansi apex arm
## 1.00 1.00 1.00 1.00 1.00
## automation aws backlog bottl bottleneck
## 1.00 1.00 1.00 1.00 1.00
## brooklyn calendar calib clusters converg
## 1.00 1.00 1.00 1.00 1.00
## cube ddl deem diploma disclaimer
## 1.00 1.00 1.00 1.00 1.00
## disconnect disk dress duplic engineer
## 1.00 1.00 1.00 1.00 1.00
## exhaust frameworks freelanc gen graviti
## 1.00 1.00 1.00 1.00 1.00
## guard harass holder inter intim
## 1.00 1.00 1.00 1.00 1.00
## japan kimbal louisiana martin matricul
## 1.00 1.00 1.00 1.00 1.00
## meal minneapoli monet mpp nashvill
## 1.00 1.00 1.00 1.00 1.00
## needs norm nuclear obie office
## 1.00 1.00 1.00 1.00 1.00
## oracle others pension pharma poc
## 1.00 1.00 1.00 1.00 1.00
## polic powerbi proficient quest ration
## 1.00 1.00 1.00 1.00 1.00
## reader redund repositori rethink reusabl
## 1.00 1.00 1.00 1.00 1.00
## robert seattl shore sites situations
## 1.00 1.00 1.00 1.00 1.00
## solving steer tabular tea toad
## 1.00 1.00 1.00 1.00 1.00
## upstream urban users viabil wednesday
## 1.00 1.00 1.00 1.00 1.00
## winner workspac writer advoc bash
## 1.00 1.00 1.00 0.99 0.99
## big comfort confirm conflict contact
## 0.99 0.99 0.99 0.99 0.99
## cybersecur extens fortun frequent grown
## 0.99 0.99 0.99 0.99 0.99
## guid gym internet period pound
## 0.99 0.99 0.99 0.99 0.99
## pressur purchas redshift represent req
## 0.99 0.99 0.99 0.99 0.99
## schema sick stabl staf studio
## 0.99 0.99 0.99 0.99 0.99
## subject substanti tight web year
## 0.99 0.99 0.99 0.99 0.99
## air athena bigqueri bug category
## 0.99 0.99 0.99 0.99 0.99
## certifications concepts culture demo dynamodb
## 0.99 0.99 0.99 0.99 0.99
## elt exp h1b horizont hybrid
## 0.99 0.99 0.99 0.99 0.99
## mart metro microstrategi mutual off
## 0.99 0.99 0.99 0.99 0.99
## parser patch prem quickly reus
## 0.99 0.99 0.99 0.99 0.99
## shave slas solicit southern ssis
## 0.99 0.99 0.99 0.99 0.99
## uncommon value wrangler
## 0.99 0.99 0.99
We can see there are sparse terms, meaning they don’t occur often. Let’s remove them.
tdm.common = removeSparseTerms(tdm, 0.1)
#compare dimensions
dim(tdm);dim(tdm.common)
## [1] 14731 3
## [1] 2316 3
Let’s inspect our reduced matrix. (commented out due to large output)
#inspect(tdm.common[1:3,1:3])
Let’s visualize the contents of our newly reduced matrix.
library(slam)
## Warning: package 'slam' was built under R version 3.4.3
tdm.dense <- as.matrix(tdm.common)
#tdm.dense
Convert the matrix to a tidy format.
library(reshape2)
## Warning: package 'reshape2' was built under R version 3.4.3
tdm.dense = melt(tdm.dense, value.name = "count")
#head(tdm.dense)
We now have a long format data frame that can be stored in a relational database containing the information from our clean document term matrix.
tdm.dense.df<-data.frame(tdm.dense)
Let’s continue on with our investigation. What are the 50 most frequent terms in our clean matrix?
freq=rowSums(as.matrix(tdm.common))
head(freq,50)
## 000 100 10s 120 150 200
## 107 101 5 10 10 18
## 2007 2008 2011 2014 2015 2016
## 6 5 4 10 20 29
## 2017 2018 21st 300 3rd 500
## 31 47 5 10 15 44
## 700 800 abil abl absolut abstract
## 11 9 794 406 47 19
## academ academia acceler accept access accommod
## 70 60 27 82 196 65
## accomplish accord account accredit accur accuraci
## 35 54 110 32 65 56
## achiev acknowledg acquir acquisit across act
## 141 4 39 49 430 76
## action activ actual actuari acumen adapt
## 178 221 238 17 20 61
## add addit
## 82 162
How about the bottom?
tail(freq,50)
## what whatev wherev whether white whole wholli whose
## 18 65 8 81 8 67 6 12
## wid wide wider will willing win wish within
## 32 111 10 2132 31 59 34 356
## without woman women won wonder word work worker
## 309 8 25 3 56 88 3308 23
## workflow workforc workload workplac workshop world worldwid worth
## 68 38 17 54 11 418 25 93
## wrangl write written www xml yarn year years
## 27 267 224 162 30 17 1405 41
## yes yet yield york youd youll young your
## 89 86 16 50 10 139 35 59
## youv zero
## 20 18
Let’s see if we can get a better story by performing tf-idf.
tdm_B = TermDocumentMatrix(doc.corpus,
control = list(weighting = weightTfIdf,
stopwords = 'english',
removePunctuation = T,
removeNumbers = T,
stemming = T))
tdm_B
## <<TermDocumentMatrix (terms: 12450, documents: 3)>>
## Non-/sparse entries: 11894/25456
## Sparsity : 68%
## Maximal term length: 83
## Weighting : term frequency - inverse document frequency (normalized) (tf-idf)
Now Let’s inspect the frequencies based on tf-idf weighting.
freq2=rowSums(as.matrix(tdm_B))
head(freq2,10);tail(freq2,10)
## aaa aab aad aap aath
## 1.447296e-05 1.217180e-05 2.652169e-05 3.651539e-05 1.217180e-05
## aawesom aback abandon abbott abbottcor
## 1.447296e-05 1.447296e-05 2.581298e-05 4.868718e-05 1.217180e-05
## zna zone zonesif zoo zookeep
## 2.652169e-05 4.297810e-05 1.217180e-05 2.894591e-05 9.737436e-05
## zookeeperjson zoomdata zumiez zuora zurich
## 1.217180e-05 1.217180e-05 1.217180e-05 2.434359e-05 1.326084e-04
Let’s plot the frequencies.
plot(sort(freq2, decreasing = T),col="blue",main="Word TF-IDF frequencies", xlab="TF-IDF-based rank", ylab = "TF-IDF")
The frequencies follow the expected pattern for tf-idf weighting .
Let’s plot the most frequent terms.
high.freq=tail(sort(freq2),n=10)
hfp.df=as.data.frame(sort(high.freq))
hfp.df$names <- rownames(hfp.df)
ggplot(hfp.df, aes(reorder(names,high.freq), high.freq)) +
geom_bar(stat="identity") + coord_flip() +
xlab("Terms") + ylab("Frequency") +
ggtitle("Term frequencies")
Based on tf-idf,we are not told a better story. We cannot conclude that tf-idf is a better algorithm to give us insight on what the top data science skills are. I can only speculate that this is due to the inverse relationship in tf-idf. Higher weight is assigned to terms that are “rare.” The skills and education credentials are not “rare” words in our corpus since they are mentioned all the time. This would explain why they were assigned low tf-idf weights.
Let’s compare this to the frequencies from our original document term matrix.
#use freq
high.freq=tail(sort(freq),n=10)
hfp.df=as.data.frame(sort(high.freq))
hfp.df$names <- rownames(hfp.df)
ggplot(hfp.df, aes(reorder(names,high.freq), high.freq)) +
geom_bar(stat="identity") + coord_flip() +
xlab("Terms") + ylab("Frequency") +
ggtitle("Term frequencies")
Let’s put the information from the frequencies into an all inclusive word cloud.
library(wordcloud)
## Warning: package 'wordcloud' was built under R version 3.4.4
## Loading required package: RColorBrewer
wordcloud (tdm.dense.df$Terms, tdm.dense.df$count, random.order=FALSE, max.words=100, colors=brewer.pal(8, "Dark2"))
Used word matrix network and topic analysis to get more insights.
library(topicmodels)
## Warning: package 'topicmodels' was built under R version 3.4.4
#word matrix network
freq_terms <- findFreqTerms(tdm, 2000) #set freq_terms
plot(tdm, term = freq_terms, corThreshold = 0.1, weighting = T) # find terms that have corr bigger or equal to 0.1.
#topic analysis
dtm <- as.DocumentTermMatrix(tdm)
lda <- LDA(dtm, k = 10) # get 10 topics
term <- terms(lda, 5) # get first 5 terms of every topic
(term <- apply(term, MARGIN = 2, paste, collapse = ", "))
## Topic 1
## "data, work, experi, learn, develop"
## Topic 2
## "data, peopl, work, will, use"
## Topic 3
## "data, analyt, statist, engin, use"
## Topic 4
## "data, engin, requir, develop, manag"
## Topic 5
## "get, can, learn, scienc, scientist"
## Topic 6
## "data, manag, team, work, busi"
## Topic 7
## "data, experi, team, will, analyt"
## Topic 8
## "data, experi, develop, design, busi"
## Topic 9
## "requir, data, experi, work, scienc"
## Topic 10
## "experi, data, build, work, support"
The plain counts of words tells a better story. We see words such as experience, team, and learn. We can infer that data scientist positions requires experience, working on a team, and learning. The word data is the top result. Perhaps data scientists should have a good understanding of the data they are working with. The word cloud features words such as develop, programming, data, experience…etc, We also see some of the skills that popped up in the word clouds pertaining to each individual job search site. The inclusive word cloud based on the job descriptions appears to caputre the top soft skills that were not captured by the indivudal world clouds. We also include a topic analysis using LDA. Based on our combined efforts, we can conclude that some important soft skills are: management learning experience team work understanding