Improved TF-IDF calculation

The three documents that compose our corpus can be found here:

This may solve the memory problem with the R markdown. library(ff)

library(ff)

## Warning: package 'ff' was built under R version 3.4.4

## Warning: package 'bit' was built under R version 3.4.1

In order to reduce depending on large files, we will retireve the corpus from their respective online sources.

library(readr)

## Warning: package 'readr' was built under R version 3.4.3

library(stringr)

## Warning: package 'stringr' was built under R version 3.4.1

indeed <- read_file("https://raw.githubusercontent.com/vindication09/Data607_Project3/master/Indeed_Corpus.txt")

ziprec<-read_file("https://raw.githubusercontent.com/vindication09/Data607_Project3/master/ZipRecruiter_Corpus.txt")

reddit<-read_file("https://raw.githubusercontent.com/vindication09/Data607_Project3/master/Reddit_Corpus.txt")

Let’s clean out anything that is not an alphanumeric character.

str_replace_all(indeed, "[^[:alnum:]]", "")
str_replace_all(ziprec, "[^[:alnum:]]", "")
str_replace_all(reddit, "[^[:alnum:]]", "")

The special characters will cause an error. We need to remove them.

indeed <- sapply(indeed,function(row) iconv(row, "latin1", "ASCII", sub=""))
ziprec <- sapply(ziprec,function(row) iconv(row, "latin1", "ASCII", sub=""))
reddit <- sapply(reddit,function(row) iconv(row, "latin1", "ASCII", sub=""))

We have our documents imported into our working directory, so we can bind them together.

docs<-c(indeed, ziprec, reddit)

We need additional libraries.

library(tm)

## Warning: package 'tm' was built under R version 3.4.4

## Loading required package: NLP

## Warning: package 'NLP' was built under R version 3.4.1

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.4.4

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

library(stringr)

Define the corpus.

corpus  <-Corpus(VectorSource(docs))

There should be 3 documents in our corpus. Let’s check.

length(corpus)

## [1] 3

Let’s inspect documents.

corpus[[1]];corpus[[2]];corpus[[3]]

## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 630186

## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 1377744

## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 1242510

Let’s convert to a docment term matrix. We need to process the data first. (Note: tolower was giving an encoding error)

doc.corpus <- tm_map(corpus, tolower)
doc.corpus <- tm_map(corpus, removePunctuation)
doc.corpus <- tm_map(corpus, removeNumbers)
doc.corpus <- tm_map(corpus, removeWords, stopwords("english"))

This next part removes endings such as “ing” or “s”.

library(SnowballC)

## Warning: package 'SnowballC' was built under R version 3.4.1

doc.corpus <- tm_map(doc.corpus, stemDocument)

We should remove additional whitespace caused by all our transformations.

doc.corpus <- tm_map(doc.corpus, stripWhitespace)

Now we create a term document matrix.

tdm = TermDocumentMatrix(doc.corpus)

tdm

## <<TermDocumentMatrix (terms: 14731, documents: 3)>>
## Non-/sparse entries: 21325/22868
## Sparsity           : 52%
## Maximal term length: 89
## Weighting          : term frequency (tf)

Let’s inspect some elements in the index. (Commented out due to large output)

#inspect(tdm[1:3,1:3])

We can compute the transpose of this matrix.

dtm <- DocumentTermMatrix(doc.corpus)

Inspect some elements in the transpose matrix. (commented out due to large output)

#inspect(dtm[1:3,1:3])

There seems to be no indication that the transpose is better than the original so we will keep working off tdm.

Let’s find the most frequent words.

findFreqTerms(tdm, 2000)

## [1] "busi"    "data"    "develop" "experi"  "scienc"  "team"    "will"   
## [8] "work"

We can also do some word association. How do the words in this corpus associate with data? Let’s set our correlation threshhold at 0.99.

findAssocs(tdm, "data", 0.99)

## $data
##            000            3rd     algorithms         amazon         austin 
##           1.00           1.00           1.00           1.00           1.00 
##           auto          avail          bound          chief        cluster 
##           1.00           1.00           1.00           1.00           1.00 
##       colorado        company        concurr       constant         defens 
##           1.00           1.00           1.00           1.00           1.00 
##         depart          deriv        diagram           eleg        element 
##           1.00           1.00           1.00           1.00           1.00 
##            era         exceed          excel         expans      experienc 
##           1.00           1.00           1.00           1.00           1.00 
##         export        exposur         famili           five          floor 
##           1.00           1.00           1.00           1.00           1.00 
##         follow           four         fulfil        greater       greatest 
##           1.00           1.00           1.00           1.00           1.00 
##         holist         immedi          index       industri         invest 
##           1.00           1.00           1.00           1.00           1.00 
##       investor           issu      knowledge         latenc         layout 
##           1.00           1.00           1.00           1.00           1.00 
##           life           lift          logic          march         mentor 
##           1.00           1.00           1.00           1.00           1.00 
##      microsoft          minim          minor       modeling        mongodb 
##           1.00           1.00           1.00           1.00           1.00 
##       occasion          occur         overal           paid          parti 
##           1.00           1.00           1.00           1.00           1.00 
##        petabyt      placement          pleas       position         postgr 
##           1.00           1.00           1.00           1.00           1.00 
##        practic        princip         produc         profil       progress 
##           1.00           1.00           1.00           1.00           1.00 
##         promot          proof        pyspark        quality          queri 
##           1.00           1.00           1.00           1.00           1.00 
##       reassign          reduc         region       requisit        respect 
##           1.00           1.00           1.00           1.00           1.00 
##         review         routin           sale         senior          seven 
##           1.00           1.00           1.00           1.00           1.00 
##           site       snowflak          south      specialti      substitut 
##           1.00           1.00           1.00           1.00           1.00 
##        suitabl     throughput        timelin          title          tools 
##           1.00           1.00           1.00           1.00           1.00 
##     understand           view        virtual         vulner          women 
##           1.00           1.00           1.00           1.00           1.00 
##       workshop           2005           2006            3nf            5pm 
##           1.00           1.00           1.00           1.00           1.00 
##            ace          agent           ansi           apex            arm 
##           1.00           1.00           1.00           1.00           1.00 
##     automation            aws        backlog          bottl     bottleneck 
##           1.00           1.00           1.00           1.00           1.00 
##       brooklyn       calendar          calib       clusters        converg 
##           1.00           1.00           1.00           1.00           1.00 
##           cube            ddl           deem        diploma     disclaimer 
##           1.00           1.00           1.00           1.00           1.00 
##     disconnect           disk          dress         duplic       engineer 
##           1.00           1.00           1.00           1.00           1.00 
##        exhaust     frameworks       freelanc            gen        graviti 
##           1.00           1.00           1.00           1.00           1.00 
##          guard         harass         holder          inter          intim 
##           1.00           1.00           1.00           1.00           1.00 
##          japan         kimbal      louisiana         martin       matricul 
##           1.00           1.00           1.00           1.00           1.00 
##           meal     minneapoli          monet            mpp       nashvill 
##           1.00           1.00           1.00           1.00           1.00 
##          needs           norm        nuclear           obie         office 
##           1.00           1.00           1.00           1.00           1.00 
##         oracle         others        pension         pharma            poc 
##           1.00           1.00           1.00           1.00           1.00 
##          polic        powerbi     proficient          quest         ration 
##           1.00           1.00           1.00           1.00           1.00 
##         reader         redund     repositori        rethink        reusabl 
##           1.00           1.00           1.00           1.00           1.00 
##         robert         seattl          shore          sites     situations 
##           1.00           1.00           1.00           1.00           1.00 
##        solving          steer        tabular            tea           toad 
##           1.00           1.00           1.00           1.00           1.00 
##       upstream          urban          users         viabil      wednesday 
##           1.00           1.00           1.00           1.00           1.00 
##         winner       workspac         writer          advoc           bash 
##           1.00           1.00           1.00           0.99           0.99 
##            big        comfort        confirm       conflict        contact 
##           0.99           0.99           0.99           0.99           0.99 
##     cybersecur         extens         fortun       frequent          grown 
##           0.99           0.99           0.99           0.99           0.99 
##           guid            gym       internet         period          pound 
##           0.99           0.99           0.99           0.99           0.99 
##        pressur        purchas       redshift      represent            req 
##           0.99           0.99           0.99           0.99           0.99 
##         schema           sick          stabl           staf         studio 
##           0.99           0.99           0.99           0.99           0.99 
##        subject      substanti          tight            web           year 
##           0.99           0.99           0.99           0.99           0.99 
##            air         athena       bigqueri            bug       category 
##           0.99           0.99           0.99           0.99           0.99 
## certifications       concepts        culture           demo       dynamodb 
##           0.99           0.99           0.99           0.99           0.99 
##            elt            exp            h1b       horizont         hybrid 
##           0.99           0.99           0.99           0.99           0.99 
##           mart          metro  microstrategi         mutual            off 
##           0.99           0.99           0.99           0.99           0.99 
##         parser          patch           prem        quickly           reus 
##           0.99           0.99           0.99           0.99           0.99 
##          shave           slas        solicit       southern           ssis 
##           0.99           0.99           0.99           0.99           0.99 
##       uncommon          value       wrangler 
##           0.99           0.99           0.99

We can see there are sparse terms, meaning they don’t occur often. Let’s remove them.

tdm.common = removeSparseTerms(tdm, 0.1)

#compare dimensions 
dim(tdm);dim(tdm.common)

## [1] 14731     3

## [1] 2316    3

Let’s inspect our reduced matrix. (commented out due to large output)

#inspect(tdm.common[1:3,1:3])

Let’s visualize the contents of our newly reduced matrix.

library(slam)

## Warning: package 'slam' was built under R version 3.4.3

 tdm.dense <- as.matrix(tdm.common)

 #tdm.dense

Convert the matrix to a tidy format.

library(reshape2)

## Warning: package 'reshape2' was built under R version 3.4.3

 tdm.dense = melt(tdm.dense, value.name = "count")
  #head(tdm.dense)

We now have a long format data frame that can be stored in a relational database containing the information from our clean document term matrix.

tdm.dense.df<-data.frame(tdm.dense)

Let’s continue on with our investigation. What are the 50 most frequent terms in our clean matrix?

freq=rowSums(as.matrix(tdm.common))
head(freq,50)

##        000        100        10s        120        150        200 
##        107        101          5         10         10         18 
##       2007       2008       2011       2014       2015       2016 
##          6          5          4         10         20         29 
##       2017       2018       21st        300        3rd        500 
##         31         47          5         10         15         44 
##        700        800       abil        abl    absolut   abstract 
##         11          9        794        406         47         19 
##     academ   academia    acceler     accept     access   accommod 
##         70         60         27         82        196         65 
## accomplish     accord    account   accredit      accur   accuraci 
##         35         54        110         32         65         56 
##     achiev acknowledg     acquir   acquisit     across        act 
##        141          4         39         49        430         76 
##     action      activ     actual    actuari     acumen      adapt 
##        178        221        238         17         20         61 
##        add      addit 
##         82        162

How about the bottom?

tail(freq,50)

##     what   whatev   wherev  whether    white    whole   wholli    whose 
##       18       65        8       81        8       67        6       12 
##      wid     wide    wider     will  willing      win     wish   within 
##       32      111       10     2132       31       59       34      356 
##  without    woman    women      won   wonder     word     work   worker 
##      309        8       25        3       56       88     3308       23 
## workflow workforc workload workplac workshop    world worldwid    worth 
##       68       38       17       54       11      418       25       93 
##   wrangl    write  written      www      xml     yarn     year    years 
##       27      267      224      162       30       17     1405       41 
##      yes      yet    yield     york     youd    youll    young     your 
##       89       86       16       50       10      139       35       59 
##     youv     zero 
##       20       18

Let’s see if we can get a better story by performing tf-idf.

tdm_B = TermDocumentMatrix(doc.corpus,
                         control = list(weighting = weightTfIdf,
                                        stopwords = 'english', 
                                        removePunctuation = T,
                                        removeNumbers = T,
                                        stemming = T))

tdm_B

## <<TermDocumentMatrix (terms: 12450, documents: 3)>>
## Non-/sparse entries: 11894/25456
## Sparsity           : 68%
## Maximal term length: 83
## Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)

Now Let’s inspect the frequencies based on tf-idf weighting.

freq2=rowSums(as.matrix(tdm_B))
head(freq2,10);tail(freq2,10)

##          aaa          aab          aad          aap         aath 
## 1.447296e-05 1.217180e-05 2.652169e-05 3.651539e-05 1.217180e-05 
##      aawesom        aback      abandon       abbott    abbottcor 
## 1.447296e-05 1.447296e-05 2.581298e-05 4.868718e-05 1.217180e-05

##           zna          zone       zonesif           zoo       zookeep 
##  2.652169e-05  4.297810e-05  1.217180e-05  2.894591e-05  9.737436e-05 
## zookeeperjson      zoomdata        zumiez         zuora        zurich 
##  1.217180e-05  1.217180e-05  1.217180e-05  2.434359e-05  1.326084e-04

Let’s plot the frequencies.

plot(sort(freq2, decreasing = T),col="blue",main="Word TF-IDF frequencies", xlab="TF-IDF-based rank", ylab = "TF-IDF")

The frequencies follow the expected pattern for tf-idf weighting .

Let’s plot the most frequent terms.

high.freq=tail(sort(freq2),n=10)
hfp.df=as.data.frame(sort(high.freq))
hfp.df$names <- rownames(hfp.df) 

ggplot(hfp.df, aes(reorder(names,high.freq), high.freq)) +
  geom_bar(stat="identity") + coord_flip() + 
  xlab("Terms") + ylab("Frequency") +
  ggtitle("Term frequencies")

Based on tf-idf,we are not told a better story. We cannot conclude that tf-idf is a better algorithm to give us insight on what the top data science skills are. I can only speculate that this is due to the inverse relationship in tf-idf. Higher weight is assigned to terms that are “rare.” The skills and education credentials are not “rare” words in our corpus since they are mentioned all the time. This would explain why they were assigned low tf-idf weights.

Let’s compare this to the frequencies from our original document term matrix.

#use freq
high.freq=tail(sort(freq),n=10)
hfp.df=as.data.frame(sort(high.freq))
hfp.df$names <- rownames(hfp.df) 

ggplot(hfp.df, aes(reorder(names,high.freq), high.freq)) +
  geom_bar(stat="identity") + coord_flip() + 
  xlab("Terms") + ylab("Frequency") +
  ggtitle("Term frequencies")

Let’s put the information from the frequencies into an all inclusive word cloud.

library(wordcloud)

## Warning: package 'wordcloud' was built under R version 3.4.4

## Loading required package: RColorBrewer

wordcloud (tdm.dense.df$Terms, tdm.dense.df$count, random.order=FALSE, max.words=100, colors=brewer.pal(8, "Dark2"))

Used word matrix network and topic analysis to get more insights.

library(topicmodels)

## Warning: package 'topicmodels' was built under R version 3.4.4

#word matrix network
freq_terms <- findFreqTerms(tdm, 2000)   #set freq_terms
plot(tdm, term = freq_terms, corThreshold = 0.1, weighting = T)  # find terms that have corr bigger or equal to 0.1.

#topic analysis
dtm <- as.DocumentTermMatrix(tdm)
lda <- LDA(dtm, k = 10) # get 10 topics
term <- terms(lda, 5) # get first 5 terms of every topic
(term <- apply(term, MARGIN = 2, paste, collapse = ", "))

##                               Topic 1 
##  "data, work, experi, learn, develop" 
##                               Topic 2 
##        "data, peopl, work, will, use" 
##                               Topic 3 
##   "data, analyt, statist, engin, use" 
##                               Topic 4 
## "data, engin, requir, develop, manag" 
##                               Topic 5 
##  "get, can, learn, scienc, scientist" 
##                               Topic 6 
##       "data, manag, team, work, busi" 
##                               Topic 7 
##    "data, experi, team, will, analyt" 
##                               Topic 8 
## "data, experi, develop, design, busi" 
##                               Topic 9 
##  "requir, data, experi, work, scienc" 
##                              Topic 10 
##  "experi, data, build, work, support"

The plain counts of words tells a better story. We see words such as experience, team, and learn. We can infer that data scientist positions requires experience, working on a team, and learning. The word data is the top result. Perhaps data scientists should have a good understanding of the data they are working with. The word cloud features words such as develop, programming, data, experience…etc, We also see some of the skills that popped up in the word clouds pertaining to each individual job search site. The inclusive word cloud based on the job descriptions appears to caputre the top soft skills that were not captured by the indivudal world clouds. We also include a topic analysis using LDA. Based on our combined efforts, we can conclude that some important soft skills are: management learning experience team work understanding

Improved TF-IDF calculation

Group 5, DATA 607 Project 3

March 27, 2018