Text Mining - Data Scientist

Kuldeep Mahani

1 February 2019

Introduction

Read Twitter Data and preliminary data cleaning
Build a Corpus, and specify the location to be the character Vectors
convert to Lowercase
## what are the most tedious tasks you have to do as a data
## analyst and data scientist im looking for ideas to build
## new t
## what are the most tedious tasks you have to do as a data
## analyst and data scientist im looking for ideas to build
## new t
Remove the @ (usernames)
## what are the most tedious tasks you have to do as a data
## analyst and data scientist im looking for ideas to build
## new t
Remove anything except the english language and space
## what are the most tedious tasks you have to do as a data
## analyst and data scientist im looking for ideas to build
## new t
Remove Stopwords
## tedious tasks analyst im looking ideas build new t
Remove Single letter words
## tedious tasks analyst im looking ideas build new t
Remove Extra Whitespaces
## tedious tasks analyst im looking ideas build new t
keep a copy of “myCorpus” for stem completion later
Stem words in the corpus
## tedious task analyst im look idea build new t
Function to correct/complete the text after stemming
Stem Complete and Display the same tweet above with the completed and corrected text.
## read book another interest post hajir like libya still
## happily cur
Used to replace words with the proper ones
Creating a term document matrix
## <<TermDocumentMatrix (terms: 7294, documents: 4438)>>
## Non-/sparse entries: 33859/32336913
## Sparsity           : 100%
## Maximal term length: 50
## Weighting          : term frequency (tf)
Frequency analysis
##  [1] "ai"            "analyst"       "analytics"     "applicable"   
##  [5] "become"        "best"          "bigdata"       "bio"          
##  [9] "busi"          "ca"            "can"           "career"       
## [13] "certificate"   "check"         "chief"         "click"        
## [17] "climat"        "compani"       "comput"        "develop"      
## [21] "dont"          "engin"         "experi"        "find"         
## [25] "good"          "great"         "group"         "gt"           
## [29] "help"          "hire"          "ibm"           "im"           
## [33] "interest"      "iot"           "job"           "join"         
## [37] "just"          "know"          "latest"        "lead"         
## [41] "learn"         "like"          "link"          "look"         
## [45] "machine"       "machinelearn"  "make"          "need"         
## [49] "new"           "now"           "one"           "open"         
## [53] "opportuni"     "posit"         "python"        "read"         
## [57] "research"      "role"          "say"           "senior"       
## [61] "skill"         "take"          "team"          "technological"
## [65] "think"         "time"          "today"         "top"          
## [69] "us"            "use"           "want"          "will"         
## [73] "work"          "yourbe"
##  [1] "ai"          "analytics"   "applicable"  "become"      "busi"       
##  [6] "can"         "career"      "certificate" "check"       "click"      
## [11] "compani"     "develop"     "dont"        "engin"       "great"      
## [16] "hire"        "im"          "interest"    "job"         "join"       
## [21] "just"        "know"        "lead"        "learn"       "like"       
## [26] "link"        "look"        "machine"     "make"        "need"       
## [31] "new"         "one"         "open"        "research"    "role"       
## [36] "senior"      "skill"       "team"        "us"          "use"        
## [41] "want"        "will"        "work"        "yourbe"
##  [1] "ai"         "analytics"  "applicable" "become"     "can"       
##  [6] "compani"    "dont"       "engin"      "hire"       "im"        
## [11] "job"        "join"       "know"       "lead"       "learn"     
## [16] "like"       "look"       "need"       "new"        "one"       
## [21] "open"       "research"   "role"       "senior"     "skill"     
## [26] "team"       "use"        "want"       "will"       "work"
##  [1] "ai"         "analytics"  "applicable" "become"     "can"       
##  [6] "engin"      "hire"       "job"        "learn"      "like"      
## [11] "look"       "need"       "new"        "one"        "senior"    
## [16] "team"       "work"
plotting the graph of frequent terms
plotting the graph of frequent terms

calculate the frequency of words and sort it by frequency and setting up the Wordcloud

Find association with a specific keyword in the tweets - usopen, champion,next
##            certificate
## ibm               0.53
## consortium        0.35
## google            0.34
## new               0.33
## cloud             0.31
## create            0.28
## launch            0.28
## close             0.25
## tough             0.25
## assess            0.22
## unveiled          0.21

##          research
## facebook     0.33
## grad         0.25
## phd          0.20

##               python
## clickhouse      0.35
## ide             0.35
## idedata         0.35
## integrationin   0.35
## rollout         0.35
## rstudio         0.35

##            analyst
## highpaying    0.26
## zuka          0.26
## perfect       0.23
## business      0.22

##                     nervous
## amatch                 0.63
## amd                    0.45
## internatio             0.45
## ipp                    0.45
## leadglobal             0.45
## mondelz                0.45
## postgraduate           0.45
## scientistindustrial    0.45
## external               0.32
## scientistintern        0.32
## even                   0.29
## applicable             0.28
## half                   0.26
## robert                 0.26

##                      academia
## boardkhan                0.43
## lis                      0.43
## lisjobs                  0.43
## yup                      0.43
## frustrate                0.35
## capgemini                0.30
## capital                  0.30
## coworking                0.30
## eventspace               0.30
## funders                  0.30
## georgetown               0.30
## lu                       0.30
## magnimind                0.30
## neighborhood             0.30
## pin                      0.30
## protec                   0.30
## resourcesdatascience     0.30
## rigorous                 0.30
## tobedatascientist        0.30

##            business
## highpaying     0.58
## frustrate      0.33
## analyst        0.22

Topic Modelling to identify latent/hidden topics using LDA technique
##                                        Topic 1 
##   "job, senior, engin, one, learn, work, role" 
##                                        Topic 2 
##      "new, look, work, will, know, ai, career" 
##                                        Topic 3 
## "job, learn, will, look, one, applicable, use" 
##                                        Topic 4 
##     "job, hire, look, need, open, learn, team" 
##                                        Topic 5 
##    "hire, ai, applicable, im, look, new, work"
## Warning: `position` is deprecated