My Text Mining Milestone Report

Read the files as is ( ie. we want to read raw text) :

news=readLines("en_US.news.txt")

## Warning in readLines("en_US.news.txt"): incomplete final line found on
## 'en_US.news.txt'

blogs=readLines("en_US.blogs.txt")
tweets=readLines("en_US.twitter.txt")

## Warning in readLines("en_US.twitter.txt"): line 167155 appears to contain
## an embedded nul

## Warning in readLines("en_US.twitter.txt"): line 268547 appears to contain
## an embedded nul

## Warning in readLines("en_US.twitter.txt"): line 1274086 appears to contain
## an embedded nul

## Warning in readLines("en_US.twitter.txt"): line 1759032 appears to contain
## an embedded nul

File info :

file.info("en_US.news.txt")$size/1024^2

## [1] 196.2775

file.info("en_US.blogs.txt")$size/1024^2

## [1] 200.4242

file.info("en_US.twitter.txt")$size/1024^2

## [1] 159.3641

Since each file is very large in size, we should take samples of 1000 lines for each file and then combine the 3 sampled files to create one master project sample file for exploratory data analysis :

SampleTweets=sample(tweets, 1000)
SampleBlogs=sample(blogs, 1000)
SampleNews=sample(news, 1000)
ProjectData=paste(SampleBlogs,SampleNews,SampleTweets, sep=" ")

Take a peak & bottom of master project file :

head(ProjectData,n=1)

## [1] "â€œIf we lose one journalist today, another reckless crime is done to you because if you kill a journalist, you kill a world,â€ said Balabo. â€œWe all just miss her laughter and her presence,â€ Hunter-Feeney reminisced, fighting back tears. â€œShe loved life and loved her family. aww i wish u cld have gone and take pics 4 me'!!!!"

tail(ProjectData, n=1)

## [1] "â€œYup!â€ \"If we learn that someone is not old enough to have a Google account, or we receive a report, we will investigate and take the appropriate action,\" says Google spokesman Jay Nancarrow. He adds that \"users first have a chance to demonstrate that they meet our age requirements. If they don't, we will close the account.\" Facebook and most other sites have similar policies. Yeah, that's right! ~RT : We suffer way too much from sh*t that never happens! 'get ready to be ok'! ;^)"

Size (in characters) of each name :

ProjectData_chars=nchar(ProjectData)

Longest line :

ProjectData[which(ProjectData_chars == max(ProjectData_chars))]

## [1] "\"13. In my view this is a case in which the claimants have given every impression of playing games in this action. I do not need to make a finding as to whether are not they actually were playing games but they have certainly given that very strong impression. The claimants have also given the very strong impression that they do not know what their case is and they have been reluctant to be pinned down, and have demonstrated a tendency from time to time to change their case.14. There is nothing inherently wrong with changing a case; it happens all the time in these courts. When it happens in the manner in which it happened in this case and with decreasing rather than increasing clarity as to what the case is, then that is a practice which must be stopped. It has been said more than once, particularly in the context of patent litigation, that it must be conducted efficiently, properly and with a proper regard to costs. All litigation must be conducted with a proper regard to the need to disclose what a case is so that the other party can understand it and meet it.15. In this respect the claimant has, in my view, fallen lamentably short of the standards required of patent litigants in this court. I am quite satisfied that the claimant has brought this application on itself and has brought itself close to having its claim struck out and, therefore, as a matter of principle the claimant ought to pay the costs of this application.16. I also have no doubt that the claimants should pay the costs of this application on an indemnity basis. The wilful, as it seems to me, refusal to disclose its case until asked point blank by a judge of this Division what the case was and their reliance on totally misconceived technicalities arising out of the patent, and a reliance on an assertion that the defendant really ought to ask the right questions, are, in my view, rather remarkable in the context of the litigation that I have seen. It is simply not the way in which litigation ought to be conducted and is plainly, in my view, within the category of conduct which falls to be penalized by an award of indemnity costs. I therefore order that the claimant pay the costs of this application on an indemnity basis\".Claimants always face the challenge of having to provide sufficient information to comply with the Civil Procedure Rules while striving as far as possible to â€œkeep oneâ€™s powder dryâ€ as to certain aspects of a claim. A claimant is not required to disclose full details of its entire case before trial, but it is clear in this case that the judge believed the claimant had overstepped the mark, penalising it by ordering indemnity costs. The moral of the story, says Kempner, is that a claimant, facing a request for clarification and/or further information from the defendant, must ensure it has provided proper details of its claim before it can safely decline to provide the requested clarification and/or further information. Other hangouts: Paco's Tacos in Westchester makes tortillas fresh all day long, entrees less than $10; 6212 W. Manchester Ave.; (310) 645-8692. Coffee Co. serves good affordable breakfasts, crepes and home-style food; 8751 La Tijera Blvd., Westchester; (310) 645-7315. Need a good corner play here! Come on Yanks!!!!"

Shortest line :

ProjectData[which(ProjectData_chars == min(ProjectData_chars))]

## [1] "Canvas CurseKirby Canvas Curse 1 1/3 cups sugar i've been blessed!"

Length measure :

length(ProjectData)

## [1] 1000

Length of longest line :

max(ProjectData_chars)

## [1] 3278

In data set, if you divide the number of lines where the word “love” occurs by the number of lines the word “hate” (all lowercase) occurs, about what do you get?

love=sum(grepl(pattern = "love", x = ProjectData))
love

## [1] 105

hate=sum(grepl(pattern = "hate", x = ProjectData))
hate

## [1] 27

lovehateratio=love/hate
lovehateratio

## [1] 3.888889

Basic counting:

Install stringr package:

library(stringr)

Test counting :

Love=str_count(ProjectData, "love")
sum(Love)

## [1] 127

Hate=str_count(ProjectData, "hate")
sum(Hate)

## [1] 27

LoveHateRatio=sum(Love)/sum(Hate)
LoveHateRatio

## [1] 4.703704

Loading required packages:

library(NLP)
library(tm)
library(SnowballC)

Building corpus :

ProjectCorpus=Corpus(VectorSource(ProjectData))

Preparing the corpus:

We generally need to perform some pre-processing of the text data to prepare for the text analysis. Example transformations include converting the text to lower case, removing numbers and punctuation, removing stop words, stemming and identifying synonyms. The basic transforms are all available within tm.

ProjectCorpus

## <<VCorpus (documents: 1000, metadata (corpus/indexed): 0/0)>>

Exploring the corpus:

class(ProjectCorpus)

## [1] "VCorpus" "Corpus"

class(ProjectCorpus[[1]])

## [1] "PlainTextDocument" "TextDocument"

inspect(ProjectCorpus[1])

## <<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>
## 
## [[1]]
## <<PlainTextDocument (metadata: 7)>>
## â€œIf we lose one journalist today, another reckless crime is done to you because if you kill a journalist, you kill a world,â€ said Balabo. â€œWe all just miss her laughter and her presence,â€ Hunter-Feeney reminisced, fighting back tears. â€œShe loved life and loved her family. aww i wish u cld have gone and take pics 4 me'!!!!

The function tm map() is used to apply one of these transformations across all documents within a corpus. Other transformations can be implemented using R functions and wrapped within content transformer() to create a function that can be passed through to tm map().

getTransformations()

## [1] "removeNumbers"     "removePunctuation" "removeWords"      
## [4] "stemDocument"      "stripWhitespace"

In the following sections we will apply each of the transformations, one-by-one, to remove unwanted characters from the text.

ProjectCorpus1 <- tm_map(ProjectCorpus, tolower)
ProjectCorpus1[[1]]

## [1] "â€œif we lose one journalist today, another reckless crime is done to you because if you kill a journalist, you kill a world,â€ said balabo. â€œwe all just miss her laughter and her presence,â€ hunter-feeney reminisced, fighting back tears. â€œshe loved life and loved her family. aww i wish u cld have gone and take pics 4 me'!!!!"

ProjectCorpus2 <- tm_map(ProjectCorpus1, PlainTextDocument)
ProjectCorpus2[[1]]

## <<PlainTextDocument (metadata: 7)>>
## â€œif we lose one journalist today, another reckless crime is done to you because if you kill a journalist, you kill a world,â€ said balabo. â€œwe all just miss her laughter and her presence,â€ hunter-feeney reminisced, fighting back tears. â€œshe loved life and loved her family. aww i wish u cld have gone and take pics 4 me'!!!!

ProjectCorpus3 <- tm_map(ProjectCorpus2, removePunctuation)
ProjectCorpus3[[1]]

## <<PlainTextDocument (metadata: 7)>>
## â€œif we lose one journalist today another reckless crime is done to you because if you kill a journalist you kill a worldâ€ said balabo â€œwe all just miss her laughter and her presenceâ€ hunterfeeney reminisced fighting back tears â€œshe loved life and loved her family aww i wish u cld have gone and take pics 4 me

stopwords("english")

##   [1] "i"          "me"         "my"         "myself"     "we"        
##   [6] "our"        "ours"       "ourselves"  "you"        "your"      
##  [11] "yours"      "yourself"   "yourselves" "he"         "him"       
##  [16] "his"        "himself"    "she"        "her"        "hers"      
##  [21] "herself"    "it"         "its"        "itself"     "they"      
##  [26] "them"       "their"      "theirs"     "themselves" "what"      
##  [31] "which"      "who"        "whom"       "this"       "that"      
##  [36] "these"      "those"      "am"         "is"         "are"       
##  [41] "was"        "were"       "be"         "been"       "being"     
##  [46] "have"       "has"        "had"        "having"     "do"        
##  [51] "does"       "did"        "doing"      "would"      "should"    
##  [56] "could"      "ought"      "i'm"        "you're"     "he's"      
##  [61] "she's"      "it's"       "we're"      "they're"    "i've"      
##  [66] "you've"     "we've"      "they've"    "i'd"        "you'd"     
##  [71] "he'd"       "she'd"      "we'd"       "they'd"     "i'll"      
##  [76] "you'll"     "he'll"      "she'll"     "we'll"      "they'll"   
##  [81] "isn't"      "aren't"     "wasn't"     "weren't"    "hasn't"    
##  [86] "haven't"    "hadn't"     "doesn't"    "don't"      "didn't"    
##  [91] "won't"      "wouldn't"   "shan't"     "shouldn't"  "can't"     
##  [96] "cannot"     "couldn't"   "mustn't"    "let's"      "that's"    
## [101] "who's"      "what's"     "here's"     "there's"    "when's"    
## [106] "where's"    "why's"      "how's"      "a"          "an"        
## [111] "the"        "and"        "but"        "if"         "or"        
## [116] "because"    "as"         "until"      "while"      "of"        
## [121] "at"         "by"         "for"        "with"       "about"     
## [126] "against"    "between"    "into"       "through"    "during"    
## [131] "before"     "after"      "above"      "below"      "to"        
## [136] "from"       "up"         "down"       "in"         "out"       
## [141] "on"         "off"        "over"       "under"      "again"     
## [146] "further"    "then"       "once"       "here"       "there"     
## [151] "when"       "where"      "why"        "how"        "all"       
## [156] "any"        "both"       "each"       "few"        "more"      
## [161] "most"       "other"      "some"       "such"       "no"        
## [166] "nor"        "not"        "only"       "own"        "same"      
## [171] "so"         "than"       "too"        "very"

ProjectCorpus4 <- tm_map(ProjectCorpus3, removeWords, stopwords("english"))
ProjectCorpus4[[1]]

## <<PlainTextDocument (metadata: 7)>>
## â€œ  lose one journalist today another reckless crime  done      kill  journalist  kill  worldâ€ said balabo â€œ  just miss  laughter   presenceâ€ hunterfeeney reminisced fighting back tears â€œ loved life  loved  family aww  wish u cld  gone  take pics 4

ProjectCorpus5 <- tm_map(ProjectCorpus4, stemDocument)
ProjectCorpus5[[1]]

## <<PlainTextDocument (metadata: 7)>>
## â€œ  lose one journalist today anoth reckless crime  done      kill  journalist  kill  worldâ€ said balabo â€œ  just miss  laughter   presenceâ€ hunterfeeney reminisc fight back tear â€œ love life  love  famili aww  wish u cld  gone  take pic 4

ProjectCorpus6 <- tm_map(ProjectCorpus5, removeNumbers)
ProjectCorpus6[[1]]

## <<PlainTextDocument (metadata: 7)>>
## â€œ  lose one journalist today anoth reckless crime  done      kill  journalist  kill  worldâ€ said balabo â€œ  just miss  laughter   presenceâ€ hunterfeeney reminisc fight back tear â€œ love life  love  famili aww  wish u cld  gone  take pic

ProjectCorpus7 <- tm_map(ProjectCorpus6, stripWhitespace)
ProjectCorpus7[[1]]

## <<PlainTextDocument (metadata: 7)>>
## â€œ lose one journalist today anoth reckless crime done kill journalist kill worldâ€ said balabo â€œ just miss laughter presenceâ€ hunterfeeney reminisc fight back tear â€œ love life love famili aww wish u cld gone take pic

ProjectCorpus7[[100]]

## <<PlainTextDocument (metadata: 7)>>
## law bad guy can persuad run program comput itâ€™ comput anymor last video made snort laughter scene show workahol guy shroom offic tri captur guy web ethernet cabl look forward show weekendth even wo fact horrif stori

Creating a Document Term Matrix :

A document term matrix is simply a matrix with documents as the rows and terms as the columns and a count of the frequency of words as the cells of the matrix. We use DocumentTermMatrix() to create the matrix:

ProjectDTM<- DocumentTermMatrix(ProjectCorpus7)
ProjectDTM

## <<DocumentTermMatrix (documents: 1000, terms: 10485)>>
## Non-/sparse entries: 43439/10441561
## Sparsity           : 100%
## Maximal term length: 32
## Weighting          : term frequency (tf)

We can inspect the document term matrix using inspect(). Here, to avoid too much output,we select a subset of inspect :

inspect(ProjectDTM[1:5,900:905])

## <<DocumentTermMatrix (documents: 5, terms: 6)>>
## Non-/sparse entries: 0/30
## Sparsity           : 100%
## Maximal term length: 10
## Weighting          : term frequency (tf)
## 
##               Terms
## Docs           beachfront bead bean bear beard bearer
##   character(0)          0    0    0    0     0      0
##   character(0)          0    0    0    0     0      0
##   character(0)          0    0    0    0     0      0
##   character(0)          0    0    0    0     0      0
##   character(0)          0    0    0    0     0      0

The document term matrix is in fact quite sparse (that is, mostly empty) and so it is actually stored in a much more compact representation internally. We can still get the row and column counts :

class(ProjectDTM)

## [1] "DocumentTermMatrix"    "simple_triplet_matrix"

dim(ProjectDTM)

## [1]  1000 10485

ProjectDTM

## <<DocumentTermMatrix (documents: 1000, terms: 10485)>>
## Non-/sparse entries: 43439/10441561
## Sparsity           : 100%
## Maximal term length: 32
## Weighting          : term frequency (tf)

The transpose is created using TermDocumentMatrix():

ProjectTDM <- TermDocumentMatrix(ProjectCorpus7)
ProjectTDM

## <<TermDocumentMatrix (terms: 10485, documents: 1000)>>
## Non-/sparse entries: 43439/10441561
## Sparsity           : 100%
## Maximal term length: 32
## Weighting          : term frequency (tf)

We can obtain the term frequencies as a vector by converting the document term matrix into a matrix and summing the column counts:

For very large corpus the size of the matrix can exceed R’s calculation limits. This will manifest itself as a integer overflow error with a message like: Error in vector(typeof(x$v), nr * nc) : vector size cannot be NA In addition: Warning message: In nr * nc : NAs produced by integer overflow If this occurs, then consider removing sparse terms from the document term matrix. Removing Sparse Terms We are often not interested in infrequent terms in our documents. Such “sparse” terms can be removed from the document term matrix quite easily using removeSparseTerms():

dim(ProjectDTM)

## [1]  1000 10485

ProjectSparse <- removeSparseTerms(ProjectDTM, 0.98)
ProjectSparse

## <<DocumentTermMatrix (documents: 1000, terms: 395)>>
## Non-/sparse entries: 17272/377728
## Sparsity           : 96%
## Maximal term length: 9
## Weighting          : term frequency (tf)

dim(ProjectSparse)

## [1] 1000  395

This has removed most sparse terms! Now, we can obtain the term frequencies as a vector by converting the document term matrix into a matrix and summing the column counts.We can see the effect by looking at the terms we have left:

ProjectFrequency <- colSums(as.matrix(ProjectSparse))
ProjectFrequency

##       â€“       â€”       â€œ       abl    accord    actual       add 
##        55        82        82        25        25        37        40 
##     addit       age       ago     allow     along   alreadi      also 
##        22        26        29        32        41        22       135 
##     alway  american     anoth     anyth    appear      area    around 
##        56        27        53        29        22        26        71 
##       art       ask    author      away      back       bad    beauti 
##        39        58        22        41       114        34        26 
##     becom     begin    believ      best    better       big       bit 
##        38        25        29        60        57        57        40 
##      blog      book       boy     break     build      busi       buy 
##        30        64        23        28        34        45        22 
##      call      came       can      cant       car      care      case 
##        88        29       208        58        25        33        47 
##      caus    center     chanc     chang     check  children      citi 
##        28        28        22        71        25        31        56 
##     class cleveland     close     coach   collect     color      come 
##        23        30        36        30        25        27       112 
## communiti   compani   complet   continu   countri     coupl     cours 
##        22        40        30        36        28        30        22 
##     cover     creat    custom       cut       day     death     decid 
##        23        28        26        26       153        33        32 
##     didnt       die    differ  director   donâ€™t      done      dont 
##        36        39        55        22        51        37        85 
##     drive     earli       eat      educ       end     enjoy    enough 
##        24        32        21        27        56        31        49 
##      even     event      ever     everi   everyon   everyth     excit 
##       110        29        33        53        39        23        27 
##    expect    experi      face      fact      fall    famili       fan 
##        27        23        29        28        31        55        25 
##       far   favorit      feel     field     figur      film     final 
##        27        22        85        23        26        42        43 
##      find      fire     first      five    follow      food      forc 
##        50        23       136        32        65        31        22 
##   forward     found      four      free    friday    friend      full 
##        24        47        29        30        22        58        26 
##       fun      game      gave       get      girl      give       god 
##        31        66        21       213        28        59        30 
##      good       got     great     group       guy      hand    happen 
##       133        82        87        36        33        43        51 
##     happi      hard      head      hear     heard     heart      help 
##        46        27        52        29        22        29        65 
##       hes      high   histori       hit      hold      home      hope 
##        30        62        24        32        23        75        35 
##      hour      hous     howev     iâ€™m      idea    immedi    import 
##        45        60        25        35        32        21        21 
##    includ   increas    inform   instead  interest    involv      isnt 
##        74        25        32        27        31        28        23 
##      issu     itâ€™       ive       job      john      just      keep 
##        25        74        28        48        26       252        65 
##       kid      kill      kind      know      larg      last      late 
##        39        27        34       117        36       104        24 
##     later      lead     learn     least      leav      left      less 
##        23        26        32        31        36        46        23 
##       let     level      life     light      like      line      list 
##        60        27        82        42       246        39        28 
##     littl      live       lol      long      look       lot      love 
##        83        66        28        59       126        70       134 
##      made      main      make       man     manag      mani       may 
##        63        23       173        50        31        72        59 
##      mayb      mean      meet     might      mind     minut      miss 
##        23        64        48        45        28        48        31 
##    monday     money     month      morn      move      movi      much 
##        21        36        46        32        41        21        89 
##     music      must      name    nation      near      need     never 
##        29        30        45        31        28       101        57 
##       new      next      nice     night       now    number     offer 
##       135        56        31        61       132        33        32 
##     offic    offici     often       old       one     onlin      open 
##        35        30        29        24       313        25        62 
##     order     other      park      part     parti      pass      past 
##        35        23        38        61        28        35        29 
##       pay     peopl   percent   perfect    person      pick    pictur 
##        25       131        26        26        46        31        39 
##      piec     place      plan      play     pleas     point     polic 
##        24        63        61        75        23        42        53 
##   possibl      post     power   practic    presid    pretti     price 
##        27        35        30        23        31        32        26 
##   probabl   problem   product   program   project    provid    public 
##        26        24        33        28        52        32        31 
##       put  question      quit      rais      read      real    realli 
##        48        33        25        23        52        26        92 
##    receiv    recent    record    releas    remain    rememb    report 
##        29        31        30        22        29        25        37 
##    result    return     right      road      roll      room       run 
##        29        24        95        31        23        49        61 
##      said  saturday       saw       say     scene    school    season 
##       304        23        23       124        25        64        38 
##    second       see      seem      seen      sens      serv    servic 
##        40       120        51        22        24        35        28 
##       set     sever     share      show      side      sinc       sit 
##        46        33        38        69        38        52        28 
##     sleep     small    someon    someth     sound   special     spend 
##        27        40        47        69        29        25        29 
##     start     state      stay      step     still      stop     store 
##       102        76        30        26        81        50        42 
##     stori    street   student    summer   support      sure      take 
##        49        33        30        31        31        40       114 
##      talk      team      tell     thank      that     thing     think 
##        42        52        35        72        64       113       139 
##    though   thought     three      time     today    togeth      told 
##        47        48        50       215        75        33        33 
##      took       top     train       tri      turn       two   univers 
##        48        34        29        69        32       103        30 
##       use      valu      view     visit      wait      walk      want 
##       119        22        28        28        41        35       133 
##       war     watch     water       way      week   weekend      well 
##        26        44        42       112        69        22        94 
##      went   whether     white     whole      will       win   without 
##        44        30        30        32       269        29        39 
##     woman    wonder      word      work     world     write      year 
##        25        29        51       144        75        40       186 
##       yes       yet      your 
##        35        38        39

table(ProjectFrequency)

## ProjectFrequency
##  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38 
##   6  17  20   9  17  16  12  19  20  17  18  14  11   4  10   7   3   6 
##  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  55  56  57 
##   8   7   4   6   2   2   4   5   4   6   3   4   4   5   3   3   4   3 
##  58  59  60  61  62  63  64  65  66  69  70  71  72  74  75  76  81  82 
##   3   3   3   4   2   2   4   3   2   4   1   2   2   2   4   1   1   4 
##  83  85  87  88  89  92  94  95 101 102 103 104 110 112 113 114 117 119 
##   1   2   1   1   1   1   1   1   1   1   1   1   1   2   1   2   1   1 
## 120 124 126 131 132 133 134 135 136 139 144 153 173 186 208 213 215 246 
##   1   1   1   1   1   2   1   2   1   1   1   1   1   1   1   1   1   1 
## 252 269 304 313 
##   1   1   1   1

length(ProjectFrequency)

## [1] 395

By ordering the frequencies we can list the most frequent terms and the least frequent terms:

ProjectOrder <- order(ProjectFrequency)
ProjectOrder

##   [1]  87 129 160 161 218 223   8  13  19  24  42  52  64  70  81 107 119
##  [18] 124 145 291 312 366 377  38  57  71  97 100 109 114 152 168 183 189
##  [35] 205 211 247 264 270 284 299 303 304  85 120 150 182 242 260 275 296
##  [52] 313   4   5  30  47  54  61 105 157 163 169 244 253 283 293 306 328
##  [69] 386   9  20  28  73  74 110 126 173 184 255 256 273 274 286 333 372
##  [86]  16  62  88  98  99 106 142 165 177 191 267 323  39  50  51  68  72
## [103] 102 131 167 171 196 199 215 229 250 277 315 322 367 368  10  18  31
## [120]  44  93 101 122 144 146 225 241 252 288 292 295 327 329 360 384 387
## [137]  36  58  60  66  69 123 133 148 226 240 269 290 332 339 364 380 381
## [154]  55  90 103 118 127 166 186 208 217 228 234 258 271 280 289 298 340
## [171] 341  11  77  86 116 151 159 164 185 221 238 272 279 362 382  48  76
## [188]  94 138 237 276 282 317 338 356 357  27  40 178 359 154 158 239 246
## [205] 251 268 314 346 370 393  59  67  78 137 180 187 219   6  83 294  29
## [222] 248 308 318 320 394  22  79  96 176 195 259 385 395   7  35  65 309
## [239] 324 342 391  12  25 222 369 111 193 265 336 344 374 112 139 373 379
## [256]  41 155 214 227 141 188 220 257 316  49 121 325 351 172 213 216 281
## [273] 352 358  91 300 337 113 207 335 353  82 140 311 388 143 278 285 321
## [290] 345  17  95 266   1  80 104  15  56  89 233  33  34 231  23  46 125
## [307] 132 200 210  32 156 190 235 249 262 301 149 245 204 261  37 212 307
## [324] 348 117 147 175 128 198 319 326 361 376 202  21  53 209 347 162 170
## [341] 153 263 355 390 331 334   2   3 135 192 197  84 108 136  43 224 287
## [358] 378 297 230 330 363 181  92  63 375 349  26 343 179 365 310 305 201
## [375] 254 236 134 371 203  14 232 115 350 389  75 206 392  45 130 354 194
## [392] 174 383 302 243

Least frequent terms :

ProjectFrequency[head(ProjectOrder)]

##    eat   gave immedi import monday   movi 
##     21     21     21     21     21     21

Most frequent terms :

ProjectFrequency[tail(ProjectOrder)]

## time like just will said  one 
##  215  246  252  269  304  313

These terms are much more likely to be of interest to us. Not surprising, given the choice of documents in the corpus, the most frequent terms are:

Distribution of Term Frequencies

Frequency of frequencies:

head(table(ProjectFrequency), 10)

## ProjectFrequency
## 21 22 23 24 25 26 27 28 29 30 
##  6 17 20  9 17 16 12 19 20 17

tail(table(ProjectFrequency), 10)

## ProjectFrequency
## 173 186 208 213 215 246 252 269 304 313 
##   1   1   1   1   1   1   1   1   1   1

So we can see here that there are so many terms that occur just once. Identifying Frequent Items and Associations.One thing we often to first do is to get an idea of the most frequent terms in the corpus. We use findFreqTerms() to do this. Here we limit the output to those terms that occur at least 1,00 times:

findFreqTerms(ProjectDTM,lowfreq=200)

## [1] "can"  "get"  "just" "like" "one"  "said" "time" "will"

So that only lists a few. We can get more of them by reducing the threshold:

findFreqTerms(ProjectDTM,lowfreq=100)

##  [1] "also"  "back"  "can"   "come"  "day"   "even"  "first" "get"  
##  [9] "good"  "just"  "know"  "last"  "like"  "look"  "love"  "make" 
## [17] "need"  "new"   "now"   "one"   "peopl" "said"  "say"   "see"  
## [25] "start" "take"  "thing" "think" "time"  "two"   "use"   "want" 
## [33] "way"   "will"  "work"  "year"

We can also find associations with a word, specifying a correlation limit:

findAssocs(ProjectDTM, "can", corlimit=0.2)

##         can
## compass 0.2
## dire    0.2
## disk    0.2
## skim    0.2

findAssocs(ProjectDTM, "like", corlimit=0.2)

##               like
## colobas       0.22
## contradictori 0.22
## durango       0.22
## evergreen     0.22
## granddaught   0.22
## maureen       0.22
## obrien        0.22
## seem          0.22
## sorti         0.22
## surrog        0.22
## firework      0.20
## vicki         0.20

findAssocs(ProjectDTM, "one", corlimit=0.2)

##                      one
## repli               0.35
## biggi               0.32
## bracelet            0.32
## circumcisedâ€      0.32
## clerk               0.32
## gaudi               0.32
## hbo                 0.32
## teller              0.32
## thatâ€             0.32
## transact            0.32
## lou                 0.30
## accumul             0.29
## husband             0.26
## dive                0.25
## sub                 0.25
## wrist               0.25
## yet                 0.24
## answer              0.23
## counter             0.23
## jewelri             0.23
## â€œiâ€™ll           0.22
## angl                0.22
## canada              0.22
## exasper             0.22
## twist               0.21
## father              0.20

findAssocs(ProjectDTM, "time", corlimit=0.2)

##          time
## cow      0.23
## imped    0.23
## imper    0.23
## medicaid 0.23
## medicar  0.23
## sacr     0.23
## moral    0.22
## cost     0.21

findAssocs(ProjectDTM, "year", corlimit=0.2)

##                 year
## chili           0.33
## dutrow          0.33
## incredul        0.33
## manipul         0.33
## midterm         0.33
## projectâ€      0.33
## sophomor        0.33
## stringent       0.33
## viva            0.33
## guidelin        0.29
## project         0.25
## derbi           0.22
## sketch          0.22
## snuck           0.22
## testament       0.22
## threw           0.22
## radar           0.21
## last            0.20

findAssocs(ProjectDTM, "will", corlimit=0.2)

##                 will
## alcov           0.27
## banquett        0.27
## bookshelv       0.27
## cornic          0.27
## cozi            0.27
## deper           0.27
## firebox         0.27
## herringbon      0.27
## josj            0.27
## pillow          0.27
## ala             0.25
## capturethequeen 0.22
## demolish        0.22
## flatter         0.22
## freezer         0.22
## gambit          0.22
## gosl            0.22
## immeasur        0.22
## jacob           0.22
## mlk             0.22
## paz             0.22
## peacocki        0.22
## regrad          0.22
## restrip         0.22
## revamp          0.22
## selfassur       0.22
## selfpossess     0.22
## spill           0.22
## sugarscap       0.22
## upscal          0.22

findAssocs(ProjectDTM, "just", corlimit=0.2)

##           just
## cosign    0.22
## dopey     0.22
## heartless 0.22
## lend      0.22
## monkey    0.22
## murmur    0.22
## nil       0.22
## nurseget  0.22
## ohalloran 0.22
## overrid   0.22
## sedat     0.22
## teeni     0.22
## zak       0.22

If two words always appear together then the correlation would be 1.0 and if they never appear together the correlation would be 0.0. Thus the correlation is a measure of how closely associated the words are in the corpus.

Correlations Plots

library(Rgraphviz)# Correlation plots.

## Loading required package: graph
## Loading required package: grid

library(graph)
library(grid)
plot(ProjectDTM,
     terms=findFreqTerms(ProjectDTM, lowfreq=120),
     corThreshold=0.11)

Plotting Word Frequencies

We can generate the frequency count of all words in a corpus:

Frequency<- sort(colSums(as.matrix(ProjectSparse)), decreasing=TRUE)
head(Frequency, 100)

##    one   said   will   just   like   time    get    can   year   make 
##    313    304    269    252    246    215    213    208    186    173 
##    day   work  think  first   also    new   love   good   want    now 
##    153    144    139    136    135    135    134    133    133    132 
##  peopl   look    say    see    use   know   back   take  thing   come 
##    131    126    124    120    119    117    114    114    113    112 
##    way   even   last    two  start   need  right   well realli   much 
##    112    110    104    103    102    101     95     94     92     89 
##   call  great   dont   feel  littl    â€”    â€œ    got   life  still 
##     88     87     85     85     83     82     82     82     82     81 
##  state   home   play  today  world includ  itâ€™   mani  thank around 
##     76     75     75     75     75     74     74     72     72     71 
##  chang    lot   show someth    tri   week   game   live follow   help 
##     71     70     69     69     69     69     66     66     65     65 
##   keep   book   mean school   that   made  place   high   open  night 
##     65     64     64     64     64     63     63     62     62     61 
##   part   plan    run   best   hous    let   give   long    may    ask 
##     61     61     61     60     60     60     59     59     59     58 
##   cant friend better    big  never  alway   citi    end   next    â€“ 
##     58     58     57     57     57     56     56     56     56     55

WordFrequency <- data.frame(word=names(Frequency), freq=Frequency)
head(WordFrequency)

##      word freq
## one   one  313
## said said  304
## will will  269
## just just  252
## like like  246
## time time  215

We can then plot the frequency of those words that occur at least 500 times in the corpus:

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.1.3

## 
## Attaching package: 'ggplot2'
## 
## The following object is masked from 'package:NLP':
## 
##     annotate

library(qdap) # Quantitative discourse analysis of transcripts.

## Loading required package: qdapDictionaries
## Loading required package: qdapRegex
## Loading required package: qdapTools
## Loading required package: RColorBrewer
## 
## Attaching package: 'qdap'
## 
## The following objects are masked from 'package:tm':
## 
##     as.DocumentTermMatrix, as.TermDocumentMatrix
## 
## The following object is masked from 'package:base':
## 
##     Filter

subset(WordFrequency, freq>150)                          %>%  
        ggplot(aes(word, freq))                           +
        geom_bar(stat="identity")                              +
        theme(axis.text.x=element_text(angle=45, hjust=1))

We can generate a word cloud as an effective alternative to providing a quick visual overview of the frequency of words in a corpus :

library(wordcloud)
library(RColorBrewer)
set.seed(1)
wordcloud(names(Frequency),Frequency,min.freq=100,colors=brewer.pal(6, "Dark2"),scale=c(3, .05))