Read the files as is ( ie. we want to read raw text) :
news=readLines("en_US.news.txt")
## Warning in readLines("en_US.news.txt"): incomplete final line found on
## 'en_US.news.txt'
blogs=readLines("en_US.blogs.txt")
tweets=readLines("en_US.twitter.txt")
## Warning in readLines("en_US.twitter.txt"): line 167155 appears to contain
## an embedded nul
## Warning in readLines("en_US.twitter.txt"): line 268547 appears to contain
## an embedded nul
## Warning in readLines("en_US.twitter.txt"): line 1274086 appears to contain
## an embedded nul
## Warning in readLines("en_US.twitter.txt"): line 1759032 appears to contain
## an embedded nul
File info :
file.info("en_US.news.txt")$size/1024^2
## [1] 196.2775
file.info("en_US.blogs.txt")$size/1024^2
## [1] 200.4242
file.info("en_US.twitter.txt")$size/1024^2
## [1] 159.3641
Since each file is very large in size, we should take samples of 1000 lines for each file and then combine the 3 sampled files to create one master project sample file for exploratory data analysis :
SampleTweets=sample(tweets, 1000)
SampleBlogs=sample(blogs, 1000)
SampleNews=sample(news, 1000)
ProjectData=paste(SampleBlogs,SampleNews,SampleTweets, sep=" ")
Take a peak & bottom of master project file :
head(ProjectData,n=1)
## [1] "“If we lose one journalist today, another reckless crime is done to you because if you kill a journalist, you kill a world,†said Balabo. “We all just miss her laughter and her presence,†Hunter-Feeney reminisced, fighting back tears. “She loved life and loved her family. aww i wish u cld have gone and take pics 4 me'!!!!"
tail(ProjectData, n=1)
## [1] "“Yup!†\"If we learn that someone is not old enough to have a Google account, or we receive a report, we will investigate and take the appropriate action,\" says Google spokesman Jay Nancarrow. He adds that \"users first have a chance to demonstrate that they meet our age requirements. If they don't, we will close the account.\" Facebook and most other sites have similar policies. Yeah, that's right! ~RT : We suffer way too much from sh*t that never happens! 'get ready to be ok'! ;^)"
Size (in characters) of each name :
ProjectData_chars=nchar(ProjectData)
Longest line :
ProjectData[which(ProjectData_chars == max(ProjectData_chars))]
## [1] "\"13. In my view this is a case in which the claimants have given every impression of playing games in this action. I do not need to make a finding as to whether are not they actually were playing games but they have certainly given that very strong impression. The claimants have also given the very strong impression that they do not know what their case is and they have been reluctant to be pinned down, and have demonstrated a tendency from time to time to change their case.14. There is nothing inherently wrong with changing a case; it happens all the time in these courts. When it happens in the manner in which it happened in this case and with decreasing rather than increasing clarity as to what the case is, then that is a practice which must be stopped. It has been said more than once, particularly in the context of patent litigation, that it must be conducted efficiently, properly and with a proper regard to costs. All litigation must be conducted with a proper regard to the need to disclose what a case is so that the other party can understand it and meet it.15. In this respect the claimant has, in my view, fallen lamentably short of the standards required of patent litigants in this court. I am quite satisfied that the claimant has brought this application on itself and has brought itself close to having its claim struck out and, therefore, as a matter of principle the claimant ought to pay the costs of this application.16. I also have no doubt that the claimants should pay the costs of this application on an indemnity basis. The wilful, as it seems to me, refusal to disclose its case until asked point blank by a judge of this Division what the case was and their reliance on totally misconceived technicalities arising out of the patent, and a reliance on an assertion that the defendant really ought to ask the right questions, are, in my view, rather remarkable in the context of the litigation that I have seen. It is simply not the way in which litigation ought to be conducted and is plainly, in my view, within the category of conduct which falls to be penalized by an award of indemnity costs. I therefore order that the claimant pay the costs of this application on an indemnity basis\".Claimants always face the challenge of having to provide sufficient information to comply with the Civil Procedure Rules while striving as far as possible to “keep one’s powder dry†as to certain aspects of a claim. A claimant is not required to disclose full details of its entire case before trial, but it is clear in this case that the judge believed the claimant had overstepped the mark, penalising it by ordering indemnity costs. The moral of the story, says Kempner, is that a claimant, facing a request for clarification and/or further information from the defendant, must ensure it has provided proper details of its claim before it can safely decline to provide the requested clarification and/or further information. Other hangouts: Paco's Tacos in Westchester makes tortillas fresh all day long, entrees less than $10; 6212 W. Manchester Ave.; (310) 645-8692. Coffee Co. serves good affordable breakfasts, crepes and home-style food; 8751 La Tijera Blvd., Westchester; (310) 645-7315. Need a good corner play here! Come on Yanks!!!!"
Shortest line :
ProjectData[which(ProjectData_chars == min(ProjectData_chars))]
## [1] "Canvas CurseKirby Canvas Curse 1 1/3 cups sugar i've been blessed!"
Length measure :
length(ProjectData)
## [1] 1000
Length of longest line :
max(ProjectData_chars)
## [1] 3278
In data set, if you divide the number of lines where the word “love” occurs by the number of lines the word “hate” (all lowercase) occurs, about what do you get?
love=sum(grepl(pattern = "love", x = ProjectData))
love
## [1] 105
hate=sum(grepl(pattern = "hate", x = ProjectData))
hate
## [1] 27
lovehateratio=love/hate
lovehateratio
## [1] 3.888889
Basic counting:
Install stringr package:
library(stringr)
Test counting :
Love=str_count(ProjectData, "love")
sum(Love)
## [1] 127
Hate=str_count(ProjectData, "hate")
sum(Hate)
## [1] 27
LoveHateRatio=sum(Love)/sum(Hate)
LoveHateRatio
## [1] 4.703704
Loading required packages:
library(NLP)
library(tm)
library(SnowballC)
Building corpus :
ProjectCorpus=Corpus(VectorSource(ProjectData))
We generally need to perform some pre-processing of the text data to prepare for the text analysis. Example transformations include converting the text to lower case, removing numbers and punctuation, removing stop words, stemming and identifying synonyms. The basic transforms are all available within tm.
ProjectCorpus
## <<VCorpus (documents: 1000, metadata (corpus/indexed): 0/0)>>
Exploring the corpus:
class(ProjectCorpus)
## [1] "VCorpus" "Corpus"
class(ProjectCorpus[[1]])
## [1] "PlainTextDocument" "TextDocument"
inspect(ProjectCorpus[1])
## <<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>
##
## [[1]]
## <<PlainTextDocument (metadata: 7)>>
## “If we lose one journalist today, another reckless crime is done to you because if you kill a journalist, you kill a world,†said Balabo. “We all just miss her laughter and her presence,†Hunter-Feeney reminisced, fighting back tears. “She loved life and loved her family. aww i wish u cld have gone and take pics 4 me'!!!!
The function tm map() is used to apply one of these transformations across all documents within a corpus. Other transformations can be implemented using R functions and wrapped within content transformer() to create a function that can be passed through to tm map().
getTransformations()
## [1] "removeNumbers" "removePunctuation" "removeWords"
## [4] "stemDocument" "stripWhitespace"
In the following sections we will apply each of the transformations, one-by-one, to remove unwanted characters from the text.
ProjectCorpus1 <- tm_map(ProjectCorpus, tolower)
ProjectCorpus1[[1]]
## [1] "“if we lose one journalist today, another reckless crime is done to you because if you kill a journalist, you kill a world,†said balabo. “we all just miss her laughter and her presence,†hunter-feeney reminisced, fighting back tears. “she loved life and loved her family. aww i wish u cld have gone and take pics 4 me'!!!!"
ProjectCorpus2 <- tm_map(ProjectCorpus1, PlainTextDocument)
ProjectCorpus2[[1]]
## <<PlainTextDocument (metadata: 7)>>
## “if we lose one journalist today, another reckless crime is done to you because if you kill a journalist, you kill a world,†said balabo. “we all just miss her laughter and her presence,†hunter-feeney reminisced, fighting back tears. “she loved life and loved her family. aww i wish u cld have gone and take pics 4 me'!!!!
ProjectCorpus3 <- tm_map(ProjectCorpus2, removePunctuation)
ProjectCorpus3[[1]]
## <<PlainTextDocument (metadata: 7)>>
## “if we lose one journalist today another reckless crime is done to you because if you kill a journalist you kill a world†said balabo “we all just miss her laughter and her presence†hunterfeeney reminisced fighting back tears “she loved life and loved her family aww i wish u cld have gone and take pics 4 me
stopwords("english")
## [1] "i" "me" "my" "myself" "we"
## [6] "our" "ours" "ourselves" "you" "your"
## [11] "yours" "yourself" "yourselves" "he" "him"
## [16] "his" "himself" "she" "her" "hers"
## [21] "herself" "it" "its" "itself" "they"
## [26] "them" "their" "theirs" "themselves" "what"
## [31] "which" "who" "whom" "this" "that"
## [36] "these" "those" "am" "is" "are"
## [41] "was" "were" "be" "been" "being"
## [46] "have" "has" "had" "having" "do"
## [51] "does" "did" "doing" "would" "should"
## [56] "could" "ought" "i'm" "you're" "he's"
## [61] "she's" "it's" "we're" "they're" "i've"
## [66] "you've" "we've" "they've" "i'd" "you'd"
## [71] "he'd" "she'd" "we'd" "they'd" "i'll"
## [76] "you'll" "he'll" "she'll" "we'll" "they'll"
## [81] "isn't" "aren't" "wasn't" "weren't" "hasn't"
## [86] "haven't" "hadn't" "doesn't" "don't" "didn't"
## [91] "won't" "wouldn't" "shan't" "shouldn't" "can't"
## [96] "cannot" "couldn't" "mustn't" "let's" "that's"
## [101] "who's" "what's" "here's" "there's" "when's"
## [106] "where's" "why's" "how's" "a" "an"
## [111] "the" "and" "but" "if" "or"
## [116] "because" "as" "until" "while" "of"
## [121] "at" "by" "for" "with" "about"
## [126] "against" "between" "into" "through" "during"
## [131] "before" "after" "above" "below" "to"
## [136] "from" "up" "down" "in" "out"
## [141] "on" "off" "over" "under" "again"
## [146] "further" "then" "once" "here" "there"
## [151] "when" "where" "why" "how" "all"
## [156] "any" "both" "each" "few" "more"
## [161] "most" "other" "some" "such" "no"
## [166] "nor" "not" "only" "own" "same"
## [171] "so" "than" "too" "very"
ProjectCorpus4 <- tm_map(ProjectCorpus3, removeWords, stopwords("english"))
ProjectCorpus4[[1]]
## <<PlainTextDocument (metadata: 7)>>
## “ lose one journalist today another reckless crime done kill journalist kill world†said balabo “ just miss laughter presence†hunterfeeney reminisced fighting back tears “ loved life loved family aww wish u cld gone take pics 4
ProjectCorpus5 <- tm_map(ProjectCorpus4, stemDocument)
ProjectCorpus5[[1]]
## <<PlainTextDocument (metadata: 7)>>
## “ lose one journalist today anoth reckless crime done kill journalist kill world†said balabo “ just miss laughter presence†hunterfeeney reminisc fight back tear “ love life love famili aww wish u cld gone take pic 4
ProjectCorpus6 <- tm_map(ProjectCorpus5, removeNumbers)
ProjectCorpus6[[1]]
## <<PlainTextDocument (metadata: 7)>>
## “ lose one journalist today anoth reckless crime done kill journalist kill world†said balabo “ just miss laughter presence†hunterfeeney reminisc fight back tear “ love life love famili aww wish u cld gone take pic
ProjectCorpus7 <- tm_map(ProjectCorpus6, stripWhitespace)
ProjectCorpus7[[1]]
## <<PlainTextDocument (metadata: 7)>>
## “ lose one journalist today anoth reckless crime done kill journalist kill world†said balabo “ just miss laughter presence†hunterfeeney reminisc fight back tear “ love life love famili aww wish u cld gone take pic
ProjectCorpus7[[100]]
## <<PlainTextDocument (metadata: 7)>>
## law bad guy can persuad run program comput it’ comput anymor last video made snort laughter scene show workahol guy shroom offic tri captur guy web ethernet cabl look forward show weekendth even wo fact horrif stori
A document term matrix is simply a matrix with documents as the rows and terms as the columns and a count of the frequency of words as the cells of the matrix. We use DocumentTermMatrix() to create the matrix:
ProjectDTM<- DocumentTermMatrix(ProjectCorpus7)
ProjectDTM
## <<DocumentTermMatrix (documents: 1000, terms: 10485)>>
## Non-/sparse entries: 43439/10441561
## Sparsity : 100%
## Maximal term length: 32
## Weighting : term frequency (tf)
We can inspect the document term matrix using inspect(). Here, to avoid too much output,we select a subset of inspect :
inspect(ProjectDTM[1:5,900:905])
## <<DocumentTermMatrix (documents: 5, terms: 6)>>
## Non-/sparse entries: 0/30
## Sparsity : 100%
## Maximal term length: 10
## Weighting : term frequency (tf)
##
## Terms
## Docs beachfront bead bean bear beard bearer
## character(0) 0 0 0 0 0 0
## character(0) 0 0 0 0 0 0
## character(0) 0 0 0 0 0 0
## character(0) 0 0 0 0 0 0
## character(0) 0 0 0 0 0 0
The document term matrix is in fact quite sparse (that is, mostly empty) and so it is actually stored in a much more compact representation internally. We can still get the row and column counts :
class(ProjectDTM)
## [1] "DocumentTermMatrix" "simple_triplet_matrix"
dim(ProjectDTM)
## [1] 1000 10485
ProjectDTM
## <<DocumentTermMatrix (documents: 1000, terms: 10485)>>
## Non-/sparse entries: 43439/10441561
## Sparsity : 100%
## Maximal term length: 32
## Weighting : term frequency (tf)
The transpose is created using TermDocumentMatrix():
ProjectTDM <- TermDocumentMatrix(ProjectCorpus7)
ProjectTDM
## <<TermDocumentMatrix (terms: 10485, documents: 1000)>>
## Non-/sparse entries: 43439/10441561
## Sparsity : 100%
## Maximal term length: 32
## Weighting : term frequency (tf)
We can obtain the term frequencies as a vector by converting the document term matrix into a matrix and summing the column counts:
For very large corpus the size of the matrix can exceed R’s calculation limits. This will manifest itself as a integer overflow error with a message like: Error in vector(typeof(x$v), nr * nc) : vector size cannot be NA In addition: Warning message: In nr * nc : NAs produced by integer overflow If this occurs, then consider removing sparse terms from the document term matrix. Removing Sparse Terms We are often not interested in infrequent terms in our documents. Such “sparse” terms can be removed from the document term matrix quite easily using removeSparseTerms():
dim(ProjectDTM)
## [1] 1000 10485
ProjectSparse <- removeSparseTerms(ProjectDTM, 0.98)
ProjectSparse
## <<DocumentTermMatrix (documents: 1000, terms: 395)>>
## Non-/sparse entries: 17272/377728
## Sparsity : 96%
## Maximal term length: 9
## Weighting : term frequency (tf)
dim(ProjectSparse)
## [1] 1000 395
This has removed most sparse terms! Now, we can obtain the term frequencies as a vector by converting the document term matrix into a matrix and summing the column counts.We can see the effect by looking at the terms we have left:
ProjectFrequency <- colSums(as.matrix(ProjectSparse))
ProjectFrequency
## – — “ abl accord actual add
## 55 82 82 25 25 37 40
## addit age ago allow along alreadi also
## 22 26 29 32 41 22 135
## alway american anoth anyth appear area around
## 56 27 53 29 22 26 71
## art ask author away back bad beauti
## 39 58 22 41 114 34 26
## becom begin believ best better big bit
## 38 25 29 60 57 57 40
## blog book boy break build busi buy
## 30 64 23 28 34 45 22
## call came can cant car care case
## 88 29 208 58 25 33 47
## caus center chanc chang check children citi
## 28 28 22 71 25 31 56
## class cleveland close coach collect color come
## 23 30 36 30 25 27 112
## communiti compani complet continu countri coupl cours
## 22 40 30 36 28 30 22
## cover creat custom cut day death decid
## 23 28 26 26 153 33 32
## didnt die differ director don’t done dont
## 36 39 55 22 51 37 85
## drive earli eat educ end enjoy enough
## 24 32 21 27 56 31 49
## even event ever everi everyon everyth excit
## 110 29 33 53 39 23 27
## expect experi face fact fall famili fan
## 27 23 29 28 31 55 25
## far favorit feel field figur film final
## 27 22 85 23 26 42 43
## find fire first five follow food forc
## 50 23 136 32 65 31 22
## forward found four free friday friend full
## 24 47 29 30 22 58 26
## fun game gave get girl give god
## 31 66 21 213 28 59 30
## good got great group guy hand happen
## 133 82 87 36 33 43 51
## happi hard head hear heard heart help
## 46 27 52 29 22 29 65
## hes high histori hit hold home hope
## 30 62 24 32 23 75 35
## hour hous howev i’m idea immedi import
## 45 60 25 35 32 21 21
## includ increas inform instead interest involv isnt
## 74 25 32 27 31 28 23
## issu it’ ive job john just keep
## 25 74 28 48 26 252 65
## kid kill kind know larg last late
## 39 27 34 117 36 104 24
## later lead learn least leav left less
## 23 26 32 31 36 46 23
## let level life light like line list
## 60 27 82 42 246 39 28
## littl live lol long look lot love
## 83 66 28 59 126 70 134
## made main make man manag mani may
## 63 23 173 50 31 72 59
## mayb mean meet might mind minut miss
## 23 64 48 45 28 48 31
## monday money month morn move movi much
## 21 36 46 32 41 21 89
## music must name nation near need never
## 29 30 45 31 28 101 57
## new next nice night now number offer
## 135 56 31 61 132 33 32
## offic offici often old one onlin open
## 35 30 29 24 313 25 62
## order other park part parti pass past
## 35 23 38 61 28 35 29
## pay peopl percent perfect person pick pictur
## 25 131 26 26 46 31 39
## piec place plan play pleas point polic
## 24 63 61 75 23 42 53
## possibl post power practic presid pretti price
## 27 35 30 23 31 32 26
## probabl problem product program project provid public
## 26 24 33 28 52 32 31
## put question quit rais read real realli
## 48 33 25 23 52 26 92
## receiv recent record releas remain rememb report
## 29 31 30 22 29 25 37
## result return right road roll room run
## 29 24 95 31 23 49 61
## said saturday saw say scene school season
## 304 23 23 124 25 64 38
## second see seem seen sens serv servic
## 40 120 51 22 24 35 28
## set sever share show side sinc sit
## 46 33 38 69 38 52 28
## sleep small someon someth sound special spend
## 27 40 47 69 29 25 29
## start state stay step still stop store
## 102 76 30 26 81 50 42
## stori street student summer support sure take
## 49 33 30 31 31 40 114
## talk team tell thank that thing think
## 42 52 35 72 64 113 139
## though thought three time today togeth told
## 47 48 50 215 75 33 33
## took top train tri turn two univers
## 48 34 29 69 32 103 30
## use valu view visit wait walk want
## 119 22 28 28 41 35 133
## war watch water way week weekend well
## 26 44 42 112 69 22 94
## went whether white whole will win without
## 44 30 30 32 269 29 39
## woman wonder word work world write year
## 25 29 51 144 75 40 186
## yes yet your
## 35 38 39
table(ProjectFrequency)
## ProjectFrequency
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
## 6 17 20 9 17 16 12 19 20 17 18 14 11 4 10 7 3 6
## 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 55 56 57
## 8 7 4 6 2 2 4 5 4 6 3 4 4 5 3 3 4 3
## 58 59 60 61 62 63 64 65 66 69 70 71 72 74 75 76 81 82
## 3 3 3 4 2 2 4 3 2 4 1 2 2 2 4 1 1 4
## 83 85 87 88 89 92 94 95 101 102 103 104 110 112 113 114 117 119
## 1 2 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1
## 120 124 126 131 132 133 134 135 136 139 144 153 173 186 208 213 215 246
## 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1
## 252 269 304 313
## 1 1 1 1
length(ProjectFrequency)
## [1] 395
By ordering the frequencies we can list the most frequent terms and the least frequent terms:
ProjectOrder <- order(ProjectFrequency)
ProjectOrder
## [1] 87 129 160 161 218 223 8 13 19 24 42 52 64 70 81 107 119
## [18] 124 145 291 312 366 377 38 57 71 97 100 109 114 152 168 183 189
## [35] 205 211 247 264 270 284 299 303 304 85 120 150 182 242 260 275 296
## [52] 313 4 5 30 47 54 61 105 157 163 169 244 253 283 293 306 328
## [69] 386 9 20 28 73 74 110 126 173 184 255 256 273 274 286 333 372
## [86] 16 62 88 98 99 106 142 165 177 191 267 323 39 50 51 68 72
## [103] 102 131 167 171 196 199 215 229 250 277 315 322 367 368 10 18 31
## [120] 44 93 101 122 144 146 225 241 252 288 292 295 327 329 360 384 387
## [137] 36 58 60 66 69 123 133 148 226 240 269 290 332 339 364 380 381
## [154] 55 90 103 118 127 166 186 208 217 228 234 258 271 280 289 298 340
## [171] 341 11 77 86 116 151 159 164 185 221 238 272 279 362 382 48 76
## [188] 94 138 237 276 282 317 338 356 357 27 40 178 359 154 158 239 246
## [205] 251 268 314 346 370 393 59 67 78 137 180 187 219 6 83 294 29
## [222] 248 308 318 320 394 22 79 96 176 195 259 385 395 7 35 65 309
## [239] 324 342 391 12 25 222 369 111 193 265 336 344 374 112 139 373 379
## [256] 41 155 214 227 141 188 220 257 316 49 121 325 351 172 213 216 281
## [273] 352 358 91 300 337 113 207 335 353 82 140 311 388 143 278 285 321
## [290] 345 17 95 266 1 80 104 15 56 89 233 33 34 231 23 46 125
## [307] 132 200 210 32 156 190 235 249 262 301 149 245 204 261 37 212 307
## [324] 348 117 147 175 128 198 319 326 361 376 202 21 53 209 347 162 170
## [341] 153 263 355 390 331 334 2 3 135 192 197 84 108 136 43 224 287
## [358] 378 297 230 330 363 181 92 63 375 349 26 343 179 365 310 305 201
## [375] 254 236 134 371 203 14 232 115 350 389 75 206 392 45 130 354 194
## [392] 174 383 302 243
Least frequent terms :
ProjectFrequency[head(ProjectOrder)]
## eat gave immedi import monday movi
## 21 21 21 21 21 21
Most frequent terms :
ProjectFrequency[tail(ProjectOrder)]
## time like just will said one
## 215 246 252 269 304 313
These terms are much more likely to be of interest to us. Not surprising, given the choice of documents in the corpus, the most frequent terms are:
Frequency of frequencies:
head(table(ProjectFrequency), 10)
## ProjectFrequency
## 21 22 23 24 25 26 27 28 29 30
## 6 17 20 9 17 16 12 19 20 17
tail(table(ProjectFrequency), 10)
## ProjectFrequency
## 173 186 208 213 215 246 252 269 304 313
## 1 1 1 1 1 1 1 1 1 1
So we can see here that there are so many terms that occur just once. Identifying Frequent Items and Associations.One thing we often to first do is to get an idea of the most frequent terms in the corpus. We use findFreqTerms() to do this. Here we limit the output to those terms that occur at least 1,00 times:
findFreqTerms(ProjectDTM,lowfreq=200)
## [1] "can" "get" "just" "like" "one" "said" "time" "will"
So that only lists a few. We can get more of them by reducing the threshold:
findFreqTerms(ProjectDTM,lowfreq=100)
## [1] "also" "back" "can" "come" "day" "even" "first" "get"
## [9] "good" "just" "know" "last" "like" "look" "love" "make"
## [17] "need" "new" "now" "one" "peopl" "said" "say" "see"
## [25] "start" "take" "thing" "think" "time" "two" "use" "want"
## [33] "way" "will" "work" "year"
We can also find associations with a word, specifying a correlation limit:
findAssocs(ProjectDTM, "can", corlimit=0.2)
## can
## compass 0.2
## dire 0.2
## disk 0.2
## skim 0.2
findAssocs(ProjectDTM, "like", corlimit=0.2)
## like
## colobas 0.22
## contradictori 0.22
## durango 0.22
## evergreen 0.22
## granddaught 0.22
## maureen 0.22
## obrien 0.22
## seem 0.22
## sorti 0.22
## surrog 0.22
## firework 0.20
## vicki 0.20
findAssocs(ProjectDTM, "one", corlimit=0.2)
## one
## repli 0.35
## biggi 0.32
## bracelet 0.32
## circumcised†0.32
## clerk 0.32
## gaudi 0.32
## hbo 0.32
## teller 0.32
## that†0.32
## transact 0.32
## lou 0.30
## accumul 0.29
## husband 0.26
## dive 0.25
## sub 0.25
## wrist 0.25
## yet 0.24
## answer 0.23
## counter 0.23
## jewelri 0.23
## “i’ll 0.22
## angl 0.22
## canada 0.22
## exasper 0.22
## twist 0.21
## father 0.20
findAssocs(ProjectDTM, "time", corlimit=0.2)
## time
## cow 0.23
## imped 0.23
## imper 0.23
## medicaid 0.23
## medicar 0.23
## sacr 0.23
## moral 0.22
## cost 0.21
findAssocs(ProjectDTM, "year", corlimit=0.2)
## year
## chili 0.33
## dutrow 0.33
## incredul 0.33
## manipul 0.33
## midterm 0.33
## project†0.33
## sophomor 0.33
## stringent 0.33
## viva 0.33
## guidelin 0.29
## project 0.25
## derbi 0.22
## sketch 0.22
## snuck 0.22
## testament 0.22
## threw 0.22
## radar 0.21
## last 0.20
findAssocs(ProjectDTM, "will", corlimit=0.2)
## will
## alcov 0.27
## banquett 0.27
## bookshelv 0.27
## cornic 0.27
## cozi 0.27
## deper 0.27
## firebox 0.27
## herringbon 0.27
## josj 0.27
## pillow 0.27
## ala 0.25
## capturethequeen 0.22
## demolish 0.22
## flatter 0.22
## freezer 0.22
## gambit 0.22
## gosl 0.22
## immeasur 0.22
## jacob 0.22
## mlk 0.22
## paz 0.22
## peacocki 0.22
## regrad 0.22
## restrip 0.22
## revamp 0.22
## selfassur 0.22
## selfpossess 0.22
## spill 0.22
## sugarscap 0.22
## upscal 0.22
findAssocs(ProjectDTM, "just", corlimit=0.2)
## just
## cosign 0.22
## dopey 0.22
## heartless 0.22
## lend 0.22
## monkey 0.22
## murmur 0.22
## nil 0.22
## nurseget 0.22
## ohalloran 0.22
## overrid 0.22
## sedat 0.22
## teeni 0.22
## zak 0.22
If two words always appear together then the correlation would be 1.0 and if they never appear together the correlation would be 0.0. Thus the correlation is a measure of how closely associated the words are in the corpus.
library(Rgraphviz)# Correlation plots.
## Loading required package: graph
## Loading required package: grid
library(graph)
library(grid)
plot(ProjectDTM,
terms=findFreqTerms(ProjectDTM, lowfreq=120),
corThreshold=0.11)
We can generate the frequency count of all words in a corpus:
Frequency<- sort(colSums(as.matrix(ProjectSparse)), decreasing=TRUE)
head(Frequency, 100)
## one said will just like time get can year make
## 313 304 269 252 246 215 213 208 186 173
## day work think first also new love good want now
## 153 144 139 136 135 135 134 133 133 132
## peopl look say see use know back take thing come
## 131 126 124 120 119 117 114 114 113 112
## way even last two start need right well realli much
## 112 110 104 103 102 101 95 94 92 89
## call great dont feel littl — “ got life still
## 88 87 85 85 83 82 82 82 82 81
## state home play today world includ it’ mani thank around
## 76 75 75 75 75 74 74 72 72 71
## chang lot show someth tri week game live follow help
## 71 70 69 69 69 69 66 66 65 65
## keep book mean school that made place high open night
## 65 64 64 64 64 63 63 62 62 61
## part plan run best hous let give long may ask
## 61 61 61 60 60 60 59 59 59 58
## cant friend better big never alway citi end next –
## 58 58 57 57 57 56 56 56 56 55
WordFrequency <- data.frame(word=names(Frequency), freq=Frequency)
head(WordFrequency)
## word freq
## one one 313
## said said 304
## will will 269
## just just 252
## like like 246
## time time 215
We can then plot the frequency of those words that occur at least 500 times in the corpus:
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.1.3
##
## Attaching package: 'ggplot2'
##
## The following object is masked from 'package:NLP':
##
## annotate
library(qdap) # Quantitative discourse analysis of transcripts.
## Loading required package: qdapDictionaries
## Loading required package: qdapRegex
## Loading required package: qdapTools
## Loading required package: RColorBrewer
##
## Attaching package: 'qdap'
##
## The following objects are masked from 'package:tm':
##
## as.DocumentTermMatrix, as.TermDocumentMatrix
##
## The following object is masked from 'package:base':
##
## Filter
subset(WordFrequency, freq>150) %>%
ggplot(aes(word, freq)) +
geom_bar(stat="identity") +
theme(axis.text.x=element_text(angle=45, hjust=1))
We can generate a word cloud as an effective alternative to providing a quick visual overview of the frequency of words in a corpus :
library(wordcloud)
library(RColorBrewer)
set.seed(1)
wordcloud(names(Frequency),Frequency,min.freq=100,colors=brewer.pal(6, "Dark2"),scale=c(3, .05))