This is a basic analysis of the data sets (text files) provided for the Johns Hopkins/Coursera Data Science capstone project.
Our eventual goal is to use natural language processing techniques to create an algorithm that predicts the next word a user will type based on one or two previous words. Our data consists of large samples of text from news articles (en_US.news.txt), twitter (en_US.twitter.txt) and blog posts (en_US.blogs.txt).
At this stage we take a text mining approach, and our main task is to clean the data set and get a sense of the most frequently used words.
library(knitr)
library(tm)
file.info("en_US.news.txt")$size / 2^20
## [1] 196.2775
file.info("en_US.twitter.txt")$size / 2^20
## [1] 159.3641
file.info("en_US.blogs.txt")$size / 2^20
## [1] 200.4242
file_length <- c(length(readLines("en_US.blogs.txt")), length(readLines("en_US.twitter.txt")), length(readLines("en_US.news.txt")))
## Warning in readLines("en_US.twitter.txt"): line 167155 appears to contain
## an embedded nul
## Warning in readLines("en_US.twitter.txt"): line 268547 appears to contain
## an embedded nul
## Warning in readLines("en_US.twitter.txt"): line 1274086 appears to contain
## an embedded nul
## Warning in readLines("en_US.twitter.txt"): line 1759032 appears to contain
## an embedded nul
## Warning in readLines("en_US.news.txt"): incomplete final line found on
## 'en_US.news.txt'
max(file_length)
## [1] 2360148
The largest file (en_US.twitter.txt) is over 2 million lines long. Trying to read this much in at once causes R to crash so we are only going to read the first 10,000 lines from each file.
news <- readLines(con = "en_US.news.txt", 1e4, encoding = "UTF-8")
news[9980]
## [1] "<U+0093>You feel like you lose a piece of yourself,<U+0094> says Arah Montagne. She's not your average survivor. Arah, 34, never actually had breast cancer. But she had the BRCA1 gene, raising her chances of getting breast cancer to 85 percent."
Next we take a few steps to remove punctuation, strange characters (due to inconsistent text encoding), capital letters, and stop words (small / extremely common words like “the”, “for”, “me” and “very”). Then what we would like to obtain is a list of 1,000 or so most common words. We have prepared the following function for this purpose since this is something we will need to repeat:
get.frequent.words <- function(inputfile, rarity = 0.997, n = 1e4, remove.stopwords = T){
news <- readLines(con = inputfile, n, encoding = "UTF-8")
news <- iconv(news,"latin1","ASCII", sub="") #to deal with things like "\u0093"
news <- tolower(news)
news <- removePunctuation(news)
corpusnews <- Corpus(VectorSource(news))
if (remove.stopwords == TRUE) {
corpusnews <- tm_map(corpusnews, removeWords, stopwords("english"))
}
dtmnews <- DocumentTermMatrix(corpusnews)
sparsenews <- removeSparseTerms(dtmnews, rarity) #we are keeping only the top 3% of most common words
wordsnews <- as.data.frame(as.matrix(sparsenews))
wordfreq <- wordsnews[0,]
wordfreq <- colSums(wordsnews) #we want to return a list of the words with number of occurences
return (wordfreq)
}
Now to actually read in the text files and look at the most frequent words:
words1 <- get.frequent.words("en_US.blogs.txt")
words2 <- get.frequent.words("en_US.news.txt")
words3 <- get.frequent.words("en_US.twitter.txt")
length(words1)
## [1] 1086
length(words2)
## [1] 1114
length(words3)
## [1] 335
Notice that the twitter text contains fewer words. Possibly due to the 140-character limit in effect on twitter.
tail(sort(words1),100)
## family end days bit times without house
## 206 209 210 211 212 212 213
## blog better man put night ever read
## 215 217 217 219 220 221 221
## went let give old big thought though
## 221 222 225 230 231 232 238
## use today used part thing since best
## 239 242 247 250 252 257 260
## long lot thats look find week cant
## 260 261 262 263 264 267 268
## away next god book feel place come
## 269 271 275 279 280 280 286
## home sure always another world great every
## 291 293 310 310 312 323 326
## need never didnt years year right say
## 326 329 331 333 345 350 351
## may take something last said around got
## 361 362 368 369 379 383 391
## work ive life want two made things
## 393 394 400 410 413 415 416
## many still going love back think day
## 418 425 444 513 521 521 531
## see little much well way really first
## 533 542 556 556 561 562 567
## even good make also dont new people
## 575 579 581 589 616 625 626
## now know get time like can just
## 675 717 768 1016 1091 1137 1158
## will one
## 1233 1327
tail(sort(words2),100)
## business office run government house never
## 156 156 156 158 159 160
## need end law told health really
## 161 166 166 166 168 168
## better use help night part money
## 169 169 172 172 173 174
## officials play court place family left
## 174 176 177 177 179 181
## show little around lot long according
## 182 183 184 185 189 190
## center big right second president hes
## 193 194 194 194 196 198
## come company next want week best
## 200 205 207 211 211 212
## including high got public say know
## 212 213 214 219 222 223
## since four another see thats team
## 225 226 227 237 250 254
## think take much police still work
## 254 257 258 258 258 260
## way season made day well county
## 269 270 277 278 286 288
## game even may good back many
## 290 298 301 302 308 313
## make dont says million going home
## 314 315 315 321 329 330
## city three school percent now get
## 333 336 348 369 370 416
## state people like years first time
## 474 484 487 501 507 513
## last just year can two also
## 524 553 569 576 576 586
## new one will said
## 675 840 1084 2484
tail(sort(words3),100)
## coming miss music real fun let soon
## 88 88 88 88 89 89 89
## anyone bad live free things everyone watching
## 90 90 90 91 91 92 92
## even hate weekend gonna something big tomorrow
## 94 94 94 95 95 96 96
## help feel yeah awesome guys man hey
## 97 98 98 99 99 99 100
## sure world keep take getting looking game
## 100 100 103 103 104 105 106
## morning wait always haha year please look
## 106 106 110 111 111 112 118
## week home better life never yes next
## 118 119 121 123 123 126 129
## twitter first say ill best come show
## 129 137 139 140 144 145 145
## way hope thank still youre last night
## 145 147 151 153 155 159 162
## work thats make tonight right people happy
## 168 175 187 193 195 198 206
## much well need want really follow think
## 208 210 211 212 219 225 227
## cant going back got see lol new
## 238 245 246 254 258 260 285
## today time great now know one day
## 298 320 326 328 330 359 370
## can dont thanks will good love get
## 385 394 394 402 419 435 473
## like just
## 509 656
The word frequency is similar between the three files. We may consider removing the word “said” as it seems to occur freakishly often in news articles.
We continue by aggregating the words from all three files and adding the word counts until we have a data frame with 2 columns. Word, and word count.
all_words <- union(names(words1), names(words2))
all_words <- union(all_words, names(words3))
words <- data.frame(all_words)
words$all_words <- as.character(words$all_words)
words$count <- 0
for (i in (1:length(all_words))){ #loop through all_words and add up the counts from all three files
n <- 0
if (all_words[i] %in% names(words1)){
n <- as.numeric(words1[all_words[i]]) + n
}
if (all_words[i] %in% names(words2)){
n <- as.numeric(words2[all_words[i]]) + n
}
if (all_words[i] %in% names(words3)){
n <- as.numeric(words3[all_words[i]]) + n
}
words[i,2] <- n
}
Note that we have over 1500 words, which is too many for a word cloud. We will take just the words that occur 200 times or more which leaves us with around 300 words for the word cloud.
library(wordcloud)
## Warning: package 'wordcloud' was built under R version 3.2.5
## Loading required package: RColorBrewer
top_words <- subset(words, count > 200)
str(top_words)
## 'data.frame': 302 obs. of 2 variables:
## $ all_words: chr "able" "according" "actually" "add" ...
## $ count : num 236 248 308 214 259 ...
tail(top_words[order(top_words$count),],50)
## all_words count
## 950 things 646
## 514 life 667
## 945 thats 687
## 965 today 693
## 621 need 698
## 803 say 712
## 566 may 719
## 925 take 722
## 784 right 739
## 428 home 740
## 554 made 761
## 385 great 803
## 526 little 804
## 562 many 810
## 1060 work 821
## 1017 want 833
## 896 still 836
## 382 got 859
## 1077 years 893
## 761 really 949
## 273 even 967
## 1026 way 975
## 951 think 1002
## 377 going 1018
## 610 much 1022
## 1076 year 1025
## 814 see 1028
## 993 two 1046
## 497 last 1052
## 545 love 1052
## 1037 well 1052
## 57 back 1075
## 557 make 1082
## 209 day 1179
## 326 first 1211
## 26 also 1259
## 487 know 1270
## 381 good 1300
## 681 people 1308
## 236 dont 1325
## 638 now 1373
## 625 new 1585
## 361 get 1657
## 962 time 1849
## 516 like 2087
## 119 can 2098
## 477 just 2367
## 649 one 2526
## 1048 will 2719
## 796 said 2942
wc <- wordcloud(top_words$all_words, top_words$count, scale = c(3,0.5), random.order = FALSE)
Our next step is to approach these words (or the top 1,000 words) and build n-grams. Specifically we would like to know which 2, 3 and 4-word sequences begin with these words and to have that data readily available for our algorithm.