This is a basic analysis of the data sets (text files) provided for the Johns Hopkins/Coursera Data Science capstone project.

Our eventual goal is to use natural language processing techniques to create an algorithm that predicts the next word a user will type based on one or two previous words. Our data consists of large samples of text from news articles (en_US.news.txt), twitter (en_US.twitter.txt) and blog posts (en_US.blogs.txt).

At this stage we take a text mining approach, and our main task is to clean the data set and get a sense of the most frequently used words.

library(knitr)
library(tm)

file.info("en_US.news.txt")$size / 2^20
## [1] 196.2775
file.info("en_US.twitter.txt")$size / 2^20
## [1] 159.3641
file.info("en_US.blogs.txt")$size / 2^20
## [1] 200.4242
file_length <- c(length(readLines("en_US.blogs.txt")), length(readLines("en_US.twitter.txt")), length(readLines("en_US.news.txt")))
## Warning in readLines("en_US.twitter.txt"): line 167155 appears to contain
## an embedded nul
## Warning in readLines("en_US.twitter.txt"): line 268547 appears to contain
## an embedded nul
## Warning in readLines("en_US.twitter.txt"): line 1274086 appears to contain
## an embedded nul
## Warning in readLines("en_US.twitter.txt"): line 1759032 appears to contain
## an embedded nul
## Warning in readLines("en_US.news.txt"): incomplete final line found on
## 'en_US.news.txt'
max(file_length)
## [1] 2360148

The largest file (en_US.twitter.txt) is over 2 million lines long. Trying to read this much in at once causes R to crash so we are only going to read the first 10,000 lines from each file.

news <- readLines(con = "en_US.news.txt", 1e4, encoding = "UTF-8")
news[9980]
## [1] "<U+0093>You feel like you lose a piece of yourself,<U+0094> says Arah Montagne. She's not your average survivor. Arah, 34, never actually had breast cancer. But she had the BRCA1 gene, raising her chances of getting breast cancer to 85 percent."

Next we take a few steps to remove punctuation, strange characters (due to inconsistent text encoding), capital letters, and stop words (small / extremely common words like “the”, “for”, “me” and “very”). Then what we would like to obtain is a list of 1,000 or so most common words. We have prepared the following function for this purpose since this is something we will need to repeat:

get.frequent.words <- function(inputfile, rarity = 0.997, n = 1e4, remove.stopwords = T){
    news <- readLines(con = inputfile, n, encoding = "UTF-8")
    news <- iconv(news,"latin1","ASCII", sub="") #to deal with things like "\u0093"
    news <- tolower(news)
    news <- removePunctuation(news)
    
    corpusnews <- Corpus(VectorSource(news))
    if (remove.stopwords == TRUE) {
        corpusnews <- tm_map(corpusnews, removeWords, stopwords("english"))
    }
    
    dtmnews <- DocumentTermMatrix(corpusnews)
    sparsenews <- removeSparseTerms(dtmnews, rarity) #we are keeping only the top 3% of most common words
    wordsnews <- as.data.frame(as.matrix(sparsenews))
    
    wordfreq <- wordsnews[0,]
    wordfreq <- colSums(wordsnews) #we want to return a list of the words with number of occurences
    return (wordfreq)
}

Now to actually read in the text files and look at the most frequent words:

words1 <- get.frequent.words("en_US.blogs.txt")
words2 <- get.frequent.words("en_US.news.txt")
words3 <- get.frequent.words("en_US.twitter.txt")

length(words1)
## [1] 1086
length(words2)
## [1] 1114
length(words3)
## [1] 335

Notice that the twitter text contains fewer words. Possibly due to the 140-character limit in effect on twitter.

tail(sort(words1),100)
##    family       end      days       bit     times   without     house 
##       206       209       210       211       212       212       213 
##      blog    better       man       put     night      ever      read 
##       215       217       217       219       220       221       221 
##      went       let      give       old       big   thought    though 
##       221       222       225       230       231       232       238 
##       use     today      used      part     thing     since      best 
##       239       242       247       250       252       257       260 
##      long       lot     thats      look      find      week      cant 
##       260       261       262       263       264       267       268 
##      away      next       god      book      feel     place      come 
##       269       271       275       279       280       280       286 
##      home      sure    always   another     world     great     every 
##       291       293       310       310       312       323       326 
##      need     never     didnt     years      year     right       say 
##       326       329       331       333       345       350       351 
##       may      take something      last      said    around       got 
##       361       362       368       369       379       383       391 
##      work       ive      life      want       two      made    things 
##       393       394       400       410       413       415       416 
##      many     still     going      love      back     think       day 
##       418       425       444       513       521       521       531 
##       see    little      much      well       way    really     first 
##       533       542       556       556       561       562       567 
##      even      good      make      also      dont       new    people 
##       575       579       581       589       616       625       626 
##       now      know       get      time      like       can      just 
##       675       717       768      1016      1091      1137      1158 
##      will       one 
##      1233      1327
tail(sort(words2),100)
##   business     office        run government      house      never 
##        156        156        156        158        159        160 
##       need        end        law       told     health     really 
##        161        166        166        166        168        168 
##     better        use       help      night       part      money 
##        169        169        172        172        173        174 
##  officials       play      court      place     family       left 
##        174        176        177        177        179        181 
##       show     little     around        lot       long  according 
##        182        183        184        185        189        190 
##     center        big      right     second  president        hes 
##        193        194        194        194        196        198 
##       come    company       next       want       week       best 
##        200        205        207        211        211        212 
##  including       high        got     public        say       know 
##        212        213        214        219        222        223 
##      since       four    another        see      thats       team 
##        225        226        227        237        250        254 
##      think       take       much     police      still       work 
##        254        257        258        258        258        260 
##        way     season       made        day       well     county 
##        269        270        277        278        286        288 
##       game       even        may       good       back       many 
##        290        298        301        302        308        313 
##       make       dont       says    million      going       home 
##        314        315        315        321        329        330 
##       city      three     school    percent        now        get 
##        333        336        348        369        370        416 
##      state     people       like      years      first       time 
##        474        484        487        501        507        513 
##       last       just       year        can        two       also 
##        524        553        569        576        576        586 
##        new        one       will       said 
##        675        840       1084       2484
tail(sort(words3),100)
##    coming      miss     music      real       fun       let      soon 
##        88        88        88        88        89        89        89 
##    anyone       bad      live      free    things  everyone  watching 
##        90        90        90        91        91        92        92 
##      even      hate   weekend     gonna something       big  tomorrow 
##        94        94        94        95        95        96        96 
##      help      feel      yeah   awesome      guys       man       hey 
##        97        98        98        99        99        99       100 
##      sure     world      keep      take   getting   looking      game 
##       100       100       103       103       104       105       106 
##   morning      wait    always      haha      year    please      look 
##       106       106       110       111       111       112       118 
##      week      home    better      life     never       yes      next 
##       118       119       121       123       123       126       129 
##   twitter     first       say       ill      best      come      show 
##       129       137       139       140       144       145       145 
##       way      hope     thank     still     youre      last     night 
##       145       147       151       153       155       159       162 
##      work     thats      make   tonight     right    people     happy 
##       168       175       187       193       195       198       206 
##      much      well      need      want    really    follow     think 
##       208       210       211       212       219       225       227 
##      cant     going      back       got       see       lol       new 
##       238       245       246       254       258       260       285 
##     today      time     great       now      know       one       day 
##       298       320       326       328       330       359       370 
##       can      dont    thanks      will      good      love       get 
##       385       394       394       402       419       435       473 
##      like      just 
##       509       656

The word frequency is similar between the three files. We may consider removing the word “said” as it seems to occur freakishly often in news articles.

We continue by aggregating the words from all three files and adding the word counts until we have a data frame with 2 columns. Word, and word count.

all_words <- union(names(words1), names(words2))
all_words <- union(all_words, names(words3))
words <- data.frame(all_words)
words$all_words <- as.character(words$all_words)
words$count <- 0

for (i in (1:length(all_words))){ #loop through all_words and add up the counts from all three files
    n <- 0
    if (all_words[i] %in% names(words1)){
        n <- as.numeric(words1[all_words[i]]) + n
    }
    
    if (all_words[i] %in% names(words2)){
        n <- as.numeric(words2[all_words[i]]) + n
    }
    
    if (all_words[i] %in% names(words3)){
        n <- as.numeric(words3[all_words[i]]) + n
    }
    words[i,2] <- n
}

Note that we have over 1500 words, which is too many for a word cloud. We will take just the words that occur 200 times or more which leaves us with around 300 words for the word cloud.

library(wordcloud)
## Warning: package 'wordcloud' was built under R version 3.2.5
## Loading required package: RColorBrewer
top_words <- subset(words, count > 200)
str(top_words)
## 'data.frame':    302 obs. of  2 variables:
##  $ all_words: chr  "able" "according" "actually" "add" ...
##  $ count    : num  236 248 308 214 259 ...
tail(top_words[order(top_words$count),],50)
##      all_words count
## 950     things   646
## 514       life   667
## 945      thats   687
## 965      today   693
## 621       need   698
## 803        say   712
## 566        may   719
## 925       take   722
## 784      right   739
## 428       home   740
## 554       made   761
## 385      great   803
## 526     little   804
## 562       many   810
## 1060      work   821
## 1017      want   833
## 896      still   836
## 382        got   859
## 1077     years   893
## 761     really   949
## 273       even   967
## 1026       way   975
## 951      think  1002
## 377      going  1018
## 610       much  1022
## 1076      year  1025
## 814        see  1028
## 993        two  1046
## 497       last  1052
## 545       love  1052
## 1037      well  1052
## 57        back  1075
## 557       make  1082
## 209        day  1179
## 326      first  1211
## 26        also  1259
## 487       know  1270
## 381       good  1300
## 681     people  1308
## 236       dont  1325
## 638        now  1373
## 625        new  1585
## 361        get  1657
## 962       time  1849
## 516       like  2087
## 119        can  2098
## 477       just  2367
## 649        one  2526
## 1048      will  2719
## 796       said  2942
wc <- wordcloud(top_words$all_words, top_words$count, scale = c(3,0.5), random.order = FALSE)

Conclusion

Our next step is to approach these words (or the top 1,000 words) and build n-grams. Specifically we would like to know which 2, 3 and 4-word sequences begin with these words and to have that data readily available for our algorithm.