Executive Summary

In this report, I look at the file and word structure of three combined text files for the Datascience Coursera Capstone Project. For this project, I will use only the English files compiled from twitter, blogs, and news sources. I will investiage the most common words and phrases in the text both with and without common linker words such as ‘the’, ‘of’, ‘and’, etc. I will finally provide an introduction to how I plan to use a model to predict the next word.

Exploratory Data Analysis

First, read in the three data files and explore the texts. The files are each quite large.

Here I begin to explore the text. Since the data sets are so large, lets start with 1% of each file and combine them to see what the most common words and phrases are. To do this, I will use a combination of the package tm and quanteda which I found to be much faster on my machine for tokenizing words and phrases (Ngrams).

#choose random 5% from each text file and combine them to one tokenized training file
twitter_sample <- subset_text(twitter, 0.01)
blogs_sample <- subset_text(blogs, 0.01)
news_sample <- subset_text(news, 0.01)
all_text <- c(twitter_sample, blogs_sample, news_sample)
token <- Corpus(VectorSource(all_text))

I noticed that there are many non-english characters in my text files. They will make the model prediction much harder, so lets remove them in addition to punctuation, numbers, profanity, and white spaces. I also transformed all words to lowercase to make it easier to compare across samples.

#look at all the unique characters to see what needs to be removed
uniqchars <- unique(strsplit(paste(all_text, sep = " ", collapse = " "), "")[[1]])
print(uniqchars)
##   [1] "S"          "i"          "n"          "g"          " "         
##   [6] "t"          "!"          "G"          "h"          "o"         
##  [11] "s"          "H"          "u"          "e"          "r"         
##  [16] "m"          "a"          "k"          "c"          "y"         
##  [21] ":"          "("          "<"          "3"          "T"         
##  [26] "f"          "C"          "l"          "p"          "w"         
##  [31] "d"          "v"          "A"          "L"          "."         
##  [36] "'"          "I"          "«"          "b"          "»"         
##  [41] "1"          "0"          "V"          "D"          ","         
##  [46] "z"          "8"          "9"          "?"          "W"         
##  [51] "X"          "F"          "="          "q"          "-"         
##  [56] ")"          "R"          "P"          "K"          "Y"         
##  [61] "O"          "N"          "E"          "x"          "\""        
##  [66] "U"          "#"          "~"          "&"          "2"         
##  [71] "j"          "M"          "B"          "/"          "“"         
##  [76] "”"          ""          "♥"          "\u270a"     "Q"         
##  [81] ";"          "5"          "6"          "J"          "*"         
##  [86] "_"          "4"          "Z"          "@"          "7"         
##  [91] "\U0001f60d" "]"          "\\"         "+"          "$"         
##  [96] "\U0001f60a" "\U0001f64f" "["          "❤"          "%"         
## [101] "\U0001f618" "✌"          "\U0001f614" "\U0001f52b" "\U0001f49c"
## [106] "’"          ">"          "\U0001f61d" "…"          "^"         
## [111] "\U0001f603" "\U0001f612" "\U0001f630" "º"          "é"         
## [116] "\U0001f44d" "\U0001f620" "\U0001f44c" "|"          "–"         
## [121] ""          ""          "\U0001f609" "\U0001f61c" "\U0001f61a"
## [126] "®"          "☼"          "\U0001f62d" "☁"          "Þ"         
## [131] "☺"          "♪"          ""          ""          "\U0001f499"
## [136] "\U0001f3c3" ""          ""          ""          ""         
## [141] "\U0001f343" "\U0001f3c0" "\U0001f431" "‘"          "☀"         
## [146] "\U0001f525" "\U0001f633" "\U0001f632" "♡"          "▒"         
## [151] "\U0001f616" "\U0001f4f1" "\U0001f442" "►"          "◄"         
## [156] "ó"          "♬"          "\U0001f4a4" "‎"           "•"         
## [161] "\u26be"     "\U0001f61e" ""          ""          "£"         
## [166] "\U0001f631" "—"          "ð"          "\u009f"     "\u0098"    
## [171] "¢"          "\u0092"     "\u0094"     "\U0001f389" "♀"         
## [176] ""          "\U0001f35f" "\U0001f388" "{"          "}"         
## [181] "\U0001f601" ""          "\U0001f33e" "\U0001f68c" "✍"         
## [186] "\U0001f625" "\U0001f3b6" "\U0001f3b5" "✈"          "\U0001f48b"
## [191] "\u2728"     "\U0001f44b" "″"          "\U0001f44e" "°"         
## [196] "©"          "´"          ""          "―"          "☆"         
## [201] "ü"          "\U0001f49b" "ⓞ"          "ⓥ"          "ⓔ"         
## [206] "π"          "\u274c"     "\u2b55"     "\U0001f602" "\U0001f44a"
## [211] "\U0001f4a2" ""          ""          ""          "\U0001f3b8"
## [216] "¡"          ""          "\U0001f60f" ""          ""         
## [221] ""          ""          ""          "♔"          ""         
## [226] "\U0001f64c" "\U0001f47c" "\U0001f623" "\U0001f418" ""         
## [231] "\U0001f480" "\U0001f497" "`"          "�"          "\U0001f451"
## [236] "\U0001f436" "\U0001f46f" "‑"          "\U0001f622" "⁰"         
## [241] "\U0001f493" "\U0001f60c" "\U0001f4aa" ""          "\U0001f647"
## [246] "✰"          "≠"          ""          "€"          "ñ"         
## [251] "\U0001f47d" "\U0001f4a9" "\U0001f6bd" "‏"           "ç"         
## [256] "ê"          "\U0001f604" "\U0001f346" "\U0001f494" "★"         
## [261] "\u26a1"     "\u2614"     "\U0001f449" "\U0001f374" "\U0001f378"
## [266] "\U0001f370" "\U0001f33b" "←"          "๏"          "\u26ea"    
## [271] "·"          ""          ""          "♺"          "™"         
## [276] "\U0001f459" ""          "È"          ""          "\U0001f42f"
## [281] "\U0001f1ea" "\U0001f1f8" "\U0001f472" ""          "\U0001f385"
## [286] "\u270b"     "ë"          "á"          "½"          ""         
## [291] "♣"          "☐"          "☑"          ""          "\U0001f381"
## [296] "\U0001f382" "ū"          "\U0001f6c0" "\u2754"     ""         
## [301] "\u2615"     "¥"          "\U0001f4a8" "\U0001f498" ""         
## [306] ""          ""          ""          "¦"          "\U0001f621"
## [311] "¨"          "Ü"          "ā"          "ṁ"          "ṇ"         
## [316] "­"          "我"         "的"         "心"         "è"         
## [321] ""           "荒"         "木"         "経"         "惟"        
## [326] "ア"         "ラ"         "ー"         "キ"         "ḥ"         
## [331] "ś"          "à"          "ú"          "¼"          "Π"         
## [336] "α"          "ρ"          "σ"          "κ"          "ε"         
## [341] "υ"          "ή"          "δ"          "τ"          "ί"         
## [346] "ς"          "φ"          "ο"          "β"          "董"        
## [351] "志"         "鸿"         "′"          "å"          "ö"         
## [356] "¥"         "М"          "о"          "с"          "а"         
## [361] "д"          "п"          "р"          "т"          "и"         
## [366] "в"          "С"          "Р"          "К"          "Г"         
## [371] "Б"          "ï"          "天"         "坛"         "í"         
## [376] "â"          "š"          "×"          "√"          "ý"         
## [381] "Ш"          "к"          "陣"         "山"         "稲"        
## [386] "荷"         "神"         "御"         "商"         "売"        
## [391] "繁"         "昌"         "家"         "内"         "安"        
## [396] "全"         "交"         "通"         ""          "ブ"        
## [401] "リ"         "チ"         "ネ"         "タ"         "バ"        
## [406] "レ"         "§"          "¶"          "ô"          "ø"         
## [411] "†"          "◆"          "\u0097"     "\u0093"     "\u0095"    
## [416] "\u0096"     "ä"          "\u0091"     "É"          "​"          
## [421] "î"          "Â"          "⅓"          "♠"          "\u0090"    
## [426] "♦"          "¹"          "Á"
#Clean up the text
profanity=c(t(read.csv(text = getURL("http://www.bannedwordlist.com/lists/swearWords.csv"), header=F)))
#do not remove small filler words
clean <- token %>%
    tm_map(content_transformer(tolower)) %>% #convert everything to lower
    tm_map(removeWords, profanity) %>% #remove profanity
    tm_map(removePunctuation)  %>% #remove punctuation
    tm_map(removeNumbers) %>% #remove numbers
    tm_map(stripWhitespace) #strip white space

#check the unique characters now
dat <- sapply(clean, function(row) iconv(row, "latin1", "ASCII", sub=""))

unique(strsplit(paste(dat, sep = " ", collapse = " "), "")[[1]])
##  [1] "s" "i" "n" "g" " " "t" "h" "o" "u" "e" "r" "m" "a" "k" "c" "y" "f"
## [18] "l" "p" "w" "d" "v" "b" "z" "x" "q" "j"
cleaned <- Corpus(VectorSource(dat))

Now that we have a clean, tokenized dataset, lets plot the most common words (unigrams).

#N-gram tokenizer
corp <- corpus(cleaned)
tok <- tokens(corp)
unigram <- tokens_ngrams(tok, n = 1)

#create document-feature matrix
uni <- colSums(dfm(unigram))
uni_df <- data.frame(word = names(uni), frequency = uni) %>%
    dplyr::mutate(word = factor(word)) %>%
    dplyr::arrange(desc(frequency))

#Plot the most popular words
ggplot(data = uni_df[1:30,]) +
    geom_bar(aes(x = factor(word, levels = unique(uni_df$word), ordered = T), y = frequency), stat = "identity") +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 90)) +
    labs(x = "Top words", y = "Frequency")

From this subset of the data, it looks like “the”, “to”, and “a” are the most popular words. That makes sense because they are used very commonly as linkers in text. Lets now look at the most common 2 and 3 word phrases.

#create most common bigrams
bigram <- tokens_ngrams(tok, n = 2)

#create document-feature matrix
bi <- colSums(dfm(bigram))
bi_df <- data.frame(word = names(bi), frequency = bi) %>%
    dplyr::mutate(word = factor(word)) %>%
    dplyr::arrange(desc(frequency))

#Plot the most popular words
ggplot(data = bi_df[1:30,]) +
    geom_bar(aes(x = factor(word, levels = unique(bi_df$word), ordered = T), y = frequency), stat = "identity") +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 90)) +
    labs(x = "Top Bigrams", y = "Frequency")

#create most common trigrams
trigram <- tokens_ngrams(tok, n = 3)

#create document-feature matrix
tri <- colSums(dfm(trigram))
tri_df <- data.frame(word = names(tri), frequency = tri) %>%
    dplyr::mutate(word = factor(word)) %>%
    dplyr::arrange(desc(frequency))

#Plot the most popular words
ggplot(data = tri_df[1:30,]) +
    geom_bar(aes(x = factor(word, levels = unique(tri_df$word), ordered = T), y = frequency), stat = "identity") +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 90)) +
    labs(x = "Top Trigrams", y = "Frequency")

From the above plots, it looks like “of the” and “in the” are the two most popular bigrams by far. The most popular trigrams are “one of the”, “a lot of”, and “thanks for the”. While these are the most popular Ngrams, it is not particularly interesting as most of the words that make up these phrases are small fillers (aka “the”). Next, lets try to remove these words and do the same analyses.

With this new analysis, we find “said”, “just” and “one” are the most popular words, “right now”, “last year”, and “last night” are the most popular 2-word phrases, and “happy mothers day”, “let us know” and “new york city” are the most popular 2-word phrases. This different analysis gives us more insight into the text we will be working with, but fails to include popular linkers like “the” or “and”.

Goals for Shiny App

The application will have an input box where the user will input a string of text. I will create a model that will predict the next words based on the frequency of the words being found together in my training dataset using a similar n-gram technique.