Synopsis

The purpose of this project is to create a prediction model, similar to Swiftkey, that determines what the likely next word will be based on previous word(s) provided by the user. This report covers reading in the data and cleaning it so it is suitable to generate groups of words. This report includes exploratory analysis that considers the frequency of certain groups of words (n-grams). An n-gram is a phrase that contains n words.

Data manipulation

The training data used for this project consists of three files, all in US-English:
- US blogs
- US news
- US twitter

Read in the data

con_US_twitter <- file("./DATA/final/en_US/en_US.twitter.txt", "r")
US_twitter <- readLines(con_US_twitter)
close.connection(con_US_twitter)

con_US_blog <- file("./DATA/final/en_US/en_US.blogs.txt", "r")
US_blog <- readLines(con_US_blog)
close.connection(con_US_blog)

con_US_news <- file("./DATA/final/en_US/en_US.news.txt", "r")
US_news <- readLines(con_US_news)
close.connection(con_US_news)

#File sizes
fs_UStwitter <- file.size("./DATA/final/en_US/en_US.twitter.txt")/1024^2
fs_USblog <- file.size("./DATA/final/en_US/en_US.blogs.txt")/1024^2
fs_USnews <- file.size("./DATA/final/en_US/en_US.news.txt")/1024^2

#Word count
wc_UStwitter <- wordcount(US_twitter, " ")
wc_USblog <- wordcount(US_blog, " ")
wc_USnews <- wordcount(US_news, " ")

#Line count
lc_UStwitter <- length(US_twitter)
lc_USblog <- length(US_blog)
lc_USnews <- length(US_news)

Produce some summary information on the three datasets including the size of each file, the number of words and the number of rows.

US_summary <- tibble(File_Name = c("US Twitter", "US Blogs", "US News"),
                     `File_Size (MB)` = round(c(fs_UStwitter, fs_USblog, fs_USnews),2),
                     Word_Count = c(wc_UStwitter, wc_USblog, wc_USnews),
                     Line_Count = c(lc_UStwitter, lc_USblog, lc_USnews))
US_summary

## # A tibble: 3 x 4
##   File_Name  `File_Size (MB)` Word_Count Line_Count
##   <chr>                 <dbl>      <int>      <int>
## 1 US Twitter             159.   30373543    2360148
## 2 US Blogs               200.   37334131     899288
## 3 US News                196.    2643969      77259

The Blogs data set is the largest and has the most words. The Twitter data set is the smallest, despite the News data set having the smallest number of words. This could be because the Twitter dataset may have more symbols and abbreviated words/slang, resulting in a higher word count.

Combine the three datasets

US_data <- c(US_twitter, US_news, US_blog)
saveRDS(US_data, "./DATA/US_data.RDS")

Take a sample of the dataset

This will reduce the run time. Let’s start with 1000 lines and consider if more lines are required at a later stage. The phrases will be sampled at random (without replacement) from all three data sets.

set.seed(0)
US_sample_1000 <- sample(US_data, 1000)
head(US_sample_1000)

## [1] "You donï¿½t know life until you know why youï¿½re living it. You must find your own purpose in it"                                                                                                                                                                                                                                                                          
## [2] "Ugh. Arrived 1 min late! You're evil! Now I don't feel so bad about sending you out for (xmas party) space heaters 2 years ago! :)"                                                                                                                                                                                                                                         
## [3] "Unfortunately true-- RT : There have been more tweets about 1 Joe MVP vote than Atl not getting the right call."                                                                                                                                                                                                                                                            
## [4] "So while I was watching the St. Louis Cardinals bring home the World Series trophy, I was busy creating cards for various challenges. With a busy weekend and start to this week I haven't had a chance to post them yet, so I figured I would lump them all in one post. I found a couple new simple card designs that I love, I hope to inspire you with these creations!"
## [5] "Gozer was very big in Sumeria."                                                                                                                                                                                                                                                                                                                                             
## [6] "She mustnâ\200\231t listen to the voices,"

Clean the data

Clean the data by performing the following actions: - convert all words to lower case
- adjust incomplete sentences
- replace contraction
- remove any word elongation
- replace internet slang with full words
- remove any urls and email addresses
- remove any hash symbols
- remove non ascii characters
- remove certain emoticons
- strip out excess whitespace
- remove punctuation

(This is all done using the clean_text function - see R code section below)

US_sample_1000_cleaned <- clean_text(US_sample_1000)

Create corpus

US_corpus_vec <- VCorpus(VectorSource(list(US_sample_1000_cleaned)),
                         readerControl = list(reader=readPlain, language="english"))

Remove profanity

Import a list of profanities

bad_words <- readLines("./DATA/badwords.txt")

This list of bad words is from https://www.cs.cmu.edu/~biglou/resources/bad-words.txt

Remove the bad words and any remaining punctuation from the corpus.

US_corpus_vec <- tm_map(US_corpus_vec, removeWords, bad_words)
US_corpus_vec <- tm_map(US_corpus_vec, removePunctuation)

Ngram tokenization

Create functions that produce groups of one, two and three words using NGramTokenizer from the RWeka package.

unigramToken <- function(x) NGramTokenizer(x,Weka_control(min=1, max=1))
bigramToken <- function(x) NGramTokenizer(x,Weka_control(min=2, max=2))
trigramToken <- function(x) NGramTokenizer(x,Weka_control(min=3, max=3))

Output the groups of words and the number of times they appear in the corpus

unigrams <- as.matrix(TermDocumentMatrix(US_corpus_vec, control = list(tokenize = unigramToken)))
bigrams <- as.matrix(TermDocumentMatrix(US_corpus_vec, control = list(tokenize = bigramToken)))
trigrams <- as.matrix(TermDocumentMatrix(US_corpus_vec, control = list(tokenize = trigramToken)))

Convert outputs to tibbles to construct some graphs

unigrams_df <- tibble(word = rownames(unigrams),
       count = unigrams[,1])
bigrams_df <- tibble(word = rownames(bigrams),
       count = bigrams[,1])
trigrams_df <- tibble(word = rownames(trigrams),
       count = trigrams[,1])

Sort the word combinations in descending order of frequency

ordered_unigrams <- unigrams_df %>% arrange(desc(count))
ordered_bigrams <- bigrams_df %>% arrange(desc(count))
ordered_trigrams <- trigrams_df %>% arrange(desc(count))

Exploratory data analysis

Unigrams

ggplot(ordered_unigrams[1:10,] %>% mutate(word_desc = fct_reorder(word, desc(count))), aes(x = word_desc, y = count)) +
  geom_bar(stat = "identity", fill = "red") +
  ggtitle("Histogram of 10 most common unigrams") +
  xlab("Unigrams") +
  ylab("Frequency")

Wordcloud showing how common words are that occur more than 10 times

wordcloud(words = ordered_unigrams$word, freq = ordered_unigrams$count, min.freq = 10, colors = brewer.pal(12,"Paired"), scale = c(6,0.5))

Bigrams

ggplot(ordered_bigrams[1:10,] %>% mutate(word_desc = fct_reorder(word, desc(count))), aes(x = word_desc, y = count)) +
  geom_bar(stat = "identity", fill = "green") +
  ggtitle("Histogram of 10 most common bigrams") +
  xlab("Bigrams") +
  ylab("Frequency")

Trigrams

ggplot(ordered_trigrams[1:10,] %>% mutate(word_desc = fct_reorder(word, desc(count))), aes(x = word_desc, y = count)) +
  geom_bar(stat = "identity", fill = "blue") +
  ggtitle("Histogram of 10 most common trigrams") +
  xlab("Trigrams") +
  ylab("Frequency") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Observations

There are several very common words such as ‘the’, ‘and’ and ‘for’. These are English stopwords. Could consider if these should be excluded from the dataset.
There are some phrases that do not make sense, e.g. “dona cent t”. Perhaps need to think how to ensure all/most words and phrases make sense.

Next steps

The sample data only consists of 1000 lines. A larger dataset is likely to be required to create the prediction model.
Build a prediction model and evaluate the model for efficiency and accuracy.
Create a Shiny app that accepts an n-gram and predicts the next word.
Produce a slide deck to promote the Shiny app.

R code

Import libraries

library(tidyverse)
library(tm)
library(textclean)
library(qdapRegex)
library(textshape)
library(ngram)
library(RWeka)
library(wordcloud)

clean_text function

clean_text <- function(text){
  
  text %>%
    tolower() %>%
    replace_incomplete(replacement = " ") %>%
    replace_contraction() %>%
    str_replace_all("haven't", "have not") %>%
    str_replace_all("hadn't", "had not") %>%
    replace_word_elongation() %>%
    replace_internet_slang() %>%
    replace_url() %>%
    replace_email() %>%
    replace_hash() %>%
    replace_non_ascii() %>%
    str_replace_all(":\\)|;\\)", "") %>%
    str_replace_all("<3", "") %>%
    str_replace_all("^:+ ", "") %>%
    stripWhitespace() %>%
    strip()
}

Capstone project - Milestone Report

9 July 2020