Introduction

Description of capstone

The following is a description of the capstone project from the Coursera course materials.

Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models. When someone types:

I went to the

the keyboard presents three options for what the next word might be. For example, the three words might be gym, store, restaurant. In this capstone you will work on understanding and building predictive text models like those used by SwiftKey.

This course will start with the basics, analyzing a large corpus of text documents to discover the structure in the data and how words are put together. It will cover cleaning and analyzing text data, then building and sampling from a predictive text model. Finally, you will use the knowledge you gained in data products to build a predictive text product you can show off to your family, friends, and potential employers.

Description of milestone report

The dataset to be used to build predictive text models includes a number of blog posts, news articles (or blurbs), and tweets in English.

In the milestone report, the task is to read in the data, perform some exploratory analysis, and summarize a basic outline of a plan for using the data to create a prediction algorithm and Shiny app.

Libraries

Load tidyverse libraries including stringr, ggplot2, tidyr, and dplyr.

Also load tm (text mining library).

library(ggplot2)
library(tidyr)
library(dplyr)
library(stringr)
library(tm)

Downloading and reading in the data

First, download and unzip the data.

download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",destfile="Coursera-SwiftKey.zip")
unzip("Coursera-SwiftKey.zip")

There will now be the following three text files under the current working directory:

  1. final/en_US/en_US.blogs.txt
  2. final/en_US/en_US.news.txt
  3. final/en_US/en_US.twitter.txt

Use a simple readLines command to read each one in. For now, let’s just read in the first 300,000 lines of the Twitter data and see what happens. Read in all lines of blogs and news.

blogs <- readLines("final/en_US/en_US.blogs.txt")
news <- readLines("final/en_US/en_US.news.txt")
twitter_first_300k <- readLines("final/en_US/en_US.twitter.txt",n=300000)
## Warning in readLines("final/en_US/en_US.twitter.txt", n = 3e+05): line
## 167155 appears to contain an embedded nul
## Warning in readLines("final/en_US/en_US.twitter.txt", n = 3e+05): line
## 268547 appears to contain an embedded nul

We get a weird error message when reading in the Twitter data. Looking at the lines warned about by R, it seems they all have “^@” if you look at them in vi. These are special characters that represent a NULL.

Let’s run a system command to make a copy of the Twitter data we can play with outside of R.

system("cp final/en_US/en_US.twitter.txt final/en_US/en_US.twitter.corrected.txt")

Outside of R, open final/en_US/en_US.twitter.corrected.txt in vi and correct it using the instructions described here:

https://unix.stackexchange.com/questions/217010/search-and-replace-control-characters-m-i-in-vi

Mainly, use “:%s/%x00/ /g” to correct in vi, then “:wq” to save.

Let’s try reading in the whole Twitter data set now, from the corrected file.

twitter <- readLines("final/en_US/en_US.twitter.corrected.txt")

No more error! Ready to start exploring the data.

Basic data exploration

Since we used the “readLines” command, the result is a vector where each item is a line in the file.

So, we can use a simple set of “length” commands to check the number of lines in each file.

print("Number of lines in blogs:")
## [1] "Number of lines in blogs:"
num_lines_in_blogs <- length(blogs)
num_lines_in_blogs
## [1] 899288
print("Number of lines in news:")
## [1] "Number of lines in news:"
num_lines_in_news <- length(news)
num_lines_in_news
## [1] 1010242
print("Number of lines in Twitter:")
## [1] "Number of lines in Twitter:"
num_lines_in_twitter <- length(twitter)
num_lines_in_twitter
## [1] 2360148

Next, use str_count from the stringr package to get words per line in each data set.

word_counts_blogs <- str_count(blogs,'\\s+')+1
word_counts_news <- str_count(news,'\\s+')+1
word_counts_twitter <- str_count(twitter,'\\s+')+1

Use a simple set of histograms to examine the words per line in each file.

par(mfrow=c(2,2))
hist(word_counts_blogs,xlab="Word count",ylab="Number of lines",main="Blogs")
hist(word_counts_news,xlab="Word count",ylab="Number of lines",main="News")
hist(word_counts_twitter,xlab="Word count",ylab="Number of lines",main="Twitter")
plot.new()

par(mfrow=c(1,2))
hist(word_counts_blogs[word_counts_blogs < 500],xlab="Word count",ylab="Number of lines",main="Blogs\n<500 words only")
hist(word_counts_news[word_counts_news < 100],xlab="Word count",ylab="Number of lines",main="News\n<100 words only")

As we would expect based on the nature of the data, most lines in the Twitter data set contain relatively few words.

However, we also find that the blogs and news data sets contain many lines with relatively few words.

One rule of thumb is that there are usually 100 to 200 words in a paragraph (https://wordcounter.net/blog/2016/01/07/10986_how-many-words-paragraph.html). So it seems like many of the entries we have for blogs and news are at most a paragraph in length, and thus may actually be blurbs instead of full-length articles.

Let’s just summarize all the summary statistics we’ve looked at so far into one table. Also, get the size of each file in MB and output into this table as well.

Inspired by Michael Lee (https://rpubs.com/michael-lee/137667) to convert size to MB by dividing by 1024^2.

size_blogs <- file.info("final/en_US/en_US.blogs.txt")$size/(1024^2)
size_news <- file.info("final/en_US/en_US.news.txt")$size/(1024^2)
size_twitter <- file.info("final/en_US/en_US.twitter.txt")$size/(1024^2)

data.frame(data = c("blogs","news","twitter"),
    file.size = c(size_blogs,size_news,size_twitter),
    line.num = c(num_lines_in_blogs,num_lines_in_news,num_lines_in_twitter),
    total.words = c(sum(word_counts_blogs),sum(word_counts_news),sum(word_counts_twitter)),
    median.words.per.line = c(median(word_counts_blogs),median(word_counts_news),median(word_counts_twitter)))
##      data file.size line.num total.words median.words.per.line
## 1   blogs  200.4242   899288    37334131                    28
## 2    news  196.2775  1010242    34372530                    31
## 3 twitter  159.3641  2360148    30373585                    12

As we also saw from the histograms, the median number of words per line is much higher in blogs and news vs. Twitter. This makes sense.

Downsampling and cleaning data

Let’s downsample each data set to 10,000 documents each.

Then, create a corpus combining the samples from each data type.

Also do some minor cleaning including convert to lowercase, remove punctuation, etc.

Some of the cleaning code inspired by Dimitris Triantafyllou (https://rstudio-pubs-static.s3.amazonaws.com/68968_fe248cab46834aab997f826eab454a45.html).

set.seed(1392)

blogs_sample <- sample(blogs,10000)
news_sample <- sample(news,10000)
twitter_sample <- sample(twitter,10000)
english_stopwords <- removePunctuation(stopwords("english"))

create_corpus <- function(text){
    text <- str_replace_all(text,pattern="’",replace='\'')
    my.corpus <- VCorpus(DataframeSource(data.frame(doc_id = 1:length(text),text = text)))
    my.corpus <- tm_map(my.corpus, content_transformer(tolower))
    my.corpus <- tm_map(my.corpus, removePunctuation)
    my.corpus <- tm_map(my.corpus, removeNumbers)
    my.corpus <- tm_map(my.corpus, removeWords, english_stopwords)
    my.corpus <- tm_map(my.corpus, stripWhitespace)
}

all_three_datasets_sample <- create_corpus(c(blogs_sample,news_sample,twitter_sample))

Most common words or common combinations of words per data set

As suggested in the tm package FAQs (http://tm.r-forge.r-project.org/faq.html#Bigrams), we can create a custom tokenizing function to look at n-grams (treating n words as a single phrase).

Then, pass this function to tm’s DocumentTermMatrix.

Look at single words, two word combinations (bigrams), and three word combinations (trigrams).

DocumentTermMatrix_n_words <- function(corpus,n){
    tokenizer_function <- function(x){
        unlist(lapply(ngrams(words(x),n), paste, collapse = " "), use.names = FALSE)
    }

    DocumentTermMatrix_raw_result <- DocumentTermMatrix(corpus,control=list(tokenize=tokenizer_function))

    DocumentTermMatrix_processed <- data.frame(Document = DocumentTermMatrix_raw_result$i,Word.or.phrase = DocumentTermMatrix_raw_result$dimnames$Term[DocumentTermMatrix_raw_result$j],Occurences = DocumentTermMatrix_raw_result$v)
    DocumentTermMatrix_processed$Occurences[DocumentTermMatrix_processed$Occurences > 1] <- 1 #Only interested in how many documents, not occurences within a single document.
    DocumentTermMatrix_processed <- DocumentTermMatrix_processed %>% group_by(Word.or.phrase) %>% summarize(Num.documents = sum(Occurences))

    DocumentTermMatrix_processed <- data.frame(DocumentTermMatrix_processed,stringsAsFactors=FALSE)

    DocumentTermMatrix_processed <- data.frame(DocumentTermMatrix_processed[order(DocumentTermMatrix_processed$Num.documents,decreasing=TRUE),],Data = deparse(substitute(corpus)),Rank.in.data = 1:nrow(DocumentTermMatrix_processed),Num.words = n,stringsAsFactors=FALSE)

    return(DocumentTermMatrix_processed)
}

ngrams_per_data_set <- DocumentTermMatrix_n_words(all_three_datasets_sample,1)
ngrams_per_data_set <- rbind(ngrams_per_data_set,DocumentTermMatrix_n_words(all_three_datasets_sample,2))
ngrams_per_data_set <- rbind(ngrams_per_data_set,DocumentTermMatrix_n_words(all_three_datasets_sample,3))

ngrams_per_data_set$Word.or.phrase <- as.vector(ngrams_per_data_set$Word.or.phrase)

ngrams_per_data_set_flt <- ngrams_per_data_set[ngrams_per_data_set$Rank.in.data <= 30,]

ngrams_per_data_set_flt <- ngrams_per_data_set_flt[order(ngrams_per_data_set_flt$Rank.in.data),]

ggplot(ngrams_per_data_set_flt[ngrams_per_data_set_flt$Num.words == 1,],
aes(x=reorder(Word.or.phrase,-Num.documents),Num.documents)) +
geom_bar(stat="identity") +
xlab("Term") +
ylab("Number of documents") +
ggtitle("Top 30 most common single words") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))

ggplot(ngrams_per_data_set_flt[ngrams_per_data_set_flt$Num.words == 2,],
aes(x=reorder(Word.or.phrase,-Num.documents),Num.documents)) +
geom_bar(stat="identity") +
xlab("Term") +
ylab("Number of documents") +
ggtitle("Top 30 most common two-word phrases") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))

ggplot(ngrams_per_data_set_flt[ngrams_per_data_set_flt$Num.words == 3,],
aes(x=reorder(Word.or.phrase,-Num.documents),Num.documents)) +
geom_bar(stat="identity") +
xlab("Term") +
ylab("Number of documents") +
ggtitle("Top 30 most common three-word phrases") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))

We start to see some interesting patterns! For example, “love” is in the top 30 most frequent single words. We also see a lot of time-based references.

At least one of the tri-grams (e.g. “illinois incorporated item”) is a bit weird. However, we see that the actual number of documents for some of the tri-grams is quite low. And there is no guarantee that each data point is independent. So it does make sense that we might see some oddly specific phrases here.

Future directions

One possible future direction is to use the most frequent bi-grams and tri-grams to help predict. For example, if the phrase ends in the word “last”, we might predict based on the bi-grams that the next word is likely to be “year”, “week”, or “night”. If there are no bi-grams or tri-grams that start with the last word or two, we might just suggest one of the most common single words.