Executive Summary

The purpose of this project is build a predictive text model based on three text sources : blogs, news and tweets.

This report for the joint Coursera & Johns Hopkins University Data Science specialisation Capstone project summarises the data preprocessing and the exploratory data analysis of the provided text data sets. Based on this first analysis a roadmap for the development of the prediction algorithm and the Shiny app is briefly elaborated.

Instructions

The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set.

The motivation for this project is to:

  1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.

  2. Create a basic report of summary statistics about the data sets.

  3. Report any interesting findings that you amassed so far.

  4. Get feedback on your plans for creating a prediction algorithm and Shiny app.

Introduction

Even though no background reading is necessary to understand this report I would like to be so free as to suggest any reader to peruse the following text that influenced this analysis : tidy text mining with R. See next paragraph for more infrmation.

The inital raw data was downloaded and saved locally for analysis.

# First we specify the source and destination of the dataset
destination_file <- "Coursera-SwiftKey.zip"
source_file <- "http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"

# Execute the download (in the workdirectory set above)
download.file(source_file, destination_file)

# Extract the files from the zip file
unzip(destination_file)

For the purpose of this assignment we are only interested in the provided US English Text documents :

Task “Exploratory data analysis” - Coursera

The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this task is to understand the basic relationships observed in the data.

Data Cleanup

Three different corpora were provided. At first it is usefull to get some inkling of what is provided in these files.

So let’s dive in with some summary statistics!

# load the files
file1 <- file("E:/DS_Capstone/Dataset/en_US.blogs.txt","rb")
file2 <- file("E:/DS_Capstone/Dataset/en_US.news.txt","rb")
file3 <- file("E:/DS_Capstone/Dataset/en_US.twitter.txt","rb")

blog <- readLines(file1, encoding = 'UTF-8', skipNul = TRUE)
news <- readLines(file2, encoding = 'UTF-8', skipNul = TRUE)
twit <- readLines(file3, encoding = 'UTF-8', skipNul = TRUE)

# Get basic information and summary stats
stats_for_raw <- data.frame(
            FileName=c("en_US.blogs","en_US.news","en_US.twitter"),
            FileSizeinMB=c(file.info("E:/DS_Capstone/Dataset/en_US.blogs.txt")$size/1024^2,
                           file.info("E:/DS_Capstone/Dataset/en_US.news.txt")$size/1024^2,
                           file.info("E:/DS_Capstone/Dataset/en_US.twitter.txt")$size/1024^2),
            t(rbind(sapply(list(blog,news,twit),stri_stats_general),
            WordCount=sapply(list(blog,news,twit),stri_stats_latex)[4,]))
            )
kable(stats_for_raw)
FileName FileSizeinMB Lines LinesNEmpty Chars CharsNWhite WordCount
en_US.blogs 200.4242 899288 899288 206824382 170389539 37570839
en_US.news 196.2775 1010242 1010242 203223154 169860866 34494539
en_US.twitter 159.3641 2360148 2360148 162096241 134082806 30451170

As the data is provided as textfiles we proceeded with a summary browsing of each txt file.

After browsing the corpora it is clear that some preprocessing is needed before continuing. But let’s keep in mind that we ultimately want to build an application that predicts the specific word a user might want to write.

Many steps are usually found in textbook approaches of corpus pre processing.
These include :

  • removing punctuation
  • removing numbers
  • converting to lowercase
  • removing stopwords
  • stemming and lemming
  • stripping whitespaces
  • unnesting tokens

As we ultimately want to predict text it was decided to proceed with two versions of the data: one with stopwords and one without. We assume that accurately predicting these commonly used words might improve the results. Foreign words and spelling mistakes were also removed.

# let's first make some tidy data - first we turn it into a dataframe making use of dplyr package

blog_df <- data_frame(text=blog)
news_df <- data_frame(text=news)
twit_df <- data_frame(text=twit)

# now we use tidytext's unnest_tokens function to turn this df into a "one-token-per-document-per-row" but not for the tweets, we'll do that later.

tidy_blog <- blog_df %>% unnest_tokens(word, text)
tidy_news <- news_df %>% unnest_tokens(word, text)
tidy_twit <- twit_df %>% unnest_tokens(word, text)

# a lot of stop words pop up, it is my decision to see whether it is useful to remove them for further analysis
# But just for informations sake i will keep both datasets ... memory is cheap, they say :o)

# compare stopwords removed blogs/tweets/news

# We can remove stop words (kept in the tidytext dataset stop_words) with an anti_join

data(stop_words)

tidy_blog_sw <- tidy_blog %>% anti_join(stop_words)
tidy_news_sw <- tidy_news %>% anti_join(stop_words)
tidy_twit_sw <- tidy_twit %>% anti_join(stop_words)
# Notice that punctuation has been stripped and tokens are converted to lowercase

# But we also need to remove numbers from these corpora

mystopwords <- data_frame(word = c(as.character(1:900000)))
tidy_blog_sw <- tidy_blog_sw %>%   anti_join(mystopwords)
tidy_news_sw <- tidy_news_sw %>%   anti_join(mystopwords)
tidy_twit_sw <- tidy_twit_sw %>%   anti_join(mystopwords)
# When browsing the files i saw both spelling mistakes and foreign words. I decided to remove them !

broom_tbs <- hunspell_check(as.vector(t(tidy_blog_sw)))
tbs_c = tidy_blog_sw[broom_tbs,]
broom_tb <- hunspell_check(as.vector(t(tidy_blog)))
tb_c = tidy_blog[broom_tb,]
broom_tns <- hunspell_check(as.vector(t(tidy_news_sw)))
tns_c = tidy_news_sw[broom_tns,]
broom_tn <- hunspell_check(as.vector(t(tidy_news)))
tn_c = tidy_news[broom_tn,]
broom_tts <- hunspell_check(as.vector(t(tidy_twit_sw)))
tts_c = tidy_twit_sw[broom_tts,]
broom_tt <- hunspell_check(as.vector(t(tidy_twit)))
tt_c = tidy_twit[broom_tt,]

rm(broom_tb,broom_tbs,broom_tn,broom_tns,broom_tt,broom_tts,blog,news,twit)

Task “Exploratory data analysis”

What are the top 10 most occuring words?

# Let us have a look at the most occuring words in each corpus (once with and once without stopwords)

count_tbs <- tidy_blog_sw %>% count(word, sort = TRUE) %>% mutate(word = reorder(word, n))
count_tb <- tidy_blog %>% count(word, sort = TRUE) %>% mutate(word = reorder(word, n))
count_tns <- tidy_news_sw %>% count(word, sort = TRUE) %>% mutate(word = reorder(word, n))
count_tn <- tidy_news %>% count(word, sort = TRUE) %>% mutate(word = reorder(word, n))
count_tts <- tidy_twit_sw %>% count(word, sort = TRUE) %>% mutate(word = reorder(word, n))
count_tt <- tidy_twit %>% count(word, sort = TRUE) %>% mutate(word = reorder(word, n))

# prepare plots of top 10 occuring words

p_tbs <- top_n(count_tbs,10) %>% ggplot(aes(word, n, fill=n)) +
  geom_bar(stat = "identity") + xlab(NULL) +
  scale_fill_gradient(low="dodgerblue4",high="dodgerblue1")+
  theme(legend.position="none") + coord_flip()
p_tb <- top_n(count_tb,10) %>% ggplot(aes(word, n, fill=n)) +
  geom_bar(stat = "identity") + xlab(NULL) +
  scale_fill_gradient(low="dodgerblue4",high="dodgerblue1")+
  theme(legend.position="none") + coord_flip()
p_tns <- top_n(count_tns,10) %>% ggplot(aes(word, n, fill=n)) +
  geom_bar(stat = "identity") + xlab(NULL) +
  scale_fill_gradient(low="brown4",high="brown1")+
  theme(legend.position="none") + coord_flip()
p_tn <- top_n(count_tn,10) %>% ggplot(aes(word, n, fill=n)) +
  geom_bar(stat = "identity") + xlab(NULL) +
  scale_fill_gradient(low="brown4",high="brown1")+
  theme(legend.position="none") + coord_flip()
p_tts <- top_n(count_tts,10) %>% ggplot(aes(word, n, fill=n)) +
  geom_bar(stat = "identity") + xlab(NULL) +
  scale_fill_gradient(low="chartreuse4",high="chartreuse1")+
  theme(legend.position="none") + coord_flip()
p_tt <- top_n(count_tt,10) %>% ggplot(aes(word, n, fill=n)) +
  geom_bar(stat = "identity") + xlab(NULL) +
  scale_fill_gradient(low="chartreuse4",high="chartreuse1")+
  theme(legend.position="none") + coord_flip()
plot_grid(p_tbs, p_tb, p_tns, p_tn, p_tts, p_tt, labels=c("A", "B", "C", "D", "E", "F"), ncol = 2, nrow = 3) #need package cowplot

rm(p_tbs, p_tb, p_tns, p_tn, p_tts, p_tt,tidy_blog,tidy_news,tidy_twit)

Are there meaningful n-grams? (n-grams are the contiguous sequence of words in the given corpora).
These can be very usefull for predicting the next word based on the previous 1, 2, or 3 words.

The following values were calculated on a subset.

gc()
blog_n2 <- tidy_blog_sw %>% sample_n(300000) %>% unnest_tokens(ngram, word, token = "ngrams", n = 2)
news_n2 <- tidy_news_sw %>% sample_n(300000) %>% unnest_tokens(ngram, word, token = "ngrams", n = 2)
twit_n2 <- tidy_twit_sw %>% sample_n(300000) %>% unnest_tokens(ngram, word, token = "ngrams", n = 2)

blog_n3 <- tidy_blog_sw %>% sample_n(1000000) %>% unnest_tokens(ngram, word, token = "ngrams", n = 3)
news_n3 <- tidy_news_sw %>% sample_n(1000000) %>% unnest_tokens(ngram, word, token = "ngrams", n = 3)
twit_n3 <- tidy_twit_sw %>% sample_n(1000000) %>% unnest_tokens(ngram, word, token = "ngrams", n = 3)

blog_n4 <- tidy_blog_sw %>% sample_n(300000) %>% unnest_tokens(ngram, word, token = "ngrams", n = 4)
news_n4 <- tidy_news_sw %>% sample_n(300000) %>% unnest_tokens(ngram, word, token = "ngrams", n = 4)
twit_n4 <- tidy_twit_sw %>% sample_n(300000) %>% unnest_tokens(ngram, word, token = "ngrams", n = 4)

#blog_n5 <- tidy_blog %>% sample_n(3000) %>% unnest_tokens(ngram, word, token = "ngrams", n = 5)
#news_n5 <- tidy_news %>% sample_n(3000) %>% unnest_tokens(ngram, word, token = "ngrams", n = 5)
#twit_n5 <- tidy_twit %>%  sample_n(3000) %>% unnest_tokens(ngram, word, token = "ngrams", n = 5)


n2_list <- list(blogs = blog_n2, news = news_n2 , twit= twit_n2 )
n2_count <- function(words){ words %>% count(ngram, sort = TRUE) }
n2_frequency <- lapply(n2_list, n2_count)

p_bn2 <- head(n2_frequency$blogs,10) %>% ggplot(aes(x=reorder(ngram,n), y = n, fill = (n))) +
  geom_bar(stat="identity") + xlab(NULL) +
  scale_fill_gradient(low="dodgerblue4",high="dodgerblue1")+
  theme(legend.position="none") + coord_flip()
p_nn2 <- head(n2_frequency$news,10) %>% ggplot(aes(x=reorder(ngram,n), y = n, fill = (n))) +
  geom_bar(stat = "identity") + xlab(NULL) +
  scale_fill_gradient(low="dodgerblue4",high="dodgerblue1")+
  theme(legend.position="none") + coord_flip()
p_tn2 <- head(n2_frequency$twit,10) %>% ggplot(aes(x=reorder(ngram,n), y = n, fill = (n))) +
  geom_bar(stat = "identity") + xlab(NULL) +
  scale_fill_gradient(low="dodgerblue4",high="dodgerblue1")+
  theme(legend.position="none") + coord_flip()

plot_grid(p_bn2, p_nn2, p_tn2, labels=c("Blogs", "News", "Tweets"),ncol = 3, nrow = 1) 

rm(p_bn2, p_nn2, p_tn2,n2_list,n2_count)

n3_list <- list(blogs = blog_n3, news = news_n3 , twit= twit_n3 )
n3_count <- function(words){ words %>% count(ngram, sort = TRUE) }
n3_frequency <- lapply(n3_list, n3_count)

p_bn3 <- head(n3_frequency$blogs,10) %>% ggplot(aes(x=reorder(ngram,n), y = n, fill = (n))) +
  geom_bar(stat="identity") + xlab(NULL) +
  scale_fill_gradient(low="dodgerblue4",high="dodgerblue1")+
  theme(legend.position="none") + coord_flip()
p_nn3 <- head(n3_frequency$news,10) %>% ggplot(aes(x=reorder(ngram,n), y = n, fill = (n))) +
  geom_bar(stat = "identity") + xlab(NULL) +
  scale_fill_gradient(low="dodgerblue4",high="dodgerblue1")+
  theme(legend.position="none") + coord_flip()
p_tn3 <- head(n3_frequency$twit,10) %>% ggplot(aes(x=reorder(ngram,n), y = n, fill = (n))) +
  geom_bar(stat = "identity") + xlab(NULL) +
  scale_fill_gradient(low="dodgerblue4",high="dodgerblue1")+
  theme(legend.position="none") + coord_flip()

plot_grid(p_bn3, p_nn3, p_tn3, labels=c("Blogs", "News", "Tweets"),ncol = 3, nrow = 1) 

rm(p_bn3, p_nn3, p_tn3,n3_list,n3_count)

#n4_list <- list(blogs = blog_n4, news = news_n4 , twit= twit_n4 )
#n4_count <- function(words){ words %>% count(ngram, sort = TRUE) }
#n4_frequency <- lapply(n4_list, n4_count)

#p_bn4 <- top_n(n4_frequency$blogs,10) %>% ggplot(aes(x=reorder(ngram,n), y = n, fill = (n))) +
#  geom_bar(stat="identity") + xlab(NULL) +
#  scale_fill_gradient(low="dodgerblue4",high="dodgerblue1")+
#  theme(legend.position="none") + coord_flip()
#p_nn4 <- top_n(n4_frequency$news,10) %>% ggplot(aes(x=reorder(ngram,n), y = n, fill = (n))) +
#  geom_bar(stat = "identity") + xlab(NULL) +
 # scale_fill_gradient(low="dodgerblue4",high="dodgerblue1")+
#  theme(legend.position="none") + coord_flip()
#p_tn4 <- top_n(n4_frequency$twit,10) %>% ggplot(aes(x=reorder(ngram,n), y = n, fill = (n))) +
#  geom_bar(stat = "identity") + xlab(NULL) +
#  scale_fill_gradient(low="dodgerblue4",high="dodgerblue1")+
#  theme(legend.position="none") + coord_flip()

#plot_grid(p_bn4, p_nn4, p_tn4, labels=c("Blogs", "News", "Tweets"),ncol = 3, nrow = 1) #need package cowplot

App developùent plan

We will develop a Shiny app in which the user will input a string and a prediction algorithm will display a series of words with probabilities computed from n-gram models (n={1,2,3,4}n={1,2,3,4}) in a backoff approach.

Questions to consider

Several Questions will be considered :

  • 1. How can we efficiently store an n-gram model (Markov Chains)?

  • 2. How can we use the knowledge about word frequencies to make the model smaller and more efficient?

  • 3. How many parameters do we need (i.e. how big is n in your n-gram model)?

  • 4. Is there a simple way to “smooth” the probabilities ?

  • 5. How to evaluate wheather the model is any good?

  • Furthermore we will have to balance the runtime and the size of the model in order to provide a reasonable experience to the user. There will likely be a tradeoff between size and runtime.