Milestone Report Week2

The goal of this capstone project is to build a prediction algorithm and a Shinny app that takes a phrase (multiple words) as input and outputs a prediction of the next word. The training dataset is downloaded from the course website. It contains four datasets in four different languages and I use the English dataset for this project. The English dataset has three txt files, named blogs, news and twitter respectively.

The main work flow of natural language processing includs the following steps:

Importing files as R text corpora and performing exploratory analysis
Cleaning and preprocessing corpora and generating tokens
Creating document feature matrices and token lists
Filtering the tokens
Analysing the frequency of tokens and ngrams

Step 1 Loading packages

library(tm)
library(plyr)
library(dplyr)
library(ggplot2)
library(tidytext)
library(stringr)
library(quanteda)
library(textclean)
library(lexicon)
library(wordcloud2)
library(ngram)
library(webshot)
library(htmlwidgets)
library(pacman)
library(pluralize)
library(webshot)

Step 2 Importing files

Three text files were loaded using readLines function. Number of lines were counted and showed in the bar graph: the twitter file has the highest number of lines, doubling that of blogs and news. The word count per line was shown in the histogram for each of the three files: the twitter files showed lowest word count per line.

blogs <- readLines("en_US.blogs.txt")
blog_line <- length(blogs)
blog_word_count <- ldply(blogs, wordcount) # word counts

news <- readLines("en_US.news.txt")
news_line <- length(news)
news_word_count <- ldply(news, wordcount)

twitter <- readLines("en_US.twitter.txt")
twitter_line <- length(twitter)
twitter_word_count <- ldply(twitter, wordcount)

# line number plot
x = c("blogs", "news", "twitter")
y = c(blog_line, news_line, twitter_line)
df <- do.call(rbind, Map(data.frame, source = x, line_number = y))
fig1 <- ggplot(data = df, aes(x = source, y = line_number)) + geom_bar(stat = "identity", fill = "orange") + theme_minimal() + ggtitle("Number of lines in each document")
options(repr.plot.width = 3, repr.plot.height = 3)
fig1

# word count plot
par(mfrow = c(3, 1))
fig2_1 <- hist(log10(blog_word_count$V1), main = "Histogram of log10(word count in blogs)", xlab = "", col = "red")
fig2_2 <- hist(log10(news_word_count$V1), main = "Histogram of log10(word count in news)", xlab = "", col = "green")
fig2_3 <- hist(log10(twitter_word_count$V1), main = "Histogram of log10(word count in twitter)", xlab = "", col = "blue")

Step 3 Create training samples

To build prediction models, the raw dataset was sampled using the rbinom function to “flip an unfair coin” with the probility of getting heads set to 25% to create a separate sample dataset. The line count and word count per line data were plotted as mentioned in step 2 using the sampled data instead.

# using unfair coin flip to randomly sample texts
## blogs
index_blogs <- rbinom(length(blogs), 1, 0.25)
sample_blogs <- blogs[which(index_blogs == 1)]
write.table(sample_blogs, file = "sample_blogs.txt")
sample_blog_line <- length(sample_blogs)
sample_blog_word_count <- ldply(sample_blogs, wordcount)

## news
index_news <- rbinom(length(news), 1, 0.25)
sample_news <- news[which(index_news == 1)]
write.table(sample_news, file = "sample_news.txt")
sample_news_line <- length(sample_news)
sample_news_word_count <- ldply(sample_news, wordcount)

## twitter
index_twitter <- rbinom(length(twitter), 1, 0.25)
sample_twitter <- twitter[which(index_twitter == 1)]
write.table(sample_twitter, file = "sample_twitter.txt")
sample_twitter_line <- length(sample_twitter)
sample_twitter_word_count <- ldply(sample_twitter, wordcount)

# Plots
# line number plot
x = c("blogs", "news", "twitter")
y = c(sample_blog_line, sample_news_line, sample_twitter_line)
df <- do.call(rbind, Map(data.frame, source = x, line_number = y))
fig3 <- ggplot(data = df, aes(x = source, y = line_number)) + geom_bar(stat = "identity", fill = "orange") + theme_minimal() + ggtitle("Number of lines in each sampled document")
fig3

# word count plot
par(mfrow = c(3, 1))
fig4_1 <- hist(log10(sample_blog_word_count$V1), main = "Histogram of log10(word count in sampled blogs)", xlab = "", col = "red2")
fig4_2 <- hist(log10(sample_news_word_count$V1), main = "Histogram of log10(word count in sampled news)", xlab = "", col = "green2")
fig4_3 <- hist(log10(sample_twitter_word_count$V1), main = "Histogram of log10(word count in sampled twitter)", xlab = "", col = "blue2")

Step 4 Tokenisation and profanity filtering

In this part, the lines in each file were broken down into tokens such as words, punctuations and numbers and the words were further filtered using a profanity word list. The tokenisation step was performed using the quanteda package. The blog data was processed first and a tokenisation function was built for processing the news and twitter data.

The sampled blog data was used to create a sparse document-feature matrix using dfm function with punctuations, numbers, symbols, urls removed and all letters changed to lower cases. Interestingly, this function does not handle mentions (e.g. @word) or hashtags (e.g. #word) and sometimes even urls and punctuations and further cleaning steps were implemented. Then, the stop words were removed and a singularisation step was performed to tidy up the word data. The data frame was subjucted to the profanity check using a combined list of five profanity word lists provided in the lexicon package and a word cloud is generated using the sorted new data frame with the top 500 words included. The process used for the blog data was implemented into a function called textToWord, which was then used for generating word clouds for the news and twitter data.

# blog data
# word
words <- dfm(sample_blogs, remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE, split_hyphens = TRUE, remove_url = TRUE) 
dim(words)

## [1] 224846 146471

## remove mentions
words_no_mention <- dfm_select(words, pattern = "@\\w+", selection = "remove", valuetype = "regex") 
dim(words_no_mention)

## [1] 224846 146304

## remove urls
words_no_url <- dfm_select(words_no_mention, pattern = "https?://.+", selection = "remove", valuetype = "regex") 
dim(words_no_url) # the number change indicate the remove_url is not as perfect

## [1] 224846 146294

## remove hashtags
words_no_hash <- dfm_select(words_no_url, pattern = "#\\w+", selection = "remove", valuetype = "regex") 
dim(words_no_hash)

## [1] 224846 145976

## remove punct
words_no_punct <- dfm_select(words_no_hash, pattern = "[[:punct:]]", selection = "remove", valuetype = "regex") 
dim(words_no_punct) # looks like the remove punct is not that efficient

## [1] 224846 131497

## remove stop words
words_no_stopword <- dfm(words_no_punct, remove = stopwords("english"))
dim(words_no_stopword)

## [1] 224846 131372

## words singularisation - using the pluralize package!!
feat <- words_no_stopword@Dimnames$features
# now replace the features with singularized features
words_no_stopword@Dimnames$features <- singularize(feat)
# the features now have duplicate names and can't be converted to dataframe due to high sparsity. Instead, the feature frequency data was used for creating a data frame where the duplicated columns can be handled. 
a <- featfreq(words_no_stopword)
df <- as.data.frame(a)
df$feature <- names(a)
colnames(df) <- c("freq", "word")
df <- df[c("word", "freq")]
# consolidate rows
df <- ddply(df, "word", numcolwise(sum))
# combined multiple profanity word lists from the lexicon package to generate a list of 3162 profanity_words
profanity_words <- unique(c(profanity_alvarez, profanity_arr_bad, profanity_banned, profanity_racist, profanity_zac_anger))
# filter out the profanity_words
"%!in%" = Negate("%in%")
newdf <- df %>% select(word,freq) %>% filter(word %!in% profanity_words) %>% arrange(desc(freq))
## generate word cloud
top500 <- newdf[1:500, ]
blogcloud <- wordcloud2(top500, shape = "pentagon")
saveWidget(blogcloud, "blog_cloud.html", selfcontained = F)
webshot("blog_cloud.html", "blog_cloud.png", delay = 5, vwidth = 2000, vheight = 2000)

## this process is implemented in the function textToWord
textToWord <- function(filename){
        words <- dfm(filename, remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE, split_hyphens = TRUE, remove_url = TRUE)
        words_no_mention <- dfm_select(words, pattern = "@\\w+", selection = "remove", valuetype = "regex") 
        words_no_url <- dfm_select(words_no_mention, pattern = "https?://.+", selection = "remove", valuetype = "regex")
        words_no_hash <- dfm_select(words_no_url, pattern = "#\\w+", selection = "remove", valuetype = "regex")
        words_no_punct <- dfm_select(words_no_hash, pattern = "[[:punct:]]", selection = "remove", valuetype = "regex")
        words_no_stopword <- dfm(words_no_punct, remove = stopwords("english"))
        feat <- words_no_stopword@Dimnames$features
        words_no_stopword@Dimnames$features <- singularize(feat)
        
        a <- featfreq(words_no_stopword)
        df <- as.data.frame(a)
        df$feature <- names(a)
        colnames(df) <- c("freq", "word")
        df <- df[c("word", "freq")]
        df <- ddply(df, "word", numcolwise(sum))
        profanity_words <- unique(c(profanity_alvarez, profanity_arr_bad, profanity_banned, profanity_racist, profanity_zac_anger))
        "%!in%" = Negate("%in%")
        newdf <- df %>% select(word,freq) %>% filter(word %!in% profanity_words) %>% arrange(desc(freq))
        return(newdf)
}

Word cloud for sampled news

# now we use this function to analyse news
newsdf <- textToWord(sample_news)

top500_newsdf <- newsdf[1:500, ]
newscloud <- wordcloud2(top500_newsdf, shape = "circle")
saveWidget(newscloud, "news_cloud.html", selfcontained = F)
webshot("news_cloud.html", "news_cloud.png", delay = 5, vwidth = 800, vheight = 800)

Word cloud for sampled twitters

# now we use this function to analyse twitter
twitterdf <- textToWord(sample_twitter)

top500_twitter <- twitterdf[1:500, ]
twittercloud <- wordcloud2(top500_twitter, shape = "triangle")
saveWidget(twittercloud, "twitter_cloud.html", selfcontained = F)
webshot("twitter_cloud.html", "twitter_cloud.png", delay = 5, vwidth = 2000, vheight = 2000)

Step 5 ngram study

For this analysis, it’s better to keep the relative position of the words information so instead of creating a document-feature matrix, it’s better to keep the lines. As the previous analysis, I started with the blog data.

words <- tokens(sample_blogs, remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE, split_hyphens = TRUE, remove_url = TRUE)

# 2-gram
words_2gram <- tokens_ngrams(words, n = 2, concatenator = " ")
words_2gram_dfm <- dfm(words_2gram)

b <- featfreq(words_2gram_dfm)
df1 <- as.data.frame(b)
df1$feature <- names(b)
colnames(df1) <- c("freq", "word")
df1 <- df1[c("word", "freq")]
# consolidate rows
df1 <- ddply(df1, "word", numcolwise(sum))
newdf1 <- df1 %>% select(word,freq) %>% filter(word %!in% profanity_words) %>% arrange(desc(freq))
top20_2gram <- newdf1[1:20, ]
top20_2gram$word <- factor(top20_2gram$word, levels = unique(top20_2gram$word))

fig5 <- ggplot(data = top20_2gram, aes(x = word, y = freq)) + geom_bar(stat = "identity", fill = "salmon") + theme_minimal() + ggtitle("Top 20 most frequently used 2-gram") + theme(axis.text.x = element_text(angle = 45))
fig5

# 3-gram
words_3gram <- tokens_ngrams(words, n = 3, concatenator = " ")
words_3gram_dfm <- dfm(words_3gram)

c <- featfreq(words_3gram_dfm)
df2 <- as.data.frame(c)
df2$feature <- names(c)
colnames(df2) <- c("freq", "word")
df2 <- df2[c("word", "freq")]
# consolidate rows
df2 <- ddply(df2, "word", numcolwise(sum))
newdf2 <- df2 %>% select(word,freq) %>% filter(word %!in% profanity_words) %>% arrange(desc(freq))
top20_3gram <- newdf2[1:20, ]
top20_3gram$word <- factor(top20_3gram$word, levels = unique(top20_3gram$word))

fig6 <- ggplot(data = top20_3gram, aes(x = word, y = freq)) + geom_bar(stat = "identity", fill = "plum") + theme_minimal() + ggtitle("Top 20 most frequently used 3-gram") + theme(axis.text.x = element_text(angle = 45))
fig6