Swiftkey Natural Language Processing

Overview

The objective of this project project is to demonstrate an interaction with the data and to create a basic report of summarized statistics about the data sets and to report any interesting findings discovered. This document would explain only the major features of the data as identified and briefly summarize plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. I would make use of tables and plots to illustrate important summaries of the data set.

Loading and Reading Data

Loading Libraries Required

library(tm)
library(corpora)
library(quanteda)
library(wordcloud)
library(ggplot2)
library(dplyr)
library(stringi)
library(RWeka)
library(corpus)

Loading and Reading the Dataset

First,I download the data and store it in a folder,then I extract the -csv files. Next,I shall load the data into R using a base function ‘readLines’, will another function ‘file’ to open a connection.

if (!file.exists("./download")){dir.create("./download")}
fileUrl = "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
#download.file(fileUrl,destfile = "./download/Coursera-SwiftKey.zip")
unzip(zipfile = "./download/Coursera-SwiftKey.zip")

blogconnect <- file("./final/en_US/en_US.blogs.txt","r")
blogread <- readLines(blogconnect, skipNul=T,encoding="UTF-8")
close(blogconnect)

newsconnect <- file("./final/en_US/en_US.news.txt","r")
newsread <- readLines(newsconnect, skipNul=T,encoding="UTF-8")
close(newsconnect)

twitterconnect <- file("./final/en_US/en_US.twitter.txt","r")
tweetread <- readLines(twitterconnect, skipNul=T,encoding="UTF-8")
close(twitterconnect)

Preprocessing and Sampling

Next, I do some pre-processing like updating stopwords’ list, creating N-gram functions, creating summary variable, objects, sampling the three different articles and combining them together for analysis. Then I perform some data cleaning to make the combined samples readable as a dataframe. Then I created a unigram, bigram, and trigram dataframe.

upStopword <- c(stopwords("en"),"can","will","want","just","like","may","cant","isnt","also","the","dont")
unigram <- function(x)NGramTokenizer(x, Weka_control(min=1,max=1))
bigram<- function(x)NGramTokenizer(x, Weka_control(min=2,max=2))
trigram <- function(x)NGramTokenizer(x,Weka_control(min=3,max=3))

bloglen <- length(blogread)
blogwcount <- sum(stri_count_words(blogread))
maxblog <- max(blogwcount)
blogWavg <- mean(stri_count_words(blogread))

newslen <- length(newsread)
newswcount <- sum(stri_count_words(newsread))
maxnews <- max(newswcount)
newsWavg <- mean(stri_count_words(newsread))

twitterlen <- length(tweetread)
twitterwcount <- sum(stri_count_words(tweetread))
maxtwitter <- max(twitterwcount)
twitterWavg <- mean(stri_count_words(tweetread))


data_summary <- data.frame(Data_Source=c("Blog","News","Twitter"),
                           Number_of_Lines=c(bloglen,newslen,twitterlen),
                           Number_of_Words=c(blogwcount,newswcount,twitterwcount),
                           Number_of_Characters_in_Longest_Line=c(maxblog,maxnews,maxtwitter),
                           Average_Words=c(blogWavg,newsWavg,twitterWavg))

memory.limit(size = 1e+9)    ## I set my memory size limit to 1 Gigabyte

## [1] 1e+09

set.seed(123)
blogSample <- sample(blogread, bloglen * 0.05)
newsSample <- sample(newsread, newslen * 0.05)
twitterSample <- sample(tweetread, twitterlen * 0.05)
corpus <- c(blogSample,newsSample,twitterSample)

corpus =  VCorpus(VectorSource(iconv(corpus,"UTF-8","ASCII",sub="")))

corpus = tm_map(corpus, content_transformer(tolower)) %>% 
        tm_map(removeWords,upStopword) %>% 
        tm_map(removeNumbers) %>% 
        tm_map(removePunctuation) %>% 
        tm_map(PlainTextDocument) %>% 
        tm_map(stripWhitespace)

uniterms <- TermDocumentMatrix(corpus,control = list(tokenize=unigram))
unifreq <- findFreqTerms(uniterms,lowfreq = 35)

unigram_matrix <- sort(rowSums(as.matrix(uniterms[unifreq,])),decreasing = T)
unigram_dataframe <- data.frame(word=names(unigram_matrix),frequency = unigram_matrix)


## Bigram Processing
biterms <- TermDocumentMatrix(corpus, control = list(tokenize=bigram))
bifreq <- findFreqTerms(biterms, lowfreq = 15)

bigram_matrix <- sort(rowSums(as.matrix(biterms[bifreq,])),decreasing = T)
bigram_dataframe <-  data.frame(word = names(bigram_matrix),frequency = bigram_matrix)

Data Summary

data_summary

##   Data_Source Number_of_Lines Number_of_Words
## 1        Blog          899288        37546239
## 2        News           77259         2674536
## 3     Twitter         2360148        30093413
##   Number_of_Characters_in_Longest_Line Average_Words
## 1                             37546239      41.75107
## 2                              2674536      34.61779
## 3                             30093413      12.75065

Blog Article

The Blog Article contains 899288 lines of text, with 37546239 words in the article. The longest line contains 37546239 lines of character. ### News Article The News Article contains 77259 lines of text, with 2674536 words in the article. The longest line contains 2674536 lines of character. ### Twitter Article The Twitter Article contains 2360148 lines of text, with 30093413 words in the article. The longest line contains 30093413 lines of character.

After combining samples from each of the three(3) articles above, we have the Unigram, Bigram and table given below:

print("Unigram")

## [1] "Unigram"

head(unigram_dataframe)    #First 6 Unigrams rows

##      word frequency
## one   one     10616
## get   get      9251
## time time      8500
## love love      7688
## good good      7617
## now   now      7344

print("Bigram")

## [1] "Bigram"

head(bigram_dataframe)    #First 6 Bigrams rows

##                            word frequency
## right now             right now      1155
## last night           last night       696
## looking forward looking forward       510
## happy birthday   happy birthday       448
## new york               new york       437
## first time           first time       418

The table shows the Unigram and Bigram with their corresponding frequency in decreasing order.

Data Visualization

This section contains visual expression of the data. First, I plot a word cloud of the most frequent 150 Unigram. Then I plot the bar chart of the most frquent Unigram and Bigram. I shall indicate a table for the trigram data because of the memory required.

wordcloud(words = unigram_dataframe$word , freq = unigram_dataframe$frequency, max.words = 150, scale = c(5,0.6) , rot.per = 0.5, use.r.layout = F, colors = brewer.pal(8,"Dark2"))

# Unigram Bar Chart
gg_unigram <- ggplot(data = unigram_dataframe[1:25,], aes(y = reorder(word,frequency), x = frequency))
gg_unigram <- gg_unigram + geom_bar(stat = "identity", fill = "green") 
gg_unigram <- gg_unigram + ggtitle("Articles Unigram with corresponding Frequency Bar in Decreasing Order") + ylab("Words") + xlab("Frequency")
gg_unigram

# Bigram Bar Chart
gg_bigram <- ggplot(data = bigram_dataframe[1:20,], aes(y = reorder(word,frequency), x = frequency)) + geom_bar(stat = "identity", fill = "orange")
gg_bigram <- gg_bigram + ggtitle("Articles Bigram with corresponding Frequency Bar in Decreasing Order") + ylab("Words") + xlab("Frequency")
gg_bigram

term_stats(corpus,ngrams = 3,types = T)

##    term                   type1   type2      type3   count support
## 1  happy mothers day      happy   mothers    day       176     174
## 2  let us know            let     us         know      123     123
## 3  happy new year         happy   new        year       96      95
## 4  new york city          new     york       city       75      73
## 5  cinco de mayo          cinco   de         mayo       62      61
## 6  looking forward seeing looking forward    seeing     56      56
## 7  st patricks day        st      patricks   day        40      35
## 8  new years eve          new     years      eve        36      34
## 9  hope great day         hope    great      day        34      34
## 10 new york times         new     york       times      34      34
## 11 new york ny            new     york       ny         30      30
## 12 two years ago          two     years      ago        30      30
## 13 come see us            come    see        us         29      29
## 14 really looking forward really  looking    forward    27      27
## 15 world war ii           world   war        ii         26      25
## 16 come join us           come    join       us         25      25
## 17 dreams come true       dreams  come       true       23      23
## 18 happy valentines day   happy   valentines day        23      23
## 19 look forward seeing    look    forward    seeing     23      23
## 20 love love love         love    love       love       23      23
## <U+22EE>(1451243 rows total)

Model Prediction

The model prediction would be based on the data wrangling carried out on the data. Based on the N-grams created in order of highest frequency to lowest frequency, the model would predict a word with the input of an alphabet, or the next word with the input of a word using the Bigram tokens, or a third word with the input of two preceeding two words using the Trigram tokens. Also, the prediction would be made to interact with each other in such a way that the model predicts a Unigram based on the first alphabetical input, and also makes a Bigram prediction based on the Unigram Prediction and a Trigram prediction based on the Bigram prediction thereby having it predict a phrase based on a word input.