Introduction

The goal of this analysis to showcase the very first phase of the Capstone project development i.e. SwiftKey Next Word Prediction. So, some preliminary steps needs to be performed before creating the Word Prediction Model.

Loading of Data an Cleaning of data will be performed before organizing/creating the dateset for our analysis.

This exercise will be comprised of below steps: 1. Demonstrate that you’ve downloaded the data and have successfully loaded it in. 2. Create a basic report of summary statistics about the data sets. 3. Report any interesting findings that you amassed so far. 4. Get feedback on your plans for creating a prediction algorithm and Shiny app

English Language dataset will be used along with dictonary to predict the next word

1. Loading Libraries and Datasets

  1. Loading the Necessary Libraries.
library(dplyr)
library(tidyr)
library(tm)
library(ggplot2)
library(lexicon)
library(sentimentr)
library(stringr)
library(wordcloud)
library(caret)
library(tokenizers)
library(ngram)
library(NLP)
  1. Setting the working Directory.

  2. Reading of files from the directory.
setwd("C:/Users/r.pratap.singh/Desktop/JohnHopkins/capstone/Coursera-SwiftKey/final/en_US")

# US Twitter File
us_twitter <- "en_US_twitter.txt"
con_tw <- file(us_twitter,open="r")
line_tw <- readLines(con_tw) 
long_tw <- length(line_tw)
twitter_size <- round((file.info(us_twitter)$size) /1024^2,1)
close(con_tw)

# US News File
us_news <- "en_US_news.txt"
con_news <- file(us_news,open="r")
line_news <- readLines(con_news) 
long_news <- length(line_news)
news_size <- round((file.info(us_news)$size) /1024^2,1)
close(con_news)

# US blog File
us_blog <- "en_US_blogs.txt"
con_blog <- file(us_blog,open="r")
line_blog <- readLines(con_blog) 
long_blog <- length(line_news)
blogs_size <- round((file.info(us_blog)$size) /1024^2,1)
close(con_blog)

2. Static Summary of Datasets

Calculating the Datasets Size, number of lines present and total number of words present in dataset.

twitterWC <- sum(sapply(gregexpr("\\S+", line_tw), length))
newsWC <- sum(sapply(gregexpr("\\S+", line_news), length))
blogWC <- sum(sapply(gregexpr("\\S+", line_blog), length))

d1 <- cbind(c(twitter_size, blogs_size, news_size), c(long_tw, long_blog, long_news), c(twitterWC, blogWC, newsWC))
rownames(d1) <- c("Twitter", "Blogs", "News")
colnames(d1) <- c("Size in  MB " , " Number of Lines ", " Number of Words")

Statistics table

library(kableExtra)
kable(d1, "html") %>%
  kable_styling(full_width = F)
Size in MB Number of Lines Number of Words
Twitter 159.4 2360148 30373792
Blogs 200.4 77259 37334441
News 196.3 77259 2643972

3. Exploratory Analysis and Data Cleansing

3.1. Sampling

In order to Move ahead with Prediction Model building, an appropriate sample of data needs to be taken to predict the word. Taking large sample can compromise on the faster prediciting capability of word. So, we will keep ourselves restricted to 5000 lines sample each from News, Twitter and Blogs Dataset.

set.seed(5)

sample_tweet <- sample(line_tw, size = 5000)
sample_blog <- sample(line_blog, size = 5000)
sample_news <- sample(line_news, size = 5000)

sample_all <- as.character(rbind(sample_tweet, sample_blog, sample_news))

3.2. Data Cleaning - Tokenization & Removing Profanities

  • tokenize_fun function created to clean the data and remove profanity.
  • Function convert all Characters to Lowercase.
  • Second part is to remove profane words by taking the list out Lexicon package using profanity_alvarez.
  • Removes Punctuation from the resultant set.
  • Last part is to remove Numbers, Special characters, extra whitespaces and blank lines from the set.
  • sample_all sample dataset is used to call the function to Clean the data and store it back in sample_all dataset.
  • Last step is to remove ‘ies’, ‘er’, ‘ed’ and ‘ing’ after removing the profanities and storing it back to `sample_all’.
############# 1. Tokenization ############################
tokenize_fun <- function(x){
  # convert to lowercase
  x <- tolower(x)
  ###################### 2. Removing Prfanity #######################
  # getting the profane list of words from lexicon library profanity_alvarez
  profane <- unique(tolower(lexicon::profanity_alvarez))
  profane <- gsub("\\(", "c", profane)
  for (i in 1:length(profane)){
    sample_all <- gsub(profane[i],"", sample_all)
  }
  # remove punctuation
  x <- removePunctuation(x)
  # remove numbers 
  x <- removeNumbers(x)
  # Removing blank line
  x <- x[which(x != " ")]
  # Removing all Special Characters apart from alphabets
  x <- gsub("[^a-zA-Z]", " " , x )
  # remove numbers 
  x <- stripWhitespace(x)
  return(x)
}

sample_all <- tokenize_fun(sample_all)

# remove independent ing, ed, ies and er after removing profanity
sample_all <- gsub("[^a-z]ing |[^a-z]ed |[^a-z]ies |[^a-z]er |[^a-z]s |[^a-z]e ", " ", sample_all)

3.3. Exploratory Data Analysis

  • sample_all stores all the clean data. It will be used as a base for N-Gram Modeling.

sample_all_1 is created just to check the occurance of words removing articles, fillers etc. This variable is just for illustration purpose

tokenize_ngrams function is used to distribute the data structure to 1-Gram for sample_all_1. This is for Illustration Purpose.

# Block to remove articles and fillers for Plot
sample_all_1 <- gsub("[^a-z]is |[^a-z]am |[^a-z]are |[^a-z]an |[^a-z]the |[^a-z]so ",
                     " ", sample_all)
sample_all_1 <- gsub("[^a-z]was |[^a-z]were |[^a-z]a |[^a-z]in |[^a-z]on |[^a-z]to |[^a-z]if ",
                     " ", sample_all_1)
sample_all_1 <- gsub("[^a-z]and |[^a-z]of |[^a-z]we |[^a-z]you |[^a-z]at |[^a-z]as |[^a-z]or ",
                     " ", sample_all_1)
sample_all_1 <- gsub("[^a-z]his |[^a-z]that |[^a-z]they |[^a-z]for |[^a-z]it |[^a-z]my ",
                     " ", sample_all_1)
sample_all_1 <- gsub("[^a-z]has |[^a-z]have |[^a-z]this |[^a-z]not |[^a-z]her |[^a-z]or |[^a-z]i ",
                     " ", sample_all_1)
sample_all_1 <- gsub("[^a-z]he |[^a-z]be |[^a-z]she |[^a-z]by |[^a-z]a |[^a-z]in |[^a-z]its ",
                     " ", sample_all_1)

# Tokenizing with ngram where n= 1
sample_all_1 <- tokenize_ngrams(sample_all_1, n = 1)
  • sample_all_2 created from sample_all using tokenize_ngrams function with value 2 as 2-Gram Modeling.
  • sample_all_3 created from sample_all using tokenize_ngrams function with value 3 as 3-Gram Modeling.
  • sample_all_4 created from sample_all using tokenize_ngrams function with value 4 as 4-Gram Modeling.
  • sample_word created from sample_all using tokenize_ngrams function with value 1 as 1-Gram Modeling.
# Tokenizing with ngram where n= 2
sample_all_2 <- tokenize_ngrams(sample_all, n = 2)
# Tokenizing with ngram where n= 3
sample_all_3 <- tokenize_ngrams(sample_all, n = 3)
# Tokenizing with ngram where n= 4
sample_all_4 <- tokenize_ngrams(sample_all, n = 4)
# Tokenizing with ngram where n= 1
sample_word <- tokenize_ngrams(sample_all, n = 1)
  • s1dataframe created from sample_all_1 first unlisting it then creating table for the occurances by arranging in Descending order. This is for Illustration Purpose

  • s2dataframe created from sample_all_2 first unlisting it then creating table for the occurances by arranging in Descending order. ** For 2-Gram Model**.
  • s3dataframe created from sample_all_2 first unlisting it then creating table for the occurances by arranging in Descending order. ** For 3-Gram Model**.
  • s4dataframe created from sample_all_2 first unlisting it then creating table for the occurances by arranging in Descending order. ** For 4-Gram Model**.
  • w1dataframe created from sample_word first unlisting it then creating table for the occurances by arranging in Descending order. ** For 2-Gram Model**.

# Creating dataframe for ngram and store it in descending order frequency
s1 <- cbind.data.frame(table(unlist(sample_all_1)))  %>%
  arrange(desc(Freq)) 
s2 <- cbind.data.frame(table(unlist(sample_all_2)))  %>%
  arrange(desc(Freq))
s3 <- cbind.data.frame(table(unlist(sample_all_3)))  %>%
  arrange(desc(Freq))
s4 <- cbind.data.frame(table(unlist(sample_all_4)))   %>%
  arrange(desc(Freq))
w1 <- cbind.data.frame(table(unlist(sample_word))) %>%
  arrange(desc(Freq))

3.4. Word Clouds & Bar Plots for each N-Gram Model

  • Word Cloud for 1-Gram from s1 (Illustration dataset) filtering on minimum 350 occurrance.
wordcloud(words = s1$Var1, freq = s1$Freq, min.freq = 350, colors = brewer.pal(8,"Paired"))

  • Word Cloud for 1-Gram from w1 filtering on minimum 500 occurrance.
wordcloud(words = w1$Var1, freq = s1$Freq, min.freq = 500, colors = brewer.pal(8,"Paired"))

  • Word Cloud for 2-Gram from s2 filtering on minimum 200 occurrance. Creation of Bar Plot for 2-Gram.
wordcloud(words = s2$Var1, freq = s2$Freq, min.freq = 200, colors = brewer.pal(8,"Paired"))

# BarPlot for 2 Ngram model
ggplot(data=s2[1:20,], aes(x=reorder(Var1,-Freq), y = Freq) )+
  geom_bar(stat = "identity", fill ="sky blue", color="blue") +
  theme(axis.text.x=element_text(angle=90)) +
  xlab("2Gram words ") +
  ylab("Frequency of Words")

  • Word Cloud for 3-Gram from s3 filtering on minimum 42 occurrance. Creation of Bar Plot for 3-Gram.
wordcloud(words = s3$Var1, freq = s3$Freq, min.freq = 42, colors = brewer.pal(8,"Paired"))

# BarPlot for 3 Ngram model
ggplot(data=s3[1:20,], aes(x=reorder(Var1,-Freq), y = Freq) )+
  geom_bar(stat = "identity", fill ="sky blue", color="blue") +
  theme(axis.text.x=element_text(angle=90)) +
  xlab("3Gram words ") +
  ylab("Frequency of Words")

  • Word Cloud for 4-Gram from s4 filtering on minimum 8 occurrance. Creation of Bar Plot for 4-Gram.
wordcloud(words = s4$Var1, freq = s4$Freq, min.freq = 8, colors = brewer.pal(8,"Paired"))

# BarPlot for 4 Ngram model
ggplot(data=s4[1:20,], aes(x=reorder(Var1,-Freq), y = Freq) )+
  geom_bar(stat = "identity", fill ="sky blue", color="blue") +
  theme(axis.text.x=element_text(angle=90)) +
  xlab("4Gram words ") +
  ylab("Frequency of Words")

3.5. Number of words for 50% and 90% Coverage

  • Function coverage created to check the number of words for Percentage Coverage. Object and Percent is passed as input and it returns the number of words.
coverage <- function(object, percent) {
  cover <- 0
  sumCover <- sum(object)
  for(i in 1:length(object)) {
    cover <- cover + object[i]
    if(cover >=  percent*(sumCover)){break}
  }
  return(i)
}

For 50% Coverage Number of Words Required are : 134

For 90% Coverage Number of Words Required are : 7137

3.6. Storing the Dataframe for Prediction

  • Separating the dataframes to words and out column where out column contains the lastword from the N-Gram Model. This out Column will act as predicted word for the Words Column.

  • These Dataframes will be used for Predicting the Words.

s4 <- separate(s4, Var1, into = c("words", "out"), sep = " (?=[^ ]+$)")
s3 <- separate(s3, Var1, into = c("words", "out"), sep = " (?=[^ ]+$)")
s2 <- separate(s2, Var1, into = c("words", "out"), sep = " (?=[^ ]+$)")

3.7. N-gram Pfediction Model Dataset Sample

This is to show the structure and few samples of Dataset

** Dataset Structure created from S4**

s4[1:10,]
##            words  out Freq
## 1     the end of  the   35
## 2    the rest of  the   35
## 3     at the end   of   27
## 4     one of the most   27
## 5  in the middle   of   26
## 6        i don t know   24
## 7    is going to   be   22
## 8      is one of  the   21
## 9    at the same time   19
## 10    to be able   to   19

** Dataset Structure created from S3**

s3[1:10,]
##       words out Freq
## 1    one of the  185
## 2     a lot  of  151
## 3    it was   a   81
## 4   as well  as   78
## 5     i don   t   75
## 6     to be   a   70
## 7   the end  of   66
## 8    out of the   65
## 9  going to  be   64
## 10   i want  to   63

** Dataset Structure created from S2**

s2[1:10,]
##    words out Freq
## 1     of the 2029
## 2     in the 1925
## 3     to the 1001
## 4     on the  898
## 5    for the  839
## 6     to  be  679
## 7     at the  630
## 8    and the  619
## 9     in   a  536
## 10    it was  486

** Dataset Structure created from W1**

w1[1:10,]
##    Var1  Freq
## 1   the 21921
## 2    to 12176
## 3   and 11307
## 4     a 10653
## 5    of  9463
## 6    in  7474
## 7     i  6852
## 8  that  4724
## 9   for  4568
## 10   is  4519

Below Part will be executed to Store the Fles in RDS Format:

  • saveRDS(w1, file = “Ngram1.rds”)
  • saveRDS(s2, file = “Ngram2.rds”)
  • saveRDS(s3, file = “Ngram3.rds”)
  • saveRDS(s4, file = “Ngram4.rds”)

4. Plans for creating a prediction algorithm and Shiny app

In order to build the N-gram, I’ve processed the N-grams by breaking them in several different indexes. Each of the N-grams is saved as intermediary files, as follows:

  1. A file for the “n-gram” to “index” mapping (since the words take most of the memory space, I’ll keep them in only one place).
  2. These will be saved as RDS files and will be loaded during Shiny app Server Program.
  3. A file for the “n-1 gram”, the prior. This has the indexes in the N-1 “n-gram to index” mapping (calculated in a prev step)
  4. A file for the “1-gram” posterior. This has the indexes in the 1-gram “n-gram to index” mapping (calculated in first step)
  5. A file with the calculated probabilities, mapping N-gram to calculated probabilities (sorted decreasingly by probability).

Shiny App Functionality:

  1. Load the “term frequency” file, sorted decreasingly in a previous step by frequency.
  2. Extract the “N-gram to index” vector, save extract the priors (N-1-Grams), posteriors (1-Grams).
  3. Calculate the conditional probabilities for the N-Grams | N-1-Grams, using MLE Extract Top 20 Predicted words.
  4. Create the Back-Off search Model to search N, N-1 and N-2 Model.
  5. Shiny App will be Created display the top 3 words of prediction (based on decreasing Frequency of words combination occurance)
  6. Shiny App will also display the WordCloud of Next Predicted Word.