The goal of this analysis to showcase the very first phase of the Capstone project development i.e. SwiftKey Next Word Prediction. So, some preliminary steps needs to be performed before creating the Word Prediction Model.
Loading of Data an Cleaning of data will be performed before organizing/creating the dateset for our analysis.
This exercise will be comprised of below steps: 1. Demonstrate that you’ve downloaded the data and have successfully loaded it in. 2. Create a basic report of summary statistics about the data sets. 3. Report any interesting findings that you amassed so far. 4. Get feedback on your plans for creating a prediction algorithm and Shiny app
English Language dataset will be used along with dictonary to predict the next word
library(dplyr)
library(tidyr)
library(tm)
library(ggplot2)
library(lexicon)
library(sentimentr)
library(stringr)
library(wordcloud)
library(caret)
library(tokenizers)
library(ngram)
library(NLP)
Setting the working Directory.
setwd("C:/Users/r.pratap.singh/Desktop/JohnHopkins/capstone/Coursera-SwiftKey/final/en_US")
# US Twitter File
us_twitter <- "en_US_twitter.txt"
con_tw <- file(us_twitter,open="r")
line_tw <- readLines(con_tw)
long_tw <- length(line_tw)
twitter_size <- round((file.info(us_twitter)$size) /1024^2,1)
close(con_tw)
# US News File
us_news <- "en_US_news.txt"
con_news <- file(us_news,open="r")
line_news <- readLines(con_news)
long_news <- length(line_news)
news_size <- round((file.info(us_news)$size) /1024^2,1)
close(con_news)
# US blog File
us_blog <- "en_US_blogs.txt"
con_blog <- file(us_blog,open="r")
line_blog <- readLines(con_blog)
long_blog <- length(line_news)
blogs_size <- round((file.info(us_blog)$size) /1024^2,1)
close(con_blog)
Calculating the Datasets Size, number of lines present and total number of words present in dataset.
twitterWC <- sum(sapply(gregexpr("\\S+", line_tw), length))
newsWC <- sum(sapply(gregexpr("\\S+", line_news), length))
blogWC <- sum(sapply(gregexpr("\\S+", line_blog), length))
d1 <- cbind(c(twitter_size, blogs_size, news_size), c(long_tw, long_blog, long_news), c(twitterWC, blogWC, newsWC))
rownames(d1) <- c("Twitter", "Blogs", "News")
colnames(d1) <- c("Size in MB " , " Number of Lines ", " Number of Words")
Statistics table
library(kableExtra)
kable(d1, "html") %>%
kable_styling(full_width = F)
| Size in MB | Number of Lines | Number of Words | |
|---|---|---|---|
| 159.4 | 2360148 | 30373792 | |
| Blogs | 200.4 | 77259 | 37334441 |
| News | 196.3 | 77259 | 2643972 |
In order to Move ahead with Prediction Model building, an appropriate sample of data needs to be taken to predict the word. Taking large sample can compromise on the faster prediciting capability of word. So, we will keep ourselves restricted to 5000 lines sample each from News, Twitter and Blogs Dataset.
set.seed(5)
sample_tweet <- sample(line_tw, size = 5000)
sample_blog <- sample(line_blog, size = 5000)
sample_news <- sample(line_news, size = 5000)
sample_all <- as.character(rbind(sample_tweet, sample_blog, sample_news))
tokenize_fun function created to clean the data and remove profanity.profanity_alvarez.sample_all sample dataset is used to call the function to Clean the data and store it back in sample_all dataset.############# 1. Tokenization ############################
tokenize_fun <- function(x){
# convert to lowercase
x <- tolower(x)
###################### 2. Removing Prfanity #######################
# getting the profane list of words from lexicon library profanity_alvarez
profane <- unique(tolower(lexicon::profanity_alvarez))
profane <- gsub("\\(", "c", profane)
for (i in 1:length(profane)){
sample_all <- gsub(profane[i],"", sample_all)
}
# remove punctuation
x <- removePunctuation(x)
# remove numbers
x <- removeNumbers(x)
# Removing blank line
x <- x[which(x != " ")]
# Removing all Special Characters apart from alphabets
x <- gsub("[^a-zA-Z]", " " , x )
# remove numbers
x <- stripWhitespace(x)
return(x)
}
sample_all <- tokenize_fun(sample_all)
# remove independent ing, ed, ies and er after removing profanity
sample_all <- gsub("[^a-z]ing |[^a-z]ed |[^a-z]ies |[^a-z]er |[^a-z]s |[^a-z]e ", " ", sample_all)
sample_all stores all the clean data. It will be used as a base for N-Gram Modeling.sample_all_1 is created just to check the occurance of words removing articles, fillers etc. This variable is just for illustration purpose
tokenize_ngrams function is used to distribute the data structure to 1-Gram for sample_all_1. This is for Illustration Purpose.
# Block to remove articles and fillers for Plot
sample_all_1 <- gsub("[^a-z]is |[^a-z]am |[^a-z]are |[^a-z]an |[^a-z]the |[^a-z]so ",
" ", sample_all)
sample_all_1 <- gsub("[^a-z]was |[^a-z]were |[^a-z]a |[^a-z]in |[^a-z]on |[^a-z]to |[^a-z]if ",
" ", sample_all_1)
sample_all_1 <- gsub("[^a-z]and |[^a-z]of |[^a-z]we |[^a-z]you |[^a-z]at |[^a-z]as |[^a-z]or ",
" ", sample_all_1)
sample_all_1 <- gsub("[^a-z]his |[^a-z]that |[^a-z]they |[^a-z]for |[^a-z]it |[^a-z]my ",
" ", sample_all_1)
sample_all_1 <- gsub("[^a-z]has |[^a-z]have |[^a-z]this |[^a-z]not |[^a-z]her |[^a-z]or |[^a-z]i ",
" ", sample_all_1)
sample_all_1 <- gsub("[^a-z]he |[^a-z]be |[^a-z]she |[^a-z]by |[^a-z]a |[^a-z]in |[^a-z]its ",
" ", sample_all_1)
# Tokenizing with ngram where n= 1
sample_all_1 <- tokenize_ngrams(sample_all_1, n = 1)
sample_all_2 created from sample_all using tokenize_ngrams function with value 2 as 2-Gram Modeling.sample_all_3 created from sample_all using tokenize_ngrams function with value 3 as 3-Gram Modeling.sample_all_4 created from sample_all using tokenize_ngrams function with value 4 as 4-Gram Modeling.sample_word created from sample_all using tokenize_ngrams function with value 1 as 1-Gram Modeling.# Tokenizing with ngram where n= 2
sample_all_2 <- tokenize_ngrams(sample_all, n = 2)
# Tokenizing with ngram where n= 3
sample_all_3 <- tokenize_ngrams(sample_all, n = 3)
# Tokenizing with ngram where n= 4
sample_all_4 <- tokenize_ngrams(sample_all, n = 4)
# Tokenizing with ngram where n= 1
sample_word <- tokenize_ngrams(sample_all, n = 1)
s1dataframe created from sample_all_1 first unlisting it then creating table for the occurances by arranging in Descending order. This is for Illustration Purpose
s2dataframe created from sample_all_2 first unlisting it then creating table for the occurances by arranging in Descending order. ** For 2-Gram Model**.s3dataframe created from sample_all_2 first unlisting it then creating table for the occurances by arranging in Descending order. ** For 3-Gram Model**.s4dataframe created from sample_all_2 first unlisting it then creating table for the occurances by arranging in Descending order. ** For 4-Gram Model**.w1dataframe created from sample_word first unlisting it then creating table for the occurances by arranging in Descending order. ** For 2-Gram Model**.
# Creating dataframe for ngram and store it in descending order frequency
s1 <- cbind.data.frame(table(unlist(sample_all_1))) %>%
arrange(desc(Freq))
s2 <- cbind.data.frame(table(unlist(sample_all_2))) %>%
arrange(desc(Freq))
s3 <- cbind.data.frame(table(unlist(sample_all_3))) %>%
arrange(desc(Freq))
s4 <- cbind.data.frame(table(unlist(sample_all_4))) %>%
arrange(desc(Freq))
w1 <- cbind.data.frame(table(unlist(sample_word))) %>%
arrange(desc(Freq))
s1 (Illustration dataset) filtering on minimum 350 occurrance.wordcloud(words = s1$Var1, freq = s1$Freq, min.freq = 350, colors = brewer.pal(8,"Paired"))
w1 filtering on minimum 500 occurrance.wordcloud(words = w1$Var1, freq = s1$Freq, min.freq = 500, colors = brewer.pal(8,"Paired"))
s2 filtering on minimum 200 occurrance. Creation of Bar Plot for 2-Gram.wordcloud(words = s2$Var1, freq = s2$Freq, min.freq = 200, colors = brewer.pal(8,"Paired"))
# BarPlot for 2 Ngram model
ggplot(data=s2[1:20,], aes(x=reorder(Var1,-Freq), y = Freq) )+
geom_bar(stat = "identity", fill ="sky blue", color="blue") +
theme(axis.text.x=element_text(angle=90)) +
xlab("2Gram words ") +
ylab("Frequency of Words")
s3 filtering on minimum 42 occurrance. Creation of Bar Plot for 3-Gram.wordcloud(words = s3$Var1, freq = s3$Freq, min.freq = 42, colors = brewer.pal(8,"Paired"))
# BarPlot for 3 Ngram model
ggplot(data=s3[1:20,], aes(x=reorder(Var1,-Freq), y = Freq) )+
geom_bar(stat = "identity", fill ="sky blue", color="blue") +
theme(axis.text.x=element_text(angle=90)) +
xlab("3Gram words ") +
ylab("Frequency of Words")
s4 filtering on minimum 8 occurrance. Creation of Bar Plot for 4-Gram.wordcloud(words = s4$Var1, freq = s4$Freq, min.freq = 8, colors = brewer.pal(8,"Paired"))
# BarPlot for 4 Ngram model
ggplot(data=s4[1:20,], aes(x=reorder(Var1,-Freq), y = Freq) )+
geom_bar(stat = "identity", fill ="sky blue", color="blue") +
theme(axis.text.x=element_text(angle=90)) +
xlab("4Gram words ") +
ylab("Frequency of Words")
coverage created to check the number of words for Percentage Coverage. Object and Percent is passed as input and it returns the number of words.coverage <- function(object, percent) {
cover <- 0
sumCover <- sum(object)
for(i in 1:length(object)) {
cover <- cover + object[i]
if(cover >= percent*(sumCover)){break}
}
return(i)
}
Separating the dataframes to words and out column where out column contains the lastword from the N-Gram Model. This out Column will act as predicted word for the Words Column.
These Dataframes will be used for Predicting the Words.
s4 <- separate(s4, Var1, into = c("words", "out"), sep = " (?=[^ ]+$)")
s3 <- separate(s3, Var1, into = c("words", "out"), sep = " (?=[^ ]+$)")
s2 <- separate(s2, Var1, into = c("words", "out"), sep = " (?=[^ ]+$)")
This is to show the structure and few samples of Dataset
** Dataset Structure created from S4**
s4[1:10,]
## words out Freq
## 1 the end of the 35
## 2 the rest of the 35
## 3 at the end of 27
## 4 one of the most 27
## 5 in the middle of 26
## 6 i don t know 24
## 7 is going to be 22
## 8 is one of the 21
## 9 at the same time 19
## 10 to be able to 19
** Dataset Structure created from S3**
s3[1:10,]
## words out Freq
## 1 one of the 185
## 2 a lot of 151
## 3 it was a 81
## 4 as well as 78
## 5 i don t 75
## 6 to be a 70
## 7 the end of 66
## 8 out of the 65
## 9 going to be 64
## 10 i want to 63
** Dataset Structure created from S2**
s2[1:10,]
## words out Freq
## 1 of the 2029
## 2 in the 1925
## 3 to the 1001
## 4 on the 898
## 5 for the 839
## 6 to be 679
## 7 at the 630
## 8 and the 619
## 9 in a 536
## 10 it was 486
** Dataset Structure created from W1**
w1[1:10,]
## Var1 Freq
## 1 the 21921
## 2 to 12176
## 3 and 11307
## 4 a 10653
## 5 of 9463
## 6 in 7474
## 7 i 6852
## 8 that 4724
## 9 for 4568
## 10 is 4519
Below Part will be executed to Store the Fles in RDS Format:
In order to build the N-gram, I’ve processed the N-grams by breaking them in several different indexes. Each of the N-grams is saved as intermediary files, as follows:
Shiny App Functionality: