1.Synopsis:

The Data provided contains 3 files en_US.blogs.txt,en_US.news.txt,en_US.twitter.txt. These files will be cleaned first then used to create different n-grams. These n-grams will be used to create a transition Markov chain matrix using katz’s back-off probability calculations.Firstly,a thorough exploration of these files is in place.

2.loading libraries:

Loading necessary libraries.

# Loading needed libraries
library(tidyverse) ;library(cld3) ;library(wordcloud) ;library(ngram) ;library(knitr) ;library(plyr) ;library(gdata);library(R.utils);library(tm);library(kableExtra)

3.Downloading files:

1.Downloading the Coursera-Swiftkey.zip from the given link.
2.Uncompress the file.
3.The eng_US folder contain three files: en_US.blogs.txt,en_US.news.txt,en_US.twitter.txt

4.Exploration of en_US files:

1.Counting lines in each file.
2.Counting words in each file.

# Counting lines for each file
linecount1<-countLines("/Users/mahmoudelsheikh/Google Drive/Coursera Data analysis/Courseraworkingspace/final/en_US/en_US.blogs.txt")
linecount2<-countLines("/Users/mahmoudelsheikh/Google Drive/Coursera Data analysis/Courseraworkingspace/final/en_US/en_US.news.txt")
linecount3<-countLines("/Users/mahmoudelsheikh/Google Drive/Coursera Data analysis/Courseraworkingspace/final/en_US/en_US.twitter.txt")

# Reading en_US.blog.txt
con1<-file("/Users/mahmoudelsheikh/Google Drive/Coursera Data analysis/Courseraworkingspace/final/en_US/en_US.blogs.txt","r")
txt1<-readLines(con1)
close(con1)
 
# Reading en_US.news.txt
con2<-file("/Users/mahmoudelsheikh/Google Drive/Coursera Data analysis/Courseraworkingspace/final/en_US/en_US.news.txt","r")
txt2<-readLines(con2)
close(con2)

# Reading en_US.twitter.txt
con3<-file("/Users/mahmoudelsheikh/Google Drive/Coursera Data analysis/Courseraworkingspace/final/en_US/en_US.twitter.txt","r")
txt3<-readLines(con3)
close(con3)

# Counting words for each file
Word_count<-c(wordcount(txt1),wordcount(txt2),wordcount(txt3))

# Combining data to create a result table
Line_count<-c(linecount1,linecount2,linecount3)
Source_file<-c("en_US.blogs.txt","en_US.news.txt","en_US.twitter")
Result<-cbind.data.frame('Source file' = Source_file,'Number of lines'= Line_count,'Number of Words'= Word_count)
kable(Result,caption = "Table 1: showing line counts and word count for each English text file") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Table 1: showing line counts and word count for each English text file
Source file Number of lines Number of Words
en_US.blogs.txt 899288 37334131
en_US.news.txt 1010242 34372530
en_US.twitter 2360148 30373543

5.Creating a function to clean text:

The function does the following:
1. Transform all text to lower case.
2. Remove Punctuation except intra word contractions like “’” and intra word dashes “-”.
3. Remove numbers as numbers are out of the scope of the application.
4. Remove extra white spaces in all text files.
5. Detect language and keep only English language as other languages are out of the scope of the application.
6. Remove any Null lines after cleaning.

cleantext<-function(text_vector) {
  x<-text_vector
  # Transforming all letters to lower case
  x<-tolower(x)
  # Detecting language and keeping only lines that are English
  cattxt<-detect_language(x)
  x<-x[cattxt == "en"]
  # Remove Punctuation keeping only dashes and contractions
  x<-removePunctuation(x,preserve_intra_word_contractions = TRUE,preserve_intra_word_dashes = TRUE)
  # Remove Numbers
  x<-removeNumbers(x)
  # Removing White spaces and trimming incase of extra spaces
  x<-stripWhitespace(x)
  trim<-function (x) gsub("^\\s+|\\s+$", "", x)
  trim2<-function (x) gsub("  "," ",x)
  trim3<-function (x) gsub("   "," ",x)
  trim4<-function (x) gsub("    "," ",x)
  x<-trim(x)
  x<-trim4(x)
  x<-trim3(x)
  x<-trim2(x)
  # Removing lines that are NA after cleaning
  notna<-!is.na(x)
  x<-x[notna == TRUE]
}

6.Reading and cleaning files:

# Cleaning and combining all texts in to one vector
cleantxt<-cleantext(c(txt1,txt2,txt3))

7.Tokenization:

7.1 Counting words per line of cleaned text:

In a count of words in each line will prove useful in the tokenization step. This will help excluding any line that contains less than 2 words while creating Bigrams or excluding any line that contains less than 3 words while creating Trigrams.

#Creating wordslength vector
wordslength<-vector(length = length(cleantxt))
j<-1
# Counting words in each line of text files
for (i in 1:length(cleantxt)) { wordslength[j] <- wordcount(cleantxt[i])
j<-j+1
}
head(wordslength,10)
##  [1]  16 139  40 109   9  54  45 137 168  56

7.2.1 Creating Unigram:

# Tokenization
ng1<-ngram(cleantxt,n=1,sep = " ")
# Getting pharse table and arranging descendingly by frequency
ng1tbl<-get.phrasetable(ng1)  %>% arrange(desc(freq)) %>% mutate (ngram_length = 1) %>% mutate(ngram = trim(ngrams)) %>% select(ngram,freq,ngram_length)
# Creating preword and currentword columns
ng1tbl<-ng1tbl %>% mutate(preword= "",currentword = "")
# Removing NA rows from currentword column (to make sure no errorous rows in the table)
ng1tbl<-ng1tbl[is.na(ng1tbl$currentword) != TRUE,]

7.2.2 Unigram graphs and table:

# Unigram phrase table structure
str(ng1tbl)
## 'data.frame':    674868 obs. of  5 variables:
##  $ ngram       : chr  "the" "and" "to" "a" ...
##  $ freq        : int  3788280 1954149 1953250 1752264 1631355 1255283 902961 800112 709195 707759 ...
##  $ ngram_length: num  1 1 1 1 1 1 1 1 1 1 ...
##  $ preword     : chr  "" "" "" "" ...
##  $ currentword : chr  "" "" "" "" ...
# View the 30 most frequent tokens in Unigram tokens
topuni<-ng1tbl[1:30,]
kable(topuni,caption = "Table 2:Top 30 Unigram tokens") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
  scroll_box(width = "100%", height = "300px")
Table 2:Top 30 Unigram tokens
ngram freq ngram_length preword currentword
the 3788280 1
and 1954149 1
to 1953250 1
a 1752264 1
of 1631355 1
in 1255283 1
i 902961 1
that 800112 1
is 709195 1
for 707759 1
it 608704 1
with 537459 1
on 534547 1
was 503012 1
as 407167 1
you 383818 1
at 376235 1
this 375065 1
he 367221 1
have 359531 1
be 358010 1
but 350723 1
are 330165 1
my 307037 1
from 297732 1
not 282670 1
said 279947 1
we 277012 1
his 265005 1
by 251144 1
# Graph showing top 30 Unigram tokens
topunig<-ggplot(topuni,aes(reorder(ngram,-freq),freq))+geom_col()+ggtitle("Most frequent Unigram tokens")+xlab("Unigram tokens")+ylab("Frequency")+theme(plot.title = element_text(size = 20),axis.title = element_text(color = "black",face = "italic",size = 14,hjust = 0),axis.text.x = element_text(angle = 45, hjust = 1, size = 14),axis.text.y = element_text(size = 14))

topunig
Figure 1:Histogram showing top 30 Unigram tokens

Figure 1:Histogram showing top 30 Unigram tokens

# Creating a word cloud graph for top 500 Unigram tokens
pal<-brewer.pal(8,"Dark2")
topuni100<-ng1tbl[1:500,]
wordcuni<-wordcloud(words = topuni100$ngram,freq = topuni100$freq,colors=pal,random.color = TRUE,random.order = FALSE,rot.per = 0)
Figure 2:Wordcloud showing top 500 Unigram tokens

Figure 2:Wordcloud showing top 500 Unigram tokens

7.3.1 Creating bigram:

# Tokenization after excluding any line that has less than 2 words
ng2<-ngram(cleantxt[wordslength !=1],n=2,sep= " ")
# Getting pharse table and arranging descendingly by frequency 
ng2tbl<-get.phrasetable(ng2)  %>% arrange(desc(freq)) %>% mutate (ngram_length = 2) %>% mutate(ngram = trim(ngrams)) %>% select(ngram,freq,ngram_length)
# Creating preword and currentword columns
ng2tbl<-ng2tbl %>% mutate(preword= word(ngram,start = 1L,end = 1L),currentword = word(ngram,start = 2L,end = 2L))
# Removing NA rows from currentword column (to make sure no errorous rows in the table)
ng2tbl<-ng2tbl[is.na(ng2tbl$currentword) != TRUE,]

7.3.2 Bigram graphs and table:

# Bigram phrase table structure
str(ng2tbl)
## 'data.frame':    11010574 obs. of  5 variables:
##  $ ngram       : chr  "of the" "in the" "to the" "on the" ...
##  $ freq        : int  371214 330482 169291 146781 126138 114206 109894 104671 95831 86955 ...
##  $ ngram_length: num  2 2 2 2 2 2 2 2 2 2 ...
##  $ preword     : chr  "of" "in" "to" "on" ...
##  $ currentword : chr  "the" "the" "the" "the" ...
# View the 30 most frequent tokens in Bigram tokens
topbi<-ng2tbl[1:30,]
kable(topbi,caption = "Table 3:Top 30 Bigram tokens") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
  scroll_box(width = "100%", height = "300px")
Table 3:Top 30 Bigram tokens
ngram freq ngram_length preword currentword
of the 371214 2 of the
in the 330482 2 in the
to the 169291 2 to the
on the 146781 2 on the
for the 126138 2 for the
to be 114206 2 to be
and the 109894 2 and the
at the 104671 2 at the
in a 95831 2 in a
with the 86955 2 with the
from the 74798 2 from the
it was 73900 2 it was
is a 73461 2 is a
with a 67561 2 with a
of a 66655 2 of a
for a 64023 2 for a
it is 63203 2 it is
as a 59637 2 as a
i was 59352 2 i was
and i 58915 2 and i
one of 56664 2 one of
that the 55499 2 that the
will be 54971 2 will be
i have 53789 2 i have
by the 53205 2 by the
is the 49219 2 is the
and a 48547 2 and a
the first 47073 2 the first
to a 45510 2 to a
i am 45411 2 i am
# Graph showing top 30 Bigram tokens
topbig<-ggplot(topbi,aes(reorder(ngram,-freq),freq))+geom_col()+ggtitle("Most frequent Bigram tokens")+xlab("Bigram tokens")+ylab("Frequency")+theme(plot.title = element_text(size = 20),axis.title = element_text(color = "black",face = "italic",size = 14,hjust = 0),axis.text.x = element_text(angle = 45, hjust = 1, size = 14),axis.text.y = element_text(size = 14))

topbig
Figure 3:Histogram showing top 30 Bigram tokens

Figure 3:Histogram showing top 30 Bigram tokens

# Creating a word cloud graph for top 500 Bigram tokens
pal<-brewer.pal(8,"Dark2")
topbi100<-ng2tbl[1:500,]
wordcbi<-wordcloud(words = topbi100$ngram,freq = topuni100$freq,colors=pal,random.color = TRUE,random.order = FALSE,fixed.asp = TRUE)
Figure 4:Wordcloud showing top 500 Bigram tokens

Figure 4:Wordcloud showing top 500 Bigram tokens

7.4.1 Creating Trigram:

# Tokenization after excluding any line that has less than 3 words
ng3<-ngram(cleantxt[wordslength!=1 & wordslength !=2],n=3,sep = " ")
# Getting pharse table and arranging descendingly by frequency 
ng3tbl<-get.phrasetable(ng3)  %>% arrange(desc(freq)) %>% mutate (ngram_length = 3) %>% mutate(ngram = trim(ngrams)) %>% select(ngram,freq,ngram_length)
# Creating preword and currentword columns (separated the steps as this was it takes less time and less processing power to create unlike the previous ngrams)
prewordng3tbl<-word(ng3tbl$ngram,start = 1L,end = 2L)
currentwordng3tbl<-word(ng3tbl$ngram,start = 3L,end = 3L)
ng3tbl<-cbind(ng3tbl,preword=prewordng3tbl,currentword=currentwordng3tbl)
# Removing NA rows from currentword column (to make sure no errorous rows in the table)
ng3tbl<-ng3tbl[is.na(ng3tbl$currentword) != TRUE,]

7.4.2 Trigram graphs and table:

# Trigram phrase table structure
str(ng3tbl)
## 'data.frame':    34084787 obs. of  5 variables:
##  $ ngram       : chr  "one of the" "a lot of" "as well as" "some of the" ...
##  $ freq        : int  28756 23469 13040 12110 12110 12023 11979 10788 10605 10257 ...
##  $ ngram_length: num  3 3 3 3 3 3 3 3 3 3 ...
##  $ preword     : Factor w/ 10725910 levels "\u0094 \u0097",..: 6651999 189398 878813 8596587 9598194 9278075 6809580 4922599 6937582 1115056 ...
##  $ currentword : Factor w/ 647393 levels "\u0096","\u0097",..: 571259 413112 66910 571259 38889 413112 571259 38889 571259 580382 ...
# View the 30 most frequent tokens in Trigram tokens
toptri<-ng3tbl[1:30,]
kable(toptri,caption = "Table 4:Top 30 Trigram tokens") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
  scroll_box(width = "100%", height = "300px")
Table 4:Top 30 Trigram tokens
ngram freq ngram_length preword currentword
one of the 28756 3 one of the
a lot of 23469 3 a lot of
as well as 13040 3 as well as
some of the 12110 3 some of the
to be a 12110 3 to be a
the end of 12023 3 the end of
out of the 11979 3 out of the
it was a 10788 3 it was a
part of the 10605 3 part of the
be able to 10257 3 be able to
going to be 9920 3 going to be
a couple of 8734 3 a couple of
the rest of 8445 3 the rest of
this is a 7967 3 this is a
the first time 7692 3 the first time
the fact that 7632 3 the fact that
i want to 7631 3 i want to
end of the 7578 3 end of the
according to the 7313 3 according to the
there is a 7257 3 there is a
in the first 7213 3 in the first
most of the 6907 3 most of the
the united states 6729 3 the united states
at the end 6579 3 at the end
it is a 6530 3 it is a
this is the 6332 3 this is the
is one of 6304 3 is one of
it would be 6177 3 it would be
i have to 6151 3 i have to
for the first 5966 3 for the first
# Graph showing top 30 Trigram tokens
toptrig<-ggplot(toptri,aes(reorder(ngram,-freq),freq))+geom_col()+ggtitle("Most frequent Trigram tokens")+xlab("Trigram tokens")+ylab("Frequency")+theme(plot.title = element_text(size = 20),axis.title = element_text(color = "black",face = "italic",size = 14,hjust = 0),axis.text.x = element_text(angle = 45, hjust = 1, size = 14),axis.text.y = element_text(size = 14))

toptrig
Figure 5:Histogram showing top 30 Trigram tokens

Figure 5:Histogram showing top 30 Trigram tokens

# Creating a word cloud graph for top 100 Trigram tokens
pal<-brewer.pal(8,"Dark2")
toptri100<-ng3tbl[1:500,]
wordctri<-wordcloud(words = toptri100$ngram,freq = toptri100$freq*2,colors=pal,random.color = TRUE,random.order = FALSE,rot.per = 0)
Figure 6:Wordcloud showing top 500 Trigram tokens

Figure 6:Wordcloud showing top 500 Trigram tokens

8.Future steps:

1.Define suitable sample size to balance between prediction accuracy and computing time and power required.
2.Create functions to calculate Katz’s back-off propabilities and save it as Markov matrix.
3.Create a shiny app that takes a preword/history and give 3 words suggestions.