Data science capstone project swiftkey week 2 Milestone report

1.Synopsis:

The Data provided contains 3 files en_US.blogs.txt,en_US.news.txt,en_US.twitter.txt. These files will be cleaned first then used to create different n-grams. These n-grams will be used to create a transition Markov chain matrix using katz’s back-off probability calculations.Firstly,a thorough exploration of these files is in place.

2.loading libraries:

Loading necessary libraries.

# Loading needed libraries
library(tidyverse) ;library(cld3) ;library(wordcloud) ;library(ngram) ;library(knitr) ;library(plyr) ;library(gdata);library(R.utils);library(tm);library(kableExtra)

3.Downloading files:

1.Downloading the Coursera-Swiftkey.zip from the given link.
2.Uncompress the file.
3.The eng_US folder contain three files: en_US.blogs.txt,en_US.news.txt,en_US.twitter.txt

4.Exploration of en_US files:

1.Counting lines in each file.
2.Counting words in each file.

# Counting lines for each file
linecount1<-countLines("/Users/mahmoudelsheikh/Google Drive/Coursera Data analysis/Courseraworkingspace/final/en_US/en_US.blogs.txt")
linecount2<-countLines("/Users/mahmoudelsheikh/Google Drive/Coursera Data analysis/Courseraworkingspace/final/en_US/en_US.news.txt")
linecount3<-countLines("/Users/mahmoudelsheikh/Google Drive/Coursera Data analysis/Courseraworkingspace/final/en_US/en_US.twitter.txt")

# Reading en_US.blog.txt
con1<-file("/Users/mahmoudelsheikh/Google Drive/Coursera Data analysis/Courseraworkingspace/final/en_US/en_US.blogs.txt","r")
txt1<-readLines(con1)
close(con1)
 
# Reading en_US.news.txt
con2<-file("/Users/mahmoudelsheikh/Google Drive/Coursera Data analysis/Courseraworkingspace/final/en_US/en_US.news.txt","r")
txt2<-readLines(con2)
close(con2)

# Reading en_US.twitter.txt
con3<-file("/Users/mahmoudelsheikh/Google Drive/Coursera Data analysis/Courseraworkingspace/final/en_US/en_US.twitter.txt","r")
txt3<-readLines(con3)
close(con3)

# Counting words for each file
Word_count<-c(wordcount(txt1),wordcount(txt2),wordcount(txt3))

# Combining data to create a result table
Line_count<-c(linecount1,linecount2,linecount3)
Source_file<-c("en_US.blogs.txt","en_US.news.txt","en_US.twitter")
Result<-cbind.data.frame('Source file' = Source_file,'Number of lines'= Line_count,'Number of Words'= Word_count)
kable(Result,caption = "Table 1: showing line counts and word count for each English text file") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Table 1: showing line counts and word count for each English text file
Source file	Number of lines	Number of Words
en_US.blogs.txt	899288	37334131
en_US.news.txt	1010242	34372530
en_US.twitter	2360148	30373543

5.Creating a function to clean text:

The function does the following:
1. Transform all text to lower case.
2. Remove Punctuation except intra word contractions like “’” and intra word dashes “-”.
3. Remove numbers as numbers are out of the scope of the application.
4. Remove extra white spaces in all text files.
5. Detect language and keep only English language as other languages are out of the scope of the application.
6. Remove any Null lines after cleaning.

cleantext<-function(text_vector) {
  x<-text_vector
  # Transforming all letters to lower case
  x<-tolower(x)
  # Detecting language and keeping only lines that are English
  cattxt<-detect_language(x)
  x<-x[cattxt == "en"]
  # Remove Punctuation keeping only dashes and contractions
  x<-removePunctuation(x,preserve_intra_word_contractions = TRUE,preserve_intra_word_dashes = TRUE)
  # Remove Numbers
  x<-removeNumbers(x)
  # Removing White spaces and trimming incase of extra spaces
  x<-stripWhitespace(x)
  trim<-function (x) gsub("^\\s+|\\s+$", "", x)
  trim2<-function (x) gsub("  "," ",x)
  trim3<-function (x) gsub("   "," ",x)
  trim4<-function (x) gsub("    "," ",x)
  x<-trim(x)
  x<-trim4(x)
  x<-trim3(x)
  x<-trim2(x)
  # Removing lines that are NA after cleaning
  notna<-!is.na(x)
  x<-x[notna == TRUE]
}

6.Reading and cleaning files:

# Cleaning and combining all texts in to one vector
cleantxt<-cleantext(c(txt1,txt2,txt3))

7.Tokenization:

7.1 Counting words per line of cleaned text:

In a count of words in each line will prove useful in the tokenization step. This will help excluding any line that contains less than 2 words while creating Bigrams or excluding any line that contains less than 3 words while creating Trigrams.

#Creating wordslength vector
wordslength<-vector(length = length(cleantxt))
j<-1
# Counting words in each line of text files
for (i in 1:length(cleantxt)) { wordslength[j] <- wordcount(cleantxt[i])
j<-j+1
}
head(wordslength,10)

##  [1]  16 139  40 109   9  54  45 137 168  56

7.2.1 Creating Unigram:

# Tokenization
ng1<-ngram(cleantxt,n=1,sep = " ")
# Getting pharse table and arranging descendingly by frequency
ng1tbl<-get.phrasetable(ng1)  %>% arrange(desc(freq)) %>% mutate (ngram_length = 1) %>% mutate(ngram = trim(ngrams)) %>% select(ngram,freq,ngram_length)
# Creating preword and currentword columns
ng1tbl<-ng1tbl %>% mutate(preword= "",currentword = "")
# Removing NA rows from currentword column (to make sure no errorous rows in the table)
ng1tbl<-ng1tbl[is.na(ng1tbl$currentword) != TRUE,]

7.2.2 Unigram graphs and table:

# Unigram phrase table structure
str(ng1tbl)

## 'data.frame':    674868 obs. of  5 variables:
##  $ ngram       : chr  "the" "and" "to" "a" ...
##  $ freq        : int  3788280 1954149 1953250 1752264 1631355 1255283 902961 800112 709195 707759 ...
##  $ ngram_length: num  1 1 1 1 1 1 1 1 1 1 ...
##  $ preword     : chr  "" "" "" "" ...
##  $ currentword : chr  "" "" "" "" ...

# View the 30 most frequent tokens in Unigram tokens
topuni<-ng1tbl[1:30,]
kable(topuni,caption = "Table 2:Top 30 Unigram tokens") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
  scroll_box(width = "100%", height = "300px")

Table 2:Top 30 Unigram tokens
ngram	freq	ngram_length
the	3788280	1
and	1954149	1
to	1953250	1
a	1752264	1
of	1631355	1
in	1255283	1
i	902961	1
that	800112	1
is	709195	1
for	707759	1
it	608704	1
with	537459	1
on	534547	1
was	503012	1
as	407167	1
you	383818	1
at	376235	1
this	375065	1
he	367221	1
have	359531	1
be	358010	1
but	350723	1
are	330165	1
my	307037	1
from	297732	1
not	282670	1
said	279947	1
we	277012	1
his	265005	1
by	251144	1

# Graph showing top 30 Unigram tokens
topunig<-ggplot(topuni,aes(reorder(ngram,-freq),freq))+geom_col()+ggtitle("Most frequent Unigram tokens")+xlab("Unigram tokens")+ylab("Frequency")+theme(plot.title = element_text(size = 20),axis.title = element_text(color = "black",face = "italic",size = 14,hjust = 0),axis.text.x = element_text(angle = 45, hjust = 1, size = 14),axis.text.y = element_text(size = 14))

topunig

Figure 1:Histogram showing top 30 Unigram tokens

# Creating a word cloud graph for top 500 Unigram tokens
pal<-brewer.pal(8,"Dark2")
topuni100<-ng1tbl[1:500,]
wordcuni<-wordcloud(words = topuni100$ngram,freq = topuni100$freq,colors=pal,random.color = TRUE,random.order = FALSE,rot.per = 0)

Figure 2:Wordcloud showing top 500 Unigram tokens

7.3.1 Creating bigram:

# Tokenization after excluding any line that has less than 2 words
ng2<-ngram(cleantxt[wordslength !=1],n=2,sep= " ")
# Getting pharse table and arranging descendingly by frequency 
ng2tbl<-get.phrasetable(ng2)  %>% arrange(desc(freq)) %>% mutate (ngram_length = 2) %>% mutate(ngram = trim(ngrams)) %>% select(ngram,freq,ngram_length)
# Creating preword and currentword columns
ng2tbl<-ng2tbl %>% mutate(preword= word(ngram,start = 1L,end = 1L),currentword = word(ngram,start = 2L,end = 2L))
# Removing NA rows from currentword column (to make sure no errorous rows in the table)
ng2tbl<-ng2tbl[is.na(ng2tbl$currentword) != TRUE,]

7.3.2 Bigram graphs and table:

# Bigram phrase table structure
str(ng2tbl)

## 'data.frame':    11010574 obs. of  5 variables:
##  $ ngram       : chr  "of the" "in the" "to the" "on the" ...
##  $ freq        : int  371214 330482 169291 146781 126138 114206 109894 104671 95831 86955 ...
##  $ ngram_length: num  2 2 2 2 2 2 2 2 2 2 ...
##  $ preword     : chr  "of" "in" "to" "on" ...
##  $ currentword : chr  "the" "the" "the" "the" ...

# View the 30 most frequent tokens in Bigram tokens
topbi<-ng2tbl[1:30,]
kable(topbi,caption = "Table 3:Top 30 Bigram tokens") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
  scroll_box(width = "100%", height = "300px")

Table 3:Top 30 Bigram tokens
ngram	freq	ngram_length	preword	currentword
of the	371214	2	of	the
in the	330482	2	in	the
to the	169291	2	to	the
on the	146781	2	on	the
for the	126138	2	for	the
to be	114206	2	to	be
and the	109894	2	and	the
at the	104671	2	at	the
in a	95831	2	in	a
with the	86955	2	with	the
from the	74798	2	from	the
it was	73900	2	it	was
is a	73461	2	is	a
with a	67561	2	with	a
of a	66655	2	of	a
for a	64023	2	for	a
it is	63203	2	it	is
as a	59637	2	as	a
i was	59352	2	i	was
and i	58915	2	and	i
one of	56664	2	one	of
that the	55499	2	that	the
will be	54971	2	will	be
i have	53789	2	i	have
by the	53205	2	by	the
is the	49219	2	is	the
and a	48547	2	and	a
the first	47073	2	the	first
to a	45510	2	to	a
i am	45411	2	i	am

# Graph showing top 30 Bigram tokens
topbig<-ggplot(topbi,aes(reorder(ngram,-freq),freq))+geom_col()+ggtitle("Most frequent Bigram tokens")+xlab("Bigram tokens")+ylab("Frequency")+theme(plot.title = element_text(size = 20),axis.title = element_text(color = "black",face = "italic",size = 14,hjust = 0),axis.text.x = element_text(angle = 45, hjust = 1, size = 14),axis.text.y = element_text(size = 14))

topbig

Figure 3:Histogram showing top 30 Bigram tokens

# Creating a word cloud graph for top 500 Bigram tokens
pal<-brewer.pal(8,"Dark2")
topbi100<-ng2tbl[1:500,]
wordcbi<-wordcloud(words = topbi100$ngram,freq = topuni100$freq,colors=pal,random.color = TRUE,random.order = FALSE,fixed.asp = TRUE)

$Figure 4:Wordcloud showing top 500 Bigram tokens$

Figure 4:Wordcloud showing top 500 Bigram tokens

7.4.1 Creating Trigram:

# Tokenization after excluding any line that has less than 3 words
ng3<-ngram(cleantxt[wordslength!=1 & wordslength !=2],n=3,sep = " ")
# Getting pharse table and arranging descendingly by frequency 
ng3tbl<-get.phrasetable(ng3)  %>% arrange(desc(freq)) %>% mutate (ngram_length = 3) %>% mutate(ngram = trim(ngrams)) %>% select(ngram,freq,ngram_length)
# Creating preword and currentword columns (separated the steps as this was it takes less time and less processing power to create unlike the previous ngrams)
prewordng3tbl<-word(ng3tbl$ngram,start = 1L,end = 2L)
currentwordng3tbl<-word(ng3tbl$ngram,start = 3L,end = 3L)
ng3tbl<-cbind(ng3tbl,preword=prewordng3tbl,currentword=currentwordng3tbl)
# Removing NA rows from currentword column (to make sure no errorous rows in the table)
ng3tbl<-ng3tbl[is.na(ng3tbl$currentword) != TRUE,]

7.4.2 Trigram graphs and table:

# Trigram phrase table structure
str(ng3tbl)

## 'data.frame':    34084787 obs. of  5 variables:
##  $ ngram       : chr  "one of the" "a lot of" "as well as" "some of the" ...
##  $ freq        : int  28756 23469 13040 12110 12110 12023 11979 10788 10605 10257 ...
##  $ ngram_length: num  3 3 3 3 3 3 3 3 3 3 ...
##  $ preword     : Factor w/ 10725910 levels "\u0094 \u0097",..: 6651999 189398 878813 8596587 9598194 9278075 6809580 4922599 6937582 1115056 ...
##  $ currentword : Factor w/ 647393 levels "\u0096","\u0097",..: 571259 413112 66910 571259 38889 413112 571259 38889 571259 580382 ...

# View the 30 most frequent tokens in Trigram tokens
toptri<-ng3tbl[1:30,]
kable(toptri,caption = "Table 4:Top 30 Trigram tokens") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
  scroll_box(width = "100%", height = "300px")

Table 4:Top 30 Trigram tokens
ngram	freq	ngram_length	preword	currentword
one of the	28756	3	one of	the
a lot of	23469	3	a lot	of
as well as	13040	3	as well	as
some of the	12110	3	some of	the
to be a	12110	3	to be	a
the end of	12023	3	the end	of
out of the	11979	3	out of	the
it was a	10788	3	it was	a
part of the	10605	3	part of	the
be able to	10257	3	be able	to
going to be	9920	3	going to	be
a couple of	8734	3	a couple	of
the rest of	8445	3	the rest	of
this is a	7967	3	this is	a
the first time	7692	3	the first	time
the fact that	7632	3	the fact	that
i want to	7631	3	i want	to
end of the	7578	3	end of	the
according to the	7313	3	according to	the
there is a	7257	3	there is	a
in the first	7213	3	in the	first
most of the	6907	3	most of	the
the united states	6729	3	the united	states
at the end	6579	3	at the	end
it is a	6530	3	it is	a
this is the	6332	3	this is	the
is one of	6304	3	is one	of
it would be	6177	3	it would	be
i have to	6151	3	i have	to
for the first	5966	3	for the	first

# Graph showing top 30 Trigram tokens
toptrig<-ggplot(toptri,aes(reorder(ngram,-freq),freq))+geom_col()+ggtitle("Most frequent Trigram tokens")+xlab("Trigram tokens")+ylab("Frequency")+theme(plot.title = element_text(size = 20),axis.title = element_text(color = "black",face = "italic",size = 14,hjust = 0),axis.text.x = element_text(angle = 45, hjust = 1, size = 14),axis.text.y = element_text(size = 14))

toptrig

Figure 5:Histogram showing top 30 Trigram tokens

# Creating a word cloud graph for top 100 Trigram tokens
pal<-brewer.pal(8,"Dark2")
toptri100<-ng3tbl[1:500,]
wordctri<-wordcloud(words = toptri100$ngram,freq = toptri100$freq*2,colors=pal,random.color = TRUE,random.order = FALSE,rot.per = 0)

Figure 6:Wordcloud showing top 500 Trigram tokens

8.Future steps:

1.Define suitable sample size to balance between prediction accuracy and computing time and power required.
2.Create functions to calculate Katz’s back-off propabilities and save it as Markov matrix.
3.Create a shiny app that takes a preword/history and give 3 words suggestions.