The Data provided contains 3 files en_US.blogs.txt,en_US.news.txt,en_US.twitter.txt. These files will be cleaned first then used to create different n-grams. These n-grams will be used to create a transition Markov chain matrix using katz’s back-off probability calculations.Firstly,a thorough exploration of these files is in place.
Loading necessary libraries.
# Loading needed libraries
library(tidyverse) ;library(cld3) ;library(wordcloud) ;library(ngram) ;library(knitr) ;library(plyr) ;library(gdata);library(R.utils);library(tm);library(kableExtra)
1.Downloading the Coursera-Swiftkey.zip from the given link.
2.Uncompress the file.
3.The eng_US folder contain three files: en_US.blogs.txt,en_US.news.txt,en_US.twitter.txt
1.Counting lines in each file.
2.Counting words in each file.
# Counting lines for each file
linecount1<-countLines("/Users/mahmoudelsheikh/Google Drive/Coursera Data analysis/Courseraworkingspace/final/en_US/en_US.blogs.txt")
linecount2<-countLines("/Users/mahmoudelsheikh/Google Drive/Coursera Data analysis/Courseraworkingspace/final/en_US/en_US.news.txt")
linecount3<-countLines("/Users/mahmoudelsheikh/Google Drive/Coursera Data analysis/Courseraworkingspace/final/en_US/en_US.twitter.txt")
# Reading en_US.blog.txt
con1<-file("/Users/mahmoudelsheikh/Google Drive/Coursera Data analysis/Courseraworkingspace/final/en_US/en_US.blogs.txt","r")
txt1<-readLines(con1)
close(con1)
# Reading en_US.news.txt
con2<-file("/Users/mahmoudelsheikh/Google Drive/Coursera Data analysis/Courseraworkingspace/final/en_US/en_US.news.txt","r")
txt2<-readLines(con2)
close(con2)
# Reading en_US.twitter.txt
con3<-file("/Users/mahmoudelsheikh/Google Drive/Coursera Data analysis/Courseraworkingspace/final/en_US/en_US.twitter.txt","r")
txt3<-readLines(con3)
close(con3)
# Counting words for each file
Word_count<-c(wordcount(txt1),wordcount(txt2),wordcount(txt3))
# Combining data to create a result table
Line_count<-c(linecount1,linecount2,linecount3)
Source_file<-c("en_US.blogs.txt","en_US.news.txt","en_US.twitter")
Result<-cbind.data.frame('Source file' = Source_file,'Number of lines'= Line_count,'Number of Words'= Word_count)
kable(Result,caption = "Table 1: showing line counts and word count for each English text file") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
| Source file | Number of lines | Number of Words |
|---|---|---|
| en_US.blogs.txt | 899288 | 37334131 |
| en_US.news.txt | 1010242 | 34372530 |
| en_US.twitter | 2360148 | 30373543 |
The function does the following:
1. Transform all text to lower case.
2. Remove Punctuation except intra word contractions like “’” and intra word dashes “-”.
3. Remove numbers as numbers are out of the scope of the application.
4. Remove extra white spaces in all text files.
5. Detect language and keep only English language as other languages are out of the scope of the application.
6. Remove any Null lines after cleaning.
cleantext<-function(text_vector) {
x<-text_vector
# Transforming all letters to lower case
x<-tolower(x)
# Detecting language and keeping only lines that are English
cattxt<-detect_language(x)
x<-x[cattxt == "en"]
# Remove Punctuation keeping only dashes and contractions
x<-removePunctuation(x,preserve_intra_word_contractions = TRUE,preserve_intra_word_dashes = TRUE)
# Remove Numbers
x<-removeNumbers(x)
# Removing White spaces and trimming incase of extra spaces
x<-stripWhitespace(x)
trim<-function (x) gsub("^\\s+|\\s+$", "", x)
trim2<-function (x) gsub(" "," ",x)
trim3<-function (x) gsub(" "," ",x)
trim4<-function (x) gsub(" "," ",x)
x<-trim(x)
x<-trim4(x)
x<-trim3(x)
x<-trim2(x)
# Removing lines that are NA after cleaning
notna<-!is.na(x)
x<-x[notna == TRUE]
}
# Cleaning and combining all texts in to one vector
cleantxt<-cleantext(c(txt1,txt2,txt3))
In a count of words in each line will prove useful in the tokenization step. This will help excluding any line that contains less than 2 words while creating Bigrams or excluding any line that contains less than 3 words while creating Trigrams.
#Creating wordslength vector
wordslength<-vector(length = length(cleantxt))
j<-1
# Counting words in each line of text files
for (i in 1:length(cleantxt)) { wordslength[j] <- wordcount(cleantxt[i])
j<-j+1
}
head(wordslength,10)
## [1] 16 139 40 109 9 54 45 137 168 56
# Tokenization
ng1<-ngram(cleantxt,n=1,sep = " ")
# Getting pharse table and arranging descendingly by frequency
ng1tbl<-get.phrasetable(ng1) %>% arrange(desc(freq)) %>% mutate (ngram_length = 1) %>% mutate(ngram = trim(ngrams)) %>% select(ngram,freq,ngram_length)
# Creating preword and currentword columns
ng1tbl<-ng1tbl %>% mutate(preword= "",currentword = "")
# Removing NA rows from currentword column (to make sure no errorous rows in the table)
ng1tbl<-ng1tbl[is.na(ng1tbl$currentword) != TRUE,]
# Unigram phrase table structure
str(ng1tbl)
## 'data.frame': 674868 obs. of 5 variables:
## $ ngram : chr "the" "and" "to" "a" ...
## $ freq : int 3788280 1954149 1953250 1752264 1631355 1255283 902961 800112 709195 707759 ...
## $ ngram_length: num 1 1 1 1 1 1 1 1 1 1 ...
## $ preword : chr "" "" "" "" ...
## $ currentword : chr "" "" "" "" ...
# View the 30 most frequent tokens in Unigram tokens
topuni<-ng1tbl[1:30,]
kable(topuni,caption = "Table 2:Top 30 Unigram tokens") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
scroll_box(width = "100%", height = "300px")
| ngram | freq | ngram_length | preword | currentword |
|---|---|---|---|---|
| the | 3788280 | 1 | ||
| and | 1954149 | 1 | ||
| to | 1953250 | 1 | ||
| a | 1752264 | 1 | ||
| of | 1631355 | 1 | ||
| in | 1255283 | 1 | ||
| i | 902961 | 1 | ||
| that | 800112 | 1 | ||
| is | 709195 | 1 | ||
| for | 707759 | 1 | ||
| it | 608704 | 1 | ||
| with | 537459 | 1 | ||
| on | 534547 | 1 | ||
| was | 503012 | 1 | ||
| as | 407167 | 1 | ||
| you | 383818 | 1 | ||
| at | 376235 | 1 | ||
| this | 375065 | 1 | ||
| he | 367221 | 1 | ||
| have | 359531 | 1 | ||
| be | 358010 | 1 | ||
| but | 350723 | 1 | ||
| are | 330165 | 1 | ||
| my | 307037 | 1 | ||
| from | 297732 | 1 | ||
| not | 282670 | 1 | ||
| said | 279947 | 1 | ||
| we | 277012 | 1 | ||
| his | 265005 | 1 | ||
| by | 251144 | 1 |
# Graph showing top 30 Unigram tokens
topunig<-ggplot(topuni,aes(reorder(ngram,-freq),freq))+geom_col()+ggtitle("Most frequent Unigram tokens")+xlab("Unigram tokens")+ylab("Frequency")+theme(plot.title = element_text(size = 20),axis.title = element_text(color = "black",face = "italic",size = 14,hjust = 0),axis.text.x = element_text(angle = 45, hjust = 1, size = 14),axis.text.y = element_text(size = 14))
topunig
Figure 1:Histogram showing top 30 Unigram tokens
# Creating a word cloud graph for top 500 Unigram tokens
pal<-brewer.pal(8,"Dark2")
topuni100<-ng1tbl[1:500,]
wordcuni<-wordcloud(words = topuni100$ngram,freq = topuni100$freq,colors=pal,random.color = TRUE,random.order = FALSE,rot.per = 0)
Figure 2:Wordcloud showing top 500 Unigram tokens
# Tokenization after excluding any line that has less than 2 words
ng2<-ngram(cleantxt[wordslength !=1],n=2,sep= " ")
# Getting pharse table and arranging descendingly by frequency
ng2tbl<-get.phrasetable(ng2) %>% arrange(desc(freq)) %>% mutate (ngram_length = 2) %>% mutate(ngram = trim(ngrams)) %>% select(ngram,freq,ngram_length)
# Creating preword and currentword columns
ng2tbl<-ng2tbl %>% mutate(preword= word(ngram,start = 1L,end = 1L),currentword = word(ngram,start = 2L,end = 2L))
# Removing NA rows from currentword column (to make sure no errorous rows in the table)
ng2tbl<-ng2tbl[is.na(ng2tbl$currentword) != TRUE,]
# Bigram phrase table structure
str(ng2tbl)
## 'data.frame': 11010574 obs. of 5 variables:
## $ ngram : chr "of the" "in the" "to the" "on the" ...
## $ freq : int 371214 330482 169291 146781 126138 114206 109894 104671 95831 86955 ...
## $ ngram_length: num 2 2 2 2 2 2 2 2 2 2 ...
## $ preword : chr "of" "in" "to" "on" ...
## $ currentword : chr "the" "the" "the" "the" ...
# View the 30 most frequent tokens in Bigram tokens
topbi<-ng2tbl[1:30,]
kable(topbi,caption = "Table 3:Top 30 Bigram tokens") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
scroll_box(width = "100%", height = "300px")
| ngram | freq | ngram_length | preword | currentword |
|---|---|---|---|---|
| of the | 371214 | 2 | of | the |
| in the | 330482 | 2 | in | the |
| to the | 169291 | 2 | to | the |
| on the | 146781 | 2 | on | the |
| for the | 126138 | 2 | for | the |
| to be | 114206 | 2 | to | be |
| and the | 109894 | 2 | and | the |
| at the | 104671 | 2 | at | the |
| in a | 95831 | 2 | in | a |
| with the | 86955 | 2 | with | the |
| from the | 74798 | 2 | from | the |
| it was | 73900 | 2 | it | was |
| is a | 73461 | 2 | is | a |
| with a | 67561 | 2 | with | a |
| of a | 66655 | 2 | of | a |
| for a | 64023 | 2 | for | a |
| it is | 63203 | 2 | it | is |
| as a | 59637 | 2 | as | a |
| i was | 59352 | 2 | i | was |
| and i | 58915 | 2 | and | i |
| one of | 56664 | 2 | one | of |
| that the | 55499 | 2 | that | the |
| will be | 54971 | 2 | will | be |
| i have | 53789 | 2 | i | have |
| by the | 53205 | 2 | by | the |
| is the | 49219 | 2 | is | the |
| and a | 48547 | 2 | and | a |
| the first | 47073 | 2 | the | first |
| to a | 45510 | 2 | to | a |
| i am | 45411 | 2 | i | am |
# Graph showing top 30 Bigram tokens
topbig<-ggplot(topbi,aes(reorder(ngram,-freq),freq))+geom_col()+ggtitle("Most frequent Bigram tokens")+xlab("Bigram tokens")+ylab("Frequency")+theme(plot.title = element_text(size = 20),axis.title = element_text(color = "black",face = "italic",size = 14,hjust = 0),axis.text.x = element_text(angle = 45, hjust = 1, size = 14),axis.text.y = element_text(size = 14))
topbig
Figure 3:Histogram showing top 30 Bigram tokens
# Creating a word cloud graph for top 500 Bigram tokens
pal<-brewer.pal(8,"Dark2")
topbi100<-ng2tbl[1:500,]
wordcbi<-wordcloud(words = topbi100$ngram,freq = topuni100$freq,colors=pal,random.color = TRUE,random.order = FALSE,fixed.asp = TRUE)
Figure 4:Wordcloud showing top 500 Bigram tokens
# Tokenization after excluding any line that has less than 3 words
ng3<-ngram(cleantxt[wordslength!=1 & wordslength !=2],n=3,sep = " ")
# Getting pharse table and arranging descendingly by frequency
ng3tbl<-get.phrasetable(ng3) %>% arrange(desc(freq)) %>% mutate (ngram_length = 3) %>% mutate(ngram = trim(ngrams)) %>% select(ngram,freq,ngram_length)
# Creating preword and currentword columns (separated the steps as this was it takes less time and less processing power to create unlike the previous ngrams)
prewordng3tbl<-word(ng3tbl$ngram,start = 1L,end = 2L)
currentwordng3tbl<-word(ng3tbl$ngram,start = 3L,end = 3L)
ng3tbl<-cbind(ng3tbl,preword=prewordng3tbl,currentword=currentwordng3tbl)
# Removing NA rows from currentword column (to make sure no errorous rows in the table)
ng3tbl<-ng3tbl[is.na(ng3tbl$currentword) != TRUE,]
# Trigram phrase table structure
str(ng3tbl)
## 'data.frame': 34084787 obs. of 5 variables:
## $ ngram : chr "one of the" "a lot of" "as well as" "some of the" ...
## $ freq : int 28756 23469 13040 12110 12110 12023 11979 10788 10605 10257 ...
## $ ngram_length: num 3 3 3 3 3 3 3 3 3 3 ...
## $ preword : Factor w/ 10725910 levels "\u0094 \u0097",..: 6651999 189398 878813 8596587 9598194 9278075 6809580 4922599 6937582 1115056 ...
## $ currentword : Factor w/ 647393 levels "\u0096","\u0097",..: 571259 413112 66910 571259 38889 413112 571259 38889 571259 580382 ...
# View the 30 most frequent tokens in Trigram tokens
toptri<-ng3tbl[1:30,]
kable(toptri,caption = "Table 4:Top 30 Trigram tokens") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
scroll_box(width = "100%", height = "300px")
| ngram | freq | ngram_length | preword | currentword |
|---|---|---|---|---|
| one of the | 28756 | 3 | one of | the |
| a lot of | 23469 | 3 | a lot | of |
| as well as | 13040 | 3 | as well | as |
| some of the | 12110 | 3 | some of | the |
| to be a | 12110 | 3 | to be | a |
| the end of | 12023 | 3 | the end | of |
| out of the | 11979 | 3 | out of | the |
| it was a | 10788 | 3 | it was | a |
| part of the | 10605 | 3 | part of | the |
| be able to | 10257 | 3 | be able | to |
| going to be | 9920 | 3 | going to | be |
| a couple of | 8734 | 3 | a couple | of |
| the rest of | 8445 | 3 | the rest | of |
| this is a | 7967 | 3 | this is | a |
| the first time | 7692 | 3 | the first | time |
| the fact that | 7632 | 3 | the fact | that |
| i want to | 7631 | 3 | i want | to |
| end of the | 7578 | 3 | end of | the |
| according to the | 7313 | 3 | according to | the |
| there is a | 7257 | 3 | there is | a |
| in the first | 7213 | 3 | in the | first |
| most of the | 6907 | 3 | most of | the |
| the united states | 6729 | 3 | the united | states |
| at the end | 6579 | 3 | at the | end |
| it is a | 6530 | 3 | it is | a |
| this is the | 6332 | 3 | this is | the |
| is one of | 6304 | 3 | is one | of |
| it would be | 6177 | 3 | it would | be |
| i have to | 6151 | 3 | i have | to |
| for the first | 5966 | 3 | for the | first |
# Graph showing top 30 Trigram tokens
toptrig<-ggplot(toptri,aes(reorder(ngram,-freq),freq))+geom_col()+ggtitle("Most frequent Trigram tokens")+xlab("Trigram tokens")+ylab("Frequency")+theme(plot.title = element_text(size = 20),axis.title = element_text(color = "black",face = "italic",size = 14,hjust = 0),axis.text.x = element_text(angle = 45, hjust = 1, size = 14),axis.text.y = element_text(size = 14))
toptrig
Figure 5:Histogram showing top 30 Trigram tokens
# Creating a word cloud graph for top 100 Trigram tokens
pal<-brewer.pal(8,"Dark2")
toptri100<-ng3tbl[1:500,]
wordctri<-wordcloud(words = toptri100$ngram,freq = toptri100$freq*2,colors=pal,random.color = TRUE,random.order = FALSE,rot.per = 0)
Figure 6:Wordcloud showing top 500 Trigram tokens
1.Define suitable sample size to balance between prediction accuracy and computing time and power required.
2.Create functions to calculate Katz’s back-off propabilities and save it as Markov matrix.
3.Create a shiny app that takes a preword/history and give 3 words suggestions.