This is an interim project report showing my progress with the project.

Loading of data and sampling

The data was downloaded into my local drive and saved in a folder. I will be using the English data set for the project. Let us first load all the library files

## Loading the Library files
setwd("E:/Capstone/final/en_US")
library(tm)
## Loading required package: NLP
library(ngram)
library(stringr)

The next step would be to read each of the files and then create a sample files. The samples would be a random selection of 5% of the total length of the complete file. For creating the samples, I would be creating a function, to read the length of the data sets and take a sample of 5% from the data sets

## This function creates sample files for each data set
samp <- function(textfile){
  
  fileln <- length(readLines(textfile)) ## Calculate the length of the file
  sampfile <- readLines(textfile) ## Reading all the lines of each data set
  sampsiz <- fileln * 0.05 ## Taking only 5% sample
  txtsamp <- sampfile[rbinom(sampsiz,fileln,0.5)] ## Taking a binomial sample
  return(txtsamp)
  
}

Using the function let us first create samples for each of the data set

news_samp <- samp("en_US.news.txt")

blog_samp <- samp("en_US.blogs.txt")

twit_samp <- samp("en_US.twitter.txt")

Creating Tokens and cleaning the dataset

Once the sample files are created, I would now create tokens of words from the data set. The purpose of creating tokens is to understand the distribution of words within the dataset. I will be using the news file for creating the tokens. After creating tokens, will convert all the tokens into lowercase.

After tokenizing, well attempt to deal with punctuations and special charachters. First some of the common words with punctuations are split to its root words. After the splitting, the other punctuations which are present in the document are removed from the data set.

## This is a function to take a text file and then create tokens of the words and clean the punctuations

clean_tokn <- function(txtfile){
  
news_tokn <- scan_tokenizer(txtfile) ## Creating tokens from the sample 
news_tokn <- tolower(news_tokn) ## Converting the tokens to lower case

news_tokn <- gsub("^[i]['][m]$", "i am", news_tokn)

news_tokn <- gsub("^[i]['][d]$", "i would", news_tokn)

news_tokn <- gsub("^[a]['][s]$", "as", news_tokn)

news_tokn <- gsub("^[there]['][s]$", "there is", news_tokn)

news_tokn <- gsub("^[didn]['][t]$", "did not", news_tokn)

news_tokn <- gsub("^[a]['][s]$", "as", news_tokn)

news_tokn <- gsub("^[isn]['][t]$", "is not", news_tokn)

news_tokn <- gsub("[we]['][re]", "we are", news_tokn)

news_tokn <- gsub("[can]['][t]", "cannot", news_tokn)

news_tokn <- gsub("[it]['][s]", "it is", news_tokn)

news_tokn <- gsub("[what]['][s]", "what is", news_tokn)

news_tokn <- gsub("[wouldn]['][t]", "would not", news_tokn)

news_tokn <- gsub("[hasn]['][t]", "has not", news_tokn)

news_tokn <- gsub("[that]['][s]", "that is", news_tokn)

news_tokn <- gsub("[haven]['][t]", "have not", news_tokn)

news_tokn <- gsub("[he]['][d]", "he would", news_tokn)

news_tokn <- gsub("[he]['][s]", "he is", news_tokn)

news_tokn <- gsub("[a]['][s]", "has", news_tokn)

news_tokn <- gsub("[wasn]['][t]", "was not", news_tokn)

news_tokn <- gsub("[don]['][t]", "do not", news_tokn)

news_tokn <- gsub("[they]['][ll]", "they will", news_tokn)

news_tokn <- gsub("[she]['][s]", "she is", news_tokn)

news_tokn <- gsub("[wasn]['][t]", "was not", news_tokn)

news_tokn <- gsub("[they]['][re]", "they are", news_tokn)

news_tokn <- gsub("[we]['][ve]", "we have", news_tokn)

news_tokn <- gsub("[won]['][t]", "will not", news_tokn)

news_tokn <- gsub("[they]['][ll]", "they will", news_tokn)

news_tokn <- gsub("[i]['][ve]", "i have", news_tokn)
news_tokn <- gsub("[they]['][ll]", "they will", news_tokn)
news_tokn <- gsub("[who]['][ve]", "they have", news_tokn)
news_tokn <- gsub("[they]['][ve]", "who have", news_tokn)
news_tokn <- gsub("[they]['][d]", "they would", news_tokn)
news_tokn <- gsub("[you]['][re]", "you are", news_tokn)
news_tokn <- gsub("[[:punct:]]", "", news_tokn) ## removing punctuations
news_tokn <- gsub("^\\s+|\\s+$", "",news_tokn) ## removing free spaces
news_tokn <- gsub("[0-9]", "",news_tokn) ## removing numbers

news_tokn <- gsub("â???o", "",news_tokn) ## removing some special charachters

news_tokn <- gsub("â???T", "",news_tokn) ## removing some special charachters

news_tokn <- gsub("???", "",news_tokn) ## removing some special charachters

news_tokn <- gsub("â", "",news_tokn) ## removing some special charachters

news_tokn <- gsub("???T", "",news_tokn) ## removing some special charachters

news_tokn <- gsub("o", "",news_tokn) ## removing some special charachters
news_tokn <- gsub("???T???T", "",news_tokn) ## removing some special charachters

}

Let us first create an clean file from the news data set

news_tokn <- clean_tokn(news_samp)

Finding the Frequency functions

Let us now try to find some frequency functions of the words. To do this we have to convert the text into a corpus. After converting into a corpus we need to make a Term document matrix out of the corpus so that we can calculate the frequence of the words

news_corpus <- Corpus(VectorSource(news_tokn)) ## Converts into a corpus of words


news_tdm <- TermDocumentMatrix(news_corpus) ## Creates a term document matrix from the corpus of words

Let us now calculate the frequency of words within the corpus. For this, we are summing up row wise and calcuate the frequencies of each words. IN the term document matrix the terms are along the rows and the documents are along the columns

news_max <- apply(news_tdm[1:1000,],1,sum)
news_max2 <- apply(news_tdm[1001:2000,],1,sum)
news_max3 <- apply(news_tdm[2001:3000,],1,sum)
news_max4 <- apply(news_tdm[3001:4000,],1,sum)
news_max5 <- apply(news_tdm[4001:5000,],1,sum)

Let us now find the most frequent words in the list . For this each of the files identified as earlier are sorted according to the frequencies

news_freq2 <- news_max5[order(news_max5,decreasing=TRUE)][1:100]
news_freq3 <- news_max4[order(news_max4,decreasing=TRUE)][1:100]
news_freq4 <- news_max3[order(news_max3,decreasing=TRUE)][1:100]
news_freq5 <- news_max2[order(news_max2,decreasing=TRUE)][1:100]
news_freq6 <- news_max[order(news_max,decreasing=TRUE)][1:100]

news_freq <- c(news_freq2,news_freq3,news_freq4,news_freq5,news_freq6)

news_freq <- news_freq[order(news_freq,decreasing=TRUE)]

Summary statistics of the files

## No of lines of each file

## Length of blog file

length(readLines("en_US.blogs.txt"))
## [1] 899288
## Length of news file

length(readLines("en_US.news.txt"))
## [1] 77259
## Length of twitter file

length(readLines("en_US.twitter.txt"))
## [1] 2360148

Let us look at the distribution and frequencies of the words

## Some of the most frequent words and their frequencies

news_freq[1:20]
##   and  said   frm   are   but  have   has   his   had   mre   all  abut 
##  3216  1022   684   655   629   616   461   431   413   347   320   279 
##  last  been   her   new   int   can after   she 
##   261   257   256   248   248   244   225   219

Let us look at some of the plots for each of the frequencies. Let us create a histrogram and plot file and see how the word frequencies look like

## Histogram of the word frequencies
hist(news_freq)

## Plot of the word frequencies
plot(news_freq)

Plan for creating the word prediction app in Shiny

I plan to make a word prediction game in Shiny. For this I plan to do the following steps

  1. Create multiple samples from each of the files
  2. Clean the samples and creat a comprehensive corpus
  3. Identify the frequent words in the corpus and their conditional probabilities
  4. Create n-gram models both bi-gram and tri-gram and identify the combination in which the words appear
  5. Create conditional probability matrices for the combination of words found in the n-gram models
  6. Based on the conditional probability matrices, create models for word prediction based on the markov chain models
  7. Deploy the model in Shiny app.