This is an interim project report showing my progress with the project.
The data was downloaded into my local drive and saved in a folder. I will be using the English data set for the project. Let us first load all the library files
## Loading the Library files
setwd("E:/Capstone/final/en_US")
library(tm)
## Loading required package: NLP
library(ngram)
library(stringr)
The next step would be to read each of the files and then create a sample files. The samples would be a random selection of 5% of the total length of the complete file. For creating the samples, I would be creating a function, to read the length of the data sets and take a sample of 5% from the data sets
## This function creates sample files for each data set
samp <- function(textfile){
fileln <- length(readLines(textfile)) ## Calculate the length of the file
sampfile <- readLines(textfile) ## Reading all the lines of each data set
sampsiz <- fileln * 0.05 ## Taking only 5% sample
txtsamp <- sampfile[rbinom(sampsiz,fileln,0.5)] ## Taking a binomial sample
return(txtsamp)
}
Using the function let us first create samples for each of the data set
news_samp <- samp("en_US.news.txt")
blog_samp <- samp("en_US.blogs.txt")
twit_samp <- samp("en_US.twitter.txt")
Once the sample files are created, I would now create tokens of words from the data set. The purpose of creating tokens is to understand the distribution of words within the dataset. I will be using the news file for creating the tokens. After creating tokens, will convert all the tokens into lowercase.
After tokenizing, well attempt to deal with punctuations and special charachters. First some of the common words with punctuations are split to its root words. After the splitting, the other punctuations which are present in the document are removed from the data set.
## This is a function to take a text file and then create tokens of the words and clean the punctuations
clean_tokn <- function(txtfile){
news_tokn <- scan_tokenizer(txtfile) ## Creating tokens from the sample
news_tokn <- tolower(news_tokn) ## Converting the tokens to lower case
news_tokn <- gsub("^[i]['][m]$", "i am", news_tokn)
news_tokn <- gsub("^[i]['][d]$", "i would", news_tokn)
news_tokn <- gsub("^[a]['][s]$", "as", news_tokn)
news_tokn <- gsub("^[there]['][s]$", "there is", news_tokn)
news_tokn <- gsub("^[didn]['][t]$", "did not", news_tokn)
news_tokn <- gsub("^[a]['][s]$", "as", news_tokn)
news_tokn <- gsub("^[isn]['][t]$", "is not", news_tokn)
news_tokn <- gsub("[we]['][re]", "we are", news_tokn)
news_tokn <- gsub("[can]['][t]", "cannot", news_tokn)
news_tokn <- gsub("[it]['][s]", "it is", news_tokn)
news_tokn <- gsub("[what]['][s]", "what is", news_tokn)
news_tokn <- gsub("[wouldn]['][t]", "would not", news_tokn)
news_tokn <- gsub("[hasn]['][t]", "has not", news_tokn)
news_tokn <- gsub("[that]['][s]", "that is", news_tokn)
news_tokn <- gsub("[haven]['][t]", "have not", news_tokn)
news_tokn <- gsub("[he]['][d]", "he would", news_tokn)
news_tokn <- gsub("[he]['][s]", "he is", news_tokn)
news_tokn <- gsub("[a]['][s]", "has", news_tokn)
news_tokn <- gsub("[wasn]['][t]", "was not", news_tokn)
news_tokn <- gsub("[don]['][t]", "do not", news_tokn)
news_tokn <- gsub("[they]['][ll]", "they will", news_tokn)
news_tokn <- gsub("[she]['][s]", "she is", news_tokn)
news_tokn <- gsub("[wasn]['][t]", "was not", news_tokn)
news_tokn <- gsub("[they]['][re]", "they are", news_tokn)
news_tokn <- gsub("[we]['][ve]", "we have", news_tokn)
news_tokn <- gsub("[won]['][t]", "will not", news_tokn)
news_tokn <- gsub("[they]['][ll]", "they will", news_tokn)
news_tokn <- gsub("[i]['][ve]", "i have", news_tokn)
news_tokn <- gsub("[they]['][ll]", "they will", news_tokn)
news_tokn <- gsub("[who]['][ve]", "they have", news_tokn)
news_tokn <- gsub("[they]['][ve]", "who have", news_tokn)
news_tokn <- gsub("[they]['][d]", "they would", news_tokn)
news_tokn <- gsub("[you]['][re]", "you are", news_tokn)
news_tokn <- gsub("[[:punct:]]", "", news_tokn) ## removing punctuations
news_tokn <- gsub("^\\s+|\\s+$", "",news_tokn) ## removing free spaces
news_tokn <- gsub("[0-9]", "",news_tokn) ## removing numbers
news_tokn <- gsub("â???o", "",news_tokn) ## removing some special charachters
news_tokn <- gsub("â???T", "",news_tokn) ## removing some special charachters
news_tokn <- gsub("???", "",news_tokn) ## removing some special charachters
news_tokn <- gsub("â", "",news_tokn) ## removing some special charachters
news_tokn <- gsub("???T", "",news_tokn) ## removing some special charachters
news_tokn <- gsub("o", "",news_tokn) ## removing some special charachters
news_tokn <- gsub("???T???T", "",news_tokn) ## removing some special charachters
}
Let us first create an clean file from the news data set
news_tokn <- clean_tokn(news_samp)
Let us now try to find some frequency functions of the words. To do this we have to convert the text into a corpus. After converting into a corpus we need to make a Term document matrix out of the corpus so that we can calculate the frequence of the words
news_corpus <- Corpus(VectorSource(news_tokn)) ## Converts into a corpus of words
news_tdm <- TermDocumentMatrix(news_corpus) ## Creates a term document matrix from the corpus of words
Let us now calculate the frequency of words within the corpus. For this, we are summing up row wise and calcuate the frequencies of each words. IN the term document matrix the terms are along the rows and the documents are along the columns
news_max <- apply(news_tdm[1:1000,],1,sum)
news_max2 <- apply(news_tdm[1001:2000,],1,sum)
news_max3 <- apply(news_tdm[2001:3000,],1,sum)
news_max4 <- apply(news_tdm[3001:4000,],1,sum)
news_max5 <- apply(news_tdm[4001:5000,],1,sum)
Let us now find the most frequent words in the list . For this each of the files identified as earlier are sorted according to the frequencies
news_freq2 <- news_max5[order(news_max5,decreasing=TRUE)][1:100]
news_freq3 <- news_max4[order(news_max4,decreasing=TRUE)][1:100]
news_freq4 <- news_max3[order(news_max3,decreasing=TRUE)][1:100]
news_freq5 <- news_max2[order(news_max2,decreasing=TRUE)][1:100]
news_freq6 <- news_max[order(news_max,decreasing=TRUE)][1:100]
news_freq <- c(news_freq2,news_freq3,news_freq4,news_freq5,news_freq6)
news_freq <- news_freq[order(news_freq,decreasing=TRUE)]
## No of lines of each file
## Length of blog file
length(readLines("en_US.blogs.txt"))
## [1] 899288
## Length of news file
length(readLines("en_US.news.txt"))
## [1] 77259
## Length of twitter file
length(readLines("en_US.twitter.txt"))
## [1] 2360148
Let us look at the distribution and frequencies of the words
## Some of the most frequent words and their frequencies
news_freq[1:20]
## and said frm are but have has his had mre all abut
## 3216 1022 684 655 629 616 461 431 413 347 320 279
## last been her new int can after she
## 261 257 256 248 248 244 225 219
Let us look at some of the plots for each of the frequencies. Let us create a histrogram and plot file and see how the word frequencies look like
## Histogram of the word frequencies
hist(news_freq)
## Plot of the word frequencies
plot(news_freq)
I plan to make a word prediction game in Shiny. For this I plan to do the following steps