The project consists on building a Shiny application that is capable of predicting the next word the user would enter, based on the words entered previously. In this report I demonstrate how to load the data, followed by an exploratory analysis and cleaning workflow to end up with a data set that can be used to develop and train a prediction algorithm.
The objective of this project is to develop an application that predicts the next word a person will type based on the words typed previously. The app relies of a prediction algorithm that has been trained with data obtained from blogs, news websites and twitter. While th ultimate goal of the project is to develop an online application using Shiny, this report is limited to loading the data, conducting an exploratory analysis and cleaning the data, in preparation for developing a prediction model.
The training data set can be downloaded here and contains text obtained from blogs, news and twitter in English, German, Russian and Finnish. In this project we will only consider the data set in English.
We begin by loading the three files in English. The first thing we can do is look at the file size to get an idea of the kind of computational resources that will be needed. From the table below it can be seen that all files are above 250 Mb, the twitter one being the largest at over 300 Mb. The table below also shows the number of characters, lines and words for each file. Even though the blogs file has the lowest number of lines, it has the highest number of words. This is expected because people tend to type longer posts in blogs than twitter or news. In addition, the first three lines of the blogs file are shown, to get a better idea of what the data looks like.
File Size
1 en_US.blogs.txt 255.4 Mb
2 en_US.news.txt 257.3 Mb
3 en_US.twitter.txt 319 Mb
Blogs News Twitter
Chars 206824382 203223154 162096241
Lines 899288 1010242 2360148
Words 37570839 34494539 30451170
[1] "In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”."
[2] "We love you Mr. Brown."
[3] "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him."
It is well know that text mining is computationally expensive. Based on the size of that data set I have decided to only use a fraction of the total data set to perform the analysis and training. I decided to use 10% of the data because it seems like a reasonable trade off between processing time and data volume.
Having sub-sampled the data, I proceed to check if there are lines written in a language different from English. While the data claims to be only text in English, it is always a good practice to check as errors may occur. For this task I am using the function textcat from the library with the same name. I realize that this process is not perfect and for some lines, especially short sentences, the wrong language may be detected. However, after running the code I realized that for the sampled data, no lines written on a language different from English were found.
if (sum(file.exists(c("subBlogs2.RDS","subNews2.RDS","subTwitter2.RDS")))==3){
subBlogs2<-readRDS("subBlogs2.RDS")
subNews2<-readRDS("subNews2.RDS")
subTwitter2<-readRDS("subTwitter2.RDS")
} else {
subBlogs2<-subBlogs[textcat((subBlogs)=="english")]
subNews2<-subNews[textcat((subNews)=="english")]
subTwitter2<-subTwitter[textcat((subTwitter)=="english")]
saveRDS(subBlogs2, "subBlogs2.RDS")
saveRDS(subNews2, "subNews2.RDS")
saveRDS(subTwitter2, "subTwitter2.RDS")
}
length(subBlogs2)/length(subBlogs)
[1] 1
length(subNews2)/length(subNews)
[1] 1
length(subTwitter2)/length(subTwitter)
[1] 1
The next step is to combine all three files into a single object, and create a corpus using the corpus function from the quanteda library. This step is necessary so that quanteda can work with the object.
if (file.exists("corpusData.RDS")){
corpusData<-readRDS("corpusData.RDS")
} else {
#Combining all vectors into a single one
allData <- c(subBlogs, subNews, subTwitter)
#Building the corpus from the char vector
corpusData <- corpus(allData)
saveRDS(corpusData, "corpusData.RDS")
}
The size of the corpus object is 174.7 Mb which is still reasonable.
We now have two options to further clean our data. The first option is to use the dfm function, the second is to use the tokens function. I have decided to use the latter. First, I converted all characters to lowercase, so that words such as Long and long are counted as the same word. Having done that, I removed punctuation, symbols, numbers, URLs, separators and split words that have hyphens while tokenizing the corpus using the tokens function from quanteda. The corpus has been tokenized into words.
corpusData<-tolower(corpusData) #All chars to lowercase
if (file.exists("masterTokens.RDS")){
masterTokens<-readRDS("masterTokens.RDS")
} else {
masterTokens <- tokens(
x = corpusData,
what = "word",
remove_punct = TRUE,
remove_symbols = TRUE,
remove_numbers = TRUE,
remove_url = TRUE,
remove_separators = TRUE,
split_hyphens = TRUE,
)
saveRDS(masterTokens, "masterTokens.RDS")
}
Lastly, I used the function tokens_remove to remove stop words and profane words. Stop words are words that are typically short and very common in a language. Therefore, they provide little information to predict the next word. Below is an example of some of the words that are considered stop words in English. The list of profane words was obtained from here.
profane<-readLines("list.txt")
masterTokens<-tokens_remove(masterTokens, c(stopwords("en"), profane))
head(stopwords("en"))
[1] "i" "me" "my" "myself" "we" "our"
Having cleaned the data, I proceeded to generate the document-feature matrix (DFM) for the corpus. The DFM is a matrix that describes the frequency of each term in each document of the corpus and can be created using the dfm function of the quanteda package. While building the DFM I decided to stem the words. Stemming is the process of reducing a word to its root and it is a common normalization technique in natural language processing.
if (file.exists("uniDfm.RDS")){
uniDfm<-readRDS("uniDfm.RDS")
} else {
uniDfm<- dfm(masterTokens, stem=TRUE)
saveRDS(uniDfm, "uniDfm.RDS")
}
Having constructed the DFM for unigrams, we follow the same process to build the DFM for bi and tri-grams. Stemming is applied in all cases. The bar plots below show the top 20 most common n-Grams for each case.
if (file.exists("biDfm.RDS")){
biDfm<-readRDS("biDfm.RDS")
} else {
biDfm<- masterTokens %>% tokens_ngrams(2) %>% dfm(stem=TRUE)
saveRDS(biDfm, "biDfm.RDS")
}
if (file.exists("triDfm.RDS")){
triDfm<-readRDS("triDfm.RDS")
} else {
triDfm<- masterTokens %>% tokens_ngrams(3) %>% dfm(stem=TRUE)
saveRDS(triDfm, "triDfm.RDS")
}
As stated at the beginning, the scope of this report is to show that I can load the data, explore it, clean it and put it in a format that can be used as input for a prediction model. Now that I am familiar with the data and I am able to manipulate it, the next steps are choosing a type of prediction model, building it, training it and testing its accuracy. After that, I will proceed to build a Shiny app and deploy it in Shiny’s servers.