Data Science Specialization Capstone Project: Milestone Report 1

The Data

The training data set can be downloaded here and contains text obtained from blogs, news and twitter in English, German, Russian and Finnish. In this project we will only consider the data set in English.

Loading the Data and Exploratory Analysis

We begin by loading the three files in English. The first thing we can do is look at the file size to get an idea of the kind of computational resources that will be needed. From the table below it can be seen that all files are above 250 Mb, the twitter one being the largest at over 300 Mb. The table below also shows the number of characters, lines and words for each file. Even though the blogs file has the lowest number of lines, it has the highest number of words. This is expected because people tend to type longer posts in blogs than twitter or news. In addition, the first three lines of the blogs file are shown, to get a better idea of what the data looks like.

               File     Size
1   en_US.blogs.txt 255.4 Mb
2    en_US.news.txt 257.3 Mb
3 en_US.twitter.txt   319 Mb

          Blogs      News   Twitter
Chars 206824382 203223154 162096241
Lines    899288   1010242   2360148
Words  37570839  34494539  30451170

[1] "In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
[2] "We love you Mr. Brown."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
[3] "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him."

It is well know that text mining is computationally expensive. Based on the size of that data set I have decided to only use a fraction of the total data set to perform the analysis and training. I decided to use 10% of the data because it seems like a reasonable trade off between processing time and data volume.

Checking for Text in Other Languages

Having sub-sampled the data, I proceed to check if there are lines written in a language different from English. While the data claims to be only text in English, it is always a good practice to check as errors may occur. For this task I am using the function textcat from the library with the same name. I realize that this process is not perfect and for some lines, especially short sentences, the wrong language may be detected. However, after running the code I realized that for the sampled data, no lines written on a language different from English were found.

if (sum(file.exists(c("subBlogs2.RDS","subNews2.RDS","subTwitter2.RDS")))==3){
        subBlogs2<-readRDS("subBlogs2.RDS")
        subNews2<-readRDS("subNews2.RDS")
        subTwitter2<-readRDS("subTwitter2.RDS") 
} else {
        subBlogs2<-subBlogs[textcat((subBlogs)=="english")]
        subNews2<-subNews[textcat((subNews)=="english")]
        subTwitter2<-subTwitter[textcat((subTwitter)=="english")]       
        saveRDS(subBlogs2, "subBlogs2.RDS")
        saveRDS(subNews2, "subNews2.RDS")
        saveRDS(subTwitter2, "subTwitter2.RDS")
}
length(subBlogs2)/length(subBlogs)

[1] 1

length(subNews2)/length(subNews)

[1] 1

length(subTwitter2)/length(subTwitter)

[1] 1

Building the Corpus

The next step is to combine all three files into a single object, and create a corpus using the corpus function from the quanteda library. This step is necessary so that quanteda can work with the object.

if (file.exists("corpusData.RDS")){
        corpusData<-readRDS("corpusData.RDS")
} else {
        #Combining all vectors into a single one
        allData <- c(subBlogs, subNews, subTwitter)
        #Building the corpus from the char vector
        corpusData <- corpus(allData)
        saveRDS(corpusData, "corpusData.RDS")
}

The size of the corpus object is 174.7 Mb which is still reasonable.

Cleaning and Tokenizing the Data

We now have two options to further clean our data. The first option is to use the dfm function, the second is to use the tokens function. I have decided to use the latter. First, I converted all characters to lowercase, so that words such as Long and long are counted as the same word. Having done that, I removed punctuation, symbols, numbers, URLs, separators and split words that have hyphens while tokenizing the corpus using the tokens function from quanteda. The corpus has been tokenized into words.

corpusData<-tolower(corpusData) #All chars to lowercase

if (file.exists("masterTokens.RDS")){
        masterTokens<-readRDS("masterTokens.RDS")
} else {
        masterTokens <- tokens(
                x = corpusData,
                what = "word",
                remove_punct = TRUE,
                remove_symbols = TRUE,
                remove_numbers = TRUE,
                remove_url = TRUE,
                remove_separators = TRUE,
                split_hyphens = TRUE,
        )        
        saveRDS(masterTokens, "masterTokens.RDS")
}

Lastly, I used the function tokens_remove to remove stop words and profane words. Stop words are words that are typically short and very common in a language. Therefore, they provide little information to predict the next word. Below is an example of some of the words that are considered stop words in English. The list of profane words was obtained from here.

profane<-readLines("list.txt")
masterTokens<-tokens_remove(masterTokens, c(stopwords("en"), profane))
head(stopwords("en"))

[1] "i"      "me"     "my"     "myself" "we"     "our"

Generating the DFM

Having cleaned the data, I proceeded to generate the document-feature matrix (DFM) for the corpus. The DFM is a matrix that describes the frequency of each term in each document of the corpus and can be created using the dfm function of the quanteda package. While building the DFM I decided to stem the words. Stemming is the process of reducing a word to its root and it is a common normalization technique in natural language processing.

if (file.exists("uniDfm.RDS")){
        uniDfm<-readRDS("uniDfm.RDS")
} else {
        uniDfm<- dfm(masterTokens, stem=TRUE)
        saveRDS(uniDfm, "uniDfm.RDS")
}

Creating n-Grams and Exploring n-Grams frequency

Having constructed the DFM for unigrams, we follow the same process to build the DFM for bi and tri-grams. Stemming is applied in all cases. The bar plots below show the top 20 most common n-Grams for each case.

if (file.exists("biDfm.RDS")){
        biDfm<-readRDS("biDfm.RDS")
} else {
        biDfm<- masterTokens %>% tokens_ngrams(2) %>% dfm(stem=TRUE)
        saveRDS(biDfm, "biDfm.RDS")
}

if (file.exists("triDfm.RDS")){
        triDfm<-readRDS("triDfm.RDS")
} else {
        triDfm<- masterTokens %>% tokens_ngrams(3) %>% dfm(stem=TRUE)
        saveRDS(triDfm, "triDfm.RDS")
}

Data Science Specialization Capstone Project: Milestone Report 1

October/2020

Executive Summary

The Capstone Project