Purpose

The goal of the Data Science Specialization’s Capstone Project is to build a word prediction model to be applied in the research and development of a predictive keyboard. That is, given an unfinished sentence, the model should be able to find the most probable word to be written by the user. The object of this document is to perform a brief description of the data present in the three data sets provided as input to the project.

Getting the datasets

The datasets were made available to students as a compressed file to be downloaded. Files corresponding to three different languages were provided. We chose the English ones. Once on disk, they were extracted witht the following results.

dir('./DataSets/')

## [1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"

We can convert all data sets into a single corpus using the ‘tm’ package.

library(tm)
sourceDir = './DataSets/'
LANG = 'en'
crp <- Corpus(DirSource(sourceDir, encoding = "UTF-8"), 
              readerControl = list(language = LANG))
class(crp)

## [1] "VCorpus" "Corpus"  "list"

The result is a ‘Corpus’ object, which is a subclass of the standard R class ‘List’. This structure contains three elements, one for each file. Each element is finally a character vector. We will use the blogs data set for our examples. Let’s peek inside the first element.

for (elem in crp) {cat(attr(elem, 'ID'), '-', class(elem), '- \nl')}

## en_US.blogs.txt - PlainTextDocument TextDocument character - 
## len_US.news.txt - PlainTextDocument TextDocument character - 
## len_US.twitter.txt - PlainTextDocument TextDocument character - 
## l

## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan gods."

## Length of element:  899288 
## l

Cleaning the data

So we’ve got almost 900 thousands blogs here. As we can see in the first one, there are capitalized words, punctuation signs and, as we discovered by eye inspection, strange characters. In order to get some meaninful statistics we’ll need to standardize their formats. We will use a subset of all blogs (200 thousands) in order to speed up processing.

We use a customized function based on functions from the ‘base’ and ’tm’packages. Basically, we convert everything to ascii and ignore any strange characters, convert all words to lowercase, remove strange punctuation marks, keeping only commas and periods.

trData <- dset[1:20000]
ctrlList <- list(convertTolower=c(TRUE, 3),                 # data cleaning parameters
                 verbose=FALSE,
                 convertToASCII=TRUE,
                 removePunct=TRUE,
                 removeNumbers=TRUE,
                 removeStopWords=c(FALSE, NULL))

trData <- cleanDoc(x=trData, control=ctrlList)             # Clean data

Let’s take a look at the first three blogs. They certainly look fine enough.

## [1] "in the years thereafter xxcommaxx most of the oil fields and platforms were named after pagan gods"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
## [2] "we love you mr xxstopxx brown"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
## [3] "chad has been awesome with the kids and holding down the fort while i work later than usual xxstopxx the kids have been busy together playing skylander on the xbox together xxcommaxx after kyan cashed in his  from his piggy bank xxstopxx he wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it he never taps into that thing either xxcommaxx that is how we know he wanted it so bad xxstopxx we made him count all of his money to make sure that he had enough xxstopxx it was very cute to watch his reaction when he realized he did xxstopxx he also does a very good job of letting lola feel like she is playing too xxcommaxx by letting her switch out the characters xxstopxx she loves it almost as much as him"

Words frequencies

Since we are going to develop a predictive model, we are interested in understanding the frequencies in which words appear in the texts. Our first model will be based on ngrams frequencies, so we have developed a set of functions dedidated to extract n-grams from the train data. In order to avoid exausting the computer’s memory with too much data, we process input in steps. The following function extracts 1-grams (words) from the subset of the training set we’re using as example for this report.

n1g <- Ngram.tf(x = trData,                    # input data
                     fun = ngram2,             # function for ngram extraction
                     n = 1,                    # we want words
                     encoded = FALSE,          # plain text
                     encode = FALSE,           # do not encode them
                     threshold = 1,            # keep terms with this frequency or higher
                     chunkSize = 0.1)          # divide the input data in chunks of this proportion

## >>> Creating 1-grams 
##     886128 terms generated. 
## >>> Estimating time for main process. 
##     Estimated time for main process:  56 secs 
## >>> Partitioning input file in 10 parts of size 88612 
## >>> Terms integrity checked OK 
## >>> Creating term frequencies dicts. 
##     Processing list 1 with 90022 terms. 
##     Processing list 2 with 88458 terms. 
##     Processing list 3 with 88656 terms. 
##     Processing list 4 with 95419 terms. 
##     Processing list 5 with 88242 terms. 
##     Processing list 6 with 87587 terms. 
##     Processing list 7 with 87725 terms. 
##     Processing list 8 with 93600 terms. 
##     Processing list 9 with 120813 terms. 
##     Processing list 10 with 45606 terms. 
## >>> Clearing memory and unlisting dicts. 
## >>> End of Job. Job time was: 55.31 secs

WRDS <- names(n1g[order(n1g, decreasing=T)])                                             
WRDS_H <- hash(WRDS, 1:length(WRDS))
WRDS_I <- invertVocab(WRDS_H)

We’ve managed to build a hash table of words that we will use later as a Vocabulary. We have kept low frequency terms on purpose. We will show that a high proportion of words are terms not belonging to the set of words in the English language. For example, let’s take a look at the first terms generated in the dictionary.

names(n1g)[1:20]

##  [1] "a"             "aa"            "aaa"           "aaaaaand"     
##  [5] "aaaaand"       "aaaagggghhhh"  "aaaandimpasse" "aaaannnd"     
##  [9] "aaargh"        "aadhaar"       "aadhar"        "aadil"        
## [13] "aaggh"         "aahhh"         "aam"           "aamer"        
## [17] "aamoth"        "aand"          "aang"          "aapish"

It looks like there are lots of them. In order to see how important is that proportion, we will build a data frame of words, computing their counts.

n1gn <- Ngram.tf(x = trData,                    # input data
                     n = 1,                         # ngram type
                     fun = ngram2,                  # ngram generator function
                     encoded = FALSE,               # input not encoded
                     encode = TRUE,                 # encode ngram
                     threshold = 1,                 # discard ngrams with low frequency
                     chunkSize = 0.1)               # chunk size to process by steps

## >>> Creating 1-grams 
##     886128 terms generated. 
## >>> Estimating time for main process. 
##     Estimated time for main process:  21.52 secs 
## >>> Partitioning input file in 10 parts of size 88612 
## >>> Terms integrity checked OK 
## >>> Creating term frequencies dicts. 
##     Processing list 1 with 112214 terms. 
##     Processing list 2 with 86412 terms. 
##     Processing list 3 with 91182 terms. 
##     Processing list 4 with 86514 terms. 
##     Processing list 5 with 85047 terms. 
##     Processing list 6 with 85045 terms. 
##     Processing list 7 with 85018 terms. 
##     Processing list 8 with 84898 terms. 
##     Processing list 9 with 84903 terms. 
##     Processing list 10 with 84895 terms. 
## >>> Clearing memory and unlisting dicts. 
## >>> End of Job. Job time was: 1.24 mins

n1gn_df <- ngram.DF(n1gn)                       # create data frame
head(n1gn_df, 20)[-5]                           # Show first rows

##    Key Count        W1        Prob
## 1    1 41104       the 0.046386075
## 2    2 38752 xxcommaxx 0.043731831
## 3    3 32358  xxstopxx 0.036516169
## 4    4 24064       and 0.027156348
## 5    5 23362        to 0.026364137
## 6    6 19744         a 0.022281205
## 7    7 19242        of 0.021714696
## 8    8 16819         i 0.018980328
## 9    9 13098        in 0.014781160
## 10  10  9992      that 0.011276023
## 11  11  9595        is 0.010828007
## 12  12  8846        it 0.009982756
## 13  13  7864       for 0.008874564
## 14  14  6433       you 0.007259674
## 15  15  6311      with 0.007121996
## 16  16  6119       was 0.006905323
## 17  17  6105        on 0.006889524
## 18  18  5912        my 0.006671722
## 19  19  5737      this 0.006474234
## 20  20  4832        as 0.005452937

The 1-gram data frame has been ordered by Count in descending order, so we’re looking at the most frequent words here. The most common punctuation marks have been encoded, since they could prove meaningful for prediction purposes. But we will see that those ‘words’ appearing only once are too numerous and that they distort probabilities.

Taking a look at our dataframe as it is right now, we get the following histogram. Evidently, not very informative.

histsum(n1gn_df$Count, units='Counts', tit='Word Frequencies')

## >>  Word Frequencies 
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     1.00     1.00     1.00    19.24     4.00 41100.00

Getting rid of hapax legomena

Let’s take a look at some of those words that appear just once in the almost 900 thousands tokens we extracted.

head(subset(n1gn_df, Count <= 1), 20)

##         Key Count            W1         Prob       ProbWB
## 22362 22362     1           aaa 1.128505e-06 1.072762e-06
## 22363 22363     1      aaaaaand 1.128505e-06 1.072762e-06
## 22364 22364     1       aaaaand 1.128505e-06 1.072762e-06
## 22365 22365     1  aaaagggghhhh 1.128505e-06 1.072762e-06
## 22366 22366     1 aaaandimpasse 1.128505e-06 1.072762e-06
## 22367 22367     1      aaaannnd 1.128505e-06 1.072762e-06
## 22368 22368     1        aaargh 1.128505e-06 1.072762e-06
## 22369 22369     1       aadhaar 1.128505e-06 1.072762e-06
## 22370 22370     1         aadil 1.128505e-06 1.072762e-06
## 22371 22371     1         aaggh 1.128505e-06 1.072762e-06
## 22372 22372     1         aahhh 1.128505e-06 1.072762e-06
## 22373 22373     1         aamer 1.128505e-06 1.072762e-06
## 22374 22374     1        aamoth 1.128505e-06 1.072762e-06
## 22375 22375     1          aand 1.128505e-06 1.072762e-06
## 22376 22376     1          aang 1.128505e-06 1.072762e-06
## 22377 22377     1        aapish 1.128505e-06 1.072762e-06
## 22378 22378     1       aaronic 1.128505e-06 1.072762e-06
## 22379 22379     1        aarons 1.128505e-06 1.072762e-06
## 22380 22380     1         aasif 1.128505e-06 1.072762e-06
## 22381 22381     1          aasl 1.128505e-06 1.072762e-06

Almost 50% of the tokens are hapax legomena. We will remove them.

n1gn_wrds <- subset(n1gn_df, Count > 1)
hapax <- nrow(n1gn_df) - nrow(n1gn_wrds)
cat('There are', hapax, 'out of', nrow(n1gn_df), 'tokens that appear only once. \n')

## There are 23684 out of 46045 tokens that appear only once.

We must not forget to also remove those codes for the punctuation marks we have introduced.

punct <- grep('^xx[a-z]+', n1gn_wrds$W1)
n1gn_wrds[punct,][,-5]

##         Key Count        W1         Prob
## 2         2 38752 xxcommaxx 4.373183e-02
## 3         3 32358  xxstopxx 3.651617e-02
## 61       61  1967 xxcolonxx 2.219770e-03
## 22304 22304     2     xxiii 2.257010e-06

n1gn_wrds <- n1gn_wrds[-punct,]

We can now recompute words probabilities.

n1gn_wrds$Prob <- n1gn_wrds$Count / sum((n1gn_wrds$Count))
cat('Sum of Probabilities = ', sum(n1gn_wrds$Prob), '. \n', sep='')

## Sum of Probabilities = 1.

head(n1gn_wrds,20)[,-5]

##    Key Count   W1        Prob
## 1    1 41104  the 0.052072235
## 4    4 24064  and 0.030485263
## 5    5 23362   to 0.029595941
## 6    6 19744    a 0.025012510
## 7    7 19242   of 0.024376556
## 8    8 16819    i 0.021307000
## 9    9 13098   in 0.016593084
## 10  10  9992 that 0.012658276
## 11  11  9595   is 0.012155340
## 12  12  8846   it 0.011206476
## 13  13  7864  for 0.009962438
## 14  14  6433  you 0.008149589
## 15  15  6311 with 0.007995034
## 16  16  6119  was 0.007751800
## 17  17  6105   on 0.007734065
## 18  18  5912   my 0.007489564
## 19  19  5737 this 0.007267867
## 20  20  4832   as 0.006121376
## 21  21  4773 have 0.006046632
## 22  22  4584  but 0.005807199

We will apply a transformation to the input data (log) in order to get a more meaningful histogram.

histsum(log2(n1gn_wrds$Count), units='Log2(Count)', tit='Word Frequencies')

## >>  Word Frequencies 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   2.614   3.459  15.330

The histogram now looks much better. We can clearly identify Zipp’s law in action here.

Next Steps

Next steps will involve training a much larger proportion of the data set, extract bigrams and trigrams to create an n-gram prediction model and test its results.

Data Science Specialization - EDA Report