## hash-2.2.6 provided by Decision Patterns
The goal of the Data Science Specialization’s Capstone Project is to build a word prediction model to be applied in the research and development of a predictive keyboard. That is, given an unfinished sentence, the model should be able to find the most probable word to be written by the user. The object of this document is to perform a brief description of the data present in the three data sets provided as input to the project.
The datasets were made available to students as a compressed file to be downloaded. Files corresponding to three different languages were provided. We chose the English ones. Once on disk, they were extracted witht the following results.
dir('./DataSets/')
## [1] "en_US.blogs.txt" "en_US.news.txt" "en_US.twitter.txt"
We can convert all data sets into a single corpus using the ‘tm’ package.
library(tm)
sourceDir = './DataSets/'
LANG = 'en'
crp <- Corpus(DirSource(sourceDir, encoding = "UTF-8"),
readerControl = list(language = LANG))
class(crp)
## [1] "VCorpus" "Corpus" "list"
The result is a ‘Corpus’ object, which is a subclass of the standard R class ‘List’. This structure contains three elements, one for each file. Each element is finally a character vector. We will use the blogs data set for our examples. Let’s peek inside the first element.
for (elem in crp) {cat(attr(elem, 'ID'), '-', class(elem), '- \nl')}
## en_US.blogs.txt - PlainTextDocument TextDocument character -
## len_US.news.txt - PlainTextDocument TextDocument character -
## len_US.twitter.txt - PlainTextDocument TextDocument character -
## l
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan gods."
## Length of element: 899288
## l
So we’ve got almost 900 thousands blogs here. As we can see in the first one, there are capitalized words, punctuation signs and, as we discovered by eye inspection, strange characters. In order to get some meaninful statistics we’ll need to standardize their formats. We will use a subset of all blogs (200 thousands) in order to speed up processing.
We use a customized function based on functions from the ‘base’ and ’tm’packages. Basically, we convert everything to ascii and ignore any strange characters, convert all words to lowercase, remove strange punctuation marks, keeping only commas and periods.
trData <- dset[1:20000]
ctrlList <- list(convertTolower=c(TRUE, 3), # data cleaning parameters
verbose=FALSE,
convertToASCII=TRUE,
removePunct=TRUE,
removeNumbers=TRUE,
removeStopWords=c(FALSE, NULL))
trData <- cleanDoc(x=trData, control=ctrlList) # Clean data
Let’s take a look at the first three blogs. They certainly look fine enough.
## [1] "in the years thereafter xxcommaxx most of the oil fields and platforms were named after pagan gods"
## [2] "we love you mr xxstopxx brown"
## [3] "chad has been awesome with the kids and holding down the fort while i work later than usual xxstopxx the kids have been busy together playing skylander on the xbox together xxcommaxx after kyan cashed in his from his piggy bank xxstopxx he wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it he never taps into that thing either xxcommaxx that is how we know he wanted it so bad xxstopxx we made him count all of his money to make sure that he had enough xxstopxx it was very cute to watch his reaction when he realized he did xxstopxx he also does a very good job of letting lola feel like she is playing too xxcommaxx by letting her switch out the characters xxstopxx she loves it almost as much as him"
Since we are going to develop a predictive model, we are interested in understanding the frequencies in which words appear in the texts. Our first model will be based on ngrams frequencies, so we have developed a set of functions dedidated to extract n-grams from the train data. In order to avoid exausting the computer’s memory with too much data, we process input in steps. The following function extracts 1-grams (words) from the subset of the training set we’re using as example for this report.
n1g <- Ngram.tf(x = trData, # input data
fun = ngram2, # function for ngram extraction
n = 1, # we want words
encoded = FALSE, # plain text
encode = FALSE, # do not encode them
threshold = 1, # keep terms with this frequency or higher
chunkSize = 0.1) # divide the input data in chunks of this proportion
## >>> Creating 1-grams
## 886128 terms generated.
## >>> Estimating time for main process.
## Estimated time for main process: 56 secs
## >>> Partitioning input file in 10 parts of size 88612
## >>> Terms integrity checked OK
## >>> Creating term frequencies dicts.
## Processing list 1 with 90022 terms.
## Processing list 2 with 88458 terms.
## Processing list 3 with 88656 terms.
## Processing list 4 with 95419 terms.
## Processing list 5 with 88242 terms.
## Processing list 6 with 87587 terms.
## Processing list 7 with 87725 terms.
## Processing list 8 with 93600 terms.
## Processing list 9 with 120813 terms.
## Processing list 10 with 45606 terms.
## >>> Clearing memory and unlisting dicts.
## >>> End of Job. Job time was: 55.31 secs
WRDS <- names(n1g[order(n1g, decreasing=T)])
WRDS_H <- hash(WRDS, 1:length(WRDS))
WRDS_I <- invertVocab(WRDS_H)
We’ve managed to build a hash table of words that we will use later as a Vocabulary. We have kept low frequency terms on purpose. We will show that a high proportion of words are terms not belonging to the set of words in the English language. For example, let’s take a look at the first terms generated in the dictionary.
names(n1g)[1:20]
## [1] "a" "aa" "aaa" "aaaaaand"
## [5] "aaaaand" "aaaagggghhhh" "aaaandimpasse" "aaaannnd"
## [9] "aaargh" "aadhaar" "aadhar" "aadil"
## [13] "aaggh" "aahhh" "aam" "aamer"
## [17] "aamoth" "aand" "aang" "aapish"
It looks like there are lots of them. In order to see how important is that proportion, we will build a data frame of words, computing their counts.
n1gn <- Ngram.tf(x = trData, # input data
n = 1, # ngram type
fun = ngram2, # ngram generator function
encoded = FALSE, # input not encoded
encode = TRUE, # encode ngram
threshold = 1, # discard ngrams with low frequency
chunkSize = 0.1) # chunk size to process by steps
## >>> Creating 1-grams
## 886128 terms generated.
## >>> Estimating time for main process.
## Estimated time for main process: 21.52 secs
## >>> Partitioning input file in 10 parts of size 88612
## >>> Terms integrity checked OK
## >>> Creating term frequencies dicts.
## Processing list 1 with 112214 terms.
## Processing list 2 with 86412 terms.
## Processing list 3 with 91182 terms.
## Processing list 4 with 86514 terms.
## Processing list 5 with 85047 terms.
## Processing list 6 with 85045 terms.
## Processing list 7 with 85018 terms.
## Processing list 8 with 84898 terms.
## Processing list 9 with 84903 terms.
## Processing list 10 with 84895 terms.
## >>> Clearing memory and unlisting dicts.
## >>> End of Job. Job time was: 1.24 mins
n1gn_df <- ngram.DF(n1gn) # create data frame
head(n1gn_df, 20)[-5] # Show first rows
## Key Count W1 Prob
## 1 1 41104 the 0.046386075
## 2 2 38752 xxcommaxx 0.043731831
## 3 3 32358 xxstopxx 0.036516169
## 4 4 24064 and 0.027156348
## 5 5 23362 to 0.026364137
## 6 6 19744 a 0.022281205
## 7 7 19242 of 0.021714696
## 8 8 16819 i 0.018980328
## 9 9 13098 in 0.014781160
## 10 10 9992 that 0.011276023
## 11 11 9595 is 0.010828007
## 12 12 8846 it 0.009982756
## 13 13 7864 for 0.008874564
## 14 14 6433 you 0.007259674
## 15 15 6311 with 0.007121996
## 16 16 6119 was 0.006905323
## 17 17 6105 on 0.006889524
## 18 18 5912 my 0.006671722
## 19 19 5737 this 0.006474234
## 20 20 4832 as 0.005452937
The 1-gram data frame has been ordered by Count in descending order, so we’re looking at the most frequent words here. The most common punctuation marks have been encoded, since they could prove meaningful for prediction purposes. But we will see that those ‘words’ appearing only once are too numerous and that they distort probabilities.
Taking a look at our dataframe as it is right now, we get the following histogram. Evidently, not very informative.
histsum(n1gn_df$Count, units='Counts', tit='Word Frequencies')
## >> Word Frequencies
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 1.00 1.00 19.24 4.00 41100.00
Let’s take a look at some of those words that appear just once in the almost 900 thousands tokens we extracted.
head(subset(n1gn_df, Count <= 1), 20)
## Key Count W1 Prob ProbWB
## 22362 22362 1 aaa 1.128505e-06 1.072762e-06
## 22363 22363 1 aaaaaand 1.128505e-06 1.072762e-06
## 22364 22364 1 aaaaand 1.128505e-06 1.072762e-06
## 22365 22365 1 aaaagggghhhh 1.128505e-06 1.072762e-06
## 22366 22366 1 aaaandimpasse 1.128505e-06 1.072762e-06
## 22367 22367 1 aaaannnd 1.128505e-06 1.072762e-06
## 22368 22368 1 aaargh 1.128505e-06 1.072762e-06
## 22369 22369 1 aadhaar 1.128505e-06 1.072762e-06
## 22370 22370 1 aadil 1.128505e-06 1.072762e-06
## 22371 22371 1 aaggh 1.128505e-06 1.072762e-06
## 22372 22372 1 aahhh 1.128505e-06 1.072762e-06
## 22373 22373 1 aamer 1.128505e-06 1.072762e-06
## 22374 22374 1 aamoth 1.128505e-06 1.072762e-06
## 22375 22375 1 aand 1.128505e-06 1.072762e-06
## 22376 22376 1 aang 1.128505e-06 1.072762e-06
## 22377 22377 1 aapish 1.128505e-06 1.072762e-06
## 22378 22378 1 aaronic 1.128505e-06 1.072762e-06
## 22379 22379 1 aarons 1.128505e-06 1.072762e-06
## 22380 22380 1 aasif 1.128505e-06 1.072762e-06
## 22381 22381 1 aasl 1.128505e-06 1.072762e-06
Almost 50% of the tokens are hapax legomena. We will remove them.
n1gn_wrds <- subset(n1gn_df, Count > 1)
hapax <- nrow(n1gn_df) - nrow(n1gn_wrds)
cat('There are', hapax, 'out of', nrow(n1gn_df), 'tokens that appear only once. \n')
## There are 23684 out of 46045 tokens that appear only once.
We must not forget to also remove those codes for the punctuation marks we have introduced.
punct <- grep('^xx[a-z]+', n1gn_wrds$W1)
n1gn_wrds[punct,][,-5]
## Key Count W1 Prob
## 2 2 38752 xxcommaxx 4.373183e-02
## 3 3 32358 xxstopxx 3.651617e-02
## 61 61 1967 xxcolonxx 2.219770e-03
## 22304 22304 2 xxiii 2.257010e-06
n1gn_wrds <- n1gn_wrds[-punct,]
We can now recompute words probabilities.
n1gn_wrds$Prob <- n1gn_wrds$Count / sum((n1gn_wrds$Count))
cat('Sum of Probabilities = ', sum(n1gn_wrds$Prob), '. \n', sep='')
## Sum of Probabilities = 1.
head(n1gn_wrds,20)[,-5]
## Key Count W1 Prob
## 1 1 41104 the 0.052072235
## 4 4 24064 and 0.030485263
## 5 5 23362 to 0.029595941
## 6 6 19744 a 0.025012510
## 7 7 19242 of 0.024376556
## 8 8 16819 i 0.021307000
## 9 9 13098 in 0.016593084
## 10 10 9992 that 0.012658276
## 11 11 9595 is 0.012155340
## 12 12 8846 it 0.011206476
## 13 13 7864 for 0.009962438
## 14 14 6433 you 0.008149589
## 15 15 6311 with 0.007995034
## 16 16 6119 was 0.007751800
## 17 17 6105 on 0.007734065
## 18 18 5912 my 0.007489564
## 19 19 5737 this 0.007267867
## 20 20 4832 as 0.006121376
## 21 21 4773 have 0.006046632
## 22 22 4584 but 0.005807199
We will apply a transformation to the input data (log) in order to get a more meaningful histogram.
histsum(log2(n1gn_wrds$Count), units='Log2(Count)', tit='Word Frequencies')
## >> Word Frequencies
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 2.000 2.614 3.459 15.330
The histogram now looks much better. We can clearly identify Zipp’s law in action here.
Next steps will involve training a much larger proportion of the data set, extract bigrams and trigrams to create an n-gram prediction model and test its results.