This report is a exploratory analysis of swiftkey dataset. The dataset contains different locales and the data comes from blogs, twitter and news. The goal for us is to build a N -gram model using the training data and predict the next word based on the word sequence provided.
We have downloaded the dataset and extracted locally. We can see there are 4 locales, each locales contains 3 dataset, twitter, blogs, and news.
We can count the # of lines in each file by implementing a utility function getCountOfLines using the command “wc -l”. Here is the line count output.
## filename linecount wordcount
## 1 final/en_US/en_US.blogs.txt 899288 210160014
## 2 final/en_US/en_US.news.txt 1010242 205811885
## 3 final/en_US/en_US.twitter.txt 2360148 167105338
## 4 final/de_DE/de_DE.blogs.txt 371440 85459666
## 5 final/de_DE/de_DE.news.txt 244743 95591959
## 6 final/de_DE/de_DE.twitter.txt 947774 75578341
## 7 final/fi_FI/fi_FI.blogs.txt 439785 108503595
## 8 final/fi_FI/fi_FI.news.txt 485758 94234350
## 9 final/fi_FI/fi_FI.twitter.txt 285214 25331142
## 10 final/ru_RU/ru_RU.blogs.txt 337100 116855835
## 11 final/ru_RU/ru_RU.news.txt 196360 118996424
## 12 final/ru_RU/ru_RU.twitter.txt 881414 105182346
We will start with analyze the text using the data mining package quanteda. This package allow use to clean the text data including exclude the stop words, punctuation, numbers, and symbols.
Since we need to predict the word using n-gram, we will try build bi-gram, tri-gram, and qudra-gram. For bi-gram, the 1st word will be used to predict the next word (outcome). For the tri-gram, the first two words will be used to preduct the last word. For quadra gram, the first three words will be used to preduct the last word.
We tokenize our corpus first for each file, and build a sparse matrix of the n-gram tokens with the count of occurance, then we sum up all the count to build the frequency for each n-gram token.
Due to the size of the file is huge, we will use the sampling method to avoid using excessive memory, we can use k-folds to do multiple training later or to read the entire corpus. Since our training corpus containing different locale and different text source, we will make our model seperate for each source.
train.tokens <- tokens_wordstem(train.tokens,
language = getStemLanguage(localeName))
train.tokens <- tokens_ngrams(train.tokens, n = nGramModelUnit)
We obtain the model for each data source. Here are the top 20 frequent n-grams obtained from each file and their frequencies.
Below is the bi gram sample size 0.1’s top 20 frequent keywords
include_graphics("./en_US_2gram_0.1.jpg")
Below is the tri-gram, sample size 0.1’s top 20 frequent keywords
include_graphics("./en_US_3gram_0.1.jpg")
Below is the 4-gram, sample size 0.1’s top 20 frequent keywords
include_graphics("./en_US_4gram_0.1.jpg")
To check the performance and the statistics of the n-gram model we build, we listed the following stats in the table. From the table we can see the more sample size we build, the time spent are greater.
stat_list <- readRDS("stats_model_en_US.rds")
stat_list <- stat_list %>%
mutate(filename=basename(as.character(filename))) %>%
mutate(timespent=round(as.numeric(timespent)))
print(stat_list)
## filename sample_size ngramUnits timespent tokens
## 1 en_US.blogs.txt 0.01 4 57 300936
## 2 en_US.news.txt 0.01 4 70 304513
## 3 en_US.twitter.txt 0.01 4 53 206380
## 4 en_US.blogs.txt 0.05 4 405 1604484
## 5 en_US.news.txt 0.05 4 441 1464360
## 6 en_US.twitter.txt 0.05 4 448 1009221
## 7 en_US.blogs.txt 0.1 4 884 3168905
## 8 en_US.news.txt 0.1 4 1041 2895384
## 9 en_US.twitter.txt 0.1 4 1123 1963647
## 10 en_US.blogs.txt 0.01 2 338 157491
## 11 en_US.news.txt 0.01 2 340 174426
## 12 en_US.twitter.txt 0.01 2 324 129822
## 13 en_US.blogs.txt 0.1 2 593 1001615
## 14 en_US.news.txt 0.1 2 734 1017311
## 15 en_US.twitter.txt 0.1 2 855 787276
## 16 en_US.blogs.txt 0.01 3 65 274008
## 17 en_US.news.txt 0.01 3 81 285380
## 18 en_US.twitter.txt 0.01 3 96 203157
## 19 en_US.blogs.txt 0.05 3 410 1315494
## 20 en_US.news.txt 0.05 3 467 1254255
## 21 en_US.twitter.txt 0.05 3 491 905402
## 22 en_US.blogs.txt 0.1 3 1073 2465992
## 23 en_US.news.txt 0.1 3 1385 2358702
## 24 en_US.twitter.txt 0.1 3 1533 1681314
We can also see percentage of n-grams that only happen once is quite high in 3-gram and 4-gram (almost 90% of the 4-gram only happen once)
In order to build a shiny app so that non-data scientest can also use, we have following requirements
Source code for this rmd file can be find in github (https://github.com/jjtt8080/SwiftKey_Data_Capstone)