Excutative Summary

This report is a exploratory analysis of swiftkey dataset. The dataset contains different locales and the data comes from blogs, twitter and news. The goal for us is to build a N -gram model using the training data and predict the next word based on the word sequence provided.

Download and read the data

We have downloaded the dataset and extracted locally. We can see there are 4 locales, each locales contains 3 dataset, twitter, blogs, and news.

We can count the # of lines in each file by implementing a utility function getCountOfLines using the command “wc -l”. Here is the line count output.

##                         filename linecount wordcount
## 1    final/en_US/en_US.blogs.txt    899288 210160014
## 2     final/en_US/en_US.news.txt   1010242 205811885
## 3  final/en_US/en_US.twitter.txt   2360148 167105338
## 4    final/de_DE/de_DE.blogs.txt    371440  85459666
## 5     final/de_DE/de_DE.news.txt    244743  95591959
## 6  final/de_DE/de_DE.twitter.txt    947774  75578341
## 7    final/fi_FI/fi_FI.blogs.txt    439785 108503595
## 8     final/fi_FI/fi_FI.news.txt    485758  94234350
## 9  final/fi_FI/fi_FI.twitter.txt    285214  25331142
## 10   final/ru_RU/ru_RU.blogs.txt    337100 116855835
## 11    final/ru_RU/ru_RU.news.txt    196360 118996424
## 12 final/ru_RU/ru_RU.twitter.txt    881414 105182346

Building Word frequency table using the n-gram model

We will start with analyze the text using the data mining package quanteda. This package allow use to clean the text data including exclude the stop words, punctuation, numbers, and symbols.

Since we need to predict the word using n-gram, we will try build bi-gram, tri-gram, and qudra-gram. For bi-gram, the 1st word will be used to predict the next word (outcome). For the tri-gram, the first two words will be used to preduct the last word. For quadra gram, the first three words will be used to preduct the last word.

We tokenize our corpus first for each file, and build a sparse matrix of the n-gram tokens with the count of occurance, then we sum up all the count to build the frequency for each n-gram token.

Due to the size of the file is huge, we will use the sampling method to avoid using excessive memory, we can use k-folds to do multiple training later or to read the entire corpus. Since our training corpus containing different locale and different text source, we will make our model seperate for each source.

 train.tokens <- tokens_wordstem(train.tokens, 
                                    language = getStemLanguage(localeName))
 train.tokens <- tokens_ngrams(train.tokens, n = nGramModelUnit)

Top 20 n-grams

We obtain the model for each data source. Here are the top 20 frequent n-grams obtained from each file and their frequencies.

Below is the bi gram sample size 0.1’s top 20 frequent keywords

include_graphics("./en_US_2gram_0.1.jpg")

Below is the tri-gram, sample size 0.1’s top 20 frequent keywords

include_graphics("./en_US_3gram_0.1.jpg")

Below is the 4-gram, sample size 0.1’s top 20 frequent keywords

include_graphics("./en_US_4gram_0.1.jpg")

Statistics of the dataset

To check the performance and the statistics of the n-gram model we build, we listed the following stats in the table. From the table we can see the more sample size we build, the time spent are greater.

stat_list <- readRDS("stats_model_en_US.rds")
stat_list <- stat_list %>%
    mutate(filename=basename(as.character(filename))) %>%
    mutate(timespent=round(as.numeric(timespent)))
print(stat_list)
##             filename sample_size ngramUnits timespent  tokens
## 1    en_US.blogs.txt        0.01          4        57  300936
## 2     en_US.news.txt        0.01          4        70  304513
## 3  en_US.twitter.txt        0.01          4        53  206380
## 4    en_US.blogs.txt        0.05          4       405 1604484
## 5     en_US.news.txt        0.05          4       441 1464360
## 6  en_US.twitter.txt        0.05          4       448 1009221
## 7    en_US.blogs.txt         0.1          4       884 3168905
## 8     en_US.news.txt         0.1          4      1041 2895384
## 9  en_US.twitter.txt         0.1          4      1123 1963647
## 10   en_US.blogs.txt        0.01          2       338  157491
## 11    en_US.news.txt        0.01          2       340  174426
## 12 en_US.twitter.txt        0.01          2       324  129822
## 13   en_US.blogs.txt         0.1          2       593 1001615
## 14    en_US.news.txt         0.1          2       734 1017311
## 15 en_US.twitter.txt         0.1          2       855  787276
## 16   en_US.blogs.txt        0.01          3        65  274008
## 17    en_US.news.txt        0.01          3        81  285380
## 18 en_US.twitter.txt        0.01          3        96  203157
## 19   en_US.blogs.txt        0.05          3       410 1315494
## 20    en_US.news.txt        0.05          3       467 1254255
## 21 en_US.twitter.txt        0.05          3       491  905402
## 22   en_US.blogs.txt         0.1          3      1073 2465992
## 23    en_US.news.txt         0.1          3      1385 2358702
## 24 en_US.twitter.txt         0.1          3      1533 1681314

We can also see percentage of n-grams that only happen once is quite high in 3-gram and 4-gram (almost 90% of the 4-gram only happen once)

Goal of building Shiny App for Prediction Model

In order to build a shiny app so that non-data scientest can also use, we have following requirements

Appendix

Source code for this rmd file can be find in github (https://github.com/jjtt8080/SwiftKey_Data_Capstone)