The goal of this document is to present the performed exploratory analysis on the SwiftKey data and the goals pursued for an application to predict the next word based on the previously typed words.
The idea between this work is that when we write some text there is some correlation in the order in which we write words. That means that the probability of having a word in a certain position of a sentence is also determined by the preceeding words. This feature is nowdays normally exploited when writing on smartphones to help writing, predicting the next word that will be typed. To perform such this prediction a training dataset is normally used to train a model, which will store the probability of different patterns of words.
In this document we analyze a dataset of text content that we will later use to build a predictive model and an app to predict the next word based on the previously typed words.
The training data for this project are available here:
https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
we will analyze the data for the english language.
The raw data have been read and the following cleaning tasks hase being performed on this data
These transformations are useful to better understand the correlation between subsequent words.
The following exploratory data analysis have been performed * Number of words and lines in the analyzed data * Word frequencies * Bigrams frequencies * trigrams frequencies * quadrigrams frequencies * 50% coverage * 90% coverage
In these analysis only 0.5% of the whole corpora has been used, to limit the computation of the values within seconds.
In the following table is reported the number of words in the whole corpora, separated by document.
| x | |
|---|---|
| en_US.blogs.txt | 37334114 |
| en_US.news.txt | 34365936 |
| en_US.twitter.txt | 30359852 |
In the following table is reported the number of lines in the whole corpora, separated by document.
| x | |
|---|---|
| en_US.blogs.txt | 899288 |
| en_US.news.txt | 1010242 |
| en_US.twitter.txt | 2360148 |
| x | |
|---|---|
| en_US.blogs.txt | 93062 |
| en_US.news.txt | 7067 |
| en_US.twitter.txt | 77112 |
It is clear that the “twitter” data are larger than the others.
Here we present the most frequent words in the data analyzed. In the following tables are reported the 10 most frequent words, ordered according to different criteria.
These are the most frequent words in the whole data analyzed with their frequency ( = number of that word occurencies / number of all words).
| en_US.blogs.txt | en_US.news.txt | en_US.twitter.txt | totals | ngram |
|---|---|---|---|---|
| 0.005200834 | 0.004245083 | 0.009661272 | 0.007103323 | get |
| 0.005211579 | 0.003254563 | 0.009414877 | 0.006962272 | just |
| 0.005963766 | 0.002971558 | 0.008079158 | 0.006764800 | like |
| 0.005716619 | 0.006933635 | 0.006107999 | 0.005935421 | will |
| 0.006081967 | 0.003679072 | 0.005705986 | 0.005822581 | time |
| 0.006135694 | 0.003113061 | 0.005459591 | 0.005721024 | can |
We can see that each document has some differences in the word distribution.
The most frequent couples of consequent words (bigrams) in the corpora analyzed are the following (with aside the number of time they have been found)
Here we present the most frequent couples of consequent words (bigrams) in the data analyzed. In the following tables are reported the 10 most bigrams, ordered according to different criteria.
These are the most frequent bigrams in the whole data analyzed.
| en_US.blogs.txt | en_US.news.txt | en_US.twitter.txt | totals | ngram |
|---|---|---|---|---|
| 0.00132981465 | 0.0000000000 | 0.00006063031 | 0.0007177136 | don t |
| 0.00018555553 | 0.0002765869 | 0.00092158075 | 0.0005141829 | right now |
| 0.00002061728 | 0.0000000000 | 0.00101858925 | 0.0004606222 | cant wait |
| 0.00075253077 | 0.0000000000 | 0.00001212606 | 0.0003963493 | didn t |
| 0.00019586417 | 0.0000000000 | 0.00064268131 | 0.0003856372 | dont know |
| 0.00032987650 | 0.0002765869 | 0.00044866431 | 0.0003802811 | feel like |
Here we present the most frequent triplets of consequent words (trigrams) in the data analyzed. In the following tables are reported the 10 most trigrams, ordered according to different criteria.
These are the most frequent trigrams in the whole data analyzed.
| en_US.blogs.txt | en_US.news.txt | en_US.twitter.txt | totals | ngram |
|---|---|---|---|---|
| 0.00000000000 | 0 | 0.0001697669 | 0.00007498621 | cant wait see |
| 0.00014432246 | 0 | 0.0000000000 | 0.00007498621 | don t know |
| 0.00001030875 | 0 | 0.0001576407 | 0.00007498621 | happi new year |
| 0.00001030875 | 0 | 0.0001455145 | 0.00006963005 | happi mother day |
| 0.00001030875 | 0 | 0.0001333883 | 0.00006427389 | let us know |
| 0.00011339622 | 0 | 0.0000000000 | 0.00005891773 | don t want |
Here we present the most frequent quartets of consequent words (quadrigrams) in the data analyzed. In the following tables are reported the 10 most quadrigrams, ordered according to different criteria.
These are the most frequent quadrigrams in the whole data analyzed.
| en_US.blogs.txt | en_US.news.txt | en_US.twitter.txt | totals | ngram |
|---|---|---|---|---|
| 0.00000000000 | 0 | 0.0001697669 | 0.00007498621 | cant wait see |
| 0.00014432246 | 0 | 0.0000000000 | 0.00007498621 | don t know |
| 0.00001030875 | 0 | 0.0001576407 | 0.00007498621 | happi new year |
| 0.00001030875 | 0 | 0.0001455145 | 0.00006963005 | happi mother day |
| 0.00001030875 | 0 | 0.0001333883 | 0.00006427389 | let us know |
| 0.00011339622 | 0 | 0.0000000000 | 0.00005891773 | don t want |
We analyzed the smallest number of word needed in a dictionary to identify a percentage of the words in the dataset.
This is percentage of unique words needed to cover 50% of the words in the analyzed data The row labeled “wordPerc” has the percentage of words needed to cover the 50% of the text data.
The row labeled “wordFreqThreshold” has the threshold of word frequency needed to cover the 50% of the text data. The row labeled “wordPerc” has the percentage of word needed to cover the 50% of the text data.| en_US.twitter.txt | totals | en_US.blogs.txt | en_US.news.txt | |
|---|---|---|---|---|
| wordPerc | 0.0181724990 | 0.0240902101 | 0.0278179022 | 0.0417501514 |
| wordFreqThreshold | 0.0002852993 | 0.0003554482 | 0.0004942941 | 0.0016980331 |
This is the percentage of unique words needed to cover 90% of the words in the analyzed data. The row labeled “wordPerc” has the percentage of words needed to cover the 90% of the text data.
The row labeled “wordFreqThreshold” has the threshold of word frequency needed to cover the 90% of the text data. The row labeled “wordPerc” has the percentage of word needed to cover the 90% of the text data.
| en_US.blogs.txt | totals | en_US.twitter.txt | en_US.news.txt | |
|---|---|---|---|---|
| wordPerc | 0.29085317553 | 0.30459904012 | 0.32356367364 | 0.38702763152 |
| wordFreqThreshold | 0.00001074552 | 0.00001692611 | 0.00002593630 | 0.00014150276 |