Executive Summary

The goal of this document is to present the performed exploratory analysis on the SwiftKey data and the goals pursued for an application to predict the next word based on the previously typed words.

Background

The idea between this work is that when we write some text there is some correlation in the order in which we write words. That means that the probability of having a word in a certain position of a sentence is also determined by the preceeding words. This feature is nowdays normally exploited when writing on smartphones to help writing, predicting the next word that will be typed. To perform such this prediction a training dataset is normally used to train a model, which will store the probability of different patterns of words.

In this document we analyze a dataset of text content that we will later use to build a predictive model and an app to predict the next word based on the previously typed words.

Data

The training data for this project are available here:

https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

we will analyze the data for the english language.

Data cleaning performed

The raw data have been read and the following cleaning tasks hase being performed on this data

These transformations are useful to better understand the correlation between subsequent words.

Exploratory data analysis

The following exploratory data analysis have been performed * Number of words and lines in the analyzed data * Word frequencies * Bigrams frequencies * trigrams frequencies * quadrigrams frequencies * 50% coverage * 90% coverage

In these analysis only 0.5% of the whole corpora has been used, to limit the computation of the values within seconds.

Number of words and lines in the corpora and the analysed data

In the following table is reported the number of words in the whole corpora, separated by document.

x
en_US.blogs.txt 37334114
en_US.news.txt 34365936
en_US.twitter.txt 30359852

In the following table is reported the number of lines in the whole corpora, separated by document.

x
en_US.blogs.txt 899288
en_US.news.txt 1010242
en_US.twitter.txt 2360148
In the following table is reported the number of words in the analysed data, separated by document.
x
en_US.blogs.txt 93062
en_US.news.txt 7067
en_US.twitter.txt 77112

It is clear that the “twitter” data are larger than the others.

Word frequencies

Here we present the most frequent words in the data analyzed. In the following tables are reported the 10 most frequent words, ordered according to different criteria.

These are the most frequent words in the whole data analyzed with their frequency ( = number of that word occurencies / number of all words).

en_US.blogs.txt en_US.news.txt en_US.twitter.txt totals ngram
0.005200834 0.004245083 0.009661272 0.007103323 get
0.005211579 0.003254563 0.009414877 0.006962272 just
0.005963766 0.002971558 0.008079158 0.006764800 like
0.005716619 0.006933635 0.006107999 0.005935421 will
0.006081967 0.003679072 0.005705986 0.005822581 time
0.006135694 0.003113061 0.005459591 0.005721024 can

We can see that each document has some differences in the word distribution.

Bigrams frequencies

The most frequent couples of consequent words (bigrams) in the corpora analyzed are the following (with aside the number of time they have been found)

Here we present the most frequent couples of consequent words (bigrams) in the data analyzed. In the following tables are reported the 10 most bigrams, ordered according to different criteria.

These are the most frequent bigrams in the whole data analyzed.

en_US.blogs.txt en_US.news.txt en_US.twitter.txt totals ngram
0.00132981465 0.0000000000 0.00006063031 0.0007177136 don t
0.00018555553 0.0002765869 0.00092158075 0.0005141829 right now
0.00002061728 0.0000000000 0.00101858925 0.0004606222 cant wait
0.00075253077 0.0000000000 0.00001212606 0.0003963493 didn t
0.00019586417 0.0000000000 0.00064268131 0.0003856372 dont know
0.00032987650 0.0002765869 0.00044866431 0.0003802811 feel like

Trigrams frequencies

Here we present the most frequent triplets of consequent words (trigrams) in the data analyzed. In the following tables are reported the 10 most trigrams, ordered according to different criteria.

These are the most frequent trigrams in the whole data analyzed.

en_US.blogs.txt en_US.news.txt en_US.twitter.txt totals ngram
0.00000000000 0 0.0001697669 0.00007498621 cant wait see
0.00014432246 0 0.0000000000 0.00007498621 don t know
0.00001030875 0 0.0001576407 0.00007498621 happi new year
0.00001030875 0 0.0001455145 0.00006963005 happi mother day
0.00001030875 0 0.0001333883 0.00006427389 let us know
0.00011339622 0 0.0000000000 0.00005891773 don t want

Quadrigrams frequencies

Here we present the most frequent quartets of consequent words (quadrigrams) in the data analyzed. In the following tables are reported the 10 most quadrigrams, ordered according to different criteria.

These are the most frequent quadrigrams in the whole data analyzed.

en_US.blogs.txt en_US.news.txt en_US.twitter.txt totals ngram
0.00000000000 0 0.0001697669 0.00007498621 cant wait see
0.00014432246 0 0.0000000000 0.00007498621 don t know
0.00001030875 0 0.0001576407 0.00007498621 happi new year
0.00001030875 0 0.0001455145 0.00006963005 happi mother day
0.00001030875 0 0.0001333883 0.00006427389 let us know
0.00011339622 0 0.0000000000 0.00005891773 don t want

Word Coverage

We analyzed the smallest number of word needed in a dictionary to identify a percentage of the words in the dataset.

50% coverage

This is percentage of unique words needed to cover 50% of the words in the analyzed data The row labeled “wordPerc” has the percentage of words needed to cover the 50% of the text data.

The row labeled “wordFreqThreshold” has the threshold of word frequency needed to cover the 50% of the text data. The row labeled “wordPerc” has the percentage of word needed to cover the 50% of the text data.
en_US.twitter.txt totals en_US.blogs.txt en_US.news.txt
wordPerc 0.0181724990 0.0240902101 0.0278179022 0.0417501514
wordFreqThreshold 0.0002852993 0.0003554482 0.0004942941 0.0016980331

90% coverage

This is the percentage of unique words needed to cover 90% of the words in the analyzed data. The row labeled “wordPerc” has the percentage of words needed to cover the 90% of the text data.

The row labeled “wordFreqThreshold” has the threshold of word frequency needed to cover the 90% of the text data. The row labeled “wordPerc” has the percentage of word needed to cover the 90% of the text data.

en_US.blogs.txt totals en_US.twitter.txt en_US.news.txt
wordPerc 0.29085317553 0.30459904012 0.32356367364 0.38702763152
wordFreqThreshold 0.00001074552 0.00001692611 0.00002593630 0.00014150276