Exploratory data analysis

The following exploratory data analysis have been performed * Number of words and lines in the analyzed data * Word frequencies * Bigrams frequencies * trigrams frequencies * quadrigrams frequencies * 50% coverage * 90% coverage

In these analysis only 0.5% of the whole corpora has been used, to limit the computation of the values within seconds.

Number of words and lines in the corpora and the analysed data

In the following table is reported the number of words in the whole corpora, separated by document.

	x
en_US.blogs.txt	37334114
en_US.news.txt	34365936
en_US.twitter.txt	30359852

In the following table is reported the number of lines in the whole corpora, separated by document.

	x
en_US.blogs.txt	899288
en_US.news.txt	1010242
en_US.twitter.txt	2360148

In the following table is reported the number of words in the analysed data, separated by document.

	x
en_US.blogs.txt	93062
en_US.news.txt	7067
en_US.twitter.txt	77112

It is clear that the “twitter” data are larger than the others.

Word frequencies

Here we present the most frequent words in the data analyzed. In the following tables are reported the 10 most frequent words, ordered according to different criteria.

These are the most frequent words in the whole data analyzed with their frequency ( = number of that word occurencies / number of all words).

en_US.blogs.txt	en_US.news.txt	en_US.twitter.txt	totals	ngram
0.005200834	0.004245083	0.009661272	0.007103323	get
0.005211579	0.003254563	0.009414877	0.006962272	just
0.005963766	0.002971558	0.008079158	0.006764800	like
0.005716619	0.006933635	0.006107999	0.005935421	will
0.006081967	0.003679072	0.005705986	0.005822581	time
0.006135694	0.003113061	0.005459591	0.005721024	can

We can see that each document has some differences in the word distribution.

Bigrams frequencies

The most frequent couples of consequent words (bigrams) in the corpora analyzed are the following (with aside the number of time they have been found)

Here we present the most frequent couples of consequent words (bigrams) in the data analyzed. In the following tables are reported the 10 most bigrams, ordered according to different criteria.

These are the most frequent bigrams in the whole data analyzed.

en_US.blogs.txt	en_US.news.txt	en_US.twitter.txt	totals	ngram
0.00132981465	0.0000000000	0.00006063031	0.0007177136	don t
0.00018555553	0.0002765869	0.00092158075	0.0005141829	right now
0.00002061728	0.0000000000	0.00101858925	0.0004606222	cant wait
0.00075253077	0.0000000000	0.00001212606	0.0003963493	didn t
0.00019586417	0.0000000000	0.00064268131	0.0003856372	dont know
0.00032987650	0.0002765869	0.00044866431	0.0003802811	feel like

Trigrams frequencies

Here we present the most frequent triplets of consequent words (trigrams) in the data analyzed. In the following tables are reported the 10 most trigrams, ordered according to different criteria.

These are the most frequent trigrams in the whole data analyzed.

en_US.blogs.txt	en_US.twitter.txt	totals	ngram
0.00000000000	0.0001697669	0.00007498621	cant wait see
0.00014432246	0.0000000000	0.00007498621	don t know
0.00001030875	0.0001576407	0.00007498621	happi new year
0.00001030875	0.0001455145	0.00006963005	happi mother day
0.00001030875	0.0001333883	0.00006427389	let us know
0.00011339622	0.0000000000	0.00005891773	don t want

Quadrigrams frequencies

Here we present the most frequent quartets of consequent words (quadrigrams) in the data analyzed. In the following tables are reported the 10 most quadrigrams, ordered according to different criteria.

These are the most frequent quadrigrams in the whole data analyzed.

en_US.blogs.txt	en_US.twitter.txt	totals	ngram
0.00000000000	0.0001697669	0.00007498621	cant wait see
0.00014432246	0.0000000000	0.00007498621	don t know
0.00001030875	0.0001576407	0.00007498621	happi new year
0.00001030875	0.0001455145	0.00006963005	happi mother day
0.00001030875	0.0001333883	0.00006427389	let us know
0.00011339622	0.0000000000	0.00005891773	don t want

Word Coverage

We analyzed the smallest number of word needed in a dictionary to identify a percentage of the words in the dataset.

50% coverage

This is percentage of unique words needed to cover 50% of the words in the analyzed data The row labeled “wordPerc” has the percentage of words needed to cover the 50% of the text data.

The row labeled “wordFreqThreshold” has the threshold of word frequency needed to cover the 50% of the text data. The row labeled “wordPerc” has the percentage of word needed to cover the 50% of the text data.

	en_US.twitter.txt	totals	en_US.blogs.txt	en_US.news.txt
wordPerc	0.0181724990	0.0240902101	0.0278179022	0.0417501514
wordFreqThreshold	0.0002852993	0.0003554482	0.0004942941	0.0016980331

90% coverage

This is the percentage of unique words needed to cover 90% of the words in the analyzed data. The row labeled “wordPerc” has the percentage of words needed to cover the 90% of the text data.

The row labeled “wordFreqThreshold” has the threshold of word frequency needed to cover the 90% of the text data. The row labeled “wordPerc” has the percentage of word needed to cover the 90% of the text data.

	en_US.blogs.txt	totals	en_US.twitter.txt	en_US.news.txt
wordPerc	0.29085317553	0.30459904012	0.32356367364	0.38702763152
wordFreqThreshold	0.00001074552	0.00001692611	0.00002593630	0.00014150276

SwiftKey - Milestone Report

Andrea Alberto

12 March 2018

Executive Summary

Background

Data

Data cleaning performed