Summarize the data

Find the length of each file

We need to see how much data is in each file to find out the largest source, and figure out potential biases that can occur due to this.

                   FileName     NOL                                               Filepath
1    /de_DE/de_DE.blogs.txt  181958   ./data/Coursera-SwiftKey/final/de_DE/de_DE.blogs.txt
2     /de_DE/de_DE.news.txt  244743    ./data/Coursera-SwiftKey/final/de_DE/de_DE.news.txt
3  /de_DE/de_DE.twitter.txt  947774 ./data/Coursera-SwiftKey/final/de_DE/de_DE.twitter.txt
4    /en_US/en_US.blogs.txt  899288   ./data/Coursera-SwiftKey/final/en_US/en_US.blogs.txt
5     /en_US/en_US.news.txt   77259    ./data/Coursera-SwiftKey/final/en_US/en_US.news.txt
6  /en_US/en_US.twitter.txt 2360148 ./data/Coursera-SwiftKey/final/en_US/en_US.twitter.txt
7    /fi_FI/fi_FI.blogs.txt  439785   ./data/Coursera-SwiftKey/final/fi_FI/fi_FI.blogs.txt
8     /fi_FI/fi_FI.news.txt  485758    ./data/Coursera-SwiftKey/final/fi_FI/fi_FI.news.txt
9  /fi_FI/fi_FI.twitter.txt  285214 ./data/Coursera-SwiftKey/final/fi_FI/fi_FI.twitter.txt
10   /ru_RU/ru_RU.blogs.txt  337100   ./data/Coursera-SwiftKey/final/ru_RU/ru_RU.blogs.txt
11    /ru_RU/ru_RU.news.txt  196360    ./data/Coursera-SwiftKey/final/ru_RU/ru_RU.news.txt
12 /ru_RU/ru_RU.twitter.txt  881414 ./data/Coursera-SwiftKey/final/ru_RU/ru_RU.twitter.txt

From this we can figure out that Twitter data for English is the most prominent source of data for us.

We can then look at the longest single line in each of the files.

                   FileName     NOL LongestLineLength
1    /de_DE/de_DE.blogs.txt  181958              7194
2     /de_DE/de_DE.news.txt  244743              3949
3  /de_DE/de_DE.twitter.txt  947774               140
4    /en_US/en_US.blogs.txt  899288             40833
5     /en_US/en_US.news.txt   77259              5760
6  /en_US/en_US.twitter.txt 2360148               140
7    /fi_FI/fi_FI.blogs.txt  439785             18299
8     /fi_FI/fi_FI.news.txt  485758              3820
9  /fi_FI/fi_FI.twitter.txt  285214               140
10   /ru_RU/ru_RU.blogs.txt  337100              7806
11    /ru_RU/ru_RU.news.txt  196360              9540
12 /ru_RU/ru_RU.twitter.txt  881414               180

Now we have a record of the longest lines in each of the files

According to Dictionary.com (using the Google Ngram viewer tool), the most common nouns in the English language are:

  1. time
  2. person
  3. year
  4. way
  5. day
  6. thing
  7. man
  8. world
  9. life
  10. hand

We will try to find the occurances of these words in the english sources in the dataset. Just to see if they hold up.

       en_US.blogs en_US.news en_US.twitter
time         99802       5844         92510
person       22299       1289         16680
year         55755       8454         36712
way          97708       5234         87706
day         105470      10270        234743
thing       110928       4343         98578
man         101145      10151         75008
world        24511       1051         18299
life         35279       1373         29914
hand         25846       1736         13381

Plotting the occurance of each word:

Plan for modelling

As indicated in the assigned readings we will try to build n-grams from each of the three data sources. This will involve reading each line and breaking it into its constituent phrases, whatever the n value might be.

Since we need to predict the next word based on the previous word, I think our model will work efficiently and accurately is n = 3.

Our next task will be to break each line of each text file into a sequence of 3 words and perform modelling on it.