Find the length of each file
We can then look at the longest single line in each of the files.
FileName NOL LongestLineLength
1 /de_DE/de_DE.blogs.txt 181958 7194
2 /de_DE/de_DE.news.txt 244743 3949
3 /de_DE/de_DE.twitter.txt 947774 140
4 /en_US/en_US.blogs.txt 899288 40833
5 /en_US/en_US.news.txt 77259 5760
6 /en_US/en_US.twitter.txt 2360148 140
7 /fi_FI/fi_FI.blogs.txt 439785 18299
8 /fi_FI/fi_FI.news.txt 485758 3820
9 /fi_FI/fi_FI.twitter.txt 285214 140
10 /ru_RU/ru_RU.blogs.txt 337100 7806
11 /ru_RU/ru_RU.news.txt 196360 9540
12 /ru_RU/ru_RU.twitter.txt 881414 180

Now we have a record of the longest lines in each of the files
We will try to find the occurances of these words in the english sources in the dataset. Just to see if they hold up.
en_US.blogs en_US.news en_US.twitter
time 99802 5844 92510
person 22299 1289 16680
year 55755 8454 36712
way 97708 5234 87706
day 105470 10270 234743
thing 110928 4343 98578
man 101145 10151 75008
world 24511 1051 18299
life 35279 1373 29914
hand 25846 1736 13381
Plotting the occurance of each word:

Plan for modelling
As indicated in the assigned readings we will try to build n-grams from each of the three data sources. This will involve reading each line and breaking it into its constituent phrases, whatever the n value might be.
Since we need to predict the next word based on the previous word, I think our model will work efficiently and accurately is n = 3.
Our next task will be to break each line of each text file into a sequence of 3 words and perform modelling on it.