Introduction

In this project we need to show that we’ve got used to working with the data and that we can later on create a prediction algorithm. We should submit a report on R Pubs (http://rpubs.com/) that shows an exploratory analysis and the goals for an eventual app and algorithm. The report should be concise and explain only the major features of the data you have identified and briefly summarize the plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager.

We should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to:

  1. Demonstrate that the data was downloaded and successfully loaded.
  2. Create a basic report of summary statistics about the data sets.
  3. Report any interesting findings.
  4. Get feedback on the plans for creating a prediction algorithm and Shiny app.

Tools Used

Since python also has very good libraries to deal with Natural Language Processing tasks, I will be using a mix of R, Python and linux scripting to extract relevant information from the dataset.

Initial Exploration

First, let us look at how the files look like, by looking at the first lines with the linux command ‘head’

Output from ‘head en_US.twitter.txt’ linux command

Output from ‘head en_US.blogs.txt’ linux command

Output from ‘head en_US.news.txt’ linux command

The number of lines and words for each file can be computed by using the linux command “wc”:

WC linux command

In the table below, it is possible to visualize the results:

Line and Word count per text file (English version)
File Name Line Count Word Count
en_US.blogs.txt 899.288 37.334.117
en_US.news.txt 1.010.242 34.365.936
en_US.twitter.txt 2.360.148 30.373.559
total 4.269.678 102.073.612

Let’s now use Python and NLTK library to compute not only the number of lines, but also the number of sentences contained in each file.

In order to run the python commands, first I had to perform some conversion on the text files, by using the following linux command:

sed 's/\\0//g' en_US.twitter.txt > en_US.twitter_converted.txt
library(reticulate)
use_condaenv("DataScienceSpec", required = TRUE)
from nltk import word_tokenize, sent_tokenize
import os
import io

filenames = {'en_US.blogs': r'C:\Users\camargom\Downloads\Coursera\final\en_US\en_US.blogs_converted.txt',
             'en_US.news' : r'C:\Users\camargom\Downloads\Coursera\final\en_US\en_US.news_converted.txt',
             'en_US.twitter' : r'C:\Users\camargom\Downloads\Coursera\final\en_US\en_US.twitter_converted.txt'  }

# for quick testing
#filenames = {'en_US.blogs': r'C:\Users\camargom\Downloads\Coursera\final\en_US\en_US.twitter_converted_small.txt',
#              'en_US.news': r'C:\Users\camargom\Downloads\Coursera\final\en_US\en_US.twitter_converted_small.txt' }

sentences = {}

# I have run this command to convert the file:
# sed 's/\\0//g' en_US.twitter.txt > en_US.twitter_converted.txt
for key in filenames:
  if os.path.isfile(filenames[key]):
      with io.open(filenames[key], encoding='utf8') as fin:
          text = fin.read()
          sentences[key] = sent_tokenize(text)
          print(f'{key} - Number of characters: {len(text)} - Number of sentences: {len(sentences[key])}')

In the table below, it is possible to visualize the results in a better way:

Sentence, Line, Word and Character count per text file (English version)
File Name Sentence Count Line Count Word Count Character Count
en_US.blogs.txt 2.083.684 899.288 37.334.117 207.723.793
en_US.news.txt 1.830.494 1.010.242 34.365.936 204.233.401
en_US.twitter.txt 2.811.218 2.360.148 30.373.559 164.456.392
total 6.725.396 4.269.678 102.073.612 576.413.586

Now let’s inspect the distribution of the total number of tokens per sentence (the sentence length) in each of the 3 files (a token might consist of a word or punctuation marks).

len_sentences = {}

for key in sentences:
  len_sentences[key] = [len(i) for i in sentences[key]]
import numpy as np
import matplotlib.pyplot as plt

for key in len_sentences:
  plt.clf()

  n, bins, patches = plt.hist(len_sentences[key],100,range=[0, 1000])

  plt.xlabel('Size')
  plt.ylabel('Number of Ocurrences')
  plt.title(f'Histogram of {key} sentence length')
  plt.plot()
  plt.savefig(f'{key}_sentence_length.png')

As expected, we see that for the Twitter dataset, the sentences are usually shorter than blogs and news. The news dataset is the one which has on average longer senteces, what is also expected given that it is usually a more formal language.

Words Distribution

Let us know compute the distributions for words in the whole dataset (all 3 files considered together). Here the 100 most common words as well as the number of occurrences of each individual word is computed.

from collections import Counter

all_sentences = []

for key in sentences:
  all_sentences+=sentences[key]

whole_text = (' '.join(filter(None, all_sentences))).lower()

all_tokens = word_tokenize(whole_text)

counter = Counter(all_tokens)
print('Most common words:')
counter.most_common(100)

100 most common words and number of occurrences:

(‘.’ : 4941686), (‘the’, 4760895), (‘,’, 4485534), (‘to’, 2753351), (‘and’, 2408096), (‘a’, 2377580), (‘of’, 2005044), (‘i’, 1999642), (‘in’, 1644631), (‘!’, 1480352), (‘it’, 1145713), (‘that’, 1122896), (‘for’, 1098683), (‘is’, 1094776), (‘you’, 1035878), (‘on’, 817046), ("‘s“, 793586), (‘with’, 714043), (‘was’, 642921), (‘:’, 615584), (‘?’, 607629), (‘my’, 603317), (”’‘“, 579315), (‘at’, 570578), (‘be’, 547600), (‘have’, 545448), (‘this’, 542991), (‘``’, 519072), (‘�’, 501647), (‘are’, 500812), (‘we’, 495163), (‘)’, 492366), (‘he’, 483023), (‘but’, 481914), (‘as’, 481313), (‘not’, 422372), (”n’t“, 408018), (‘from’, 383936), (‘so’, 380936), (‘(’, 375693), (‘do’, 368192), (‘me’, 367212), (‘they’, 361910), (‘…’, 341570), (‘all’, 330222), (‘will’, 315146), (‘by’, 313861), (‘or’, 309679), (‘said’, 304842), (‘just’, 303750), (‘what’, 302549), (‘your’, 302071), (‘his’, 301100), (‘an’, 298796), (‘about’, 295617), (‘out’, 294476), (‘one’, 291810), (‘up’, 291296), (‘if’, 278548), (‘#’, 273587), (‘can’, 271360), (‘like’, 269205), (‘has’, 265478), (‘when’, 264771), (‘who’, 260108), (‘there’, 247226), (‘more’, 243006), (‘s’, 236295), (‘had’, 235428), (‘get’, 226346), (‘she’, 219280), (‘would’, 218262), (‘time’, 213904), (‘her’, 207883), (‘their’, 204548), (‘some’, 201581), (‘no’, 198108), (‘new’, 193911), (‘been’, 187858), (‘our’, 185883), (”’m", 184462), (’�‘, 183299), (’were’, 182465), (‘�’, 180041), (‘good’, 178262), (‘now’, 178153), (‘how’, 178026), (‘-’, 175355), (‘day’, 168109), (‘know’, 162698), (‘people’, 161523), (‘them’, 160750), (‘love’, 160138), (‘did’, 153042), (‘$’, 149905), (‘;’, 145229), (‘which’, 143980), (‘back’, 141015), (‘than’, 140371), (‘go’, 139292)

import matplotlib.pyplot as plt

plt.clf()

n, bins, patches = plt.hist(counter.values(),500, range=[0,500])

plt.xlabel('Word')
plt.ylabel('Number of Ocurrences')
plt.title(f'Histogram of word occurences (3 files together)')
plt.plot()
plt.savefig(f'hist_word_occurence.png')

Bigrams Distribution

Now we compute the same, but just for Bigrams. Below one can see the 100 most common bigrams as well as the histogram.

from nltk.util import ngrams
bi_grams = list(ngrams(all_tokens, 2)) 
counter_bigram = Counter(bi_grams)
counter_bigram.most_common(100)

100 most common bigrams and number of occurrences:

((‘of’, ‘the’), 428080), ((‘in’, ‘the’), 405701), ((‘.’, ‘i’), 401850), ((‘!’, ‘!’), 341329), ((‘,’, ‘and’), 325770), ((‘.’, ‘the’), 306619), ((‘�’, ‘s’), 228712), ((‘to’, ‘the’), 212163), ((‘,’, ‘but’), 207401), ((‘for’, ‘the’), 199942), ((‘,’, ‘the’), 197143), ((‘on’, ‘the’), 195152), ((‘,’, ‘i’), 195131), ((‘.’, ‘``’), 194845), ((‘i’, “‘m“), 183353), ((‘it’,”’s“), 180666), ((‘.’,”’‘“), 161915), ((‘to’, ‘be’), 161660), ((‘.’, ‘it’), 156437), ((‘,’,”’’”), 150300), ((‘at’, ‘the’), 142118), ((‘do’, “n’t”), 128071), ((‘and’, ‘the’), 124440), ((‘in’, ‘a’), 118707), ((‘�’, ‘t’), 113105), ((‘!’, ‘i’), 110721), ((‘with’, ‘the’), 105254), ((‘:’, ‘)’), 102587), ((‘it’, ‘was’), 100267), ((‘is’, ‘a’), 100198), ((‘and’, ‘i’), 97539), ((‘.’, ‘but’), 97311), ((‘said’, ‘.’), 96208), ((‘.’, ‘he’), 94841), ((‘,’, ‘a’), 94440), ((‘for’, ‘a’), 93483), ((‘i’, ‘have’), 92552), ((‘i’, ‘was’), 89246), ((‘if’, ‘you’), 86881), ((‘from’, ‘the’), 86573), ((‘.’, ‘and’), 83453), ((‘it’, ‘is’), 83069), ((‘with’, ‘a’), 81641), ((‘will’, ‘be’), 81042), ((‘,’, ‘it’), 80668), ((‘.’, ‘we’), 80260), ((‘going’, ‘to’), 79798), ((‘of’, ‘a’), 78654), ((‘i’, ‘am’), 76540), ((‘it’, ‘.’), 76158), ((‘have’, ‘a’), 73368), ((‘one’, ‘of’), 72568), ((‘is’, ‘the’), 72431), ((‘to’, ‘get’), 70378), ((‘as’, ‘a’), 68535), ((‘,’, ‘which’), 67953), ((‘.’, ‘in’), 67727), ((‘that’, “’s”), 66767), ((‘ca’, “n’t”), 65753), ((‘i’, ‘do’), 64563), ((‘want’, ‘to’), 63931), ((‘.’, ‘this’), 62716), ((‘but’, ‘i’), 62299), ((‘i’, ‘�’), 62125), ((‘have’, ‘to’), 61462), ((‘by’, ‘the’), 61019), ((‘.’, ‘you’), 60354), ((‘this’, ‘is’), 60272), ((‘that’, ‘i’), 59401), ((‘)’, ‘.’), 59127), ((‘that’, ‘the’), 58914), ((‘i’, ‘think’), 58305), ((‘to’, ‘do’), 58200), ((‘and’, ‘a’), 57828), ((‘,’, ‘who’), 57487), ((‘the’, ‘first’), 56960), ((‘was’, ‘a’), 56210), ((‘out’, ‘of’), 55935), ((‘rt’, ‘:’), 55299), ((‘to’, ‘a’), 55114), ((‘,’, ‘you’), 54871), ((‘.’, ‘they’), 54760), ((‘to’, ‘see’), 53886), ((‘.’, ‘�’), 53688), ((‘.’, ‘a’), 53295), ((‘on’, ‘a’), 53244), ((‘.’, ‘that’), 53214), ((‘,’, ‘he’), 53179), ((‘you’, “’re”), 53002), ((‘.’, ‘so’), 51872), ((‘.’, ‘if’), 51521), ((‘you’, ‘can’), 51349), ((‘i’, ‘love’), 50357), ((‘all’, ‘the’), 50243), ((‘it’, ‘�’), 49673), ((‘the’, ‘same’), 49442), ((‘,’, ‘we’), 49058), ((‘did’, “n’t”), 48956), ((‘?’, ‘i’), 48698), ((‘i’, “’ve”), 48489)

import matplotlib.pyplot as plt

plt.clf()

n, bins, patches = plt.hist(counter_bigram.values(),500, range=[0,500])

plt.xlabel('Word')
plt.ylabel('Number of Ocurrences')
plt.title(f'Histogram of Bigram occurences (3 files together)')
plt.plot()
plt.savefig(f'hist_bigram_word_occurence.png')

We can see how a small amount of words has very high frequency while the rest is very seldomly used.

Trigrams Distribution

Here one can see the analysis for Trigrams. The 100 most frequent ones as well as the histogram are shown bekow

from nltk.util import ngrams
tri_grams = list(ngrams(all_tokens, 3)) 
counter_trigram = Counter(tri_grams)
counter_trigram.most_common(100)

100 most common trigrams and number of occurrences:

((‘!’, ‘!’, ‘!’), 152308), ((‘it’, ‘�’, ‘s’), 48686), ((‘.’, ‘it’, “‘s“), 43891), ((‘i’, ‘do’,”n’t“), 43663), ((‘said’, ‘.’,”’”), 40815), ((‘.’, ‘i’, “‘m“), 35989), ((‘one’, ‘of’, ‘the’), 34520), ((‘don’, ‘�’, ‘t’), 33559), ((‘i’, ‘�’, ‘m’), 32830), ((‘.’,”’”, ‘i’), 30992), ((‘a’, ‘lot’, ‘of’), 29984), ((‘,’, ‘but’, ‘i’), 28837), ((‘,’, ‘and’, ‘the’), 25136), ((‘,’, “’‘“, ‘said’), 24427), ((‘.’, ‘it’, ‘was’), 24372), ((‘thanks’, ‘for’, ‘the’),23763), ((‘,’, ‘and’, ‘i’), 22637), ((‘it’,”’s“, ‘a’), 21273), ((‘i’, ‘ca’,”n’t“), 20858), ((‘,’,”’‘“, ‘he’),20666), ((‘!’, ‘:’, ‘)’), 20603), ((‘,’, ‘it’,”’s“), 20258), ((‘he’, ‘said’, ‘.’), 20178), ((‘.’, ‘if’, ‘you’),20132), ((‘.’, ‘i’, ‘have’), 19572), ((‘.’,”’”, ‘we’), 19094), ((‘.’, ‘it’, ‘is’), 18857), ((‘.’, ‘this’, ‘is’),18826), ((‘.’, “‘“, ‘it’), 18291), ((‘!’, ‘!’, ‘i’), 18276), ((‘to’, ‘be’, ‘a’), 18085), ((‘?’, ‘?’, ‘?’), 17630), ((‘going’, ‘to’, ‘be’), 17423), ((‘i’,”’m“, ‘not’), 17313), ((‘,’, ‘but’, ‘it’), 17280), ((”’‘“, ‘he’, ‘said’),16837), ((‘,’, ‘i’,”’m“), 16534), ((‘.’, ‘i’, ‘am’), 16491), ((‘.’, ‘i’, ‘was’), 16342), ((‘didn’, ‘�’, ‘t’),16121), ((‘ca’,”n’t“, ‘wait’), 15910), ((‘i’, ‘�’, ‘ve’), 15648), ((‘that’, ‘�’, ‘s’), 15596), ((‘!’, ‘i’,”’m“),15513), ((‘.’, ‘that’,”’s“), 15481), ((‘!’, ‘rt’, ‘:’), 15191), ((‘i’, ‘want’, ‘to’), 14984), ((‘the’, ‘end’, ‘of’),14876), ((‘out’, ‘of’, ‘the’), 14714), ((‘.’,”’‘“, ‘the’), 14636), ((‘.’, ‘i’, ‘�’), 14630), ((‘.’, ‘i’, ‘think’), 14597), ((‘,’, ‘according’, ‘to’), 14582), ((‘i’, ‘did’,”n’t“), 14332), ((‘.’, ‘it’, ‘�’), 14299), ((‘it’, ‘was’, ‘a’),14162), ((‘do’,”n’t“, ‘know’), 13957), ((‘as’, ‘well’, ‘as’), 13806), ((‘,’, ‘but’, ‘the’), 13771), ((‘some’, ‘of’, ‘the’),13632), ((‘.’, ‘in’, ‘the’), 13524), ((‘.’,”’”, "‘“), 13484), ((‘it’,”’s“, ‘not’), 13430), ((‘you’, ‘do’,”n’t“), 13425), ((‘:’, ‘-’, ‘)’), 13242), ((‘be’, ‘able’, ‘to’), 13059), ((‘.’, ‘:’, ‘)’), 12993), ((‘.’, ‘i’, ‘do’), 12852), ((‘.’, ‘`’, ‘the’), 12586), ((‘i’, ‘don’, ‘�’), 12352), ((‘part’, ‘of’, ‘the’), 12312), ((‘.’, ‘i’, ‘love’), 12126), ((‘``’, ‘it’,”’s“), 11913), ((‘can’, ‘�’, ‘t’), 11874), ((‘�’, ‘s’, ‘a’), 11810), ((‘i’, ‘have’, ‘a’), 11797), ((‘,’, ‘which’, ‘is’), 11462), ((‘.’, ‘thanks’, ‘for’), 11398), ((‘i’,”’ve“, ‘been’), 11273), ((‘the’, ‘rest’,‘of’),11224), ((‘i’, ‘have’, ‘to’), 11214), ((‘looking’, ‘forward’, ‘to’), 11194), ((‘,’, ‘and’, ‘it’), 10886), ((‘.’, ‘rt’, ‘:’),10782), ((‘do’,”n’t“, ‘have’), 10766), ((‘,’, ‘so’, ‘i’), 10715), ((‘doesn’, ‘�’, ‘t’), 10339), ((‘:’, ‘)’, ‘i’), 10256), ((‘.’, ‘i’,”’ve“), 10215), ((‘you’, ‘�’, ‘re’), 10205), ((‘is’, ‘going’, ‘to’), 10141), ((‘the’, ‘first’, ‘time’), 10134), ((‘thank’, ‘you’, ‘for’), 10080), ((‘a’, ‘couple’, ‘of’), 10028), ((‘!’, ‘it’,”’s"), 9929), ((’,‘, ’it’, ‘was’), 9912), ((‘.’, ‘he’, ‘was’), 9888), ((‘i’, ‘think’, ‘i’), 9831), ((‘this’, ‘is’, ‘a’), 9795), ((‘.’, ‘thank’, ‘you’), 9771)

import matplotlib.pyplot as plt

plt.clf()

n, bins, patches = plt.hist(counter_bigram.values(),500, range=[0,500])

plt.xlabel('Word')
plt.ylabel('Number of Ocurrences')
plt.title(f'Histogram of Trigram occurences (3 files together)')
plt.plot()
plt.savefig(f'hist_trigram_word_occurence.png')

Prediction Algorithm

As a next step, I plan to create a prediction model by means of an N-Gram Language Model, which is a technique that assigns as next word the one with higher probability of occurrence, considering your training data. In order to decide whether to use N as 1, 2, 3 or more, I plan to separate a part of the training set for evaluation, and then test which of the cases will provide higher probabilities for the ground truth text. However, probably it will only be possible to test N less or equal than 3, since due to a limited hardware performance, it might not be feasible to use values of N greater than 3 (there will be a very big number of different 4-grams).

More information can be found on: https://web.stanford.edu/~jurafsky/slp3/3.pdf