Data Science Specialization - Capstone Project

Initial Exploration

First, let us look at how the files look like, by looking at the first lines with the linux command ‘head’

Output from ‘head en_US.twitter.txt’ linux command

Output from ‘head en_US.blogs.txt’ linux command

Output from ‘head en_US.news.txt’ linux command

The number of lines and words for each file can be computed by using the linux command “wc”:

WC linux command

In the table below, it is possible to visualize the results:

Line and Word count per text file (English version)
File Name	Line Count	Word Count
en_US.blogs.txt	899.288	37.334.117
en_US.news.txt	1.010.242	34.365.936
en_US.twitter.txt	2.360.148	30.373.559
total	4.269.678	102.073.612

Let’s now use Python and NLTK library to compute not only the number of lines, but also the number of sentences contained in each file.

In order to run the python commands, first I had to perform some conversion on the text files, by using the following linux command:

sed 's/\\0//g' en_US.twitter.txt > en_US.twitter_converted.txt

library(reticulate)
use_condaenv("DataScienceSpec", required = TRUE)

from nltk import word_tokenize, sent_tokenize
import os
import io

filenames = {'en_US.blogs': r'C:\Users\camargom\Downloads\Coursera\final\en_US\en_US.blogs_converted.txt',
             'en_US.news' : r'C:\Users\camargom\Downloads\Coursera\final\en_US\en_US.news_converted.txt',
             'en_US.twitter' : r'C:\Users\camargom\Downloads\Coursera\final\en_US\en_US.twitter_converted.txt'  }

# for quick testing
#filenames = {'en_US.blogs': r'C:\Users\camargom\Downloads\Coursera\final\en_US\en_US.twitter_converted_small.txt',
#              'en_US.news': r'C:\Users\camargom\Downloads\Coursera\final\en_US\en_US.twitter_converted_small.txt' }

sentences = {}

# I have run this command to convert the file:
# sed 's/\\0//g' en_US.twitter.txt > en_US.twitter_converted.txt
for key in filenames:
  if os.path.isfile(filenames[key]):
      with io.open(filenames[key], encoding='utf8') as fin:
          text = fin.read()
          sentences[key] = sent_tokenize(text)
          print(f'{key} - Number of characters: {len(text)} - Number of sentences: {len(sentences[key])}')

In the table below, it is possible to visualize the results in a better way:

Sentence, Line, Word and Character count per text file (English version)
File Name	Sentence Count	Line Count	Word Count	Character Count
en_US.blogs.txt	2.083.684	899.288	37.334.117	207.723.793
en_US.news.txt	1.830.494	1.010.242	34.365.936	204.233.401
en_US.twitter.txt	2.811.218	2.360.148	30.373.559	164.456.392
total	6.725.396	4.269.678	102.073.612	576.413.586

Now let’s inspect the distribution of the total number of tokens per sentence (the sentence length) in each of the 3 files (a token might consist of a word or punctuation marks).

len_sentences = {}

for key in sentences:
  len_sentences[key] = [len(i) for i in sentences[key]]

import numpy as np
import matplotlib.pyplot as plt

for key in len_sentences:
  plt.clf()

  n, bins, patches = plt.hist(len_sentences[key],100,range=[0, 1000])

  plt.xlabel('Size')
  plt.ylabel('Number of Ocurrences')
  plt.title(f'Histogram of {key} sentence length')
  plt.plot()
  plt.savefig(f'{key}_sentence_length.png')

As expected, we see that for the Twitter dataset, the sentences are usually shorter than blogs and news. The news dataset is the one which has on average longer senteces, what is also expected given that it is usually a more formal language.

Words Distribution

Let us know compute the distributions for words in the whole dataset (all 3 files considered together). Here the 100 most common words as well as the number of occurrences of each individual word is computed.

from collections import Counter

all_sentences = []

for key in sentences:
  all_sentences+=sentences[key]

whole_text = (' '.join(filter(None, all_sentences))).lower()

all_tokens = word_tokenize(whole_text)

counter = Counter(all_tokens)
print('Most common words:')
counter.most_common(100)

100 most common words and number of occurrences:

(‘.’ : 4941686), (‘the’, 4760895), (‘,’, 4485534), (‘to’, 2753351), (‘and’, 2408096), (‘a’, 2377580), (‘of’, 2005044), (‘i’, 1999642), (‘in’, 1644631), (‘!’, 1480352), (‘it’, 1145713), (‘that’, 1122896), (‘for’, 1098683), (‘is’, 1094776), (‘you’, 1035878), (‘on’, 817046), ("‘s“, 793586), (‘with’, 714043), (‘was’, 642921), (‘:’, 615584), (‘?’, 607629), (‘my’, 603317), (”’‘“, 579315), (‘at’, 570578), (‘be’, 547600), (‘have’, 545448), (‘this’, 542991), (‘``’, 519072), (‘�’, 501647), (‘are’, 500812), (‘we’, 495163), (‘)’, 492366), (‘he’, 483023), (‘but’, 481914), (‘as’, 481313), (‘not’, 422372), (”n’t“, 408018), (‘from’, 383936), (‘so’, 380936), (‘(’, 375693), (‘do’, 368192), (‘me’, 367212), (‘they’, 361910), (‘…’, 341570), (‘all’, 330222), (‘will’, 315146), (‘by’, 313861), (‘or’, 309679), (‘said’, 304842), (‘just’, 303750), (‘what’, 302549), (‘your’, 302071), (‘his’, 301100), (‘an’, 298796), (‘about’, 295617), (‘out’, 294476), (‘one’, 291810), (‘up’, 291296), (‘if’, 278548), (‘#’, 273587), (‘can’, 271360), (‘like’, 269205), (‘has’, 265478), (‘when’, 264771), (‘who’, 260108), (‘there’, 247226), (‘more’, 243006), (‘s’, 236295), (‘had’, 235428), (‘get’, 226346), (‘she’, 219280), (‘would’, 218262), (‘time’, 213904), (‘her’, 207883), (‘their’, 204548), (‘some’, 201581), (‘no’, 198108), (‘new’, 193911), (‘been’, 187858), (‘our’, 185883), (”’m", 184462), (’�‘, 183299), (’were’, 182465), (‘�’, 180041), (‘good’, 178262), (‘now’, 178153), (‘how’, 178026), (‘-’, 175355), (‘day’, 168109), (‘know’, 162698), (‘people’, 161523), (‘them’, 160750), (‘love’, 160138), (‘did’, 153042), (‘$’, 149905), (‘;’, 145229), (‘which’, 143980), (‘back’, 141015), (‘than’, 140371), (‘go’, 139292)

import matplotlib.pyplot as plt

plt.clf()

n, bins, patches = plt.hist(counter.values(),500, range=[0,500])

plt.xlabel('Word')
plt.ylabel('Number of Ocurrences')
plt.title(f'Histogram of word occurences (3 files together)')
plt.plot()
plt.savefig(f'hist_word_occurence.png')

Bigrams Distribution

Now we compute the same, but just for Bigrams. Below one can see the 100 most common bigrams as well as the histogram.

from nltk.util import ngrams
bi_grams = list(ngrams(all_tokens, 2)) 
counter_bigram = Counter(bi_grams)
counter_bigram.most_common(100)

100 most common bigrams and number of occurrences:

((‘of’, ‘the’), 428080), ((‘in’, ‘the’), 405701), ((‘.’, ‘i’), 401850), ((‘!’, ‘!’), 341329), ((‘,’, ‘and’), 325770), ((‘.’, ‘the’), 306619), ((‘�’, ‘s’), 228712), ((‘to’, ‘the’), 212163), ((‘,’, ‘but’), 207401), ((‘for’, ‘the’), 199942), ((‘,’, ‘the’), 197143), ((‘on’, ‘the’), 195152), ((‘,’, ‘i’), 195131), ((‘.’, ‘``’), 194845), ((‘i’, “‘m“), 183353), ((‘it’,”’s“), 180666), ((‘.’,”’‘“), 161915), ((‘to’, ‘be’), 161660), ((‘.’, ‘it’), 156437), ((‘,’,”’’”), 150300), ((‘at’, ‘the’), 142118), ((‘do’, “n’t”), 128071), ((‘and’, ‘the’), 124440), ((‘in’, ‘a’), 118707), ((‘�’, ‘t’), 113105), ((‘!’, ‘i’), 110721), ((‘with’, ‘the’), 105254), ((‘:’, ‘)’), 102587), ((‘it’, ‘was’), 100267), ((‘is’, ‘a’), 100198), ((‘and’, ‘i’), 97539), ((‘.’, ‘but’), 97311), ((‘said’, ‘.’), 96208), ((‘.’, ‘he’), 94841), ((‘,’, ‘a’), 94440), ((‘for’, ‘a’), 93483), ((‘i’, ‘have’), 92552), ((‘i’, ‘was’), 89246), ((‘if’, ‘you’), 86881), ((‘from’, ‘the’), 86573), ((‘.’, ‘and’), 83453), ((‘it’, ‘is’), 83069), ((‘with’, ‘a’), 81641), ((‘will’, ‘be’), 81042), ((‘,’, ‘it’), 80668), ((‘.’, ‘we’), 80260), ((‘going’, ‘to’), 79798), ((‘of’, ‘a’), 78654), ((‘i’, ‘am’), 76540), ((‘it’, ‘.’), 76158), ((‘have’, ‘a’), 73368), ((‘one’, ‘of’), 72568), ((‘is’, ‘the’), 72431), ((‘to’, ‘get’), 70378), ((‘as’, ‘a’), 68535), ((‘,’, ‘which’), 67953), ((‘.’, ‘in’), 67727), ((‘that’, “’s”), 66767), ((‘ca’, “n’t”), 65753), ((‘i’, ‘do’), 64563), ((‘want’, ‘to’), 63931), ((‘.’, ‘this’), 62716), ((‘but’, ‘i’), 62299), ((‘i’, ‘�’), 62125), ((‘have’, ‘to’), 61462), ((‘by’, ‘the’), 61019), ((‘.’, ‘you’), 60354), ((‘this’, ‘is’), 60272), ((‘that’, ‘i’), 59401), ((‘)’, ‘.’), 59127), ((‘that’, ‘the’), 58914), ((‘i’, ‘think’), 58305), ((‘to’, ‘do’), 58200), ((‘and’, ‘a’), 57828), ((‘,’, ‘who’), 57487), ((‘the’, ‘first’), 56960), ((‘was’, ‘a’), 56210), ((‘out’, ‘of’), 55935), ((‘rt’, ‘:’), 55299), ((‘to’, ‘a’), 55114), ((‘,’, ‘you’), 54871), ((‘.’, ‘they’), 54760), ((‘to’, ‘see’), 53886), ((‘.’, ‘�’), 53688), ((‘.’, ‘a’), 53295), ((‘on’, ‘a’), 53244), ((‘.’, ‘that’), 53214), ((‘,’, ‘he’), 53179), ((‘you’, “’re”), 53002), ((‘.’, ‘so’), 51872), ((‘.’, ‘if’), 51521), ((‘you’, ‘can’), 51349), ((‘i’, ‘love’), 50357), ((‘all’, ‘the’), 50243), ((‘it’, ‘�’), 49673), ((‘the’, ‘same’), 49442), ((‘,’, ‘we’), 49058), ((‘did’, “n’t”), 48956), ((‘?’, ‘i’), 48698), ((‘i’, “’ve”), 48489)

import matplotlib.pyplot as plt

plt.clf()

n, bins, patches = plt.hist(counter_bigram.values(),500, range=[0,500])

plt.xlabel('Word')
plt.ylabel('Number of Ocurrences')
plt.title(f'Histogram of Bigram occurences (3 files together)')
plt.plot()
plt.savefig(f'hist_bigram_word_occurence.png')

We can see how a small amount of words has very high frequency while the rest is very seldomly used.

Trigrams Distribution

Here one can see the analysis for Trigrams. The 100 most frequent ones as well as the histogram are shown bekow

from nltk.util import ngrams
tri_grams = list(ngrams(all_tokens, 3)) 
counter_trigram = Counter(tri_grams)
counter_trigram.most_common(100)

100 most common trigrams and number of occurrences:

((‘!’, ‘!’, ‘!’), 152308), ((‘it’, ‘�’, ‘s’), 48686), ((‘.’, ‘it’, “‘s“), 43891), ((‘i’, ‘do’,”n’t“), 43663), ((‘said’, ‘.’,”’”), 40815), ((‘.’, ‘i’, “‘m“), 35989), ((‘one’, ‘of’, ‘the’), 34520), ((‘don’, ‘�’, ‘t’), 33559), ((‘i’, ‘�’, ‘m’), 32830), ((‘.’,”’”, ‘i’), 30992), ((‘a’, ‘lot’, ‘of’), 29984), ((‘,’, ‘but’, ‘i’), 28837), ((‘,’, ‘and’, ‘the’), 25136), ((‘,’, “’‘“, ‘said’), 24427), ((‘.’, ‘it’, ‘was’), 24372), ((‘thanks’, ‘for’, ‘the’),23763), ((‘,’, ‘and’, ‘i’), 22637), ((‘it’,”’s“, ‘a’), 21273), ((‘i’, ‘ca’,”n’t“), 20858), ((‘,’,”’‘“, ‘he’),20666), ((‘!’, ‘:’, ‘)’), 20603), ((‘,’, ‘it’,”’s“), 20258), ((‘he’, ‘said’, ‘.’), 20178), ((‘.’, ‘if’, ‘you’),20132), ((‘.’, ‘i’, ‘have’), 19572), ((‘.’,”’”, ‘we’), 19094), ((‘.’, ‘it’, ‘is’), 18857), ((‘.’, ‘this’, ‘is’),18826), ((‘.’, “‘“, ‘it’), 18291), ((‘!’, ‘!’, ‘i’), 18276), ((‘to’, ‘be’, ‘a’), 18085), ((‘?’, ‘?’, ‘?’), 17630), ((‘going’, ‘to’, ‘be’), 17423), ((‘i’,”’m“, ‘not’), 17313), ((‘,’, ‘but’, ‘it’), 17280), ((”’‘“, ‘he’, ‘said’),16837), ((‘,’, ‘i’,”’m“), 16534), ((‘.’, ‘i’, ‘am’), 16491), ((‘.’, ‘i’, ‘was’), 16342), ((‘didn’, ‘�’, ‘t’),16121), ((‘ca’,”n’t“, ‘wait’), 15910), ((‘i’, ‘�’, ‘ve’), 15648), ((‘that’, ‘�’, ‘s’), 15596), ((‘!’, ‘i’,”’m“),15513), ((‘.’, ‘that’,”’s“), 15481), ((‘!’, ‘rt’, ‘:’), 15191), ((‘i’, ‘want’, ‘to’), 14984), ((‘the’, ‘end’, ‘of’),14876), ((‘out’, ‘of’, ‘the’), 14714), ((‘.’,”’‘“, ‘the’), 14636), ((‘.’, ‘i’, ‘�’), 14630), ((‘.’, ‘i’, ‘think’), 14597), ((‘,’, ‘according’, ‘to’), 14582), ((‘i’, ‘did’,”n’t“), 14332), ((‘.’, ‘it’, ‘�’), 14299), ((‘it’, ‘was’, ‘a’),14162), ((‘do’,”n’t“, ‘know’), 13957), ((‘as’, ‘well’, ‘as’), 13806), ((‘,’, ‘but’, ‘the’), 13771), ((‘some’, ‘of’, ‘the’),13632), ((‘.’, ‘in’, ‘the’), 13524), ((‘.’,”’”, "‘“), 13484), ((‘it’,”’s“, ‘not’), 13430), ((‘you’, ‘do’,”n’t“), 13425), ((‘:’, ‘-’, ‘)’), 13242), ((‘be’, ‘able’, ‘to’), 13059), ((‘.’, ‘:’, ‘)’), 12993), ((‘.’, ‘i’, ‘do’), 12852), ((‘.’, ‘`’, ‘the’), 12586), ((‘i’, ‘don’, ‘�’), 12352), ((‘part’, ‘of’, ‘the’), 12312), ((‘.’, ‘i’, ‘love’), 12126), ((‘``’, ‘it’,”’s“), 11913), ((‘can’, ‘�’, ‘t’), 11874), ((‘�’, ‘s’, ‘a’), 11810), ((‘i’, ‘have’, ‘a’), 11797), ((‘,’, ‘which’, ‘is’), 11462), ((‘.’, ‘thanks’, ‘for’), 11398), ((‘i’,”’ve“, ‘been’), 11273), ((‘the’, ‘rest’,‘of’),11224), ((‘i’, ‘have’, ‘to’), 11214), ((‘looking’, ‘forward’, ‘to’), 11194), ((‘,’, ‘and’, ‘it’), 10886), ((‘.’, ‘rt’, ‘:’),10782), ((‘do’,”n’t“, ‘have’), 10766), ((‘,’, ‘so’, ‘i’), 10715), ((‘doesn’, ‘�’, ‘t’), 10339), ((‘:’, ‘)’, ‘i’), 10256), ((‘.’, ‘i’,”’ve“), 10215), ((‘you’, ‘�’, ‘re’), 10205), ((‘is’, ‘going’, ‘to’), 10141), ((‘the’, ‘first’, ‘time’), 10134), ((‘thank’, ‘you’, ‘for’), 10080), ((‘a’, ‘couple’, ‘of’), 10028), ((‘!’, ‘it’,”’s"), 9929), ((’,‘, ’it’, ‘was’), 9912), ((‘.’, ‘he’, ‘was’), 9888), ((‘i’, ‘think’, ‘i’), 9831), ((‘this’, ‘is’, ‘a’), 9795), ((‘.’, ‘thank’, ‘you’), 9771)

import matplotlib.pyplot as plt

plt.clf()

n, bins, patches = plt.hist(counter_bigram.values(),500, range=[0,500])

plt.xlabel('Word')
plt.ylabel('Number of Ocurrences')
plt.title(f'Histogram of Trigram occurences (3 files together)')
plt.plot()
plt.savefig(f'hist_trigram_word_occurence.png')

Prediction Algorithm

As a next step, I plan to create a prediction model by means of an N-Gram Language Model, which is a technique that assigns as next word the one with higher probability of occurrence, considering your training data. In order to decide whether to use N as 1, 2, 3 or more, I plan to separate a part of the training set for evaluation, and then test which of the cases will provide higher probabilities for the ground truth text. However, probably it will only be possible to test N less or equal than 3, since due to a limited hardware performance, it might not be feasible to use values of N greater than 3 (there will be a very big number of different 4-grams).

More information can be found on: https://web.stanford.edu/~jurafsky/slp3/3.pdf

Data Science Specialization - Capstone Project

Mauricio Camargo

03/01/2021

Introduction

Tools Used

Initial Exploration

Words Distribution

Bigrams Distribution

Trigrams Distribution

Prediction Algorithm