In this project we need to show that we’ve got used to working with the data and that we can later on create a prediction algorithm. We should submit a report on R Pubs (http://rpubs.com/) that shows an exploratory analysis and the goals for an eventual app and algorithm. The report should be concise and explain only the major features of the data you have identified and briefly summarize the plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager.
We should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to:
Since python also has very good libraries to deal with Natural Language Processing tasks, I will be using a mix of R, Python and linux scripting to extract relevant information from the dataset.
First, let us look at how the files look like, by looking at the first lines with the linux command ‘head’
Output from ‘head en_US.twitter.txt’ linux command
Output from ‘head en_US.blogs.txt’ linux command
Output from ‘head en_US.news.txt’ linux command
The number of lines and words for each file can be computed by using the linux command “wc”:
WC linux command
In the table below, it is possible to visualize the results:
| File Name | Line Count | Word Count |
|---|---|---|
| en_US.blogs.txt | 899.288 | 37.334.117 |
| en_US.news.txt | 1.010.242 | 34.365.936 |
| en_US.twitter.txt | 2.360.148 | 30.373.559 |
| total | 4.269.678 | 102.073.612 |
Let’s now use Python and NLTK library to compute not only the number of lines, but also the number of sentences contained in each file.
In order to run the python commands, first I had to perform some conversion on the text files, by using the following linux command:
sed 's/\\0//g' en_US.twitter.txt > en_US.twitter_converted.txt
library(reticulate)
use_condaenv("DataScienceSpec", required = TRUE)
from nltk import word_tokenize, sent_tokenize
import os
import io
filenames = {'en_US.blogs': r'C:\Users\camargom\Downloads\Coursera\final\en_US\en_US.blogs_converted.txt',
'en_US.news' : r'C:\Users\camargom\Downloads\Coursera\final\en_US\en_US.news_converted.txt',
'en_US.twitter' : r'C:\Users\camargom\Downloads\Coursera\final\en_US\en_US.twitter_converted.txt' }
# for quick testing
#filenames = {'en_US.blogs': r'C:\Users\camargom\Downloads\Coursera\final\en_US\en_US.twitter_converted_small.txt',
# 'en_US.news': r'C:\Users\camargom\Downloads\Coursera\final\en_US\en_US.twitter_converted_small.txt' }
sentences = {}
# I have run this command to convert the file:
# sed 's/\\0//g' en_US.twitter.txt > en_US.twitter_converted.txt
for key in filenames:
if os.path.isfile(filenames[key]):
with io.open(filenames[key], encoding='utf8') as fin:
text = fin.read()
sentences[key] = sent_tokenize(text)
print(f'{key} - Number of characters: {len(text)} - Number of sentences: {len(sentences[key])}')
In the table below, it is possible to visualize the results in a better way:
| File Name | Sentence Count | Line Count | Word Count | Character Count |
|---|---|---|---|---|
| en_US.blogs.txt | 2.083.684 | 899.288 | 37.334.117 | 207.723.793 |
| en_US.news.txt | 1.830.494 | 1.010.242 | 34.365.936 | 204.233.401 |
| en_US.twitter.txt | 2.811.218 | 2.360.148 | 30.373.559 | 164.456.392 |
| total | 6.725.396 | 4.269.678 | 102.073.612 | 576.413.586 |
Now let’s inspect the distribution of the total number of tokens per sentence (the sentence length) in each of the 3 files (a token might consist of a word or punctuation marks).
len_sentences = {}
for key in sentences:
len_sentences[key] = [len(i) for i in sentences[key]]
import numpy as np
import matplotlib.pyplot as plt
for key in len_sentences:
plt.clf()
n, bins, patches = plt.hist(len_sentences[key],100,range=[0, 1000])
plt.xlabel('Size')
plt.ylabel('Number of Ocurrences')
plt.title(f'Histogram of {key} sentence length')
plt.plot()
plt.savefig(f'{key}_sentence_length.png')
As expected, we see that for the Twitter dataset, the sentences are usually shorter than blogs and news. The news dataset is the one which has on average longer senteces, what is also expected given that it is usually a more formal language.
Let us know compute the distributions for words in the whole dataset (all 3 files considered together). Here the 100 most common words as well as the number of occurrences of each individual word is computed.
from collections import Counter
all_sentences = []
for key in sentences:
all_sentences+=sentences[key]
whole_text = (' '.join(filter(None, all_sentences))).lower()
all_tokens = word_tokenize(whole_text)
counter = Counter(all_tokens)
print('Most common words:')
counter.most_common(100)
100 most common words and number of occurrences:
(‘.’ : 4941686), (‘the’, 4760895), (‘,’, 4485534), (‘to’, 2753351), (‘and’, 2408096), (‘a’, 2377580), (‘of’, 2005044), (‘i’, 1999642), (‘in’, 1644631), (‘!’, 1480352), (‘it’, 1145713), (‘that’, 1122896), (‘for’, 1098683), (‘is’, 1094776), (‘you’, 1035878), (‘on’, 817046), ("‘s“, 793586), (‘with’, 714043), (‘was’, 642921), (‘:’, 615584), (‘?’, 607629), (‘my’, 603317), (”’‘“, 579315), (‘at’, 570578), (‘be’, 547600), (‘have’, 545448), (‘this’, 542991), (‘``’, 519072), (‘�’, 501647), (‘are’, 500812), (‘we’, 495163), (‘)’, 492366), (‘he’, 483023), (‘but’, 481914), (‘as’, 481313), (‘not’, 422372), (”n’t“, 408018), (‘from’, 383936), (‘so’, 380936), (‘(’, 375693), (‘do’, 368192), (‘me’, 367212), (‘they’, 361910), (‘…’, 341570), (‘all’, 330222), (‘will’, 315146), (‘by’, 313861), (‘or’, 309679), (‘said’, 304842), (‘just’, 303750), (‘what’, 302549), (‘your’, 302071), (‘his’, 301100), (‘an’, 298796), (‘about’, 295617), (‘out’, 294476), (‘one’, 291810), (‘up’, 291296), (‘if’, 278548), (‘#’, 273587), (‘can’, 271360), (‘like’, 269205), (‘has’, 265478), (‘when’, 264771), (‘who’, 260108), (‘there’, 247226), (‘more’, 243006), (‘s’, 236295), (‘had’, 235428), (‘get’, 226346), (‘she’, 219280), (‘would’, 218262), (‘time’, 213904), (‘her’, 207883), (‘their’, 204548), (‘some’, 201581), (‘no’, 198108), (‘new’, 193911), (‘been’, 187858), (‘our’, 185883), (”’m", 184462), (’�‘, 183299), (’were’, 182465), (‘�’, 180041), (‘good’, 178262), (‘now’, 178153), (‘how’, 178026), (‘-’, 175355), (‘day’, 168109), (‘know’, 162698), (‘people’, 161523), (‘them’, 160750), (‘love’, 160138), (‘did’, 153042), (‘$’, 149905), (‘;’, 145229), (‘which’, 143980), (‘back’, 141015), (‘than’, 140371), (‘go’, 139292)
import matplotlib.pyplot as plt
plt.clf()
n, bins, patches = plt.hist(counter.values(),500, range=[0,500])
plt.xlabel('Word')
plt.ylabel('Number of Ocurrences')
plt.title(f'Histogram of word occurences (3 files together)')
plt.plot()
plt.savefig(f'hist_word_occurence.png')
Now we compute the same, but just for Bigrams. Below one can see the 100 most common bigrams as well as the histogram.
from nltk.util import ngrams
bi_grams = list(ngrams(all_tokens, 2))
counter_bigram = Counter(bi_grams)
counter_bigram.most_common(100)
100 most common bigrams and number of occurrences:
((‘of’, ‘the’), 428080), ((‘in’, ‘the’), 405701), ((‘.’, ‘i’), 401850), ((‘!’, ‘!’), 341329), ((‘,’, ‘and’), 325770), ((‘.’, ‘the’), 306619), ((‘�’, ‘s’), 228712), ((‘to’, ‘the’), 212163), ((‘,’, ‘but’), 207401), ((‘for’, ‘the’), 199942), ((‘,’, ‘the’), 197143), ((‘on’, ‘the’), 195152), ((‘,’, ‘i’), 195131), ((‘.’, ‘``’), 194845), ((‘i’, “‘m“), 183353), ((‘it’,”’s“), 180666), ((‘.’,”’‘“), 161915), ((‘to’, ‘be’), 161660), ((‘.’, ‘it’), 156437), ((‘,’,”’’”), 150300), ((‘at’, ‘the’), 142118), ((‘do’, “n’t”), 128071), ((‘and’, ‘the’), 124440), ((‘in’, ‘a’), 118707), ((‘�’, ‘t’), 113105), ((‘!’, ‘i’), 110721), ((‘with’, ‘the’), 105254), ((‘:’, ‘)’), 102587), ((‘it’, ‘was’), 100267), ((‘is’, ‘a’), 100198), ((‘and’, ‘i’), 97539), ((‘.’, ‘but’), 97311), ((‘said’, ‘.’), 96208), ((‘.’, ‘he’), 94841), ((‘,’, ‘a’), 94440), ((‘for’, ‘a’), 93483), ((‘i’, ‘have’), 92552), ((‘i’, ‘was’), 89246), ((‘if’, ‘you’), 86881), ((‘from’, ‘the’), 86573), ((‘.’, ‘and’), 83453), ((‘it’, ‘is’), 83069), ((‘with’, ‘a’), 81641), ((‘will’, ‘be’), 81042), ((‘,’, ‘it’), 80668), ((‘.’, ‘we’), 80260), ((‘going’, ‘to’), 79798), ((‘of’, ‘a’), 78654), ((‘i’, ‘am’), 76540), ((‘it’, ‘.’), 76158), ((‘have’, ‘a’), 73368), ((‘one’, ‘of’), 72568), ((‘is’, ‘the’), 72431), ((‘to’, ‘get’), 70378), ((‘as’, ‘a’), 68535), ((‘,’, ‘which’), 67953), ((‘.’, ‘in’), 67727), ((‘that’, “’s”), 66767), ((‘ca’, “n’t”), 65753), ((‘i’, ‘do’), 64563), ((‘want’, ‘to’), 63931), ((‘.’, ‘this’), 62716), ((‘but’, ‘i’), 62299), ((‘i’, ‘�’), 62125), ((‘have’, ‘to’), 61462), ((‘by’, ‘the’), 61019), ((‘.’, ‘you’), 60354), ((‘this’, ‘is’), 60272), ((‘that’, ‘i’), 59401), ((‘)’, ‘.’), 59127), ((‘that’, ‘the’), 58914), ((‘i’, ‘think’), 58305), ((‘to’, ‘do’), 58200), ((‘and’, ‘a’), 57828), ((‘,’, ‘who’), 57487), ((‘the’, ‘first’), 56960), ((‘was’, ‘a’), 56210), ((‘out’, ‘of’), 55935), ((‘rt’, ‘:’), 55299), ((‘to’, ‘a’), 55114), ((‘,’, ‘you’), 54871), ((‘.’, ‘they’), 54760), ((‘to’, ‘see’), 53886), ((‘.’, ‘�’), 53688), ((‘.’, ‘a’), 53295), ((‘on’, ‘a’), 53244), ((‘.’, ‘that’), 53214), ((‘,’, ‘he’), 53179), ((‘you’, “’re”), 53002), ((‘.’, ‘so’), 51872), ((‘.’, ‘if’), 51521), ((‘you’, ‘can’), 51349), ((‘i’, ‘love’), 50357), ((‘all’, ‘the’), 50243), ((‘it’, ‘�’), 49673), ((‘the’, ‘same’), 49442), ((‘,’, ‘we’), 49058), ((‘did’, “n’t”), 48956), ((‘?’, ‘i’), 48698), ((‘i’, “’ve”), 48489)
import matplotlib.pyplot as plt
plt.clf()
n, bins, patches = plt.hist(counter_bigram.values(),500, range=[0,500])
plt.xlabel('Word')
plt.ylabel('Number of Ocurrences')
plt.title(f'Histogram of Bigram occurences (3 files together)')
plt.plot()
plt.savefig(f'hist_bigram_word_occurence.png')
We can see how a small amount of words has very high frequency while the rest is very seldomly used.
Here one can see the analysis for Trigrams. The 100 most frequent ones as well as the histogram are shown bekow
from nltk.util import ngrams
tri_grams = list(ngrams(all_tokens, 3))
counter_trigram = Counter(tri_grams)
counter_trigram.most_common(100)
100 most common trigrams and number of occurrences:
((‘!’, ‘!’, ‘!’), 152308), ((‘it’, ‘�’, ‘s’), 48686), ((‘.’, ‘it’, “‘s“), 43891), ((‘i’, ‘do’,”n’t“), 43663), ((‘said’, ‘.’,”’”), 40815), ((‘.’, ‘i’, “‘m“), 35989), ((‘one’, ‘of’, ‘the’), 34520), ((‘don’, ‘�’, ‘t’), 33559), ((‘i’, ‘�’, ‘m’), 32830), ((‘.’,”’”, ‘i’), 30992), ((‘a’, ‘lot’, ‘of’), 29984), ((‘,’, ‘but’, ‘i’), 28837), ((‘,’, ‘and’, ‘the’), 25136), ((‘,’, “’‘“, ‘said’), 24427), ((‘.’, ‘it’, ‘was’), 24372), ((‘thanks’, ‘for’, ‘the’),23763), ((‘,’, ‘and’, ‘i’), 22637), ((‘it’,”’s“, ‘a’), 21273), ((‘i’, ‘ca’,”n’t“), 20858), ((‘,’,”’‘“, ‘he’),20666), ((‘!’, ‘:’, ‘)’), 20603), ((‘,’, ‘it’,”’s“), 20258), ((‘he’, ‘said’, ‘.’), 20178), ((‘.’, ‘if’, ‘you’),20132), ((‘.’, ‘i’, ‘have’), 19572), ((‘.’,”’”, ‘we’), 19094), ((‘.’, ‘it’, ‘is’), 18857), ((‘.’, ‘this’, ‘is’),18826), ((‘.’, “‘“, ‘it’), 18291), ((‘!’, ‘!’, ‘i’), 18276), ((‘to’, ‘be’, ‘a’), 18085), ((‘?’, ‘?’, ‘?’), 17630), ((‘going’, ‘to’, ‘be’), 17423), ((‘i’,”’m“, ‘not’), 17313), ((‘,’, ‘but’, ‘it’), 17280), ((”’‘“, ‘he’, ‘said’),16837), ((‘,’, ‘i’,”’m“), 16534), ((‘.’, ‘i’, ‘am’), 16491), ((‘.’, ‘i’, ‘was’), 16342), ((‘didn’, ‘�’, ‘t’),16121), ((‘ca’,”n’t“, ‘wait’), 15910), ((‘i’, ‘�’, ‘ve’), 15648), ((‘that’, ‘�’, ‘s’), 15596), ((‘!’, ‘i’,”’m“),15513), ((‘.’, ‘that’,”’s“), 15481), ((‘!’, ‘rt’, ‘:’), 15191), ((‘i’, ‘want’, ‘to’), 14984), ((‘the’, ‘end’, ‘of’),14876), ((‘out’, ‘of’, ‘the’), 14714), ((‘.’,”’‘“, ‘the’), 14636), ((‘.’, ‘i’, ‘�’), 14630), ((‘.’, ‘i’, ‘think’), 14597), ((‘,’, ‘according’, ‘to’), 14582), ((‘i’, ‘did’,”n’t“), 14332), ((‘.’, ‘it’, ‘�’), 14299), ((‘it’, ‘was’, ‘a’),14162), ((‘do’,”n’t“, ‘know’), 13957), ((‘as’, ‘well’, ‘as’), 13806), ((‘,’, ‘but’, ‘the’), 13771), ((‘some’, ‘of’, ‘the’),13632), ((‘.’, ‘in’, ‘the’), 13524), ((‘.’,”’”, "‘“), 13484), ((‘it’,”’s“, ‘not’), 13430), ((‘you’, ‘do’,”n’t“), 13425), ((‘:’, ‘-’, ‘)’), 13242), ((‘be’, ‘able’, ‘to’), 13059), ((‘.’, ‘:’, ‘)’), 12993), ((‘.’, ‘i’, ‘do’), 12852), ((‘.’, ‘`’, ‘the’), 12586), ((‘i’, ‘don’, ‘�’), 12352), ((‘part’, ‘of’, ‘the’), 12312), ((‘.’, ‘i’, ‘love’), 12126), ((‘``’, ‘it’,”’s“), 11913), ((‘can’, ‘�’, ‘t’), 11874), ((‘�’, ‘s’, ‘a’), 11810), ((‘i’, ‘have’, ‘a’), 11797), ((‘,’, ‘which’, ‘is’), 11462), ((‘.’, ‘thanks’, ‘for’), 11398), ((‘i’,”’ve“, ‘been’), 11273), ((‘the’, ‘rest’,‘of’),11224), ((‘i’, ‘have’, ‘to’), 11214), ((‘looking’, ‘forward’, ‘to’), 11194), ((‘,’, ‘and’, ‘it’), 10886), ((‘.’, ‘rt’, ‘:’),10782), ((‘do’,”n’t“, ‘have’), 10766), ((‘,’, ‘so’, ‘i’), 10715), ((‘doesn’, ‘�’, ‘t’), 10339), ((‘:’, ‘)’, ‘i’), 10256), ((‘.’, ‘i’,”’ve“), 10215), ((‘you’, ‘�’, ‘re’), 10205), ((‘is’, ‘going’, ‘to’), 10141), ((‘the’, ‘first’, ‘time’), 10134), ((‘thank’, ‘you’, ‘for’), 10080), ((‘a’, ‘couple’, ‘of’), 10028), ((‘!’, ‘it’,”’s"), 9929), ((’,‘, ’it’, ‘was’), 9912), ((‘.’, ‘he’, ‘was’), 9888), ((‘i’, ‘think’, ‘i’), 9831), ((‘this’, ‘is’, ‘a’), 9795), ((‘.’, ‘thank’, ‘you’), 9771)
import matplotlib.pyplot as plt
plt.clf()
n, bins, patches = plt.hist(counter_bigram.values(),500, range=[0,500])
plt.xlabel('Word')
plt.ylabel('Number of Ocurrences')
plt.title(f'Histogram of Trigram occurences (3 files together)')
plt.plot()
plt.savefig(f'hist_trigram_word_occurence.png')
As a next step, I plan to create a prediction model by means of an N-Gram Language Model, which is a technique that assigns as next word the one with higher probability of occurrence, considering your training data. In order to decide whether to use N as 1, 2, 3 or more, I plan to separate a part of the training set for evaluation, and then test which of the cases will provide higher probabilities for the ground truth text. However, probably it will only be possible to test N less or equal than 3, since due to a limited hardware performance, it might not be feasible to use values of N greater than 3 (there will be a very big number of different 4-grams).
More information can be found on: https://web.stanford.edu/~jurafsky/slp3/3.pdf