Milestone Report

Coursera Data Science Specialization Capstone

Nacho Rodriguez

Exploratory Analysis

(Getting and cleaning)

First step of the exploratory analysis is getting a sample of each of the sets and cleaning it. For this purpose, the following iterative process has been followed:

  • Reading a bunch of lines from the original file (randomly decide if use it or not, for sampling)
  • Tokenizing each line, getting each separate "word"
  • Cleaning each "word" using "regular expresions" (numbers, special characters...)
  • Unifying lower and upper case
  • Reading another bunch of lines (back to Step1)

Exploratory Analysis

(Summarizing)

Once the text is in a tokenized and cleaned format, a N-grams approach is used to summarized the information.

  • 1-grams: Single word appareances are counted
  • 2-grams: Exact pairs of words are counted
  • 3-grams: Exact 3-word groups are counted

First results

Twitter set

  • English language Twitter training set has:
  • A size of 163Mb
  • 2,360,148 lines
  • 30,373,583 words
  • The longest line has 213 characters and the shortest one, just 2 characters
  • The table is an example of the more common words and their relative frequency
##      word1   rel
## 3005   the 3.173
## 3072    to 2.684
## 1443     i 2.417
## 8        a 2.062
## 3484   you 1.837

Twitter set

  • A quick look to the 1,000 more common words indicates a long tail distribution, so probably, a small subset of the words will be able to represent a big part of the total text

plot of chunk unnamed-chunk-2

First results

Blogs set

  • English language Twitter training set has:
  • A size of 205Mb
  • 899,299 lines
  • 37,334,131 words
  • The longest line has 40,835 characters and the shortest one, just 1 character
  • The table is an example of the more common words and their relative frequency
##      word1   rel
## 7203   the 5.010
## 270    and 2.938
## 7316    to 2.904
## 16       a 2.440
## 4900    of 2.343

plot of chunk unnamed-chunk-3

Blogs set

  • Even if we look at the 10,000 more common words, it indicates a long tail distribution, so probably, a small subset of the words will be able to represent a big part of the total text

plot of chunk unnamed-chunk-4

First results

News set

  • English language Twitter training set has:
  • A size of 201Mb
  • 1,010,243 lines
  • 34,372,530 words
  • The longest line has 11,384 characters and the shortest one, just 1 character
  • The table is an example of the more common words and their relative frequency
##      word1   rel
## 6965   the 5.904
## 7076    to 2.751
## 11       a 2.636
## 277    and 2.611
## 4762    of 2.324

plot of chunk unnamed-chunk-5

News set

  • Again, the news training set has a similar long-tail word distribution

plot of chunk unnamed-chunk-6

Next Steps

  • Putting together the N-grams analysis from each source
  • First predictive approach will just be based on more frequent N-grams
  • Example: If "The White House" appears 10 times on the training set and "The White Sox" appears 8 times on the training set, then If the app input is "The white", the output will be "House" (and not "Sox")
  • Memory occupation and velocity should also be taken into consideration