Synopsis

This document is intended to get into the road toward a Final Capstone Project in the Data Science specialization offered by JHU, which consists on buildindg an app for text prediction. The scope of this first draft is getting the training data and do a general data exploration in order to undestand word distribution, term relationship and most frequent words used in a text collection and thus getting the first steps for building a suitable algorithm for the text predictive model.

Trainining dataset is avaliable at this link. There are several respositories of files containing input for text collection based on three different web sources: blogs, news and twitter. The scope of the anlysis will be delimited on files related with english language. Data will be loaded into R and, because of considerable size of files, the strategy will be focused on reduction of size by sampling orginal files.

Getting data

Data will be loaded into R by using fread function from data.table package. This approach gives faster loading time compare to readLines. The following messages fall apart when loading files via fread function. This function improves significantly file text loading into R env.

# Loading file into R enviroment
        setwd("/Users/elobo/HomeRstudio/CAPSTONE/")
        EngFile1 <- data.table::fread("./final/en_US/en_US.blogs.txt", sep = "\n", header = FALSE)

Read 15.6% of 899288 rows
Read 33.4% of 899288 rows
Read 52.3% of 899288 rows
Read 63.4% of 899288 rows
Read 85.6% of 899288 rows
Read 96.7% of 899288 rows
Read 899049 rows and 1 (of 1) columns from 0.196 GB file in 00:00:08
        EngFile2 <- data.table::fread("./final/en_US/en_US.twitter.txt", sep = "\n", header = FALSE)

Read 5.1% of 2360148 rows
Read 33.5% of 2360148 rows
Read 46.6% of 2360148 rows
Read 57.6% of 2360148 rows
Read 83.5% of 2360148 rows
Read 2360052 rows and 1 (of 1) columns from 0.156 GB file in 00:00:13
        EngFile3 <- data.table::fread("./final/en_US/en_US.news.txt", sep = "\n", header = FALSE)

Read 26.7% of 1010242 rows
Read 29.7% of 1010242 rows
Read 66.3% of 1010242 rows
Read 75.2% of 1010242 rows
Read 78.2% of 1010242 rows
Read 1010228 rows and 1 (of 1) columns from 0.192 GB file in 00:00:12

Exploratory Analisys

Pre-assesment

Quick overview regarding original files stats and features, just to have a sligth idea how well raw files are structured.

[1] "File Comparison:"
File.Name File_Size_MB Line_Count Word_Count Mean.Line.Length Max.Line.Length Max.word.length
en_US.blogs.txt 200.4242 899049 37546246 230.03633 40833 165
en_US.twitter.txt 159.3640 2360052 30093406 68.67509 1795 126
en_US.news.tx 196.2775 1010228 34762395 201.03024 11384 84

From the table above we could infere for the case of twitter file there might be some lines joined by multiple entries since nature of application only allows 140-char-long entries, lets check that out with the longest line length matched

[1] 506478
checking longest line in twitter file:
[1] "Good habits r not made on birthdays-The workshop of character is everyday life-The uneventful & commonplace is where the battle is won/lost\r\nThis is going to break me clean in two. This is going to bring me close to you.\r\nfor all *I* know you *could* be a terrorist!\r\nThank you so much for reading! You've got a treasure trove:)\r\nThat's why I love twitter.\r\nSorry. Meant to respond sooner. The name challenge is OK. We plan to kick it up a notch in two weeks. Have any good ideas?\r\nUh, yes? I mean. YES! All houses need bunk beds and a gumball machine Big style. Also, Zoltar machine.\r\nThe best mirror is an old friend. -English proverb\r\nMary Nell Bryant with Children's Hospital Alliance of Tennessee speaks to support One with Courage\r\nMonarch beats Castle View 14-10 in boys lax\r\nBusy week with a lot to do. Great weather helps!\r\nhey hey hey hey i love you. k bye (: <3\r\nMoney is money\r\nI ❤ NASHVILLE!\r\nHappy Birthday Kenny (: Love you too (:\r\nNo perfect game for Humber tonight\r\nRT We love you AND your big mouth! (Fave tweet of this librarian-centric day).\r\nI cannot bring myself to listen to anything but foreigner, Aerosmith, journey, or queen. Can you blame me?\r\nThanks to everyone who came to Film Journeys last night , it was a great film/discussion--we'll do it again next month.\r\nactually I'm thinking that being done for the day will be the winning strategy.\r\noh ya so thats midnight your time. yes they are playing it now and will then and I might sing live\r\nPlaying the entire Beastie Boys discography in the brewery today. #RIPMCA We'll miss you.\r\nThanks for the follow :) You may also follow me on We are happy to assist with any #Dell concerns. Have a great day! LM\r\n“: Freshmen students texted their mentors (juniors/seniors) to let them know how their first day of school was"

Pre-Processing

The goal of pre-processing is try to get as much cleaned as possible the raw file to do further computation. The scope covered in this phase will handle a reduced version of files and then go all over the next stages:

  • Remove numbers
  • Remove punctuations
  • Remove stopwords
  • Remove whitespace

cheking out some pre-processing results

[1] "this story is about family whose members just want to live their lives in their own way without bothering anybody else there martin vanderhof aka grandpa his daughter penny and her husband paul sycamore their daughters alice and essie who is married to ed carmichael there also an assortment of other characters some who live in the house and some who don"
[1] "this day just gone bad since heard read the bad news bout vinny"
[1] "the attorney chatter comes as the special prosecutor in the case florida state attorney angela corey announced early tuesday evening that she is preparing to release new information in the case within the next hours"

Most Frequent words

In order to get a visualization of most frequent words of all three text collection, tokenization will be aiming the task. Follows are the plots for 1-gram, 2-gram and 3-gram.

1-Grams

2-grams and 3-grams

Distribution for most frequent features(2-grams, 3-grams)

By knowing which feature combination is most frequent, an interesting approach might be know their distribution along the files, A proper code was done in order to use cleaned files and target term as argument of search and dist display.

# Comparing "new york city" 3-gram for blog and news
pd1 <- token_3dist(rf1c,"new york city")
pd2 <- token_3dist(rf3c,"new york city")
grid.arrange(pd1, pd2, ncol=2, nrow =1) 

# checking other 2-grams insede twitter file
token_2dist(rf2c,"can wait")

token_2dist(rf2c,"don know")

token_2dist(rf2c,"last year")

# checking a 3-gram on twitter
token_3dist(rf2c,"happy mother day")

# checking a 3-gram on news
token_3dist(rf3c,"president barack obama")

Sample statistic

In order to have an estimation of how statistic may vary from the sample of original files the use of the Central Limit Theorem would give us some aproximation of population statisic. The idea is basically take a distribution for an n-gram and get the sample mean of those means

Let`s check the case of bi-grams from most frequent and less frequent

Blog file statistic: 
2-grams: 1.dont know    2.new york  3.years ago 4.one day
  Sampling.mean Sampling.se
1      8586.042    47.05709
2      8785.932    64.01418
3      9886.075    64.74892
4      9279.580   120.23372
Twiter file statistic: 
2-grams: 1.can wait 2.right now 3.great day 4.mother day
  Sampling.mean Sampling.se
1      23626.83    63.92370
2      24370.97    66.16305
3      22513.84   212.59938
4      19814.29   229.94293
News file statistic: 
2-grams: 1.last year    2.new york  3.health care   4.last month
  Sampling.mean Sampling.se
1     10048.738    27.60952
2      9875.300    30.85482
3      9500.930    90.02062
4      9758.825    94.13708

Exploratory findings

  • English uses lots of complementary words acting as auxiliar on sentences e.g, can, just, will, don`t, does, etc. So those would be the first expectation of most frequent terms.

  • Combination of words may drastically reduce frequency of terms, so comparing 2-grams, 3-grams against 1-gram there is a huge difference into data set summarization.

  • From any possible model consideration it seems very clever to put away words like, just, get, will, one, can, dont, do, so a good approach would be to added to a customized stopwords list.

  • From Blog text collection words seems to be related in general to humans, mood, timing.

  • From twiter text collection, most word combination seems to be related to a particual issue happening some point in the time.

  • From CLT aproximation we could infere that Population mean will have a high chance to be located near the middle of file lines index, with most frequent token with less variability around the mean.

Possible predictive model brainstorm

Once having cleaned the data, the algorithm to be feasible to prective text could be something like this

  • Check correlation of features with most frequent one and check what features to select.
  • Take a matrix feature with most frequent terms and evaluate accuracy for a classification problem.
  • Try some ML algorithms known and evaluate performance and accuracy.
  • Other option could be how to evaluate a markov model over a document-term-matrix based on the cleaned data. With this approach the idea is to build an n-gram model that would use a the conditional probability P(Xi| X1,n-1).
  • Build a probability distribution on a ramdon variable based on n-gram and check what the distribution says about the likelihood of the observed data.