This is a report for the capstone project for the coursera-John Hopkins Data Science specialization

The motivation for this report is to:

  1. Demonstrate that I’ve downloaded the data and have successfully loaded it in.
  2. Create a basic report of summary statistics about the data sets.
  3. Report any interesting findings that amassed me so far.
  4. Get feedback on myplans for creating a prediction algorithm and Shiny app.

a) Files exploration

With the help of wc (https://en.wikipedia.org/wiki/Wc_(Unix)) linux command we find the following:

filesSummary
##      file linescount wordscount charactercount
## 1 twitter    2360148   30359852      166843164
## 2    news    1009907   34365936      205243643
## 3   blogs     898584   37334114      208623085

However if we open the files we will see that there are many issues we have to face when creating ngrams to build our model:

For these reasons we have to clean and tidy our data

b) Data tidying

With the help of unix tools:
* tr
* awk
* sed

the author decided to:

  1. convert all characters to lowercase
  2. remove special characters
  3. remove digits
  4. remove punctuation except apostrophe and dash
  5. surpass multiple space characters

c) More data exploration

Having cleaned our files the author moved to create for each data file a new file containing unique word frequencies. E.g for twitter the first two lines are:
936331, the
787437, to

library(data.table)
twitWords<-fread("twitter.unicount.txt",sep=",",header=F,colClasses = c("numeric","character"),stringsAsFactors = F,na.strings = "NA")
setnames(twitWords, names(twitWords), c("freq","word"))

newsWords<-fread("news.unicount.txt",sep=",",header=F,colClasses = c("numeric","character"),stringsAsFactors = F,na.strings = "NA")
setnames(newsWords, names(newsWords), c("freq","word"))

blogsWords<-fread("blogs.unicount.txt",sep=",",header=F,colClasses = c("numeric","character"),stringsAsFactors = F,na.strings = "NA")
setnames(blogsWords, names(blogsWords), c("freq","word"))

We move on to explore our tables

head(twitWords,10)
##       freq word
##  1: 936331  the
##  2: 787437   to
##  3: 726051    i
##  4: 614545    a
##  5: 548560  you
##  6: 438176  and
##  7: 385140  for
##  8: 378136   in
##  9: 359232   of
## 10: 358756   is
head(newsWords,10)
##        freq  word
##  1: 1971977   the
##  2:  901179    to
##  3:  891338     a
##  4:  885278   and
##  5:  771093    of
##  6:  674213    in
##  7:  351244   for
##  8:  350523  that
##  9:  284130    is
## 10:  266868    on
head(blogsWords,10)
##        freq  word
##  1: 1857944   the
##  2: 1092742   and
##  3: 1066296    to
##  4:  901848     a
##  5:  875332    of
##  6:  832173     i
##  7:  594343    in
##  8:  472671  that
##  9:  442779    it
## 10:  432459    is

We note the close similarity for the top results in each file

We move on to see the distribution of our words:

summary(twitWords)
##       freq              word          
##  Min.   :     1.0   Length:388043     
##  1st Qu.:     1.0   Class :character  
##  Median :     1.0   Mode  :character  
##  Mean   :    76.8                     
##  3rd Qu.:     3.0                     
##  Max.   :936331.0
summary(newsWords)
##       freq               word          
##  Min.   :      1.0   Length:319395     
##  1st Qu.:      1.0   Class :character  
##  Median :      2.0   Mode  :character  
##  Mean   :    106.1                     
##  3rd Qu.:      6.0                     
##  Max.   :1971977.0
summary(blogsWords)
##       freq               word          
##  Min.   :      1.0   Length:361688     
##  1st Qu.:      1.0   Class :character  
##  Median :      1.0   Mode  :character  
##  Mean   :    103.4                     
##  3rd Qu.:      4.0                     
##  Max.   :1857944.0

We note that as expected our text is dominated by a small percentage of the total words

We can also see from the following histogram that our data is highly skewed

hist(twitWords[,freq])

The same result with a simple plot

plot(twitWords[,freq])

Even if we restrict ourselves to the top 100 words we have high skewness

plot(twitWords[1:100,freq])

Next Steps

The author created ngrams for each text file.
Again the distribution is very skewed. We should take this into account when building our model and use only frequent ngrams where only the most frequent words appear