Data Science Capstone: Milestone Report

Introduction

The goal of the capstone project is to build a predictive text product. The first to building a predictive text product is to understand the basic relationships in the data to prepare to build a linguistic models. The milestone report will attempt to accomplish two tasks

Exploratory analysis - perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora.
Understand frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data.

Getting Data - Download Data

# Download and unzip the data to R enviroment, change destfile to your R enviroment path
if (!file.exists("Coursera-SwiftKey.zip")) {
  download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", destfile="~/Coursera/01Capstone/Coursera-SwiftKey.zip", mode = "wb")
  unzip("Coursera-SwiftKey.zip")
}

The data set contains three sources (blogs, news, and twitter) for four different languages (German, English - US, Finnish, and Russian). The report will focus on the English - US.

# Read the blogs, news and twitter data into R enviroment
blogs <- readLines("~/Coursera/01Capstone/final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("~/Coursera/01Capstone/final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("~/Coursera/01Capstone/final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)

Summary Statistics

library(stringi)

# Read the blogs, news and twitter data into R enviroment
blogs <- readLines("~/Coursera/01Capstone/final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("~/Coursera/01Capstone/final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("~/Coursera/01Capstone/final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)

library(stringi)
library(dplyr)

# Get file sizes
blogsMB <- round(file.info("~/Coursera/01Capstone/final/en_US/en_US.blogs.txt")$size / 1024 ^ 2,1)
newsMB <- round(file.info("~/Coursera/01Capstone/final/en_US/en_US.news.txt")$size / 1024 ^ 2,1)
twitterMB <- round(file.info("~/Coursera/01Capstone/final/en_US/en_US.twitter.txt")$size / 1024 ^ 2,1)

# Get words in files
blogsWords <- stri_count_words(blogs)
newsWords <- stri_count_words(news)
twitterWords <- stri_count_words(twitter)

# Summary of the data sets
sumstat <-data.frame(source = c("blogs", "news", "twitter"),
           size = c(blogsMB, newsMB, twitterMB),
           lines = c(length(blogs), length(news), length(twitter)),
           words = c(sum(blogsWords), sum(newsWords), sum(twitterWords)))

sumstat <- mutate(sumstat, avgWordperLine = round(words/lines,1))

sumstat

##    source  size   lines    words avgWordperLine
## 1   blogs 200.4  899288 37546246           41.8
## 2    news 196.3   77259  2674536           34.6
## 3 twitter 159.4 2360148 30093410           12.8

Cleaning Data

As you can see (below) in the samples the data is unstructured

# View 5 samples from each data set
sample(blogs, 5)

## [1] "The slide into the abyss is orchestrated to be played in tiny incremental variations so the disharmony is not evident to the casual listener. The score looks fine to those unacquainted with music and the accidentals are cleverly hidden within measures of beautiful melody."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
## [2] "Before we get to the eye candy, I hit over 100 followers recently, so do intend a giveaway soon, I'm trying to figure out what to offer!"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
## [3] "One of the keepers of the Bear Tower once told me that there is no animal so dangerous or so savage and unmanageable as the hybrid resulting when a fighting dog mounts a she-wolf. We are accustomed to think of the beasts of the forest and mountain as wild, and to think of the men who spring up, as it seems, from their soil as savage. But the truth is that there is a wildness more vicious (as we would know better if we were not so habituated to it) in certain domestic animals, despite their understanding so much human speech and sometimes even speaking a few words; and there is a more profound savagery in men and women whose ancestors have lived in cities and towns since the dawn of humanity. Vodalus, in whose veins flowed the undefiled blood of a thousand exultants - exarchs, ethnarchs, and starosts - was capable of violence unimaginable to the autochthons that stalked the streets of Thrax, naked beneath their huanaco cloaks."
## [4] "But since I want to get rid of corporate taxes I<U+0092>m not too invested in that argument. So what else do I have<U+0085>"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
## [5] "Side note, the commenting thing is totally still not fixed. I can comment on some blogs but not on others. So hopefully that'll be rectified by the time that I'm back."

sample(news, 5)

## [1] "If you're going to make that decision -- and sign your name to it -- you'd want to able to justify it to yourself. Kitzhaber decided he couldn't do that, writing, \"The single-best indicator of who will and will not be executed has nothing to do with the circumstances of a crime or the findings of a jury. The factor that determines whether someone sentenced to death in Oregon is actually executed is that they volunteer. That is a perversion of justice.\""
## [2] "<U+0093>A 1972 graduate of Millville Senior High School, Jane graduated from Elizabethtown College and later served for four years as an admissions counselor-recruiter for the college."                                                                                                                                                                                                                                                                                         
## [3] "Ratepayer outcry over bioswales and bikes has been fierce. Commissioner Amanda Fritz last reported that she's answered about 100 e-mails on this topic."                                                                                                                                                                                                                                                                                                                   
## [4] "But he made clear that he did not shun other free agent offers because he felt he owed something to the Blazers, who have paid him more than $19 million over the last four years to play 82 games."                                                                                                                                                                                                                                                                       
## [5] "Stanford will have a deep linebacking corps. Interestingly, Jarek Lancaster and A.J. Tarpley - the first and third leading tacklers this season - could lose their starting jobs. Noor Davis, a 6-foot-4, 233-pound verbal commit from Florida, may take over at one outside spot."

sample(twitter, 5)

## [1] "If the Blackhawks beat the Lightning the #Caps will clinch the Southeast Division"                     
## [2] "It's all about the glorification and protection of criminals. Poverty and education are also at fault."
## [3] "small business ideas here"                                                                             
## [4] "Why are there so many things I need/ want to buy??!"                                                   
## [5] "when will you be going into the studio again??"

Before exploratory analysis is conducted, pre processing or standardizing the data needs to be performed on the data sources. Standardizing data involves removing special characters, numbers, stop-words, and changing the text to lower case. One of the challenges faced was removal of the non-ASCII characters. Another challenged faced was the removeSparseTerms() function that uses a significant amount of memory when run. In the illustrations below less than 1% of the data set was used to i

Data Exploration

The first illustrations depicts the most common word in the sample data set:

The second illustrations depicts the most common contiguous two words in the sample data set:

The third illustrations depicts the most common contiguous three words in the sample data set:

Project Next Steps

The next steps of the project is to work through the performance challenges to begin on building a predictive text model; followed by creating a predictive text product on Shiny.