Background

This document is intended as a first part of the Coursera Capstone in Data science given by Johns Hopkins Bloomberg School of Public Health by Jeff Leek, PhD, Roger D. Peng, PhD, Brian Caffo, PhD. The capstone is about using data science in the area of NLP (Natural Language Processing). Datasets from 3 different sources (Twitter, Blogs and News) is provided as a corpus forbuilding a prediction model which will predict the next word in a sequence of words based on the information in the corpus.

Data

Data can be taken from Coursera web page or from here.

We can take the data form URI and unzip it.

library(ggplot2)
dir.create('c:/Coursera', showWarnings = FALSE)
setwd('c:/Coursera')
download.file('https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip','c:/Coursera/Coursera-SwiftKey.zip')
unzip('c:/Coursera/Coursera-SwiftKey.zip',exdir='.')

We now take a closer look at the extracted files.

## [1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"

There will be warnings when importing the files for the first time. For the twitter file i removed manually the surplus elements using a text editor. For the news file i removed some characters from the file. After that i managed to read the data into R with no errors or warnings.

We can load the data using the readLines function.

con = 'c:/Coursera/final/en_US/en_US.news.txt'
news = readLines(con)
remove(con)

con = 'c:/Coursera/final/en_US/en_US.blogs.txt'
blogs = readLines(con)
remove(con)

con = 'c:/Coursera/final/en_US/en_US.twitter.txt'
twitter = readLines(con)
remove(con)

Exploratory Data Analysis

We take a look how the data acctualy looks like.

head(news)
## [1] "He wasn't home alone, apparently."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."                                                                                                                                                                                                                                                                                                                                                         
## [3] "WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building."                                                                                                                                                                                                                                                                                                                                 
## [4] "The Alaimo Group of Mount Holly was up for a contract last fall to evaluate and suggest improvements to Trenton Water Works. But campaign finance records released this week show the two employees donated a total of $4,500 to the political action committee (PAC) Partners for Progress in early June. Partners for Progress reported it gave more than $10,000 in both direct and in-kind contributions to Mayor Tony Mack in the two weeks leading up to his victory in the mayoral runoff election June 15."
## [5] "And when it's often difficult to predict a law's impact, legislators should think twice before carrying any bill. Is it absolutely necessary? Is it an issue serious enough to merit their attention? Will it definitely not make the situation worse?"                                                                                                                                                                                                                                                            
## [6] "There was a certain amount of scoffing going around a few years ago when the NFL decided to move the draft from the weekend to prime time -- eventually splitting off the first round to a separate day."
head(blogs)
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan â<U+0080><U+009C>godsâ<U+0080>."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
## [2] "We love you Mr. Brown."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
## [3] "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him."
## [4] "so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all these amazing images stored away ready to come to life when we get our home."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
## [5] "With graduation season right around the corner, Nancy has whipped up a fun set to help you out with not only your graduation cards and gifts, but any occasion that brings on a change in one's life. I stamped the images in Memento Tuxedo Black and cut them out with circle Nestabilities. I embossed the kraft and red cardstock with TE's new Stars Impressions Plate, which is double sided and gives you 2 fantastic patterns. You can see how to use the Impressions Plates in this tutorial Taylor created. Just one pass through your die cut machine using the Embossing Pad Kit is all you need to do - super easy!"                                                                                    
## [6] "If you have an alternative argument, let's hear it! :)"
head(twitter)
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."  
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."                                                                       
## [4] "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"                           
## [5] "Words from a complete stranger! Made my birthday even better :)"                                                
## [6] "First Cubs game ever! Wrigley field is gorgeous. This is perfect. Go Cubs Go!"

We can use regular expressions to count the number of words in each of the files and a simple count of the lines to get the basic information about the differences in the data set.

summarydata = matrix(c(1,2,3,sum(sapply(gregexpr("\\W+", news), length) + 1),sum(sapply(gregexpr("\\W+", blogs), length) + 1),sum(sapply(gregexpr("\\W+", twitter), length) + 1),length(news),length(blogs),length(twitter)),nrow=3,ncol=3)
colnames(summarydata) = c('dataset','wordcount','linecount')
rownames(summarydata) = c('news','blogs','twitter')
summarydata = as.data.frame(summarydata)

summarydata
##         dataset wordcount linecount
## news          1  36876910   1010242
## blogs         2  39386844    899288
## twitter       3  32874052   2360148

This shows that although the number of lines is greatest for the twitter data set the total word count is the greatest for the blogs data set. This makes sense as the twitter feeds are usually shorter then the size of blogs. Something to notice is that all data sets are fairly large.

Here is the comparison for the line count and word count of the data sets.

That concludes the basic exploratory data analysis.

Preparations for model building and prediction

Next thing to do is to define a sample size for each of the data sets and take samples of the data as working on the whole dataset seems difficult right now. Once the sample is created i will farther prepare the sample by cleaning it of words that are not useful for prediction. After the sample is completely cleaned i will create a term frequency matrix to see what are some of the terms that are most used in the dataset. This will also be begining of the building of unigrams for n-grams model. Right now the plan is to use unigram, bigram and trigrams for model. Interpolation will be used to calculate the probabilities of the next word in sequence and new words never seen in corpus will be covered with using ’UNK’s for low frequency words. After this is done the model will be transferred to Shiny environment and tested there. Lets see how it goes!