Capstone Milestone Report

Abstract

The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable

We analyse three corpora of US English text found online. We find that the blogs and news corpora are similar, the twitter corpus is different. We propose that this is the result of the 140 character limit of Twitter messages.to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set.

Introduction

The motivation for this project is to:

Demonstrate that you’ve downloaded the data and have successfully loaded it in. Create a basic report of summary statistics about the data sets. Report any interesting findings that you amassed so far. Get feedback on your plans for creating a prediction algorithm and Shiny app. In this report we look at three corpora of US English text, a set of internet blogs posts, a set of internet news articles, and a set of twitter messages.

We collect the following forms of information:

file size number of lines number of non-empty lines number of words distribution of words (quantiles and plot) number of characters number of non-white characters In the following section we will describe the data collection process, the section after that gives the results of the data exploration, we finally present conclusions and give references.

For our analysis we use the R computing environment [@R], as well as the libraries stringi [@stringi] and ggplot2 [@ggplot2]. In order to make the code more readable we use the pipe operator from the magrittr library [@magrittr]. This report is compiled using the rmarkdown library [@rmarkdown] and [@knitr]. Finally during writing we used the RStudio IDE [@RStudio].

Downloading Data

Download file if it does not exist, and unzip it.

if (!file.exists("./data/final")) {
  if (!file.exists("./data/Coursera-SwiftKey.zip")) {
    download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",
                  destfile = "Coursera-SwiftKey.zip", quiet=TRUE)
  }
  
  unzip("Coursera-SwiftKey.zip", exdir = "./data");
}

We only load the English data, as specified by the requirements.

blogs <- readLines("data/final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul=TRUE)
twitter <-  readLines("data/final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul=TRUE)

# Kept getting an 'incomplete final line found' error with the news dataset, using binary mode instead
con <- file("data/final/en_US/en_US.news.txt", open="rb")
news <- readLines(con, encoding="UTF-8", skipNul=TRUE)
close(con)
rm(con)

Show some sample data.

head(blogs, 2)

## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan <U+0093>gods<U+0094>."
## [2] "We love you Mr. Brown."

head(twitter, 2)

## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."  
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."

head(news, 2)

## [1] "He wasn't home alone, apparently."                                                                                                                        
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."

Basic report of summary statistics

The before we analyse the files we look at their size (presented in MegaBytes / MBs).

# file size (in MegaBytes/MB)
file.info("data/final/en_US/en_US.blogs.txt")$size   / 1024^2

## [1] 200.4242

file.info("data/final/en_US/en_US.news.txt")$size    / 1024^2

## [1] 196.2775

file.info("data/final/en_US/en_US.twitter.txt")$size / 1024^2

## [1] 159.3641

For our analysis we need two libraries.

# library for character string analysis
library(stringi)

# library for plotting
library(ggplot2)

We analyse the lines and characters.

# library for character string analysis
stri_stats_general( blogs )

##       Lines LinesNEmpty       Chars CharsNWhite 
##      899288      899288   206824382   170389539

stri_stats_general( news )

##       Lines LinesNEmpty       Chars CharsNWhite 
##     1010242     1010242   203223154   169860866

stri_stats_general( twitter )

##       Lines LinesNEmpty       Chars CharsNWhite 
##     2360148     2360148   162096241   134082806

Count words per line & summarize distribution of these counts per corpus, using summary statistics and a distibution plot.

words_blogs   <- stri_count_words(blogs)
summary( words_blogs )

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    9.00   28.00   41.75   60.00 6726.00

b <- qplot(   words_blogs )
b

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Analyze the “news” corpus:

words_news    <- stri_count_words(news)
summary( words_news )

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   19.00   32.00   34.41   46.00 1796.00

qplot(   words_news )

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Analyze “twitter” corpus:

words_twitter <- stri_count_words(twitter)
summary( words_twitter )

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   12.00   12.75   18.00   47.00

qplot(   words_twitter )

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Interesting Findings

While the general use case is to create a prediction algorithm, each of the data sets can be quite different.

For example, Twitter’s data is very different from that of Blogs or News. This is probably due to the 140-character limit of Twitter, so a lot of abbreviations are used. This makes the dataset very hard to clean up (also, in a case like this, how do we go about handling predictions? Should we expand all abbreviations to their full form and add it to the corpus, or should we leave it as it is?) In addition, due to the use of hashtags in Twitter, and the way people use hashtags (some use hashtags as part of the sentence, some use it before or after the actual tweet), it seems to be the hardest case to handle.

One of the biggest challenges I foresee is the process and rules involved in cleaning the data. For example, with Twitter and Blogs, URLs tend to be used. URLs should ideally not be part of the corpus and should be cleaned.

Plans

Finally, I plan to build a word prediction algorithm. I’m still working on the model, but I plan to use some of the techniques presented in the previous courses. I will create a training dataset to build prediction model and then test this model.