Data Scientist Capstone Project: 1st milestone

Synopsis

The final goal of this project to be able to predict the next word given a phrase. In this first milestone document, I’ll report about the following topics: - how the Coursera SwiftKey dataset is obtained; - the first basic summaries of the dataset; - my first reflections on what problems to deal with and which to ignore; - plans for a machine learning algorithm to predict the next word given a phrase.

Introduction

The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set.The motivation for this project is to Demonstrate that you’ve downloaded the data and have successfully loaded it in. Create a basic report of summary statistics about the data sets. Report any interesting findings that you amassed so far. Get feedback on your plans for creating a prediction algorithm and Shiny app.

1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.

Setting working directory

setwd("C:/Users/Dcollin/Desktop/Capstone")
list.files("final/en_US")

## [1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"

2. Create a basic report of summary statistics about the data sets.

Setting files vars

news<-"final/en_US/en_US.news.txt"
blogs<-"final/en_US/en_US.blogs.txt"
twitter<-"final/en_US/en_US.twitter.txt"

Reading files information

file.info(news)

##                                 size isdir mode               mtime
## final/en_US/en_US.news.txt 205811889 FALSE  666 2014-07-22 05:13:04
##                                          ctime               atime exe
## final/en_US/en_US.news.txt 2017-11-25 16:40:57 2017-11-25 16:40:57  no

file.info(blogs)

##                                  size isdir mode               mtime
## final/en_US/en_US.blogs.txt 210160014 FALSE  666 2014-07-22 05:13:05
##                                           ctime               atime exe
## final/en_US/en_US.blogs.txt 2017-11-25 16:41:09 2017-11-25 16:41:09  no

file.info(twitter)

##                                    size isdir mode               mtime
## final/en_US/en_US.twitter.txt 167105338 FALSE  666 2014-07-22 05:12:58
##                                             ctime               atime exe
## final/en_US/en_US.twitter.txt 2017-11-25 16:40:46 2017-11-25 16:40:46  no

Reading files

f_blogs<-readLines(blogs, encoding="UTF-8")
f_twitter<-readLines(twitter, encoding="UTF-8")

## Warning in readLines(twitter, encoding = "UTF-8"): line 167155 appears to
## contain an embedded nul

## Warning in readLines(twitter, encoding = "UTF-8"): line 268547 appears to
## contain an embedded nul

## Warning in readLines(twitter, encoding = "UTF-8"): line 1274086 appears to
## contain an embedded nul

## Warning in readLines(twitter, encoding = "UTF-8"): line 1759032 appears to
## contain an embedded nul

# Import news dataset in binary mode
con <- file(news, open="rb")
f_news <- readLines(con, encoding="UTF-8")
close(con)
rm(con)

Extract head and tail samples from the 3 files

head(f_news,3)

## [1] "He wasn't home alone, apparently."                                                                                                                                                
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."                        
## [3] "WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building."

tail(f_news,3)

## [1] "But I'm in the mood. After six or more months of chill and ice crystals in Northeast Ohio, the ground is soft and fragrant. Seemingly overnight, things are growing as if we were in the tropics. We are again producing fruit of the earth: sweet corn, mightily fragrant herbs, deep green and tender broccoli."                                                      
## [2] "That starts this Sunday at Chivas. The Goats aren't a great team, but they just beat one (a 1-0 win over Salt Lake at Rio Tinto). They also have the one player who can rival Roger Espinoza as \"The Best Guy in MLS That No One Talks About Because He Doesn't Play in New York, LA or the Pacific Northwest\" in goalkeeper Dan Kennedy. These will be tough points."
## [3] "The only outwardly religious adornment was a billboard-sized banner with an image of Our Lady of Charity, patron saint of Cuba, hanging on the side of the National Library."

head(f_blogs,3)

## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan <U+0093>gods<U+0094>."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
## [2] "We love you Mr. Brown."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
## [3] "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him."

tail(f_blogs,3)

## [1] "Plus, I have also been allowing myself not to get <U+0091>stressed<U+0092> over things that have not been done! If the ironing is not done right now, it<U+0092>s not the end of the world! If that phone call is made tomorrow rather than today, then that<U+0092>s OK too! Living in the moment and allowing myself the time to get <U+0091>back to feeling great<U+0092>!"
## [2] "(5) What's the barrier to entry and why is the business sustainable?"                                                                                                                                                                                                                                                               
## [3] "In response to an over-whelming number of comments we sat down and created a list of do (s) and don<U+0092>t (s) <U+0096> these recommendations are easy to follow and except for - adding some herbs to your rinse . So let<U+0092>s get begin<U+0085>"

head(f_twitter,3)

## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."  
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."

tail(f_twitter,3)

## [1] "u welcome"                                                                                                     
## [2] "It is #RHONJ time!!"                                                                                           
## [3] "The key to keeping your woman happy= attention, affection, treat her like a queen and sex her like a pornstar!"

Basic Statistics

For our analysis we need two libraries.

# library for character string analysis
library(stringi)
# library for plotting
library(ggplot2)

Analyze lines and characters:

stri_stats_general(f_blogs)

##       Lines LinesNEmpty       Chars CharsNWhite 
##      899288      899288   206824382   170389539

stri_stats_general(f_news)

##       Lines LinesNEmpty       Chars CharsNWhite 
##     1010242     1010242   203223154   169860866

stri_stats_general(f_twitter)

##       Lines LinesNEmpty       Chars CharsNWhite 
##     2360148     2360148   162096031   134082634

Count words per line & summarize distribution of these counts per corpus, using summary statistics and a distibution plot

words_blogs <- stri_count_words(f_blogs)
summary(words_blogs)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    9.00   28.00   41.75   60.00 6726.00

qplot(words_blogs)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Analyze the “news” corpus:

words_news <- stri_count_words(f_news)
summary( words_news )

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   19.00   32.00   34.41   46.00 1796.00

qplot(words_news)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Analyze “twitter” corpus:

words_twitter <- stri_count_words(f_twitter)
summary( words_twitter )

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   12.00   12.75   18.00   47.00

qplot(   words_twitter )

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Intermediate Conclusion for Milestone Report

The frequency distributions of the “blogs” and “news” corpora are similar (appearing to be log-normal). The frequency distribution of the “twitter” corpus is again different, as a result of the 140 character limit.

Final Conclusion for Milestone report

For final project, it will be required to work on the training predictive models using training data sets within the corpora. It will be required to compare respective models for each type of text - blogs, news & Twitter & to understand and how these perform against aggregate models trained off the entire corpus that spans the 3 types of text files.