JHU - Data Science Capstone - Milestone Report

Introduction

The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to: 1. Demonstrate that you’ve downloaded the data and have successfully loaded it in 2. Create a basic report of summary statistics about the data sets 3. Report any interesting findings that you amassed so far 4. Get feedback on your plans for creating a prediction algorithm and Shiny app

1. Demonstrate that you’ve downloaded the data and have successfully loaded it in

setwd('C:/Users/huaig/Desktop/Nick/Coding/Coursera/Johns Hopkins University/10. Data Science Capstone/Coursera-SwiftKey/')
list.files('final/en_US')

## [1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"

2. Create a basic report of summary statistics about the data sets

blogs_data <- 'C:/Users/huaig/Desktop/Nick/Coding/Coursera/Johns Hopkins University/10. Data Science Capstone/Coursera-SwiftKey/final/en_US/en_US.blogs.txt'
news_data <- 'C:/Users/huaig/Desktop/Nick/Coding/Coursera/Johns Hopkins University/10. Data Science Capstone/Coursera-SwiftKey/final/en_US/en_US.news.txt'
twitter_data <- 'C:/Users/huaig/Desktop/Nick/Coding/Coursera/Johns Hopkins University/10. Data Science Capstone/Coursera-SwiftKey/final/en_US/en_US.twitter.txt'

Size of each file (mega bytes/ MB)

file.info(blogs_data)$size / 1024^2

## [1] 200.4242

file.info(news_data)$size / 1024^2

## [1] 196.2775

file.info(twitter_data)$size / 1024^2

## [1] 159.3641

Line counts & characters

blogs_data_f <- readLines(blogs_data)
news_data_f <- readLines(news_data)

## Warning in readLines(news_data): incomplete final line found on 'C:/Users/huaig/
## Desktop/Nick/Coding/Coursera/Johns Hopkins University/10. Data Science Capstone/
## Coursera-SwiftKey/final/en_US/en_US.news.txt'

twitter_data_f <- readLines(twitter_data)

## Warning in readLines(twitter_data): line 167155 appears to contain an embedded
## nul

## Warning in readLines(twitter_data): line 268547 appears to contain an embedded
## nul

## Warning in readLines(twitter_data): line 1274086 appears to contain an embedded
## nul

## Warning in readLines(twitter_data): line 1759032 appears to contain an embedded
## nul

library(stringi)
blogs_stats <- stri_stats_general(blogs_data_f)
news_stats <- stri_stats_general(news_data_f)
twitter_stats <- stri_stats_general(twitter_data_f)

all_stats <- data.frame(blogs_stats, news_stats, twitter_stats)
colnames(all_stats) <- c('Blogs', 'News', 'Twitter')
t(all_stats)

##           Lines LinesNEmpty     Chars CharsNWhite
## Blogs    899288      899288 208361438   171926076
## News      77259       77259  15683765    13117038
## Twitter 2360148     2360148 162384825   134370864

library(ggplot2)
num_lines <- c(length(blogs_data_f), length(news_data_f), length(twitter_data_f))
num_lines <- data.frame(num_lines)
num_lines$names <- c('Blogs', 'News', 'Twitter')
ggplot(num_lines, aes(x = names, y = num_lines)) + geom_bar(stat = 'identity', fill = 'blue', color = 'blue') + xlab('Data Source') + ylab('Total No. of Lines') + ggtitle('Total Line Count per Data Source')

Word counts & stats

blogs_words <- stri_count_words(blogs_data_f)
news_words <- stri_count_words(news_data_f)
twitter_words <- stri_count_words(twitter_data_f)

all_words <- rbind(summary(blogs_words), summary(news_words), summary(twitter_words))
rownames(all_words) <- c('Blogs', 'News', 'Twitter')

word_count <- rbind(sum(blogs_words), sum(news_words), sum(twitter_words))
rownames(word_count) <- c('Blogs', 'News', 'Twitter')
colnames(word_count) <- 'Word Count'

word_stats <- cbind(all_words, word_count)
word_stats

##         Min. 1st Qu. Median     Mean 3rd Qu. Max. Word Count
## Blogs      0       9     29 42.42716      61 6726   38154238
## News       1      19     32 34.86840      46 1123    2693898
## Twitter    1       7     12 12.80349      18   60   30218125

3. Report any interesting findings that you amassed so far

Based on the summary stats above, we can see that each data sets are quite different. For instance, Twitter is limited by the 140-character constraint, hence users tend to use lots of abbreviations. This, coupled with the prevalent use of hashtags in Twitter, makes data cleaning for Twitter datasets harder as compared to datasets from blogs and news.

Another challenge on data cleaning is the common use of URLs in blogs and Twitter. Ideally, URLs should be cleaned as they are not part of the corpus.

4. Get feedback on your plans for creating a prediction algorithm and Shiny app

Next steps: - Clean puntuation, stemming, strange characters, whitespace - Create a corpus - Create onegram, bigram and trigram functions - Create a TextDocumentMatrix - Create a frequency file - Develop predictive model