Milestone Report

Instruction

The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to: 1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.2. Create a basic report of summary statistics about the data sets.3. Report any interesting findings that you amassed so far.4. Get feedback on your plans for creating a prediction algorithm and Shiny app.

Review criteria 1. Does the link lead to an HTML page describing the exploratory analysis of the training data set? 2. Has the data scientist done basic summaries of the three files? Word counts, line counts and basic data tables? 3. Has the data scientist made basic plots, such as histograms to illustrate features of the data? 4. Was the report written in a brief, concise style, in a way that a non-data scientist manager could appreciate?

Get data

If the dataset is not available, download from URL and unzip.

#rm(list = ls(all.names = TRUE)) 
#gc()
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
setwd('..')
if(!file.exists("Coursera-SwiftKey.zip")) {
  print("Download file and unzip")
  download.file(url, "Coursera-SwiftKey.zip")
  unzip("Coursera-SwiftKey.zip", exdir="C:/Users/JoséIgnacio/Documents")}

Data Summary

Read the blog, news and twitter dataset from the English language files and display statistics of the three dataset in a table. The statistics include the file size in mega bytes, the total number of lines, total number of words and total number of characters.

library(stringi)
blogfile <- "./en_US/en_US.blogs.txt"
blogsize <- file.info(blogfile)$size / 1024^2
blogs<- readLines(blogfile, skipNul=TRUE)
bloglines <- length(blogs) 
blogchars<-sum(nchar(blogs))
blogwords <- sum(stri_count_words(blogs))
#summary(nchar(bloglines))

newsfile <- "./en_US/en_US.news.txt"
newssize <- file.info(newsfile)$size / 1024^2
news<- readLines(newsfile, skipNul=TRUE)
newslines <- length(news) 
newschars<-sum(nchar(news))
newswords <- sum(stri_count_words(news))
#summary(nchar(newslines))

twitterfile <- "./en_US/en_US.twitter.txt"
twittersize <- file.info(twitterfile)$size / 1024^2
twitter<- readLines(twitterfile, skipNul=TRUE)
twitterlines <- length(twitter) 
twitterchars<-sum(nchar(twitter))
twitterwords <- sum(stri_count_words(twitter))
#summary(nchar(twitterlines))

table<-data.frame("Name" = c("Blogs","News","Twitter"),
                  "Size(MB)" = c(blogsize,newssize,twittersize),
                  "Lines"=c(bloglines,newslines,twitterlines),
                  "Words"=c(blogwords,newswords,twitterwords),
                  "Char"=c(blogchars,newschars,twitterchars))
table

##      Name Size.MB.   Lines    Words      Char
## 1   Blogs 200.4242  899288 37546250 206824509
## 2    News 196.2775   77259  2674536  15639408
## 3 Twitter 159.3641 2360148 30093413 162122861

Clean data and build Corpus

As the data set is too big, we sample a portion of the data and create the corpus. Next, we remove punctuation, numbers, whitespaces, stopwords from the corpuses, convert text to lower case and create plain text document.

library(NLP)
library(tm)
samplesize <- 30000
#samplesize <- 5000
mydata <- c(sample(blogs, samplesize), sample(news, samplesize), sample(twitter, samplesize))
corpus <- VCorpus(VectorSource(mydata))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, stopwords())
corpus <- tm_map(corpus, PlainTextDocument)

Exploratory analysis

Tokenization is the splitting of a phrase, sentence, paragraph, or an entire text document into smaller units, such as individual words or terms. First, we create the unigrams by splitting the text into single words and explore what are the most common words in our text.

library(tidytext)
library(dplyr)
library(tibble)
library(tidyr)
library(ggplot2)

text_df <- tidy(corpus)

text_df <- text_df %>% select(c("id","text")) %>%  rename(line = id)

freq <- text_df %>% unnest_tokens(word,text) %>%  count(word, sort = TRUE) %>% top_n(60)     

ggplot(freq, aes(x=word, y=n)) +
    geom_bar(stat="Identity", fill="dark blue") +
    xlab("Word") + ylab("Count") + ggtitle("Top 60") + 
    theme(axis.text.x=element_text(angle=65, hjust=1))

Second, we look at bigrams.

freqb <- text_df %>% unnest_tokens(bigram, text, token = "ngrams", n = 2) %>% count(bigram, sort = TRUE) %>%  top_n(41)
freqb <- freqb[-1,] #Remove NA row

ggplot(freqb, aes(x=bigram, y=n)) +
    geom_bar(stat="Identity", fill="blue") +
    xlab("Bigram") + ylab("Count") + ggtitle("Top 40") + 
    theme(axis.text.x=element_text(angle=55, hjust=1))

Lastly, we look at trigrams.

freqt <- text_df %>% unnest_tokens(trigram, text, token = "ngrams", n = 3) %>% count(trigram, sort = TRUE) %>% top_n(21)
freqt <- freqt[-1,] #Remove NA row

ggplot(freqt, aes(x=trigram, y=n)) +
    geom_bar(stat="Identity", fill="cyan") +
    xlab("Trigram") + ylab("Count") + ggtitle("Top 20") + 
    theme(axis.text.x=element_text(angle=45, hjust=1))