Milestone Report for Data Science Capstone

Executive Summary

This report is a brief introduction into the shiny text predictor. In this report, I will:

Download the Data
Import the data to R
Take a sample of the data
Clean and standardize the data
Create and visualize nGrams
Explain Project Plan

Download the Data

The dataset was taken directly from the course’s website as directed by the tutor. A zip file was downloaded and extracted. The directory contains files in German, English, Finnish and Russian.

require(stringi)
require(knitr)
require(tm)
require(RWeka)
library(ggplot2)
#get the data
  url <-"https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
  zip <- "Coursera-SwiftKey.zip"
  path<-getwd()
 #path
 # download.file(url, zip)
 # unzip('Coursera-SwiftKey.zip')

Import the data to R

I will be using the US dataset for this project.

blogs <- readLines("en_US.blogs.txt", encoding = 'UTF-8', skipNul = TRUE)
news <- readLines("en_US.news.txt", encoding = 'UTF-8', skipNul = TRUE)
twitter <- readLines("en_US.twitter.txt", encoding = 'UTF-8', skipNul = TRUE)
swearWords <- readLines("swearWords.csv", encoding = 'UTF-8', skipNul = TRUE)

kable(data.frame(row.names = c("blogs","news","twitter")
                      ,LineCount = sapply(list(blogs,news,twitter),length)
                      ,LongestLines = sapply(list(stri_length(blogs),stri_length(news), stri_length(twitter)),max)
                      ,TotalWords = sapply(list(blogs,news,twitter),stri_stats_latex)[4,]
))

	LineCount	LongestLines	TotalWords
blogs	899288	40833	37570839
news	77259	5760	2651432
twitter	2360148	140	30451170

Take a sample of the data

set.seed(2016)
sampleData <- list()
sampleTwitter <- twitter[sample(1:length(twitter),10000)]
sampleNews <- news[sample(1:length(news),10000)]
sampleBlogs <- blogs[sample(1:length(blogs),10000)]
usSample <- c(sampleTwitter,sampleNews,sampleBlogs)
rm(blogs,news,twitter, sampleBlogs, sampleNews, sampleTwitter)

Clean and standardize the data

usSample <- iconv(usSample, 'UTF-8', 'ASCII', "byte")
usCorpus <- VCorpus(VectorSource(usSample))


toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
usCorpus <- tm_map(usCorpus, toSpace, "/|@|\\|") # remove more transforms
usCorpus <- tm_map(usCorpus, removeWords, stopwords("english")) # remove english stop words
usCorpus <- tm_map(usCorpus, content_transformer(tolower))# convert to lowercase
usCorpus <- tm_map(usCorpus, removePunctuation)# remove punctuation
usCorpus <- tm_map(usCorpus, removeNumbers)# remove numbers
usCorpus <- tm_map(usCorpus, stripWhitespace) # strip whitespace
usCorpus <- tm_map(usCorpus, removeWords, swearWords) # Remove profanity

Create nGrams

#Tokenizer functions
bigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))
trigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=3, max=3))
quadgramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=4, max=4))

Visualize nGrams

#Plotting function
bar_plot <- function(df, title) {
  color="red"
    ggplot(df, aes(x = seq(1:20), y = freq)) +
    geom_bar(stat = "identity", fill = color, colour = "black", width = 0.80) +
    coord_cartesian(xlim = c(0, 21)) +
    labs(title = title) +
    xlab("Words") +
    ylab("Count")  +    
    scale_x_discrete(breaks = seq(1, 20, by = 1), labels = df$word) +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))
}
# Unigram Chart
bar_plot(unigram_df,"Unigrams")

unigram_df

##          word freq
## the       the 4938
## said     said 3074
## will     will 2837
## one       one 2584
## just     just 2251
## like     like 2063
## can       can 2049
## time     time 1816
## get       get 1722
## new       new 1555
## people people 1379
## now       now 1351
## also     also 1303
## first   first 1260
## good     good 1232
## know     know 1211
## day       day 1189
## but       but 1172
## and       and 1119
## back     back 1099

bar_plot(bigram_df,"Bigrams")

bigram_df

##                word freq
## i think     i think  481
## i know       i know  343
## i love       i love  303
## i just       i just  277
## i can         i can  276
## i will       i will  264
## i want       i want  215
## i like       i like  186
## last year last year  186
## right now right now  169
## new york   new york  163
## i really   i really  150
## i get         i get  147
## i hope       i hope  143
## i thought i thought  139
## years ago years ago  139
## i feel       i feel  136
## i donet     i donet  134
## i need       i need  131
## i got         i got  130

Project Plan

Since we have completed cleaning the data, we were able to explore it. Above I have shown the nGrams that are most prevalent in the data. The steps will be to build a predictive model that uses the nGrams above. The shiny app will be a text input bar and based on what the user type we will predict the next word or phrase.