Executive Summary

The motivation for this project is to:

  1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.
  2. Create a basic report of summary statistics about the data sets.
  3. Report any interesting findings that you amassed so far.

For this Milestone Report, I will go about these steps to outline my goals and objectives.

  • First, we will load the Data
  • Take and show random samples of the imported data
  • Do some initial cleaning of data
  • Show you some basic overall statistics
  • Plot sample data
  • Summarize the data overall.

Project Setup

We will load the sesssion and packages needed and overall system set up

rm(list = ls(all.names = TRUE))
library(ggplot2)
library(downloader)
library(plyr)
library(dplyr)
library(knitr)
library(tm)
library(wordcloud)
library(slam)
library(ngram)
library(knitr)
library(kableExtra)
library(RColorBrewer)
library(gridExtra)
library(RWeka)
setwd(getwd())
set.seed(123456)

Description of Datasets

Swiftkey has three(3) datasets:

  • blogs, news, and twitter

You can normally retreive all of this data set from pubilc sources. These datasets can be extremly larger so we will sample our data to show overall sampling of the data.

This project will only focus on the English corpora.

## Check do you have the data already?
if(!file.exists("./data")){
  dir.create("./data")
}
Url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
## Check: is hte zip file in you working directory?
if(!file.exists("./data/Coursera-SwiftKey.zip")){
  download.file(Url,destfile="./data/Coursera-SwiftKey.zip",mode = "wb")
}
## Check: unzip the downloaded zip file.
if(!file.exists("./data/final")){
  unzip(zipfile="./data/Coursera-SwiftKey.zip",exdir="./data")
}

First, we will download the data sets and install it into working directory.

  • List all the files of ./data/final/en_US/ Dataset folder

There will be text from 3 different sources:

  1. News
  2. Blogs
  3. Twitter feeds

In this project, we will only focus on the English - US data sets.

path <- file.path("./data/final" , "en_US")
files<-list.files(path, recursive=TRUE)

#file connection twitter data set
con <- file("./data/final/en_US/en_US.twitter.txt", "r") 
Twitter<-readLines(con, skipNul = TRUE, warn = FALSE, encoding = "UTF-8")
# Close the connection 
close(con)

#file connection blog data set
con <- file("./data/final/en_US/en_US.blogs.txt", "r") 
Blogs<-readLines(con, skipNul = TRUE, warn = FALSE, encoding = "UTF-8")
# Close the connection e
close(con)

#file connection news data set
con <- file("./data/final/en_US/en_US.news.txt", "r") 
News<-readLines(con, skipNul = TRUE, warn = FALSE, encoding = "UTF-8")
# Close the connection 
close(con)

Basic Summary

Before we start building the corpus we will need to clean the data and create a basic summary of the three datasets provided.

We will review the:

  • file sizes
  • number of lines
  • number of characters
  • number of words for each source file.

We will also included are basic statistics: - words per line (min, mean, and max)

library(stringi)

#Get file size
Blogs.size <- file.info("./data/final/en_US/en_US.blogs.txt")$size / 1024 ^ 2
News.size <- file.info("./data/final/en_US/en_US.news.txt")$size / 1024 ^ 2
Twitter.size <- file.info("./data/final/en_US/en_US.twitter.txt")$size / 1024 ^ 2

# Get words in files
Blogs.words <- stri_count_words(Blogs)
News.words <- stri_count_words(News)
Twitter.words <- stri_count_words(Twitter)

# Summary of the data sets
data.frame(source = c("blogs", "news", "twitter"),
           file.size.MB = round(c(Blogs.size, News.size, Twitter.size), digits = 2),
           words.per.line = c(length(Blogs), length(News), length(Twitter)),
           word.count = c(sum(Blogs.words), sum(News.words), sum(Twitter.words)),
           mean.word.count = round(c(mean(Blogs.words), mean(News.words), mean(Twitter.words)),digits = 2))

Histogram Words per Line

The above table shows that on average:

  • News has the lowest number of words per line.
  • Blogs more words per lines.
  • Due to the 280-character limit Twitter lowest mean word count

To improve processing time, sample size of 5% will be used from all three data sets. Then will be combined into a unified document corpus for subsequent analyses later.

Cleaning The Data

Text data sets are quite large, we will randomly choose 5% of the data for cleaning and exploratory analysis also convert to UTF-8 characters.

Before performing exploratory analysis, must clean the data first.

  • remove any URLs, twitter handles, email patterns
  • remove special characters
  • remove punctuations
  • remove numbers
  • remove excess whitespace
  • remove english stopwords and any non UTF-8 formats
  • set all characters lower case.
  • remove any and all profanity
#Remove profanity from each list: Using profanity list originally published by Google
  profanityFile <- "full-list-of-bad-words-banned-by-google.csv"
  pathToprofanityList <- file.path("./data", profanityFile)
  profanity <- read.csv(pathToprofanityList, sep="\t", strip.white = TRUE, encoding = "UTF-8")
  
#Sample the data
set.seed(123456)
sampleData <- c(sample(Blogs, length(Blogs) * 0.05),
                 sample(News, length(News) * 0.05),
                 sample(Twitter, length(Twitter) * 0.05))
#Create corpus and clean the data
  corpus <- VCorpus(VectorSource(sampleData))
  toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
#remove URL, Twitter handles and email patterns
  corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
  corpus <- tm_map(corpus, toSpace, "@[^\\s]+")
  corpus <- tm_map(corpus, toSpace, "\\b[A-Z a-z 0-9._ - ]*[@](.*?)[.]{1,3} \\b")
#remove profanity from the sample data set
  profanity <- iconv(profanity, "latin1", "ASCII", sub = "")
  corpus <- tm_map(corpus, removeWords, profanity)
#remove rest of unwanted characters
  corpus <- tm_map(corpus, tolower)
  corpus <- tm_map(corpus, removeWords, stopwords("english"))
  corpus <- tm_map(corpus, removePunctuation)
  corpus <- tm_map(corpus, removeNumbers)
  corpus <- tm_map(corpus, stripWhitespace)
  corpus <- tm_map(corpus, PlainTextDocument)

Exploratory Analysis

We will use some different techniques to develop an understanding of the data.

This Section will look at:

  • most frequently used words
  • tokenizing
  • n-gram generation
  • uniqrams
  • bigrams
  • trigrams
options(mc.cores=1)
#gather the frequencies of words
getFreq <- function(tdm) {
  freq <- sort(rowSums(as.matrix(tdm)), decreasing = TRUE)
  return(data.frame(word = names(freq), freq = freq))
}

bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
makePlot <- function(data, label) {
  ggplot(data[1:30,], aes(reorder(word, -freq), freq)) +
         labs(x = label, y = "Frequency") +
         theme(axis.text.x = element_text(angle = 60, size = 12, hjust = 1)) +
         geom_bar(stat = "identity", fill = I("blue"))
}

Histogram: Most common words / tokenizing and N-Gram

# Get frequencies of most common n-grams in data sample
freq1 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus), 0.9999))
freq2 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = bigram)), 0.9999))
freq3 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = trigram)), 0.9999))

Word Unigrams Frequencies One(1)

## Warning in dist(df, method = "euclidean"): NAs introduced by coercion

Word Bigrams Frequencies Two(2)

## Warning in dist(df, method = "euclidean"): NAs introduced by coercion

Word Trigrams Frequencies Three(3)

## Warning in dist(df, method = "euclidean"): NAs introduced by coercion

Conclusion

In conclusion, the final deliverable in the capstone project is to build a predictive algorithm that will be deployed as a Shiny app for the user interface.

Possible models:

  • One predictive algorithm using n-gram model with frequency lookup similar to our exploratory analysis above.
  • Use the trigram model to predict the next word.
  • If no matching trigram, then back to the bigram model, and then to the unigram model if needed.

The user interface of the Shiny app will consist of a text input box that will allow a user to enter a phrase. Then the app will use our algorithm to suggest the most likely next word after a short delay.

The final strategy will be based on the one that increases efficiency and provides the best accuracy.