Coursera Data Science Capstone: Milestone Report

James Cooksley

March 26, 2018

Introduction

This is the initial Milestone Report which is part of the Coursera Data Science Capstone Project. The task is to create a predictive text model, using natural language processing techniques to perform the analysis and build the model. This Milestone Report describes the important features of the training data using exploratory data analysis and describes further plans for a predictive model.

Sourcing The Data

**Download zip file including text files from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip.**

Download and unzip the data to local disk

if (!file.exists("Coursera-SwiftKey.zip")) {
  download.file(url = "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", destfile = "C://Users//u182335//Documents//DataScience//CAPSTONE//Week 2//Coursera-SwiftKey.zip")
  unzip("C://Users//u182335//Documents//DataScience//CAPSTONE//Week 2//Coursera-SwiftKey.zip")
}

The data sets consist of text from 3 different sources: 1=Blogs, 2=News, 3=Twitter feeds. The text data is in 4 different languages: one. German, two. English - United States, three. Finnish and four. Russian. I will focus on the English - United States data sets only.

Read the blogs and Twitter data into R

blogs <- readLines("final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)

## Warning in readLines("final/en_US/en_US.news.txt", encoding =
## "UTF-8", skipNul = TRUE): incomplete final line found on 'final/en_US/
## en_US.news.txt'

twitter <- readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)

Create a summary of findings (counts of rows, file sizes, counts of words, mean of words per line)

library(stringi)

Get file sizes

blogs.size <- file.info("final/en_US/en_US.blogs.txt")$size / 1024 ^ 2
news.size <- file.info("final/en_US/en_US.news.txt")$size / 1024 ^ 2
twitter.size <- file.info("final/en_US/en_US.twitter.txt")$size / 1024 ^ 2

Get words in files

blogs.words <- stri_count_words(blogs)
news.words <- stri_count_words(news)
twitter.words <- stri_count_words(twitter)

Summary of the data sets

data.frame(source = c("blogs", "news", "twitter"),
           file.size.MB = c(blogs.size, news.size, twitter.size),
           num.lines = c(length(blogs), length(news), length(twitter)),
           num.words = c(sum(blogs.words), sum(news.words), sum(twitter.words)),
           mean.num.words = c(mean(blogs.words), mean(news.words), mean(twitter.words)))

##    source file.size.MB num.lines num.words mean.num.words
## 1   blogs     200.4242    899288  37546246       41.75108
## 2    news     196.2775     77259   2674536       34.61779
## 3 twitter     159.3641   2360148  30093410       12.75065

Cleaning The Data

Initially we clean the data to perform the analysis with more efficiency. In the process I remove special characters, missing data URLs, formatting etc. Due to filesize a 1% sample has been used to demonstrate and decrease runtime

library(tm)

## Loading required package: NLP

Loading required package: NLP

Sample the data

set.seed(679)
data.sample <- c(sample(blogs, length(blogs) * 0.01),
                 sample(news, length(news) * 0.01),
                 sample(twitter, length(twitter) * 0.01))

Create corpus and clean the data

library(stringr)
usableText=str_replace_all(data.sample,"[^[:alnum:]]", " ")
usableText  <- gsub("[ÁbcdêãçoàúüÃ]","" , usableText ,ignore.case = TRUE)

corpus <- VCorpus(VectorSource(usableText))
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
corpus <- tm_map(corpus, toSpace, "@[^\\s]+")
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("en"))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)

Exploratory Analysis

Perform exploratory analysis on the data and list the most common unigrams to start indicating common themes.

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

options(mc.cores=1)
getFreq <- function(tdm) {
  freq <- sort(rowSums(as.matrix(tdm)), decreasing = TRUE)
  return(data.frame(word = names(freq), freq = freq))
}
bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
makePlot <- function(data, label) {
  ggplot(data[1:30,], aes(reorder(word, -freq), freq)) +
    labs(x = label, y = "Frequency") +
    theme(axis.text.x = element_text(angle = 60, size = 12, hjust = 1)) +
    geom_bar(stat = "identity", fill = I("grey50"))
}

Create frequencies of most common n-grams in data sample

library(ngram)
library(tokenizers)
library(rJava)
library(RTextTools)

## Loading required package: SparseM

## 
## Attaching package: 'SparseM'

## The following object is masked from 'package:base':
## 
##     backsolve

freq1 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus), 0.9999))

Histogram of the 30 most common unigrams in the data sample.

makePlot(freq1, "30 Most Common Unigrams")

#Next Steps For Prediction Algorithm And Shiny App Next steps of this capstone project is to finalize a predictive algorithm, and deploy as a Shiny app. The predictive algorithm will be using n-gram model with frequency lookup following on from the exploratory analysis above. A potential strategy would be to use the trigram model to predict the next word. If no matching trigram can be located, the algorithm would revert to the bigram model and then to the unigram model as required. In terms of using the app, the plan is to enter a phrase into an input box to allow the user to seatch keywords. The app would then suggest the most likely ‘next word’ from the data provided by the user.