C10 Data Science Capstone - Milestone Report

By Sandy Sng
12 June 2018

Executive Summary

This is the Milestone Report for the Coursera Data Science Capstone Project.
The goal of the Capstone Project is to create an algorithm and build a predictive text mining application to predict the next word based on previous words typed by a user. Using three databases of english sentences (extracted from blogs, news, and twitter), we will build and analyse basic n-gram models for predicting the next word based on previous frequently occuring words.

The motivation for this Milestone Report is to:
1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.
2. Create a basic report of summary statistics about the data sets.
3. Report any interesting findings that you amassed so far.
4. Get feedback on your plans for creating a prediction algorithm and Shiny app.

Getting the Data

Download the three datasets from this Data Source. We will only use the English versions (File name: en_US) for this analysis.


if (!file.exists("final")) {
  download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip")
  unzip("Coursera-SwiftKey.zip")
}

blogs <- readLines("final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)

Examine datasets for these information: file sizes, line counts, word counts, and mean words per line.


require(stringi)

# Get file sizes
blogs.size <- file.info("final/en_US/en_US.blogs.txt")$size / 1024 ^ 2
news.size <- file.info("final/en_US/en_US.news.txt")$size / 1024 ^ 2
twitter.size <- file.info("final/en_US/en_US.twitter.txt")$size / 1024 ^ 2

# Get no. of words in files
blogs.words <- stri_count_words(blogs)
news.words <- stri_count_words(news)
twitter.words <- stri_count_words(twitter)

# Summary of the datasets
data.frame(source = c("blogs", "news", "twitter"),
           file.size.MB = c(blogs.size, news.size, twitter.size),
           num.lines = c(length(blogs), length(news), length(twitter)),
           num.words = c(sum(blogs.words), sum(news.words), sum(twitter.words)),
           mean.num.words = c(mean(blogs.words), mean(news.words), mean(twitter.words)))
##    source file.size.MB num.lines num.words mean.num.words
## 1   blogs     200.4242    899288  37546239       41.75107
## 2    news     196.2775   1010242  34762395       34.40997
## 3 twitter     159.3641   2360148  30093413       12.75065

Cleaning the Data

This involves removing URLs, special characters, punctuations, numbers, excess whitespace, stopwords, and changing the text to lower case. Since the data sets are quite large, we will randomly choose 1% of the data to demonstrate the data cleaning and exploratory analysis.


require(tm)
# Sample the data (random at 1%)
set.seed(324)
data.sample <- c(sample(blogs, length(blogs) * 0.01),
                 sample(news, length(news) * 0.01),
                 sample(twitter, length(twitter) * 0.01))

# Create corpus and clean the data
corpus <- VCorpus(VectorSource(data.sample))
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
corpus <- tm_map(corpus, toSpace, "@[^\\s]+")
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("en"))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)

Exploratory Data Analysis

Part 1: Visualising Data using Wordcloud

Words with the highest frequency of occurence are plotted first in the center of the wordcloud.


library(wordcloud)
wordcloud(corpus, max.words = 1000, random.order = FALSE, rot.per = 0.3, use.r.layout = FALSE, colors = brewer.pal(4, "BuPu"))

Next Steps: Creating a prediction algorithm & Shiny app

We have now completed the steps of getting and cleaning the data, and partially explored the data using a wordcloud. Next, we will
- increase our sample size from the current 1% to a larger sample size,
- continue Exploratory Data Analysis by “Part 2: Visualising Data using n-gram models”, to build basic n-gram models for predicting the next word based on the frequency occuring words in the data,
- create a Shiny app for a friendly user-interface.