By Sandy Sng
12 June 2018
This is the Milestone Report for the Coursera Data Science Capstone Project.
The goal of the Capstone Project is to create an algorithm and build a predictive text mining application to predict the next word based on previous words typed by a user. Using three databases of english sentences (extracted from blogs, news, and twitter), we will build and analyse basic n-gram models for predicting the next word based on previous frequently occuring words.
The motivation for this Milestone Report is to:
1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.
2. Create a basic report of summary statistics about the data sets.
3. Report any interesting findings that you amassed so far.
4. Get feedback on your plans for creating a prediction algorithm and Shiny app.
Download the three datasets from this Data Source. We will only use the English versions (File name: en_US) for this analysis.
if (!file.exists("final")) {
download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip")
unzip("Coursera-SwiftKey.zip")
}
blogs <- readLines("final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)
Examine datasets for these information: file sizes, line counts, word counts, and mean words per line.
require(stringi)
# Get file sizes
blogs.size <- file.info("final/en_US/en_US.blogs.txt")$size / 1024 ^ 2
news.size <- file.info("final/en_US/en_US.news.txt")$size / 1024 ^ 2
twitter.size <- file.info("final/en_US/en_US.twitter.txt")$size / 1024 ^ 2
# Get no. of words in files
blogs.words <- stri_count_words(blogs)
news.words <- stri_count_words(news)
twitter.words <- stri_count_words(twitter)
# Summary of the datasets
data.frame(source = c("blogs", "news", "twitter"),
file.size.MB = c(blogs.size, news.size, twitter.size),
num.lines = c(length(blogs), length(news), length(twitter)),
num.words = c(sum(blogs.words), sum(news.words), sum(twitter.words)),
mean.num.words = c(mean(blogs.words), mean(news.words), mean(twitter.words)))
## source file.size.MB num.lines num.words mean.num.words
## 1 blogs 200.4242 899288 37546239 41.75107
## 2 news 196.2775 1010242 34762395 34.40997
## 3 twitter 159.3641 2360148 30093413 12.75065
This involves removing URLs, special characters, punctuations, numbers, excess whitespace, stopwords, and changing the text to lower case. Since the data sets are quite large, we will randomly choose 1% of the data to demonstrate the data cleaning and exploratory analysis.
require(tm)
# Sample the data (random at 1%)
set.seed(324)
data.sample <- c(sample(blogs, length(blogs) * 0.01),
sample(news, length(news) * 0.01),
sample(twitter, length(twitter) * 0.01))
# Create corpus and clean the data
corpus <- VCorpus(VectorSource(data.sample))
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
corpus <- tm_map(corpus, toSpace, "@[^\\s]+")
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("en"))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)
Words with the highest frequency of occurence are plotted first in the center of the wordcloud.
library(wordcloud)
wordcloud(corpus, max.words = 1000, random.order = FALSE, rot.per = 0.3, use.r.layout = FALSE, colors = brewer.pal(4, "BuPu"))
We have now completed the steps of getting and cleaning the data, and partially explored the data using a wordcloud. Next, we will
- increase our sample size from the current 1% to a larger sample size,
- continue Exploratory Data Analysis by “Part 2: Visualising Data using n-gram models”, to build basic n-gram models for predicting the next word based on the frequency occuring words in the data,
- create a Shiny app for a friendly user-interface.