Introduction

In this capstone project we will be applying data science in the area of natural language processing (NLP). The milestone report is done as a part of Coursera Data Science Capstone project and it describes exloratory data analysis performed with the course data set.

The data is from a corpus called HC Corpora and it contains texts collected from blogs, twitter and newsr: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

This project will use the English dataset:

Exploratory Data Analysis

Getting the data

library(ggplot2) 
library(kableExtra)
library(knitr)
library(tm)
library(SnowballC)
library(wordcloud)
library(stringi)
library(tidytext)
library(dplyr)

# download and unzip the data 
if (!file.exists("Coursera-SwiftKey.zip")) {
  download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip")
  unzip("Coursera-SwiftKey.zip")
}

# read the text data
DT_blog <- readLines("Coursera-SwiftKey/final/en_US/en_US.blogs.txt")
DT_news <- readLines("Coursera-SwiftKey/final/en_US/en_US.news.txt")
DT_twitter <- readLines("Coursera-SwiftKey/final/en_US/en_US.twitter.txt")

The summary about the data files are shown below.

# summarize information about the data files
summary_DT <- data.frame(filename = c("en_US.blogs.txt","en_US.news.txt","en_US.twitter.txt"),
                         size = c(format(object.size(DT_blog), units = "auto"), 
                                  format(object.size(DT_news), units = "auto"), 
                                  format(object.size(DT_twitter), units = "auto")),
                         lines = c(length(DT_blog),length(DT_news),length(DT_twitter)))
# summary table
summary_DT %>%
  kable("html") %>%
  kable_styling()
filename size lines
en_US.blogs.txt 248.5 Mb 899288
en_US.news.txt 19.2 Mb 77259
en_US.twitter.txt 301.4 Mb 2360148

Creating the sample data set

As the summary table abowve shows that files are very large. To avoid performance problems we will randomly choose 1% of blogs and news data set and 0.1 % of twitter data set to demonstrate data preprocessing, exploratory data analysis and prediction algorithm. We will also combine all tree data sets together.

# subset the data
set.seed(123)
sample_blog <- sample(DT_blog, length(DT_blog) * 0.01)
sample_news <- sample(DT_news, length(DT_news) * 0.01)
sample_twitter <- sample(DT_twitter, length(DT_twitter) * 0.001)

# combine subsets together
text_sample  <- c(sample_blog, sample_news, sample_twitter)

Cleaning and preprocessing

We now continue with the sample text data set by cleaning and preprocessing the data. First we create a corpus and then we perform following steps to clean and simplify the data:

  1. Makes all character in lower case.
  2. Removes numbers.
  3. Removes punctuation.
  4. Removes extra whitespaces.
  5. Removes Non-ASCII characters.
  6. Remowe stop words.
  7. Perform stemming
# create a corpus
sampleCorpus <- Corpus(VectorSource(text_sample))

# create a plain text document
sampleCorpus <- tm_map(sampleCorpus, PlainTextDocument)

# lower case
sampleCorpus <- tm_map(sampleCorpus, content_transformer(tolower))
# remove numbers
sampleCorpus <- tm_map(sampleCorpus, removeNumbers)
# remove ounctuations
sampleCorpus <- tm_map(sampleCorpus, removePunctuation)
# remove extra whitespaces
sampleCorpus <- tm_map(sampleCorpus, stripWhitespace)
# remove all non-ASCII chars
sampleCorpus <- tm_map(sampleCorpus, function(x) iconv(x, "latin1", "ASCII", sub=""))
# remove stop words
sampleCorpus <- tm_map(sampleCorpus,removeWords,stopwords("en"))
# stemming
sampleCorpus <- tm_map(sampleCorpus, stemDocument)

Findings

The word cloud below shows the most frequent words.

wordcloud(sampleCorpus, max.words = 100, random.order = FALSE, colors=brewer.pal(8,"Dark2"))

Next we start to build a base for our prediction model. To be able to create a prediction model we need to create N-grams from our sample data. N-grams are used to understand relationships between words they define how often word A is followed by word B.

# convert corpus as a dataframe for tidytext
df <- data.frame(text = sapply(sampleCorpus, as.character), stringsAsFactors = FALSE)

# subsetting a dataframe for performance reasons
df_sample <- head(df, 1000)

# create bigrams
bigrams_sample <- df_sample %>%
  unnest_tokens(df_sample, text, token = "ngrams", n = 2)

head(bigrams_sample)
##        df_sample
## 1     sum unjust
## 1.1 unjust wound
## 1.2    wound men
## 1.3      men let
## 1.4       let us
## 1.5  us overlook

Next steps