In this capstone project we will be applying data science in the area of natural language processing (NLP). The milestone report is done as a part of Coursera Data Science Capstone project and it describes exloratory data analysis performed with the course data set.
The data is from a corpus called HC Corpora and it contains texts collected from blogs, twitter and newsr: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
This project will use the English dataset:
library(ggplot2)
library(kableExtra)
library(knitr)
library(tm)
library(SnowballC)
library(wordcloud)
library(stringi)
library(tidytext)
library(dplyr)
# download and unzip the data
if (!file.exists("Coursera-SwiftKey.zip")) {
download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip")
unzip("Coursera-SwiftKey.zip")
}
# read the text data
DT_blog <- readLines("Coursera-SwiftKey/final/en_US/en_US.blogs.txt")
DT_news <- readLines("Coursera-SwiftKey/final/en_US/en_US.news.txt")
DT_twitter <- readLines("Coursera-SwiftKey/final/en_US/en_US.twitter.txt")
The summary about the data files are shown below.
# summarize information about the data files
summary_DT <- data.frame(filename = c("en_US.blogs.txt","en_US.news.txt","en_US.twitter.txt"),
size = c(format(object.size(DT_blog), units = "auto"),
format(object.size(DT_news), units = "auto"),
format(object.size(DT_twitter), units = "auto")),
lines = c(length(DT_blog),length(DT_news),length(DT_twitter)))
# summary table
summary_DT %>%
kable("html") %>%
kable_styling()
| filename | size | lines |
|---|---|---|
| en_US.blogs.txt | 248.5 Mb | 899288 |
| en_US.news.txt | 19.2 Mb | 77259 |
| en_US.twitter.txt | 301.4 Mb | 2360148 |
As the summary table abowve shows that files are very large. To avoid performance problems we will randomly choose 1% of blogs and news data set and 0.1 % of twitter data set to demonstrate data preprocessing, exploratory data analysis and prediction algorithm. We will also combine all tree data sets together.
# subset the data
set.seed(123)
sample_blog <- sample(DT_blog, length(DT_blog) * 0.01)
sample_news <- sample(DT_news, length(DT_news) * 0.01)
sample_twitter <- sample(DT_twitter, length(DT_twitter) * 0.001)
# combine subsets together
text_sample <- c(sample_blog, sample_news, sample_twitter)
We now continue with the sample text data set by cleaning and preprocessing the data. First we create a corpus and then we perform following steps to clean and simplify the data:
# create a corpus
sampleCorpus <- Corpus(VectorSource(text_sample))
# create a plain text document
sampleCorpus <- tm_map(sampleCorpus, PlainTextDocument)
# lower case
sampleCorpus <- tm_map(sampleCorpus, content_transformer(tolower))
# remove numbers
sampleCorpus <- tm_map(sampleCorpus, removeNumbers)
# remove ounctuations
sampleCorpus <- tm_map(sampleCorpus, removePunctuation)
# remove extra whitespaces
sampleCorpus <- tm_map(sampleCorpus, stripWhitespace)
# remove all non-ASCII chars
sampleCorpus <- tm_map(sampleCorpus, function(x) iconv(x, "latin1", "ASCII", sub=""))
# remove stop words
sampleCorpus <- tm_map(sampleCorpus,removeWords,stopwords("en"))
# stemming
sampleCorpus <- tm_map(sampleCorpus, stemDocument)
The word cloud below shows the most frequent words.
wordcloud(sampleCorpus, max.words = 100, random.order = FALSE, colors=brewer.pal(8,"Dark2"))
Next we start to build a base for our prediction model. To be able to create a prediction model we need to create N-grams from our sample data. N-grams are used to understand relationships between words they define how often word A is followed by word B.
# convert corpus as a dataframe for tidytext
df <- data.frame(text = sapply(sampleCorpus, as.character), stringsAsFactors = FALSE)
# subsetting a dataframe for performance reasons
df_sample <- head(df, 1000)
# create bigrams
bigrams_sample <- df_sample %>%
unnest_tokens(df_sample, text, token = "ngrams", n = 2)
head(bigrams_sample)
## df_sample
## 1 sum unjust
## 1.1 unjust wound
## 1.2 wound men
## 1.3 men let
## 1.4 let us
## 1.5 us overlook