This is a milestone report for the ‘Data Science Capstone’ course offered by John Hopkins University through Coursera. The goal of the project is to build a Predictive Text Model App.
As a first step in building the App, we need to source a Natural Language dataset. The datasets have been provided by the course itself, which in turn has sourced the dataset from SwiftKey. In this report we have conducted Exploratory Data Analysis (EDA) of the dataset. The dataset comprises of data from three sources viz. blogs, news and twitter in four different languages viz. English, German, Finnish and Russian. We will use the English dataset for building our App.
The requisite libraries for our EDA are loaded as under:
library(stringi); library(tm); library(wordcloud); library(RWeka)
library(ggplot2); library(gridExtra)
The dataset has been loaded as under:
url <- 'https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip'
destination <- '.\\Project_Data.zip'
if(!file.exists(destination)){
download.file(url=url, destfile=destination, method='curl')
unzip(zipfile=destination, exdir='.')
}
file_blogs <- file(".\\final\\en_US\\en_US.blogs.txt")
data_blogs <- readLines(file_blogs,encoding = "UTF-8",skipNul = TRUE)
close(file_blogs)
file_news <- file(".\\final\\en_US\\en_US.news.txt", 'rb')
data_news <- readLines(file_news,encoding = "UTF-8",skipNul = TRUE)
close(file_news)
file_twitter <- file(".\\final\\en_US\\en_US.twitter.txt")
data_twitter <- readLines(file_twitter,encoding = "UTF-8",skipNul = TRUE)
close(file_twitter)
We have calculated three kinds of summary statistics viz. line_count, word_count and words_per_line. We can observe and infer the following:
## Count lines
line_count <- c(length(data_blogs),
length(data_news),
length(data_twitter))
#Count words
word_count <- c(sum(stri_count_words(data_blogs)),
sum(stri_count_words(data_news)),
sum(stri_count_words(data_twitter)))
#Display basic table
data_summary <- data.frame(Lines = line_count,
Words = word_count,
Words_per_line = word_count/line_count)
row.names(data_summary) <- c('Blogs','News','Twitter')
print(data_summary)
Lines Words Words_per_line
Blogs 899288 37546239 41.75107
News 1010242 34762395 34.40997
Twitter 2360148 30093413 12.75065
The data currently is too volumnious and can not be processed with the available computing resources. Hence, we would sample 10000 rows at random from each of the three datasets.
set.seed(12354)
sample_size <- 10000
sample_blogs <- sample(data_blogs, size=sample_size)
sample_news <- sample(data_news, size=sample_size)
sample_twitter <- sample(data_twitter, size=sample_size)
remove(data_blogs); remove(data_news); remove(data_twitter) # Freeing up memory
We will be performing the following steps to clean our data:
tm packageNote: The code chunk including the list of banned_words is hidden
cleanData <- function(input_data){
output_data <- iconv(input_data, "UTF-8","ASCii","byte") # Step (a)
output_data <- VCorpus(VectorSource(input_data)) # Step (b)
output_data <- tm_map(output_data, content_transformer(tolower)) # Step (c)
output_data <- tm_map(output_data, removeWords, banned_words) # Step (d)
output_data <- tm_map(output_data,removePunctuation) # Step (e)
output_data <- tm_map(output_data, removeNumbers) # Step (f)
tm_map(output_data, stripWhitespace) # Step (g)
}
clean_blogs <- cleanData(sample_blogs)
clean_news <- cleanData(sample_news)
clean_twitter <- cleanData(sample_twitter)
remove(sample_blogs); remove(sample_news); remove(sample_twitter) # Freeing up memory
In this section we will analyze the frequency of the most common words and we will plot them as a word cloud for each of the datasets. We can infer the following:
wordcloud(clean_blogs, max.words = 200, random.order = FALSE, color = "black")
title("Top 200 most frequent words in Blogs dataset")
wordcloud(clean_news, max.words = 200, random.order = FALSE, color = "orange")
title("Top 200 most frequent words in News dataset")
wordcloud(clean_twitter, max.words = 200, random.order = FALSE, color = "skyblue")
title("Top 200 most frequent words in Twitter dataset")
In this section, we will look at the frequency plots of bigrams and trigrams for the three datasets. The observations and inferences are as follows:
analyzeNgrams <- function(input_data){
bigram_tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigram_tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
bigram_tdm <- TermDocumentMatrix(
input_data, control = list(tokenize = bigram_tokenizer))
trigram_tdm <- TermDocumentMatrix(
input_data, control = list(tokenize = trigram_tokenizer))
list(bigram = bigram_tdm, trigram = trigram_tdm)
}
plotNgrams <- function(bigram_tdm, trigram_tdm, bigramLow, trigramLow, title){
bigrams = findFreqTerms(bigram_tdm, lowfreq = bigramLow)
bigrams_count = rowSums(as.matrix(bigram_tdm[bigrams,]))
data_bigram = data.frame(Words=bigrams, Count=bigrams_count)
trigrams = findFreqTerms(trigram_tdm, lowfreq = trigramLow)
trigrams_count = rowSums(as.matrix(trigram_tdm[trigrams,]))
data_trigram = data.frame(Words=trigrams, Count=trigrams_count)
g1 <- ggplot(data_bigram, aes(x=reorder(Words, Count), y=Count)) +
geom_col() + coord_flip() +
xlab("Bigrams") + ylab("Count") +
labs(title = paste("Bigrams Frequency Chart for", title))
g2 <- ggplot(data_trigram, aes(x=reorder(Words, Count), y=Count)) +
geom_col() + coord_flip() +
xlab("Trigrams") + ylab("Count") +
labs(title = paste("Trigrams Frequency Chart for", title))
grid.arrange(g1, g2)
}
tdm_blogs <- analyzeNgrams(clean_blogs)
plotNgrams(tdm_blogs$bigram, tdm_blogs$trigram, 400, 60, 'Blogs dataset')
tdm_news <- analyzeNgrams(clean_news)
plotNgrams(tdm_news$bigram, tdm_news$trigram, 300, 40, 'News dataset')
tdm_twitter <- analyzeNgrams(clean_twitter)
plotNgrams(tdm_twitter$bigram, tdm_twitter$trigram, 130, 25, 'Twitter datset')
The analysis above has served us well in providing us a future course of action. First, we will combine the three datasets and split this combined dataset into training and testing datasets. Then, we may adopt several of the following approaches:
The exact choice of the model will be dependent upon the performance of the models on the testing dataset, speed of the models and computing resources required by the models.