Coursera Data Science Milestone Project

Summary

The main purpose of this milestone report is to demonstrate the familiarity with the datasets and subject matter expertise required to successfully complete the Coursera capstone data science project designed by Johns Hopkins University and sponsored by SwiftKey. The goal of the capstone project itself is to create a predictive text model based on a large corpus of text documents as training data. Natural Language Processing (NLP) techniques are used to perform an analysis and build a text predictive model, which will be packaged as a Shiny R application in the capstone project.

The milestone report has the following sections:
- Import the datasets
- Explore the main characteristics of the datasets
- Merge and sample the datasets
- Clean the merged dataset
- Most common unigrams
- Most common bigrams
- Most common trigrams
- Next steps

Load the required packages and datasets

library(stringi)
library(quanteda)
library(ggplot2)

The data sets were downloaded at the following link: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip.

Load the blogs dataset.

blogs <- readLines("C:/Open/Coursera/Milestone/en_US/en_US.blogs.txt",
                   encoding="UTF-8", skipNul = TRUE)

blogs <- iconv(blogs, from = "latin1", to = "UTF-8", sub="")

Load the news dataset.

news <- readLines("C:/Open/Coursera/Milestone/en_US/en_US.news.txt",
                  encoding="UTF-8", skipNul = TRUE)

news <- iconv(news, from = "latin1", to = "UTF-8", sub="")

Load the twitter dataset.

twitter <- readLines("C:/Open/Coursera/Milestone/en_US/en_US.twitter.txt",
                     encoding="UTF-8", skipNul = TRUE)

twitter <- iconv(twitter, from = "latin1", to = "UTF-8", sub="")

Explore the main characteristics of the datasets

Determine the number of lines and words in each dataset.

paste('The blogs dataset has ',length(blogs)," blog posts with a total of ",
      stri_stats_latex(blogs)[4], " words.", sep = "")

## [1] "The blogs dataset has 899288 blog posts with a total of 37864968 words."

paste('The news dataset has ',length(news)," news articles with a total of ",
      stri_stats_latex(news)[4], " words.", sep = "")

## [1] "The news dataset has 77259 news articles with a total of 2665727 words."

paste('The twitter dataset has ',length(twitter)," tweets with a total of ",
      stri_stats_latex(twitter)[4], " words.", sep = "")

## [1] "The twitter dataset has 2360148 tweets with a total of 30553797 words."

Merge and sample the datasets

Merge the three datasets into one single dataset.

total.corpus <- c(blogs, news, twitter)

paste('The merged dataset has ',length(total.corpus)," entries with ",
      stri_stats_latex(total.corpus)[4], " words.", sep = "")

## [1] "The merged dataset has 3336695 entries with 71084492 words."

The merged dataset is too big for a computer with a limited amount of RAM. Therefore, a relatively small subset of words will be sampled and used for exploratory analysis.

Randomly permutate the order of entries in the merged dataset before creating a smaller exploratory sample.

set.seed(1813)
corpus <- total.corpus[sample(length(total.corpus))]

Take 100,000 entries (out of the original of more than 3.3 million) from the merged dataset for exploratory analysis.

set.seed(1813)
corpus <- sample(corpus, size = 100000, replace = FALSE)

paste('The exploratory dataset has ',length(corpus)," entries with ",
      stri_stats_latex(corpus)[4], " words.", sep = "")

## [1] "The exploratory dataset has 100000 entries with 2133636 words."

Clean the merged dataset

Package stringi is used to clean the dataset. Remove special characters.

corpus <- gsub("#"," ",  corpus)
corpus <- gsub("-"," ",  corpus)
corpus <- gsub("_"," ",  corpus)
corpus <- gsub("\'"," ", corpus)
corpus <- gsub("\""," ", corpus)

Create a new line when there is a punctuation, except for comma.

corpus <- unlist(strsplit(corpus, ".", fixed=TRUE))
corpus <- unlist(strsplit(corpus, "?", fixed=TRUE))
corpus <- unlist(strsplit(corpus, "!", fixed=TRUE))
corpus <- unlist(strsplit(corpus, "(", fixed=TRUE))
corpus <- unlist(strsplit(corpus, ")", fixed=TRUE))
corpus <- unlist(strsplit(corpus, "/", fixed=TRUE))
corpus <- unlist(strsplit(corpus, ":", fixed=TRUE))
corpus <- unlist(strsplit(corpus, ";", fixed=TRUE))

Mark the end of each sentence with symbol |.

corpus <- paste(corpus,"|")

Convert all the words into lower case.

corpus <- tolower(corpus)

Delete white spaces before and after every sentence.

corpus <- trimws(corpus)

Convert double or more spaces into a single space.

corpus <- gsub("\\s+", " ", corpus)

Split each sentence into individual words.

corpus <- unlist(strsplit(corpus, "\\s"))

Most common unigrams

Transform the cleaned unigram dataset into a data frame.

word.freq <- data.frame(unlist(corpus))

Find the most common unigrams and order them.

unigrams <- as.data.frame(table(word.freq))

unigrams <- unigrams[order(-unigrams$Freq), ]

Delete the simbol |, which is not necessary for unigrams.

unigrams <- unigrams[!(unigrams$word.freq=="|"), ]

Plot the top 10 unigrams.

top.unigrams <- head(unigrams, 10)
top.unigrams$word.freq <- factor(top.unigrams$word.freq,
                                 levels = top.unigrams$word.freq[order(-top.unigrams$Freq)])

g1 <- ggplot(top.unigrams, aes(top.unigrams$word.freq, top.unigrams$Freq))
g1 <- g1 + geom_bar(stat='identity', col="dark green", fill="light green")
g1 <- g1 + labs(title = "Most Frequent Unigrams", x = "", y = "Frequency")
g1 <- g1 + theme(plot.title = element_text(lineheight=.5, hjust=0.5, face="bold"))

g1

Most common bigrams

Package quanteda is used to find the most common bigrams (two consecutive words).

bigram.data <- tokens_ngrams(corpus, n = 2L, concatenator=" ")

Some of these bigrams contain the symbol |. It needs to be removed.

bigram.data <- grep("^[^\\|]", bigram.data, perl=TRUE, value=TRUE)

bigram.data <- grep("[^\\|]$", bigram.data, perl=TRUE, value=TRUE)

Transform the cleaned bigram dataset into a data frame.

bigram.df <- data.frame(unlist(bigram.data))

Find the most common bigrams and order them.

bigram.freq <- as.data.frame(table(bigram.df))

bigram.freq <- bigram.freq[order(-bigram.freq$Freq), ]

Plot the top 10 bigrams.

top.bigrams <- head(bigram.freq, n = 10)

top.bigrams$bigram.df <- factor(top.bigrams$bigram.df,
                         levels = top.bigrams$bigram.df[order(-top.bigrams$Freq)])

g2 <- ggplot(top.bigrams, aes(top.bigrams$bigram.df, top.bigrams$Freq))
g2 <- g2 + geom_bar(stat='identity', col="dark blue", fill="light blue")
g2 <- g2 + labs(title = "Most Frequent Bigrams", x = "", y = "Frequency")
g2 <- g2 + theme(axis.text.x = element_text(angle = 90, hjust = 1))
g2 <- g2 + theme(plot.title = element_text(lineheight=.5, hjust=0.5, face="bold"))

g2

Most common trigrams

Package quanteda is used to find the most common trigrams (three consecutive words).

trigram.data <- tokens_ngrams(corpus, n = 3L, concatenator=" ")

Some of these trigrams contain the symbol |. It needs to be removed.

trigram.data <- grep("^[^\\|]", trigram.data, perl=TRUE, value=TRUE)

trigram.data <- grep("[^\\|]$", trigram.data, perl=TRUE, value=TRUE)

Transform the cleaned trigram dataset into a data frame.

trigram.df <- data.frame(unlist(trigram.data))

Find the most common trigrams and order them.

trigram.freq <- as.data.frame(table(trigram.df))

trigram.freq <- trigram.freq[order(-trigram.freq$Freq), ]

Plot the top 10 trigrams.

top.trigrams <- head(trigram.freq, n = 10)

top.trigrams$trigram.df <- factor(top.trigrams$trigram.df,
                         levels = top.trigrams$trigram.df[order(-top.trigrams$Freq)])

g2 <- ggplot(top.trigrams, aes(top.trigrams$trigram.df, top.trigrams$Freq))
g2 <- g2 + geom_bar(stat='identity', col="black", fill="red")
g2 <- g2 + labs(title = "Most Frequent Trigrams", x = "", y = "Frequency")
g2 <- g2 + theme(axis.text.x = element_text(angle = 90, hjust = 1))
g2 <- g2 + theme(plot.title = element_text(lineheight=.5, hjust=0.5, face="bold"))

g2

Next steps

Building a predictive algorithm, based primarily on the bigrams and trigrams, will be the next step. The algorithm will be capable of predicting the next most likely word for a given word. Finally, a Shiny app will be developed for an easy and interactive way to use the predictive algorithm.