This is a milestone report for the Data Science Capstone Project offered by John Hopkins University through Coursera. The purpose of the capstone project is to create a predictive algorithm using a corpus of text with the help of natural language programming.
This milestone report provides a description on the process of loading and cleaning data and an exploratory analysis using a training dataset. The analysis will show how frequently words appear and list some of the commonly used words in the form of n-grams such as unigram, bigram and trigram. A random data sample of 5% is used to demonstrate the ability to create a predictive algorithm using unigrams, bigrams and trigrams. The report culminates with bar plots of unigrams, bigrams and trigrams produced from the sample data.
#Loading the libraries
library(stringr)
library(ggplot2)
library(dplyr)
library(tm)
library(stringi)
library(quanteda)
library(ngram)
library(wordcloud2)
The data from Swiftkey (provided on the course website) is downloaded in zip format. The english language versions of “blogs”, “news” and “twitter” text files are extracted and used for the purpose of the project. UTF-8, which is the most common encoding for webpages, is used for compatibility purposes. The ‘news’ file is read in binary mode (to account for information stored in the form of bytes) as otherwise a large number of lines will be dropped. Non-english characters are removed using the ‘iconv’ function.
The predictive algorithm is expected to reject profanity words. Therefore, they will be removed by including the following code. The list of profanity words is borrowed from Google’s badwords.txt.
#Setting up the folder, downloading data and unzipping the file
#(I) Setting the path to the directory storing the data
setwd("E:/Coursera/Data Science Capstone/")
#(II) Downloading the file
if(!file.exists("./Data")){dir.create("./Data")
fileURL <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(fileURL,destfile="./Data/Swiftkey.zip")
#(II) Unzip the downloaded file
unzip(zipfile="./Data/Swiftkey.zip",exdir="./Data")}
# Adding the file containing the profanities. This will be used to remove profanities from our sample.
if (!file.exists("./Data/profanity.txt"))
download.file(
url = "https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/badwordslist/badwords.txt",
destfile = "./Data/profanity.txt")
profanity <- readLines("./Data/profanity.txt")
# Loading the blogs, news and twitter files
blogs_file <- file ("./Data/final/en_US/en_US.blogs.txt")
blogs<- readLines(blogs_file, encoding="UTF-8", skipNul = T)
news_file <- file ("./Data/final/en_US/en_US.news.txt", "rb")
news<- readLines(news_file, encoding="UTF-8", skipNul = T)
twitter_file <- file ("./Data/final/en_US/en_US.twitter.txt")
twitter <- readLines(twitter_file, encoding="UTF-8", skipNul = T)
#Removing all non-english characters
blogs <- iconv(blogs, "latin1", "ASCII", sub="")
news <- iconv(news, "latin1", "ASCII", sub="")
twitter <- iconv(twitter, "latin1", "ASCII", sub="")
The number of lines and words, the mean number of words per line (or post) and the maximum words per line are calculated as part of the exploratory data analysis. As expected, Twitter lines tend to be the shortest in terms of the number of words, while blog lines are the longest. With an average of 12 words per line, Twitter stands up to its reputation as a microblogging platform.
#Creating the table summarizing the statistics of the three files.
nlines_blogs <- length(blogs)
nlines_news <- length(news)
nlines_twitter <- length(twitter)
count_words_blogs <- stri_count_words(blogs)
count_words_news <- stri_count_words(news)
count_words_twitter <- stri_count_words(twitter)
sum_words_blogs <- sum(count_words_blogs)
sum_words_news <- sum(count_words_news)
sum_words_twitter <- sum(count_words_twitter)
mean_words_blogs <- mean(count_words_blogs)
mean_words_news <- mean(count_words_news)
mean_words_twitter <- mean(count_words_twitter)
max_words_blogs <- max(count_words_blogs)
max_words_news <- max(count_words_news)
max_words_twitter <- max(count_words_twitter)
summary <- data.frame(filename = c("Blogs", "News", "Twitter"),
number_lines = c(nlines_blogs, nlines_news, nlines_twitter),
total_words = c(sum_words_blogs, sum_words_news, sum_words_twitter ),
mean_words =c(mean_words_blogs, mean_words_news, mean_words_twitter ),
max_words =c(max_words_blogs, max_words_news, max_words_twitter ))
summary
## filename number_lines total_words mean_words max_words
## 1 Blogs 899288 37510168 41.71096 6725
## 2 News 1010242 34749301 34.39701 1796
## 3 Twitter 2360148 30088605 12.74861 47
The sample data is 5 percent of the total data extracted for this report. A small sample is used to reduce computation times. The lines from blogs, news and twitter are combined to create a random set, which is used as a sample. At 5 percent, this sample consists of more than 200,000 lines, which I believe is a good enough size to obtain information for the predictive algorithm.
The sample data is then tokenized, that is text is split into words and into machine readable input. This is then converted into a document feature matrix. This document feature matrix is then used to calculate the top features in each of the n-grams listed below. The tokenization process converts all words into lower case, removes punctuations, numbers, separators, symbols, URLs, and Twitter characters. It is also ensured that stop words such “a”, “as”, “of”, “for” and “the” are removed as well. Hyphenated words are split and words with the same stem such as “computing”, “computers” and “computation” are cleaned by using the ‘stem’ option.
Unigrams, bigrams and trigrams are created below. These will play an important role in the predictive algorithms.
#Creating a random subset of the overall data:
set.seed(123456789)
all_sample <- c(sample(blogs, nlines_blogs*0.05), sample(news, nlines_news*0.05), sample(twitter, nlines_twitter*0.05))
#Data tokenization
token_text <- tokens(all_sample, toLower= TRUE, remove_punct=TRUE, remove_numbers=TRUE, remove_separators=TRUE, remove_symbols=TRUE, stem=TRUE, remove_url = TRUE, split_hyphens=TRUE, remove_twitter=TRUE)
#Cleaning profanity from the data
token_textcl <- tokens_remove(tokens(token_text), profanity, padding=TRUE)
#Creaing ngrams, starting with unigrams
n1gram <-tokens_remove(token_textcl, stopwords('english'))
n1gram <-tokens_ngrams(n1gram, n=1)
n1gram <- quanteda::dfm(n1gram)
#Creating bigrams and replacing underscores separating words with spaces
n2gram <-tokens_remove(token_textcl, stopwords('english'))
n2gram <-tokens_ngrams(n2gram, n=2, concatenator = " ")
n2gram <-quanteda::dfm(n2gram)
#Creating trigrams
n3gram <-tokens_remove(token_textcl, stopwords('english'))
n3gram <-tokens_ngrams(n3gram, n=3, concatenator = " ")
n3gram <-quanteda::dfm(n3gram)
There is a sharp decline in the frequency a word appears in each of the n-grams, suggesting that there are few common words that are often repeated in the sample. Several words are less commonly used. Considering the frequency of the most common words, bigrams are likely to be less frequent than unigrams. Trigrams are the least frequent among the three n-grams. Even though different sizes of the top ranked features are used (n=1000, n=100, n=20), the decline in the frequency of occurence of a particular word (or words) in the form of a n-gram decreases sharply with its ranking. The rank-frequency plots with a long tail support the use of random sampling as the most common words are as likely to appear in a large sample as they are in a smaller sample.
#Plotting for the frequency of words and their rankings
textstat_frequency(n1gram, n= 1000) %>%
ggplot(aes(x= rank, y = frequency, color=frequency) ) + geom_point() + labs(x= "Ranking of Words", y = "Total Count") + ggtitle(" Ranking the Words in the Unigram") + theme(legend.position = "none")
textstat_frequency(n2gram, n= 100) %>%
ggplot(aes(x= rank, y = frequency, color=frequency) ) + geom_point() + labs(x= "Ranking of Words", y = "Total Count") + ggtitle(" Ranking the Words in the Bigram") + theme(legend.position = "none")
textstat_frequency(n3gram, n= 20) %>%
ggplot(aes(x= rank, y = frequency, color=frequency) ) + geom_point() + labs(x= "Ranking of Words", y = "Total Count") + ggtitle(" Ranking the Words in the Trigram") + theme(legend.position = "none")
The top eight features are ranked for each of the n-grams and shown in the bar plot below. The top features are first listed and then converted into dataframes to be used as input in the to code for the bar plots below.
#Creating the top features for the three ngrams and converting into dataframes.
topfeatures1 <- topfeatures(n1gram, 8)
topfeatures1 <- sort(topfeatures1, decreasing = TRUE)
topfeatures1.df <- data.frame(words=names(topfeatures1), freq=topfeatures1)
topfeatures2 <- topfeatures(n2gram, 8)
topfeatures2 <- sort(topfeatures2, decreasing = TRUE)
topfeatures2.df <- data.frame(words=names(topfeatures2), freq=topfeatures2)
topfeatures3 <- topfeatures(n3gram, 8)
topfeatures3 <- sort(topfeatures3, decreasing = TRUE)
topfeatures3.df <- data.frame(words=names(topfeatures3), freq=topfeatures3)
Again, the sharp decrease in the frequency of words in each of the three ngrams is apparent as the ranking in terms of its importance decreases. Therefore, it is likely that some words are more commonly used than others and that there is a significantly greater reliance on the usage of those few words in English language. A word cloud is also created to show the relative occurance of specific words in the unigram. The size of the word decreases along with its prominence in the sample data.
#Plotting the graph for unigram
gp1 <- ggplot(data=topfeatures1.df, aes(x= reorder(factor(words), -freq), y=freq, fill=freq))+ geom_bar(stat="identity") + labs(x="Unigram", y="Total Count") +ggtitle("Most Common Unigrams") + theme(legend.position = "none") + scale_fill_gradient(high="blue", low="cyan")
gp1
#Creating the word cloud
topfeatures1_wc <- topfeatures(n1gram, 30)
topfeatures1_wc <- sort(topfeatures1_wc, decreasing = TRUE)
topfeatures1_wc.df <- data.frame(words=names(topfeatures1_wc), freq=topfeatures1_wc)
wordcloud2(topfeatures1_wc.df, size=0.5, color='random-dark')
#Plotting the graph for bigram
gp2 <- ggplot(data=topfeatures2.df, aes(x= reorder(words, -freq), y=freq, fill=freq))+ geom_bar(stat="identity") + labs(x="Bigram", y="Total Count") +ggtitle("Most Common Bigrams") + theme(legend.position = "none") + scale_fill_gradient(high="blue", low="cyan")
gp2
#Plotting the graph for trigram
gp3 <- ggplot(data=topfeatures3.df, aes(x= reorder(words, -freq), y=freq, fill=freq))+ geom_bar(stat="identity") + labs(x="Trigram", y="Total Count") +ggtitle("Most Common Trigrams") + theme(legend.position = "none") + scale_fill_gradient(high="blue", low="cyan") + theme(axis.text.x=element_text(angle=30,hjust=1))
gp3
The purpose of the current exercise is to get familiarized with the data set. The final report will focus on the predictive algorithm. The predictive algorithm will be applied on the data in order to obtain word recommendations. A Shiny app will be developed that takes an input (which can be in the form of multiple words) and predicts the next word. The n-grams generated will be used to calculate the frequency of the words used in their relevant context and then reveal the predicted words as output. The Shiny app will be accompanied by a presentation to explain the process to a potential investor.