Johns Hopkins Data Science Specialization : Milestone Report

Exploratory Data Analysis and Modeling

Suhas. P. K

2023-09-09

Project Overview

Introduction

Introduction to mobile devices led many inventions and innovations. Once there was a time where sending an email was a luxury. Within a span of two decades the evolution of communication devices from a huge hefty computing machines to a simple palm held, touch responsive mobile phone is observed significantly.

One of the basic features of these mobile devices is to predicts words based on the sentences that the user types.

In this milestone project, with the help of SwiftKey and Johns Hopkins Data Science Specialization Course, I will be attempting to build a predictive text models like those used by SwiftKey.

In this project includes analyzing a large corpus of text documents to discover the structure in the data, and how words are put together. This involves cleaning and analysisng text data, building and sampling predictive text model. And build a predictive text product using Rshiny.

Resources

Course datasets

This is the training dataset that I will be using for the basic exercises of this project.

Download here

Libraries and usage.

The libraries used in no particular order are knitr, stringi, NLP, tm, RWeka, SnowballC, data.table, wordcloud, ggplot2, kableExtra, gridExtra, ggdark, and RColorBrewer.

Importing the datasets.

The data sets can be downloaded and unzipped directly in the code chunk. But as I had already downloaded them.
Once the data set zip file is downloaded, unzip it. A folder Coursera-SwiftKey will be there, which contains all the text data which has been collected from twitter, news and blogs.
Here is the code to import the data.

# Read Blog lines
blogs <- file(
  paste(getwd(),'Coursera-SwiftKey/final/en_US/en_US.blogs.txt',sep = "/"),
  "r"
)
blog_lines <- readLines(blogs, warn = FALSE, encoding = "UTF-8", skipNul = TRUE)
close(blogs)

# Read News lines
news <- file(
  paste(getwd(),'Coursera-SwiftKey/final/en_US/en_US.news.txt',sep = "/"),
  "r"
)
news_lines <- readLines(news,warn = FALSE, encoding = "UTF-8", skipNul = TRUE)
close(news)

# Read Twitter Lines
twitter <- file(
  paste(getwd(),'Coursera-SwiftKey/final/en_US/en_US.twitter.txt',sep = "/"),
  "r"
)
twitter_lines <- readLines(twitter,warn = FALSE, encoding = "UTF-8", skipNul = TRUE)
close(twitter)

Exploratory Data Analysis

General statistics on ‘Words per line (wpl)’

Once the data sets are imported, let us apply some general statistics to it.
In this let us answer these basic questions.
1. What is the file size of each data set in MB?
2. In each data sets how many number of lines are there?
3. How many words are there in each data set?
Apart from these explorations, let us look into data sets with a more statistical approach.
One such idea is Words per line (wpl). WPL describes on an average how many words will be there in a sentence/line.
Here is the code chunk.

# number of lines 
num_lines <- sapply(list(blog_lines,news_lines,twitter_lines), length)

# number of characters
num_char <- sapply(list(nchar(blog_lines),nchar(news_lines),nchar(twitter_lines)),sum)

# number of words 
num_words <- sapply(list(blog_lines,news_lines,twitter_lines), stri_stats_latex)[4,]

# number of words per line
wpl <- lapply(list(blog_lines,news_lines,twitter_lines), function(x){stri_count_words(x)})

# summary 
wpl_gist <- sapply(list(blog_lines,news_lines,twitter_lines), function(x){
  summary(stri_count_words(x))[c('Min.', 'Mean', 'Max.')]})

rownames(wpl_gist) = c('wpl_min', 'wpl_mean', 'wpl_max')

gist <- data.frame(
  FileName = c("en_US.blogs.txt","en_US.news.txt","en_US.twitter.txt"),
  FileSize = sapply(
    list(blog_lines,news_lines,twitter_lines), function(x){format(object.size(x),"MB")}
  ),
  Lines = num_lines,
  Charaters = num_char,
  Words = num_words,
  t(rbind(round(wpl_gist)))
  
)

# Tabular representation

gist %>% kbl() %>%  kable_styling(bootstrap_options = c("condensed", "responsive"))

FileName	FileSize	Lines	Charaters	Words	wpl_min	wpl_mean	wpl_max
en_US.blogs.txt	255.4 Mb	899288	206824505	37570839	0	42	6726
en_US.news.txt	19.8 Mb	77259	15639408	2651432	1	35	1123
en_US.twitter.txt	319 Mb	2360148	162096241	30451170	1	13	47

Results table and interpretations

From the result table these are some insights,

Data set en_US.twitter.txt has larger file size (319 MB).
Number of line are more in the twitter data set that blogs data set, because usually tweets are always in 3-5 lines.
Even though the file size of en_US.twitter.txt (319 MB) is more compared to en_US.blogs.txt (255.4 MB), maximum words per line is less in en_US.twitter.txt than en_US.blogs.txt. This means that usually in tweets/lines, there are less words than in blogs.

Visualization of `wpl`.

plot1 <- qplot(wpl[[1]],
               geom = "histogram",
               main = "US Blogs",
               xlim = c(0,200),
               xlab = "Words per Line",
               ylab = "Frequency",
               binwidth = 5) + dark_theme_light()

plot2 <- qplot(wpl[[2]],
               geom = "histogram",
               main = "US News",
               xlim = c(0,300),
               xlab = "Words per Line",
               ylab = "Frequency",
               binwidth = 5) + dark_theme_light()

plot3 <- qplot(wpl[[3]],
               geom = "histogram",
               main = "US Twitter",
               xlab = "Words per Line",
               ylab = "Frequency",
               binwidth = 1) + dark_theme_light()

plotList = list(plot1, plot2, plot3)
do.call(grid.arrange, c(plotList, list(ncol = 1)))

# free up some memory
rm(plot1, plot2, plot3)

Sampling and cleaning.

Now that I have all three data sets, I will take sample of these data sets.
Now I start to clean the data. Cleaning data should involve :
1. Removing all non-English characters from the sample data.
2. Combining all three sample data sets into a single data set and writing to disk.
3. Calculating number of lines and number of words in the sample data set.

set.seed(660067)

# assigning sample size
sampleSize = 0.01

# sampling all 3 datasets
sample_blogs <- sample(blog_lines, length(blog_lines)*sampleSize, replace = FALSE)
sample_news <- sample(news_lines, length(blog_lines)*sampleSize, replace = FALSE)
sample_twitter <- sample(twitter_lines, length(blog_lines)*sampleSize, replace = FALSE)

# Removing all non-English characters from the sample data.

sample_blogs <- iconv(sample_blogs, "latin1", "ASCII", sub="")
sample_news <- iconv(sample_news, "latin1", "ASCII", sub="")
sample_twitter <- iconv(sample_twitter, "latin1", "ASCII", sub="")

# Combining all 3 sample data sets into a single data set and writing to disk.

sample_data <- c(sample_blogs,sample_news,sample_twitter)
sample_data_filename <- paste(getwd(),    "Coursera-SwiftKey/final/en_US/en_US.sample.txt",sep = "/")
con <- file(sample_data_filename, open = "w")
writeLines(sample_data,con)
close(con)

# number of lines and words in the sample data set.
sample_data_lines <- length(sample_data)
sample_data_words <- sum(stri_count_words(sample_data))
sample_data_lines

## [1] 26976

sample_data_words

## [1] 793939

Profanity dataset.

Thanks to twitter, usage of profane words are more. So naturally, sample data set will have profane words.
These profane words need to be removed.
For profanity check I have downloaded a .csv file from Kaggle Profanities in English.
The downloaded data set is in csv format. So I have changed it to a text format so all data sets are in same format.
To convert the .csv file to .txt file, the code is given below.
It will be saved as ‘en_US.profanity.txt’ .

pf <- read.csv("profanity_en.csv")

head(pf)

library(readr)

profanity_words <- pf$text
text_file_path <- paste(
  getwd(),"Coursera-SwiftKey/final/en_US/en_US.profanity.txt",sep = "/"
)

con <- file(text_file_path,open = "w")
writeLines(profanity_words,con)
close(con)

rm(profanity_words)

Building Corpus

Corpus : a collection of written texts, especially the entire works of a particular author or a body of writing on a particular subject.
Based on the sample data set, we need to create a corpus. In this corpus, this should contain words only.
So, all the following cleaning must be done,
1. Removing URL, Twitter handle, and email patterns.
2. Removing profane words.
3. Converting Latin to ASCII formats.
4. Converting words to lower cases.
5. Removing stop words.
6. Removing punctuation marks.
7. Removing white spaces.
8. Removing plain text document.
After cleaning the corpus, it is converted to a data frame and the lines/words are written to disk.

build_corpus <- function(dataset){
  docs <- VCorpus(VectorSource(dataset))
  to_space <- content_transformer(
    function(x,pattern)
      gsub(pattern, " ",x)
  )
  
  # Removing URL, Twitter handles, email pattern.
  docs <- tm_map(docs, to_space,"(f|ht)tp(s?)://(.*)[.][a-z]+")
  docs <- tm_map(docs, to_space,"@[^\\s]+")
  docs <- tm_map(docs, to_space, "\\b[A-Z a-z 0-9._ - ]*[@](.*?)[.]{1,3} \\b")
  
  # Removing Profane words from the sample data.
  
  con <- file(
    paste(getwd(),'Coursera-SwiftKey/final/en_US/en_US.profanity.txt',sep = "/"),
  "r"
  )
  profanity <-readLines(
    con, warn = FALSE, encoding = "UTF-8", skipNul = TRUE
  )
  close(con)
  
  profanity <- iconv(profanity, "latin1", "ASCII", sub = "")
  docs <- tm_map(docs, removeWords,profanity)
  
  # Converting the words to lower case.
  docs <- tm_map(docs, tolower)
  # Removing stopwords
  docs <- tm_map(docs, removeWords, stopwords("english"))
  # Removing punctuation marks.
  docs <- tm_map(docs, removePunctuation)
  # Removing white spaces.
  docs <- tm_map(docs, stripWhitespace)
  # Removing Plain text document
  docs <- tm_map(docs, PlainTextDocument)
  # Removing numbers.
  docs <- tm_map(docs, removeNumbers)
  
  return(docs)
}

# Building corpus and writing to disk
corpus <- build_corpus(sample_data)
saveRDS(corpus, file = paste(
  getwd(),"Coursera-SwiftKey/final/en_US/en_US.corpus.rds", sep = "/"
))

# Converting the corpus to a data frame and writing the lines/word to disks.

corpus_text <- data.frame(text = unlist(
  sapply(corpus, '[', "content")), stringsAsFactors = FALSE
)

corpus_filename <- paste(getwd(),"Coursera-SwiftKey/final/en_US/en_US.corpus.txt",sep = "/")
con <- file(corpus_filename, open = "w")
writeLines(corpus_text$text,con)
close(con)

Word Frequencies

Here, we try to analyze the word frequency.
For that, these following steps are involved,
1. Term document matrix is a mathematical matrix that describe the frequency of terms that occur in a collection of document. A Term document matrix of corpus is made.
2. Using this matrix, it is converted into a data frame, which is more legible to read.
Later, we can visualize them using bar plot and word cloud.
Note: this corpus is created using our sample data.

tdm <- TermDocumentMatrix(corpus)
freq <- sort(rowSums(as.matrix(tdm)), decreasing = TRUE)
word_Freq <- data.frame(word = names(freq), freq = freq, row.names = NULL)


# Plotting top 10 most frequent words.

plot_most_freq_word <- ggplot(
  word_Freq[1:10,], aes(x = reorder(word_Freq[1:10,]$word, -word_Freq[1:10,]$freq),
                       y = word_Freq[1:10,]$freq)
)

plot_most_freq_word <- plot_most_freq_word+
  geom_bar(stat = "Identity", fill = "#90EE90" ) + 
  geom_text(aes(label = word_Freq[1:10,]$freq), vjust = -0.20, size=3) + xlab("Words")+ ylab("Word Frequencies")+dark_theme_light()+theme(plot.title = element_text(size = 14, hjust = 0.5, vjust = 0.5),
               axis.text.x = element_text( angle = 45, vjust = 1, hjust = 1),
               axis.text.y = element_text(hjust = 0.5, vjust = 0.5))+ggtitle("10 Most Frequent Words")

plot_most_freq_word

# Word cloud
suppressWarnings(
  wordcloud(
    words = word_Freq$word,
    freq = word_Freq$freq,
    min.freq = 1,
    max.words = 100,
    random.order = FALSE,
    rot.per = 0.35,
    colors = brewer.pal(10, "Dark2")
  ) 
  
  
)

Tokenizing and N-Gram Generation

Tokenization is used in NLP, to split paragraphs and sentences into smaller units thatcan be more easily assigned meanings.
N-grams are defined as the contiguous sequence of n items that can be extracted from a given sample of text or speech.
The items can be letters, words, or base pairs. The N-grams are collected from a text or speech corpus.
The smallest n-gram with n = 1 are called Unigrams. Unigrams are simply tokens appearing in the sentences.
Bigrams are n-gram with n = 2 and trigrams are n-gram with n = 3.
Function for these n-gram tokenizers is written with this approach.

unigram_tokenizer <- function(x){
  NGramTokenizer(x, Weka_control(min=1, max = 1))
}

bigram_tokenizer <- function(x){
  NGramTokenizer(x, Weka_control(min=2,max=2))
}

trigram_tokenizer <- function(x){
  NGramTokenizer(x,
                 Weka_control(min=3, max=3))
}

tetragram_tokenizer <- function(x){
  NGramTokenizer(x,
                 Weka_control(min=4, max=4))
}

Unigrams

The unigrams are generated as follows,
1. A corpus of unigrams are created.
2. Term document matrix of these unigram corpus is created, followed by make it data frame.
3. Visualization for these unigrams.

# creating term document matrix for the corpus
unigram_matrix <- TermDocumentMatrix(corpus, control = list(tokenize = unigram_tokenizer))

# Eliminating sparse terms for each N-gram and get frequencies of most common n-grams

unigram_matrix_freq <- sort(rowSums(as.matrix(
  removeSparseTerms(
    unigram_matrix, 0.99
  )
)), decreasing = TRUE)


unigram_matrix_freq <- data.frame(
  word = names(unigram_matrix_freq),
  freq = unigram_matrix_freq,
  row.names = NULL
)

uni_path <- "E:/my data science journey/john_hopkins_data_science/JHU_Capstone_project/Word_prediction_app/unigram_List.rds"
saveRDS(unigram_matrix_freq,uni_path)


# Plotting histogram
unigram_hist <- ggplot(
  unigram_matrix_freq[1:20,],
  aes(x = reorder(word, -freq), y = freq)
) + geom_bar(stat = "identity", fill = "#37BEB0")+ dark_theme_light() +geom_text(
  aes(label = freq), vjust = -0.20, size = 3) + xlab("Words") + ylab("Frequency")+
  theme(plot.title = element_text(size = 14, hjust = 0.5, vjust = 0.5),
               axis.text.x = element_text( angle = 45, hjust = 1.0, vjust = 1),
               axis.text.y = element_text(hjust = 0.5, vjust = 0.5),
        )+ ggtitle("20 Most Common Unigrams")

unigram_hist

# Word cloud
suppressWarnings(
  wordcloud(
    words = unigram_matrix_freq$word,
    freq = unigram_matrix_freq$freq,
    min.freq = 1,
    max.words = 100,
    random.order = FALSE,
    rot.per = 0.35,
    colors = brewer.pal(10, "Dark2")
  ) 
  
  
)

Bigrams

The bigrams are generated as follows,
1. A corpus of bigrams are created.
2. Term document matrix of these bigram corpus is created, followed by make it data frame.
3. Visualization for these bigrams.

# creating term document matrix for the corpus
bigram_matrix <- TermDocumentMatrix(corpus, control = list(tokenize = bigram_tokenizer))

# Eliminating sparse terms for each N-gram and get frequencies of most common n-grams

bigram_matrix_freq <- sort(rowSums(as.matrix(
  removeSparseTerms(
    bigram_matrix, 0.999
  )
)), decreasing = TRUE)


bigram_matrix_freq <- data.frame(
  word = names(bigram_matrix_freq),
  freq = bigram_matrix_freq,
  row.names = NULL
)

bi_path <- "E:/my data science journey/john_hopkins_data_science/JHU_Capstone_project/Word_prediction_app/bigram_List.rds" 
saveRDS(bigram_matrix_freq,bi_path)

# Plotting histogram
bigram_hist <- ggplot(
  bigram_matrix_freq[1:20,],
  aes(x = reorder(word, -freq), y = freq)
) + geom_bar(stat = "identity", fill = "#FFD580")+ dark_theme_light() +geom_text(
  aes(label = freq), vjust = -0.20, size = 3) + xlab("Words") + ylab("Frequency")+
  theme(plot.title = element_text(size = 14, hjust = 0.5, vjust = 0.5),
               axis.text.x = element_text(angle = 45, hjust = 1.0, vjust = 1.0),
               axis.text.y = element_text(hjust = 0.5, vjust = 0.5))+ ggtitle("20 Most Common bigrams")

bigram_hist

# Word cloud
suppressWarnings(
  wordcloud(
    words = bigram_matrix_freq$word,
    freq = bigram_matrix_freq$freq,
    min.freq = 1,
    max.words = 100,
    random.order = FALSE,
    rot.per = 0.35,
    colors = brewer.pal(10, "Dark2")
  ) 
  
  
)

Trigrams

The trigrams are generated as follows,
1. A corpus of trigrams are created.
2. Term document matrix of these trigram corpus is created, followed by make it data frame.
3. Visualization for these trigrams.

# creating term document matrix for the corpus
trigram_matrix <- TermDocumentMatrix(corpus, control = list(tokenize = trigram_tokenizer))

# Eliminating sparse terms for each N-gram and get frequencies of most common n-grams

trigram_matrix_freq <- sort(rowSums(as.matrix(
  removeSparseTerms(
    trigram_matrix, 0.9999
  )
)), decreasing = TRUE)


trigram_matrix_freq <- data.frame(
  word = names(trigram_matrix_freq),
  freq = trigram_matrix_freq,
  row.names = NULL
)

tri_path <- "E:/my data science journey/john_hopkins_data_science/JHU_Capstone_project/Word_prediction_app/trigram_List.rds" 
saveRDS(trigram_matrix_freq,tri_path)

# Plotting histogram
trigram_hist <- ggplot(
  trigram_matrix_freq[1:10,],
  aes(x = reorder(word, -freq), y = freq)
) + geom_bar(stat = "identity", fill = "#fff44f")+ dark_theme_light() +geom_text(
  aes(label = freq), vjust = -0.20, size = 3) + xlab("Words") + ylab("Frequency")+
  theme(plot.title = element_text(size = 14, hjust = 0.5, vjust = 0.5),
               axis.text.x = element_text(angle = 45, hjust = 1.0),
               axis.text.y = element_text(hjust = 0.5, vjust = 0.5))+ ggtitle("10 Most Common trigrams")

trigram_hist

# Word cloud
suppressWarnings(
  wordcloud(
    words = trigram_matrix_freq$word,
    freq = trigram_matrix_freq$freq,
    min.freq = 1,
    max.words = 100,
    random.order = FALSE,
    rot.per = 0.35,
    colors = brewer.pal(10, "Dark2")
  ) 
  
  
)

Tetra grams

# creating term document matrix for the corpus
tetragram_matrix <- TermDocumentMatrix(corpus, control = list(tokenize = tetragram_tokenizer))

# Eliminating sparse terms for each N-gram and get frequencies of most common n-grams

tetragram_matrix_freq <- sort(rowSums(as.matrix(
  removeSparseTerms(
    tetragram_matrix, 0.9999
  )
)), decreasing = TRUE)


tetragram_matrix_freq <- data.frame(
  word = names(tetragram_matrix_freq),
  freq = tetragram_matrix_freq,
  row.names = NULL
)

tetra_path <- "E:/my data science journey/john_hopkins_data_science/JHU_Capstone_project/Word_prediction_app/trigram_List.rds" 
saveRDS(tetragram_matrix_freq,tetra_path)

# Plotting histogram
tetragram_hist <- ggplot(
  tetragram_matrix_freq[1:10,],
  aes(x = reorder(word, -freq), y = freq)
) + geom_bar(stat = "identity", fill = "#fff44f")+ dark_theme_light() +geom_text(
  aes(label = freq), vjust = -0.20, size = 3) + xlab("Words") + ylab("Frequency")+
  theme(plot.title = element_text(size = 14, hjust = 0.5, vjust = 0.5),
               axis.text.x = element_text(angle = 45, hjust = 1.0),
               axis.text.y = element_text(hjust = 0.5, vjust = 0.5))+ ggtitle("10 Most Common trigrams")

tetragram_hist

# Word cloud
suppressWarnings(
  wordcloud(
    words = tetragram_matrix_freq$word,
    freq = tetragram_matrix_freq$freq,
    min.freq = 1,
    max.words = 100,
    random.order = FALSE,
    rot.per = 0.35,
    colors = brewer.pal(10, "Dark2")
  ) 
  
  
)

Conclusion and remarks

All the n-gram tokens have been generated.
It would be convenient to perform several test with different sample sizes of the dataset to find a balance between computational costs and accuracy of the results.

Next step

Build and test different prediction models and evaluate each based on their performance.
Make and test any necessary modifications to resolve any issues encountered during modeling.
Build, test and deploy a Shiny app with a simple user interface that has acceptable run time and reliably and accurately predicts the next word based on a word or phrase entered by the user.