Executive Summary

This milestone report is the project of week two within the Data Science Capstone Project Course on the Data Science Specialization by Johns Hopkins University on Coursera.

The overall goal of the capstone is to develop a prediction algorithm for the most likely next word in a sequence of words. The purpose of this milestone report is to show some exploratory data analyses to investigate some features of the data, which will lead to the eventual prediction app and algorithm.

Summary stats and Basic Information about Corpus Dataset

To know more about how the data looks like, we can see the size in Megabytes, the number of lines, number of characters, and number of words for each of the 3 datasets (Blog, News and Twitter). And last, the min, max and average number of words per line.

library(knitr); 
library(dplyr);

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(doParallel);

## Warning: package 'doParallel' was built under R version 4.3.3

## Loading required package: foreach

## Warning: package 'foreach' was built under R version 4.3.3

## Loading required package: iterators

## Warning: package 'iterators' was built under R version 4.3.3

## Loading required package: parallel

library(tm);

## Warning: package 'tm' was built under R version 4.3.3

## Loading required package: NLP

## Warning: package 'NLP' was built under R version 4.3.3

library(SnowballC);
library(stringi); 
library(tm); 
library(ggplot2);

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

library(wordcloud);

## Warning: package 'wordcloud' was built under R version 4.3.3

## Loading required package: RColorBrewer

library(kableExtra);

## Warning: package 'kableExtra' was built under R version 4.3.3

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

setwd("E:/Project/Data Science Capstone Project/final/en_US");

blogpath <- "./en_US.blogs.txt";
newspath <- "./en_US.news.txt";
twitpath <- "./en_US.twitter.txt";

# Read blogs data in binary mode
conn <- file(blogpath, open="rb");
blog <- readLines(conn, skipNul = TRUE ,encoding="UTF-8"); 
close(conn);
# Read news data in binary mode
conn <- file(newspath, open="rb");
news <- readLines(conn,skipNul = TRUE, encoding="UTF-8"); 
close(conn);
# Read twitter data in binary mode
conn <- file(twitpath, open="rb");
twit <- readLines(conn,skipNul = TRUE, encoding="UTF-8"); 
close(conn);
# Remove temporary variable
rm(conn)

WPL <- sapply(list(blog,news,twit),function(x) summary(stri_count_words(x))[c('Min.','Mean','Max.')])
rownames(WPL) <- c('WPL_Min','WPL_Mean','WPL_Max')
stats <- data.frame(
  FileName=c("en_US.blogs","en_US.news","en_US.twitter"),     
  "File Size" = sapply(list(blog, news, twit), function(x){format(object.size(x),"MB")}),
  t(rbind(
    sapply(list(blog,news,twit),stri_stats_general)[c('Lines','Chars'),],
    Words=sapply(list(blog,news,twit),stri_stats_latex)['Words',],
    WPL)
  ))
View(stats)

So far, we made a table of raw data stats using only base functions (i.e. no dependencies).Sample the data and obtain descriptive statistics. Obtain a sample of the data Now, we will obtain the same set of statistics for a sample of the data. First, I will set the seed so that I can obtain the same exact samples later. Next, I could sample by number of characters or words, but I will sample by line count.

set.seed(20170219)
## For whatever reason, the sample function (as below) truncated the news dataset 
blog10 <- sample(blog, size = length(blog) / 10, replace = FALSE)
news10 <- sample(news, size = length(news)/10, replace = FALSE)
twit10 <- sample(twit, size = length(twit) / 10, replace = FALSE)
##  I used the rbinom subsetting method below and it did not work for me.
#blog10 <- blog[rbinom(length(blog)/10, length(blog), .5)]
#news10 <- news[rbinom(length(news)/10, length(news), .5)]
#twit10 <- twit[rbinom(length(twit)/10, length(twit), .5)]

Obtain basic statistics describing the three dataset samples

The next few steps are (almost) the same as before, except this time I will use the samples instead of the full datasets.

First, I will obtain sample sizes in mebibytes (MiB), as per my IEC vs. SI unit rant above :) MiB still makes sense here even though I took a 1/10 sample of the datasets. If I took a smaller sample it would be better to use kebibyte units.

blog10MB <- format(object.size(blog10), standard = "IEC", units = "MiB")
news10MB <- format(object.size(news10), standard = "IEC", units = "MiB")
twit10MB <- format(object.size(twit10), standard = "IEC", units = "MiB")
## Get the number of lines
blog10Lines <- length(blog10)
news10Lines <- length(news10)
twit10Lines <- length(twit10)
## Get the number of words per line using sapply and gregexpr base functions
blog10Words<-sapply(gregexpr("[[:alpha:]]+", blog10), function(x) sum(x > 0))
news10Words<-sapply(gregexpr("[[:alpha:]]+", news10), function(x) sum(x > 0))
twit10Words<-sapply(gregexpr("[[:alpha:]]+", twit10), function(x) sum(x > 0))
## Sum the number of words in each line to get total words
blog10WordsSum<-sum(blog10Words)
news10WordsSum<-sum(news10Words)
twit10WordsSum<-sum(twit10Words)
##Get the character count (per line) for each data set
blog10Char<-nchar(blog10, type = "chars")
news10Char<-nchar(news10, type = "chars")
twit10Char<-nchar(twit10, type = "chars")
##Sum the character counts to get total number of characters
blog10CharSum<-sum(blog10Char)
news10CharSum<-sum(news10Char)
twit10CharSum<-sum(twit10Char)
## Alternative: Use the Unix command wc e.g. system("wc filepath")
## This will give the lines, words and characters.
## For simple things like these, I trust Unix commands > R base functions > R packages :)

Generate a table showing the basic dataset statistics

This is the second deliverable and nicely summarizes information about our samples (from the previous code chunk). It is important to make sure that the values in this table match the previous table. If this is not the case, then it may indicate that something went wrong with the sampling.

df10 <- data.frame(File=c("Blogs Sample", "News Sample", "Twitter Sample"),
                   fileSize = c(blog10MB, news10MB, twit10MB),
                   lineCount = c(blog10Lines, news10Lines, twit10Lines),
                   wordCount = c(blog10WordsSum, news10WordsSum, twit10WordsSum),
                   charCount = c(blog10CharSum,news10CharSum,twit10CharSum),
                   wordMean = c(mean(blog10Words), mean(news10Words), mean(twit10Words)),
                   charMean = c(mean(blog10Char), mean(news10Char), mean(twit10Char))
)
View(df10)

Data cleaning

For the data cleaning steps, I will first put all three of the datasets together. Then I will remove stop words, extra whitespace, punctuation, profanity, one-letter words and symbols. For many of these steps, I will use the tm package. Put all of the dataset samples together

## Put all of the data samples together
#dat<- c(blog,news,twit)
dat10<- c(blog10,news10,twit10)

Remove stop words, multiple spaces and punctuation

dat10NoPunc<- removePunctuation(dat10)
dat10NoWS<- stripWhitespace(dat10NoPunc)
dat10NoStop <- removeWords(dat10NoWS, stopwords("english"))

Remove profanity

At first, I was not sure whether to remove profanity… because I didn’t like the list of “bad” words I found on github e.g. https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/blob/master/en and https://gist.github.com/jamiew/1112488 There are perfectly normal (in my opinion) words mixed in those lists e.g. anatomical words We are all adults here (I assume) and I think the profane words can also be interesting for analysis. I decided to remove profanity because I was terrified at the thought that N-word would be one of the top ranked unigrams by frequency. In the end, it did not make a big difference in object size, so probably not a big loss in data.

## download profanity word lists
download.file("https://raw.githubusercontent.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en", 
              destfile = paste(getwd(),"/profan1.csv", sep=""))
download.file("https://gist.githubusercontent.com/jamiew/1112488/raw/7ca9b1669e1c24b27c66174762cb04e14cf05aa7/google_twunter_lol", 
              destfile = paste(getwd(),"/profan2.csv", sep=""))
##Read in profanity word lists
profan1<- as.character(read.csv("profan1.csv", header=FALSE))
profan2<- as.character(row.names(read.csv("profan2.csv", header=TRUE, sep = ":")))
## Put the two lists together
profan<-c(profan1, profan2)
## Trim the first and last line of profan
profan<-profan[-1]
profan<-profan[-length(profan)]
## Remove profanity
dat10NoProfan <- removeWords(dat10NoStop, profan) 
## Find out the object size difference after removing profanity
object.size(dat10NoPunc)

## 85298144 bytes

object.size(dat10NoProfan)

## 74721856 bytes

object.size(dat10NoPunc)-object.size(dat10NoProfan)

## 10576288 bytes

I noticed a lot of weird characters in my output file,e.g “â”, “o”, “î”, “z”,“???”,“T”,“³”,“ð”,“¾”,“ñ”, “~”…I think the problem may be in the data and unrelated to preprocessing.

Convert everything to lowercase

I came up with different methods for converting to lowercase including dat10Lower<-sapply(dat10NoStop, tolower) dat10Lower<- tm_map(corp, tolower) dat10Lower<- tolower(dat10NoStop) I went with the stringi package method below. Nevertheless, I think the examples above work just fine.

dat10Lower <- stri_trans_tolower(dat10NoProfan)

Remove special symbols

I decided to list the symbols I want removed. I know I can come up with simple regular expression to accomplish the same thing.

dat10azONLY <- gsub("ð|â|???|T|o|'|³|¾|ñ|f|.|º|°|»|²|¼|>|<|¹|·|¸|¦|~|~", "", dat10Lower)

Remove one-letter words

All non-alphanumeric characters should be removed by now, but I remove punctuation again just to be sure. I also removed extra whitespace again, just in case and of the previous steps created new whitespace. I will remove any redundancy and improve efficiency for my final project app.

dat10NoPunc2<- removePunctuation(dat10azONLY)
dat10NoWS2<- stripWhitespace(dat10NoPunc2)
#Remove single letter words
dat10NoShort <- removeWords(dat10NoWS2, "\\b\\w{1}\\b")

Tokenization

Here I put together lists of unigrams, bigrams and trigrams.

## I was not able to install RWeka package, because of a java version problem.
## Instead of trying to figure it out, I used the ngram_tokenizer snippet
## created by Maciej Szymkiewicz, aka zero323 on Github.
download.file("https://raw.githubusercontent.com/zero323/r-snippets/master/R/ngram_tokenizer.R", 
              destfile = paste(getwd(),"/ngram_tokenizer.R", sep=""))
source("ngram_Tokenizer.R")
unigram_tokenizer <- ngram_tokenizer(1)
uniList <- unigram_tokenizer(dat10NoShort)
freqNames <- as.vector(names(table(unlist(uniList))))
freqCount <- as.numeric(table(unlist(uniList)))
dfUni <- data.frame(Word = freqNames,
                    Count = freqCount)
attach(dfUni)
dfUniSort<-dfUni[order(-Count),]
detach(dfUni)
bigram_tokenizer <- ngram_tokenizer(2)
biList <- bigram_tokenizer(dat10NoShort)
freqNames <- as.vector(names(table(unlist(biList))))
freqCount <- as.numeric(table(unlist(biList)))
dfBi <- data.frame(Word = freqNames,
                   Count = freqCount)
attach(dfBi)
dfBiSort<-dfBi[order(-Count),]
detach(dfBi)
trigram_tokenizer <- ngram_tokenizer(3)
triList <- trigram_tokenizer(dat10NoShort)
freqNames <- as.vector(names(table(unlist(triList))))
freqCount <- as.numeric(table(unlist(triList)))
dfTri <- data.frame(Word = freqNames,
                    Count = freqCount)
attach(dfTri)
dfTriSort<-dfTri[order(-Count),]
detach(dfTri)

Exploratory Data Analysis

After preparing the Ngram lists, I am ready to visualize the data First, I will make some histograms to show the most frequent words Unigram histogram

par(mar = c(8,4,1,1) + 0.1, las = 2)
barplot(dfUniSort[1:20,2],col="blue",
        names.arg = dfUniSort$Word[1:20],srt = 45,
        space=0.1, xlim=c(0,20),
        main = "Top 20 Unigrams by Frequency",
        cex.names = 1, xpd = FALSE)

par(mar = c(8,4,1,1) + 0.1, las = 2)
barplot(dfBiSort[1:20,2],col="green",
        names.arg = dfBiSort$Word[1:20],srt = 45,
        space=0.1, xlim=c(0,20),
        main = "Top 20 Bigrams by Frequency",
        cex.names = 1, xpd = FALSE)

par(mar = c(8,4,1,1) + 0.1, las = 2)
barplot(dfTriSort[1:20,2],col="red",
        names.arg = dfTriSort$Word[1:20],srt = 45,
        space=0.1, xlim=c(0,20),
        main = "Top 20 Trigrams by Frequency",
        cex.names = 1, xpd = FALSE)

Exploratory Data Analysis Conclusions

Based on the plots above, it appears that the data cleaning and tokenization steps worked. I think removing single letter words does not hurt. I concluded that I should not remove two-letter words, as the lack of the word “of” may result in some meaning being lost. I believe these steps work well but they take a long time and I need to think about efficiency when training my model.

Observations

I noticed foreign words (mostly in Spanish) in the output files. These may cause problems, so I will work on a way of removing the words using a Spanish dictionary (word list). This will be similar to the approach of removing profanity. It would be great to make an app that could translate foreign words on the fly so that they can also be used in the analysis. That being said, I think however it will be better to simply remove foreign words to.

Next steps

For my app, I am interested in providing functionality for hash tags from the twitter data. The idea is to predict what may follow a hash tag, just like other words. Hashtags by themselves are unigrams even if they represent multiple words (e.g. #HungryLikeAWolf), but they may be preceeded by other words. The predictive model would first try to predict by a quadgram, then a trigram, then a bigram and the word itself. In addition to word buttons to insert the text, I plan to show the output as a wordcloud wherein the word size is the probability of that word following what the user typed in.

Milestone Report

Abhijit Paul

2024-12-06