Milestone Report Introduction

As part of Data Science Specialization Capstone experience, I have demonstrate the ability to work with relatively unstructured text data. The first step is to download and read in the data.

The code below is from the course materials. It shows how to for read data from txt files using the readLines function.

con <- file(“en_US.twitter.txt”, “r”) readLines(con, 1) ## Read the first line of text readLines(con, 1) ## Read the next line of text readLines(con, 5) ## Read in the next 5 lines of text close(con) ## It’s important to close the connection when you are done ## Download and read in the data The data are available at this url: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip ### Download the data

## You can set your working directory to final/en_US...
## and then use the filenames (wrapped in quotes)...
## but I prefer to load the path into memory...
path<-paste(getwd(),"/final/en_US", sep="")
## and then load the filenames into memory.
files<-list.files(path)
##Here are the file names
print(files)
## [1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"
## If I had more files, I would use a loop for the next steps...
## but for 3 files it is not worth it and in my opinion,
## avoiding loops makes the code more readable and faster...

## Load file paths into memory
blogPath <- paste(path,"/", files[1],sep="")
newsPath <- paste(path,"/", files[2],sep="")
twitPath <- paste(path,"/", files[3],sep="")

Read in the data

## First, I will read in all of the data, but later I will take a subsample.
## The raw binary (rb) method was the only way I could read in the full news data set. 
## I skipped the embedded nuls in all, though these were only in the twitter data set.
con <- file(blogPath, open="rb")
blog<-readLines(con, skipNul = TRUE, encoding = "UTF-8")
close(con)

con <- file(newsPath, open="rb")
news<-readLines(con, skipNul = TRUE, encoding = "UTF-8")
close(con)

con <- file(twitPath, open="rb")
twit<-readLines(con, skipNul = TRUE, encoding = "UTF-8")
close(con)

Obtain basic statistics describing the three datasets

Now that the data are read into memory, we can obtain basic information about the datafiles (e.g. file sizes) and their contents (e.g. word counts). First I obtain file sizes in bytes. Then I will convert bytes to mebibytes (MiB). Mebibytes are part of the International Electrotechnical Commission (IEC) system. Personally, I prefer International System of Units (SI) units, because I think SI units e.g. kilobytes and megabytes (1000 and 1000^2 bytes respectively) make more sense than IEC units e.g. kibibytes and mebibytes (1000 and 1000^2 respectively) However, my preference is biased by the fact that I am laboratory scientist :)

## Get file sizes in Bytes
blogBytes <- file.info(blogPath)$size
newsBytes <- file.info(newsPath)$size
twitBytes <- file.info(twitPath)$size
##Convert bytes to mebibytes(MiB)
blogMB <- blogBytes / 1024 ^ 2
newsMB <- newsBytes / 1024 ^ 2
twitMB <- twitBytes / 1024 ^ 2
## Get the number of lines
blogLines <- length(count.fields(blogPath, sep="\n"))
newsLines <- length(count.fields(newsPath, sep="\n"))
twitLines <- length(count.fields(twitPath, sep="\n"))

## Get the number of words per line using sapply and gregexpr base functions
blogWords<-sapply(gregexpr("[[:alpha:]]+", blog), function(x) sum(x > 0))
newsWords<-sapply(gregexpr("[[:alpha:]]+", news), function(x) sum(x > 0))
twitWords<-sapply(gregexpr("[[:alpha:]]+", twit), function(x) sum(x > 0))

## Alternative: Get the number of words in each line using stringi package
##install.packages("stringi")
##library(stringi)
##blogWords <- stri_count_words(blog)
##newsWords <- stri_count_words(news)                              
##twitWords <- stri_count_words(twit)

## Sum the number of words in each line to get total words
blogWordsSum<-sum(blogWords)
newsWordsSum<-sum(newsWords)
twitWordsSum<-sum(twitWords)

##Get the character count (per line) for each data set
blogChar<-nchar(blog, type = "chars")
newsChar<-nchar(news, type = "chars")
twitChar<-nchar(twit, type = "chars")

##Sum the character counts to get total number of characters
blogCharSum<-sum(blogChar)
newsCharSum<-sum(newsChar)
twitCharSum<-sum(twitChar)

## Alternative: Use the Unix command wc e.g. system("wc filepath")
## This will give the lines, words and characters.
## I trust Unix commands > R base functions > R packages :)

Generate a table showing the basic dataset statistics

This is the first deliverable and nicely summarizes information about our datasets (from the previous code chunk).

df<-data.frame(File=c("Blogs", "News", "Twitter"),
               fileSize = c(blogMB, newsMB, twitMB),
               lineCount = c(blogLines, newsLines, twitLines),
               wordCount = c(blogWordsSum, newsWordsSum, twitWordsSum),
               charCount = c(blogCharSum,newsCharSum,twitCharSum),
               wordMean = c(mean(blogWords), mean(newsWords), mean(twitWords)),
               charMean = c(mean(blogChar), mean(newsChar), mean(twitChar))
               )

View(df)

So far, we made a table of raw data stats using only base functions (i.e. no dependencies) ## Sample the data and obtain descriptive statistics ### Obtain a sample of the data Now, we will obtain the same set of statistics for a sample of the data. First, I will set the seed so that I can obtain the same exact samples later. Next, I could sample by number of characters or words, but I will sample by line count.

set.seed(20170219)
## For whatever reason, the sample function (as below) truncated the news dataset 
blog10 <- sample(blog, size = length(blog) / 10, replace = FALSE)
news10 <- sample(news, size = length(news)/10, replace = FALSE)
twit10 <- sample(twit, size = length(twit) / 10, replace = FALSE)
##  I used the rbinom subsetting method below and it did not work for me.
#blog10 <- blog[rbinom(length(blog)/10, length(blog), .5)]
#news10 <- news[rbinom(length(news)/10, length(news), .5)]
#twit10 <- twit[rbinom(length(twit)/10, length(twit), .5)]

Obtain basic statistics describing the three dataset samples

The next few steps are (almost) the same as before, except this time I will use the samples instead of the full datasets.

First, I will obtain sample sizes in mebibytes (MiB), as per my IEC vs. SI unit rant above :) MiB still makes sense here even though I took a 1/10 sample of the datasets. If I took a smaller sample it would be better to use kebibyte units.

blog10MB <- format(object.size(blog10), standard = "IEC", units = "MiB")
news10MB <- format(object.size(news10), standard = "IEC", units = "MiB")
twit10MB <- format(object.size(twit10), standard = "IEC", units = "MiB")

## Get the number of lines
blog10Lines <- length(blog10)
news10Lines <- length(news10)
twit10Lines <- length(twit10)


## Get the number of words per line using sapply and gregexpr base functions
blog10Words<-sapply(gregexpr("[[:alpha:]]+", blog10), function(x) sum(x > 0))
news10Words<-sapply(gregexpr("[[:alpha:]]+", news10), function(x) sum(x > 0))
twit10Words<-sapply(gregexpr("[[:alpha:]]+", twit10), function(x) sum(x > 0))

## Sum the number of words in each line to get total words
blog10WordsSum<-sum(blog10Words)
news10WordsSum<-sum(news10Words)
twit10WordsSum<-sum(twit10Words)

##Get the character count (per line) for each data set
blog10Char<-nchar(blog10, type = "chars")
news10Char<-nchar(news10, type = "chars")
twit10Char<-nchar(twit10, type = "chars")

##Sum the character counts to get total number of characters
blog10CharSum<-sum(blog10Char)
news10CharSum<-sum(news10Char)
twit10CharSum<-sum(twit10Char)

## Alternative: Use the Unix command wc e.g. system("wc filepath")
## This will give the lines, words and characters.
## For simple things like these, I trust Unix commands > R base functions > R packages :)

Generate a table showing the basic dataset statistics

This is the second deliverable and nicely summarizes information about our samples (from the previous code chunk). It is important to make sure that the values in this table match the previous table. If this is not the case, then it may indicate that something went wrong with the sampling.

df10 <- data.frame(File=c("Blogs Sample", "News Sample", "Twitter Sample"),
               fileSize = c(blog10MB, news10MB, twit10MB),
               lineCount = c(blog10Lines, news10Lines, twit10Lines),
               wordCount = c(blog10WordsSum, news10WordsSum, twit10WordsSum),
               charCount = c(blog10CharSum,news10CharSum,twit10CharSum),
               wordMean = c(mean(blog10Words), mean(news10Words), mean(twit10Words)),
               charMean = c(mean(blog10Char), mean(news10Char), mean(twit10Char))
               )

View(df10)

Data cleaning

For the data cleaning steps, I will first put all three of the datasets together. Then I will remove stop words, extra whitespace, punctuation, profanity, one-letter words and symbols. For many of these steps, I will use the tm package. ### Put all of the dataset samples together

#install.packages("tm")
library(tm)
## Loading required package: NLP
## Put all of the data samples together
#dat<- c(blog,news,twit)
dat10<- c(blog10,news10,twit10)

Remove stop words, multiple spaces and punctuation

dat10NoPunc<- removePunctuation(dat10)
dat10NoWS<- stripWhitespace(dat10NoPunc)
dat10NoStop <- removeWords(dat10NoWS, stopwords("english"))

Remove profanity

At first, I was not sure whether to remove profanity… because I didn’t like the list of “bad” words I found on github e.g. https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/blob/master/en and https://gist.github.com/jamiew/1112488 There are perfectly normal (in my opinion) words mixed in those lists e.g. anatomical words We are all adults here (I assume) and I think the profane words can also be interesting for analysis. I decided to remove profanity because I was terrified at the thought that N-word would be one of the top ranked unigrams by frequency :/ In the end, it did not make a big difference in object size, so probably not a big loss in data.

## download profanity word lists
download.file("https://raw.githubusercontent.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en", 
    destfile = paste(getwd(),"/profan1.csv", sep=""))
download.file("https://gist.githubusercontent.com/jamiew/1112488/raw/7ca9b1669e1c24b27c66174762cb04e14cf05aa7/google_twunter_lol", 
    destfile = paste(getwd(),"/profan2.csv", sep=""))
##Read in profanity word lists
profan1<- as.character(read.csv("profan1.csv", header=FALSE))
profan2<- as.character(row.names(read.csv("profan2.csv", header=TRUE, sep = ":")))
## Put the two lists together
profan<-c(profan1, profan2)
## Trim the first and last line of profan
profan<-profan[-1]
profan<-profan[-length(profan)]
## Remove profanity
dat10NoProfan <- removeWords(dat10NoStop, profan) 

## Find out the object size difference after removing profanity
object.size(dat10NoPunc)
## 85298144 bytes
object.size(dat10NoProfan)
## 74721856 bytes
object.size(dat10NoPunc)-object.size(dat10NoProfan)
## 10576288 bytes

I noticed a lot of weird characters in my output file,

e.g “?”, “o”, “?”, “z”,“???”,“T”,“?”,“?”,“?”,“?”, “~”…

I think the problem may be in the data and unrelated to preprocessing.

Convert everything to lowercase

I came up with different methods for converting to lowercase including dat10Lower<-sapply(dat10NoStop, tolower) dat10Lower<- tm_map(corp, tolower) dat10Lower<- tolower(dat10NoStop) I went with the stringi package method below. Nevertheless, I think the examples above work just fine.

library(stringi)
dat10Lower <- stri_trans_tolower(dat10NoProfan)

Remove special symbols

I decided to list the symbols I want removed. I know I can come up with simple regular expression to accomplish the same thing.

dat10azONLY <- gsub("?|?|???|T|o|'|?|?|?|f|.|?|?|?|?|?|>|<|?|?|?|?|~|~", "", dat10Lower) 

Remove one-letter words

All non-alphanumeric characters should be removed by now, but I remove punctuation again just to be sure. I also removed extra whitespace again, just in case and of the previous steps created new whitespace. I will remove any redundancy and improve efficiency for my final project app.

dat10NoPunc2<- removePunctuation(dat10azONLY)
dat10NoWS2<- stripWhitespace(dat10NoPunc2)
#Remove single letter words
dat10NoShort <- removeWords(dat10NoWS2, "\\b\\w{1}\\b") 

##Tokenization Here I put together lists of unigrams, bigrams and trigrams.

## I was not able to install RWeka package, because of a java version problem.
## Instead of trying to figure it out, I used the ngram_tokenizer snippet
## created by Maciej Szymkiewicz, aka zero323 on Github.

download.file("https://raw.githubusercontent.com/zero323/r-snippets/master/R/ngram_tokenizer.R", 
    destfile = paste(getwd(),"/ngram_tokenizer.R", sep=""))
source("ngram_Tokenizer.R")
unigram_tokenizer <- ngram_tokenizer(1)
uniList <- unigram_tokenizer(dat10NoShort)
freqNames <- as.vector(names(table(unlist(uniList))))
freqCount <- as.numeric(table(unlist(uniList)))
dfUni <- data.frame(Word = freqNames,
                    Count = freqCount)
attach(dfUni)
dfUniSort<-dfUni[order(-Count),]
detach(dfUni)

bigram_tokenizer <- ngram_tokenizer(2)
biList <- bigram_tokenizer(dat10NoShort)
freqNames <- as.vector(names(table(unlist(biList))))
freqCount <- as.numeric(table(unlist(biList)))
dfBi <- data.frame(Word = freqNames,
                    Count = freqCount)
attach(dfBi)
dfBiSort<-dfBi[order(-Count),]
detach(dfBi)

trigram_tokenizer <- ngram_tokenizer(3)
triList <- trigram_tokenizer(dat10NoShort)
freqNames <- as.vector(names(table(unlist(triList))))
freqCount <- as.numeric(table(unlist(triList)))
dfTri <- data.frame(Word = freqNames,
                    Count = freqCount)
attach(dfTri)
dfTriSort<-dfTri[order(-Count),]
detach(dfTri)

Exploratory Data Analysis

After preparing the Ngram lists, I am ready to visualize the data First, I will make some histograms to show the most frequent words ### Unigram histogram

par(mar = c(8,4,1,1) + 0.1, las = 2)
barplot(dfUniSort[1:20,2],col="blue",
        names.arg = dfUniSort$Word[1:20],srt = 45,
        space=0.1, xlim=c(0,20),
        main = "Top 20 Unigrams by Frequency",
        cex.names = 1, xpd = FALSE)

### Bigram histogram

par(mar = c(8,4,1,1) + 0.1, las = 2)
barplot(dfBiSort[1:20,2],col="green",
        names.arg = dfBiSort$Word[1:20],srt = 45,
        space=0.1, xlim=c(0,20),
        main = "Top 20 Bigrams by Frequency",
        cex.names = 1, xpd = FALSE)

### Trigram histogram

par(mar = c(8,4,1,1) + 0.1, las = 2)
barplot(dfTriSort[1:20,2],col="red",
        names.arg = dfTriSort$Word[1:20],srt = 45,
        space=0.1, xlim=c(0,20),
        main = "Top 20 Trigrams by Frequency",
        cex.names = 1, xpd = FALSE)

?barplot
## starting httpd help server ... done

Exploratory Data Analysis Conclusions

Based on the plots above, it appears that the data cleaning and tokenization steps worked. I think removing single letter words does not hurt. I concluded that I should not remove two-letter words, as the lack of the word “of” may result in some meaning being lost. I believe these steps work well but they take a long time and I need to think about efficiency when training my model.

Observations

I noticed foreign words (mostly in Spanish) in the output files. These may cause problems, so I will work on a way of removing the words using a Spanish dictionary (word list). This will be similar to the approach of removing profanity. It would be great to make an app that could translate foreign words on the fly so that they can also be used in the analysis. That being said, I think however it will be better to simply remove foreign words to.

Next steps

For my app, I am interested in providing functionality for hash tags from the twitter data. The idea is to predict what may follow a hash tag, just like other words. Hashtags by themselves are unigrams even if they represent multiple words (e.g. #HungryLikeAWolf), but they may be preceeded by other words. The predictive model would first try to predict by a quadgram, then a trigram, then a bigram and the word itself. In addition to word buttons to insert the text, I plan to show the output as a wordcloud wherein the word size is the probability of that word following what the user typed in.