Milestone Report Introduction

As part of Data Science Specialization Capstone experience, the first step is to download and read in the data.

The code shows how to for read data from txt files using the readLines function.

con <- file(“en_US.twitter.txt”, “r”) readLines(con, 1) ## Read the first line of text readLines(con, 1) ## Read the next line of text readLines(con, 5) ## Read in the next 5 lines of text close(con) ## It’s important to close the connection when you are done ## Download and read in the data The data are available at this url: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip ### Download the data

Load filepaths into memory

## To load the path and filenames into memory.
path<-paste(getwd(),"/final/en_US", sep="")
files<-list.files(path)
##File names
print(files)
## [1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"
## Load file paths into memory
blogPath <- paste(path,"/", files[1],sep="")
newsPath <- paste(path,"/", files[2],sep="")
twitPath <- paste(path,"/", files[3],sep="")

Read in the data

## Will be necessary to read all data.
## The raw binary (rb) to read in the full news data set. 
con <- file(blogPath, open="rb")
blog<-readLines(con, skipNul = TRUE, encoding = "UTF-8")
close(con)

con <- file(newsPath, open="rb")
news<-readLines(con, skipNul = TRUE, encoding = "UTF-8")
close(con)

con <- file(twitPath, open="rb")
twit<-readLines(con, skipNul = TRUE, encoding = "UTF-8")
close(con)

Obtain basic statistics describing the three datasets

After read all data into memory, will obtain basic information about the datafiles and their contents Obtain file sizes in bytes. Then I will convert bytes to mebibytes (MiB).

## Get file sizes in Bytes
blogBytes <- file.info(blogPath)$size
newsBytes <- file.info(newsPath)$size
twitBytes <- file.info(twitPath)$size
##Convert bytes to mebibytes(MiB)
blogMB <- blogBytes / 1024 ^ 2
newsMB <- newsBytes / 1024 ^ 2
twitMB <- twitBytes / 1024 ^ 2
## Get the number of lines
blogLines <- length(count.fields(blogPath, sep="\n"))
newsLines <- length(count.fields(newsPath, sep="\n"))
twitLines <- length(count.fields(twitPath, sep="\n"))

## Get the number of words per line using sapply and gregexpr base functions
blogWords<-sapply(gregexpr("[[:alpha:]]+", blog), function(x) sum(x > 0))
newsWords<-sapply(gregexpr("[[:alpha:]]+", news), function(x) sum(x > 0))
twitWords<-sapply(gregexpr("[[:alpha:]]+", twit), function(x) sum(x > 0))

## Sum the number of words in each line to get total words
blogWordsSum<-sum(blogWords)
newsWordsSum<-sum(newsWords)
twitWordsSum<-sum(twitWords)

##Get the character count (per line) for each data set
blogChar<-nchar(blog, type = "chars")
newsChar<-nchar(news, type = "chars")
twitChar<-nchar(twit, type = "chars")

##Sum the character counts to get total number of characters
blogCharSum<-sum(blogChar)
newsCharSum<-sum(newsChar)
twitCharSum<-sum(twitChar)

Generate a table showing the basic dataset statistics

The first deliverable and summarizes information about our datasets.

df<-data.frame(File=c("Blogs", "News", "Twitter"),
               fileSize = c(blogMB, newsMB, twitMB),
               lineCount = c(blogLines, newsLines, twitLines),
               wordCount = c(blogWordsSum, newsWordsSum, twitWordsSum),
               charCount = c(blogCharSum,newsCharSum,twitCharSum),
               wordMean = c(mean(blogWords), mean(newsWords), mean(twitWords)),
               charMean = c(mean(blogChar), mean(newsChar), mean(twitChar))
               )

View(df)

So far, we made a table of raw data stats using only base functions ## Sample the data and obtain descriptive statistics ### Obtain a sample of the data Now, we will obtain the same set of statistics for a sample of the data. First, need to set the seed and then obtain the same exact samples.

set.seed(20170219)
## The sample function truncated the news dataset 
blog10 <- sample(blog, size = length(blog) / 10, replace = FALSE)
news10 <- sample(news, size = length(news)/10, replace = FALSE)
twit10 <- sample(twit, size = length(twit) / 10, replace = FALSE)

Obtain basic statistics describing the three dataset samples

The next few steps are the same as before, but will use the samples instead of the full datasets.

To obtain sample sizes in mebibytes (MiB),

blog10MB <- format(object.size(blog10), standard = "IEC", units = "MiB")
news10MB <- format(object.size(news10), standard = "IEC", units = "MiB")
twit10MB <- format(object.size(twit10), standard = "IEC", units = "MiB")

## Get the number of lines
blog10Lines <- length(blog10)
news10Lines <- length(news10)
twit10Lines <- length(twit10)


## Get the number of words per line using sapply and gregexpr base functions
blog10Words<-sapply(gregexpr("[[:alpha:]]+", blog10), function(x) sum(x > 0))
news10Words<-sapply(gregexpr("[[:alpha:]]+", news10), function(x) sum(x > 0))
twit10Words<-sapply(gregexpr("[[:alpha:]]+", twit10), function(x) sum(x > 0))

## Sum the number of words in each line to get total words
blog10WordsSum<-sum(blog10Words)
news10WordsSum<-sum(news10Words)
twit10WordsSum<-sum(twit10Words)

##Get the character count (per line) for each data set
blog10Char<-nchar(blog10, type = "chars")
news10Char<-nchar(news10, type = "chars")
twit10Char<-nchar(twit10, type = "chars")

##Sum the character counts to get total number of characters
blog10CharSum<-sum(blog10Char)
news10CharSum<-sum(news10Char)
twit10CharSum<-sum(twit10Char)

Generate a table showing the basic dataset statistics

This is the second deliverable and summarizes information about our samples It is important to make sure that the values in this table match the previous table.

df10 <- data.frame(File=c("Blogs Sample", "News Sample", "Twitter Sample"),
               fileSize = c(blog10MB, news10MB, twit10MB),
               lineCount = c(blog10Lines, news10Lines, twit10Lines),
               wordCount = c(blog10WordsSum, news10WordsSum, twit10WordsSum),
               charCount = c(blog10CharSum,news10CharSum,twit10CharSum),
               wordMean = c(mean(blog10Words), mean(news10Words), mean(twit10Words)),
               charMean = c(mean(blog10Char), mean(news10Char), mean(twit10Char))
               )

View(df10)

Data cleaning

Will be necessary put all three of the datasets together. Then I will remove stop words, extra whitespace, punctuation, profanity, one-letter words and symbols. ### Put all of the dataset samples together

#install.packages("tm")
library(tm)
## Loading required package: NLP
## Put all of the data samples together
dat10<- c(blog10,news10,twit10)

Remove stop words, multiple spaces and punctuation

dat10NoPunc<- removePunctuation(dat10)
dat10NoWS<- stripWhitespace(dat10NoPunc)
dat10NoStop <- removeWords(dat10NoWS, stopwords("english"))

Remove special symbols

e.g “?”, “o”, “?”, “z”,“???”,“T”,“?”,“?”,“?”,“?”, “~”…

Convert everything to lowercase

I came up with different methods for converting to lowercase including dat10Lower<-sapply(dat10NoStop, tolower) dat10Lower<- tm_map(corp, tolower) dat10Lower<- tolower(dat10NoStop) I went with the stringi package method below.

library(stringi)
dat10Lower <- stri_trans_tolower(dat10NoStop)

Remove special symbols

The simple regular expression to accomplish the same thing.

dat10azONLY <- gsub("?|?|???|T|o|'|?|?|?|f|.|?|?|?|?|?|>|<|?|?|?|?|~|~", "", dat10Lower) 

Remove one-letter words

All non-alphanumeric characters should be removed by now.

dat10NoPunc2<- removePunctuation(dat10azONLY)
dat10NoWS2<- stripWhitespace(dat10NoPunc2)
#Remove single letter words
dat10NoShort <- removeWords(dat10NoWS2, "\\b\\w{1}\\b") 

Tokenization

Create a lists of unigrams, bigrams and trigrams.

download.file("https://raw.githubusercontent.com/zero323/r-snippets/master/R/ngram_tokenizer.R", 
    destfile = paste(getwd(),"/ngram_tokenizer.R", sep=""))
source("ngram_Tokenizer.R")
unigram_tokenizer <- ngram_tokenizer(1)
uniList <- unigram_tokenizer(dat10NoShort)
freqNames <- as.vector(names(table(unlist(uniList))))
freqCount <- as.numeric(table(unlist(uniList)))
dfUni <- data.frame(Word = freqNames,
                    Count = freqCount)
attach(dfUni)
dfUniSort<-dfUni[order(-Count),]
detach(dfUni)

bigram_tokenizer <- ngram_tokenizer(2)
biList <- bigram_tokenizer(dat10NoShort)
freqNames <- as.vector(names(table(unlist(biList))))
freqCount <- as.numeric(table(unlist(biList)))
dfBi <- data.frame(Word = freqNames,
                    Count = freqCount)
attach(dfBi)
dfBiSort<-dfBi[order(-Count),]
detach(dfBi)

trigram_tokenizer <- ngram_tokenizer(3)
triList <- trigram_tokenizer(dat10NoShort)
freqNames <- as.vector(names(table(unlist(triList))))
freqCount <- as.numeric(table(unlist(triList)))
dfTri <- data.frame(Word = freqNames,
                    Count = freqCount)
attach(dfTri)
dfTriSort<-dfTri[order(-Count),]
detach(dfTri)

Exploratory Data Analysis

After preparing the Ngram lists, need to ready to visualize the data First, a histograms to show the most frequent words ### Unigram histogram

par(mar = c(8,4,1,1) + 0.1, las = 2)
barplot(dfUniSort[1:10,2],col="blue",
        names.arg = dfUniSort$Word[1:10],srt = 45,
        space=0.1, xlim=c(0,10),
        main = "Top 10 Unigrams by Frequency",
        cex.names = 1, xpd = FALSE)

### Bigram histogram

par(mar = c(8,4,1,1) + 0.1, las = 2)
barplot(dfBiSort[1:10,2],col="blue",
        names.arg = dfBiSort$Word[1:10],srt = 45,
        space=0.1, xlim=c(0,10),
        main = "Top 10 Bigrams by Frequency",
        cex.names = 1, xpd = FALSE)

### Trigram histogram

par(mar = c(8,4,1,1) + 0.1, las = 2)
barplot(dfTriSort[1:10,2],col="blue",
        names.arg = dfTriSort$Word[1:10],srt = 45,
        space=0.1, xlim=c(0,10),
        main = "Top 10 Trigrams by Frequency",
        cex.names = 1, xpd = FALSE)

?barplot

Exploratory Data Analysis Conclusions

Based on the plots above, it appears that the data cleaning and tokenization steps worked. I concluded that I should not remove two-letter words, as the lack of the word “of” may result in some meaning being lost.

Observations

I noticed foreign words in the output files. This will be similar to the approach of removing profanity. It will be better to simply remove foreign words.