As part of Data Science Specialization Capstone experience, the first step is to download and read in the data.
The code shows how to for read data from txt files using the readLines function.
con <- file(“en_US.twitter.txt”, “r”) readLines(con, 1) ## Read the first line of text readLines(con, 1) ## Read the next line of text readLines(con, 5) ## Read in the next 5 lines of text close(con) ## It’s important to close the connection when you are done ## Download and read in the data The data are available at this url: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip ### Download the data
## To load the path and filenames into memory.
path<-paste(getwd(),"/final/en_US", sep="")
files<-list.files(path)
##File names
print(files)
## [1] "en_US.blogs.txt" "en_US.news.txt" "en_US.twitter.txt"
## Load file paths into memory
blogPath <- paste(path,"/", files[1],sep="")
newsPath <- paste(path,"/", files[2],sep="")
twitPath <- paste(path,"/", files[3],sep="")
## Will be necessary to read all data.
## The raw binary (rb) to read in the full news data set.
con <- file(blogPath, open="rb")
blog<-readLines(con, skipNul = TRUE, encoding = "UTF-8")
close(con)
con <- file(newsPath, open="rb")
news<-readLines(con, skipNul = TRUE, encoding = "UTF-8")
close(con)
con <- file(twitPath, open="rb")
twit<-readLines(con, skipNul = TRUE, encoding = "UTF-8")
close(con)
After read all data into memory, will obtain basic information about the datafiles and their contents Obtain file sizes in bytes. Then I will convert bytes to mebibytes (MiB).
## Get file sizes in Bytes
blogBytes <- file.info(blogPath)$size
newsBytes <- file.info(newsPath)$size
twitBytes <- file.info(twitPath)$size
##Convert bytes to mebibytes(MiB)
blogMB <- blogBytes / 1024 ^ 2
newsMB <- newsBytes / 1024 ^ 2
twitMB <- twitBytes / 1024 ^ 2
## Get the number of lines
blogLines <- length(count.fields(blogPath, sep="\n"))
newsLines <- length(count.fields(newsPath, sep="\n"))
twitLines <- length(count.fields(twitPath, sep="\n"))
## Get the number of words per line using sapply and gregexpr base functions
blogWords<-sapply(gregexpr("[[:alpha:]]+", blog), function(x) sum(x > 0))
newsWords<-sapply(gregexpr("[[:alpha:]]+", news), function(x) sum(x > 0))
twitWords<-sapply(gregexpr("[[:alpha:]]+", twit), function(x) sum(x > 0))
## Sum the number of words in each line to get total words
blogWordsSum<-sum(blogWords)
newsWordsSum<-sum(newsWords)
twitWordsSum<-sum(twitWords)
##Get the character count (per line) for each data set
blogChar<-nchar(blog, type = "chars")
newsChar<-nchar(news, type = "chars")
twitChar<-nchar(twit, type = "chars")
##Sum the character counts to get total number of characters
blogCharSum<-sum(blogChar)
newsCharSum<-sum(newsChar)
twitCharSum<-sum(twitChar)
The first deliverable and summarizes information about our datasets.
df<-data.frame(File=c("Blogs", "News", "Twitter"),
fileSize = c(blogMB, newsMB, twitMB),
lineCount = c(blogLines, newsLines, twitLines),
wordCount = c(blogWordsSum, newsWordsSum, twitWordsSum),
charCount = c(blogCharSum,newsCharSum,twitCharSum),
wordMean = c(mean(blogWords), mean(newsWords), mean(twitWords)),
charMean = c(mean(blogChar), mean(newsChar), mean(twitChar))
)
View(df)
So far, we made a table of raw data stats using only base functions ## Sample the data and obtain descriptive statistics ### Obtain a sample of the data Now, we will obtain the same set of statistics for a sample of the data. First, need to set the seed and then obtain the same exact samples.
set.seed(20170219)
## The sample function truncated the news dataset
blog10 <- sample(blog, size = length(blog) / 10, replace = FALSE)
news10 <- sample(news, size = length(news)/10, replace = FALSE)
twit10 <- sample(twit, size = length(twit) / 10, replace = FALSE)
The next few steps are the same as before, but will use the samples instead of the full datasets.
To obtain sample sizes in mebibytes (MiB),
blog10MB <- format(object.size(blog10), standard = "IEC", units = "MiB")
news10MB <- format(object.size(news10), standard = "IEC", units = "MiB")
twit10MB <- format(object.size(twit10), standard = "IEC", units = "MiB")
## Get the number of lines
blog10Lines <- length(blog10)
news10Lines <- length(news10)
twit10Lines <- length(twit10)
## Get the number of words per line using sapply and gregexpr base functions
blog10Words<-sapply(gregexpr("[[:alpha:]]+", blog10), function(x) sum(x > 0))
news10Words<-sapply(gregexpr("[[:alpha:]]+", news10), function(x) sum(x > 0))
twit10Words<-sapply(gregexpr("[[:alpha:]]+", twit10), function(x) sum(x > 0))
## Sum the number of words in each line to get total words
blog10WordsSum<-sum(blog10Words)
news10WordsSum<-sum(news10Words)
twit10WordsSum<-sum(twit10Words)
##Get the character count (per line) for each data set
blog10Char<-nchar(blog10, type = "chars")
news10Char<-nchar(news10, type = "chars")
twit10Char<-nchar(twit10, type = "chars")
##Sum the character counts to get total number of characters
blog10CharSum<-sum(blog10Char)
news10CharSum<-sum(news10Char)
twit10CharSum<-sum(twit10Char)
This is the second deliverable and summarizes information about our samples It is important to make sure that the values in this table match the previous table.
df10 <- data.frame(File=c("Blogs Sample", "News Sample", "Twitter Sample"),
fileSize = c(blog10MB, news10MB, twit10MB),
lineCount = c(blog10Lines, news10Lines, twit10Lines),
wordCount = c(blog10WordsSum, news10WordsSum, twit10WordsSum),
charCount = c(blog10CharSum,news10CharSum,twit10CharSum),
wordMean = c(mean(blog10Words), mean(news10Words), mean(twit10Words)),
charMean = c(mean(blog10Char), mean(news10Char), mean(twit10Char))
)
View(df10)
Will be necessary put all three of the datasets together. Then I will remove stop words, extra whitespace, punctuation, profanity, one-letter words and symbols. ### Put all of the dataset samples together
#install.packages("tm")
library(tm)
## Loading required package: NLP
## Put all of the data samples together
dat10<- c(blog10,news10,twit10)
dat10NoPunc<- removePunctuation(dat10)
dat10NoWS<- stripWhitespace(dat10NoPunc)
dat10NoStop <- removeWords(dat10NoWS, stopwords("english"))
I came up with different methods for converting to lowercase including dat10Lower<-sapply(dat10NoStop, tolower) dat10Lower<- tm_map(corp, tolower) dat10Lower<- tolower(dat10NoStop) I went with the stringi package method below.
library(stringi)
dat10Lower <- stri_trans_tolower(dat10NoStop)
The simple regular expression to accomplish the same thing.
dat10azONLY <- gsub("?|?|???|T|o|'|?|?|?|f|.|?|?|?|?|?|>|<|?|?|?|?|~|~", "", dat10Lower)
All non-alphanumeric characters should be removed by now.
dat10NoPunc2<- removePunctuation(dat10azONLY)
dat10NoWS2<- stripWhitespace(dat10NoPunc2)
#Remove single letter words
dat10NoShort <- removeWords(dat10NoWS2, "\\b\\w{1}\\b")
Create a lists of unigrams, bigrams and trigrams.
download.file("https://raw.githubusercontent.com/zero323/r-snippets/master/R/ngram_tokenizer.R",
destfile = paste(getwd(),"/ngram_tokenizer.R", sep=""))
source("ngram_Tokenizer.R")
unigram_tokenizer <- ngram_tokenizer(1)
uniList <- unigram_tokenizer(dat10NoShort)
freqNames <- as.vector(names(table(unlist(uniList))))
freqCount <- as.numeric(table(unlist(uniList)))
dfUni <- data.frame(Word = freqNames,
Count = freqCount)
attach(dfUni)
dfUniSort<-dfUni[order(-Count),]
detach(dfUni)
bigram_tokenizer <- ngram_tokenizer(2)
biList <- bigram_tokenizer(dat10NoShort)
freqNames <- as.vector(names(table(unlist(biList))))
freqCount <- as.numeric(table(unlist(biList)))
dfBi <- data.frame(Word = freqNames,
Count = freqCount)
attach(dfBi)
dfBiSort<-dfBi[order(-Count),]
detach(dfBi)
trigram_tokenizer <- ngram_tokenizer(3)
triList <- trigram_tokenizer(dat10NoShort)
freqNames <- as.vector(names(table(unlist(triList))))
freqCount <- as.numeric(table(unlist(triList)))
dfTri <- data.frame(Word = freqNames,
Count = freqCount)
attach(dfTri)
dfTriSort<-dfTri[order(-Count),]
detach(dfTri)
After preparing the Ngram lists, need to ready to visualize the data First, a histograms to show the most frequent words ### Unigram histogram
par(mar = c(8,4,1,1) + 0.1, las = 2)
barplot(dfUniSort[1:10,2],col="blue",
names.arg = dfUniSort$Word[1:10],srt = 45,
space=0.1, xlim=c(0,10),
main = "Top 10 Unigrams by Frequency",
cex.names = 1, xpd = FALSE)
### Bigram histogram
par(mar = c(8,4,1,1) + 0.1, las = 2)
barplot(dfBiSort[1:10,2],col="blue",
names.arg = dfBiSort$Word[1:10],srt = 45,
space=0.1, xlim=c(0,10),
main = "Top 10 Bigrams by Frequency",
cex.names = 1, xpd = FALSE)
### Trigram histogram
par(mar = c(8,4,1,1) + 0.1, las = 2)
barplot(dfTriSort[1:10,2],col="blue",
names.arg = dfTriSort$Word[1:10],srt = 45,
space=0.1, xlim=c(0,10),
main = "Top 10 Trigrams by Frequency",
cex.names = 1, xpd = FALSE)
?barplot
Based on the plots above, it appears that the data cleaning and tokenization steps worked. I concluded that I should not remove two-letter words, as the lack of the word “of” may result in some meaning being lost.
I noticed foreign words in the output files. This will be similar to the approach of removing profanity. It will be better to simply remove foreign words.