As part of Data Science Specialization Capstone experience, I have demonstrate the ability to work with relatively unstructured text data. The first step is to download and read in the data.
The code below is from the course materials. It shows how to for read data from txt files using the readLines function.
con <- file(“en_US.twitter.txt”, “r”) readLines(con, 1) ## Read the first line of text readLines(con, 1) ## Read the next line of text readLines(con, 5) ## Read in the next 5 lines of text close(con) ## It’s important to close the connection when you are done ## Download and read in the data The data are available at this url: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip ### Download the data
## You can set your working directory to final/en_US...
## and then use the filenames (wrapped in quotes)...
## but I prefer to load the path into memory...
path<-paste(getwd(),"/final/en_US", sep="")
## and then load the filenames into memory.
files<-list.files(path)
##Here are the file names
print(files)
## [1] "en_US.blogs.txt" "en_US.news.txt" "en_US.twitter.txt"
## If I had more files, I would use a loop for the next steps...
## but for 3 files it is not worth it and in my opinion,
## avoiding loops makes the code more readable and faster...
## Load file paths into memory
blogPath <- paste(path,"/", files[1],sep="")
newsPath <- paste(path,"/", files[2],sep="")
twitPath <- paste(path,"/", files[3],sep="")
## First, I will read in all of the data, but later I will take a subsample.
## The raw binary (rb) method was the only way I could read in the full news data set.
## I skipped the embedded nuls in all, though these were only in the twitter data set.
con <- file(blogPath, open="rb")
blog<-readLines(con, skipNul = TRUE, encoding = "UTF-8")
close(con)
con <- file(newsPath, open="rb")
news<-readLines(con, skipNul = TRUE, encoding = "UTF-8")
close(con)
con <- file(twitPath, open="rb")
twit<-readLines(con, skipNul = TRUE, encoding = "UTF-8")
close(con)
Now that the data are read into memory, we can obtain basic information about the datafiles (e.g. file sizes) and their contents (e.g. word counts). First I obtain file sizes in bytes. Then I will convert bytes to mebibytes (MiB). Mebibytes are part of the International Electrotechnical Commission (IEC) system. Personally, I prefer International System of Units (SI) units, because I think SI units e.g. kilobytes and megabytes (1000 and 1000^2 bytes respectively) make more sense than IEC units e.g. kibibytes and mebibytes (1000 and 1000^2 respectively) However, my preference is biased by the fact that I am laboratory scientist :)
## Get file sizes in Bytes
blogBytes <- file.info(blogPath)$size
newsBytes <- file.info(newsPath)$size
twitBytes <- file.info(twitPath)$size
##Convert bytes to mebibytes(MiB)
blogMB <- blogBytes / 1024 ^ 2
newsMB <- newsBytes / 1024 ^ 2
twitMB <- twitBytes / 1024 ^ 2
## Get the number of lines
blogLines <- length(count.fields(blogPath, sep="\n"))
newsLines <- length(count.fields(newsPath, sep="\n"))
twitLines <- length(count.fields(twitPath, sep="\n"))
## Get the number of words per line using sapply and gregexpr base functions
blogWords<-sapply(gregexpr("[[:alpha:]]+", blog), function(x) sum(x > 0))
newsWords<-sapply(gregexpr("[[:alpha:]]+", news), function(x) sum(x > 0))
twitWords<-sapply(gregexpr("[[:alpha:]]+", twit), function(x) sum(x > 0))
## Alternative: Get the number of words in each line using stringi package
##install.packages("stringi")
##library(stringi)
##blogWords <- stri_count_words(blog)
##newsWords <- stri_count_words(news)
##twitWords <- stri_count_words(twit)
## Sum the number of words in each line to get total words
blogWordsSum<-sum(blogWords)
newsWordsSum<-sum(newsWords)
twitWordsSum<-sum(twitWords)
##Get the character count (per line) for each data set
blogChar<-nchar(blog, type = "chars")
newsChar<-nchar(news, type = "chars")
twitChar<-nchar(twit, type = "chars")
##Sum the character counts to get total number of characters
blogCharSum<-sum(blogChar)
newsCharSum<-sum(newsChar)
twitCharSum<-sum(twitChar)
## Alternative: Use the Unix command wc e.g. system("wc filepath")
## This will give the lines, words and characters.
## I trust Unix commands > R base functions > R packages :)
This is the first deliverable and nicely summarizes information about our datasets (from the previous code chunk).
df<-data.frame(File=c("Blogs", "News", "Twitter"),
fileSize = c(blogMB, newsMB, twitMB),
lineCount = c(blogLines, newsLines, twitLines),
wordCount = c(blogWordsSum, newsWordsSum, twitWordsSum),
charCount = c(blogCharSum,newsCharSum,twitCharSum),
wordMean = c(mean(blogWords), mean(newsWords), mean(twitWords)),
charMean = c(mean(blogChar), mean(newsChar), mean(twitChar))
)
View(df)
So far, we made a table of raw data stats using only base functions (i.e. no dependencies) ## Sample the data and obtain descriptive statistics ### Obtain a sample of the data Now, we will obtain the same set of statistics for a sample of the data. First, I will set the seed so that I can obtain the same exact samples later. Next, I could sample by number of characters or words, but I will sample by line count.
set.seed(20170219)
## For whatever reason, the sample function (as below) truncated the news dataset
blog10 <- sample(blog, size = length(blog) / 10, replace = FALSE)
news10 <- sample(news, size = length(news)/10, replace = FALSE)
twit10 <- sample(twit, size = length(twit) / 10, replace = FALSE)
## I used the rbinom subsetting method below and it did not work for me.
#blog10 <- blog[rbinom(length(blog)/10, length(blog), .5)]
#news10 <- news[rbinom(length(news)/10, length(news), .5)]
#twit10 <- twit[rbinom(length(twit)/10, length(twit), .5)]
The next few steps are (almost) the same as before, except this time I will use the samples instead of the full datasets.
First, I will obtain sample sizes in mebibytes (MiB), as per my IEC vs. SI unit rant above :) MiB still makes sense here even though I took a 1/10 sample of the datasets. If I took a smaller sample it would be better to use kebibyte units.
blog10MB <- format(object.size(blog10), standard = "IEC", units = "MiB")
news10MB <- format(object.size(news10), standard = "IEC", units = "MiB")
twit10MB <- format(object.size(twit10), standard = "IEC", units = "MiB")
## Get the number of lines
blog10Lines <- length(blog10)
news10Lines <- length(news10)
twit10Lines <- length(twit10)
## Get the number of words per line using sapply and gregexpr base functions
blog10Words<-sapply(gregexpr("[[:alpha:]]+", blog10), function(x) sum(x > 0))
news10Words<-sapply(gregexpr("[[:alpha:]]+", news10), function(x) sum(x > 0))
twit10Words<-sapply(gregexpr("[[:alpha:]]+", twit10), function(x) sum(x > 0))
## Sum the number of words in each line to get total words
blog10WordsSum<-sum(blog10Words)
news10WordsSum<-sum(news10Words)
twit10WordsSum<-sum(twit10Words)
##Get the character count (per line) for each data set
blog10Char<-nchar(blog10, type = "chars")
news10Char<-nchar(news10, type = "chars")
twit10Char<-nchar(twit10, type = "chars")
##Sum the character counts to get total number of characters
blog10CharSum<-sum(blog10Char)
news10CharSum<-sum(news10Char)
twit10CharSum<-sum(twit10Char)
## Alternative: Use the Unix command wc e.g. system("wc filepath")
## This will give the lines, words and characters.
## For simple things like these, I trust Unix commands > R base functions > R packages :)
This is the second deliverable and nicely summarizes information about our samples (from the previous code chunk). It is important to make sure that the values in this table match the previous table. If this is not the case, then it may indicate that something went wrong with the sampling.
df10 <- data.frame(File=c("Blogs Sample", "News Sample", "Twitter Sample"),
fileSize = c(blog10MB, news10MB, twit10MB),
lineCount = c(blog10Lines, news10Lines, twit10Lines),
wordCount = c(blog10WordsSum, news10WordsSum, twit10WordsSum),
charCount = c(blog10CharSum,news10CharSum,twit10CharSum),
wordMean = c(mean(blog10Words), mean(news10Words), mean(twit10Words)),
charMean = c(mean(blog10Char), mean(news10Char), mean(twit10Char))
)
View(df10)
For the data cleaning steps, I will first put all three of the datasets together. Then I will remove stop words, extra whitespace, punctuation, profanity, one-letter words and symbols. For many of these steps, I will use the tm package. ### Put all of the dataset samples together
#install.packages("tm")
library(tm)
## Loading required package: NLP
## Put all of the data samples together
#dat<- c(blog,news,twit)
dat10<- c(blog10,news10,twit10)
dat10NoPunc<- removePunctuation(dat10)
dat10NoWS<- stripWhitespace(dat10NoPunc)
dat10NoStop <- removeWords(dat10NoWS, stopwords("english"))
At first, I was not sure whether to remove profanity… because I didn’t like the list of “bad” words I found on github e.g. https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/blob/master/en and https://gist.github.com/jamiew/1112488 There are perfectly normal (in my opinion) words mixed in those lists e.g. anatomical words We are all adults here (I assume) and I think the profane words can also be interesting for analysis. I decided to remove profanity because I was terrified at the thought that N-word would be one of the top ranked unigrams by frequency :/ In the end, it did not make a big difference in object size, so probably not a big loss in data.
## download profanity word lists
download.file("https://raw.githubusercontent.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en",
destfile = paste(getwd(),"/profan1.csv", sep=""))
download.file("https://gist.githubusercontent.com/jamiew/1112488/raw/7ca9b1669e1c24b27c66174762cb04e14cf05aa7/google_twunter_lol",
destfile = paste(getwd(),"/profan2.csv", sep=""))
##Read in profanity word lists
profan1<- as.character(read.csv("profan1.csv", header=FALSE))
profan2<- as.character(row.names(read.csv("profan2.csv", header=TRUE, sep = ":")))
## Put the two lists together
profan<-c(profan1, profan2)
## Trim the first and last line of profan
profan<-profan[-1]
profan<-profan[-length(profan)]
## Remove profanity
dat10NoProfan <- removeWords(dat10NoStop, profan)
## Find out the object size difference after removing profanity
object.size(dat10NoPunc)
## 85298144 bytes
object.size(dat10NoProfan)
## 74721856 bytes
object.size(dat10NoPunc)-object.size(dat10NoProfan)
## 10576288 bytes
After preparing the Ngram lists, I am ready to visualize the data First, I will make some histograms to show the most frequent words ### Unigram histogram
par(mar = c(8,4,1,1) + 0.1, las = 2)
barplot(dfUniSort[1:20,2],col="blue",
names.arg = dfUniSort$Word[1:20],srt = 45,
space=0.1, xlim=c(0,20),
main = "Top 20 Unigrams by Frequency",
cex.names = 1, xpd = FALSE)
### Bigram histogram
par(mar = c(8,4,1,1) + 0.1, las = 2)
barplot(dfBiSort[1:20,2],col="green",
names.arg = dfBiSort$Word[1:20],srt = 45,
space=0.1, xlim=c(0,20),
main = "Top 20 Bigrams by Frequency",
cex.names = 1, xpd = FALSE)
### Trigram histogram
par(mar = c(8,4,1,1) + 0.1, las = 2)
barplot(dfTriSort[1:20,2],col="red",
names.arg = dfTriSort$Word[1:20],srt = 45,
space=0.1, xlim=c(0,20),
main = "Top 20 Trigrams by Frequency",
cex.names = 1, xpd = FALSE)
?barplot
## starting httpd help server ... done
Based on the plots above, it appears that the data cleaning and tokenization steps worked. I think removing single letter words does not hurt. I concluded that I should not remove two-letter words, as the lack of the word “of” may result in some meaning being lost. I believe these steps work well but they take a long time and I need to think about efficiency when training my model.
I noticed foreign words (mostly in Spanish) in the output files. These may cause problems, so I will work on a way of removing the words using a Spanish dictionary (word list). This will be similar to the approach of removing profanity. It would be great to make an app that could translate foreign words on the fly so that they can also be used in the analysis. That being said, I think however it will be better to simply remove foreign words to.
For my app, I am interested in providing functionality for hash tags from the twitter data. The idea is to predict what may follow a hash tag, just like other words. Hashtags by themselves are unigrams even if they represent multiple words (e.g. #HungryLikeAWolf), but they may be preceeded by other words. The predictive model would first try to predict by a quadgram, then a trigram, then a bigram and the word itself. In addition to word buttons to insert the text, I plan to show the output as a wordcloud wherein the word size is the probability of that word following what the user typed in.