As part of Data Science Specialization Capstone experience, I have demonstrate the ability to work with relatively unstructured text data. The first step is to download and read in the data.
The code below is from the course materials. It shows how to for read data from txt files using the readLines function.
con <- file(“en_US.twitter.txt”, “r”) readLines(con, 1) ## Read the first line of text readLines(con, 1) ## Read the next line of text readLines(con, 5) ## Read in the next 5 lines of text close(con) ## It’s important to close the connection when you are done ## Download and read in the data The data are available at this url: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip ### Download the data
## You can set your working directory to final/en_US...
## and then use the filenames (wrapped in quotes)...
## but I prefer to load the path into memory...
path<-paste(getwd(),"/final/en_US", sep="")
## and then load the filenames into memory.
files<-list.files(path)
##Here are the file names
print(files)
## [1] "en_US.blogs.txt" "en_US.news.txt" "en_US.twitter.txt"
## If I had more files, I would use a loop for the next steps...
## but for 3 files it is not worth it and in my opinion,
## avoiding loops makes the code more readable and faster...
## Load file paths into memory
blogPath <- paste(path,"/", files[1],sep="")
newsPath <- paste(path,"/", files[2],sep="")
twitPath <- paste(path,"/", files[3],sep="")
## First, I will read in all of the data, but later I will take a subsample.
## The raw binary (rb) method was the only way I could read in the full news data set.
## I skipped the embedded nuls in all, though these were only in the twitter data set.
con <- file(blogPath, open="rb")
blog<-readLines(con, skipNul = TRUE, encoding = "UTF-8")
close(con)
con <- file(newsPath, open="rb")
news<-readLines(con, skipNul = TRUE, encoding = "UTF-8")
close(con)
con <- file(twitPath, open="rb")
twit<-readLines(con, skipNul = TRUE, encoding = "UTF-8")
close(con)
Now that the data are read into memory, we can obtain basic information about the datafiles (e.g. file sizes) and their contents (e.g. word counts). First I obtain file sizes in bytes. Then I will convert bytes to mebibytes (MiB). Mebibytes are part of the International Electrotechnical Commission (IEC) system. Personally, I prefer International System of Units (SI) units, because I think SI units e.g. kilobytes and megabytes (1000 and 1000^2 bytes respectively) make more sense than IEC units e.g. kibibytes and mebibytes (1000 and 1000^2 respectively) However, my preference is biased by the fact that I am laboratory scientist :)
## Get file sizes in Bytes
blogBytes <- file.info(blogPath)$size
newsBytes <- file.info(newsPath)$size
twitBytes <- file.info(twitPath)$size
##Convert bytes to mebibytes(MiB)
blogMB <- blogBytes / 1024 ^ 2
newsMB <- newsBytes / 1024 ^ 2
twitMB <- twitBytes / 1024 ^ 2
## Get the number of lines
blogLines <- length(count.fields(blogPath, sep="\n"))
newsLines <- length(count.fields(newsPath, sep="\n"))
twitLines <- length(count.fields(twitPath, sep="\n"))
## Get the number of words per line using sapply and gregexpr base functions
blogWords<-sapply(gregexpr("[[:alpha:]]+", blog), function(x) sum(x > 0))
newsWords<-sapply(gregexpr("[[:alpha:]]+", news), function(x) sum(x > 0))
twitWords<-sapply(gregexpr("[[:alpha:]]+", twit), function(x) sum(x > 0))
## Alternative: Get the number of words in each line using stringi package
##install.packages("stringi")
##library(stringi)
##blogWords <- stri_count_words(blog)
##newsWords <- stri_count_words(news)
##twitWords <- stri_count_words(twit)
## Sum the number of words in each line to get total words
blogWordsSum<-sum(blogWords)
newsWordsSum<-sum(newsWords)
twitWordsSum<-sum(twitWords)
##Get the character count (per line) for each data set
blogChar<-nchar(blog, type = "chars")
newsChar<-nchar(news, type = "chars")
twitChar<-nchar(twit, type = "chars")
##Sum the character counts to get total number of characters
blogCharSum<-sum(blogChar)
newsCharSum<-sum(newsChar)
twitCharSum<-sum(twitChar)
## Alternative: Use the Unix command wc e.g. system("wc filepath")
## This will give the lines, words and characters.
## I trust Unix commands > R base functions > R packages :)
This is the first deliverable and nicely summarizes information about our datasets (from the previous code chunk).
df<-data.frame(File=c("Blogs", "News", "Twitter"),
fileSize = c(blogMB, newsMB, twitMB),
lineCount = c(blogLines, newsLines, twitLines),
wordCount = c(blogWordsSum, newsWordsSum, twitWordsSum),
charCount = c(blogCharSum,newsCharSum,twitCharSum),
wordMean = c(mean(blogWords), mean(newsWords), mean(twitWords)),
charMean = c(mean(blogChar), mean(newsChar), mean(twitChar))
)
df
## File fileSize lineCount wordCount charCount wordMean charMean
## 1 Blogs 200.4242 898821 37873860 206824505 42.11538 229.98695
## 2 News 196.2775 77258 34613675 203223159 34.26276 201.16285
## 3 Twitter 159.3641 2329858 30555611 162096241 12.94648 68.68054
So far, we made a table of raw data stats using only base functions (i.e. no dependencies) ## Sample the data and obtain descriptive statistics ### Obtain a sample of the data Now, we will obtain the same set of statistics for a sample of the data. First, I will set the seed so that I can obtain the same exact samples later. Next, I could sample by number of characters or words, but I will sample by line count.
set.seed(20170219)
blog10 <- sample(blog, size = length(blog) / 10, replace = FALSE)
news10 <- sample(news, size = length(news)/10, replace = FALSE)
twit10 <- sample(twit, size = length(twit) / 10, replace = FALSE)
## I used the rbinom subsetting method below and it did not work for me.
## For whatever reason, the it truncated the dataset samples
#blog10 <- blog[rbinom(length(blog)/10, length(blog), .5)]
#news10 <- news[rbinom(length(news)/10, length(news), .5)]
#twit10 <- twit[rbinom(length(twit)/10, length(twit), .5)]
The next few steps are (almost) the same as before, except this time I will use the samples instead of the full datasets.
First, I will obtain sample sizes in mebibytes (MiB), as per my IEC vs. SI unit rant above :) MiB still makes sense here even though I took a 1/10 sample of the datasets. If I took a smaller sample it would be better to use kebibyte units.
blog10MB <- format(object.size(blog10), standard = "IEC", units = "MiB")
news10MB <- format(object.size(news10), standard = "IEC", units = "MiB")
twit10MB <- format(object.size(twit10), standard = "IEC", units = "MiB")
## Get the number of lines
blog10Lines <- length(blog10)
news10Lines <- length(news10)
twit10Lines <- length(twit10)
## Get the number of words per line using sapply and gregexpr base functions
blog10Words<-sapply(gregexpr("[[:alpha:]]+", blog10), function(x) sum(x > 0))
news10Words<-sapply(gregexpr("[[:alpha:]]+", news10), function(x) sum(x > 0))
twit10Words<-sapply(gregexpr("[[:alpha:]]+", twit10), function(x) sum(x > 0))
## Sum the number of words in each line to get total words
blog10WordsSum<-sum(blog10Words)
news10WordsSum<-sum(news10Words)
twit10WordsSum<-sum(twit10Words)
##Get the character count (per line) for each data set
blog10Char<-nchar(blog10, type = "chars")
news10Char<-nchar(news10, type = "chars")
twit10Char<-nchar(twit10, type = "chars")
##Sum the character counts to get total number of characters
blog10CharSum<-sum(blog10Char)
news10CharSum<-sum(news10Char)
twit10CharSum<-sum(twit10Char)
## Alternative: Use the Unix command wc e.g. system("wc filepath")
## This will give the lines, words and characters.
## For simple things like these, I trust Unix commands > R base functions > R packages :)
This is the second deliverable and nicely summarizes information about our samples (from the previous code chunk). It is important to make sure that the values in this table match the previous table. If this is not the case, then it may indicate that something went wrong with the sampling.
df10 <- data.frame(File=c("Blogs Sample", "News Sample", "Twitter Sample"),
fileSize = c(blog10MB, news10MB, twit10MB),
lineCount = c(blog10Lines, news10Lines, twit10Lines),
wordCount = c(blog10WordsSum, news10WordsSum, twit10WordsSum),
charCount = c(blog10CharSum,news10CharSum,twit10CharSum),
wordMean = c(mean(blog10Words), mean(news10Words), mean(twit10Words)),
charMean = c(mean(blog10Char), mean(news10Char), mean(twit10Char))
)
df10
## File fileSize lineCount wordCount charCount wordMean charMean
## 1 Blogs Sample 24.9 MiB 89928 3792244 20702129 42.16978 230.20782
## 2 News Sample 25 MiB 101024 3466726 20359366 34.31587 201.52999
## 3 Twitter Sample 30.4 MiB 236014 3054861 16207919 12.94356 68.67355
For the data cleaning steps, I will first put all three of the datasets together. Then I will remove stop words, extra whitespace, punctuation, profanity, one-letter words and symbols. For many of these steps, I will use the tm package. ### Put all of the dataset samples together
#install.packages("tm")
library(tm)
## Loading required package: NLP
## Put all of the data samples together
#dat<- c(blog,news,twit)
dat10<- c(blog10,news10,twit10)
dat10NoPunc<- removePunctuation(dat10)
dat10NoWS<- stripWhitespace(dat10NoPunc)
dat10NoStop <- removeWords(dat10NoWS, stopwords("english"))
At first, I was not sure whether to remove profanity… because I didn’t like the list of “bad” words I found on github e.g. https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/blob/master/en and https://gist.github.com/jamiew/1112488 There are perfectly normal (in my opinion) words mixed in those lists e.g. anatomical words We are all adults here (I assume) and I think the profane words can also be interesting for analysis. I decided to remove profanity because I was terrified at the thought that N-word would be one of the top ranked unigrams by frequency :/ In the end, it did not make a big difference in object size, so probably not a big loss in data.
## download profanity word lists
if (!file.exists("profan1.csv")) {
download.file("https://raw.githubusercontent.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en",
destfile = paste(getwd(),"/profan1.csv", sep=""))
}
if (!file.exists("profan2.csv")) {
download.file("https://gist.githubusercontent.com/jamiew/1112488/raw/7ca9b1669e1c24b27c66174762cb04e14cf05aa7/google_twunter_lol",
destfile = paste(getwd(),"/profan2.csv", sep=""))
}
##Read in profanity word lists
profan1<- as.character(read.csv("profan1.csv", header=FALSE))
profan2<- as.character(row.names(read.csv("profan2.csv", header=TRUE, sep = ":")))
## Put the two lists together
profan<-c(profan1, profan2)
## Trim the first and last line of profan
profan<-profan[-1]
profan<-profan[-length(profan)]
## Remove profanity
dat10NoProfan <- removeWords(dat10NoStop, profan)
## Find out the object size difference after removing profanity
object.size(dat10NoPunc)
## 82070992 bytes
object.size(dat10NoProfan)
## 71548504 bytes
object.size(dat10NoPunc)-object.size(dat10NoProfan)
## 10522488 bytes
I noticed a lot of weird characters in my output file, e.g “â”, “o”, “î”, “z”,“???”,“T”,“³”,“ð”,“¾”,“ñ”, “~”… I think the problem may be in the data and unrelated to preprocessing.
I came up with different methods for converting to lowercase including dat10Lower<-sapply(dat10NoStop, tolower) dat10Lower<- tm_map(corp, tolower) dat10Lower<- tolower(dat10NoStop) I went with the stringi package method below. Nevertheless, I think the examples above work just fine.
library(stringi)
dat10Lower <- stri_trans_tolower(dat10NoProfan)
I decided to list the symbols I want removed. I know I can come up with simple regular expression to accomplish the same thing.
dat10azONLY <- gsub("ð|â|???|T|o|'|³|¾|ñ|f|.|º|°|»|²|¼|>|<|¹|·|¸|¦|~|~", "", dat10Lower)
All non-alphanumeric characters should be removed by now, but I remove punctuation again just to be sure. I also removed extra whitespace again, just in case and of the previous steps created new whitespace. I will remove any redundancy and improve efficiency for my final project app.
dat10NoPunc2<- removePunctuation(dat10azONLY)
dat10NoWS2<- stripWhitespace(dat10NoPunc2)
#Remove single letter words
dat10NoShort <- removeWords(dat10NoWS2, "\\b\\w{1}\\b")
Here I put together lists of unigrams, bigrams and trigrams.
## I was not able to install RWeka package, because of a java version problem.
## Instead of trying to figure it out, I used the ngram_tokenizer snippet
## created by Maciej Szymkiewicz, aka zero323 on Github.
if (!file.exists("ngram_tokenizer.R")) {
download.file("https://raw.githubusercontent.com/zero323/r-snippets/master/R/ngram_tokenizer.R",
destfile = paste(getwd(),"/ngram_tokenizer.R", sep=""))
}
source("ngram_Tokenizer.R")
unigram_tokenizer <- ngram_tokenizer(1)
uniList <- unigram_tokenizer(dat10NoShort)
freqNames <- as.vector(names(table(unlist(uniList))))
freqCount <- as.numeric(table(unlist(uniList)))
dfUni <- data.frame(Word = freqNames,
Count = freqCount)
attach(dfUni)
dfUniSort<-dfUni[order(-Count),]
detach(dfUni)
bigram_tokenizer <- ngram_tokenizer(2)
biList <- bigram_tokenizer(dat10NoShort)
freqNames <- as.vector(names(table(unlist(biList))))
freqCount <- as.numeric(table(unlist(biList)))
dfBi <- data.frame(Word = freqNames,
Count = freqCount)
attach(dfBi)
dfBiSort<-dfBi[order(-Count),]
detach(dfBi)
trigram_tokenizer <- ngram_tokenizer(3)
triList <- trigram_tokenizer(dat10NoShort)
freqNames <- as.vector(names(table(unlist(triList))))
freqCount <- as.numeric(table(unlist(triList)))
dfTri <- data.frame(Word = freqNames,
Count = freqCount)
attach(dfTri)
dfTriSort<-dfTri[order(-Count),]
detach(dfTri)
After preparing the Ngram lists, I am ready to visualize the data First, I will make some histograms to show the most frequent words ### Unigram histogram
par(mar = c(8,4,1,1) + 0.1, las = 2)
barplot(dfUniSort[1:20,2],col="blue",
names.arg = dfUniSort$Word[1:20],srt = 45,
space=0.1, xlim=c(0,20),
main = "Top 20 Unigrams by Frequency",
cex.names = 1, xpd = FALSE)
### Bigram histogram
par(mar = c(8,4,1,1) + 0.1, las = 2)
barplot(dfBiSort[1:20,2],col="green",
names.arg = dfBiSort$Word[1:20],srt = 45,
space=0.1, xlim=c(0,20),
main = "Top 20 Bigrams by Frequency",
cex.names = 1, xpd = FALSE)
### Trigram histogram
par(mar = c(8,4,1,1) + 0.1, las = 2)
barplot(dfTriSort[1:20,2],col="red",
names.arg = dfTriSort$Word[1:20],srt = 45,
space=0.1, xlim=c(0,20),
main = "Top 20 Trigrams by Frequency",
cex.names = 1, xpd = FALSE)
?barplot
## starting httpd help server ...
## done
Based on the plots above, it appears that the data cleaning and tokenization steps worked. I think removing single letter words does not hurt. I concluded that I should not remove two-letter words, as the lack of the word “of” may result in some meaning being lost. I believe these steps work well but they take a long time and I need to think about efficiency when training my model.
I noticed foreign words (mostly in Spanish) in the output files. These may cause problems, so I will work on a way of removing the words using a Spanish dictionary (word list). This will be similar to the approach of removing profanity. It would be great to make an app that could translate foreign words on the fly so that they can also be used in the analysis. That being said, I think however it will be better to simply remove foreign words to.
For my app, I am interested in providing functionality for hash tags from the twitter data. The idea is to predict what may follow a hash tag, just like other words. Hashtags by themselves are unigrams even if they represent multiple words (e.g. #HungryLikeAWolf), but they may be preceeded by other words. The predictive model would first try to predict by a quadgram, then a trigram, then a bigram and finally the word itself. In addition to word buttons to insert the text, I plan to show the output as a wordcloud wherein the word size is the probability of that word following what the user typed in.