The report is the first of a series that will be written for the Data Science Capstone course by John Hopkins University (offered through Coursera). The objective of the capstone course is to fit a model that will allow the user to input one (or more) word and will predict the next word. In this report, the initial stages of the project are presented. Specifically, the following task are accomplished:
Get the data from the course website
Import the data in R and take a smaller set for ease of computation.
Exploratory analysis (e.g. what are the most common words?)
The data were downloaded from the course website and saved locally. The downloaded data contained English text from three different sources: Twitter, news and blogs. The following code was used to import the data into R:
# set working directory
setwd("./en_US")
# Read the English Twitter dataset
con <- file("en_US.twitter.txt", "r")
# Read lines of text (skip nul lines)
twitter <- readLines(con, skipNul = TRUE)
# close connection
close(con)
# Read the English blogs dataset
con <- file("en_US.blogs.txt", "r")
# Read lines of text (skip nul lines)
blogs <- readLines(con, skipNul = TRUE)
# close connection
close(con)
# Read the English news dataset
con <- file("en_US.news.txt", "r")
# Read lines of text (skip nul lines)
news <- readLines(con, skipNul = TRUE)
# close connection
close(con)
The 3 sources of data contain different numbers of lines and words with Twitter being the most represented source (in terms of numbers of lines).
| source | size | Lines.Total | Words.Total |
|---|---|---|---|
| 316037600 | 2360148 | 30093410 | |
| News | 261759048 | 1010242 | 34762395 |
| Blogs | 260564320 | 899288 | 37546246 |
Each dataset required a massive amount of memory (unfortunately, my laptop is not as powerful!); hence, to avoid running out of memory, I sampled ~5% of the original data to be used in what follows:
# Twitter data
# set seed for reprobucibility
set.seed(123)
# select randomly from a binomial distribution
select <- rbinom(n = length(twitter), size = 1, prob = 0.05)
# create sample data
twitterSample <- twitter[(select == 1)]
# News data
# set seed for reprobucibility
set.seed(12)
# select randomly from a binomial distribution
select <- rbinom(n = length(news), size = 1, prob = 0.05)
# create sample data
newsSample <- news[(select == 1)]
# Blogs data
# Twitter data
# set seed for reprobucibility
set.seed(13)
# select randomly from a binomial distribution
select <- rbinom(n = length(blogs), size = 1, prob = 0.05)
# create sample data
blogsSample <- blogs[(select == 1)]
| source | size | Lines.Total | Words.Total |
|---|---|---|---|
| 15899280 | 117684 | 1499181 | |
| News | 13163216 | 50698 | 1749084 |
| Blogs | 13002816 | 44870 | 1874420 |
To build the corpus, the data from each source were merged together. The corpus was build using the R package quanteda. I first tried using the tm package; however, I decided to go for quanteda as it used less memory, allowed me to select a bigger subset of the original data, and it improved computational time by hours.
# build a courpus for each source and put them together
finalSample <- c(twitterSample, newsSample, blogsSample)
finalSample <- corpus(finalSample)
# add source
docvars(finalSample, "source") <- c(rep("Twitter", length(twitterSample)),rep("news", length(newsSample)),
rep("Blogs", length(blogsSample)))
# remove separate files
rm(twitterSample); rm(newsSample); rm(blogsSample)
In the following analyses, the most common words, 2-grams and 3-grams were found after:
converting text to lowercase
removing numbers
removing punctuation
removing symbols
removing Twitter characters @ and #
removing URL beginning with http/https
removing stopwords. Stopwords are common words in English (e.g. and, the, a) that will not help us in building a prediction model. We also removed the word “will” as it is commonly use, and it will not ad any additional information in building a prediction model.
# need to add will in ignored features as it is not in the stopword list
top1word <- dfm(finalSample, toLower = T, removeNumbers = T, removePunct = T,
removeTwitter = T, stem = T, ignoredFeatures = c("will",stopwords("english")),
verbose = F)
##
top1word
Document-feature matrix of: 213,252 documents, 106,299 features.
# 10 most frequent words
top10 <- data.frame(topfeatures(top1word, 10))
Word <- rownames(top10)
top10 <- data.frame(Word, top10)
rownames(top10) <- NULL
colnames(top10) <- c("word","frequency")
kable(top10)
| word | frequency |
|---|---|
| said | 15373 |
| one | 15362 |
| just | 15097 |
| like | 15039 |
| get | 14953 |
| go | 13368 |
| time | 12834 |
| can | 12349 |
| day | 11128 |
| year | 10674 |
ggplot(top10, aes(x = reorder(word, frequency), y = frequency)) + geom_bar(stat = "identity") + theme_bw() +
coord_flip() + ylab("") + ggtitle("Top 10 most common words") + xlab("")
The following functions were build to understand how many unique words we need in a frequency sorted dictionary to cover a user specify percentage (perc) of all word instances in the language.
# put frequency for each word in one table (this is already sorted from the most frequent to the least)
wordFreq <- topfeatures(top1word, n = 141347)
# numbers of words needed for a specified coverage
coverage <- function(perc, x){
sum(cumsum(x) < sum(x)*perc)
}
# set different coverage (p) and check the number of words needed at each level
p <- seq(0,1,0.01)
wordsNeed <- c()
for(i in 1:length(p)){
wordsNeed[i] <- coverage(p[i], wordFreq)
}
dat <- data.frame(words = wordsNeed, coverage = p)
ggplot(dat, aes(x = wordsNeed, y = p)) + geom_line() + xlab("Number of unique words needed") +
ylab("Percentage of vocabulary needed") + theme_bw() + ggtitle("Number of words to cover the vocabulary") +
geom_vline(xintercept = coverage(0.5, wordFreq), col = "red") +
geom_vline(xintercept = coverage(0.9, wordFreq), col = "blue") +
annotate("text", x = (100 + coverage(0.5, wordFreq)), y = 0.25, label = coverage(0.5, wordFreq), col = "red", angle = 90) +
annotate("text", x = (100 + coverage(0.9, wordFreq)), y = 0.25, label = coverage(0.9, wordFreq), col = "blue", angle = 90)
# remove to allow memory space.
rm(top1word); rm(wordFreq)
The 10 monst common 2-grams were found using the dfm functions and specify ngrams = 2
top2word <- dfm(finalSample, toLower = T, removeNumbers = T, removePunct = T,
removeTwitter = T, stem = T, ignoredFeatures = c("will",stopwords("english")),
ngram = 2, verbose = F)
##
top2word
Document-feature matrix of: 213,252 documents, 810,191 features.
# 10 most frequent 2gram
top102gram <- data.frame(topfeatures(top2word, 10))
Word <- rownames(top102gram)
top102gram <- data.frame(Word, top102gram)
rownames(top102gram) <- NULL
colnames(top102gram) <- c("word","frequency")
top102gram$word <- gsub("_", " ", top102gram$word )
kable(top102gram)
| word | frequency |
|---|---|
| right now | 1216 |
| last year | 1086 |
| new york | 1039 |
| last night | 834 |
| high school | 701 |
| last week | 682 |
| years ago | 680 |
| feel lik | 604 |
| looking forward | 594 |
| first tim | 584 |
ggplot(top102gram, aes(x = reorder(word, frequency), y = frequency)) + geom_bar(stat = "identity") + theme_bw() +
coord_flip() + ylab("") + ggtitle("Top 10 most common 2-grams") + xlab("")
Similarly, for the 10 most common 3-grams:
top3word <- dfm(finalSample, toLower = T, removeNumbers = T, removePunct = T,
removeTwitter = T, stem = T, ignoredFeatures = c("'","will",stopwords("english")),
ngram = 3, verbose = F)
##
top3word
Document-feature matrix of: 213,252 documents, 521,607 features.
# 10 most frequent 3gram
top103gram <- data.frame(topfeatures(top3word, 10))
Word <- rownames(top103gram)
top103gram <- data.frame(Word, top103gram)
rownames(top103gram) <- NULL
colnames(top103gram) <- c("word","frequency")
top103gram$word <- gsub("_", " ", top103gram$word)
kable(top103gram)
| word | frequency |
|---|---|
| let us know | 132 |
| new york c | 120 |
| happy new year | 97 |
| happy mothers day | 97 |
| happy mother’s day | 92 |
| two years ago | 91 |
| new york tim | 79 |
| president barack obama | 79 |
| cinco de mayo | 62 |
| world war ii | 54 |
ggplot(top103gram, aes(x = reorder(word, frequency), y = frequency)) + geom_bar(stat = "identity") + theme_bw() +
coord_flip() + ylab("") + ggtitle("Top 10 most common 3-grams") + xlab("")
“said” was the most used word.
“right now” was the most used 2-grams.
“let us know” was the most used 3-grams
The relation between the number of unique words needed and the vocabulary coverage was not linear.
Things to do before fitting the model:
try to eliminate profanity.
How can we count for spelling mistakes? or repeated words (e.g. a sentence with the 2-grams “jobs jobs”)
News, blogs and Twitter have a quite different structure (e.g. you are only allowed 140 characters in a tweet), so I want to try building a predictive model that accounts for the source of the sentence. For instance, suppose that a user wants to know what comes after the word “happy”, he/she might get a different suggestion depending on the document he/she is currently writing.
Next step is building the model. First, I will try to work around my computer memory problems and use a bigger set of data. Second, I will divide the available data in two sets: a training set where the model will be fitted, and a test set where the prediction accuracy of the selected model will be tested.
To build the model, I will start looking at 4-grams, 3-grams and 2-grams (in this exact order). To avoid using much memory and speed the algorithm, the model will be build so that the next word only depends on the current n-gram (I will try to build a model with no long term memory). The output of the algorithm will be the most frequent word (or maybe the 3 most frequent words). If there are not 4-grams, 3-grams and 2-grams in the model, maybe just return the 3 most common words? (need some more research on this point)