Milestone Report: Exploring the Data

Getting the data

The data were downloaded from the course website and saved locally. The downloaded data contained English text from three different sources: Twitter, news and blogs. The following code was used to import the data into R:

# set working directory
setwd("./en_US")

# Read the English Twitter dataset
con <- file("en_US.twitter.txt", "r") 
# Read lines of text (skip nul lines)
twitter <- readLines(con, skipNul = TRUE)
# close connection
close(con)

# Read the English blogs dataset
con <- file("en_US.blogs.txt", "r") 
# Read lines of text (skip nul lines)
blogs <- readLines(con, skipNul = TRUE)
# close connection
close(con)

# Read the English news dataset
con <- file("en_US.news.txt", "r") 
# Read lines of text (skip nul lines)
news <- readLines(con, skipNul = TRUE)
# close connection
close(con)

The 3 sources of data contain different numbers of lines and words with Twitter being the most represented source (in terms of numbers of lines).

source	size	Lines.Total	Words.Total
Twitter	316037600	2360148	30093410
News	261759048	1010242	34762395
Blogs	260564320	899288	37546246

Each dataset required a massive amount of memory (unfortunately, my laptop is not as powerful!); hence, to avoid running out of memory, I sampled ~5% of the original data to be used in what follows:

# Twitter data
# set seed for reprobucibility
set.seed(123)
# select randomly from a binomial distribution
select <- rbinom(n = length(twitter), size = 1, prob = 0.05)
# create sample data
twitterSample <- twitter[(select == 1)]

# News data
# set seed for reprobucibility
set.seed(12)
# select randomly from a binomial distribution
select <- rbinom(n = length(news), size = 1, prob = 0.05)
# create sample data
newsSample <- news[(select == 1)]

# Blogs data
# Twitter data
# set seed for reprobucibility
set.seed(13)
# select randomly from a binomial distribution
select <- rbinom(n = length(blogs), size = 1, prob = 0.05)
# create sample data
blogsSample <- blogs[(select == 1)]

source	size	Lines.Total	Words.Total
Twitter	15899280	117684	1499181
News	13163216	50698	1749084
Blogs	13002816	44870	1874420

Create the corpus

To build the corpus, the data from each source were merged together. The corpus was build using the R package quanteda. I first tried using the tm package; however, I decided to go for quanteda as it used less memory, allowed me to select a bigger subset of the original data, and it improved computational time by hours.

# build a courpus for each source and put them together
finalSample <- c(twitterSample, newsSample, blogsSample)
finalSample <- corpus(finalSample)

# add source
docvars(finalSample, "source") <- c(rep("Twitter", length(twitterSample)),rep("news", length(newsSample)),
                                    rep("Blogs", length(blogsSample)))

# remove separate files
rm(twitterSample); rm(newsSample); rm(blogsSample)

Exploratory Data Analysis

In the following analyses, the most common words, 2-grams and 3-grams were found after:

converting text to lowercase
removing numbers
removing punctuation
removing symbols
removing Twitter characters @ and #
removing URL beginning with http/https
removing stopwords. Stopwords are common words in English (e.g. and, the, a) that will not help us in building a prediction model. We also removed the word “will” as it is commonly use, and it will not ad any additional information in building a prediction model.

Most Common Words

# need to add will in ignored features as it is not in the stopword list
top1word <- dfm(finalSample, toLower = T, removeNumbers = T, removePunct = T,
                     removeTwitter = T, stem = T, ignoredFeatures = c("will",stopwords("english")),
                verbose = F)

##

top1word

Document-feature matrix of: 213,252 documents, 106,299 features.

# 10 most frequent words
top10 <- data.frame(topfeatures(top1word, 10))  
Word <- rownames(top10)
top10 <- data.frame(Word, top10)
rownames(top10) <- NULL
colnames(top10) <- c("word","frequency")
kable(top10)

word	frequency
said	15373
one	15362
just	15097
like	15039
get	14953
go	13368
time	12834
can	12349
day	11128
year	10674

ggplot(top10, aes(x = reorder(word, frequency), y = frequency)) + geom_bar(stat = "identity") + theme_bw() +
      coord_flip() + ylab("") + ggtitle("Top 10 most common words") + xlab("")

Coverage

The following functions were build to understand how many unique words we need in a frequency sorted dictionary to cover a user specify percentage (perc) of all word instances in the language.

# put frequency for each word in one table (this is already sorted from the most frequent to the least)
wordFreq <- topfeatures(top1word, n = 141347)
# numbers of words needed for a specified coverage
coverage <- function(perc, x){
      sum(cumsum(x) < sum(x)*perc)
}
# set different coverage (p) and check the number of words needed at each level
p <- seq(0,1,0.01)
wordsNeed <- c()
for(i in 1:length(p)){
      wordsNeed[i] <- coverage(p[i], wordFreq)
}

dat <- data.frame(words = wordsNeed, coverage = p) 
ggplot(dat, aes(x = wordsNeed, y = p)) + geom_line() + xlab("Number of unique words needed") +
      ylab("Percentage of vocabulary needed") + theme_bw() + ggtitle("Number of words to cover the vocabulary") +
      geom_vline(xintercept = coverage(0.5, wordFreq), col = "red") + 
      geom_vline(xintercept = coverage(0.9, wordFreq), col = "blue") + 
      annotate("text", x = (100 + coverage(0.5, wordFreq)), y = 0.25, label = coverage(0.5, wordFreq), col = "red", angle = 90) +
      annotate("text", x = (100 + coverage(0.9, wordFreq)), y = 0.25, label = coverage(0.9, wordFreq), col = "blue", angle = 90)

# remove to allow memory space.
rm(top1word); rm(wordFreq)

Most Common 2-grams and 3-grams

The 10 monst common 2-grams were found using the dfm functions and specify ngrams = 2

top2word <- dfm(finalSample, toLower = T, removeNumbers = T, removePunct = T,
                removeTwitter = T, stem = T, ignoredFeatures = c("will",stopwords("english")),
                ngram = 2, verbose = F)

##

top2word

Document-feature matrix of: 213,252 documents, 810,191 features.

# 10 most frequent 2gram
top102gram <- data.frame(topfeatures(top2word, 10))  
Word <- rownames(top102gram)
top102gram <- data.frame(Word, top102gram)
rownames(top102gram) <- NULL
colnames(top102gram) <- c("word","frequency")
top102gram$word <- gsub("_", " ", top102gram$word )
kable(top102gram)

word	frequency
right now	1216
last year	1086
new york	1039
last night	834
high school	701
last week	682
years ago	680
feel lik	604
looking forward	594
first tim	584

ggplot(top102gram, aes(x = reorder(word, frequency), y = frequency)) + geom_bar(stat = "identity") + theme_bw() +
      coord_flip() + ylab("") + ggtitle("Top 10 most common 2-grams") + xlab("")

Similarly, for the 10 most common 3-grams:

top3word <- dfm(finalSample, toLower = T, removeNumbers = T, removePunct = T,
                removeTwitter = T, stem = T, ignoredFeatures = c("'","will",stopwords("english")),
                ngram = 3, verbose = F)

##

top3word

Document-feature matrix of: 213,252 documents, 521,607 features.

# 10 most frequent 3gram
top103gram <- data.frame(topfeatures(top3word, 10))  
Word <- rownames(top103gram)
top103gram <- data.frame(Word, top103gram)
rownames(top103gram) <- NULL
colnames(top103gram) <- c("word","frequency")
top103gram$word <- gsub("_", " ", top103gram$word)
kable(top103gram)

word	frequency
let us know	132
new york c	120
happy new year	97
happy mothers day	97
happy mother’s day	92
two years ago	91
new york tim	79
president barack obama	79
cinco de mayo	62
world war ii	54

ggplot(top103gram, aes(x = reorder(word, frequency), y = frequency)) + geom_bar(stat = "identity") + theme_bw() +
      coord_flip() + ylab("") + ggtitle("Top 10 most common 3-grams") + xlab("")

Conclusions

“said” was the most used word.
“right now” was the most used 2-grams.
“let us know” was the most used 3-grams
The relation between the number of unique words needed and the vocabulary coverage was not linear.

What’s next? The prediction model

Things to do before fitting the model:

try to eliminate profanity.
How can we count for spelling mistakes? or repeated words (e.g. a sentence with the 2-grams “jobs jobs”)
News, blogs and Twitter have a quite different structure (e.g. you are only allowed 140 characters in a tweet), so I want to try building a predictive model that accounts for the source of the sentence. For instance, suppose that a user wants to know what comes after the word “happy”, he/she might get a different suggestion depending on the document he/she is currently writing.

Next step is building the model. First, I will try to work around my computer memory problems and use a bigger set of data. Second, I will divide the available data in two sets: a training set where the model will be fitted, and a test set where the prediction accuracy of the selected model will be tested.

To build the model, I will start looking at 4-grams, 3-grams and 2-grams (in this exact order). To avoid using much memory and speed the algorithm, the model will be build so that the next word only depends on the current n-gram (I will try to build a model with no long term memory). The output of the algorithm will be the most frequent word (or maybe the 3 most frequent words). If there are not 4-grams, 3-grams and 2-grams in the model, maybe just return the 3 most common words? (need some more research on this point)

Milestone Report: Exploring the Data

Chiara Di Gravio

November 20, 2016

Summary

Getting the data

Create the corpus

Exploratory Data Analysis

Most Common Words

Coverage

Most Common 2-grams and 3-grams

Conclusions

What’s next? The prediction model