Getting data from the Coursera Swift Key capstone data repository. And read in the data rom blogs, news and twitter English files.
setwd('/Users/garymu/Dropbox (Personal)/Coursera/DS/capstone/data')
unzip('Coursera-SwiftKey.zip', exdir = './final')
#read in data
con <- file("final/en_US/en_US.blogs.txt", "r")
blog <- readLines(con)
con2 <- file("final/en_US/en_US.news.txt", "r")
news <- readLines(con2)
con3 <- file("final/en_US/en_US.twitter.txt", "r")
twitter <- readLines(con3)
close(con)
close(con2)
close(con3)
Before conducting data analysis, I’d like to get a high level understanding of what the data I am dealing with looks like. I will look for things like how many lines, words there are in each file to understand if I would need sampling later on for data analysis or I can use the whole data file.
Looks like all three files are fairly large with over 1 million lines (blogs have close to 1 million), and all files have over 30 million words, with blogs having the most.
Using the entire corpus to build n-gram model may have performance issue, hence I would need sampling.
#set seed and sample 10K documents from all files
set.seed(7)
sample_twitter <- sample(twitter, 10000)
sample_news <- sample(news, 10000)
sample_blogs <- sample(blog, 10000)
I will sample 10000 lines from each file to build n-gram model.
I will use quanteda package to conduct n-gram exploratory analysis.
#build corpus from sampled all three files
news_corpus <- corpus(sample_news)
blogs_corpus <- corpus(sample_blogs)
twitter_corpus <- corpus(sample_twitter)
#building unigram, bigram and trigram document feature matrix (dfm)
#for each of the corpus with quanteda
build_ngram_dfm <- function(corpus, n = 1){
dfm <- vector()
for(i in n){
ngram <-
dfm(corpus,
tolower=TRUE,
stem=TRUE,
remove_punct = TRUE,
ngrams=i,
verbose=TRUE,
remove_twitter=FALSE,
remove = stopwords('english')
)
dfm <- c(dfm, ngram)
}
return(dfm)
}
#build uni/bi/tri-gram document feature matrices
news_1_3_gram_dfm <- build_ngram_dfm(news_corpus, 1:3)
blog_1_3_gram_dfm <- build_ngram_dfm(blogs_corpus, 1:3)
twitter_1_3_gram_dfm <- build_ngram_dfm(twitter_corpus, 1:3)
Now we have built uni/bi/tri-gram document feature matrices(DFM) for the three corpuses, we can start looking into the top n-grams.
Now that we have the top ngrams, I can look into what are the most common single words using the unigram DFM.
#show top uigram
#top unigram
newsu <- topfeatures(news_1_3_gram_dfm[[1]])
blogsu <- topfeatures(blog_1_3_gram_dfm[[1]])
twu <- topfeatures(twitter_1_3_gram_dfm[[1]])
par(mfrow = c(1,3))
twu_df <- data_frame( words = names(twu), count = twu)
twu_df$words <- factor(twu_df$words, levels = twu_df$words[order(twu_df$count)])
newsu_df <- data_frame( words = names(newsu), count = newsu)
newsu_df$words <- factor(newsu_df$words, levels = newsu_df$words[order(newsu_df$count)])
blogsu_df <- data_frame( words = names(blogsu), count = blogsu)
blogsu_df$words <- factor(blogsu_df$words, levels = blogsu_df$words[order(blogsu_df$count)])
plot_twu <- ggplot(twu_df,aes(x = words, y = count)) + geom_bar(stat = 'identity') + coord_flip() + ggtitle('Twitter Top 10 Unigram')
plot_newsu <- ggplot(newsu_df, aes(x = words, y = count)) + geom_bar(stat = 'identity') + coord_flip() + ggtitle('News Top 10 Unigram')
plot_blogsu <- ggplot(blogsu_df, aes(x = words, y = count)) + geom_bar(stat = 'identity') + coord_flip() + ggtitle('Blogs Top 10 Unigram')
grid.arrange(plot_twu, plot_newsu, plot_blogsu, ncol = 3)
We looked at the top frequency single word, let’s look at what are the top words that are used together in a tri-gram model
#show top uigram
#top unigram
newst <- topfeatures(news_1_3_gram_dfm[[3]])
blogst <- topfeatures(blog_1_3_gram_dfm[[3]])
twt <- topfeatures(twitter_1_3_gram_dfm[[3]])
par(mfrow = c(1,3))
twt_df <- data_frame( words = names(twt), count = twt)
twt_df$words <- factor(twt_df$words, levels = twt_df$words[order(twt_df$count)])
newst_df <- data_frame( words = names(newst), count = newst)
newst_df$words <- factor(newst_df$words, levels = newst_df$words[order(newst_df$count)])
blogst_df <- data_frame( words = names(blogst), count = blogst)
blogst_df$words <- factor(blogst_df$words, levels = blogst_df$words[order(blogst_df$count)])
plot_twt <- ggplot(twt_df,aes(x = words, y = count)) + geom_bar(stat = 'identity') + coord_flip() + ggtitle('Twitter Top 10 Trigram')
plot_newst <- ggplot(newst_df, aes(x = words, y = count)) + geom_bar(stat = 'identity') + coord_flip() + ggtitle('News Top 10 Trigram')
plot_blogst <- ggplot(blogst_df, aes(x = words, y = count)) + geom_bar(stat = 'identity') + coord_flip() + ggtitle('Blogs Top 10 Trigram')
grid.arrange(plot_twt, plot_newst, plot_blogst, ncol = 3)
As we can see that in the unigram, we are able to get the words that are used the most freuqently, and trigram model (should be the same or the bigram model), that we can look at the top words that are used together. This will help me build an algorithm to predict the next word the user is going to type in, based on the type of document user is engaged in (twitter, blog or news).
For example, in twitter document, the top words that are used together are “thank or the”, hence, when someone types in “thank for” the predicted word would be “the” based on the top used frequency.
Then what about words that are not in the n-gram model? I think if this does happen, then when we can fall back to (n-1)-gram model, until we reach unigram.
For example, if someone types in “This really does”, and it’s not in our 4-gram training data, and we failed to predict, then we can fall back to trigram, using ‘really does’ words in our trigram data to predict the next word, and so on so forth. If all fails, the naive way is to use the top requency word in unigram model.