Introduction

This is the milestone report for the second week of the Data Science Capstone. the aim of this report is to showcase basic features of the data, first analyses and some ideas about how to proceed from here.

The data

The data were provided from Swiftkey and contained sentences that were collected from blogs, twitter, and the news in four different languages (german, english, finnish, russian). I will concentrate on the english dataset in this report, but in theory everything should be applicable to the other languages as well.

So we start with the three file:

The file sizes with about 200 MB each already tells us that memory capacity will be a major factor in this analysis. I firstly read all data and check the number of lines etc. for each dataset.

library(stringi)
library(pander)

setwd("~/Google Drive/coursera/cap_stone/final/en_US/")
conn <- file("en_US.blogs.txt", "r")
blogs <- readLines(conn,skipNul = TRUE)
l1 <- length(blogs)
w1 <- sum(stri_count_words(blogs))
m1 <- w1/l1
close(conn)

conn <- file("en_US.news.txt", "r")
news <- readLines(conn,skipNul = TRUE)
l2 <- length(news)
w2 <- sum(stri_count_words(news))
m2 <- w2/l2
close(conn)

conn <- file("en_US.twitter.txt", "r")
twitter <- readLines(conn,skipNul = TRUE)
l3 <- length(twitter)
w3 <- sum(stri_count_words(twitter))
m3 <- w3/l3
close(conn)

stats = data.frame(dataset=c("blogs","news","twitter"),number_of_lines = c(l1,l2,l3),number_of_words = c(w1,w2,w3),mean_no_words = c(m1,m2,m3))
pander(stats)
dataset number_of_lines number_of_words mean_no_words
blogs 899288 37546246 41.75
news 1010242 34762395 34.41
twitter 2360148 30093410 12.75

Exploratory analysis

All three datasets contain more than 30.000.000 words, but the number of lines, and therefore mean number of words per line, differ vastly. Twitter lines contain the least words, followd by the news and lastly the blogs. So the first question is, do we really need that many words to see trends in the occurrences of the words. For this, I have sampled different percentages (0.5,0.25,0.1,0.05,0.025) of lines from the blogs dataset and compared the list of 1000 most prevalent words to see if you get the same result with a much smaller set of words. This analysis takes some time so I’m only showing the overlap of the 100 most prevalent words between the 25%, 10%, and 2.5% overlap.

library(VennDiagram)
## Loading required package: grid
## Loading required package: futile.logger
grid.newpage()
draw.triple.venn(area1 = 1000, area2 = 1000, area3 = 1000, n12 = 973, n23 = 940, n13 = 948, 
                  n123 = 932, category = c("25%", "10%", "2.5%"), lty = "blank", 
                  fill = c("skyblue", "pink1", "mediumorchid"))

## (polygon[GRID.polygon.1], polygon[GRID.polygon.2], polygon[GRID.polygon.3], polygon[GRID.polygon.4], polygon[GRID.polygon.5], polygon[GRID.polygon.6], text[GRID.text.7], text[GRID.text.8], text[GRID.text.9], text[GRID.text.10], text[GRID.text.11], text[GRID.text.12], text[GRID.text.13], text[GRID.text.14], text[GRID.text.15], text[GRID.text.16])

While the overlap of 932 words is certainly very high, it is also obvious that the 2.5% dataset has the highest number of unique words, which is highly likely due to the random sampling of the lines. For an initial check on the data using 2.5% seems reasonable for a first analysis though. For the model, higher percentages will be used.

It is obvious that there are some words that are used much more often than you would expect (‘one’,‘will’,‘can’). There’s also overlap between the datasets (e.g. 665 words are in the top 1000 in the blogs and news), but some words seem to be specific (‘said’ in the news dataset). Interestingly, there seem to be a lot less distinct words in the twitter dataset (ca. 7.000 compared to 15.000 in the other datasets).

I will now generate bigrams and trigrams for the datasets and see which terms are overrepresented.

# generates the tdm from corpus using an ngram specifier and returns a dataframe of ngrams and their frequencies
get_freq_ngram <- function(x,ngram) {
    Tokenizer <-function(x) unlist(lapply(ngrams(words(x), ngram), paste, collapse = " "), use.names = FALSE)
    tdm = removeSparseTerms(TermDocumentMatrix(x,control = list(tokenize = Tokenizer)), 0.9999)
    ng = as.data.frame(rowSums(as.matrix(tdm)))
    ng$word = rownames(ng)
    names(ng)=c("freq","word")
    ng=arrange(ng,desc(freq))
    return(ng)
}

blogs_bi = get_freq_ngram(b_corpus,2)
news_bi = get_freq_ngram(n_corpus,2)
twitter_bi = get_freq_ngram(t_corpus,2)

par(mfrow=c(1,3))
par(mar=c(10,4.1,4.1,2.1))
barplot(blogs_bi[1:20,]$freq, las = 2, names.arg = blogs_bi[1:20,]$word,
         col ="lightblue", main ="blogs",
         ylab = "Word frequencies")
barplot(news_bi[1:20,]$freq, las = 2, names.arg = news_bi[1:20,]$word,
         col ="lightblue", main ="news",
         ylab = "Word frequencies")
barplot(twitter_bi[1:20,]$freq, las = 2, names.arg = twitter_bi[1:20,]$word,
         col ="lightblue", main ="twitter",
         ylab = "Word frequencies")

In contrast to the single words, bigrams seem to be quite different in datasets, e.g. a lot of city names (New York, New Jersey) in the news dataset. Again the twitter dataset has the least number of bigrams. These trends are also obvious in the trigrams.

blogs_tri = get_freq_ngram(b_corpus,3)
news_tri = get_freq_ngram(n_corpus,3)
twitter_tri = get_freq_ngram(t_corpus,3)

par(mfrow=c(1,3))
par(mar=c(10,4.1,4.1,2.1))
barplot(blogs_tri[1:20,]$freq, las = 2, names.arg = blogs_tri[1:20,]$word,
         col ="lightblue", main ="blogs",
         ylab = "Word frequencies")
barplot(news_tri[1:20,]$freq, las = 2, names.arg = news_tri[1:20,]$word,
         col ="lightblue", main ="news",
         ylab = "Word frequencies")
barplot(twitter_tri[1:20,]$freq, las = 2, names.arg = twitter_tri[1:20,]$word,
         col ="lightblue", main ="twitter",
         ylab = "Word frequencies")

Thoughts for the model

The actual Shiny app will be relatively simple. First, it may be good to give the user the option to choose between casual (twitter, blogs) and formal (news) word predictions, as there were clear differences between the datasets. Then there will be a textbox for the user input and then an area for the display of the predicted words.

I will use the tm package to build models for each dataset and for the combined datasets. I will have to test whether using bigrams is better than trigrams, or if the outcome is the same.