Introduction

The following report is the first milestone of the capstone project of the data science specialization offered by Coursera and Johns Hopkins University, in association with SwiftKey. The goal of the capstone project is to build a predictive text model that will guess the next word you want to write. SwiftKey provided us with different data sets in different languages (english, german, russian, finish) and from different sources like news, blogs and twitter.

This report represents the first step in the pipline and consists of a EDA (Exploratory Data Analysis) of the data sets.

The following libraries will be needed:

library(quanteda)
library(ggplot2)
library(data.table)
library(dplyr)
library(gridExtra)

About the Data

The dataset can be obtained from the following link

For a fist analysis we will just focus in the english data set taken from the news.

con <- file("./Data/en_US.twitter.txt","rb")
txt <- readLines(con, encoding="UTF-8")
close(con)
length(txt)
[1] 2360148
print(object.size(txt), units = "auto")
301.4 Mb

We see that the dataset is very big with and takes a very big space of memory. In order to save memory space and speed up the analysis a random sample of 50.000 lines will be taken for the analysis which should still provide a clear picture of the dataset. The dataset will be then converted into a corpus object from the quanteda library for better handling

sampleIndex <- sample(1:length(txt),size = 50000, replace = F)
txt <- txt[sampleIndex] %>% corpus()

EDA

Tokenization

Using the quanteda package we can tokenize the corpus object and use the tokens to construct different ngrams. The tokens will be change to lower case since quanteda differentiates between upper and lower case. In the english language capitalization does not play an important role in the language in comparisson with german where special care should be taken.

tks <- txt %>% 
    tokens(what = "word", 
           remove_numbers = TRUE, remove_punct = TRUE,
           remove_symbols = TRUE, remove_hyphens = TRUE) %>%
    tokens_tolower()

unigram <- tks %>% tokens_ngrams(n = 1)
bigram <- tks %>% tokens_ngrams(n = 2)

Ngram Analysis

Using the dfm and textstat_frequency functions from the quanteda package we can easily identify the most common words as unigrams as well as bigrams to get a better picture of the dataset.

uni.freq <- unigram %>% dfm() %>% textstat_frequency(n = 100)
bi.freq <- bigram %>% dfm() %>% textstat_frequency(n = 100)

Word Cloud

library(RColorBrewer)
unigram %>% dfm() %>% textplot_wordcloud(max_words = 200,color = brewer.pal(8,"Dark2"))

bigram %>% dfm() %>% textplot_wordcloud(max_words = 200,color = brewer.pal(8,"Dark2"))

Quanteda constructs bigrams by merging two words together with a “_“. The following code will change the underscore character for a blanc space for a better representation of the data.

bi.freq$feature <- sapply(bi.freq$feature,function(x){gsub("_"," ",x)})
p1 <- uni.freq %>%
    filter(rank < 21) %>%
    ggplot( aes(x = reorder(feature, frequency), y = frequency)) +
    geom_point() + 
    coord_flip() + 
    labs(x = NULL, y = "Frequency")
p2 <- bi.freq %>%
    filter(rank < 21) %>%
    ggplot( aes(x = reorder(feature, frequency), y = frequency)) +
    geom_point() + 
    coord_flip() + 
    labs(x = NULL, y = "Frequency")
grid.arrange(p1,p2, nrow = 1)

We an apply then the same procedure for the two remaining data sets and see the difference between the most common words.

con <- file("./Data/en_US.blogs.txt","rb")
txt.blogs <- readLines(con, encoding="UTF-8")
con <- file("./Data/en_US.news.txt","rb")
txt.news <- readLines(con, encoding="UTF-8")
close(con)

txt.blogs <- txt.blogs[sampleIndex]
txt.news <- txt.news[sampleIndex]

tks.blogs <- txt.blogs %>%
    tokens(what = "word", 
           remove_numbers = TRUE, remove_punct = TRUE,
           remove_symbols = TRUE, remove_hyphens = TRUE) %>%
    tokens_tolower()

tks.news <- txt.news %>%
    tokens(what = "word", 
           remove_numbers = TRUE, remove_punct = TRUE,
           remove_symbols = TRUE, remove_hyphens = TRUE) %>%
    tokens_tolower()

p1 <- uni.freq %>%
    filter(rank < 21) %>%
    ggplot( aes(x = reorder(feature, frequency), y = frequency)) +
    geom_point() + 
    coord_flip() + 
    labs(x = NULL, y = "Frequency", title = "Twitter")

p2 <- tks.blogs %>%
    dfm() %>%
    textstat_frequency() %>%
    filter(rank < 21) %>%
    ggplot( aes(x = reorder(feature, frequency), y = frequency)) +
    geom_point() + 
    coord_flip() + 
    labs(x = NULL, y = "Frequency", title = "Blogs")

p3 <- tks.news %>%
    dfm() %>%
    textstat_frequency() %>%
    filter(rank < 21) %>%
    ggplot( aes(x = reorder(feature, frequency), y = frequency)) +
    geom_point() + 
    coord_flip() + 
    labs(x = NULL, y = "Frequency", title = "News")

grid.arrange(p1,p2,p3,nrow = 1)

Conclusion

With the plots shown in the text it is clear that the most common words are “the” and “to”. This words however do not add much meaning to a sentence but are however an important part of the english language.

The next steps of the capstone project are to develop and to finalise the predictive algorithm, and then deploy the algorithm as a Shiny app.