Context and requirements

The goal of this project is just to show the ability of working with consistent text data. This report consists in an exploratory analysis showing the major features of the data in a way that would be understandable to a non-data scientist manager. The motivation for this project is to:

  1. demonstrate that data can be easily loaded,
  2. create a basic report of summary statistics about the data sets,
  3. report any interesting findings amassed so far.

Loading the libraries

library(knitr)
library(readtext)
library(stringr)
library(quanteda)
library(ggplot2)

Exploring the Data set

setwd("C:/Users/Samir/Desktop/E-Learning/Coursera/Capstone Project/final/en_US")
summary <- data.frame(File= c("en_US.blogs.txt", "en_US.newss.txt", "en_US.twitter.txt"), 
                        Size_in_MB = c(file.info("en_US.blogs.txt")$size / (1024**2),
                                       file.info("en_US.news.txt")$size / (1024**2),
                                       file.info("en_US.twitter.txt")$size / (1024**2)),
                        Max_Nb_Characters_per_row = c(max(nchar(readLines("en_US.blogs.txt"))),
                                                      max(nchar(readLines("en_US.news.txt"))),
                                                      max(nchar(readLines("en_US.twitter.txt")))),
                        Min_Nb_Characters_per_row = c(min(nchar(readLines("en_US.blogs.txt"))),
                                                      min(nchar(readLines("en_US.news.txt"))),
                                                      min(nchar(readLines("en_US.twitter.txt")))),
                        Nb_of_rows = c(length(readLines("en_US.blogs.txt")),
                                       length(readLines("en_US.news.txt")),
                                       length(readLines("en_US.twitter.txt")))
                      )
kable(summary)
File Size_in_MB Max_Nb_Characters_per_row Min_Nb_Characters_per_row Nb_of_rows
en_US.blogs.txt 200.4242 40835 1 899288
en_US.newss.txt 196.2775 5760 2 77259
en_US.twitter.txt 159.3641 213 2 2360148

Loading and preliminary processing of Data

The data is imported using the function readtext() (from the package of the same name). Eventhough all data is in text files, this function is very convenient as it permits load files of different formats to obtain the same standard output.

corpus <- readtext("*.txt") #read text files
corpus <- iconv(corpus, "UTF-8", "ASCII", sub="") #exclude non-ASCII characters
corpus <- str_replace_all(corpus, " - ", " ") #exclude isolated dashes
corpus <- str_replace_all(corpus, "[^[:alnum:]]['-]", " ") 
corpus <- str_replace_all(corpus, "\\b[^IiAa]\\b", "") #exclude one-character strings except "I" and "a"

Tokenisation and processing of tokens

To compute the frequency of occurrence of each word, the text is split in tokens and all punctuations, numbers and symbols are removed. The tokens are then transformed to lower case while english stopwords such as “the” are excluded before the stemming. From this step on, I decided to use the package quanteda allocating four cores to the computation.

quanteda_options(threads = 4)
toks <- tokens(corpus, remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE)
toks <- tokens_tolower(toks)
toks <- tokens_remove(toks, stopwords("english"))
#toks <- tokens_wordstem(toks)

Construction of the document-feature matrix

The document-feature matrix (DFM) is a representation allowing the algebraic analysis of the data. The rows of this matrix represent the documents of the corpus, the columns represent the tokens and the cells indicate their frequency in each document. Since the computation of the DFM is time consuming and ressources requiring, especially for higher order n-grams, I choosed to accomplish all the processing in the previous steps. However, please note that it is also possible to accomplish most of the processing directly using the dfm() function.

1-gram features

dtm_ng1 <- dfm(toks, ngrams = 1, verbose  = T)
## Creating a dfm from a tokens input...
##    ... lowercasing
##    ... found 3 documents, 437,099 features
##    ... created a 3 x 437,099 sparse dfm
##    ... complete. 
## Elapsed time: 144 seconds.
topfeatures(dtm_ng1, n = 20)
##   just    can   like    get    one     go   time   love    day   make 
## 256018 251392 247499 246467 231657 216605 201934 191008 187186 159898 
##   know   good  thank    now    see   work    new  think   look   want 
## 157854 157373 150672 147314 135317 131606 130067 128734 126886 126194
textplot_wordcloud(dtm_ng1, max_words = 50, colors = c("red", "pink", "green", "purple", "orange", "blue"))

topFeats_1 <- topfeatures(dtm_ng1, 20)
topDf_1 <- data.frame(features = names(topFeats_1), freq = topFeats_1)
ggplot(data = topDf_1, aes(x = reorder(features, freq), y = freq, fill = features)) + 
        geom_bar(stat = "identity", position = position_dodge()) +
        theme_minimal() +
        labs(x = "Feature", y = "Frequency") +
        labs(title = expression("Features Frequencies")) +
        coord_flip() +
        guides(fill=FALSE)

2-gram features

dtm_ng2 <- dfm(toks, ngrams = 2, verbose  = T) 
## Creating a dfm from a tokens input...
##    ... lowercasing
##    ... found 3 documents, 12,246,120 features
##    ... created a 3 x 12,246,120 sparse dfm
##    ... complete. 
## Elapsed time: 432 seconds.
topfeatures(dtm_ng2, n = 20) 
##      right_now      look_like       can_wait     last_night      feel_like 
##          22322          17986          16474          15865          15199 
##   look_forward   thank_follow       don_know        can_get       year_old 
##          14912          12502          11467          11215          11131 
##      last_year       new_york      make_sure happi_birthday       let_know 
##           9109           9002           8967           8900           8660 
##       year_ago      good_morn     first_time         let_go       just_got 
##           8648           8439           8373           8124           7964
textplot_wordcloud(dtm_ng2, max_words = 50, colors = c("red", "pink", "green", "purple", "orange", "blue")) 

topFeats_2 <- topfeatures(dtm_ng2, 20)
topDf_2 <- data.frame(features = names(topFeats_2), freq = topFeats_2)
ggplot(data = topDf_2, aes(x = reorder(features, freq), y = freq, fill = features)) + 
        geom_bar(stat = "identity", position = position_dodge()) +
        theme_minimal() +
        labs(x = "Feature", y = "Frequency") +
        labs(title = expression("Features Frequencies")) +
        coord_flip() +
        guides(fill=FALSE)

3-gram features

dtm_ng3 <- dfm(toks, ngrams = 3, verbose  = T)
## Creating a dfm from a tokens input...
##    ... lowercasing
##    ... found 3 documents, 32,961,816 features
##    ... created a 3 x 32,961,816 sparse dfm
##    ... complete. 
## Elapsed time: 1698 seconds.
topfeatures(dtm_ng3, n = 20)
##   happi_mother_day       can_wait_see        let_us_know 
##               3472               3148               2637 
##     happi_new_year   look_forward_see      new_york_citi 
##               2176               1601               1321 
##      cinco_de_mayo follow_follow_back    dream_come_true 
##               1128                917                868 
##      don_even_know     love_love_love      new_york_time 
##                865                858                827 
##       can_wait_get     st_patrick_day     make_feel_like 
##                813                765                709 
## happi_valentin_day       new_year_eve       ve_ever_seen 
##                694                684                668 
##       let_just_say      just_got_back 
##                666                662
textplot_wordcloud(dtm_ng3, max_words = 50, colors = c("red", "pink", "green", "purple", "orange", "blue"))

topFeats_3 <- topfeatures(dtm_ng3, 20)
topDf_3 <- data.frame(features = names(topFeats_3), freq = topFeats_3)
ggplot(data = topDf_3, aes(x = reorder(features, freq), y = freq, fill = features)) + 
        geom_bar(stat = "identity", position = position_dodge()) +
        theme_minimal() +
        labs(x = "Feature", y = "Frequency") +
        labs(title = expression("Features Frequencies")) +
        coord_flip() +
        guides(fill=FALSE)

A complete study concluded with a shiny application will be available very soon