Introduction

A first step into developing a text prediction application is to perform text analysis on the data available, in this case three files containing thousands of sentences in English language (en_US.blogs.txt, en_US.news.txt, en_US.twitter.txt).

This document presents the results for this analysis, as part of the Coursera Data Science Specialization Capstone project. Due to the amount of data to be analyzed, and based on the recommendations provided in the course website, a Linux machine with 16Gb of RAM was used to perform the analysis without the need to partition the data. For those interested, a link to an article on how to create such an instance in AWS is available in the references.

A table containing the elapsed time for every operation is available at the end of the document, as well as some findings and next-steps considered to achieve the goal of the project.

Exploratory Data Analysis and Modeling

Text analysis comprises the following operations (and objects):

  1. Importing data (files, webpages)
  2. Cleaning and Preprocessing (Corpus and tokens)
  3. Representing, filtering and weighting (dtm, tokenlist)
  4. Analyzing (results)

Some useful concepts are as follow:

I decided to use the R packages readtext to read the files, and quanteda to perform text analysis. This last package contains all the functions required to perform the operations and manipulate the objects required for our analysis.

Loading required libraries

library(quanteda)
library(readtext)
library(ggplot2)
library(scales)

Reading the documents

enus_data <- readtext("engtext/*.txt")
summary(enus_data)
##     doc_id              text          
##  Length:3           Length:3          
##  Class :character   Class :character  
##  Mode  :character   Mode  :character
str(enus_data)
## Classes 'readtext' and 'data.frame': 3 obs. of  2 variables:
##  $ doc_id: chr  "en_US.blogs.txt" "en_US.news.txt" "en_US.twitter.txt"
##  $ text  : chr  "In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”.\nWe love you Mr. B"| __truncated__ "He wasn't home alone, apparently.\nThe St. Louis plant had to close. It would die of old age. Workers had been "| __truncated__ "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long.\n"| __truncated__

Creating a Corpus

enus_corpus <- corpus(enus_data)
summary(enus_corpus)
## Corpus consisting of 3 documents:
## 
##               Text  Types   Tokens Sentences
##    en_US.blogs.txt 482432 42840140   2072941
##     en_US.news.txt 431664 39918314   1867522
##  en_US.twitter.txt 566951 36719658   2588551
## 
## Source: /home/capstone/* on x86_64 by capstone
## Created: Mon Jun 25 01:11:03 2018
## Notes:

Creating Tokens

enus_token <- tokens(enus_corpus, 
                     remove_numbers = TRUE, 
                     remove_punct = TRUE, 
                     remove_symbols = TRUE, 
                     remove_twitter = TRUE, 
                     remove_url = TRUE)
str(enus_token)
## List of 3
##  $ en_US.blogs.txt  : chr [1:36903161] "In" "the" "years" "thereafter" ...
##  $ en_US.news.txt   : chr [1:33485223] "He" "wasn't" "home" "alone" ...
##  $ en_US.twitter.txt: chr [1:29540580] "How" "are" "you" "Btw" ...
##  - attr(*, "types")= chr [1:999217] "In" "the" "years" "thereafter" ...
##  - attr(*, "padding")= logi FALSE
##  - attr(*, "class")= chr "tokens"
##  - attr(*, "what")= chr "word"
##  - attr(*, "ngrams")= int 1
##  - attr(*, "skip")= int 0
##  - attr(*, "concatenator")= chr "_"
##  - attr(*, "docvars")='data.frame':  3 obs. of  0 variables

Creating DTM and n-grams

Bigrams

enus_ng2 <- tokens_ngrams(enus_token, n=2, concatenator=" ")
enus_dfm2 <- dfm(enus_ng2)
df.dfm2 <- textstat_frequency(enus_dfm2, n=20)
df.dfm2
##     feature frequency rank docfreq group
## 1    of the    431090    1       3   all
## 2    in the    413185    2       3   all
## 3    to the    214544    3       3   all
## 4   for the    201614    4       3   all
## 5    on the    197104    5       3   all
## 6     to be    162012    6       3   all
## 7    at the    143381    7       3   all
## 8   and the    126365    8       3   all
## 9      in a    120233    9       3   all
## 10 with the    106282   10       3   all
## 11     is a    101092   11       3   all
## 12   it was     96487   12       3   all
## 13    for a     94219   13       3   all
## 14 from the     87498   14       3   all
## 15   i have     86473   15       3   all
## 16    i was     86081   16       3   all
## 17    it is     82611   17       3   all
## 18    and i     82551   18       3   all
## 19   with a     81952   19       3   all
## 20  will be     81163   20       3   all
Plot:

Trigrams

##            used  (Mb) gc trigger   (Mb)   max used   (Mb)
## Ncells  2993958 159.9   38820696 2073.3   33637410 1796.5
## Vcells 72781052 555.3  801851851 6117.7 1043815786 7963.7
enus_ng3 <- tokens_ngrams(enus_token, n=3, concatenator=" ")
enus_dfm3 <- dfm(enus_ng3)
df.dfm3 <- textstat_frequency(enus_dfm3, n=20)
df.dfm3
##               feature frequency rank docfreq group
## 1          one of the     34620    1       3   all
## 2            a lot of     30060    2       3   all
## 3      thanks for the     23846    3       3   all
## 4             to be a     18229    4       3   all
## 5         going to be     17447    5       3   all
## 6           i want to     15082    6       3   all
## 7          the end of     14938    7       3   all
## 8          out of the     14814    8       3   all
## 9            it was a     14334    9       3   all
## 10         as well as     13952   10       3   all
## 11        some of the     13679   11       3   all
## 12         be able to     13068   12       3   all
## 13        part of the     12395   13       3   all
## 14           i have a     11872   14       3   all
## 15          i have to     11292   15       3   all
## 16        the rest of     11246   16       3   all
## 17 looking forward to     11232   17       3   all
## 18       i don't know     11124   18       3   all
## 19      thank you for     10297   19       3   all
## 20        is going to     10177   20       3   all
Plot:

Time elapsed

Operation Time (mins) Object Size (MB)
Reading documents 0.605791 552 Mb
Creating a Corpus 6.924373 550 Mb
Creating Tokens 2.4952943 440.2 Mb
Bigrams 5.9525106 1569.7 Mb
Trigrams 19.5056557 4251.4 Mb

Findings and Next Steps

Findings:

Next-Steps:

References