Exploratory Data Analysis and Modeling for Capstone

Exploratory Data Analysis and Modeling

Text analysis comprises the following operations (and objects):

Importing data (files, webpages)
Cleaning and Preprocessing (Corpus and tokens)
Representing, filtering and weighting (dtm, tokenlist)
Analyzing (results)

Some useful concepts are as follow:

Corpus: A large and unstructured set of texts.
tokens: Units of text, such as words or word stems.
Document-Term Matrix (dtm): A matrix that represents the frequency of terms in corpus.
n-grams: Sequence of n-words. i.e. unigram (n=1), bigram (n=2), trigram (n=3), four-gram (n=4), and so on.

I decided to use the R packages readtext to read the files, and quanteda to perform text analysis. This last package contains all the functions required to perform the operations and manipulate the objects required for our analysis.

Loading required libraries

library(quanteda)
library(readtext)
library(ggplot2)
library(scales)

Reading the documents

enus_data <- readtext("engtext/*.txt")
summary(enus_data)

##     doc_id              text          
##  Length:3           Length:3          
##  Class :character   Class :character  
##  Mode  :character   Mode  :character

str(enus_data)

## Classes 'readtext' and 'data.frame': 3 obs. of  2 variables:
##  $ doc_id: chr  "en_US.blogs.txt" "en_US.news.txt" "en_US.twitter.txt"
##  $ text  : chr  "In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”.\nWe love you Mr. B"| __truncated__ "He wasn't home alone, apparently.\nThe St. Louis plant had to close. It would die of old age. Workers had been "| __truncated__ "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long.\n"| __truncated__

Creating a Corpus

enus_corpus <- corpus(enus_data)
summary(enus_corpus)

## Corpus consisting of 3 documents:
## 
##               Text  Types   Tokens Sentences
##    en_US.blogs.txt 482432 42840140   2072941
##     en_US.news.txt 431664 39918314   1867522
##  en_US.twitter.txt 566951 36719658   2588551
## 
## Source: /home/capstone/* on x86_64 by capstone
## Created: Mon Jun 25 01:11:03 2018
## Notes:

Creating Tokens

enus_token <- tokens(enus_corpus, 
                     remove_numbers = TRUE, 
                     remove_punct = TRUE, 
                     remove_symbols = TRUE, 
                     remove_twitter = TRUE, 
                     remove_url = TRUE)
str(enus_token)

## List of 3
##  $ en_US.blogs.txt  : chr [1:36903161] "In" "the" "years" "thereafter" ...
##  $ en_US.news.txt   : chr [1:33485223] "He" "wasn't" "home" "alone" ...
##  $ en_US.twitter.txt: chr [1:29540580] "How" "are" "you" "Btw" ...
##  - attr(*, "types")= chr [1:999217] "In" "the" "years" "thereafter" ...
##  - attr(*, "padding")= logi FALSE
##  - attr(*, "class")= chr "tokens"
##  - attr(*, "what")= chr "word"
##  - attr(*, "ngrams")= int 1
##  - attr(*, "skip")= int 0
##  - attr(*, "concatenator")= chr "_"
##  - attr(*, "docvars")='data.frame':  3 obs. of  0 variables

Creating DTM and n-grams

Bigrams

enus_ng2 <- tokens_ngrams(enus_token, n=2, concatenator=" ")
enus_dfm2 <- dfm(enus_ng2)
df.dfm2 <- textstat_frequency(enus_dfm2, n=20)
df.dfm2

##     feature frequency rank docfreq group
## 1    of the    431090    1       3   all
## 2    in the    413185    2       3   all
## 3    to the    214544    3       3   all
## 4   for the    201614    4       3   all
## 5    on the    197104    5       3   all
## 6     to be    162012    6       3   all
## 7    at the    143381    7       3   all
## 8   and the    126365    8       3   all
## 9      in a    120233    9       3   all
## 10 with the    106282   10       3   all
## 11     is a    101092   11       3   all
## 12   it was     96487   12       3   all
## 13    for a     94219   13       3   all
## 14 from the     87498   14       3   all
## 15   i have     86473   15       3   all
## 16    i was     86081   16       3   all
## 17    it is     82611   17       3   all
## 18    and i     82551   18       3   all
## 19   with a     81952   19       3   all
## 20  will be     81163   20       3   all

Plot:

Trigrams

##            used  (Mb) gc trigger   (Mb)   max used   (Mb)
## Ncells  2993958 159.9   38820696 2073.3   33637410 1796.5
## Vcells 72781052 555.3  801851851 6117.7 1043815786 7963.7

enus_ng3 <- tokens_ngrams(enus_token, n=3, concatenator=" ")
enus_dfm3 <- dfm(enus_ng3)
df.dfm3 <- textstat_frequency(enus_dfm3, n=20)
df.dfm3

##               feature frequency rank docfreq group
## 1          one of the     34620    1       3   all
## 2            a lot of     30060    2       3   all
## 3      thanks for the     23846    3       3   all
## 4             to be a     18229    4       3   all
## 5         going to be     17447    5       3   all
## 6           i want to     15082    6       3   all
## 7          the end of     14938    7       3   all
## 8          out of the     14814    8       3   all
## 9            it was a     14334    9       3   all
## 10         as well as     13952   10       3   all
## 11        some of the     13679   11       3   all
## 12         be able to     13068   12       3   all
## 13        part of the     12395   13       3   all
## 14           i have a     11872   14       3   all
## 15          i have to     11292   15       3   all
## 16        the rest of     11246   16       3   all
## 17 looking forward to     11232   17       3   all
## 18       i don't know     11124   18       3   all
## 19      thank you for     10297   19       3   all
## 20        is going to     10177   20       3   all

Plot:

Time elapsed

Operation	Time (mins)	Object Size (MB)
Reading documents	0.605791	552 Mb
Creating a Corpus	6.924373	550 Mb
Creating Tokens	2.4952943	440.2 Mb
Bigrams	5.9525106	1569.7 Mb
Trigrams	19.5056557	4251.4 Mb

Exploratory Data Analysis and Modeling for Capstone

adanlp

june 23, 2018

Introduction

Exploratory Data Analysis and Modeling

Loading required libraries

Reading the documents

Creating a Corpus

Creating Tokens

Creating DTM and n-grams

Bigrams

Plot:

Trigrams

Plot:

Time elapsed

Findings and Next Steps

References