A first step into developing a text prediction application is to perform text analysis on the data available, in this case three files containing thousands of sentences in English language (en_US.blogs.txt, en_US.news.txt, en_US.twitter.txt).
This document presents the results for this analysis, as part of the Coursera Data Science Specialization Capstone project. Due to the amount of data to be analyzed, and based on the recommendations provided in the course website, a Linux machine with 16Gb of RAM was used to perform the analysis without the need to partition the data. For those interested, a link to an article on how to create such an instance in AWS is available in the references.
A table containing the elapsed time for every operation is available at the end of the document, as well as some findings and next-steps considered to achieve the goal of the project.
Text analysis comprises the following operations (and objects):
Some useful concepts are as follow:
I decided to use the R packages readtext to read the files, and quanteda to perform text analysis. This last package contains all the functions required to perform the operations and manipulate the objects required for our analysis.
library(quanteda)
library(readtext)
library(ggplot2)
library(scales)
enus_data <- readtext("engtext/*.txt")
summary(enus_data)
## doc_id text
## Length:3 Length:3
## Class :character Class :character
## Mode :character Mode :character
str(enus_data)
## Classes 'readtext' and 'data.frame': 3 obs. of 2 variables:
## $ doc_id: chr "en_US.blogs.txt" "en_US.news.txt" "en_US.twitter.txt"
## $ text : chr "In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”.\nWe love you Mr. B"| __truncated__ "He wasn't home alone, apparently.\nThe St. Louis plant had to close. It would die of old age. Workers had been "| __truncated__ "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long.\n"| __truncated__
enus_corpus <- corpus(enus_data)
summary(enus_corpus)
## Corpus consisting of 3 documents:
##
## Text Types Tokens Sentences
## en_US.blogs.txt 482432 42840140 2072941
## en_US.news.txt 431664 39918314 1867522
## en_US.twitter.txt 566951 36719658 2588551
##
## Source: /home/capstone/* on x86_64 by capstone
## Created: Mon Jun 25 01:11:03 2018
## Notes:
enus_token <- tokens(enus_corpus,
remove_numbers = TRUE,
remove_punct = TRUE,
remove_symbols = TRUE,
remove_twitter = TRUE,
remove_url = TRUE)
str(enus_token)
## List of 3
## $ en_US.blogs.txt : chr [1:36903161] "In" "the" "years" "thereafter" ...
## $ en_US.news.txt : chr [1:33485223] "He" "wasn't" "home" "alone" ...
## $ en_US.twitter.txt: chr [1:29540580] "How" "are" "you" "Btw" ...
## - attr(*, "types")= chr [1:999217] "In" "the" "years" "thereafter" ...
## - attr(*, "padding")= logi FALSE
## - attr(*, "class")= chr "tokens"
## - attr(*, "what")= chr "word"
## - attr(*, "ngrams")= int 1
## - attr(*, "skip")= int 0
## - attr(*, "concatenator")= chr "_"
## - attr(*, "docvars")='data.frame': 3 obs. of 0 variables
enus_ng2 <- tokens_ngrams(enus_token, n=2, concatenator=" ")
enus_dfm2 <- dfm(enus_ng2)
df.dfm2 <- textstat_frequency(enus_dfm2, n=20)
df.dfm2
## feature frequency rank docfreq group
## 1 of the 431090 1 3 all
## 2 in the 413185 2 3 all
## 3 to the 214544 3 3 all
## 4 for the 201614 4 3 all
## 5 on the 197104 5 3 all
## 6 to be 162012 6 3 all
## 7 at the 143381 7 3 all
## 8 and the 126365 8 3 all
## 9 in a 120233 9 3 all
## 10 with the 106282 10 3 all
## 11 is a 101092 11 3 all
## 12 it was 96487 12 3 all
## 13 for a 94219 13 3 all
## 14 from the 87498 14 3 all
## 15 i have 86473 15 3 all
## 16 i was 86081 16 3 all
## 17 it is 82611 17 3 all
## 18 and i 82551 18 3 all
## 19 with a 81952 19 3 all
## 20 will be 81163 20 3 all
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 2993958 159.9 38820696 2073.3 33637410 1796.5
## Vcells 72781052 555.3 801851851 6117.7 1043815786 7963.7
enus_ng3 <- tokens_ngrams(enus_token, n=3, concatenator=" ")
enus_dfm3 <- dfm(enus_ng3)
df.dfm3 <- textstat_frequency(enus_dfm3, n=20)
df.dfm3
## feature frequency rank docfreq group
## 1 one of the 34620 1 3 all
## 2 a lot of 30060 2 3 all
## 3 thanks for the 23846 3 3 all
## 4 to be a 18229 4 3 all
## 5 going to be 17447 5 3 all
## 6 i want to 15082 6 3 all
## 7 the end of 14938 7 3 all
## 8 out of the 14814 8 3 all
## 9 it was a 14334 9 3 all
## 10 as well as 13952 10 3 all
## 11 some of the 13679 11 3 all
## 12 be able to 13068 12 3 all
## 13 part of the 12395 13 3 all
## 14 i have a 11872 14 3 all
## 15 i have to 11292 15 3 all
## 16 the rest of 11246 16 3 all
## 17 looking forward to 11232 17 3 all
## 18 i don't know 11124 18 3 all
## 19 thank you for 10297 19 3 all
## 20 is going to 10177 20 3 all
| Operation | Time (mins) | Object Size (MB) |
|---|---|---|
| Reading documents | 0.605791 | 552 Mb |
| Creating a Corpus | 6.924373 | 550 Mb |
| Creating Tokens | 2.4952943 | 440.2 Mb |
| Bigrams | 5.9525106 | 1569.7 Mb |
| Trigrams | 19.5056557 | 4251.4 Mb |
Findings:
stopwords() provides a list of such words. I decided not to remove these words since the purpose of the project is to predict the next term, no matter if this is a function word or not.Next-Steps: