## Read text files
dataframe <- readtext("./final/en_us/*.txt",
                      docvarsfrom = "filenames",
                      docvarnames = c("language", "source"),
                      dvsep = "_",
                      encoding = "UTF-8")

## Create corpus
doc.corpus <- corpus(dataframe)
corpusSummary <- summary(doc.corpus)

Summary

This report presents the data to be used during this capstone project. Our goal in producing this report was to familiarize oursolves with this type of data files to begin developing a strategy for a language predicting model.

Three files were used: “en_US.blogs.txt”, “en_US.news.txt”, and “en_US.twitter.txt”. Our initial look at the data shows a total of 82,631,275 words in all three files, and a total of 4,805,050 sentences.

Exploratory Analysis

Our first step, after loading the data, was to create a corpus of the three documents. The following summary shows the initial exploratory analysis.

corpusSummary %>% select(source, language, Types, Tokens, Sentences) %>%
        knitr::kable(format='markdown', align='c')
source language Types Tokens Sentences
US.blogs en 482484 42840192 2072941
US.news en 115180 3071381 143558
US.twitter en 566995 36719702 2588551

The following edits to the data were applied:

  1. Tokenize our data.
## Create tokens
doc.tokens <- tokens(doc.corpus)
  1. Remove punctuations.
## Remove puntuations and numbers
doc.tokens <- tokens(doc.tokens, remove_punct = TRUE, remove_numbers = TRUE)
  1. Remove stop words.
## Remove stopwords
doc.tokens <- tokens_select(doc.tokens, stopwords('english'), selection='remove')
  1. Stem-words
## Stem the tokens
doc.tokens <- tokens_wordstem(doc.tokens)
  1. Lowercase All Words
## Convert all words to lowercase
doc.tokens <- tokens_tolower(doc.tokens)
  1. Create Data Frame Matrix
## Creating dfm from doc.tokens
doc.dfm.final <- dfm(doc.tokens)

Logically, we observe a decrease in the number of tokens. This is good, as we are getting now a clearer idea of the most common words used in these three files.

summary(doc.tokens)
##                   Length   Class  Mode     
## en_US.blogs.txt   18897577 -none- character
## en_US.news.txt     1497784 -none- character
## en_US.twitter.txt 16887131 -none- character

Most Used Words

tokenFreq <- textstat_frequency(doc.dfm.final)
head(tokenFreq, 15)
##    feature frequency rank docfreq group
## 1     just    255824    1       3   all
## 2      get    245663    2       3   all
## 3     like    245084    3       3   all
## 4      one    227587    4       3   all
## 5       go    214363    5       3   all
## 6     time    197504    6       3   all
## 7      can    193852    7       3   all
## 8     love    188878    8       3   all
## 9      day    180803    9       3   all
## 10    make    158173   10       3   all
## 11    know    157246   11       3   all
## 12    good    156416   12       3   all
## 13   thank    149318   13       3   all
## 14     now    146839   14       3   all
## 15     see    134909   15       3   all
doc.dfm.final %>% 
        textstat_frequency(n = 15) %>% 
        ggplot(aes(x = reorder(feature, frequency), y = frequency)) +
        geom_point() +
        #geom_bar(stat="identity") +
        coord_flip() +
        labs(x = NULL, y = "Frequency") +
        theme_minimal()

A very interesting way to visualize the most common word is to create a word cloud.

set.seed(132)
textplot_wordcloud(doc.dfm.final, max_words = 100)

Upcoming Plan

We will be designing a language predictor model using n-grams. By observing patterns in the most common 2-grams and 3-grams structures, we will be predicting the next word in the text.