## Read text files
dataframe <- readtext("./final/en_us/*.txt",
docvarsfrom = "filenames",
docvarnames = c("language", "source"),
dvsep = "_",
encoding = "UTF-8")
## Create corpus
doc.corpus <- corpus(dataframe)
corpusSummary <- summary(doc.corpus)
This report presents the data to be used during this capstone project. Our goal in producing this report was to familiarize oursolves with this type of data files to begin developing a strategy for a language predicting model.
Three files were used: “en_US.blogs.txt”, “en_US.news.txt”, and “en_US.twitter.txt”. Our initial look at the data shows a total of 82,631,275 words in all three files, and a total of 4,805,050 sentences.
Our first step, after loading the data, was to create a corpus of the three documents. The following summary shows the initial exploratory analysis.
corpusSummary %>% select(source, language, Types, Tokens, Sentences) %>%
knitr::kable(format='markdown', align='c')
| source | language | Types | Tokens | Sentences |
|---|---|---|---|---|
| US.blogs | en | 482484 | 42840192 | 2072941 |
| US.news | en | 115180 | 3071381 | 143558 |
| US.twitter | en | 566995 | 36719702 | 2588551 |
The following edits to the data were applied:
## Create tokens
doc.tokens <- tokens(doc.corpus)
## Remove puntuations and numbers
doc.tokens <- tokens(doc.tokens, remove_punct = TRUE, remove_numbers = TRUE)
## Remove stopwords
doc.tokens <- tokens_select(doc.tokens, stopwords('english'), selection='remove')
## Stem the tokens
doc.tokens <- tokens_wordstem(doc.tokens)
## Convert all words to lowercase
doc.tokens <- tokens_tolower(doc.tokens)
## Creating dfm from doc.tokens
doc.dfm.final <- dfm(doc.tokens)
Logically, we observe a decrease in the number of tokens. This is good, as we are getting now a clearer idea of the most common words used in these three files.
summary(doc.tokens)
## Length Class Mode
## en_US.blogs.txt 18897577 -none- character
## en_US.news.txt 1497784 -none- character
## en_US.twitter.txt 16887131 -none- character
tokenFreq <- textstat_frequency(doc.dfm.final)
head(tokenFreq, 15)
## feature frequency rank docfreq group
## 1 just 255824 1 3 all
## 2 get 245663 2 3 all
## 3 like 245084 3 3 all
## 4 one 227587 4 3 all
## 5 go 214363 5 3 all
## 6 time 197504 6 3 all
## 7 can 193852 7 3 all
## 8 love 188878 8 3 all
## 9 day 180803 9 3 all
## 10 make 158173 10 3 all
## 11 know 157246 11 3 all
## 12 good 156416 12 3 all
## 13 thank 149318 13 3 all
## 14 now 146839 14 3 all
## 15 see 134909 15 3 all
doc.dfm.final %>%
textstat_frequency(n = 15) %>%
ggplot(aes(x = reorder(feature, frequency), y = frequency)) +
geom_point() +
#geom_bar(stat="identity") +
coord_flip() +
labs(x = NULL, y = "Frequency") +
theme_minimal()
A very interesting way to visualize the most common word is to create a word cloud.
set.seed(132)
textplot_wordcloud(doc.dfm.final, max_words = 100)
We will be designing a language predictor model using n-grams. By observing patterns in the most common 2-grams and 3-grams structures, we will be predicting the next word in the text.