The goal of this project is just to show the ability of working with consistent text data. This report consists in an exploratory analysis showing the major features of the data in a way that would be understandable to a non-data scientist manager. The motivation for this project is to:
library(knitr)
library(readtext)
library(stringr)
library(quanteda)
library(ggplot2)
setwd("C:/Users/Samir/Desktop/E-Learning/Coursera/Capstone Project/final/en_US")
summary <- data.frame(File= c("en_US.blogs.txt", "en_US.newss.txt", "en_US.twitter.txt"),
Size_in_MB = c(file.info("en_US.blogs.txt")$size / (1024**2),
file.info("en_US.news.txt")$size / (1024**2),
file.info("en_US.twitter.txt")$size / (1024**2)),
Max_Nb_Characters_per_row = c(max(nchar(readLines("en_US.blogs.txt"))),
max(nchar(readLines("en_US.news.txt"))),
max(nchar(readLines("en_US.twitter.txt")))),
Min_Nb_Characters_per_row = c(min(nchar(readLines("en_US.blogs.txt"))),
min(nchar(readLines("en_US.news.txt"))),
min(nchar(readLines("en_US.twitter.txt")))),
Nb_of_rows = c(length(readLines("en_US.blogs.txt")),
length(readLines("en_US.news.txt")),
length(readLines("en_US.twitter.txt")))
)
kable(summary)
| File | Size_in_MB | Max_Nb_Characters_per_row | Min_Nb_Characters_per_row | Nb_of_rows |
|---|---|---|---|---|
| en_US.blogs.txt | 200.4242 | 40835 | 1 | 899288 |
| en_US.newss.txt | 196.2775 | 5760 | 2 | 77259 |
| en_US.twitter.txt | 159.3641 | 213 | 2 | 2360148 |
The data is imported using the function readtext() (from the package of the same name). Eventhough all data is in text files, this function is very convenient as it permits load files of different formats to obtain the same standard output.
corpus <- readtext("*.txt") #read text files
corpus <- iconv(corpus, "UTF-8", "ASCII", sub="") #exclude non-ASCII characters
corpus <- str_replace_all(corpus, " - ", " ") #exclude isolated dashes
corpus <- str_replace_all(corpus, "[^[:alnum:]]['-]", " ")
corpus <- str_replace_all(corpus, "\\b[^IiAa]\\b", "") #exclude one-character strings except "I" and "a"
To compute the frequency of occurrence of each word, the text is split in tokens and all punctuations, numbers and symbols are removed. The tokens are then transformed to lower case while english stopwords such as “the” are excluded before the stemming. From this step on, I decided to use the package quanteda allocating four cores to the computation.
quanteda_options(threads = 4)
toks <- tokens(corpus, remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE)
toks <- tokens_tolower(toks)
toks <- tokens_remove(toks, stopwords("english"))
#toks <- tokens_wordstem(toks)
The document-feature matrix (DFM) is a representation allowing the algebraic analysis of the data. The rows of this matrix represent the documents of the corpus, the columns represent the tokens and the cells indicate their frequency in each document. Since the computation of the DFM is time consuming and ressources requiring, especially for higher order n-grams, I choosed to accomplish all the processing in the previous steps. However, please note that it is also possible to accomplish most of the processing directly using the dfm() function.
dtm_ng1 <- dfm(toks, ngrams = 1, verbose = T)
## Creating a dfm from a tokens input...
## ... lowercasing
## ... found 3 documents, 437,099 features
## ... created a 3 x 437,099 sparse dfm
## ... complete.
## Elapsed time: 144 seconds.
topfeatures(dtm_ng1, n = 20)
## just can like get one go time love day make
## 256018 251392 247499 246467 231657 216605 201934 191008 187186 159898
## know good thank now see work new think look want
## 157854 157373 150672 147314 135317 131606 130067 128734 126886 126194
textplot_wordcloud(dtm_ng1, max_words = 50, colors = c("red", "pink", "green", "purple", "orange", "blue"))
topFeats_1 <- topfeatures(dtm_ng1, 20)
topDf_1 <- data.frame(features = names(topFeats_1), freq = topFeats_1)
ggplot(data = topDf_1, aes(x = reorder(features, freq), y = freq, fill = features)) +
geom_bar(stat = "identity", position = position_dodge()) +
theme_minimal() +
labs(x = "Feature", y = "Frequency") +
labs(title = expression("Features Frequencies")) +
coord_flip() +
guides(fill=FALSE)
dtm_ng2 <- dfm(toks, ngrams = 2, verbose = T)
## Creating a dfm from a tokens input...
## ... lowercasing
## ... found 3 documents, 12,246,120 features
## ... created a 3 x 12,246,120 sparse dfm
## ... complete.
## Elapsed time: 432 seconds.
topfeatures(dtm_ng2, n = 20)
## right_now look_like can_wait last_night feel_like
## 22322 17986 16474 15865 15199
## look_forward thank_follow don_know can_get year_old
## 14912 12502 11467 11215 11131
## last_year new_york make_sure happi_birthday let_know
## 9109 9002 8967 8900 8660
## year_ago good_morn first_time let_go just_got
## 8648 8439 8373 8124 7964
textplot_wordcloud(dtm_ng2, max_words = 50, colors = c("red", "pink", "green", "purple", "orange", "blue"))
topFeats_2 <- topfeatures(dtm_ng2, 20)
topDf_2 <- data.frame(features = names(topFeats_2), freq = topFeats_2)
ggplot(data = topDf_2, aes(x = reorder(features, freq), y = freq, fill = features)) +
geom_bar(stat = "identity", position = position_dodge()) +
theme_minimal() +
labs(x = "Feature", y = "Frequency") +
labs(title = expression("Features Frequencies")) +
coord_flip() +
guides(fill=FALSE)
dtm_ng3 <- dfm(toks, ngrams = 3, verbose = T)
## Creating a dfm from a tokens input...
## ... lowercasing
## ... found 3 documents, 32,961,816 features
## ... created a 3 x 32,961,816 sparse dfm
## ... complete.
## Elapsed time: 1698 seconds.
topfeatures(dtm_ng3, n = 20)
## happi_mother_day can_wait_see let_us_know
## 3472 3148 2637
## happi_new_year look_forward_see new_york_citi
## 2176 1601 1321
## cinco_de_mayo follow_follow_back dream_come_true
## 1128 917 868
## don_even_know love_love_love new_york_time
## 865 858 827
## can_wait_get st_patrick_day make_feel_like
## 813 765 709
## happi_valentin_day new_year_eve ve_ever_seen
## 694 684 668
## let_just_say just_got_back
## 666 662
textplot_wordcloud(dtm_ng3, max_words = 50, colors = c("red", "pink", "green", "purple", "orange", "blue"))
topFeats_3 <- topfeatures(dtm_ng3, 20)
topDf_3 <- data.frame(features = names(topFeats_3), freq = topFeats_3)
ggplot(data = topDf_3, aes(x = reorder(features, freq), y = freq, fill = features)) +
geom_bar(stat = "identity", position = position_dodge()) +
theme_minimal() +
labs(x = "Feature", y = "Frequency") +
labs(title = expression("Features Frequencies")) +
coord_flip() +
guides(fill=FALSE)