The aim of this capstone project is to apply data science techniques in the area of Natural Language Processing using R.
The starting dataset for this project is represented by a large corpus of documents, which will be used to build a simple but effective predictive text model. The data can be found here.
This milestone report will explain the basics steps taken to load and sample the data, clean it and organize it in a useful manner to perform the subsequent modeling tasks.
After downloading and unzipping the data in the data/ directory, we can check which files we have.
list.files("./data/final/", recursive = T)
## [1] "de_DE/de_DE.blogs.txt" "de_DE/de_DE.news.txt"
## [3] "de_DE/de_DE.twitter.txt" "en_US/en_US.blogs.txt"
## [5] "en_US/en_US.news.txt" "en_US/en_US.twitter.txt"
## [7] "fi_FI/fi_FI.blogs.txt" "fi_FI/fi_FI.news.txt"
## [9] "fi_FI/fi_FI.twitter.txt" "ru_RU/ru_RU.blogs.txt"
## [11] "ru_RU/ru_RU.news.txt" "ru_RU/ru_RU.twitter.txt"
There are three different files, blogs.txt, news.txt and twitter.txt for each of the four provided languages, DE, US, FI and RU. We will use the US dataset for this project.
Let’s get some basic statistics about these files; specifically, their size in MB, number of lines and number of words.
corp_stats <- tibble(corpus = c("en_US.blogs.txt", "en_US.news.txt", "en_US.twitter.txt"),
size_MB = c(blogs_size, news_size, twitter_size),
num_lines = c(blogs_len, news_len, twitter_len),
num_words = c(blogs_words, news_words, twitter_words))
corp_stats %>%
kable()
| corpus | size_MB | num_lines | num_words |
|---|---|---|---|
| en_US.blogs.txt | 200.4242 | 899288 | 37546239 |
| en_US.news.txt | 196.2775 | 1010242 | 34762395 |
| en_US.twitter.txt | 159.3641 | 2360148 | 30093413 |
From what we see, the datasets are quite huge, so we’ll need to select a small sample of them for the sake of simplicity. This of course might harm our analysis, but it should be fine for the purposes of this project.
We will select 1% of the lines in each document, and create a cumulative sample by merging together these 3 different samples.
set.seed(420)
blogs_sample <- sample_frac(blogs_df, size = .01, replace = T)
news_sample <- sample_frac(news_df, size = .01, replace = T)
twitter_sample <- sample_frac(twitter_df, size = .01, replace = T)
cum_sample <- bind_rows(blogs_sample, news_sample, twitter_sample)
Now we can compute the same statistics seen above, but for our reduced samples.
sample_stats <- tibble(sample = c("blogs", "news", "twitter", "cumulative"),
size = c(blogs_sample_size, news_sample_size, twitter_sample_size, cum_sample_size),
num_lines = c(blogs_sample_len, news_sample_len, twitter_sample_len, cum_sample_len),
num_words = c(blogs_sample_words, news_sample_words, twitter_sample_words, cum_sample_words))
sample_stats %>%
kable()
| sample | size | num_lines | num_words |
|---|---|---|---|
| blogs | 2.5 Mb | 8993 | 375790 |
| news | 2.6 Mb | 10102 | 344597 |
| 3.2 Mb | 23601 | 300564 | |
| cumulative | 8.3 Mb | 42696 | 1020951 |
After this sampling, the data should be much easier to analyse and manipulate, so we can proceed to clean and prepare it for the rest of the project.
The data we have so far need to be cleaned a bit, this means removing common stop words (“the”, “of”, “in”, etc.), removing punctuation and whitespaces, and converting every word to lowercase; we will also remove profanity words, based on a list available on this GitHub repository.
This will of course harm the accuracy of our future predictions, but will ensure that the data are coherent and easy to work with.
We are going to use the tm package for this purpose.
First we will create a so-called corpus, which contains our data in a suitable format.
corpus <- VCorpus(VectorSource(c(cum_sample)))
Now we can apply all the desired transformations and cleaning steps.
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeWords, stopwords("en"))
corpus <- tm_map(corpus, removePunctuation, preserve_intra_word_contractions = T)
tospace <- content_transformer(function(x, pattern) {return (gsub(pattern, " ", x))})
corpus <- tm_map(corpus, tospace, "–")
corpus <- tm_map(corpus, tospace, "”")
corpus <- tm_map(corpus, tospace, "“")
corpus <- tm_map(corpus, tospace, "…")
corpus <- tm_map(corpus, removeNumbers)
remove_url <- function(x) gsub("(f|ht)tp(s?)://(.*)[.][a-z]+", " ", x)
corpus <- tm_map(corpus, content_transformer(remove_url))
remove_handles <- function(x) gsub("@[^\\s]+", " ", x)
corpus <- tm_map(corpus, content_transformer(remove_handles))
profanity <- readLines("./data/profanity.txt")
corpus <- tm_map(corpus, removeWords, profanity)
corpus <- tm_map(corpus, stripWhitespace)
Now we are ready to perform some analysis and gain a few interesting insights from our data.
The clean corpus we generated can be used to analyse word frequencies, as well as a couple of n-grams frequencies. An n-gram is a contiguous sequence of n items (in this case, words) from a given corpus; we will explore bi-grams (n = 2) and tri-grams (n = 3).
In order to do this, we must first create a document-term matrix, which contains the frequency of each n-gram over the three documents we have (blogs, news and tweets).
We are going to use the RWeka library for this purpose.
dtm1 <- DocumentTermMatrix(corpus)
top_1gram <- findMostFreqTerms(dtm1, n = 30)$`1`
top_1gram
## will said just one like can get time new good
## 3077 3039 2953 2894 2618 2493 2307 2200 1948 1826
## now day love know people see first back going also
## 1750 1665 1637 1600 1501 1399 1386 1348 1337 1285
## great make think last year much two really work got
## 1248 1242 1238 1229 1227 1191 1178 1168 1133 1115
df_1gram <- tibble(word = names(top_1gram), freq = top_1gram)
p_1gram <- df_1gram %>%
ggplot(aes(x = reorder(word, -freq), y = freq, label = word)) +
geom_col(fill = "steelblue") +
labs(x = "", y = "Frequency", title = "Distribution of top 30 words") +
theme_light() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
ggplotly(p_1gram, tooltip = c("freq", "word"))
bi_gram_tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
dtm2 <- DocumentTermMatrix(corpus, control = list(tokenize = bi_gram_tokenizer))
top_2gram <- findMostFreqTerms(dtm2, n = 30)$`1`
top_2gram
## right now last year new york high school
## 248 214 198 149
## last night years ago first time can get
## 140 137 133 123
## feel like looking forward last week even though
## 122 116 108 103
## st louis make sure looks like next week
## 100 98 95 94
## good morning happy birthday united states one day
## 93 89 86 85
## can see look like new jersey every day
## 83 81 81 80
## two years just like s day let know
## 79 76 76 72
## go back just got
## 70 70
df_2gram <- tibble(word = names(top_2gram), freq = top_2gram)
p_2gram <- df_2gram %>%
ggplot(aes(x = reorder(word, -freq), y = freq, label = word)) +
geom_col(fill = "seagreen") +
labs(x = "", y = "Frequency", title = "Distribution of top 30 bigrams") +
theme_light() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
ggplotly(p_2gram, tooltip = c("freq", "word"))
tri_gram_tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
dtm3 <- DocumentTermMatrix(corpus, control = list(tokenize = tri_gram_tokenizer))
top_3gram <- findMostFreqTerms(dtm3, n = 30)$`1`
top_3gram
## fake fake fake mother s day
## 33 29
## none repeat scroll repeat scroll yellow
## 25 25
## stylebackground none repeat happy new year
## 25 24
## new york city let us know
## 23 21
## president barack obama two years ago
## 20 18
## valentine s day cake cake cake
## 17 15
## happy mother s cinco de mayo
## 15 14
## happy mothers day last year s
## 14 14
## new year s new york times
## 13 13
## martin luther king come see us
## 12 11
## county sheriff s g protein g
## 11 11
## rock n roll g carbohydrate g
## 11 10
## sheriff s office st louis county
## 10 10
## three years ago wall street journal
## 10 10
## attorney s office fat g saturated
## 9 9
df_3gram <- tibble(word = names(top_3gram), freq = top_3gram)
p_3gram <- df_3gram %>%
ggplot(aes(x = reorder(word, -freq), y = freq, label = word)) +
geom_col(fill = "darksalmon") +
labs(x = "", y = "Frequency", title = "Distribution of top 30 trigrams") +
theme_light() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
ggplotly(p_3gram, tooltip = c("freq", "word"))
The provided dataset is composed of three different documents: blog posts, news and tweets. Our aim was to understand their structure and peculiarities in order to build a text prediction model, similar to those working daily on our smartphones.
I could have chosen to use only the blog posts and news text, since usually tweets can contain many more mistakes and contractions due to the 140-letter restriction; however, this so-called Twitter slang has become part of the common English language, so for the sake of completeness I included the tweets dataset into the data analysis as well.
In order to provide fast but reliable predictions, I had to sample these data; 1% of the original dataset seemed to be a reasonable amount, in order to achieve fast computations while still being representative of the whole corpus. This should be enough for the purpose of this project.
Stop words were removed, so we may not be able to predict some super-common words such as “and”, “or”, “in” and so on. An additional cleaning step would be word stemming, but this could have restricted our already-chopped dataset a bit too much, so I chose to avoid it.
All these choices might impair the accuracy of the prediction model, but hopefully the final application will still be able to perform quite well with the given amount of data.
The n-grams plots showed that some words are way more common than some others, and from the 2- and 3-grams plots it is evident how some of these words frequently appear together. This will be exploited for the creation of the prediction algorithm, which will probably take advantage of a back-off model, where the model would first look at the most common 3-grams (or maybe also 4-grams) in order to predict the next word, if nothing is found it will look in the 2-grams model and finally it will use the single-words model.
The final prediction application will be implemented as a ShinyApp.
All the code for this report as well as for the future development of the app will be available on this GitHub repository.