The capstone project scope is to design an application able to predict next word(s) based on a incomplete sentence typed by the user. A corpora of three files with text data from twitter, blog and news were provided and will be used to create the prediction model. The application will be developed in R using Shiny package.
The two main tasks of this milestone report are:
Performing an exploratory analysis of the corpora provided, understand the distribution of words and relationship between them.
Briefly summarize plans for creating the prediction algorithm and Shiny app.
The corpora are collected from publicly available sources by a web crawler. The crawler checks for language, so as to mainly get texts consisting of the desired language.
We are only loading texts in English language.
path <- ("C:/Users/Marco/Documents/ds_specialization_github/datasciencecoursera/Capstone Project/swiftkey_data/en_US/")
blogs <- readLines(paste0(path,"en_US.blogs.txt"), encoding = "UTF-8")
news <- readLines(paste0(path,"en_US.news.txt"), encoding = "UTF-8")
twitter <- readLines(paste0(path,"en_US.twitter.txt"), encoding = "UTF-8")
Given the big size of datasets, we are going to take a sample for each of them. The idea is to have a smaller and representative subset on which we can perform qicker analysis and build statistical models eventually.
We are going to take 15% of data for each dataset.
sample_2_per <- function (x) {sample(x , length(x)*0.02, replace = F)}
set.seed(15)
blogs_s <- sample_2_per(blogs)
news_s <- sample_2_per(news)
twitter_s <- sample_2_per(twitter)
# We can remove original text files loaded into R to free up memory
library(pryr)
mem_used()
rm(blogs, news, twitter)
mem_used() # we have much more free memory now
First of all we are going to remove non-ASCII characters e.g. symbols, emoticons, etc. which are often present in social networks messages.
# create a function to convert to ascii
conv_to_ascii <- function(x) {
to_ascii <- iconv(x, "latin1", "ASCII", sub="")
return(to_ascii)
}
blogs_s <- conv_to_ascii(blogs_s)
news_s <- conv_to_ascii(news_s)
twitter_s <- conv_to_ascii(twitter_s)
all_samples <- c(blogs_s, news_s, twitter_s)
After several attempts using tm and RWeka, due mainly to performance issues, I have decided to explore **quanteda** library.
I am also going to make use of the recently built tidytext library to generate some visualizations.
library("quanteda")
library("tidytext")
my_corpus <- corpus(all_samples)
summary(my_corpus, 3)
## Corpus consisting of 66732 documents, showing 3 documents:
##
## Text Types Tokens Sentences
## text1 50 65 2
## text2 16 18 1
## text3 72 124 6
##
## Source: C:/Users/Marco/Dropbox/Coursera/Data Science Specialization - JHU/Capstone Project/week2/* on x86-64 by Marco
## Created: Sun Oct 29 19:02:20 2017
## Notes:
# or to extract for example just the 2nd text of the corpus:
# texts(my_corpus)[2]
For example the 3rd text is composed of 6 sentences and a total number of 126 tokens.
I am creating a sparse document-feature matrix and apply the following cleaning operations:
profanity <- read.table("swearWords.csv", sep=",")
my_corpus_cleaned <- dfm(all_samples, tolower = T,
stem = T, remove_punct=T,remove_numbers=T,
remove = c(stopwords("english"), profanity),
dictionary = NULL, verbose=TRUE
)
top_words <- topfeatures(my_corpus_cleaned, 50)
Plotting text dataset as a worldcloud. Size of words are proportional to their frequency.
textplot_wordcloud(my_corpus_cleaned, min.freq = 200, random.order = FALSE,
rot.per = .25, colors = RColorBrewer::brewer.pal(8,"Dark2"))
Please note I am doing this analysis using the combined sample created previously wich include only a 2% subset of the original data.
From the cleaned corpus generated above I know there is a total of 50713 features inside.
library(dplyr)
all_features <- topfeatures(my_corpus_cleaned, 50713)
all_features <- as.data.frame(all_features) %>% rename (freq= all_features)
all_features <- all_features %>%
mutate(word=row.names(all_features), total=sum(freq),
per=freq/total, cum=cumsum(per)) %>%
select (word, freq, cum)
cum50 <- which(all_features$cum >= 0.5)[1]
cum50 / 50713
## [1] 0.01031294
We se that with just 525 words (that corresponds to 1% of total words present in the document) we can build 50% of the entire document.
cum90 <- which(all_features$cum >= 0.9)[1]
cum90 / 50713
## [1] 0.1483446
On the other hand we need 7550 words (14% of total words) to build 90% of the document. This is a very important point to consider when building the prediction algorithm: it might not be necessary to keep the entire corpus to provide accurate predictions for next words.
Additional analysis: it would be interesting to break down this analysis for blogs vs news vs twitter to see if for example the % of unique words needed is even lower for twitter compared to blogs or news (hypothesis: social networds have more restricted vocabulary while news might present a much richer vocabulary).
To understand relationships between words I am creatin several grams. Grams are sequences of words that are found inside wider sequences of words. I’ll be creating bigrams (2 words sequences), trigrams (3 words) and fougrams (4 words).
In quanteda I can still use the dfm() function and use the argument ngrams to specify the number of words in the sequence.
# Create a function since the only parameter I might have to change is the number of grams
create_gram <- function (x, n) {
gram <- dfm(x, ngrams=n, concatenator=" ", tolower = T,
stem = T, remove_punct=T, remove_numbers=T,
remove = stopwords("english"), dictionary = NULL, verbose=TRUE)
return(gram)
}
bigrams <- create_gram(my_corpus, 2)
trigrams <- create_gram(my_corpus, 3)
fourgrams <- create_gram(my_corpus, 4)
top10_bigrams <- topfeatures(bigrams, 10)
kable(as.data.frame(top10_bigrams))
| top10_bigrams | |
|---|---|
| of the | 5130 |
| in the | 4959 |
| to the | 2854 |
| for the | 2700 |
| on the | 2526 |
| to be | 2438 |
| at the | 1809 |
| i have | 1656 |
| want to | 1580 |
| go to | 1568 |
textplot_wordcloud(bigrams, min.freq = 200, random.order = FALSE,
rot.per = .25, colors = RColorBrewer::brewer.pal(8,"Dark2"))
We can also visualize network of bigrams and understand all the relationships between words simultaneously. To make the visualization we will use the ggraph package as suggested in the book Text Mining with R.
First I need to get data into a tidy format (data frame where each word is a column)
library(tidyr)
top100_bigrams <- topfeatures(bigrams, 100)
test <- as.data.frame(top100_bigrams)
bigrams_tidy <- data.frame(row.names(test), test$top100_bigrams)
names(bigrams_tidy) <- c("words", "freq")
bigrams_tidy <- separate(bigrams_tidy, words, into = c("word1", "word2"), sep = " ")
kable(head(bigrams_tidy))
| word1 | word2 | freq |
|---|---|---|
| of | the | 5130 |
| in | the | 4959 |
| to | the | 2854 |
| for | the | 2700 |
| on | the | 2526 |
| to | be | 2438 |
Now we can build the network viz (I am plotting just the top 100 bigrams).
library(igraph)
library(ggraph)
bigram_graph <- bigrams_tidy %>%
graph_from_data_frame()
a <- grid::arrow(type = "closed", length = unit(.15, "inches"))
ggraph(bigram_graph, layout = "fr") +
geom_edge_link(show.legend = FALSE,arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void()
As we might expected, the network shows that words like “I”, “you”, “to”, “the” represent key node around which sentences are built upon.
top10_trigrams <- topfeatures(trigrams, 10)
kable(as.data.frame(top10_trigrams))
| top10_trigrams | |
|---|---|
| thank for the | 510 |
| a lot of | 395 |
| i want to | 395 |
| one of the | 389 |
| look forward to | 322 |
| go to be | 294 |
| to be a | 275 |
| be abl to | 227 |
| it was a | 227 |
| i have to | 223 |
textplot_wordcloud(trigrams, min.freq = 50, random.order = FALSE,
rot.per = .25, colors = RColorBrewer::brewer.pal(8,"Dark2"))
top10_fourgrams <- topfeatures(fourgrams, 10)
kable(as.data.frame(top10_fourgrams))
| top10_fourgrams | |
|---|---|
| thank for the follow | 119 |
| the rest of the | 100 |
| the end of the | 91 |
| is go to be | 81 |
| thank for the rt | 80 |
| for the first time | 79 |
| at the end of | 78 |
| when it come to | 69 |
| in the middl of | 67 |
| if you want to | 64 |
textplot_wordcloud(fourgrams, min.freq = 30, random.order = FALSE,
rot.per = .25, colors = RColorBrewer::brewer.pal(8,"Dark2"))