knitr::opts_chunk$set(echo = TRUE)
setwd("C:/Users/USUARIO/Desktop/")
require(viridisLite)
require(pacman)
pacman::p_load(knitr, tidyverse, plotly, stringr, tm, stringi, RWeka, ggplot2, kableExtra, wordcloud, SnowballC, htmlTable, RColorBrewer, viridis, cluster, factoextra, tidyr, quanteda, quanteda.textplots)
Text mining is a branch of data science oriented to analyze the information contained in texts with the objective of being able to describe correctly the writings and to find information that is not explicit through the search for correlations and patterns between words.
Text analysis is very interesting because it allows us to interpret the way in which a document is written, whether it is a book, a speech, a scientific article, among others. Through text analysis we can infer some sentiment analysis and find, for example, correlations between speeches.
To perform quantitative text analysis there are several packages and ways, in this case we will make use of the package ‘quanteda’ which was developed to perform data analysis in text form. For this work we will apply it to perform a content analysis.
There are 3 essential components of a text: - The corpus: it is an object within R that we create by loading our text data. - The document feature matrix (dfm): this is the analytical unit on which we perform the analysis. - Tokens: refers to each individual word in a text
A desktop will then be created to download the information with the URL given by the platform.
if(!file.exists("./work")){
dir.create("./work")
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(url, destfile="./work/Coursera-SwiftKey.zip", mode = "wb")
unzip(zipfile="./work/Coursera-SwiftKey.zip", exdir="./work")
}
setwd("C:/Users/USUARIO/Desktop/work/final/en_US/")
We upload the 3 documents in English.
path <- "C:/Users/USUARIO/Desktop/work/final/en_US"
news <- readLines(paste(path, "/en_US.news.txt", sep = ""))
blogs <- readLines(paste(path, "/en_US.blogs.txt", sep = ""))
twitter <- readLines(paste(path, "/en_US.twitter.txt", sep = ""))
A summary of the 3 texts is then created, this summary includes the weight of the files, the number of words in each file, among other data of interest.
dates <- function(files, lines) {
size <- file.info(files)[1]/1024^2
number_chars <- lapply(lines, nchar)
count <- sum(sapply(strsplit(lines, "\\s+"), length))
return(c(files, format(round(as.double(size),2), nsmall=2), length(lines), count))
}
#Calling function
news_stat<- dates("en_US.news.txt", news)
blogs_stat <- dates("en_US.blogs.txt", blogs)
Twit_stat<- dates("en_US.twitter.txt", twitter)
summary <- c(news_stat, blogs_stat,Twit_stat)
table <- summary %>% unlist() %>% matrix(nrow = 3, byrow = T) %>% as.data.frame()
colnames(table) <- c("Files", "Size_MB" , "Number_of_lines" , "Number_of_words" )
total <- data.frame("Files" = "Total", "Size_MB" = as.character(556.06), "Number_of_lines" = as.numeric(3336695), "Number_of_words" = as.numeric(70352205))
table <- rbind(table, total)
table <- table %>% htmlTable(tfoot = "Statistical summary",col.rgroup = c("steelblue", "grey"),encoding="UTF-8")
In this section we are going to build the corpus with the given texts.
After making an individual analysis, some functions will be written for making a general process. This is more easy and general, however, this deppends of the specific needs.
# Function for building the corpus: This function is named with B "Building"
B <- function(file) {
corpus <- paste(file, collapse=" ")
corpus <- VectorSource(corpus)
corpus <- Corpus(corpus)
}
The process of cleaning the corpus is very important because usually there are characters such as accents that prevent a proper understanding of the text, it is necessary to convert the text to lowercase, remove punctuation marks, remove “stop words” among others.
# Function for cleaning the corpus: This function is named with C "Cleaning"
C <- function(cleancorpus) {
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
cleancorpus <- tm_map(cleancorpus, toSpace, "/|@|//|$|:|:)|*|&|!|?|_|-|#|'|´|’|--|")
cleancorpus <- tm_map(cleancorpus, removeNumbers)
cleancorpus <- tm_map(cleancorpus, content_transformer(tolower))
cleancorpus <- tm_map(cleancorpus, removePunctuation)
cleancorpus <- tm_map(cleancorpus, removeWords, stopwords("english"))
cleancorpus <- tm_map(cleancorpus, stripWhitespace)
return (cleancorpus)
}
One of the things that we are going to do over the 3 texts is getting the most common words per each document, this will give us one idea of he text composition.
This words will be shown in a “plotly” graph where you can see the number of words for the first 30 words.
Next, will be done a word cloud where you can which are the 100 most common words per text. This is important because we need to know what are the most used words. And a hierarchical analysis will be done for each text, this will show us the relation between different words and also shows the clusters that are formed.
m_c_w <- function (cleancorpus) {
sparse <- DocumentTermMatrix(cleancorpus)
matrix <- as.matrix(sparse) ## convert our term-document-matrix into a normal matrix
fw <- colSums(matrix)
fw <- as.data.frame(sort(fw, decreasing=TRUE))
fw$word <- rownames(fw)
colnames(fw) <- c("Frequency","word")
return (fw)
}
sam1 <-sample(news, round(0.25*length(news)), replace = F)
corpus <- B(sam1)
corpus_news <- C(corpus)
news_m_c_w <- m_c_w(corpus_news)
news_m_c_w1<- news_m_c_w[1:30,]
news_m_c_w2 <- news_m_c_w[1:50,]
p <- ggplot() + geom_col(data=news_m_c_w1, aes(x=reorder(word,Frequency), y=Frequency, fill=factor(reorder(word,-Frequency))), position = "stack") + xlab("Word") + ylab("Frecuency of words") + labs(title = "Most common words : US_News") + theme_minimal() + theme(legend.title=element_blank(), legend.position = "bottomleft", axis.text.x = element_text(angle = 90, hjust=1)) + scale_fill_viridis(discrete = TRUE)
fig <- ggplotly(p)
fig
wordcloud(news_m_c_w$word[1:100], news_m_c_w$Frequency[1:100],
colors=brewer.pal(8, "Set1"))
hc_euclidea <- hclust(d = dist(x = news_m_c_w2, method = "euclidean"),
method = "complete")
fviz_dend(x = hc_euclidea, k = 6, cex = 0.7) +
geom_hline(yintercept = 5.5, linetype = "dashed") +
labs(title = "Herarchical clustering",
subtitle = "US_news analysis, K=6")
sam2 <-sample(blogs, round(0.25*length(blogs)), replace = F)
corpus_blogs <- B(sam2)
corpus_blogs <- C(corpus_blogs)
blogs_m_c_w <- m_c_w(corpus_blogs)
blogs_m_c_w1<- blogs_m_c_w[1:30,]
blogs_m_c_w2<- blogs_m_c_w[1:50,]
p <- ggplot() + geom_col(data=blogs_m_c_w1, aes(x=reorder(word,Frequency), y=Frequency, fill=factor(reorder(word,-Frequency)))) + xlab("Word") + ylab("Frecuency of words") + labs(title = "Most common words : US_Blogs") + theme_minimal() + theme(legend.title=element_blank(), legend.position = "bottomleft", axis.text.x = element_text(angle = 90, hjust=1)) + scale_fill_viridis(discrete = TRUE)
fig <- ggplotly(p)
fig
wordcloud(blogs_m_c_w$word[1:100], blogs_m_c_w$Frequency[1:100],
colors=brewer.pal(8, "Dark2"))
hc_euclidea <- hclust(d = dist(x = blogs_m_c_w2, method = "euclidean"),
method = "complete")
fviz_dend(x = hc_euclidea, k = 6, cex = 0.7) +
geom_hline(yintercept = 5.5, linetype = "dashed") +
labs(title = "Herarchical clustering",
subtitle = "Us_blogs analysis, K=6")
sam3 <-sample(twitter, round(0.25*length(twitter)), replace = F)
corpus_twitter <- B(sam3)
corpus_twitter <- C(corpus_twitter)
twitter_m_c_w <- m_c_w(corpus_twitter)
twitter_m_c_w1<- twitter_m_c_w[1:30,]
twitter_m_c_w2<- twitter_m_c_w[1:50,]
p <- ggplot() + geom_col(data=twitter_m_c_w1, aes(x=reorder(word,Frequency), y=Frequency, fill=factor(reorder(word,-Frequency)))) + xlab("Word") + ylab("Frecuency of words") + labs(title = "Most common words : US_Blogs") + theme_minimal() + theme(legend.title=element_blank(), legend.position = "bottomleft", axis.text.x = element_text(angle = 90, hjust=1)) + scale_fill_viridis(discrete = TRUE)
fig <- ggplotly(p)
fig
wordcloud(twitter_m_c_w$word[1:100], twitter_m_c_w$Frequency[1:100],
colors=brewer.pal(8, "Set2"))
hc_euclidea <- hclust(d = dist(x = twitter_m_c_w2, method = "euclidean"),
method = "complete")
fviz_dend(x = hc_euclidea, k = 6, cex = 0.7) +
geom_hline(yintercept = 5.5, linetype = "dashed") +
labs(title = "Herarchical clustering",
subtitle = "US_Twitter analysis, K=6")
The next stage is to work on text classification and how we are going to develop our prediction system. For this, we will implement a probabilistic language model that will allow us to build a word prediction system.
For our model, we will make use of the Markov assumption, which states that determining the probability of occurrence of a word depends only on the previous word. We will develop our language model based on n-grams using the “quanteda” package, an exploratory analysis of the bigrams present in the news document is presented below.
Finally, we show a word cloud and a bar chart that allows us to observe how the bigrams behave for the news text. From this, a general model involving all the texts should be built.
sam1 <-sample(news, round(0.25*length(news)), replace = F)
newsCorpus <- corpus(sam1)
News_tokens <- newsCorpus %>% tokens(what = "word", remove_numbers = TRUE, remove_punct = TRUE, remove_separators = TRUE, remove_symbols =TRUE) %>% tokens_remove(c("â", "s")) %>% tokens_tolower() %>% tokens_select(stopwords(),selection ="remove", padding = FALSE) %>% tokens_ngrams(n=2) %>% dfm(remove = stopwords("english"), stem = TRUE)
wordcloud <- textplot_wordcloud(News_tokens, min_freq = 6, random_order = FALSE, rotation = .25, lang="english", colors = RColorBrewer::brewer.pal(8, "Dark2"))
topFeatures <- topfeatures(News_tokens, 30)
bigrams_news <- data.frame(word=names(topFeatures), count=topFeatures)
p <- ggplot() + geom_col(data=bigrams_news, aes(x=reorder(word,count), y=count, fill=factor(reorder(word,-count)))) + xlab("Bigrams") + ylab("Frecuency of Bigrams") + labs(title = "Most common Bigrams : US_News") + theme_minimal() + theme(legend.title=element_blank(), legend.position = "bottomleft", axis.text.x = element_text(angle = 90, hjust=1)) + scale_fill_viridis(discrete = TRUE) + coord_flip()
fig <- ggplotly(p)
fig