Data science specialization

knitr::opts_chunk$set(echo = TRUE)
setwd("C:/Users/USUARIO/Desktop/")
require(viridisLite)
require(pacman)
pacman::p_load(knitr, tidyverse, plotly, stringr, tm, stringi, RWeka, ggplot2, kableExtra, wordcloud, SnowballC, htmlTable, RColorBrewer, viridis, cluster, factoextra, tidyr, quanteda, quanteda.textplots)

Application of Text Mining for the final paper

Text mining is a branch of data science oriented to analyze the information contained in texts with the objective of being able to describe correctly the writings and to find information that is not explicit through the search for correlations and patterns between words.

Text analysis is very interesting because it allows us to interpret the way in which a document is written, whether it is a book, a speech, a scientific article, among others. Through text analysis we can infer some sentiment analysis and find, for example, correlations between speeches.

To perform quantitative text analysis there are several packages and ways, in this case we will make use of the package ‘quanteda’ which was developed to perform data analysis in text form. For this work we will apply it to perform a content analysis.

There are 3 essential components of a text: - The corpus: it is an object within R that we create by loading our text data. - The document feature matrix (dfm): this is the analytical unit on which we perform the analysis. - Tokens: refers to each individual word in a text

Downloading data for analysis

A desktop will then be created to download the information with the URL given by the platform.

if(!file.exists("./work")){
  dir.create("./work")
  url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
  download.file(url, destfile="./work/Coursera-SwiftKey.zip", mode = "wb")
  unzip(zipfile="./work/Coursera-SwiftKey.zip", exdir="./work")
}

setwd("C:/Users/USUARIO/Desktop/work/final/en_US/")

Loading individual documents

We upload the 3 documents in English.

path <- "C:/Users/USUARIO/Desktop/work/final/en_US" 

news <- readLines(paste(path, "/en_US.news.txt", sep = ""))

blogs <- readLines(paste(path, "/en_US.blogs.txt", sep = ""))    

twitter <- readLines(paste(path, "/en_US.twitter.txt", sep = ""))

Data summary

A summary of the 3 texts is then created, this summary includes the weight of the files, the number of words in each file, among other data of interest.

dates <- function(files, lines) {
    size <- file.info(files)[1]/1024^2
    number_chars <- lapply(lines, nchar)
    count <- sum(sapply(strsplit(lines, "\\s+"), length))
    return(c(files, format(round(as.double(size),2), nsmall=2), length(lines), count))
}

#Calling function
news_stat<- dates("en_US.news.txt", news)
blogs_stat <- dates("en_US.blogs.txt", blogs)
Twit_stat<- dates("en_US.twitter.txt", twitter)

summary <- c(news_stat, blogs_stat,Twit_stat)
    
table <- summary %>% unlist() %>% matrix(nrow = 3, byrow = T) %>% as.data.frame()

colnames(table) <- c("Files",   "Size_MB"   ,   "Number_of_lines"   ,   "Number_of_words"   )
    
total <- data.frame("Files" = "Total", "Size_MB" = as.character(556.06), "Number_of_lines" = as.numeric(3336695), "Number_of_words" = as.numeric(70352205))
   
table <- rbind(table, total)
    
table <- table %>% htmlTable(tfoot = "Statistical summary",col.rgroup = c("steelblue", "grey"),encoding="UTF-8")

Building the corpus

In this section we are going to build the corpus with the given texts.

After making an individual analysis, some functions will be written for making a general process. This is more easy and general, however, this deppends of the specific needs.

# Function for building the corpus: This function is named with B "Building"

B <- function(file) {
    corpus <- paste(file, collapse=" ")
    corpus <- VectorSource(corpus)
    corpus <- Corpus(corpus)
}

Cleaning the corpus

The process of cleaning the corpus is very important because usually there are characters such as accents that prevent a proper understanding of the text, it is necessary to convert the text to lowercase, remove punctuation marks, remove “stop words” among others.

# Function for cleaning the corpus: This function is named with C "Cleaning"
C <- function(cleancorpus) {
   toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
    cleancorpus <- tm_map(cleancorpus, toSpace, "/|@|//|$|:|:)|*|&|!|?|_|-|#|'|´|’|--|")
    cleancorpus <- tm_map(cleancorpus, removeNumbers)
    cleancorpus <- tm_map(cleancorpus, content_transformer(tolower))
    cleancorpus <- tm_map(cleancorpus, removePunctuation)
    cleancorpus <- tm_map(cleancorpus, removeWords, stopwords("english"))
    cleancorpus <- tm_map(cleancorpus, stripWhitespace)
    return (cleancorpus)
}

Exploratory Text Analysis

Most common words

One of the things that we are going to do over the 3 texts is getting the most common words per each document, this will give us one idea of he text composition.

This words will be shown in a “plotly” graph where you can see the number of words for the first 30 words.

Cloud words and herarchical clustering

Next, will be done a word cloud where you can which are the 100 most common words per text. This is important because we need to know what are the most used words. And a hierarchical analysis will be done for each text, this will show us the relation between different words and also shows the clusters that are formed.

m_c_w <- function (cleancorpus) {
    sparse <- DocumentTermMatrix(cleancorpus)
    matrix <- as.matrix(sparse)   ## convert our term-document-matrix into a normal matrix
    fw <- colSums(matrix)
    fw <- as.data.frame(sort(fw, decreasing=TRUE))
    fw$word <- rownames(fw)
    colnames(fw) <- c("Frequency","word")
    return (fw)
}

Most common words for en_US.news text

sam1 <-sample(news, round(0.25*length(news)), replace = F)
    corpus <- B(sam1)
    corpus_news <- C(corpus)
    news_m_c_w <- m_c_w(corpus_news)
    news_m_c_w1<- news_m_c_w[1:30,]
    news_m_c_w2 <- news_m_c_w[1:50,]

p <- ggplot() + geom_col(data=news_m_c_w1, aes(x=reorder(word,Frequency), y=Frequency, fill=factor(reorder(word,-Frequency))), position = "stack") + xlab("Word") + ylab("Frecuency of words") + labs(title = "Most common words : US_News") + theme_minimal() + theme(legend.title=element_blank(), legend.position = "bottomleft", axis.text.x = element_text(angle = 90, hjust=1)) + scale_fill_viridis(discrete = TRUE)
fig <- ggplotly(p)
fig

wordcloud(news_m_c_w$word[1:100], news_m_c_w$Frequency[1:100],
              colors=brewer.pal(8, "Set1"))

hc_euclidea <- hclust(d = dist(x = news_m_c_w2, method = "euclidean"),
                                method = "complete")

fviz_dend(x = hc_euclidea, k = 6, cex = 0.7) +
    geom_hline(yintercept = 5.5, linetype = "dashed") +
    labs(title = "Herarchical clustering",
         subtitle = "US_news analysis, K=6")

Most common words for en_US.blogs text

sam2 <-sample(blogs, round(0.25*length(blogs)), replace = F)
    corpus_blogs <- B(sam2)
    corpus_blogs <- C(corpus_blogs)
    blogs_m_c_w <- m_c_w(corpus_blogs)
    blogs_m_c_w1<- blogs_m_c_w[1:30,]
    blogs_m_c_w2<- blogs_m_c_w[1:50,]
    

p <- ggplot() + geom_col(data=blogs_m_c_w1, aes(x=reorder(word,Frequency), y=Frequency, fill=factor(reorder(word,-Frequency)))) + xlab("Word") + ylab("Frecuency of words") + labs(title = "Most common words : US_Blogs") + theme_minimal() + theme(legend.title=element_blank(), legend.position = "bottomleft", axis.text.x = element_text(angle = 90, hjust=1)) + scale_fill_viridis(discrete = TRUE)
fig <- ggplotly(p)
fig

wordcloud(blogs_m_c_w$word[1:100], blogs_m_c_w$Frequency[1:100],
              colors=brewer.pal(8, "Dark2"))

hc_euclidea <- hclust(d = dist(x = blogs_m_c_w2, method = "euclidean"),
                                method = "complete")

fviz_dend(x = hc_euclidea, k = 6, cex = 0.7) +
    geom_hline(yintercept = 5.5, linetype = "dashed") +
    labs(title = "Herarchical clustering",
         subtitle = "Us_blogs analysis, K=6")

Most common words for en_US.twitter text

sam3 <-sample(twitter, round(0.25*length(twitter)), replace = F)
    corpus_twitter <- B(sam3)
    corpus_twitter <- C(corpus_twitter)
    twitter_m_c_w <- m_c_w(corpus_twitter)
    twitter_m_c_w1<- twitter_m_c_w[1:30,]
    twitter_m_c_w2<- twitter_m_c_w[1:50,]

p <- ggplot() + geom_col(data=twitter_m_c_w1, aes(x=reorder(word,Frequency), y=Frequency, fill=factor(reorder(word,-Frequency)))) + xlab("Word") + ylab("Frecuency of words") + labs(title = "Most common words : US_Blogs") + theme_minimal() + theme(legend.title=element_blank(), legend.position = "bottomleft", axis.text.x = element_text(angle = 90, hjust=1)) + scale_fill_viridis(discrete = TRUE)
fig <- ggplotly(p)
fig

wordcloud(twitter_m_c_w$word[1:100], twitter_m_c_w$Frequency[1:100],
              colors=brewer.pal(8, "Set2"))

hc_euclidea <- hclust(d = dist(x = twitter_m_c_w2, method = "euclidean"),
                                method = "complete")

fviz_dend(x = hc_euclidea, k = 6, cex = 0.7) +
    geom_hline(yintercept = 5.5, linetype = "dashed") +
    labs(title = "Herarchical clustering",
         subtitle = "US_Twitter analysis, K=6")

Future work

The next stage is to work on text classification and how we are going to develop our prediction system. For this, we will implement a probabilistic language model that will allow us to build a word prediction system.

For our model, we will make use of the Markov assumption, which states that determining the probability of occurrence of a word depends only on the previous word. We will develop our language model based on n-grams using the “quanteda” package, an exploratory analysis of the bigrams present in the news document is presented below.

Finally, we show a word cloud and a bar chart that allows us to observe how the bigrams behave for the news text. From this, a general model involving all the texts should be built.

sam1 <-sample(news, round(0.25*length(news)), replace = F)
newsCorpus <- corpus(sam1)
News_tokens <- newsCorpus %>% tokens(what = "word", remove_numbers = TRUE, remove_punct = TRUE, remove_separators = TRUE, remove_symbols =TRUE) %>% tokens_remove(c("â", "s")) %>% tokens_tolower() %>% tokens_select(stopwords(),selection ="remove", padding = FALSE) %>% tokens_ngrams(n=2) %>% dfm(remove = stopwords("english"), stem = TRUE)

wordcloud <- textplot_wordcloud(News_tokens, min_freq = 6, random_order = FALSE, rotation = .25, lang="english", colors = RColorBrewer::brewer.pal(8, "Dark2"))

topFeatures <- topfeatures(News_tokens, 30)
bigrams_news <- data.frame(word=names(topFeatures), count=topFeatures)

p <- ggplot() + geom_col(data=bigrams_news, aes(x=reorder(word,count), y=count, fill=factor(reorder(word,-count)))) + xlab("Bigrams") + ylab("Frecuency of Bigrams") + labs(title = "Most common Bigrams : US_News") + theme_minimal() + theme(legend.title=element_blank(), legend.position = "bottomleft", axis.text.x = element_text(angle = 90, hjust=1)) + scale_fill_viridis(discrete = TRUE) + coord_flip()
fig <- ggplotly(p)
fig

Data science specialization - Final work

Carlos Andrés Gómez Flórez

24 marzo, 2021