Coursera Data Science Capstone

General view

This report aims to show, by means of an exploratory analysis, the words (along with their corresponding frequency) that appear the most in the training data, from a large body of text documents, provided by the SwiftKey application, for this after Having read the data, we proceed to make a summary of each of the data sets of blogs, news and twitter in English, this in order to understand how the data is, then a sample of the data is selected, and at this sample, it is cleaned, in order to finally carry out the exploratory analysis, and thus be able to give us an idea of how we could use the aforementioned data, to carry out our prediction algorithm, which takes a word as input and predicts the word or the following words, and implement it in a shiny app.

blogs_data_us <- readLines("final/en_US/en_US.blogs.txt", skipNul = TRUE)
news_data_us <- readLines("final/en_US/en_US.news.txt", skipNul = TRUE)

## Warning in readLines("final/en_US/en_US.news.txt", skipNul = TRUE): incomplete
## final line found on 'final/en_US/en_US.news.txt'

twitter_data_us <- readLines("final/en_US/en_US.twitter.txt", skipNul = TRUE)

Loading the data

For this project, the data that I am going to use are those corresponding to blogs, news and twitter in English, which I download from the link https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip. and then I saved them in the working directory and then I proceeded to read them, as shown below:

Summarizing the data

She then presented a brief summary of the data previously loaded, which includes for each file that contains the data: the file size in Megabytes, the number of lines, the number of characters, and the number of words.

##       files_names size_Mb num_lines num_characters num_words
## 1   blogs_data_us   267.8    899288      208361438  38154238
## 2    news_data_us    20.7     77259       15683765   2693898
## 3 twitter_data_us   334.5   2360148      162385035  30218166

The summary was made with the following code:

library(stringi)
files <- list(blogs_data_us,news_data_us,twitter_data_us)
files_names <- c("blogs_data_us","news_data_us","twitter_data_us")
df_summary <- cbind(files_names = c("blogs_data_us","news_data_us","twitter_data_us"),
                    size_Mb = round(sapply(files, object.size)/10^6,1),
                    num_lines = sapply(files, length),
                    num_characters = sapply(sapply(files, nchar),sum),
                    num_words = sapply(sapply(files, stri_count_words),sum))
as.data.frame(df_summary)

Choosing a sample of the data

Since the size of the files containing the data is very large, he took a sample of 1% of each file as shown below.

sample_blog <- sample(blogs_data_us, as.numeric(df_summary[1,3])* 0.01)
sample_news <- sample(news_data_us, as.numeric(df_summary[2,3])* 0.01)
sample_twitter <- sample(twitter_data_us, as.numeric(df_summary[3,3])* 0.01)

Cleaning the data

Since I already have the data with which to carry out the exploratory analysis, which is precisely the sample created previously, I proceed to clean the data, removing from them the numbers, punctuation marks, blanks, and characters special, and I also converted all the letters to lowercase and finally created a plain text document as shown below.

library(tm)
corpus_data <- VCorpus(VectorSource(c(sample_blog, sample_news, sample_twitter)))
corpus_data <- tm_map(corpus_data, tolower)
corpus_data <- tm_map(corpus_data, removeNumbers)
corpus_data <- tm_map(corpus_data, removePunctuation)
corpus_data <- tm_map(corpus_data, stripWhitespace)
corpus_data <- tm_map(corpus_data, function(x) iconv(x, "latin1", "ASCII", sub=""))
corpus_data <- tm_map(corpus_data, removeWords,stopwords("en"))
corpus_data <- tm_map(corpus_data, PlainTextDocument)

Exploratory data analysis

Through this exploratory analysis, we want to answer the question: what are the words, or sets of words, that appear the most in the texts ?, in order to give us an idea of the behavior of the words that are in the data and thus be able to know which path to take to make the prediction algorithm. To answer this question, the first thing we did was a graph that shows the most used words present in the data, and then we made a histogram of a unigram, bigram and trigram, which are shown below, and answer the question above. mentioned.

library(wordcloud)
wordcloud(corpus_data, max.words = 100, random.order = FALSE, colors=brewer.pal(8,"Dark2"))

library(RWeka)
unigram <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

frequency_function <- function(input) {
  frequency <- sort(rowSums(as.matrix(input)), decreasing = TRUE)
  return(data.frame(word = names(frequency), frequency = frequency))
}
frequency_1 <- frequency_function(removeSparseTerms(TermDocumentMatrix(corpus_data), 0.9999))
frequency_2 <- frequency_function(removeSparseTerms(TermDocumentMatrix(corpus_data, control = list(tokenize = bigram)), 0.9999))
frequency_3 <- frequency_function(removeSparseTerms(TermDocumentMatrix(corpus_data, control = list(tokenize = trigram)), 0.9999))
library(ggplot2)
frequencys = list(frequency_1[1:20,], frequency_2[1:20,], frequency_3[1:20,])
titles <- c("Top 20 Unigrams By Frequency","Top 20 Bigrams By Frequency","Top 20 Trigrams By Frequency")
for (i in 1:3) {
  plot <- ggplot(as.data.frame(frequencys[i]), aes(reorder(word, -frequency), frequency))
  plot <- plot + theme_bw() + labs(title = titles[i], x = "Words", y = "Frequency")
  plot <- plot + theme(plot.title = element_text(face = "bold",size = rel(1.5),vjust = 2.5, hjust = 0.5), axis.text.x = element_text(angle = 60, size = 10, hjust = 1))
  plot <- plot + geom_bar(stat = "identity", fill = I("grey50"))
  print(plot)
}

Steps to follow

Since we know a little more about the training data available to perform our prediction algorithm, the next thing we should do is clean the data even more, excluding bad words from the English language, and then create the prediction algorithm which could use the trigram model, to predict the two words that follow the entered word, and in case this does not work, the bigrama model will be used to predict the word that follows the entered word, and if the bigrama model does not work then the unigram model will be used, and once you have the prediction algorithm, the last step to do would be to create the shiny application, which will take a word as input, and will throw the word or the following words.

Coursera Data Science Capstone - Milestone Report

Obed Garcia

22/7/2020