Data Science Capstone

library(dplyr)
library(NLP)
library(tm)
library(tidytext)
library(tidyr)
library(quanteda)
library(ggplot2)
library(stringi)

Introduction

The following is a complete report of what we have researched until now, and the intentions and ideas that are being identified for the development of the prediction application, “Next word.”
We begin by using initial data provided by “HC Corpora.” We have various languages to choose from but, at least initially, we are only going to address the files found in the folder “en_US.” This folder contains 3 different files: Blogs, News and Twitter (we have to open one of them in binary mode in order to see all of its data).

Getting the data

We download the data from the URL provided by Coursera, we upload the 3 files to the memory and we can now see that they are very, very large files—between 150 and 200 Mb each. There are more than 100 million words in total.
We have to see what the size of the data sample we’re going to be working with will be, as using all of the data is out of the question. We are going to use a random sample of 10% of the data.
With this 10% selection, we are going to construct a “Corpus,” which will serve as a launching pad for generating the different n-grams (1,2,3,4…)

memory.limit(size = 16000)

## [1] 16000

set.seed(1313)

conn <- file("https://www.cs.cmu.edu/~biglou/resources/bad-words.txt","r")
df_prof <- readLines(conn)
close(conn)
rm(conn)

porc <- 0.10 # 10%

conn <- file("./final/en_US/en_US.news.txt", open = "rb")
ne <- readLines(conn, encoding = "UTF-8", skipNul = TRUE)
close(conn)
rm(conn)
ne <- iconv(ne,"latin1","ASCII",sub = "")
df_sum_ne <- tibble(file = "News", words_total = sum(stri_count_words(ne)), Mb_total = round(sum(stri_numbytes(ne)) / 1024^2,0))
rm(ne)
if (!file.exists(paste0("./final/en_US/sample/ne.q", porc, ".RData"))) {
  ne_s <- sample(ne, round(porc * length(ne)))
  saveRDS(ne_s, file = paste0("./final/en_US/sample/ne.q", porc, ".RData"))
  rm(ne_s)
}

bl <- readLines("./final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
bl <- iconv(bl,"latin1","ASCII",sub = "")
df_sum_bl <- tibble(file = "Blogs", words_total = sum(stri_count_words(bl)) , Mb_total = round(sum(stri_numbytes(bl)) / 1024^2,0))
rm(bl)
if (!file.exists(paste0("./final/en_US/sample/bl.q", porc, ".RData"))) {
  bl_s <- sample(bl, round(porc * length(bl)))
  saveRDS(bl_s, file = paste0("./final/en_US/sample/bl.q", porc, ".RData"))
  rm(bl_s)
}

tw <- readLines("./final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)
tw <- iconv(tw,"latin1","ASCII",sub = "")
df_sum_tw <- tibble(file = "Twitter", words_total = sum(stri_count_words(tw)), Mb_total = round(sum(stri_numbytes(tw)) / 1024^2,0))
rm(tw)
if (!file.exists(paste0("./final/en_US/sample/tw.q", porc, ".RData"))) {
  tw_s <- sample(tw, round(porc * length(tw)))
  saveRDS(tw_s, file = paste0("./final/en_US/sample/tw.q", porc, ".RData"))
  rm(tw_s)
}

rbind(df_sum_ne, df_sum_bl, df_sum_tw)

## # A tibble: 3 x 3
##   file    words_total Mb_total
##   <chr>         <int>    <dbl>
## 1 News       34749301      194
## 2 Blogs      37510168      196
## 3 Twitter    30088605      154

ne <- readRDS(paste0("./final/en_US/sample/ne.q", porc, ".RData"))
bl <- readRDS(paste0("./final/en_US/sample/bl.q", porc, ".RData"))
tw <- readRDS(paste0("./final/en_US/sample/tw.q", porc, ".RData"))

df_sum_ne <- tibble(file = "News", words_sample = sum(stri_count_words(ne)), Mb_sample = 
round(sum(stri_numbytes(ne)) / 1024^2,0))
df_sum_bl <- tibble(file = "Blogs", words_sample = sum(stri_count_words(bl)), Mb_sample = round(sum(stri_numbytes(bl)) / 1024^2,0))
df_sum_tw <- tibble(file = "Twitter", words_sample = sum(stri_count_words(tw)), Mb_sample = round(sum(stri_numbytes(tw)) / 1024^2,0))

rbind(df_sum_ne, df_sum_bl, df_sum_tw)

## # A tibble: 3 x 3
##   file    words_sample Mb_sample
##   <chr>          <int>     <dbl>
## 1 News         3480208        19
## 2 Blogs        3732720        20
## 3 Twitter      3006843        15

Cleaning the data

The content of the sample contains a lot of different types of irregularities that must be normalized in order for its analysis to take place. We use the “Tokenization” process, introducing the following cleaning subprocessess:
- Eliminate numbers.
- Eliminate punctuation marks.
- Eliminate symbols.
- Eliminate text separators.
- Eliminate particularities from Twitter.
- Eliminate dashes.
- Convert all text to lower case.
- Eliminate profanity.

jc.freq = function(x) {
  x = x %>%
    group_by(NextWord) %>%
    summarise(count = dplyr::n()) 
  x = x %>% 
    mutate(freq = count / sum(x$count)) %>% 
    select(-count) %>%
    arrange(desc(freq))
}
jc.toke = function(x, ngramSize = 1) {
  
  # Do some regex magic with quanteda
  tolower(
      quanteda::tokens_remove(
        quanteda::tokens_tolower(
          quanteda::tokens(x,
                       remove_numbers = T,
                       remove_punct = T,
                       remove_symbols = T,
                       remove_separators = T,
                       remove_twitter = T,
                       remove_hyphens = T,
                       remove_url = T,
                       ngrams = ngramSize,
                       concatenator = " "
          )
        )
      ,
      df_prof
      )
  )
}

train <- c(ne,bl,tw)
rm(ne,bl,tw)

train = corpus(train)

if (!file.exists(paste0("./final/en_US/ngrams/1ng.q", porc, ".RData"))) {
  train1 = jc.toke(train)
  dfTrain1 = tibble(NextWord = train1)
  dfTrain1 = jc.freq(dfTrain1)
  saveRDS(dfTrain1,file = paste0("./final/en_US/ngrams/1ng.q", porc, ".RData"))
  rm(train1, dfTrain1)
}

if (!file.exists(paste0("./final/en_US/ngrams/2ng.q", porc, ".RData"))) {
  train2 = jc.toke(train, 2)
  dfTrain2= tibble(NextWord = train2)
  dfTrain2 = jc.freq(dfTrain2) %>%
    separate(NextWord, c('word1', 'NextWord'), " ")
  saveRDS(dfTrain2,file = paste0("./final/en_US/ngrams/2ng.q", porc, ".RData"))
  rm(train2, dfTrain2)
  }

if (!file.exists(paste0("./final/en_US/ngrams/3ng.q", porc, ".RData"))) {
  train3 = jc.toke(train, 3)
  dfTrain3 = tibble(NextWord = train3)
  dfTrain3 = jc.freq(dfTrain3) %>%
    separate(NextWord, c('word1', 'word2', 'NextWord'), " ")
  saveRDS(dfTrain3,file = paste0("./final/en_US/ngrams/3ng.q", porc, ".RData"))
  rm(train3, dfTrain3)
}
if (!file.exists(paste0("./final/en_US/ngrams/4ng.q", porc, ".RData"))) {
  train4 = jc.toke(train, 4)
  dfTrain4 = tibble(NextWord = train4)
  dfTrain4 = jc.freq(dfTrain4) %>%
    separate(NextWord, c('word1', 'word2', 'word3', 'NextWord'), " ")
  rm(train, train1, train2, train3, train4)
  saveRDS(dfTrain4,file = paste0("./final/en_US/ngrams/4ng.q", porc, ".RData"))
  rm(train4, dfTrain4)
}

rm(train)
dfTrain1 <- readRDS(paste0("./final/en_US/ngrams/1ng.q", porc, ".RData"))
dfTrain2 <- readRDS(paste0("./final/en_US/ngrams/2ng.q", porc, ".RData"))
dfTrain3 <- readRDS(paste0("./final/en_US/ngrams/3ng.q", porc, ".RData"))
dfTrain4 <- readRDS(paste0("./final/en_US/ngrams/4ng.q", porc, ".RData"))

Exploratory Analysis

We are going to analyze the frequency with which the words in the different n-grams are repeated.

dfTrain1.h <- head(dfTrain1,30)
ggplot(dfTrain1.h, aes(reorder(NextWord, -freq), freq)) +
         labs(x = "30 most common unigrams", y = "Frequency") +
         theme(axis.text.x = element_text(angle = 60, size = 12, hjust = 1)) +
         geom_bar(stat = "identity", fill = I("grey50"))

dfTrain2.h <- head(dfTrain2,30)
ggplot(dfTrain2.h, aes(reorder(paste(word1, NextWord), -freq), freq)) +
         labs(x = "30 most common bigrams", y = "Frequency") +
         theme(axis.text.x = element_text(angle = 60, size = 12, hjust = 1)) +
         geom_bar(stat = "identity", fill = I("grey50"))

dfTrain3.h <- head(dfTrain3,30)
ggplot(dfTrain3.h, aes(reorder(paste(word1, word2, NextWord), -freq), freq)) +
         labs(x = "30 most common trigrams", y = "Frequency") +
         theme(axis.text.x = element_text(angle = 60, size = 12, hjust = 1)) +
         geom_bar(stat = "identity", fill = I("grey50"))

dfTrain4.h <- head(dfTrain4,30)
ggplot(dfTrain4.h, aes(reorder(paste(word1, word2, word3, NextWord), -freq), freq)) +
         labs(x = "30 most common tetragrams", y = "Frequency") +
         theme(axis.text.x = element_text(angle = 60, size = 12, hjust = 1)) +
         geom_bar(stat = "identity", fill = I("grey50"))

Nexts steps

I need to think about eliminate the stop words…
Anyway, using only 10% of the data will presumably have an adverse affect on the accuracy of the results. However, if we increase the sample size, it will yield some very large n-grams which, if we apply them to our algorithm, will make the response time much longer and render our Application slower and rather useless.
With regard to that same concept, we can also increase the dimensions of the n-grams by using (4), making it possible to build up to the 8-n-gram or higher, thereby logically increasing accuracy, but also leading to greatly increased response times for searches.
First, we will need to see what can be achieved on a home computer, and second, by using the App container for Shiny.
The algorithm that we are going to implement basically consists of searching in the n-grams in order of dimensions, and identifying the words used with the highest frequency. But first, we would have to do the same transformations to the text in the App and then approach it using the following sequence:
- For a chain of blank text, we search in the 1-n-gram of the most frequently used words.
- For a chain of text of N words, we search in the (N+1)-n-gram of the most frequently used words.
- …
- For a chain of text of 3 words, we search in the 4-n-gram of the most frequently used words.
- For a chain of text of 2 words, we search in the 3-n-gram of the most frequently used words.
- For a chain of text of 1 word, we search in the 2-n-gram of the most frequently used words.
- If we haven’t found anything in any of the N-n-grams, we search in the 1-n-gram of the most frequently used words.

Data Science Capstone - Report

Juan Carlos Carmona Calvo

May 20th, 2019

Introduction

Getting the data

Cleaning the data

Exploratory Analysis

Nexts steps