Report goal

  1. Demonstrate that I’ve downloaded the data and have successfully loaded it in.

  2. Create a basic report of summary statistics about the data sets.

  3. Report any interesting findings.

  4. Plans for creating a prediction algorithm and Shiny app.

Getting the data

The data was obtained from the link below:

https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

The dataset was collected from publicly available sources by a web crawler. The zip folder contains four datasets: de_DE, en_US, fi_FI, ru_RU. The files from english language will be used for this exploration.

rm(list = ls(all.names = TRUE))

library(readr)
library(MODIS)
## Warning: package 'MODIS' was built under R version 3.6.3
## Loading required package: mapdata
## Warning: package 'mapdata' was built under R version 3.6.3
## Loading required package: maps
## Warning: package 'maps' was built under R version 3.6.3
## Loading required package: raster
## Warning: package 'raster' was built under R version 3.6.3
## Loading required package: sp
## Warning: package 'sp' was built under R version 3.6.3
FOLDER = "/Users/danie/Documents/cursofinal/Coursera-SwiftKey/final"
setwd(FOLDER)

conTwitter <- file(paste0(FOLDER,"/", "en_US/en_US.twitter.txt"), "r") 
conNews <- file(paste0(FOLDER,"/", "en_US/en_US.news.txt"), "r") 
conBlogs <- file(paste0(FOLDER,"/", "en_US/en_US.blogs.txt"), "r") 
Tfull <- readLines(conTwitter)
## Warning in readLines(conTwitter): line 167155 appears to contain an embedded nul
## Warning in readLines(conTwitter): line 268547 appears to contain an embedded nul
## Warning in readLines(conTwitter): line 1274086 appears to contain an embedded
## nul
## Warning in readLines(conTwitter): line 1759032 appears to contain an embedded
## nul
Bfull <- readLines(conBlogs)
Nfull <- readLines(conNews)
## Warning in readLines(conNews): incomplete final line found on '/Users/danie/
## Documents/cursofinal/Coursera-SwiftKey/final/en_US/en_US.news.txt'
close(conTwitter)
close(conNews)
close(conBlogs)

Summary of the data

From the summary we can notice that the Twitter data has the biggest number of lines, but Blogs is the biggest file in terms of memory size.

data_summary <- NA
files <- c("en_US.twitter.txt", "en_US.news.txt", "en_US.blogs.txt")
lens <- c(length(Tfull), length(Nfull), length(Bfull))
sizes <- c(fileSize(paste0(FOLDER,"/", "en_US/en_US.twitter.txt"), units = "MB"),
           fileSize(paste0(FOLDER,"/", "en_US/en_US.news.txt"), units = "MB"),
           fileSize(paste0(FOLDER,"/", "en_US/en_US.blogs.txt"), units = "MB"))
chars <-  c(sum(sapply(strsplit(Tfull, "\\s+"), length)),
            sum(sapply(strsplit(Nfull, "\\s+"), length)),
            sum(sapply(strsplit(Bfull, "\\s+"), length)))
data_summary <- data.frame(File=files, NumberLines=lens, SizeMB=sizes, NumWords = chars)
data_summary
##                File NumberLines   SizeMB NumWords
## 1 en_US.twitter.txt     2360148 159.3641 30373792
## 2    en_US.news.txt       77259 196.2775  2643972
## 3   en_US.blogs.txt      899288 200.4242 37334441

Sampling

Given the size of the dataset, we will use the 10% of each source to exploration. The sampling is carried out using a binomial distribution and the all samples are combined together.

set.seed(100)
Tsample <- Tfull[rbinom(length(Tfull) * 0.1, length(Tfull), 0.5)]
Bsample <- Tfull[rbinom(length(Tfull) * 0.1, length(Tfull), 0.5)]
Nsample <- Tfull[rbinom(length(Tfull) * 0.1, length(Tfull), 0.5)]
AllSample <- paste(Tsample, Bsample, Nsample)

Cleaning

All the data cleaning for this project was performed using the tm package, which provides methods for corpus handling after converting the whole sample to a corpus.

library(tm) 
## Warning: package 'tm' was built under R version 3.6.3
## Loading required package: NLP
## Warning: package 'NLP' was built under R version 3.6.3
library(SnowballC) 
## Warning: package 'SnowballC' was built under R version 3.6.3
corpus <- VCorpus(VectorSource(AllSample))
clean_corp <- tm_map(corpus, function(x) gsub("[^a-zA-Z0-9 ]","",x))
clean_corp <- tm_map(clean_corp, tolower)
clean_corp <- tm_map(clean_corp, removeNumbers)
clean_corp <- tm_map(clean_corp, stripWhitespace)
clean_corp <- tm_map(clean_corp, removeWords, stopwords("english")) 
clean_corp <- tm_map(clean_corp, stemDocument) 

Exploration

The package use for tokenizing and doing the exploratory data analysis is quanteda (quantitative analysis of textual data).

The first step is to tokenize the corpus and cleaning by removing symbols, numbers, separators, puntuation and stopwords.

library(quanteda)
## Warning: package 'quanteda' was built under R version 3.6.3
## Package version: 3.0.0
## Unicode version: 10.0
## ICU version: 61.1
## Parallel computing: 4 of 4 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:tm':
## 
##     stopwords
## The following objects are masked from 'package:NLP':
## 
##     meta, meta<-
library(quanteda.textplots)
## Warning: package 'quanteda.textplots' was built under R version 3.6.3
corpus_q <- quanteda::corpus(AllSample)
toks <- tokens(corpus_q, remove_punct = TRUE,
  remove_symbols = TRUE,
  remove_numbers = TRUE,
  remove_url = TRUE,
  remove_separators = TRUE)


# REMOVE STOPWORDS
toks_nostop <- tokens_select(toks, pattern = stopwords("en"), selection = "remove")
dfm_tok <- dfm(toks_nostop)

The package also includes some standarized plots, useful for making textual data visualizations.

library(quanteda.textplots)

textplot_wordcloud(dfm_tok)

textplot_wordcloud(dfm_tok, min_count = 10,
     color = c('red', 'pink', 'green', 'purple', 'orange', 'blue'))

Frequency plots

library(quanteda.textstats)
## Warning: package 'quanteda.textstats' was built under R version 3.6.3
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
freq_dfm <- textstat_frequency(dfm_tok, n = 50)

# Sort by reverse frequency order
freq_dfm$feature <- with(freq_dfm, reorder(feature, -frequency))

ggplot(freq_dfm, aes(x = feature, y = frequency)) +
    geom_point() + 
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

uni <- tokens_ngrams(toks_nostop, n = 1)
bi <- tokens_ngrams(toks_nostop, n = 2)
tri <- tokens_ngrams(toks_nostop, n = 3)

dfm_bi <- dfm(bi)
dfm_tri <- dfm(tri)

textplot_wordcloud(dfm_bi, max_words = 100,
                   ordered_color = TRUE)

freq_dfmtri <- textstat_frequency(dfm_tri, n = 50)
freq_dfmtri$feature <- with(freq_dfmtri, reorder(feature, -frequency))

ggplot(freq_dfmtri, aes(x = feature, y = frequency)) +
    geom_point() + 
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

Task 3: Modeling

The final goal is to build a model that will predict the next word given the last 1,2 or three words. That means predicting the likelihood of the next word.

It has to be taken into account that there will be new combination of words, so the model should be able to handle the cases where the n-gram isn’t observed to predict the word.

Also, the runtime and memory consumption are to be considered.

PLAN

Initially, some research need to be done about Ngram models, how they work and what tools have been used to create them.

The general idea for starting with the construction of the model is to build a dictionary (input:prediction) of Ngrams and their following word, and use that as the train/test dataset. If the N words combination is not found whithin the corpus, the algorithm should look for the N-1, N-1… until no prediction can be made, and then a default behaviour or paramter should be defined.

The input for the shinny app will be a text input and a button to trigger the model prediction.