Intro

This is the first report of the milestone porject, in which i will explain the exploratory analysis and the goals for the eventual app and algorithm. The document includes: how to download and read the data, exploratory analysis, plots to illustrate the common terms and Ngrams and the plans to use them for the shiny app.

Libraries and Functions

These are the libraries, functions and seed that I use for the calculations.

#libraries
library(tm)
library(wordcloud)
library(RWeka)
library(ggplot2)


#Functions
clean_corpus <- function(corpus){
  corpus <- tm_map(corpus, stripWhitespace)
  corpus <- tm_map(corpus, removePunctuation)
  corpus <- tm_map(corpus, content_transformer(tolower))
  corpus <- tm_map(corpus, removeNumbers)
  corpus <- tm_map(corpus, removeWords, stopwords("en"))
  return(corpus)
}

convertTDMtoDF <- function(tdm){
  ngram_m <- as.matrix(tdm)
  ngram_m <- sort(rowSums(ngram_m),decreasing = T)
  ngram_df <- data.frame(terms=names(ngram_m),num=ngram_m)
  return(ngram_df)
}


#Reproducible
set.seed(1103)

Download and data loading

Code for downloading the data, using the course URL.

#Descargar el archivo zip de los datos y descomprimirlo
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(url, destfile = "carpeta1.zip")
unzip("carpeta1.zip",exdir=".")

In this report I will be using the english files: Blog, News and Twitter.

setwd("~/CurseraDataScience/DataScienceCapstone")
#read blogs
blogs <- file("./final/en_US/en_US.blogs.txt","r")
blogs_lines <- readLines(blogs, encoding = "UTF-8",skipNul = TRUE)
close(blogs)
##read news
news <- file("./final/en_US/en_US.news.txt","r")
news_lines <- readLines(news, encoding = "UTF-8",skipNul = TRUE)
close(news)
#read twitter
twitter <- file("./final/en_US/en_US.twitter.txt","r")
twitter_lines <- readLines(twitter, encoding = "UTF-8",skipNul = TRUE)
close(twitter)

Summarizing

This is the summarizing of the files, it includes the file size, count of lines and count of characters.

size <- file.size(c("./final/en_US/en_US.blogs.txt","./final/en_US/en_US.news.txt","./final/en_US/en_US.twitter.txt"))
size <- round((size/1024)/1000) 
lines <- sapply(list(blogs_lines,news_lines,twitter_lines), length)
characters <- sapply(list(blogs_lines, news_lines, twitter_lines), function(x){sum(nchar(x))})
sum <- data.frame(files =c("en_US.blogs.txt","en_US.news.txt","en_US.twitter.txt"),size,lines,characters)
sum
##               files size   lines characters
## 1   en_US.blogs.txt  205  899288  206824505
## 2    en_US.news.txt  201   77259   15639408
## 3 en_US.twitter.txt  163 2360148  162096241

Subsetting

I made a subset of the data, to facilitate processing and to follow the concept of test and train sets. With 5% of the content of each file. Then I write every subset in a new file, using the command write and adding the word “sub” at the end of the name. These files are saved in the subdirectory “sub”.

sub_blogs <- blogs_lines[sample(length(blogs_lines),length(blogs_lines)*.05)]
sub_news <- news_lines[sample(length(news_lines),length(news_lines)*.05)]
sub_twitter <- twitter_lines[sample(length(twitter_lines),length(twitter_lines)*.05)]

Create and clean corpus

I will use the subset files in sub directory to create the corpus, using the “tm” package.

dir <- "./final/en_US/sub"
en_corpus <- VCorpus(DirSource(dir,encoding = "UTF-8"), readerControl = list(language="en"))
inspect(en_corpus)
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 3
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 10461430
## 
## [[2]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 773070
## 
## [[3]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 8133346

To clean the corpus I use the function clean_corpus showed in the functions section. It aplies differen procedures like convert letters, clean spaces, remove numbers and words that are not relevant.

en_corpus <- clean_corpus(en_corpus)

Tokenization with RWeka and plots

an n-gram is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application.(Wikipedia). I used the RWeka package to get the tokenizer and then using it with tm function TermDocumentMatrix to obtain the matrix of repeated terms depending of the N-gram from 1 to 4.

UnigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=1, max=1))
tdm1 <- TermDocumentMatrix(en_corpus, control = list(tokenize=UnigramTokenizer))

BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
tdm2 <- TermDocumentMatrix(en_corpus, control = list(tokenize = BigramTokenizer))

TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=3, max=3))
tdm3 <- TermDocumentMatrix(en_corpus, control = list(tokenize=TrigramTokenizer))

FourgramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=4, max=4))
tdm4 <- TermDocumentMatrix(en_corpus, control = list(tokenize=FourgramTokenizer))

I converted the resulting objects into data frames to graph them and observe the first 25 terms of each n-gram. Using the convertTDMtoDF function, showed in the functions and libraries section of this report.

unigram_freq <- convertTDMtoDF(tdm1)
bigram_freq <- convertTDMtoDF(tdm2)
trigram_freq <- convertTDMtoDF(tdm3)
fourgram_freq <- convertTDMtoDF(tdm4)

1-Gram Plot

2-Gram plot

3-Gram plot

4-gram plot

App plans

The objective of the project is to build a text prediction algorithm, the input is a text written by the user, and the output, the most probable word that can continue the text.

The idea is to use the n-grams and predict, based on probability, the word that the user is looking for to complete the text.

Then I´ll build a shiny app, giving the user a textbox and a button to predict 3 options of words to complete the input text.