This is the first report of the milestone porject, in which i will explain the exploratory analysis and the goals for the eventual app and algorithm. The document includes: how to download and read the data, exploratory analysis, plots to illustrate the common terms and Ngrams and the plans to use them for the shiny app.
These are the libraries, functions and seed that I use for the calculations.
#libraries
library(tm)
library(wordcloud)
library(RWeka)
library(ggplot2)
#Functions
clean_corpus <- function(corpus){
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, stopwords("en"))
return(corpus)
}
convertTDMtoDF <- function(tdm){
ngram_m <- as.matrix(tdm)
ngram_m <- sort(rowSums(ngram_m),decreasing = T)
ngram_df <- data.frame(terms=names(ngram_m),num=ngram_m)
return(ngram_df)
}
#Reproducible
set.seed(1103)
Code for downloading the data, using the course URL.
#Descargar el archivo zip de los datos y descomprimirlo
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(url, destfile = "carpeta1.zip")
unzip("carpeta1.zip",exdir=".")
In this report I will be using the english files: Blog, News and Twitter.
setwd("~/CurseraDataScience/DataScienceCapstone")
#read blogs
blogs <- file("./final/en_US/en_US.blogs.txt","r")
blogs_lines <- readLines(blogs, encoding = "UTF-8",skipNul = TRUE)
close(blogs)
##read news
news <- file("./final/en_US/en_US.news.txt","r")
news_lines <- readLines(news, encoding = "UTF-8",skipNul = TRUE)
close(news)
#read twitter
twitter <- file("./final/en_US/en_US.twitter.txt","r")
twitter_lines <- readLines(twitter, encoding = "UTF-8",skipNul = TRUE)
close(twitter)
This is the summarizing of the files, it includes the file size, count of lines and count of characters.
size <- file.size(c("./final/en_US/en_US.blogs.txt","./final/en_US/en_US.news.txt","./final/en_US/en_US.twitter.txt"))
size <- round((size/1024)/1000)
lines <- sapply(list(blogs_lines,news_lines,twitter_lines), length)
characters <- sapply(list(blogs_lines, news_lines, twitter_lines), function(x){sum(nchar(x))})
sum <- data.frame(files =c("en_US.blogs.txt","en_US.news.txt","en_US.twitter.txt"),size,lines,characters)
sum
## files size lines characters
## 1 en_US.blogs.txt 205 899288 206824505
## 2 en_US.news.txt 201 77259 15639408
## 3 en_US.twitter.txt 163 2360148 162096241
I made a subset of the data, to facilitate processing and to follow the concept of test and train sets. With 5% of the content of each file. Then I write every subset in a new file, using the command write and adding the word “sub” at the end of the name. These files are saved in the subdirectory “sub”.
sub_blogs <- blogs_lines[sample(length(blogs_lines),length(blogs_lines)*.05)]
sub_news <- news_lines[sample(length(news_lines),length(news_lines)*.05)]
sub_twitter <- twitter_lines[sample(length(twitter_lines),length(twitter_lines)*.05)]
I will use the subset files in sub directory to create the corpus, using the “tm” package.
dir <- "./final/en_US/sub"
en_corpus <- VCorpus(DirSource(dir,encoding = "UTF-8"), readerControl = list(language="en"))
inspect(en_corpus)
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 3
##
## [[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 10461430
##
## [[2]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 773070
##
## [[3]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 8133346
To clean the corpus I use the function clean_corpus showed in the functions section. It aplies differen procedures like convert letters, clean spaces, remove numbers and words that are not relevant.
en_corpus <- clean_corpus(en_corpus)
an n-gram is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application.(Wikipedia). I used the RWeka package to get the tokenizer and then using it with tm function TermDocumentMatrix to obtain the matrix of repeated terms depending of the N-gram from 1 to 4.
UnigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=1, max=1))
tdm1 <- TermDocumentMatrix(en_corpus, control = list(tokenize=UnigramTokenizer))
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
tdm2 <- TermDocumentMatrix(en_corpus, control = list(tokenize = BigramTokenizer))
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=3, max=3))
tdm3 <- TermDocumentMatrix(en_corpus, control = list(tokenize=TrigramTokenizer))
FourgramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=4, max=4))
tdm4 <- TermDocumentMatrix(en_corpus, control = list(tokenize=FourgramTokenizer))
I converted the resulting objects into data frames to graph them and observe the first 25 terms of each n-gram. Using the convertTDMtoDF function, showed in the functions and libraries section of this report.
unigram_freq <- convertTDMtoDF(tdm1)
bigram_freq <- convertTDMtoDF(tdm2)
trigram_freq <- convertTDMtoDF(tdm3)
fourgram_freq <- convertTDMtoDF(tdm4)
1-Gram Plot
2-Gram plot
3-Gram plot
4-gram plot
The objective of the project is to build a text prediction algorithm, the input is a text written by the user, and the output, the most probable word that can continue the text.
The idea is to use the n-grams and predict, based on probability, the word that the user is looking for to complete the text.
Then I´ll build a shiny app, giving the user a textbox and a button to predict 3 options of words to complete the input text.