Loading Library for this report

library(tm)

## Loading required package: NLP

library(slam)
library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.1.3

## 
## Attaching package: 'ggplot2'
## 
## The following object is masked from 'package:NLP':
## 
##     annotate

Aim this report.

Demonstrate that you’ve downloaded the data and have successfully loaded it in.
Create a basic report of summary statistics about the data sets.
Report any interesting findings that you amassed so far.
Get feedback on your plans for creating a prediction algorithm and Shiny app.

Introduction

For create this milestone report, I made a lot of research and I changed it a lot this report through this process. For example, for loading the data I had used first readLines function and realized that for news file that it’s not a good choice. I also tried scan function with same result. Only after combined file with readLines function to performance all right.

I also had trouble with ram memory and process time consume. At begining I decide to use all dataset available. However I had a constraint of memory to perform this task, so I decided to sample the dataset. I combined all 3 file together(twitter, news and blogs) after that sample 10000 “documents”, then created a new text file only with this sample.

For bad words I found a list from the user shutterstock at github, I created a text file in my machine with this words for late remove from my sample.

1 - Loading Data

I downloaded the dataset from Coursera, a Coursera-SwiftKey.zip, I unpacked and choose english dataset for this report. I used “final/en_US” folder for this milestone report. This folder have 3 files in format of text provided by SwiftKey.

setwd("~/Documents/Capstone/final/en_US")
twitter <- readLines("en_US.twitter.txt",skipNul = TRUE)
blogs <- readLines("en_US.blogs.txt", skipNul = TRUE)

con <- paste0("en_US.news.txt")
con <- file(con, open="rb")
news <- readLines(con)
close(con)
rm(con)

en_US <- c(twitter,news,blogs)

1.1 - Summary Statistics

As we can see in summary below, the relation between line and word is little close in news and blogs file (we have approximate 41 and 34 words per line) but in twitter file we have a smaller number, only 12 words per line. This is happening by limitation that user has with only 140 character in Twitter platform.

setwd("/Users/Leonardo/Documents/Capstone/final/en_US")
fsize = file.info(c("en_US.blogs.txt","en_US.news.txt","en_US.twitter.txt"))
size.mb = round((fsize$size/1024)/1000)

lines = sapply(list(blogs,news,twitter), length)
words <- sapply(list(blogs,news,twitter), function(x){ NROW(unlist(strsplit(x, split=" ")))})

summary = data.frame(files = row.names(fsize),size_MB = size.mb ,lines,words)
summary

##               files size_MB   lines    words
## 1   en_US.blogs.txt     205  899288 37334131
## 2    en_US.news.txt     201 1010242 34372530
## 3 en_US.twitter.txt     163 2360148 30373583

1.2 - Sample Data

I have sample with all 3 document at once, our strategy was sampled a entire document and not a words, so in this way I believe this is a good sample (50.000 documents) for this initial exploratory analysis. I writo this sample in a new file called sample_US.txt and for now on use this file as main source for analysis.

sample_US = en_US[sample(length(en_US),50000)]
write(sample_US, file="sample_US.txt")

1.3 - Cleaning Data and Create a Corpus

badwords = readLines("badwords_en.txt")
sample_US <- removeWords(sample_US,badwords)

sample_US <- VectorSource(sample_US)
sample_Corpus <- Corpus(sample_US)

sample_Corpus <- tm_map(sample_Corpus, stripWhitespace)
sample_Corpus <- tm_map(sample_Corpus, removePunctuation)
sample_Corpus <- tm_map(sample_Corpus, removeNumbers)
sample_Corpus <- tm_map(sample_Corpus, tolower)
sample_Corpus <- tm_map(sample_Corpus, stemDocument, language = ("english"))
sample_Corpus <- tm_map(sample_Corpus, PlainTextDocument)

2 - Exploratory Analysis

Let’s take a look in your Corpus and use a function TermDocumentMatrix to anaysis distribution of words in our dataset.

sample_Corpus.tdm <- TermDocumentMatrix(sample_Corpus,control = list(minWordLength = 1))
term_freq <- rowapply_simple_triplet_matrix(sample_Corpus.tdm,sum)
term_freq <- term_freq[order(term_freq,decreasing = T)]
top20 <- as.data.frame(term_freq[1:20])
top20 <- data.frame(words = row.names(top20),top20)
names(top20)[2] = "freq"
row.names(top20) <- NULL


ggplot(data=top20, aes(x=words, y=freq, fill=freq)) + geom_bar(stat="identity") +  guides(fill=FALSE)

Remove stopword in English for a better understand of words.

sample_Corpus <- tm_map(sample_Corpus, removeWords, stopwords("english"))
sample_Corpus.tdm <- TermDocumentMatrix(sample_Corpus,control = list(minWordLength = 1))
term_freq <- rowapply_simple_triplet_matrix(sample_Corpus.tdm,sum)
term_freq <- term_freq[order(term_freq,decreasing = T)]
top20 <- as.data.frame(term_freq[1:20])
top20 <- data.frame(words = row.names(top20),top20)
names(top20)[2] = "freq"
row.names(top20) <- NULL

ggplot(data=top20, aes(x=words, y=freq, fill=freq)) + geom_bar(stat="identity") +  guides(fill=FALSE)

For final project, I don’t think it`s a good idea to remove this words from our dataset, because our goal it´s help the user for save time in type no matter how word is type. Our intention here was only explore our dataset.

Interesting Findings

As we can see in this Cluster Dendrogram Plot some words are more correlated each other. For example, good and day are strong correlated. This is my starting point for try to develop an algorithm to predict next word. Calculate this correlation and create a sort of database of prediction.

sample_Corpus.tdm97 <- removeSparseTerms(sample_Corpus.tdm, sparse=0.97)
sample_Corpus.tdm97 <- as.data.frame(inspect(sample_Corpus.tdm97))
sample_Corpus.tdm97_scale <- scale(sample_Corpus.tdm97)

d <- dist(sample_Corpus.tdm97_scale, method = "euclidean") # distance matrix
fit <- hclust(d, method="ward.D2")

plot(fit)

3 - Future Work

The next step would be developing a text prediction algorithm. First I will going to built a un-gram, bi-gram and tri-gram models using a larger datasets. Then I will create train/test dataset and try some different algoritm for create a model to predict next word.

References

I find a list with badwords from: (https://github.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/blob/master/en)

Another userful site about text anaylitcs from: Matthew Jockers (http://www.matthewjockers.net/materials/dh-2014-introduction-to-text-analysis-and-topic-modeling-with-r/)

Coursera Capstone - Milestone Report

Leonardo Cavalcanti

Friday, March 27, 2015