library(tm)
## Loading required package: NLP
library(slam)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.1.3
##
## Attaching package: 'ggplot2'
##
## The following object is masked from 'package:NLP':
##
## annotate
For create this milestone report, I made a lot of research and I changed it a lot this report through this process. For example, for loading the data I had used first readLines function and realized that for news file that it’s not a good choice. I also tried scan function with same result. Only after combined file with readLines function to performance all right.
I also had trouble with ram memory and process time consume. At begining I decide to use all dataset available. However I had a constraint of memory to perform this task, so I decided to sample the dataset. I combined all 3 file together(twitter, news and blogs) after that sample 10000 “documents”, then created a new text file only with this sample.
For bad words I found a list from the user shutterstock at github, I created a text file in my machine with this words for late remove from my sample.
I downloaded the dataset from Coursera, a Coursera-SwiftKey.zip, I unpacked and choose english dataset for this report. I used “final/en_US” folder for this milestone report. This folder have 3 files in format of text provided by SwiftKey.
setwd("~/Documents/Capstone/final/en_US")
twitter <- readLines("en_US.twitter.txt",skipNul = TRUE)
blogs <- readLines("en_US.blogs.txt", skipNul = TRUE)
con <- paste0("en_US.news.txt")
con <- file(con, open="rb")
news <- readLines(con)
close(con)
rm(con)
en_US <- c(twitter,news,blogs)
As we can see in summary below, the relation between line and word is little close in news and blogs file (we have approximate 41 and 34 words per line) but in twitter file we have a smaller number, only 12 words per line. This is happening by limitation that user has with only 140 character in Twitter platform.
setwd("/Users/Leonardo/Documents/Capstone/final/en_US")
fsize = file.info(c("en_US.blogs.txt","en_US.news.txt","en_US.twitter.txt"))
size.mb = round((fsize$size/1024)/1000)
lines = sapply(list(blogs,news,twitter), length)
words <- sapply(list(blogs,news,twitter), function(x){ NROW(unlist(strsplit(x, split=" ")))})
summary = data.frame(files = row.names(fsize),size_MB = size.mb ,lines,words)
summary
## files size_MB lines words
## 1 en_US.blogs.txt 205 899288 37334131
## 2 en_US.news.txt 201 1010242 34372530
## 3 en_US.twitter.txt 163 2360148 30373583
I have sample with all 3 document at once, our strategy was sampled a entire document and not a words, so in this way I believe this is a good sample (50.000 documents) for this initial exploratory analysis. I writo this sample in a new file called sample_US.txt and for now on use this file as main source for analysis.
sample_US = en_US[sample(length(en_US),50000)]
write(sample_US, file="sample_US.txt")
badwords = readLines("badwords_en.txt")
sample_US <- removeWords(sample_US,badwords)
sample_US <- VectorSource(sample_US)
sample_Corpus <- Corpus(sample_US)
sample_Corpus <- tm_map(sample_Corpus, stripWhitespace)
sample_Corpus <- tm_map(sample_Corpus, removePunctuation)
sample_Corpus <- tm_map(sample_Corpus, removeNumbers)
sample_Corpus <- tm_map(sample_Corpus, tolower)
sample_Corpus <- tm_map(sample_Corpus, stemDocument, language = ("english"))
sample_Corpus <- tm_map(sample_Corpus, PlainTextDocument)
Let’s take a look in your Corpus and use a function TermDocumentMatrix to anaysis distribution of words in our dataset.
sample_Corpus.tdm <- TermDocumentMatrix(sample_Corpus,control = list(minWordLength = 1))
term_freq <- rowapply_simple_triplet_matrix(sample_Corpus.tdm,sum)
term_freq <- term_freq[order(term_freq,decreasing = T)]
top20 <- as.data.frame(term_freq[1:20])
top20 <- data.frame(words = row.names(top20),top20)
names(top20)[2] = "freq"
row.names(top20) <- NULL
ggplot(data=top20, aes(x=words, y=freq, fill=freq)) + geom_bar(stat="identity") + guides(fill=FALSE)
Remove stopword in English for a better understand of words.
sample_Corpus <- tm_map(sample_Corpus, removeWords, stopwords("english"))
sample_Corpus.tdm <- TermDocumentMatrix(sample_Corpus,control = list(minWordLength = 1))
term_freq <- rowapply_simple_triplet_matrix(sample_Corpus.tdm,sum)
term_freq <- term_freq[order(term_freq,decreasing = T)]
top20 <- as.data.frame(term_freq[1:20])
top20 <- data.frame(words = row.names(top20),top20)
names(top20)[2] = "freq"
row.names(top20) <- NULL
ggplot(data=top20, aes(x=words, y=freq, fill=freq)) + geom_bar(stat="identity") + guides(fill=FALSE)
For final project, I don’t think it`s a good idea to remove this words from our dataset, because our goal it´s help the user for save time in type no matter how word is type. Our intention here was only explore our dataset.
As we can see in this Cluster Dendrogram Plot some words are more correlated each other. For example, good and day are strong correlated. This is my starting point for try to develop an algorithm to predict next word. Calculate this correlation and create a sort of database of prediction.
sample_Corpus.tdm97 <- removeSparseTerms(sample_Corpus.tdm, sparse=0.97)
sample_Corpus.tdm97 <- as.data.frame(inspect(sample_Corpus.tdm97))
sample_Corpus.tdm97_scale <- scale(sample_Corpus.tdm97)
d <- dist(sample_Corpus.tdm97_scale, method = "euclidean") # distance matrix
fit <- hclust(d, method="ward.D2")
plot(fit)
The next step would be developing a text prediction algorithm. First I will going to built a un-gram, bi-gram and tri-gram models using a larger datasets. Then I will create train/test dataset and try some different algoritm for create a model to predict next word.
I find a list with badwords from: (https://github.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/blob/master/en)
Another userful site about text anaylitcs from: Matthew Jockers (http://www.matthewjockers.net/materials/dh-2014-introduction-to-text-analysis-and-topic-modeling-with-r/)