This is the milestone report for the Exploratory Analysis section of the Coursera Data Science Capstone project. The goal of this capstone project is to analyze a large corpus of text documents to discover the structure in the data and how words are put together. By cleaning and analyzing text data, I will then build a predictive text model.
This report will summarize the findings from the exploratory analysis carried out on data for the capstone project. I have included some major statistical findings and you will find below, graphs that are representative of the data set of words.
library(R.utils)
library(stringi)
library(ggplot2)
library(ngram)
library(tm)
library(RWeka)
library(tau)
library(wordcloud)
setwd("/Users/sandraezidiegwu/Documents/Data Science/Capstone Project/final/en_US/")
swiftkey <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(swiftkey, destfile = "./SwiftKey", method = "curl")
unzip("SwiftKey")
blog <- readLines("en_US.blogs.txt", encoding = "UTF-8")
news <- readLines("en_US.news.txt", encoding = "UTF-8")
twitter <- readLines("en_US.twitter.txt", encoding = "UTF-8")
To reduce processing time, we will grab samples of the data.
set.seed(124)
blog.s <- sample(blog, 1000)
news.s <- sample(news, 1000)
twitter.s <- sample(twitter, 1000)
words <- c(blog.s, news.s, twitter.s)
write(words, file = 'words.txt')
words <- readLines("words.txt")
## File Name File Size (mb) Line Count Word Count
## 1 Blogs 248.49 899288 37334131
## 2 News 249.63 1010242 34372530
## 3 Twitter 301.40 2360148 30373543
## 4 Sample Summary 0.66 3000 89476
words <- gsub("http.*\\s*", "", words)
words <- gsub("[[:punct:]]", "", words)
words <- gsub("[[:digit:]]", "", words)
words <- gsub("http[[:alnum:]]*", "", words)
words <- gsub("^\\s+|\\s+$", " ", words)
#Convert to Lower Case
words <- tolower(words)
#Remove Stopwords
words <- removeWords(words, stopwords(kind = "en"))
#Remove Profanity
words.profanity <- readLines("profanity.csv")
words <- removeWords(words, words.profanity)
#Create Corpus
words.corpus <- Corpus(VectorSource(words), readerControl = list(reader = readPlain, language = 'en'))
Using the ‘tau’ package, I will create functions that tokenize the text into unigrams, bigrams and trigrams
unigramtoken <- function(x, n=1) return(rownames(as.data.frame(unclass(textcnt(x$content,method="string",n=n)))))
bigramtoken <- function(x, n=2) return(rownames(as.data.frame(unclass(textcnt(x$content,method="string",n=n)))))
trigramtoken <- function(x, n=3) return(rownames(as.data.frame(unclass(textcnt(x$content,method="string",n=n)))))
tdm.unigram <- TermDocumentMatrix(words.corpus, control=list(tokenize=unigramtoken))
freq1 <- sort(rowSums(as.matrix(tdm.unigram)), decreasing = T)
freq.df1 <- data.frame(word=names(freq1), freq=freq1)
head(freq.df1, 10)
## word freq
## said said 242
## will will 231
## one one 215
## just just 212
## like like 184
## can can 168
## get get 152
## time time 141
## new new 139
## now now 135
tdm.bigram <- TermDocumentMatrix(words.corpus, control=list(tokenize=bigramtoken))
freq2 <- sort(rowSums(as.matrix(tdm.bigram)), decreasing = T)
freq.df2 <- data.frame(word=names(freq2), freq=freq2)
head(freq.df2, 10)
## word freq
## high school high school 18
## new york new york 16
## im going im going 14
## last year last year 14
## right now right now 13
## st louis st louis 13
## years ago years ago 13
## first time first time 12
## dont know dont know 11
## dont think dont think 10
tdm.trigram <- TermDocumentMatrix(words.corpus, control=list(tokenize=trigramtoken))
freq3 <- sort(rowSums(as.matrix(tdm.trigram)), decreasing = T)
freq.df3 <- data.frame(word=names(freq3), freq=freq3)
head(freq.df3, 10)
## word freq
## president barack obama president barack obama 5
## new york times new york times 4
## im hard time im hard time 3
## also great way also great way 2
## can honestly say can honestly say 2
## cant imagine youre cant imagine youre 2
## cant wait share cant wait share 2
## columbia university new columbia university new 2
## couple weeks ago couple weeks ago 2
## couple years ago couple years ago 2
Wordcloud of most frequent words used in dataset
wordcloud(freq.df1$word, freq.df1$freq, scale = c(4,0.8), max.words = 40, random.order = F, random.color = TRUE, colors = brewer.pal(8,'Dark2'))