The purpose of this report is to display that I have become familiar with working with the text data supplied for the capstone class in the data science certificate. I have downloaded the zip file of data and saved it in a folder. This is the first step in the process for creating a predictive algorithm for determining the next word in a tweet, blog, or news post.
I will load the data into R, create a basic report of summary statistics, report any interesting findings, and present plans for creating the prediction algorithm and Shiny app.
Set working directory and load libraries, initialize environment
setwd("/Users/elissachasen/Google Drive/coursera/capstone/final")
library(tm)
library(tidytext)
library(dplyr)
library(ggplot2)
options(stringsAsFactors = FALSE)
Due to the long loading time of the large files, I have created a function that takes a random subset of the data. For each of these files, I chose to work with a subset including 10% of the original file.
smaller <- function(seed, file, prop){
set.seed(seed)
#make vector, length = prop*length(file)
v <- sample(x = 1:length(file), size = prop*length(file))
ord <- sort(v) # order the vector
newfile <<- file[c(ord)] # save the subsetted newfile to the environment
}
con <- file("en_US/en_US.blogs.txt", "r")
blog <- readLines(con)
close(con)
smaller(27, blog, 0.1)
blog_sub <- newfile
con <- file("en_US/en_US.twitter.txt", "r")
twitter <- readLines(con)
close(con)
smaller(27, twitter, 0.1)
twit_sub <- newfile
con <- file("en_US/en_US.news.txt", "r")
news <- readLines(con)
close(con)
smaller(27, news, 0.1)
news_sub <- newfile
blog_sub[sample(length(blog_sub), size = 1)]
## [1] "Friday, August 13-Owl Be There"
news_sub[sample(length(news_sub), size = 1)]
## [1] "Back at De'Jerica Michaels' apartment, she pointed proudly to a certificate on the wall showing her completion of a Job Corps high school equivalency program. Next, she hopes to go to college."
twit_sub[sample(length(twit_sub), size = 1)]
## [1] "Things went so well! Thank you! Great to see you too! I'm sure I'll see you around. I can't seem to escape"
What is the length of each? What is the length of the longest lines? What is the length of the average line? What is the average number of characters per line?
bl <- length(blog_sub)
tl <- length(twit_sub)
nl <- length(news_sub)
#function to determine mean, max and min number of characters per line
stat <- function(dat){
l <- length(dat)
ch <- c(rep(NA, l))
for(i in 1:l){
sz <- nchar(dat[i])
ch[i] <- sz
}
print(mean(ch))
print(max(ch))
}
stat(blog_sub)
## [1] 229.8662
## [1] 12409
stat(news_sub)
## [1] 201.5778
## [1] 3555
stat(twit_sub)
## [1] 68.58331
## [1] 140
df <- data.frame(file = c("twitter", "blog", "news"), length = c(tl, bl, nl),
mnchar = c(69, 230, 202), maxchar = c(140, 12409, 3555))
barplot(df$length, names.arg = c("Twitter", "Blog", "News"),
ylab="Number of lines", border = "red")
barplot(df$mnchar, names.arg = c("Twitter", "Blog", "News"),
ylab="Mean # of characters/line", border = "red")
barplot(df$maxchar, names.arg = c("Twitter", "Blog", "News"),
ylab="Max # of characters/line", border = "red")
I will transition now to looking at just one of the files (blogs) to keep this analysis more succinct. I want to be able to look at the cleaned term document matrix in two ways. First, I will look at the cleaned document without removing stop words, and then I will look at it again after removing the stop words.
blogC <- Corpus(VectorSource(blog_sub))
# cleaning functions
blogC <- tm_map(blogC, removeNumbers)
blogC <- tm_map(blogC, removePunctuation)
blogC <- tm_map(blogC, content_transformer(tolower))
blogC <- tm_map(blogC, stripWhitespace)
blogC2 <- tm_map(blogC, removeWords, stopwords("en"))
# convert to term document matrix
blogTDM <- TermDocumentMatrix(blogC)
blogTDM2 <- TermDocumentMatrix(blogC2)
td <- tidy(blogTDM)
td2 <- tidy(blogTDM2)
aggdata <-
td %>%
group_by(term) %>%
summarize(counts = sum(count)) %>%
arrange(desc(counts))
aggdata2 <-
td2 %>%
group_by(term) %>%
summarize(counts = sum(count)) %>%
arrange(desc(counts))
top <- aggdata[1:20,]
top$term
## [1] "the" "and" "that" "for" "you" "with" "was" "this"
## [9] "have" "but" "are" "not" "from" "all" "they" "one"
## [17] "about" "its" "what" "out"
top2 <- aggdata2[1:20,]
top2$term
## [1] "one" "will" "just" "like" "can" "time" "get"
## [8] "now" "people" "know" "dont" "new" "also" "even"
## [15] "first" "back" "really" "well" "much" "day"
intersect(top$term, top2$term)
## [1] "one"
gblog <- ggplot(top, aes(x = term, y = counts)) +
geom_bar(stat = "identity") + coord_flip() +
labs(title = "Most Frequent Blog Words, with stop words")
gblog
gblog2 <- ggplot(top2, aes(x = term, y = counts)) +
geom_bar(stat = "identity") + coord_flip() +
labs(title = "Most Frequent Blog Words, stop words removed")
gblog2
By removing the stop words, we lose a lot of the most commonly used words, which would decrease our predictive power for predicting the next word.
blog_t2 <- data_frame(text = blog_sub) %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
count(bigram, sort = TRUE)
blog_t3 <- data_frame(text = blog_sub) %>%
unnest_tokens(bigram, text, token = "ngrams", n = 3) %>%
count(bigram, sort = TRUE)
top_n2 <- blog_t2[1:20,]
n2blog <- ggplot(top_n2, aes(x = bigram, y = n)) +
geom_bar(stat = "identity") + coord_flip() +
labs(title = "Most Frequent Bigram") +
ylab("count")
n2blog
top_n3 <- blog_t3[1:20,]
n3blog <- ggplot(top_n3, aes(x = bigram, y = n)) +
geom_bar(stat = "identity") + coord_flip() +
labs(title = "Most Frequent Trigram") +
ylab("count")
n3blog
The ngrams will be useful in predicting the next word. I will need to conduct an analysis to see if there is a difference in word frequencies, bigram, or trigram frequences between the twitter, blog and news data. If they are different, I will continue to analyse the data separately and if not, I will combine them.