Milestone report

The purpose of this report is to display that I have become familiar with working with the text data supplied for the capstone class in the data science certificate. I have downloaded the zip file of data and saved it in a folder. This is the first step in the process for creating a predictive algorithm for determining the next word in a tweet, blog, or news post.

I will load the data into R, create a basic report of summary statistics, report any interesting findings, and present plans for creating the prediction algorithm and Shiny app.

Set working directory and load libraries, initialize environment

setwd("/Users/elissachasen/Google Drive/coursera/capstone/final")
library(tm)
library(tidytext)
library(dplyr)
library(ggplot2)
options(stringsAsFactors = FALSE)

Load data

Due to the long loading time of the large files, I have created a function that takes a random subset of the data. For each of these files, I chose to work with a subset including 10% of the original file.

smaller <- function(seed, file, prop){
  set.seed(seed)
  #make vector, length = prop*length(file)
  v <- sample(x = 1:length(file), size = prop*length(file))
  ord <- sort(v) # order the vector 
  newfile <<- file[c(ord)] # save the subsetted newfile to the environment
}

con <- file("en_US/en_US.blogs.txt", "r")
blog <- readLines(con)
close(con)
smaller(27, blog, 0.1)
blog_sub <- newfile

con <- file("en_US/en_US.twitter.txt", "r")
twitter <- readLines(con)
close(con)
smaller(27, twitter, 0.1)
twit_sub <- newfile

con <- file("en_US/en_US.news.txt", "r")
news <- readLines(con)
close(con)
smaller(27, news, 0.1)
news_sub <- newfile

Look at a random line from each of the files

blog_sub[sample(length(blog_sub), size = 1)]
## [1] "Friday, August 13-Owl Be There"
news_sub[sample(length(news_sub), size = 1)]
## [1] "Back at De'Jerica Michaels' apartment, she pointed proudly to a certificate on the wall showing her completion of a Job Corps high school equivalency program. Next, she hopes to go to college."
twit_sub[sample(length(twit_sub), size = 1)]
## [1] "Things went so well! Thank you! Great to see you too! I'm sure I'll see you around. I can't seem to escape"

Examine each text set:

What is the length of each? What is the length of the longest lines? What is the length of the average line? What is the average number of characters per line?

bl <- length(blog_sub)
tl <- length(twit_sub)
nl <- length(news_sub)
#function to determine mean, max and min number of characters per line
stat <- function(dat){
  l <- length(dat)
  ch <- c(rep(NA, l))
  for(i in 1:l){
    sz <- nchar(dat[i])
    ch[i] <- sz 
  }
  print(mean(ch))
  print(max(ch))
}

stat(blog_sub)
## [1] 229.8662
## [1] 12409
stat(news_sub)
## [1] 201.5778
## [1] 3555
stat(twit_sub)
## [1] 68.58331
## [1] 140

make data frame consisting of files and descriptive stats

df <- data.frame(file = c("twitter", "blog", "news"), length = c(tl, bl, nl),
                 mnchar = c(69, 230, 202), maxchar = c(140, 12409, 3555))

create corpus, clean the corpus, and covert to term document matrix

I will transition now to looking at just one of the files (blogs) to keep this analysis more succinct. I want to be able to look at the cleaned term document matrix in two ways. First, I will look at the cleaned document without removing stop words, and then I will look at it again after removing the stop words.

blogC <- Corpus(VectorSource(blog_sub))
# cleaning functions
blogC <- tm_map(blogC, removeNumbers)
blogC <- tm_map(blogC, removePunctuation)
blogC <- tm_map(blogC, content_transformer(tolower))
blogC <- tm_map(blogC, stripWhitespace)
blogC2 <- tm_map(blogC, removeWords, stopwords("en"))
# convert to term document matrix
blogTDM <- TermDocumentMatrix(blogC)
blogTDM2 <- TermDocumentMatrix(blogC2)

Use tidytext to create data set of terms, documents, and counts for each term, then aggregate data by term.

td <- tidy(blogTDM)
td2 <- tidy(blogTDM2)
aggdata <-
  td %>%
  group_by(term) %>%
  summarize(counts = sum(count)) %>%
  arrange(desc(counts))

aggdata2 <-
  td2 %>%
  group_by(term) %>%
  summarize(counts = sum(count)) %>%
  arrange(desc(counts))

What are the 20 most common terms in the blog data?

top <- aggdata[1:20,]
top$term
##  [1] "the"   "and"   "that"  "for"   "you"   "with"  "was"   "this" 
##  [9] "have"  "but"   "are"   "not"   "from"  "all"   "they"  "one"  
## [17] "about" "its"   "what"  "out"

What about when the stop terms are removed?

top2 <- aggdata2[1:20,]
top2$term
##  [1] "one"    "will"   "just"   "like"   "can"    "time"   "get"   
##  [8] "now"    "people" "know"   "dont"   "new"    "also"   "even"  
## [15] "first"  "back"   "really" "well"   "much"   "day"

Which words overlap?

intersect(top$term, top2$term)
## [1] "one"

Visualize the words and their frequencies.

gblog <- ggplot(top, aes(x = term, y = counts)) + 
  geom_bar(stat = "identity") + coord_flip() +
  labs(title = "Most Frequent Blog Words, with stop words")
gblog

gblog2 <- ggplot(top2, aes(x = term, y = counts)) + 
  geom_bar(stat = "identity") + coord_flip() +
  labs(title = "Most Frequent Blog Words, stop words removed")
gblog2

By removing the stop words, we lose a lot of the most commonly used words, which would decrease our predictive power for predicting the next word.

Looking for ngrams

blog_t2 <- data_frame(text = blog_sub) %>% 
        unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
        count(bigram, sort = TRUE)
blog_t3 <- data_frame(text = blog_sub) %>% 
        unnest_tokens(bigram, text, token = "ngrams", n = 3) %>%
        count(bigram, sort = TRUE)

Visualize the top most common 2,3 ngrams

top_n2 <- blog_t2[1:20,]
n2blog <- ggplot(top_n2, aes(x = bigram, y = n)) + 
  geom_bar(stat = "identity") + coord_flip() +
  labs(title = "Most Frequent Bigram") + 
  ylab("count")
n2blog

top_n3 <- blog_t3[1:20,]
n3blog <- ggplot(top_n3, aes(x = bigram, y = n)) + 
  geom_bar(stat = "identity") + coord_flip() +
  labs(title = "Most Frequent Trigram") +
  ylab("count")
n3blog

The ngrams will be useful in predicting the next word. I will need to conduct an analysis to see if there is a difference in word frequencies, bigram, or trigram frequences between the twitter, blog and news data. If they are different, I will continue to analyse the data separately and if not, I will combine them.