Capstone Week 2 - Milestone Report

Introduction

The purpose of this project is to:

  1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.
  2. Create a basic report of summary statistics about the data sets.
  3. Report any interesting findings that you amassed so far.
  4. Get feedback on your plans for creating a prediction algorithm and Shiny app.

Step one is to load all required libraries and create some initial variables.

library(stringr)
library(stringi)
library(hunspell)
library(DescTools)
library(ggplot2)
library(knitr)
library(kableExtra)
library(gtable)
library(grid)
library(scales)

urltext <- c("en_US.blogs.txt","en_US.news.txt","en_US.twitter.txt")
#urltext <- c("enUS.mini.txt","enUS.mini.txt","enUS.mini.txt")
datatype <- c("Blogs","News","Twitter")

The Problem

The learner has been given three very large text files to analyze with the end purpose being to develop a predictive model. In this phase the intent is to get to know the files and determine if there are any key differences between them. The files represent blog posts, twitter posts, and news posts. For this analysis only single word analysis is undertaken.

Step 1 - Read and Parse Text

In this step each of the three files is read by a common function call and then key elements are returned for further analysis.

info_stor <- data.frame(Data_Type = "", 
                        Lines = "", 
                        Characters = "",
                        Words = "", 
                        Percentile_50th = "", 
                        Percentile_75th = "",
                        Percentile_90th = "",
                        Percentile_95th = "",stringsAsFactors=FALSE)

for (i in 1:3){
      data_values <- proc_file(urltext[i]) 
      info_stor[i,1] <- datatype[i]
      for (j in 2:8){
            info_stor[i,j] <- data_values[j]
      }
    
      if (i == 1) {
            blogswords <- data.frame(data_values[1])
      } else if (i == 2) {
            newswords <- data.frame(data_values[1])
      } else {
            twitterwords <- data.frame(data_values[1])
      }
}

Step 2 - Examine a Summary of the Data

Begin by looking at similarities and differences between the data files. This table shows the relative sizes in lines, characters, and words (after cleaning out non words character strings). It also shows how many unique words it takes to include 50 percent, 75 percent, 90 percent, and 95 percent of the total word count.

kable(info_stor) %>% kable_styling(boostrap_options=c("striped","hover","condensed"))
## Currently generic markdown table using pandoc is not supported.
Data_Type Lines Characters Words Percentile_50th Percentile_75th Percentile_90th Percentile_95th
Blogs 899288 199231704 62348 83 849 4153 8676
News 77259 14916062 36035 119 1162 4478 8488
Twitter 2360148 152968969 51277 88 559 2597 5479

From this table we see some differences between three three file types. Most notably, the number of words required to achieve the 50%, 75%, 90% and 95% of the cumulative frequency of the total words is significantly different, with the nummber of words in the twitter file being far fewer.

Step 3 - Additional Analysis

It is also useful to look at the plots of the most frequently used words in each of the three files. The following three graphs provide a look at the both the Cumulative Frequence Percentage and the Word Count Frequency for each of the three data sources, Blogs, News, and Twitter.

for (i in 1:3) {
      graphtitle <- paste("Cumulative Percentage and Frequencies of the 40 Most Used Words in ",info_stor[i,1]," Posts", sep='')
      if (i == 1) {
            mval <- blogswords[1,2]/blogswords[1,3]
            run_plot(blogswords,graphtitle,mval)
            } else if (i == 2) {
            mval <- newswords[1,2]/newswords[1,3]
            run_plot(newswords,graphtitle,mval)
      } else {
            mval <- twitterwords[1,2]/twitterwords[1,3]
            run_plot(twitterwords,graphtitle,mval)
      }
 


}

From the plots we see that there are some similarities and some differences in the frequencies of the words. For example, the pronouns “I”, “my”, and “you” appear much higher in the Twitter list than in the other two lists. The word “about” shows up in the news and blogs list, but not in the Twitter list. It will be especially interesting to look at the differences in the bigrams and trigrams in future analysis.

Next Steps

The next steps are to move the analysis forward with additional discoveries related to n-grams in each of the data files. This is the next step toward building the predictive model allowing the development of a shiny app to predict the next word of text from a typed in sentence.

Appendix

Function to process each data file

This function is called to read the data file for each of the three types of information and return key types of information used in the above report.

# This is the function that is used to process each of the data files
proc_file <- function(urlname){
      options(warn = -1)
      sentences<-readLines(urlname, skipNul = TRUE)
      options(warn = 0)
      #Replace period and comma
      sentences<-gsub("\\.","",sentences)
      sentences<-gsub("\\,","",sentences)
      sentences <- str_trim(sentences) # eliminate the multiple spaces
      sentences <- gsub("[^A-Za-z ]","",sentences) # replace anything crazy
      sentences <- tolower(sentences) #set everything to lower case
      #Get number of words, lines, and characters
      words.lines <- stri_stats_general(sentences)['Lines']
      words.chars <- stri_stats_general(sentences)['Chars']
      #Split sentence into words
      words<-strsplit(sentences," ")
      words<- as.character(unlist(words))
      correct <- hunspell_check(words)
      correctwords <- as.data.frame(cbind(words,correct),col.names=c("word","correct"))
      correctwords <- subset(correctwords, correct!="FALSE")
      correctwords <- correctwords[!(correctwords$words == ""),]
      words <- as.vector(as.character(correctwords[,1]))
      #Calculate word frequencies
      words.freq<-table(words)
      words.table <- as.data.frame(words.freq,names=c("word","freq"))
      words.table$Freq <- as.integer(words.table$Freq)
      words.table <- words.table[order(-words.table$Freq),]
      row.names(words.table) <- 1:nrow(words.table)
      #add row and cumnulative percentage
      words.table$freqpct <- words.table$Freq/sum(words.table$Freq)
      words.table$cumpct <- cumsum(words.table$Freq)/sum(words.table$Freq)
      words.total <- nrow(words.table)
      words.50 <- Closest(words.table$cumpct, 0.50, which = TRUE)
      words.75 <- Closest(words.table$cumpct, 0.75, which = TRUE)
      words.90 <- Closest(words.table$cumpct, 0.90, which = TRUE)
      words.95 <- Closest(words.table$cumpct, 0.95, which = TRUE)
      #Create return values with the ordered table, and the information about the table. 
      return.list <- list("table"= words.table, 
                          "Lines" = words.lines, 
                          "Characters" = words.chars, 
                          "Words" = words.total, 
                          "Words at 50th pct" = words.50, 
                          "Words at 75th pct" = words.75, 
                          "Words at 90th pct" = words.90, 
                          "Words at 95th pct" = words.95)
}     

run_plot <- function(tn,title,mval) {
      
      #This function actually produces the plots. The table name, the chart title, and the multiplier value for the secondary y axis is passed and everything else is automatic.
      
            grid.newpage()      
            p1 <- ggplot(data = tn[1:40,], aes(x = reorder(table.words, table.cumpct))) +
            geom_bar(aes(y = table.cumpct), stat="identity", fill = "red", alpha = 0.50) +
            scale_y_continuous(sec.axis = sec_axis(trans = ~.*mval,name="Word Count Frequency", labels = comma)) + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
      p1 <- p1 + labs(x = "Words", y = "Cumulative Frequency", title =  graphtitle)
      p1 <- p1 + theme(axis.text.x=element_text(angle=90,hjust=1))
      p1 <- p1 + annotate(geom="text", x=10, y = .4, hjust = 0,
                          label = "Word Count is Blue", color="blue")
      p1 <- p1 + annotate(geom="text", x=10, y = .37, hjust = 0,  
                          label = "Cum Freq Pct is Red", color="red")

      p2 <- ggplot(data = tn[1:40,], aes(x = reorder(table.words, table.cumpct))) +  geom_bar(aes(y = table.Freq), stat="identity", fill = "blue", alpha = 0.50)  + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) %+replace% theme(panel.background = element_rect(fill = NA))
 

# extract gtable
      g1 <- ggplot_gtable(ggplot_build(p1))
      g2 <- ggplot_gtable(ggplot_build(p2))

# overlap the panel of 2nd plot on that of 1st plot
      pp <- c(subset(g1$layout, name == "panel", se = t:r))
      g <- gtable_add_grob(g1, g2$grobs[[which(g2$layout$name == "panel")]], pp$t, 
                     pp$l, pp$b, pp$l)
      grid.draw(g)
    
}