Data Science Capstone Week 2

Milestone Report

The goal of this report is to explains my exploratory analysis and my goals for the eventual shiny app and algorithm for text prediction. The motivation for this report is to: 1. Demonstrate that I’ve downloaded the data and have successfully loaded it in.2. Create a basic report of summary statistics about the data sets.3. Report any interesting findings that I have amassed so far.4. Get feedback on your plans for creating a prediction algorithm and Shiny app.

Prelimarary data analysis

Load dependencies and read in the 3 files: Blog, Twitter and News to understand their size, word structure and frequency.

library(readtext)
library(stringr)
library(reshape2)
library(ggplot2)
library(gridExtra)

d <- "D:\\Coursera Data Science\\Data Science Capstone\\final\\en_US\\"

f <- list.files(path = d)

FileInfo <- lapply(f, function(x){
  rFile <- paste0(d,x)
  fsize <- (file.info(rFile)$size)/(1024*1024)
  con <- file(rFile, open = "r")
  rLines <- readLines(con)
  nChars <- lapply(rLines, nchar)
  aveChars <- mean(unlist(nChars))
  maxChars <- which.max(nChars)
  nWords <- str_count(rLines,pattern = "\\w+")
  aveWords <- mean(nWords)
  return(c(x, round(fsize,2), length(rLines), maxChars, round(aveChars,2), round(aveWords,2)))
  close(con)
})

df <- data.frame(t(sapply(FileInfo,c)))
colnames(df) <- c("File.Name", "Size(MB)", "No.Lines", "Max.Chars", "Ave.Chars", "Ave.Words")
df

##           File.Name Size(MB) No.Lines Max.Chars Ave.Chars Ave.Words
## 1   en_US.blogs.txt   200.42   899288    483415     231.7     42.92
## 2    en_US.news.txt   196.28    77259     14556       203     35.67
## 3 en_US.twitter.txt   159.36  2360148   1484357      68.8     13.19

From the table, the 3 file sizes are of almost equal size. We will explore further on the structure of the contents in terms of number of lines, characters and words.

df_melt <- melt(df, id="File.Name")
g <- ggplot(df_melt[df_melt$variable %in% c("No.Lines", "Max.Chars"),]) + aes(x=File.Name, y=as.integer(value), fill=variable) +geom_bar(stat = "identity", position = "dodge")
g2 <- ggplot(df_melt[df_melt$variable %in% c("Ave.Chars", "Ave.Words"),]) + aes(x=File.Name, y=as.integer(value), fill=variable) +geom_bar(stat = "identity", position = "dodge") + scale_fill_hue(l=30)
grid.arrange(g, g2, nrow=1)

The figure on the left shows the number of lines and the maximum characters of the 3 files. We can see that Twitter contains the most lines. This is expected as Twitter is short 140-280 characters per input. Whereas News contains more words per line or paragraphs.

The figure on the right shows the average characters and words per content type. Bogs and News has simliar characteristics whereas Twitter contains shorter words.

We can conclude that Blogs and News share many common characteristics whereas Twitter is the odd one amongst the 3. Do we want to include Twitter in building our prediction model?

I would go with yes. Because if we build it with only Blogs and News, it would skew towards a certain way of writing. Since we are not building an app for a particular kind of user, we would have to include other ways of writing that more informal and short in nature.

Data Science Capstone Week 2

Chuk Yong

9 August 2018

Milestone Report

Prelimarary data analysis