The goal of this report is to explains my exploratory analysis and my goals for the eventual shiny app and algorithm for text prediction. The motivation for this report is to: 1. Demonstrate that I’ve downloaded the data and have successfully loaded it in.2. Create a basic report of summary statistics about the data sets.3. Report any interesting findings that I have amassed so far.4. Get feedback on your plans for creating a prediction algorithm and Shiny app.
Load dependencies and read in the 3 files: Blog, Twitter and News to understand their size, word structure and frequency.
library(readtext)
library(stringr)
library(reshape2)
library(ggplot2)
library(gridExtra)
d <- "D:\\Coursera Data Science\\Data Science Capstone\\final\\en_US\\"
f <- list.files(path = d)
FileInfo <- lapply(f, function(x){
rFile <- paste0(d,x)
fsize <- (file.info(rFile)$size)/(1024*1024)
con <- file(rFile, open = "r")
rLines <- readLines(con)
nChars <- lapply(rLines, nchar)
aveChars <- mean(unlist(nChars))
maxChars <- which.max(nChars)
nWords <- str_count(rLines,pattern = "\\w+")
aveWords <- mean(nWords)
return(c(x, round(fsize,2), length(rLines), maxChars, round(aveChars,2), round(aveWords,2)))
close(con)
})
df <- data.frame(t(sapply(FileInfo,c)))
colnames(df) <- c("File.Name", "Size(MB)", "No.Lines", "Max.Chars", "Ave.Chars", "Ave.Words")
df
## File.Name Size(MB) No.Lines Max.Chars Ave.Chars Ave.Words
## 1 en_US.blogs.txt 200.42 899288 483415 231.7 42.92
## 2 en_US.news.txt 196.28 77259 14556 203 35.67
## 3 en_US.twitter.txt 159.36 2360148 1484357 68.8 13.19
From the table, the 3 file sizes are of almost equal size. We will explore further on the structure of the contents in terms of number of lines, characters and words.
df_melt <- melt(df, id="File.Name")
g <- ggplot(df_melt[df_melt$variable %in% c("No.Lines", "Max.Chars"),]) + aes(x=File.Name, y=as.integer(value), fill=variable) +geom_bar(stat = "identity", position = "dodge")
g2 <- ggplot(df_melt[df_melt$variable %in% c("Ave.Chars", "Ave.Words"),]) + aes(x=File.Name, y=as.integer(value), fill=variable) +geom_bar(stat = "identity", position = "dodge") + scale_fill_hue(l=30)
grid.arrange(g, g2, nrow=1)
The figure on the left shows the number of lines and the maximum characters of the 3 files. We can see that Twitter contains the most lines. This is expected as Twitter is short 140-280 characters per input. Whereas News contains more words per line or paragraphs.
The figure on the right shows the average characters and words per content type. Bogs and News has simliar characteristics whereas Twitter contains shorter words.
We can conclude that Blogs and News share many common characteristics whereas Twitter is the odd one amongst the 3. Do we want to include Twitter in building our prediction model?
I would go with yes. Because if we build it with only Blogs and News, it would skew towards a certain way of writing. Since we are not building an app for a particular kind of user, we would have to include other ways of writing that more informal and short in nature.