Milestone Report Data Science Capstone Project

Introduction

This is the Week 2 Milestone report for the Coursera Data Science Capstone Project.

The goal of this report is to perform appropriate exploratory analyses on the given raw data to understand the statistical properties of the data set that can later be of use when building the initial prediction model for the final Shiny app product. This report will identify major features of the data provided and then briefly summarize the plans to create the prediction algorithm

The predictive model will trained using the document corpus complied from three sources of text data:
- Blogs
- Twitter
- News

The model will only focus on the English corpora, but 3 other languages are available.

Data

The data was downloaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

Setting up the Environment

Due to large size of the data, I have downloaded the data from the above link, unzipped it and stored it locally in the folder Coursera-SwiftKey to save time downloading and unzipping the data.

# Loading the necessary packages  

library(knitr)
library(dplyr)
library(ggplot2)

# Clearing the workspace including hidden objects  

rm(list=ls(all.names = TRUE))

# Setting the working directory

setwd("C:/Users/ashwi/OneDrive/Documents/Coursera/Data Science Capstone/Project Data/Coursera-SwiftKey/final/en_US")

Document Setup

Reading the Blogs, News, and Twitter Files

# Twitter  
twitterfile <- readLines(con = file("~/Coursera/Data Science Capstone/Project Data/Coursera-SwiftKey/final/en_US/en_US.twitter.txt"), encoding = "UTF-8", skipNul = TRUE)

# Blogs

blogfile <- readLines(con = file("~/Coursera/Data Science Capstone/Project Data/Coursera-SwiftKey/final/en_US/en_US.blogs.txt"), encoding = "UTF-8", skipNul = TRUE)

# News  

newsfile <- readLines(con = file("~/Coursera/Data Science Capstone/Project Data/Coursera-SwiftKey/final/en_US/en_US.news.txt"), encoding = "UTF-8", skipNul = TRUE)

# Looking at few lines from each file

blogfile [c(5, 10)]

## [1] "With graduation season right around the corner, Nancy has whipped up a fun set to help you out with not only your graduation cards and gifts, but any occasion that brings on a change in one's life. I stamped the images in Memento Tuxedo Black and cut them out with circle Nestabilities. I embossed the kraft and red cardstock with TE's new Stars Impressions Plate, which is double sided and gives you 2 fantastic patterns. You can see how to use the Impressions Plates in this tutorial Taylor created. Just one pass through your die cut machine using the Embossing Pad Kit is all you need to do - super easy!"                                                                                          
## [2] "Peter Schiff: Hard to tell. It will look pretty bad for most Americans when prices will go way up and they can’t afford to buy stuff. It could also get very bad as far as loss of individual liberty. A lot of people will blame it on capitalism, on freedom, and they will claim we need more government. It could be used as an impetus for more regulation, which would be a disaster, or it could be an impetus to get rid of all the regulation that was causing the problem. But whether we will do the right or the wrong thing here in America, there will be a lot of pain first. We got some serious problems we have to deal with, but we are not dealing with the problems, we only make the problems worse."

twitterfile [c(12, 24)]

## [1] "Dammnnnnn what a catch" "Tommorows the day..."

newsfile [c(3, 16)]

## [1] "WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building."
## [2] "Andrade's children, Erin and son Patrick, are adults, and he said he was intrigued by the opportunity to play a mentoring role in the life of a high school student."

rm(con)

Exploratory Data Analysis

Checking the filesizes, word counts, line counts, character counts, and longest line in the 3 files

# Checking the size of each file in megabytes (MB)

blogname <- "en_US.blogs.txt"
twittername <- "en_US.twitter.txt"
newsname <- "en_US.news.txt"
MBsize <- round(file.info(c(blogname, 
                            twittername, 
                            newsname))$size/1024^2)

# Checking the number of lines for all three files

bloglines <- length(blogfile)
twitterlines <- length(twitterfile)
newslines <- length(newsfile)

file_lines <- c(bloglines, twitterlines, newslines)

# Checking the number of characters

blogchar <- sum(nchar(blogfile))
twitterchar <- sum(nchar(twitterfile))
newschar <- sum(nchar(newsfile))

file_char <- c(blogchar, twitterchar, newschar)

# Checking the number of words

blogwords <- sum(sapply(strsplit(blogfile, " "), length))
twitterwords <- sum(sapply(strsplit(twitterfile, " "), length))
newswords <- sum(sapply(strsplit(newsfile, " "), length))

file_words <- c(blogwords, twitterwords, newswords)

# Finding the number of characters in the longest line in each file

blogll <- max(nchar(blogfile))
twitterll <- max(nchar(twitterfile))
newsll <- max(nchar(newsfile))

file_ll <- c(blogll, twitterll, newsll)

Finding mean Words per Line for the above three files

mblogwpl <- mean(sapply(strsplit(blogfile, " "), length))
mtwitterwpl <- mean(sapply(strsplit(twitterfile, " "), length))
mnewswpl <- mean(sapply(strsplit(newsfile, " "), length))

mfile_wpl <- c(mblogwpl, mtwitterwpl, mnewswpl)

Creating a summary of all the above characteristics

# Creating a data frame with values of all characteristics seen above
summary <- data.frame(Name = c("en_US.blogs.txt", "en_US.twitter.txt", "en_US.news.txt"),
                      Size = paste(MBsize,"MB"))
summary$Lines = file_lines
summary$Characters = file_char
summary$Words = file_words
summary$LongestLine = file_ll
summary$MeanWordsperLine = mfile_wpl
summary

##                Name   Size   Lines Characters    Words LongestLine
## 1   en_US.blogs.txt 200 MB  899288  206824505 37334131       40833
## 2 en_US.twitter.txt 159 MB 2360148  162096241 30373583         140
## 3    en_US.news.txt 196 MB   77259   15639408  2643969        5760
##   MeanWordsperLine
## 1         41.51521
## 2         12.86936
## 3         34.22215

From the table, We can see that file size ranges between 160 upto 200 MB, with the smallest being the Twitter file and the largest being the Blogs file. Alternatively, the Blogs file has the smallest no.of lines (899288) compared with the Twitter file (2360148) The word counts exceeds 30 million for all three files In terms of the number of characters, the Blogs file is the largest (206824505) compared to the News file which is the smallest (15639408)

It is interesting to see that Twitter file has the lowest values for maximum words per line (140) this can be attributed to the restrictions Twitter places on the length of an single tweet. Comparatively, News and Blogs aren’t restricted at such levels, thus they display larger number of maximum words per lines 5760 and 40833 respectively

Graphs

Plotting the filesizes, maximum words per line, word counts, and line counts as histograms for the three files

# Filesize Plot
plot1 <- ggplot(summary, aes(x = Name, y = Size), color = Name, fill = Name) + 
  geom_bar(stat = "identity", fill = "blue", alpha = 0.7) + 
  labs(title = "File Size Graph") + 
  xlab("File Name") + ylab("File Size in Megabytes")
print(plot1)

# Line Count Plot
plot2 <- ggplot(summary, aes(Name, Lines), color = Name, fill = Name) + 
  geom_bar(stat = "identity") + 
  labs(title = "Lines in each file") + 
  xlab("File Name") + ylab("Number of Lines")
print(plot2)

# Mean Words per Line Plot for all three files
plot3 <- ggplot(summary, aes(Name, MeanWordsperLine), color = Name, fill = Name) + 
  geom_bar(stat = "identity", fill = "magenta4", alpha = 0.5, width = 0.5) +
  labs(title = "Mean Words Per Line for three files") +
  xlab("Name of the File") + ylab("Mean Words per Line")+
  theme_minimal()
print(plot3)

Things to consider to move forward

The final goal of the Capstone project is to build an predictive algorithm that is to be deployed as a Shiny app which would a take a phrase (with multiple words) and spit out a prediction of possible next word.

The Exploratory Analysis has showed the characteristics of the data considered i.e. Word counts, Line counts, Character Counts, Mean Words per line, Longest Line, and file sizes.
A generic way to move forward with building the prediction model will be to:
1. Converting Uppercase Letter to lowercase letters (un-capitalizing the words)
2. Removing words and characters that are in a different language from the files
3. Removing punctuation characters
4. Profanity Filtering
5. Sampling the combined dataset randomly - to make the algorithm faster in order to make the app run the algorithm in the background and just to provide the predictions as output. Otherwise, the lag will make the app unusable and/or inaccurate