This is the Week 2 Milestone report for the Coursera Data Science Capstone Project.
The goal of this report is to perform appropriate exploratory analyses on the given raw data to understand the statistical properties of the data set that can later be of use when building the initial prediction model for the final Shiny app product. This report will identify major features of the data provided and then briefly summarize the plans to create the prediction algorithm
The predictive model will trained using the document corpus complied
from three sources of text data:
- Blogs
- Twitter
- News
The model will only focus on the English corpora, but 3 other languages are available.
The data was downloaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
Due to large size of the data, I have downloaded the data from the above link, unzipped it and stored it locally in the folder Coursera-SwiftKey to save time downloading and unzipping the data.
# Loading the necessary packages
library(knitr)
library(dplyr)
library(ggplot2)
# Clearing the workspace including hidden objects
rm(list=ls(all.names = TRUE))
# Setting the working directory
setwd("C:/Users/ashwi/OneDrive/Documents/Coursera/Data Science Capstone/Project Data/Coursera-SwiftKey/final/en_US")
Reading the Blogs, News, and Twitter Files
# Twitter
twitterfile <- readLines(con = file("~/Coursera/Data Science Capstone/Project Data/Coursera-SwiftKey/final/en_US/en_US.twitter.txt"), encoding = "UTF-8", skipNul = TRUE)
# Blogs
blogfile <- readLines(con = file("~/Coursera/Data Science Capstone/Project Data/Coursera-SwiftKey/final/en_US/en_US.blogs.txt"), encoding = "UTF-8", skipNul = TRUE)
# News
newsfile <- readLines(con = file("~/Coursera/Data Science Capstone/Project Data/Coursera-SwiftKey/final/en_US/en_US.news.txt"), encoding = "UTF-8", skipNul = TRUE)
# Looking at few lines from each file
blogfile [c(5, 10)]
## [1] "With graduation season right around the corner, Nancy has whipped up a fun set to help you out with not only your graduation cards and gifts, but any occasion that brings on a change in one's life. I stamped the images in Memento Tuxedo Black and cut them out with circle Nestabilities. I embossed the kraft and red cardstock with TE's new Stars Impressions Plate, which is double sided and gives you 2 fantastic patterns. You can see how to use the Impressions Plates in this tutorial Taylor created. Just one pass through your die cut machine using the Embossing Pad Kit is all you need to do - super easy!"
## [2] "Peter Schiff: Hard to tell. It will look pretty bad for most Americans when prices will go way up and they can’t afford to buy stuff. It could also get very bad as far as loss of individual liberty. A lot of people will blame it on capitalism, on freedom, and they will claim we need more government. It could be used as an impetus for more regulation, which would be a disaster, or it could be an impetus to get rid of all the regulation that was causing the problem. But whether we will do the right or the wrong thing here in America, there will be a lot of pain first. We got some serious problems we have to deal with, but we are not dealing with the problems, we only make the problems worse."
twitterfile [c(12, 24)]
## [1] "Dammnnnnn what a catch" "Tommorows the day..."
newsfile [c(3, 16)]
## [1] "WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building."
## [2] "Andrade's children, Erin and son Patrick, are adults, and he said he was intrigued by the opportunity to play a mentoring role in the life of a high school student."
rm(con)
Checking the filesizes, word counts, line counts, character counts, and longest line in the 3 files
# Checking the size of each file in megabytes (MB)
blogname <- "en_US.blogs.txt"
twittername <- "en_US.twitter.txt"
newsname <- "en_US.news.txt"
MBsize <- round(file.info(c(blogname,
twittername,
newsname))$size/1024^2)
# Checking the number of lines for all three files
bloglines <- length(blogfile)
twitterlines <- length(twitterfile)
newslines <- length(newsfile)
file_lines <- c(bloglines, twitterlines, newslines)
# Checking the number of characters
blogchar <- sum(nchar(blogfile))
twitterchar <- sum(nchar(twitterfile))
newschar <- sum(nchar(newsfile))
file_char <- c(blogchar, twitterchar, newschar)
# Checking the number of words
blogwords <- sum(sapply(strsplit(blogfile, " "), length))
twitterwords <- sum(sapply(strsplit(twitterfile, " "), length))
newswords <- sum(sapply(strsplit(newsfile, " "), length))
file_words <- c(blogwords, twitterwords, newswords)
# Finding the number of characters in the longest line in each file
blogll <- max(nchar(blogfile))
twitterll <- max(nchar(twitterfile))
newsll <- max(nchar(newsfile))
file_ll <- c(blogll, twitterll, newsll)
Finding mean Words per Line for the above three files
mblogwpl <- mean(sapply(strsplit(blogfile, " "), length))
mtwitterwpl <- mean(sapply(strsplit(twitterfile, " "), length))
mnewswpl <- mean(sapply(strsplit(newsfile, " "), length))
mfile_wpl <- c(mblogwpl, mtwitterwpl, mnewswpl)
Creating a summary of all the above characteristics
# Creating a data frame with values of all characteristics seen above
summary <- data.frame(Name = c("en_US.blogs.txt", "en_US.twitter.txt", "en_US.news.txt"),
Size = paste(MBsize,"MB"))
summary$Lines = file_lines
summary$Characters = file_char
summary$Words = file_words
summary$LongestLine = file_ll
summary$MeanWordsperLine = mfile_wpl
summary
## Name Size Lines Characters Words LongestLine
## 1 en_US.blogs.txt 200 MB 899288 206824505 37334131 40833
## 2 en_US.twitter.txt 159 MB 2360148 162096241 30373583 140
## 3 en_US.news.txt 196 MB 77259 15639408 2643969 5760
## MeanWordsperLine
## 1 41.51521
## 2 12.86936
## 3 34.22215
From the table, We can see that file size ranges between 160 upto 200 MB, with the smallest being the Twitter file and the largest being the Blogs file. Alternatively, the Blogs file has the smallest no.of lines (899288) compared with the Twitter file (2360148) The word counts exceeds 30 million for all three files In terms of the number of characters, the Blogs file is the largest (206824505) compared to the News file which is the smallest (15639408)
It is interesting to see that Twitter file has the lowest values for maximum words per line (140) this can be attributed to the restrictions Twitter places on the length of an single tweet. Comparatively, News and Blogs aren’t restricted at such levels, thus they display larger number of maximum words per lines 5760 and 40833 respectively
Plotting the filesizes, maximum words per line, word counts, and line counts as histograms for the three files
# Filesize Plot
plot1 <- ggplot(summary, aes(x = Name, y = Size), color = Name, fill = Name) +
geom_bar(stat = "identity", fill = "blue", alpha = 0.7) +
labs(title = "File Size Graph") +
xlab("File Name") + ylab("File Size in Megabytes")
print(plot1)
# Line Count Plot
plot2 <- ggplot(summary, aes(Name, Lines), color = Name, fill = Name) +
geom_bar(stat = "identity") +
labs(title = "Lines in each file") +
xlab("File Name") + ylab("Number of Lines")
print(plot2)
# Mean Words per Line Plot for all three files
plot3 <- ggplot(summary, aes(Name, MeanWordsperLine), color = Name, fill = Name) +
geom_bar(stat = "identity", fill = "magenta4", alpha = 0.5, width = 0.5) +
labs(title = "Mean Words Per Line for three files") +
xlab("Name of the File") + ylab("Mean Words per Line")+
theme_minimal()
print(plot3)
The final goal of the Capstone project is to build an predictive algorithm that is to be deployed as a Shiny app which would a take a phrase (with multiple words) and spit out a prediction of possible next word.
The Exploratory Analysis has showed the characteristics of the data
considered i.e. Word counts, Line counts, Character Counts, Mean Words
per line, Longest Line, and file sizes.
A generic way to move forward with building the prediction model will be
to:
1. Converting Uppercase Letter to lowercase letters (un-capitalizing the
words)
2. Removing words and characters that are in a different language from
the files
3. Removing punctuation characters
4. Profanity Filtering
5. Sampling the combined dataset randomly - to make the algorithm faster
in order to make the app run the algorithm in the background and just to
provide the predictions as output. Otherwise, the lag will make the app
unusable and/or inaccurate