This is the first Milestone Project within the Data Science Capstone. We have been provide three English Text documents.
The goal of this assignment is to begin to work with the data and establish a predictive model. The goal of the model will be to predict what the next word in a sentence will be.
I will use RStudio and R Version 3.3.2 (2016-10-31) “Sincere Pumpkin Patch”.
The report is created with R Markdown then using the Knitr package to create an html which finally will be published to RPUBS.
The required libraries will be loaded and a parrallel cluster will be build in order to improve preformance time.
Install.packages commands are provided if someone wishes to reproduce this document. However, I have them commented out.
#install.packages("dplyr")
#install.packages("doParallel")
#install.packages("stringi")
#install.packages("tm")
#install.packages("slam")
#install.packages("ggplot2")
#install.packages("wordcloud")
Initializing Libraries:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(doParallel)
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
library(stringi)
library(tm)
## Loading required package: NLP
library(slam)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
library(wordcloud)
## Loading required package: RColorBrewer
library(SnowballC)
Initializing Parrell Libraries:
jobcluster <- makeCluster(detectCores())
invisible(clusterEvalQ(jobcluster, library(tm)))
invisible(clusterEvalQ(jobcluster, library(slam)))
invisible(clusterEvalQ(jobcluster, library(stringi)))
invisible(clusterEvalQ(jobcluster, library(wordcloud)))
The data used was provided by Swiftkey and Johns Hopkins University.
#Blogs Data
conn <- file("C:/Users/kyle_/Documents/Coursera-SwiftKey/final/en_US/en_US.blogs.txt", open = "rb")
blogs <- readLines(conn, encoding = "UTF-8")
close(conn)
#News Data
conn <- file("C:/Users/kyle_/Documents/Coursera-SwiftKey/final/en_US/en_US.news.txt", open = "rb")
news <- readLines(conn, encoding = "UTF-8")
close(conn)
#Twitter Data
conn <- file("C:/Users/kyle_/Documents/Coursera-SwiftKey/final/en_US/en_US.twitter.txt", open = "rb")
twits <- readLines(conn, encoding = "UTF-8")
## Warning in readLines(conn, encoding = "UTF-8"): line 167155 appears to
## contain an embedded nul
## Warning in readLines(conn, encoding = "UTF-8"): line 268547 appears to
## contain an embedded nul
## Warning in readLines(conn, encoding = "UTF-8"): line 1274086 appears to
## contain an embedded nul
## Warning in readLines(conn, encoding = "UTF-8"): line 1759032 appears to
## contain an embedded nul
close(conn)
rm(conn)
The basic analysis of the three english files follows. This analysis will focus on the following counts:
Based on this we establish the following:
# Word/Line for Data Types
rawWPL<-lapply(list(blogs,news,twits),function(x) stri_count_words(x))
# Summary
rawstats<-data.frame(
File=c("blogs","news","twitter"),
t(rbind(sapply(list(blogs,news,twits),stri_stats_general),
TotalWords=sapply(list(blogs,news,twits),stri_stats_latex)[4,])),
# Compute words per line summary
WPL=rbind(summary(rawWPL[[1]]),summary(rawWPL[[2]]),summary(rawWPL[[3]]))
)
print(rawstats)
## File Lines LinesNEmpty Chars CharsNWhite TotalWords WPL.Min.
## 1 blogs 899288 899288 206824382 170389539 37570839 0
## 2 news 1010242 1010242 203223154 169860866 34494539 1
## 3 twitter 2360148 2360148 162096031 134082634 30451128 1
## WPL.1st.Qu. WPL.Median WPL.Mean WPL.3rd.Qu. WPL.Max.
## 1 9 28 41.75 60 6726
## 2 19 32 34.41 46 1796
## 3 7 12 12.75 18 47
All data types are right skewed. Extreme values are presumed to be outliers or sampling errors.
# Histogram for Data Types
qplot(rawWPL[[1]],geom="histogram",main="Histogram for US Blogs",
xlab="No. of Words",ylab="Frequency",binwidth=10)
qplot(rawWPL[[2]],geom="histogram",main="Histogram for US News",
xlab="No. of Words",ylab="Frequency",binwidth=10)
qplot(rawWPL[[3]],geom="histogram",main="Histogram for US Twits",
xlab="No. of Words",ylab="Frequency",binwidth=10)
rm(rawWPL);rm(rawstats)
All three documents have a vast differences in their words per lines. In order to build a predictive model I will need to establish and remove outliers.
In order to build a preditive model I will need to analyze the frequency distribution of words in their respective combination.