Introduction

This is the first Milestone Project within the Data Science Capstone. We have been provide three English Text documents.

  1. Blogs
  2. News
  3. Twitter

The goal of this assignment is to begin to work with the data and establish a predictive model. The goal of the model will be to predict what the next word in a sentence will be.

Methods

I will use RStudio and R Version 3.3.2 (2016-10-31) “Sincere Pumpkin Patch”.

The report is created with R Markdown then using the Knitr package to create an html which finally will be published to RPUBS.

Set Up

The required libraries will be loaded and a parrallel cluster will be build in order to improve preformance time.

Install.packages commands are provided if someone wishes to reproduce this document. However, I have them commented out.

#install.packages("dplyr")
#install.packages("doParallel")
#install.packages("stringi")
#install.packages("tm")
#install.packages("slam")
#install.packages("ggplot2")
#install.packages("wordcloud")

Initializing Libraries:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(doParallel)
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
library(stringi)
library(tm)
## Loading required package: NLP
library(slam)
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
library(wordcloud)
## Loading required package: RColorBrewer
library(SnowballC)

Initializing Parrell Libraries:

jobcluster <- makeCluster(detectCores())
invisible(clusterEvalQ(jobcluster, library(tm)))
invisible(clusterEvalQ(jobcluster, library(slam)))
invisible(clusterEvalQ(jobcluster, library(stringi)))
invisible(clusterEvalQ(jobcluster, library(wordcloud)))

Load Data

The data used was provided by Swiftkey and Johns Hopkins University.

#Blogs Data

conn <- file("C:/Users/kyle_/Documents/Coursera-SwiftKey/final/en_US/en_US.blogs.txt", open = "rb")
blogs <- readLines(conn, encoding = "UTF-8")
close(conn)

#News Data

conn <- file("C:/Users/kyle_/Documents/Coursera-SwiftKey/final/en_US/en_US.news.txt", open = "rb")
news <- readLines(conn, encoding = "UTF-8")
close(conn)

#Twitter Data

conn <- file("C:/Users/kyle_/Documents/Coursera-SwiftKey/final/en_US/en_US.twitter.txt", open = "rb")
twits <- readLines(conn, encoding = "UTF-8")
## Warning in readLines(conn, encoding = "UTF-8"): line 167155 appears to
## contain an embedded nul
## Warning in readLines(conn, encoding = "UTF-8"): line 268547 appears to
## contain an embedded nul
## Warning in readLines(conn, encoding = "UTF-8"): line 1274086 appears to
## contain an embedded nul
## Warning in readLines(conn, encoding = "UTF-8"): line 1759032 appears to
## contain an embedded nul
close(conn)

rm(conn)

Exploratory Analysis

The basic analysis of the three english files follows. This analysis will focus on the following counts:

  1. Line Count
  2. Character Count
  3. Word Count
  4. Words Per Line

Based on this we establish the following:

  1. Word/Line mean for blog is 41.75.
  2. Word/Line mean for news is 34.41.
  3. Word/Line mean for twitter is 12.75.
# Word/Line for Data Types
rawWPL<-lapply(list(blogs,news,twits),function(x) stri_count_words(x))

# Summary
rawstats<-data.frame(
            File=c("blogs","news","twitter"), 
            t(rbind(sapply(list(blogs,news,twits),stri_stats_general),
                    TotalWords=sapply(list(blogs,news,twits),stri_stats_latex)[4,])),
            # Compute words per line summary
            WPL=rbind(summary(rawWPL[[1]]),summary(rawWPL[[2]]),summary(rawWPL[[3]]))
            )
print(rawstats)
##      File   Lines LinesNEmpty     Chars CharsNWhite TotalWords WPL.Min.
## 1   blogs  899288      899288 206824382   170389539   37570839        0
## 2    news 1010242     1010242 203223154   169860866   34494539        1
## 3 twitter 2360148     2360148 162096031   134082634   30451128        1
##   WPL.1st.Qu. WPL.Median WPL.Mean WPL.3rd.Qu. WPL.Max.
## 1           9         28    41.75          60     6726
## 2          19         32    34.41          46     1796
## 3           7         12    12.75          18       47

Creation of Histograms

All data types are right skewed. Extreme values are presumed to be outliers or sampling errors.

# Histogram for Data Types
qplot(rawWPL[[1]],geom="histogram",main="Histogram for US Blogs",
      xlab="No. of Words",ylab="Frequency",binwidth=10)

qplot(rawWPL[[2]],geom="histogram",main="Histogram for US News",
      xlab="No. of Words",ylab="Frequency",binwidth=10)

qplot(rawWPL[[3]],geom="histogram",main="Histogram for US Twits",
      xlab="No. of Words",ylab="Frequency",binwidth=10)

rm(rawWPL);rm(rawstats)

Conclusion

All three documents have a vast differences in their words per lines. In order to build a predictive model I will need to establish and remove outliers.

In order to build a preditive model I will need to analyze the frequency distribution of words in their respective combination.