Data Science Capstone Project- Week 2

Table of contents:
- Introduction
- Data Loading
- Data Processing
- Next Steps

Introduction

The goal of milestone report for the Coursera Data Science Capstone project is to display how the data was downloaded and to explanin the plan to create the prediction algorithm. This document explain the major features of the data that I have identified and briefly summarizes the plans for creating the prediction algorithm and Shiny app. ## Data
In this project the following data is provided: “http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
Text documents are provided in English, German, Finnish and Russian and they come in 3 differenct forms:
- Blogs - News - Twitter Since I don’t know any of the other 3 languges, I’m going to use the data in English.

Data specifications:

  • Blogs: 899288 lines and 206824505 characters
  • News: 1010242 lines and 203223159 characters
  • Twitter: 2360148 lines and 162096031 characters

Since these datasets are huge and processing takes a long time, we will choose a sample data set for the data processing and analysis for this project. The full data set will be used in the final project for prediction algorithm. Data from the 3 files are combined and a text corpus is built using the tm library. We only load 1000 lines for this report. ###Load Necessary Libraries:

library(tm,quietly = TRUE, warn.conflicts = FALSE )
library(fpc,quietly = TRUE, warn.conflicts = FALSE)
library(SnowballC,quietly = TRUE, warn.conflicts = FALSE)
library(ggplot2,quietly = TRUE, warn.conflicts = FALSE)
library(wordcloud,quietly = TRUE,warn.conflicts = FALSE)
library(gridExtra,quietly = TRUE, warn.conflicts = FALSE)

Loading Data

Then we will load the data files Twitter, Blogs and News and display the number of rows and characters in each file:

setwd("/Users/Arezu/Desktop/Coursera/CapstoneProject/en_US")
blogsf <- readLines("/Users/Arezu/Desktop/Coursera/CapstoneProject/en_US/en_US.blogs.txt", encoding = "UTF-8",warn = FALSE)
# to get the number of the rows in Blogs file
NROW(blogsf)
## [1] 899288
# to get the number of charactrers in Blogs file
sum(nchar(blogsf))
## [1] 206824505
newsf <- readLines("/Users/Arezu/Desktop/Coursera/CapstoneProject/en_US/en_US.news.txt", encoding = "UTF-8",warn = FALSE)
# to get the number of the rows in news file
NROW(newsf)
## [1] 1010242
# to get the number of charactrers in news file
sum(nchar(newsf))
## [1] 203223159
twitterf <- readLines("/Users/Arezu/Desktop/Coursera/CapstoneProject/en_US/en_US.twitter.txt", encoding = "UTF-8",warn = FALSE)
# to get the number of the rows in Twitter file
NROW(twitterf)
## [1] 2360148
# to get the number of charactrers in Twitter file
sum(nchar(twitterf))
## [1] 162096031

Sample Data

since the original data files are huge, I choose 1000 lines as a sample data:

# Loading the first 1000 sample rows of the Blogs data file
blogsf <- readLines("/Users/Arezu/Desktop/Coursera/CapstoneProject/en_US/en_US.blogs.txt",1000, encoding = "UTF-8",warn = FALSE)
# checking the number of rows
NROW(blogsf)
## [1] 1000
# Loading the first 1000 sample rows of the News data file
newsf <- readLines("/Users/Arezu/Desktop/Coursera/CapstoneProject/en_US/en_US.news.txt",1000, encoding = "UTF-8",warn = FALSE)
# checking the number of rows
NROW(newsf)
## [1] 1000
# Loading the first 1000 sample rows of the Twitter data file
twitterf <- readLines("/Users/Arezu/Desktop/Coursera/CapstoneProject/en_US/en_US.twitter.txt",1000, encoding = "UTF-8",warn = FALSE)
# checking the number of rows in Twitter file
NROW(twitterf)
## [1] 1000

Data Processing

For data processing, we construct a corpus from the files;clean up the data by removing punctuation, special characters, etc (Tokenization) and also profanity; Build n-gram model.
I utilize tm package which is an R package to text mining to clean up the data.

#combine the 3 files and corpus
combinedfiletemp <- c(blogsf,newsf,twitterf)
combinedfiles <- paste(combinedfiletemp, collapse = " ")
masterfile <- Corpus(VectorSource(combinedfiles))

Summary of the Sample data

inspect(masterfile)
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 1
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 500297

Data Processing

Now we’re going to remove numbers, white spaces, punctuations, converting to lower case, steming(e.g. s,es,ing) and stop words(e.g. the, also, a, an, and,…)

masterfile <- tm_map(masterfile,removePunctuation)
masterfile <- tm_map(masterfile,tolower)
masterfile <- tm_map(masterfile, removeNumbers)
masterfile <- tm_map(masterfile, stripWhitespace)
masterfile <- tm_map(masterfile, stemDocument)
masterfile <- tm_map(masterfile,removeWords, stopwords("english"))
masterfile <- tm_map(masterfile,PlainTextDocument)

Facts and detetails on the sample data:

we take a look at some word frequencies in our sample data set:

# create a document text matrix
dtm <- DocumentTermMatrix(masterfile)
dtm
## <<DocumentTermMatrix (documents: 1, terms: 13551)>>
## Non-/sparse entries: 13551/0
## Sparsity           : 0%
## Maximal term length: 95
## Weighting          : term frequency (tf)
# organize words based on their frequecny
freqwords <- colSums(as.matrix(dtm))
length(freqwords)
## [1] 13551
ordWords <- order(freqwords)
dtms <- removeSparseTerms(dtm, 0.1)
# to show the top 15 frequent 
freqwords <- sort(colSums(as.matrix(dtm)), decreasing = TRUE)
head(freqwords,15)
##  said  will   one  just  like   can  time   new   get  dont   day  know 
##   304   260   255   250   248   192   192   186   171   146   144   144 
##   now  good first 
##   138   132   128
# This will idenfentify all the terms that appear frequently more than 100 times
findFreqTerms(dtm, lowfreq = 100)
##  [1] "also"   "can"    "day"    "dont"   "first"  "get"    "good"  
##  [8] "just"   "know"   "like"   "love"   "make"   "much"   "new"   
## [15] "now"    "one"    "people" "said"   "time"   "two"    "will"  
## [22] "year"
wf <- data.frame(words=names(freqwords), freqwords=freqwords)
head(wf)
##      words freqwords
## said  said       304
## will  will       260
## one    one       255
## just  just       250
## like  like       248
## can    can       192

plot words that appear at least 50 times

f <- ggplot(subset(wf,freqwords>100), aes(words, freqwords))
f <- f + geom_bar(stat = "identity", fill = "yellow", colour = "black")
f <- f + theme(axis.text.x=element_text(angle=45, hjust = 1, colour = blues9)) 
f

### Plot frequent words

set.seed(142)
wordcloud(names(freqwords),freqwords,min.freq = 50, scale=c(5, .1),colors = brewer.pal(6,"Dark2"))

Next Steps

The next steps will be to build a predictive algorithm that uses an n-gram model with a frequency lookup similar to the analysis above. This algorithm will then be deployed in a Shiny app and will suggest the most likely next word after a phrase is typed.