Table of contents:
- Introduction
- Data Loading
- Data Processing
- Next Steps
The goal of milestone report for the Coursera Data Science Capstone project is to display how the data was downloaded and to explanin the plan to create the prediction algorithm. This document explain the major features of the data that I have identified and briefly summarizes the plans for creating the prediction algorithm and Shiny app. ## Data
In this project the following data is provided: “http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip”
Text documents are provided in English, German, Finnish and Russian and they come in 3 differenct forms:
- Blogs - News - Twitter Since I don’t know any of the other 3 languges, I’m going to use the data in English.
Since these datasets are huge and processing takes a long time, we will choose a sample data set for the data processing and analysis for this project. The full data set will be used in the final project for prediction algorithm. Data from the 3 files are combined and a text corpus is built using the tm library. We only load 1000 lines for this report. ###Load Necessary Libraries:
library(tm,quietly = TRUE, warn.conflicts = FALSE )
library(fpc,quietly = TRUE, warn.conflicts = FALSE)
library(SnowballC,quietly = TRUE, warn.conflicts = FALSE)
library(ggplot2,quietly = TRUE, warn.conflicts = FALSE)
library(wordcloud,quietly = TRUE,warn.conflicts = FALSE)
library(gridExtra,quietly = TRUE, warn.conflicts = FALSE)
Then we will load the data files Twitter, Blogs and News and display the number of rows and characters in each file:
setwd("/Users/Arezu/Desktop/Coursera/CapstoneProject/en_US")
blogsf <- readLines("/Users/Arezu/Desktop/Coursera/CapstoneProject/en_US/en_US.blogs.txt", encoding = "UTF-8",warn = FALSE)
# to get the number of the rows in Blogs file
NROW(blogsf)
## [1] 899288
# to get the number of charactrers in Blogs file
sum(nchar(blogsf))
## [1] 206824505
newsf <- readLines("/Users/Arezu/Desktop/Coursera/CapstoneProject/en_US/en_US.news.txt", encoding = "UTF-8",warn = FALSE)
# to get the number of the rows in news file
NROW(newsf)
## [1] 1010242
# to get the number of charactrers in news file
sum(nchar(newsf))
## [1] 203223159
twitterf <- readLines("/Users/Arezu/Desktop/Coursera/CapstoneProject/en_US/en_US.twitter.txt", encoding = "UTF-8",warn = FALSE)
# to get the number of the rows in Twitter file
NROW(twitterf)
## [1] 2360148
# to get the number of charactrers in Twitter file
sum(nchar(twitterf))
## [1] 162096031
since the original data files are huge, I choose 1000 lines as a sample data:
# Loading the first 1000 sample rows of the Blogs data file
blogsf <- readLines("/Users/Arezu/Desktop/Coursera/CapstoneProject/en_US/en_US.blogs.txt",1000, encoding = "UTF-8",warn = FALSE)
# checking the number of rows
NROW(blogsf)
## [1] 1000
# Loading the first 1000 sample rows of the News data file
newsf <- readLines("/Users/Arezu/Desktop/Coursera/CapstoneProject/en_US/en_US.news.txt",1000, encoding = "UTF-8",warn = FALSE)
# checking the number of rows
NROW(newsf)
## [1] 1000
# Loading the first 1000 sample rows of the Twitter data file
twitterf <- readLines("/Users/Arezu/Desktop/Coursera/CapstoneProject/en_US/en_US.twitter.txt",1000, encoding = "UTF-8",warn = FALSE)
# checking the number of rows in Twitter file
NROW(twitterf)
## [1] 1000
For data processing, we construct a corpus from the files;clean up the data by removing punctuation, special characters, etc (Tokenization) and also profanity; Build n-gram model.
I utilize tm package which is an R package to text mining to clean up the data.
#combine the 3 files and corpus
combinedfiletemp <- c(blogsf,newsf,twitterf)
combinedfiles <- paste(combinedfiletemp, collapse = " ")
masterfile <- Corpus(VectorSource(combinedfiles))
inspect(masterfile)
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 1
##
## [[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 500297
Now we’re going to remove numbers, white spaces, punctuations, converting to lower case, steming(e.g. s,es,ing) and stop words(e.g. the, also, a, an, and,…)
masterfile <- tm_map(masterfile,removePunctuation)
masterfile <- tm_map(masterfile,tolower)
masterfile <- tm_map(masterfile, removeNumbers)
masterfile <- tm_map(masterfile, stripWhitespace)
masterfile <- tm_map(masterfile, stemDocument)
masterfile <- tm_map(masterfile,removeWords, stopwords("english"))
masterfile <- tm_map(masterfile,PlainTextDocument)
we take a look at some word frequencies in our sample data set:
# create a document text matrix
dtm <- DocumentTermMatrix(masterfile)
dtm
## <<DocumentTermMatrix (documents: 1, terms: 13551)>>
## Non-/sparse entries: 13551/0
## Sparsity : 0%
## Maximal term length: 95
## Weighting : term frequency (tf)
# organize words based on their frequecny
freqwords <- colSums(as.matrix(dtm))
length(freqwords)
## [1] 13551
ordWords <- order(freqwords)
dtms <- removeSparseTerms(dtm, 0.1)
# to show the top 15 frequent
freqwords <- sort(colSums(as.matrix(dtm)), decreasing = TRUE)
head(freqwords,15)
## said will one just like can time new get dont day know
## 304 260 255 250 248 192 192 186 171 146 144 144
## now good first
## 138 132 128
# This will idenfentify all the terms that appear frequently more than 100 times
findFreqTerms(dtm, lowfreq = 100)
## [1] "also" "can" "day" "dont" "first" "get" "good"
## [8] "just" "know" "like" "love" "make" "much" "new"
## [15] "now" "one" "people" "said" "time" "two" "will"
## [22] "year"
wf <- data.frame(words=names(freqwords), freqwords=freqwords)
head(wf)
## words freqwords
## said said 304
## will will 260
## one one 255
## just just 250
## like like 248
## can can 192
f <- ggplot(subset(wf,freqwords>100), aes(words, freqwords))
f <- f + geom_bar(stat = "identity", fill = "yellow", colour = "black")
f <- f + theme(axis.text.x=element_text(angle=45, hjust = 1, colour = blues9))
f
### Plot frequent words
set.seed(142)
wordcloud(names(freqwords),freqwords,min.freq = 50, scale=c(5, .1),colors = brewer.pal(6,"Dark2"))
The next steps will be to build a predictive algorithm that uses an n-gram model with a frequency lookup similar to the analysis above. This algorithm will then be deployed in a Shiny app and will suggest the most likely next word after a phrase is typed.