Synopsis

This is the milestone report created for one of the Data Science Capstone assignments. It is an explaination of the exploratory analysis done for the project datasets.

The Datasets

The data used in this report was downloaded from:

Capstone Dataset (source: [linked phrase] (https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip) where only the following English word files were used here Only the following english text files were used: ** en_US.blogs.txt ** en_US.news.txt ** en_US.twitter.txt Exploratory Analysis

File Analysis

Download data file into working directory, and load-in into the memory.

setwd("D:/Ahlulwulus/Training/Data Scientist 2015/Module 10 - Capstone")
twitter <- readLines(con <- file("./Data/en_US/en_US.twitter.txt"), encoding = "UTF-8", skipNul = TRUE)
close(con)

//blog <- readLines(con <- file(“./Data/en_US/en_US.blogs.txt”), encoding = “UTF-8”, //skipNul = TRUE) //new <- readLines(con <- file(“./Data/en_US/en_US.news.txt”), encoding = “UTF-8”, //skipNul = TRUE)

Capture the length each of the file

lenTwitter <- length(twitter)

//lenBlog <- length(blog) //lenNew <- length(new)

Count how many words each of the file

twitterWords <- sum(sapply(gregexpr("\\S+", twitter), length))

//blogWords <- sum(sapply(gregexpr(“\S+”, blog), length)) //newWords <- sum(sapply(gregexpr(“\S+”, new), length))

Calculate average words per line

avgTwitter <- twitterWords %/% lenTwitter

//avgBlog <- blogWords %/% lenBlog //avgNew <- newWords %/% lenNew

Data Analysis

Remove all weird characters

cleanedTwitter<- iconv(twitter, 'UTF-8', 'ASCII', "byte")

//cleanedBlog<- iconv(blog, ‘UTF-8’, ‘ASCII’, “byte”) //cleanedNew<- iconv(new, ‘UTF-8’, ‘ASCII’, “byte”)

Create a sample data (5000 words) for corpus (only for twitter).

library(tm)

## Warning: package 'tm' was built under R version 3.2.3

## Loading required package: NLP

## Warning: package 'NLP' was built under R version 3.2.3

twitterSample<-sample(cleanedTwitter, 5000)
doc.vec <- VectorSource(twitterSample)                      
doc.corpus <- Corpus(doc.vec)

#convert to lower case
doc.corpus<- tm_map(doc.corpus, tolower)

#remove all punctuatins
doc.corpus<- tm_map(doc.corpus, removePunctuation)

#remove all numbers 
doc.corpus<- tm_map(doc.corpus, removeNumbers)

## remove whitespace
doc.corpus <- tm_map(doc.corpus, stripWhitespace)

## force everything back to plaintext document
doc.corpus <- tm_map(doc.corpus, PlainTextDocument)

Remove file from memory to gain spaces

rm(cleanedTwitter)
rm(twitterSample)

Visualize corpus using wordcloud

library(wordcloud)

## Warning: package 'wordcloud' was built under R version 3.2.4

## Loading required package: RColorBrewer

## Warning: package 'RColorBrewer' was built under R version 3.2.3

wordcloud(doc.corpus, max.words = 50, random.order = FALSE,rot.per=0.35, use.r.layout=FALSE,colors=brewer.pal(8, "Dark2"))

Conclusion:

These three (3) files are huge, and required a good machine to analyse the set of data. However, by taking some sample from the data set that possibly to analyse the data within the existing environment.
There differences of word frequency between the files. So, this issue need to be consired when we need to perform some prediction analysis later.
Due to not enough resource in my computer, i just only can do analysis on twitter data.