title: “Data Science Capstone Assignment - Milestone Report”
author: “ahlulwulus”
date: “March 20, 2016”
output: html_document

Synopsis

This is the milestone report created for one of the Data Science Capstone assignments. It is an explaination of the exploratory analysis done for the project datasets.

The Datasets

The data used in this report was downloaded from:

Capstone Dataset (source: [linked phrase] (https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip) where only the following English word files were used here Only the following english text files were used: ** en_US.blogs.txt ** en_US.news.txt ** en_US.twitter.txt Exploratory Analysis

File Analysis

  1. Download data file into working directory, and load-in into the memory.
setwd("D:/Ahlulwulus/Training/Data Scientist 2015/Module 10 - Capstone")
twitter <- readLines(con <- file("./Data/en_US/en_US.twitter.txt"), encoding = "UTF-8", skipNul = TRUE)
close(con)

//blog <- readLines(con <- file(“./Data/en_US/en_US.blogs.txt”), encoding = “UTF-8”, //skipNul = TRUE) //new <- readLines(con <- file(“./Data/en_US/en_US.news.txt”), encoding = “UTF-8”, //skipNul = TRUE)

  1. Capture the length each of the file
lenTwitter <- length(twitter)

//lenBlog <- length(blog) //lenNew <- length(new)

  1. Count how many words each of the file
twitterWords <- sum(sapply(gregexpr("\\S+", twitter), length))

//blogWords <- sum(sapply(gregexpr(“\S+”, blog), length)) //newWords <- sum(sapply(gregexpr(“\S+”, new), length))

  1. Calculate average words per line
avgTwitter <- twitterWords %/% lenTwitter

//avgBlog <- blogWords %/% lenBlog //avgNew <- newWords %/% lenNew

Data Analysis

  1. Remove all weird characters
cleanedTwitter<- iconv(twitter, 'UTF-8', 'ASCII', "byte")

//cleanedBlog<- iconv(blog, ‘UTF-8’, ‘ASCII’, “byte”) //cleanedNew<- iconv(new, ‘UTF-8’, ‘ASCII’, “byte”)

  1. Create a sample data (5000 words) for corpus (only for twitter).
library(tm)
## Warning: package 'tm' was built under R version 3.2.3
## Loading required package: NLP
## Warning: package 'NLP' was built under R version 3.2.3
twitterSample<-sample(cleanedTwitter, 5000)
doc.vec <- VectorSource(twitterSample)                      
doc.corpus <- Corpus(doc.vec)

#convert to lower case
doc.corpus<- tm_map(doc.corpus, tolower)

#remove all punctuatins
doc.corpus<- tm_map(doc.corpus, removePunctuation)

#remove all numbers 
doc.corpus<- tm_map(doc.corpus, removeNumbers)

## remove whitespace
doc.corpus <- tm_map(doc.corpus, stripWhitespace)

## force everything back to plaintext document
doc.corpus <- tm_map(doc.corpus, PlainTextDocument)

Remove file from memory to gain spaces

rm(cleanedTwitter)
rm(twitterSample)
  1. Visualize corpus using wordcloud
library(wordcloud)
## Warning: package 'wordcloud' was built under R version 3.2.4
## Loading required package: RColorBrewer
## Warning: package 'RColorBrewer' was built under R version 3.2.3
wordcloud(doc.corpus, max.words = 50, random.order = FALSE,rot.per=0.35, use.r.layout=FALSE,colors=brewer.pal(8, "Dark2"))

Conclusion: