title: “Data Science Capstone Assignment - Milestone Report” |
author: “ahlulwulus” |
date: “March 20, 2016” |
output: html_document |
This is the milestone report created for one of the Data Science Capstone assignments. It is an explaination of the exploratory analysis done for the project datasets.
The data used in this report was downloaded from:
Capstone Dataset (source: [linked phrase] (https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip) where only the following English word files were used here Only the following english text files were used: ** en_US.blogs.txt ** en_US.news.txt ** en_US.twitter.txt Exploratory Analysis
setwd("D:/Ahlulwulus/Training/Data Scientist 2015/Module 10 - Capstone")
twitter <- readLines(con <- file("./Data/en_US/en_US.twitter.txt"), encoding = "UTF-8", skipNul = TRUE)
close(con)
//blog <- readLines(con <- file(“./Data/en_US/en_US.blogs.txt”), encoding = “UTF-8”, //skipNul = TRUE) //new <- readLines(con <- file(“./Data/en_US/en_US.news.txt”), encoding = “UTF-8”, //skipNul = TRUE)
lenTwitter <- length(twitter)
//lenBlog <- length(blog) //lenNew <- length(new)
twitterWords <- sum(sapply(gregexpr("\\S+", twitter), length))
//blogWords <- sum(sapply(gregexpr(“\S+”, blog), length)) //newWords <- sum(sapply(gregexpr(“\S+”, new), length))
avgTwitter <- twitterWords %/% lenTwitter
//avgBlog <- blogWords %/% lenBlog //avgNew <- newWords %/% lenNew
cleanedTwitter<- iconv(twitter, 'UTF-8', 'ASCII', "byte")
//cleanedBlog<- iconv(blog, ‘UTF-8’, ‘ASCII’, “byte”) //cleanedNew<- iconv(new, ‘UTF-8’, ‘ASCII’, “byte”)
library(tm)
## Warning: package 'tm' was built under R version 3.2.3
## Loading required package: NLP
## Warning: package 'NLP' was built under R version 3.2.3
twitterSample<-sample(cleanedTwitter, 5000)
doc.vec <- VectorSource(twitterSample)
doc.corpus <- Corpus(doc.vec)
#convert to lower case
doc.corpus<- tm_map(doc.corpus, tolower)
#remove all punctuatins
doc.corpus<- tm_map(doc.corpus, removePunctuation)
#remove all numbers
doc.corpus<- tm_map(doc.corpus, removeNumbers)
## remove whitespace
doc.corpus <- tm_map(doc.corpus, stripWhitespace)
## force everything back to plaintext document
doc.corpus <- tm_map(doc.corpus, PlainTextDocument)
Remove file from memory to gain spaces
rm(cleanedTwitter)
rm(twitterSample)
library(wordcloud)
## Warning: package 'wordcloud' was built under R version 3.2.4
## Loading required package: RColorBrewer
## Warning: package 'RColorBrewer' was built under R version 3.2.3
wordcloud(doc.corpus, max.words = 50, random.order = FALSE,rot.per=0.35, use.r.layout=FALSE,colors=brewer.pal(8, "Dark2"))