In this document, we will do a simple analysis on the English blog txt file. The file used was downloaded through a link
library(stringi)
library(tm)
library(wordcloud)
library(dplyr)
library(ggplot2)
url<-("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip")
download.file(url,destfile = "C:/Users/danlu/Documents/Capstone.zip")
Capf<-unzip("Capstone.zip")
list.files (path="C:/Users/danlu/Documents/Capf")
## [1] "final"
list.files (path="C:/Users/danlu/Documents/Capf/final/en_US")
## [1] "en_US.blogs.txt" "en_US.news.txt" "en_US.twitter.txt"
Load data
filepath<-"C:/Users/danlu/Documents/Capf/final/en_US/en_US.blogs.txt"
data_B <- readLines(filepath)
filepath_news<-"C:/Users/danlu/Documents/Capf/final/en_US/en_US.news.txt"
data_n <- readLines(filepath_news)
filepath_t<-"C:/Users/danlu/Documents/Capf/final/en_US/en_US.twitter.txt"
data_t <- readLines(filepath_t)
stri_stats_general(data_B)
## Lines LinesNEmpty Chars CharsNWhite
## 899288 899288 208361438 171926076
stri_stats_general(data_n)
## Lines LinesNEmpty Chars CharsNWhite
## 77259 77259 15683765 13117038
stri_stats_general(data_t)
## Lines LinesNEmpty Chars CharsNWhite
## 2360148 2360148 162384825 134370864
Creat a Corpus vector
data<-Corpus(VectorSource(data_B))
data1<-data[1:20000]
inspect(data1[1])
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 1
##
## [1] In the years thereafter, most of the Oil fields and platforms were named after pagan â\200œgodsâ\200\235.
Clean data
library(tm)
doc2<- tm_map(data1, removePunctuation)
doc2<-tm_map(doc2, stemDocument)
doc2<-tm_map(doc2,removeNumbers)
doc2<-tm_map(doc2,removeWords,stopwords("english"))
doc2<-tm_map(doc2,stripWhitespace)
To analyze the textual data, we use a Document-Term Matrix (DTM) representation:## documents as the rows, terms/words as the columns, frequency of the term in the document as the entries.## Because the number of unique words in the corpus the dimension can be large.##
DTM <- DocumentTermMatrix(doc2)
DTM<-removeSparseTerms(DTM,sparse = 0.99)
list the frequency and make visuallization
freq = data.frame(sort(colSums(as.matrix(DTM)), decreasing=TRUE))
head(freq,50)
## sort.colSums.as.matrix.DTM....decreasing...TRUE.
## the 4153
## one 2872
## will 2520
## time 2468
## like 2428
## just 2281
## can 2220
## get 2095
## make 1776
## know 1587
## day 1480
## love 1377
## use 1376
## year 1371
## thing 1352
## now 1335
## becaus 1313
## want 1312
## and 1290
## think 1286
## work 1274
## peopl 1251
## new 1245
## way 1229
## even 1227
## first 1194
## also 1193
## see 1192
## look 1168
## this 1166
## good 1140
## take 1128
## much 1127
## onli 1127
## back 1127
## but 1106
## veri 1103
## say 1098
## realli 1094
## well 1070
## littl 1058
## come 1047
## need 1023
## start 941
## â\200“ 930
## week 899
## feel 888
## mani 883
## two 858
## ani 857
wordcloud(rownames(freq), freq[,1], max.words=50, colors=brewer.pal(1, "Dark2"))
In a nutshell, we are able to download and load the data into R. we looked at the files under “en_us” folder and got the basic statistics on three files under this folder. we were able to list the most frequent words in the file “US_Blog.txt”. Future predictive modeling work to be continued.