Milestone report

In this document, we will do a simple analysis on the English blog txt file. The file used was downloaded through a link

Load all the necessary libraries

library(stringi)
library(tm)
library(wordcloud)
library(dplyr)
library(ggplot2)

Use the link provided, listfiles in the downloaded folder

url<-("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip")
download.file(url,destfile = "C:/Users/danlu/Documents/Capstone.zip")
Capf<-unzip("Capstone.zip")
list.files (path="C:/Users/danlu/Documents/Capf")

## [1] "final"

list.files (path="C:/Users/danlu/Documents/Capf/final/en_US")

## [1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"

Load data

filepath<-"C:/Users/danlu/Documents/Capf/final/en_US/en_US.blogs.txt"
data_B <- readLines(filepath)
filepath_news<-"C:/Users/danlu/Documents/Capf/final/en_US/en_US.news.txt"
data_n <- readLines(filepath_news)
filepath_t<-"C:/Users/danlu/Documents/Capf/final/en_US/en_US.twitter.txt"
data_t <- readLines(filepath_t)

List statistics on the three files under “en_US” folder

Basic stastistics about blog file

stri_stats_general(data_B)

##       Lines LinesNEmpty       Chars CharsNWhite 
##      899288      899288   208361438   171926076

Basic stastistics about news file

stri_stats_general(data_n)

##       Lines LinesNEmpty       Chars CharsNWhite 
##       77259       77259    15683765    13117038

Basic stastistics about twitter file

stri_stats_general(data_t)

##       Lines LinesNEmpty       Chars CharsNWhite 
##     2360148     2360148   162384825   134370864

Creat a Corpus vector

data<-Corpus(VectorSource(data_B))
data1<-data[1:20000]

Below, we are going to do word count analysis on one file: the BLOG file to get an idea of the method.

Check Corpus

inspect(data1[1])

## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 1
## 
## [1] In the years thereafter, most of the Oil fields and platforms were named after pagan â\200œgodsâ\200\235.

Clean data

library(tm)

Remove punctuations

doc2<- tm_map(data1, removePunctuation)

Stemming

doc2<-tm_map(doc2, stemDocument)

Remove numbers

doc2<-tm_map(doc2,removeNumbers)

Remove English stopwords

doc2<-tm_map(doc2,removeWords,stopwords("english"))

strip whitespaces

doc2<-tm_map(doc2,stripWhitespace)

To analyze the textual data, we use a Document-Term Matrix (DTM) representation:## documents as the rows, terms/words as the columns, frequency of the term in the document as the entries.## Because the number of unique words in the corpus the dimension can be large.##

DTM <- DocumentTermMatrix(doc2)

Reduce the dimension of the matrix

DTM<-removeSparseTerms(DTM,sparse = 0.99)

list the frequency and make visuallization

freq = data.frame(sort(colSums(as.matrix(DTM)), decreasing=TRUE))
head(freq,50)

##        sort.colSums.as.matrix.DTM....decreasing...TRUE.
## the                                                4153
## one                                                2872
## will                                               2520
## time                                               2468
## like                                               2428
## just                                               2281
## can                                                2220
## get                                                2095
## make                                               1776
## know                                               1587
## day                                                1480
## love                                               1377
## use                                                1376
## year                                               1371
## thing                                              1352
## now                                                1335
## becaus                                             1313
## want                                               1312
## and                                                1290
## think                                              1286
## work                                               1274
## peopl                                              1251
## new                                                1245
## way                                                1229
## even                                               1227
## first                                              1194
## also                                               1193
## see                                                1192
## look                                               1168
## this                                               1166
## good                                               1140
## take                                               1128
## much                                               1127
## onli                                               1127
## back                                               1127
## but                                                1106
## veri                                               1103
## say                                                1098
## realli                                             1094
## well                                               1070
## littl                                              1058
## come                                               1047
## need                                               1023
## start                                               941
## â\200“                                                 930
## week                                                899
## feel                                                888
## mani                                                883
## two                                                 858
## ani                                                 857

wordcloud(rownames(freq), freq[,1], max.words=50, colors=brewer.pal(1, "Dark2"))

Conclusion

In a nutshell, we are able to download and load the data into R. we looked at the files under “en_us” folder and got the basic statistics on three files under this folder. we were able to list the most frequent words in the file “US_Blog.txt”. Future predictive modeling work to be continued.