1.Introduction

This is a report of coursera data science project. The data is a corpus of text documents which is downloaded from coursera’s page (https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip). The goal of this project is to construct a predict model and build an app. The first step of this project is to download the data and do some basic and explotary analysis. This report is mainly to show the result of the first step.

2.Download Data

library(utils)
direction<-"https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(direction,"data.zip")
unzip("data.zip")

3.Load and Check Data

we use the data in en_US

library(stringi)
news<-readLines("en_US.news.txt", encoding="UTF-8")
## Warning in readLines("en_US.news.txt", encoding = "UTF-8"): incomplete
## final line found on 'en_US.news.txt'
blogs<-readLines("en_US.blogs.txt", encoding="UTF-8")
twitter<-readLines("en_US.twitter.txt", encoding="UTF-8")
## Warning in readLines("en_US.twitter.txt", encoding = "UTF-8"): line 167155
## appears to contain an embedded nul
## Warning in readLines("en_US.twitter.txt", encoding = "UTF-8"): line 268547
## appears to contain an embedded nul
## Warning in readLines("en_US.twitter.txt", encoding = "UTF-8"): line 1274086
## appears to contain an embedded nul
## Warning in readLines("en_US.twitter.txt", encoding = "UTF-8"): line 1759032
## appears to contain an embedded nul
stri_stats_general(news)
##       Lines LinesNEmpty       Chars CharsNWhite 
##       77259       77259    15639408    13072698
stri_stats_general(blogs)
##       Lines LinesNEmpty       Chars CharsNWhite 
##      899288      899288   206824382   170389539
stri_stats_general(twitter)
##       Lines LinesNEmpty       Chars CharsNWhite 
##     2360148     2360148   162096031   134082634

4.Primary Clean Data

the dataset is too large, so we randomly get 1000 samples from each object and combine them to one data.

set.seed(123)
blogs_sample<-sample(blogs,1000)
twitter_sample<-sample(twitter,1000)
news_sample<-sample(news,1000)
data<-c(blogs_sample,twitter_sample,news_sample)

remove punctions, numbers, excess whitespace and translate characters to lowercase

library(tm)
## Loading required package: NLP
datac<-removePunctuation(data)
datac<-removeNumbers(datac)
datac<-stripWhitespace(datac)
datac<-tolower(datac)

5.Explotary Analysis

calculate words frequencies and plot it

library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
library(RWeka)
tenmost<-function(ta) {
  t<-table(ta)
  t<-as.data.frame(t)
  t<-t[order(t$Freq,decreasing = TRUE),][1:10,]
  return(t)
}
one_word<-NGramTokenizer(data, Weka_control(min=1, max=1))
one_word_ten<-tenmost(one_word)
ggplot(data=one_word_ten,aes(y=Freq,x=ta))+geom_bar(stat="identity")+scale_x_discrete(limits=one_word_ten$ta)

calculate 2 words frequencies and plot it

two_word<-NGramTokenizer(data, Weka_control(min=2, max=2))
two_word_ten<-tenmost(two_word)
ggplot(data=two_word_ten,aes(y=Freq,x=ta))+geom_bar(stat="identity")+scale_x_discrete(limits=two_word_ten$ta)

calculate 3 words frequencies and plot it

three_word<-NGramTokenizer(data, Weka_control(min=3, max=3))
three_word_ten<-tenmost(three_word)
ggplot(data=three_word_ten,aes(y=Freq,x=ta))+geom_bar(stat="identity")+scale_x_discrete(limits=three_word_ten$ta)

6.Conclution

As we can see, the dataset is large and it will take too much memory and calculation to build an algorithm, so how can we optimize the algorithm is the most important question. As the plots show, the majority single words are prepositions and pronouns. Most time they are meaningless, we should think about it in the further step. Maybe we will remove those moises. In this step, we get 1000 samples from each group, but we shall get in consideration that weights of each group, because there is obvious difference between three groups.