1.Introduction
This is a report of coursera data science project. The data is a corpus of text documents which is downloaded from coursera’s page (https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip). The goal of this project is to construct a predict model and build an app. The first step of this project is to download the data and do some basic and explotary analysis. This report is mainly to show the result of the first step.
2.Download Data
library(utils)
direction<-"https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(direction,"data.zip")
unzip("data.zip")
3.Load and Check Data
we use the data in en_US
library(stringi)
news<-readLines("en_US.news.txt", encoding="UTF-8")
## Warning in readLines("en_US.news.txt", encoding = "UTF-8"): incomplete
## final line found on 'en_US.news.txt'
blogs<-readLines("en_US.blogs.txt", encoding="UTF-8")
twitter<-readLines("en_US.twitter.txt", encoding="UTF-8")
## Warning in readLines("en_US.twitter.txt", encoding = "UTF-8"): line 167155
## appears to contain an embedded nul
## Warning in readLines("en_US.twitter.txt", encoding = "UTF-8"): line 268547
## appears to contain an embedded nul
## Warning in readLines("en_US.twitter.txt", encoding = "UTF-8"): line 1274086
## appears to contain an embedded nul
## Warning in readLines("en_US.twitter.txt", encoding = "UTF-8"): line 1759032
## appears to contain an embedded nul
stri_stats_general(news)
## Lines LinesNEmpty Chars CharsNWhite
## 77259 77259 15639408 13072698
stri_stats_general(blogs)
## Lines LinesNEmpty Chars CharsNWhite
## 899288 899288 206824382 170389539
stri_stats_general(twitter)
## Lines LinesNEmpty Chars CharsNWhite
## 2360148 2360148 162096031 134082634
4.Primary Clean Data
the dataset is too large, so we randomly get 1000 samples from each object and combine them to one data.
set.seed(123)
blogs_sample<-sample(blogs,1000)
twitter_sample<-sample(twitter,1000)
news_sample<-sample(news,1000)
data<-c(blogs_sample,twitter_sample,news_sample)
remove punctions, numbers, excess whitespace and translate characters to lowercase
library(tm)
## Loading required package: NLP
datac<-removePunctuation(data)
datac<-removeNumbers(datac)
datac<-stripWhitespace(datac)
datac<-tolower(datac)
5.Explotary Analysis
calculate words frequencies and plot it
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
library(RWeka)
tenmost<-function(ta) {
t<-table(ta)
t<-as.data.frame(t)
t<-t[order(t$Freq,decreasing = TRUE),][1:10,]
return(t)
}
one_word<-NGramTokenizer(data, Weka_control(min=1, max=1))
one_word_ten<-tenmost(one_word)
ggplot(data=one_word_ten,aes(y=Freq,x=ta))+geom_bar(stat="identity")+scale_x_discrete(limits=one_word_ten$ta)

calculate 2 words frequencies and plot it
two_word<-NGramTokenizer(data, Weka_control(min=2, max=2))
two_word_ten<-tenmost(two_word)
ggplot(data=two_word_ten,aes(y=Freq,x=ta))+geom_bar(stat="identity")+scale_x_discrete(limits=two_word_ten$ta)

calculate 3 words frequencies and plot it
three_word<-NGramTokenizer(data, Weka_control(min=3, max=3))
three_word_ten<-tenmost(three_word)
ggplot(data=three_word_ten,aes(y=Freq,x=ta))+geom_bar(stat="identity")+scale_x_discrete(limits=three_word_ten$ta)

6.Conclution
As we can see, the dataset is large and it will take too much memory and calculation to build an algorithm, so how can we optimize the algorithm is the most important question. As the plots show, the majority single words are prepositions and pronouns. Most time they are meaningless, we should think about it in the further step. Maybe we will remove those moises. In this step, we get 1000 samples from each group, but we shall get in consideration that weights of each group, because there is obvious difference between three groups.