The first step of analyzing this data set from SwiftKey is figuring out: (a) what data you have and (b) what are the standard tools and models used for that type of data.Then perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora.
During this report , we will follow the next few steps to understand the data and finish some basic data exploration: 0.BACKGRAND SETTING 1.Load the datasets into R: 2.Basic summaries of the three files: 3.Basic data cleanning 4.Features of the data 5.The Goals for the eventual app and algorithm
## Loading required package: NLP
## Loading required package: RColorBrewer
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
twitter <- readLines(con <- file("./en_US.twitter.txt"),
encoding = "UTF-8", skipNul = TRUE)
close(con)
blogs <- readLines(con <- file("./en_US.blogs.txt"),
encoding = "UTF-8", skipNul = TRUE)
close(con)
news <- readLines(con <- file("./en_US.news.txt"),
encoding = "UTF-8", skipNul = TRUE)
## Warning in readLines(con <- file("./en_US.news.txt"), encoding = "UTF-8", :
## 读'./en_US.news.txt'时最后一行未遂
close(con)
kable(talbeL)
| lengthname | length |
|---|---|
| 2360148 | |
| blogs | 899288 |
| news | 77259 |
kable(tableS)
| lengthname | lengthS |
|---|---|
| 163.1888 | |
| blogs | 200.9882 |
| news | 205.2344 |
kable(tableW)
| lengthname | lengthW |
|---|---|
| 30373583 | |
| blogs | 37334131 |
| news | 2643969 |
| #3.Basic data | cleanning : |
##produce a function to clean the data :
##3.1the basic steps of cleaning the data:
cleanedT<- iconv(twitter, 'UTF-8', 'ASCII', "byte")
cleanedB<- iconv(blogs, 'UTF-8', 'ASCII', "byte")
cleanN<-iconv(news, 'UTF-8', 'ASCII', "byte")
set.seed(404)
Tsample<-sample(cleanedT, 5000,replace = T)
Bsample<-sample(cleanedB, 5000,replace = T)
Nsample<-sample(cleanN, 5000,replace = T)
##3.2 build a function to simpling the data processing:
BasicClean<-function(x){
Dvector<-VectorSource(x)
dCorpus<-Corpus(Dvector)
dCorpus<-tm_map(dCorpus,tolower)
dCorpus<-tm_map(dCorpus,removePunctuation)
dCorpus<-tm_map(dCorpus,removeNumbers)
dCorpus<-tm_map(dCorpus,stripWhitespace)
dCorpus<-tm_map(dCorpus,PlainTextDocument)
return(dCorpus)
}
###clean blogs data :
TWSCorpus<-BasicClean(Tsample)
## Warning in tm_map.SimpleCorpus(dCorpus, tolower): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(dCorpus, removePunctuation): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(dCorpus, removeNumbers): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(dCorpus, stripWhitespace): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(dCorpus, PlainTextDocument): transformation
## drops documents
TWSCorpus <- Corpus(VectorSource(TWSCorpus))
###clean blogs data :
BlogCorpus<-BasicClean(Bsample)
## Warning in tm_map.SimpleCorpus(dCorpus, tolower): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(dCorpus, removePunctuation): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(dCorpus, removeNumbers): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(dCorpus, stripWhitespace): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(dCorpus, PlainTextDocument): transformation
## drops documents
BlogCorpus <- Corpus(VectorSource(BlogCorpus))
###clean news data:
NewsCorpus<-BasicClean(Nsample)
## Warning in tm_map.SimpleCorpus(dCorpus, tolower): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(dCorpus, removePunctuation): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(dCorpus, removeNumbers): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(dCorpus, stripWhitespace): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(dCorpus, PlainTextDocument): transformation
## drops documents
NewsCorpus <- Corpus(VectorSource(NewsCorpus))
wordcloud(TWSCorpus, max.words=200, colors=brewer.pal(8,"Dark2"))
wordcloud(BlogCorpus, max.words=200, colors=brewer.pal(8,"Dark2"))
wordcloud(NewsCorpus, max.words=200, colors=brewer.pal(8,"Dark2"))
-After we makes some exploration of data ,next step should be use N-GRAMS to build the modle for tokenizing the words ; -About frequency of tokens can be used in building the model; -Then use these sets of n-grams to create predictive model; -To make the project more easy to understand,will use a Shiny app as a user-interface to interact with our predictive models to predict the next word.