Executive Summary

The first step of analyzing this data set from SwiftKey is figuring out: (a) what data you have and (b) what are the standard tools and models used for that type of data.Then perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora.

During this report , we will follow the next few steps to understand the data and finish some basic data exploration: 0.BACKGRAND SETTING 1.Load the datasets into R: 2.Basic summaries of the three files: 3.Basic data cleanning 4.Features of the data 5.The Goals for the eventual app and algorithm

0.BACKGRAND SETTING :

## Loading required package: NLP

## Loading required package: RColorBrewer

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

1.Load the datasets into R:

twitter <- readLines(con <- file("./en_US.twitter.txt"), 
                     encoding = "UTF-8", skipNul = TRUE)
close(con)

blogs <- readLines(con <- file("./en_US.blogs.txt"), 
                  encoding = "UTF-8", skipNul = TRUE)
close(con)

news <- readLines(con <- file("./en_US.news.txt"), 
                  encoding = "UTF-8", skipNul = TRUE)

## Warning in readLines(con <- file("./en_US.news.txt"), encoding = "UTF-8", :
## 读'./en_US.news.txt'时最后一行未遂

close(con)

2.Basic summaries of the three files:

kable(talbeL)

lengthname	length
twitter	2360148
blogs	899288
news	77259

kable(tableS)

lengthname	lengthS
twitter	163.1888
blogs	200.9882
news	205.2344

kable(tableW)

lengthname	lengthW
twitter	30373583
blogs	37334131
news	2643969
#3.Basic data	cleanning :

##produce a function to clean the data :
##3.1the basic steps of cleaning the data:
cleanedT<- iconv(twitter, 'UTF-8', 'ASCII', "byte")
cleanedB<- iconv(blogs, 'UTF-8', 'ASCII', "byte")
cleanN<-iconv(news, 'UTF-8', 'ASCII', "byte")

set.seed(404)
Tsample<-sample(cleanedT, 5000,replace = T)
Bsample<-sample(cleanedB, 5000,replace = T)
Nsample<-sample(cleanN, 5000,replace = T)

##3.2 build a function to simpling the data processing:
BasicClean<-function(x){
  Dvector<-VectorSource(x)
  dCorpus<-Corpus(Dvector)
  dCorpus<-tm_map(dCorpus,tolower)
  dCorpus<-tm_map(dCorpus,removePunctuation)
  dCorpus<-tm_map(dCorpus,removeNumbers)
  dCorpus<-tm_map(dCorpus,stripWhitespace)
  dCorpus<-tm_map(dCorpus,PlainTextDocument)
  return(dCorpus)
}

###clean blogs data :
TWSCorpus<-BasicClean(Tsample)

## Warning in tm_map.SimpleCorpus(dCorpus, tolower): transformation drops
## documents

## Warning in tm_map.SimpleCorpus(dCorpus, removePunctuation): transformation
## drops documents

## Warning in tm_map.SimpleCorpus(dCorpus, removeNumbers): transformation
## drops documents

## Warning in tm_map.SimpleCorpus(dCorpus, stripWhitespace): transformation
## drops documents

## Warning in tm_map.SimpleCorpus(dCorpus, PlainTextDocument): transformation
## drops documents

TWSCorpus <- Corpus(VectorSource(TWSCorpus))

###clean blogs data :
BlogCorpus<-BasicClean(Bsample)

## Warning in tm_map.SimpleCorpus(dCorpus, tolower): transformation drops
## documents

## Warning in tm_map.SimpleCorpus(dCorpus, removePunctuation): transformation
## drops documents

## Warning in tm_map.SimpleCorpus(dCorpus, removeNumbers): transformation
## drops documents

## Warning in tm_map.SimpleCorpus(dCorpus, stripWhitespace): transformation
## drops documents

## Warning in tm_map.SimpleCorpus(dCorpus, PlainTextDocument): transformation
## drops documents

BlogCorpus <- Corpus(VectorSource(BlogCorpus))

###clean news data:
NewsCorpus<-BasicClean(Nsample)

## Warning in tm_map.SimpleCorpus(dCorpus, tolower): transformation drops
## documents

## Warning in tm_map.SimpleCorpus(dCorpus, removePunctuation): transformation
## drops documents

## Warning in tm_map.SimpleCorpus(dCorpus, removeNumbers): transformation
## drops documents

## Warning in tm_map.SimpleCorpus(dCorpus, stripWhitespace): transformation
## drops documents

## Warning in tm_map.SimpleCorpus(dCorpus, PlainTextDocument): transformation
## drops documents

NewsCorpus <- Corpus(VectorSource(NewsCorpus))

4.Features of the data:

wordcloud(TWSCorpus, max.words=200, colors=brewer.pal(8,"Dark2"))

wordcloud(BlogCorpus, max.words=200, colors=brewer.pal(8,"Dark2"))

wordcloud(NewsCorpus, max.words=200, colors=brewer.pal(8,"Dark2"))

5.The Goals for the eventual app and algorithm:

-After we makes some exploration of data ,next step should be use N-GRAMS to build the modle for tokenizing the words ; -About frequency of tokens can be used in building the model; -Then use these sets of n-grams to create predictive model; -To make the project more easy to understand,will use a Shiny app as a user-interface to interact with our predictive models to predict the next word.

Milestone Report