The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set.
The motivation for this project is to:
Demonstrate that you’ve downloaded the data and have successfully loaded it in.
Create a basic report of summary statistics about the data sets.
Report any interesting findings that you amassed so far.
Get feedback on your plans for creating a prediction algorithm and Shiny app.
In this part i’m going to download, unzip y select the data to use.
name_file <- "Coursera-SwiftKey.zip"
source <- "http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(source, name_file)
file<- unzip(name_file)
We have to know the files sizes because is important to decide the stategy to develop the model. I select ‘English Language’ files for Blogs, Twiiter and News source.
## [1] "Files downloaded: data//de_DE/de_DE.blogs.txt"
## [2] "Files downloaded: data//de_DE/de_DE.news.txt"
## [3] "Files downloaded: data//de_DE/de_DE.twitter.txt"
## [4] "Files downloaded: data//en_US/en_US.blogs.txt"
## [5] "Files downloaded: data//en_US/en_US.news.txt"
## [6] "Files downloaded: data//en_US/en_US.twitter.txt"
## [7] "Files downloaded: data//fi_FI/fi_FI.blogs.txt"
## [8] "Files downloaded: data//fi_FI/fi_FI.news.txt"
## [9] "Files downloaded: data//fi_FI/fi_FI.twitter.txt"
## [10] "Files downloaded: data//ru_RU/ru_RU.blogs.txt"
## [11] "Files downloaded: data//ru_RU/ru_RU.news.txt"
## [12] "Files downloaded: data//ru_RU/ru_RU.twitter.txt"
## [1] "Blog Size in MB: 200.424207687378"
## [1] "Twitter Size in MB: 159.364068984985"
## [1] "News Size in MB: 196.277512550354"
## [1] "Number of lines in blog: 899288"
## [1] "Number of lines in twitter: 2360148"
## [1] "Number of lines in News: 77259"
I made a count the total words in the files and summarise it for a initial look the content
## [1] "Summary of Words in Blog file:"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 9.00 28.00 41.75 60.00 6726.00
## [1] "Summary of Words in Twitter file:"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 12.00 12.75 18.00 47.00
## [1] "Summary of Words in News file:"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 19.00 32.00 34.62 46.00 1123.00
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
In this exploratory stage, i took a sample to handle better the processing of the data. I took 1000 rows from each file and consolidate in a single data frame.
set.seed(7472)
sBlog <- fBlog[sample(1:length(fBlog), 1000)]
sTwitter <- fTwitter[sample(1:length(fTwitter), 1000)]
sNews <- fNews[sample(1:length(fNews), 1000)]
# Conslidating the sample files
sData <- c(sTwitter,sNews,sBlog)
Applique some clean task in the sample, removing Puntuation, Numbers, etc.
library(tm)
## Loading required package: NLP
##
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
##
## annotate
Corp <- Corpus(VectorSource(sData))
sSpce <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
Corp <- tm_map(Corp, sSpce,"\"|/|@|\\|")
Corp <- tm_map(Corp, content_transformer(tolower))
Corp <- tm_map(Corp, removePunctuation)
Corp <- tm_map(Corp, removeNumbers)
Corp <- tm_map(Corp, stripWhitespace)
Corp <- tm_map(Corp, removeWords, stopwords('english'))
Start creating nGrams, biGrams and triGrams to see some specific infomation about the data
library(RWeka)
fNGrams <- function(Corp, grams, top) {
ngram <- NGramTokenizer(Corp, Weka_control(min = grams, max = grams,
delimiters = " \\r\\n\\t.,;:\"()?!"))
ngram <- data.frame(table(ngram))
ngram <- ngram[order(ngram$Freq, decreasing = TRUE),][1:top,]
colnames(ngram) <- c("Words","Count")
ngram
}
moGrams <- fNGrams(Corp, 1, 50)
biGrams <- fNGrams(Corp, 2, 50)
triGrams <- fNGrams(Corp, 3, 50)
PLot the NGrams to examinate the most frequent terms.
a. the data source is in a free format, and comes with a lot of useless content, for that reason is important to make carefully some task to process, clean and transforms in proper order to obtain significant information to build a model.
b. The size of the files forces to take samples to work in the stage of construction and analysis of the model, and will require and additional effort to optimize the app’s time response