Captone Project, Milestone Report

Introduction

The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set.

The motivation for this project is to:

Demonstrate that you’ve downloaded the data and have successfully loaded it in.
Create a basic report of summary statistics about the data sets.
Report any interesting findings that you amassed so far.
Get feedback on your plans for creating a prediction algorithm and Shiny app.

1. Data Processing

Downloading the data

In this part i’m going to download, unzip y select the data to use.

name_file <- "Coursera-SwiftKey.zip"
source <- "http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(source, name_file)
file<- unzip(name_file)

Listing the files and determining files sizes in MB

We have to know the files sizes because is important to decide the stategy to develop the model. I select ‘English Language’ files for Blogs, Twiiter and News source.

##  [1] "Files downloaded: data//de_DE/de_DE.blogs.txt"  
##  [2] "Files downloaded: data//de_DE/de_DE.news.txt"   
##  [3] "Files downloaded: data//de_DE/de_DE.twitter.txt"
##  [4] "Files downloaded: data//en_US/en_US.blogs.txt"  
##  [5] "Files downloaded: data//en_US/en_US.news.txt"   
##  [6] "Files downloaded: data//en_US/en_US.twitter.txt"
##  [7] "Files downloaded: data//fi_FI/fi_FI.blogs.txt"  
##  [8] "Files downloaded: data//fi_FI/fi_FI.news.txt"   
##  [9] "Files downloaded: data//fi_FI/fi_FI.twitter.txt"
## [10] "Files downloaded: data//ru_RU/ru_RU.blogs.txt"  
## [11] "Files downloaded: data//ru_RU/ru_RU.news.txt"   
## [12] "Files downloaded: data//ru_RU/ru_RU.twitter.txt"

## [1] "Blog Size in MB: 200.424207687378"

## [1] "Twitter Size in MB: 159.364068984985"

## [1] "News Size in MB: 196.277512550354"

Loading the datasets to use into memory and summarizing its contents

## [1] "Number of lines in blog: 899288"

## [1] "Number of lines in twitter: 2360148"

## [1] "Number of lines in News: 77259"

2. Exploatory Analysis

Exploring the words counts in the files

I made a count the total words in the files and summarise it for a initial look the content

## [1] "Summary of Words in Blog file:"

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    9.00   28.00   41.75   60.00 6726.00

## [1] "Summary of Words in Twitter file:"

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   12.00   12.75   18.00   47.00

## [1] "Summary of Words in News file:"

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   19.00   32.00   34.62   46.00 1123.00

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Sampling the files

In this exploratory stage, i took a sample to handle better the processing of the data. I took 1000 rows from each file and consolidate in a single data frame.

set.seed(7472)
sBlog <- fBlog[sample(1:length(fBlog), 1000)]
sTwitter <- fTwitter[sample(1:length(fTwitter), 1000)]
sNews <- fNews[sample(1:length(fNews), 1000)]
# Conslidating the sample files
sData <- c(sTwitter,sNews,sBlog)

Cleaning the sample

Applique some clean task in the sample, removing Puntuation, Numbers, etc.

library(tm)

## Loading required package: NLP

## 
## Attaching package: 'NLP'

## The following object is masked from 'package:ggplot2':
## 
##     annotate

Corp <- Corpus(VectorSource(sData))
sSpce  <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
Corp <- tm_map(Corp, sSpce,"\"|/|@|\\|")
Corp <- tm_map(Corp, content_transformer(tolower))
Corp <- tm_map(Corp, removePunctuation)
Corp <- tm_map(Corp, removeNumbers)
Corp <- tm_map(Corp, stripWhitespace)
Corp <- tm_map(Corp, removeWords, stopwords('english'))

Creating ngrams

Start creating nGrams, biGrams and triGrams to see some specific infomation about the data

library(RWeka)
fNGrams <- function(Corp, grams, top) {
  ngram <- NGramTokenizer(Corp, Weka_control(min = grams, max = grams,
                                             delimiters = " \\r\\n\\t.,;:\"()?!"))
  ngram <- data.frame(table(ngram))
  ngram <- ngram[order(ngram$Freq, decreasing = TRUE),][1:top,]
  colnames(ngram) <- c("Words","Count")
  ngram
}
moGrams   <- fNGrams(Corp, 1, 50)
biGrams     <- fNGrams(Corp, 2, 50)
triGrams    <- fNGrams(Corp, 3, 50)

PLotting the Ngrams

PLot the NGrams to examinate the most frequent terms.

3. INITIALS OBSERVATIONS

a. the data source is in a free format, and comes with a lot of useless content, for that reason is important to make carefully some task to process, clean and transforms in proper order to obtain significant information to build a model.
b. The size of the files forces to take samples to work in the stage of construction and analysis of the model, and will require and additional effort to optimize the app’s time response

4. NEXT STEPS

Tockenization of words with Ngrams.
Work with a small sample to get more useful info (~5%).
Works with data compression.
Segment and reduce the data to select the more relevant info.
Use Machine Learning tecniques to develop the predictive model.