Data Science Capstone Project Report

Introduction

The goal of this project is just to display the proficiency working with the data and the preliminary skills to create a prediction algorithm. This report explains some exploratory analysis and the goals for the eventual app and algorithm. This document is concise and explain only the major features identified in the data and briefly summarize the plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager.

1. Data downloaded and successfully loaded

setwd("C:/misdatos/md4/DATA SCIENCE/CAPSTONE PROJECT/material")
list.files("Coursera-SwiftKey/final/en_US")

## [1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"

# datasource<-"C:/misdatos/md4/DATA SCIENCE/CAPSTONE PROJECT/material/Coursera-SwiftKey/final/en_US"

blogsfile <- c("Coursera-SwiftKey/final/en_US/en_US.blogs.txt")
twitterfile <- c("Coursera-SwiftKey/final/en_US/en_US.twitter.txt")
newsfile <- c("Coursera-SwiftKey/final/en_US/en_US.news.txt")

## Loading files

 blogs <- readLines(blogsfile, encoding = "UTF-8", skipNul=TRUE)
 twitter <-  readLines(twitterfile, encoding = "UTF-8", skipNul=TRUE)
 news <- readLines(newsfile, encoding="UTF-8", skipNul=TRUE)

## Warning in readLines(newsfile, encoding = "UTF-8", skipNul = TRUE):
## incomplete final line found on
## 'Coursera-SwiftKey/final/en_US/en_US.news.txt'

2. Summary stadistics about the data

## Load neccesary package

library(stringi)  ## character string analysis
library(ggplot2)  ## ploting library
 

## some statistics about the files
 
wblogs   <- stri_count_words(blogs)      ## Statistics of the blogs file
summary(wblogs)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    9.00   28.00   41.75   60.00 6726.00

wtwitter   <- stri_count_words(twitter)  ## Statistics of the twitter file
summary(wtwitter)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   12.00   12.75   18.00   47.00

wnews   <- stri_count_words(news)        ## Statistics of the news file
summary(wnews)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   19.00   32.00   34.62   46.00 1123.00

## Ploting statitics

pblogs <- qplot(   wblogs )
ptwitter <- qplot(   wtwitter )
pnews <- qplot(   wnews )

 
 pblogs

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

 ptwitter

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

 pnews

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

3. Interesting Findings

Although this is a challenging Project, the provided data is suitable for the task requested. The three data sets are very rich and appropriated to develop a predicting model for the spelling corrector. The three files has differents charasteristics, the twitter file has the shortest sentences, while the blogs file has the longest.

4. Plans for creating a prediction algorithm and Shiny app

The next steps to take to complete the project are: a) Determine the correct clean up preprocessing required b) Create the n-grams or by-grams to make the prediction task c) Build and evaluate a prediction model

Capstone Project

Felipe Llaugel

December 28, 2015