The goal of this project is just to display the proficiency working with the data and the preliminary skills to create a prediction algorithm. This report explains some exploratory analysis and the goals for the eventual app and algorithm. This document is concise and explain only the major features identified in the data and briefly summarize the plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager.
setwd("C:/misdatos/md4/DATA SCIENCE/CAPSTONE PROJECT/material")
list.files("Coursera-SwiftKey/final/en_US")
## [1] "en_US.blogs.txt" "en_US.news.txt" "en_US.twitter.txt"
# datasource<-"C:/misdatos/md4/DATA SCIENCE/CAPSTONE PROJECT/material/Coursera-SwiftKey/final/en_US"
blogsfile <- c("Coursera-SwiftKey/final/en_US/en_US.blogs.txt")
twitterfile <- c("Coursera-SwiftKey/final/en_US/en_US.twitter.txt")
newsfile <- c("Coursera-SwiftKey/final/en_US/en_US.news.txt")
## Loading files
blogs <- readLines(blogsfile, encoding = "UTF-8", skipNul=TRUE)
twitter <- readLines(twitterfile, encoding = "UTF-8", skipNul=TRUE)
news <- readLines(newsfile, encoding="UTF-8", skipNul=TRUE)
## Warning in readLines(newsfile, encoding = "UTF-8", skipNul = TRUE):
## incomplete final line found on
## 'Coursera-SwiftKey/final/en_US/en_US.news.txt'
## Load neccesary package
library(stringi) ## character string analysis
library(ggplot2) ## ploting library
## some statistics about the files
wblogs <- stri_count_words(blogs) ## Statistics of the blogs file
summary(wblogs)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 9.00 28.00 41.75 60.00 6726.00
wtwitter <- stri_count_words(twitter) ## Statistics of the twitter file
summary(wtwitter)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 12.00 12.75 18.00 47.00
wnews <- stri_count_words(news) ## Statistics of the news file
summary(wnews)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 19.00 32.00 34.62 46.00 1123.00
## Ploting statitics
pblogs <- qplot( wblogs )
ptwitter <- qplot( wtwitter )
pnews <- qplot( wnews )
pblogs
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
ptwitter
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
pnews
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Although this is a challenging Project, the provided data is suitable for the task requested. The three data sets are very rich and appropriated to develop a predicting model for the spelling corrector. The three files has differents charasteristics, the twitter file has the shortest sentences, while the blogs file has the longest.
The next steps to take to complete the project are: a) Determine the correct clean up preprocessing required b) Create the n-grams or by-grams to make the prediction task c) Build and evaluate a prediction model