The goal of this project is to understand the provided data, perform analysis to provide summary statistics and establich a overall approach to build prediction algorithm and shiny app to predict the next probable word by using the given input.
The data for this project is downloaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip provided by Coursera in partnership with Swiftkey.For this project we are using only files from “en_US” folder.
Below is the summary of all the three files. It will show the Number of Lines and Number of Words for each file.
## filename num_lines num_words
## 1 news 1010242 34762395
## 2 twitter 2360148 30093410
## 3 blogs 899288 37546246
From the summary we can see that twitter data has more number of lines where as blogs data has more number of words.
These files are huge so it requires more time to process the data. in order to reduce this only subset of the data is taken using sampling so that our subset is a representative sample. For simplicity, we will take 1% sample of the data.Below is the summary of all the three sample files. It will show the Number of Lines and Number of Words for each sample file.
## filename num_linesSample num_wordsSample
## 1 newsSample 10102 348176
## 2 twitterSample 23601 300582
## 3 blogsSample 8992 390738
## Loading required package: NLP
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
From the graphs we can see that it requires more data cleaning.
Requires more data cleaning. With the clean data build a predictive model using this n-gram model and provide the interface to the user with the Shiny app.