Data Science Capstone Project Milestone Report 1

Table of contents:

  • Introduction
  • Data Loading
  • Data Processing
  • Next Steps

Introduction
The goal of milestone report for the Coursera Data Science Capstone project is to display how the data was downloaded and to explanin the plan to create the prediction algorithm. This document explain the major features of the data that I have identified and briefly summarizes the plans for creating the prediction algorithm and Shiny app.

Data Loading
In this project the following data is provided:
http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
Text documents are provided in English, German, Finnish and Russian and they come in 3 differenct forms:

  • Blogs
  • News
  • Twitter Since I don't know any of the other 3 languges, I'm going to use the data in English. First, we load the data:

Data specifications:

  • Blogs: 899288 lines and 206824505 characters
  • News: 1010242 lines and 203223159 characters
  • Twitter: 2360148 lines and 162096031 characters Since these datasets are huge and processing takes a long time, we will choose a sample data set for the data processing and analysis for this project. The full data set will be used in the final project for prediction algorithm. Data from the 3 files are combined and a text corpus is built using the tm library. We only load 1000 lines for this report. Load Necessary Libraries
Error in library(NPL, quietly = TRUE, warn.conflicts = FALSE) : 
  there is no package called 'NPL'