Introduction

The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to: 1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.2. Create a basic report of summary statistics about the data sets.3. Report any interesting findings that you amassed so far.4. Get feedback on your plans for creating a prediction algorithm and Shiny app.

Downloading and reading in files

setwd("C:/Users/ract1/Desktop/Ricardo Carranza/Data Science/Coursera/Data Science Capstone")
destfile = "./Coursera-SwiftKey.zip"
if(!file.exists(destfile)){
  url = "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
  file <- basename(url)
  download.file(url, file, method = "curl")
  unzip(file)
}

news <- readLines("final/en_US/en_US.news.txt", encoding = 'UTF-8', warn = FALSE)
twitter <- readLines("final/en_US/en_US.twitter.txt", encoding = 'UTF-8', warn = FALSE)
blogs <- readLines("final/en_US/en_US.blogs.txt")

Exploratory data Analysis

wordcounts and linecounts

##         nr of lines nr of words
## news          77259     2643969
## twitter     2360148    30373543
## blogs        899288    37334131

Files are too large to process. Therefore 1% sample is taken of each, and the files are combined

## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

1,2,3 ngrams and plots

##    unigram_combi Freq1 bigram_combi Freq2  trigram_combi Freq3
## 1            the 26535       of the  2526        I don t   358
## 2             to 18621       in the  2364        I can t   211
## 3              I 16354          I m  1547       a lot of   181
## 4              a 14869      for the  1376 Thanks for the   180
## 5            and 14792       to the  1327     one of the   168
## 6             of 12985       on the  1214        I m not   159
## 7             in  9583        to be  1152        to be a   148
## 8            you  7957        don t   872    going to be   124
## 9             is  7906       at the   860      I want to   123
## 10           for  7372      and the   736     be able to   121
## 11          that  7121       I have   725     don t know   107
## 12            it  6911         is a   723       I have a   106
## 13            on  5445         it s   717       I didn t   104
## 14            my  4975        I was   699     the end of   102
## 15             s  4680         in a   691      I ve been   101

Plots