The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to: 1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.2. Create a basic report of summary statistics about the data sets.3. Report any interesting findings that you amassed so far.4. Get feedback on your plans for creating a prediction algorithm and Shiny app.
setwd("C:/Users/ract1/Desktop/Ricardo Carranza/Data Science/Coursera/Data Science Capstone")
destfile = "./Coursera-SwiftKey.zip"
if(!file.exists(destfile)){
url = "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
file <- basename(url)
download.file(url, file, method = "curl")
unzip(file)
}
news <- readLines("final/en_US/en_US.news.txt", encoding = 'UTF-8', warn = FALSE)
twitter <- readLines("final/en_US/en_US.twitter.txt", encoding = 'UTF-8', warn = FALSE)
blogs <- readLines("final/en_US/en_US.blogs.txt")
## nr of lines nr of words
## news 77259 2643969
## twitter 2360148 30373543
## blogs 899288 37334131
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## unigram_combi Freq1 bigram_combi Freq2 trigram_combi Freq3
## 1 the 26535 of the 2526 I don t 358
## 2 to 18621 in the 2364 I can t 211
## 3 I 16354 I m 1547 a lot of 181
## 4 a 14869 for the 1376 Thanks for the 180
## 5 and 14792 to the 1327 one of the 168
## 6 of 12985 on the 1214 I m not 159
## 7 in 9583 to be 1152 to be a 148
## 8 you 7957 don t 872 going to be 124
## 9 is 7906 at the 860 I want to 123
## 10 for 7372 and the 736 be able to 121
## 11 that 7121 I have 725 don t know 107
## 12 it 6911 is a 723 I have a 106
## 13 on 5445 it s 717 I didn t 104
## 14 my 4975 I was 699 the end of 102
## 15 s 4680 in a 691 I ve been 101