Capstone Project - Milestone Report

Introduction

This report will summary statistics and Some explanatory data analysis about SwiftKey dataset .

Getting data

if(!"Coursera-SwiftKey.zip" %in% dir("."))
{

  download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", destfile = "Coursera-SwiftKey.zip")

  
}
if(!"en_US" %in% dir(".final"))
{
  unzip(zipfile = "Coursera-SwiftKey.zip")
}

Exploratory analysis

this dataset has 4 subfolders “de_DE” “en_US” “fi_FI” “ru_RU”

dir("final")

## [1] "de_DE" "en_US" "fi_FI" "ru_RU"

I will analysis en_US set only

data_files <- "final/en_US"
lst<- dir(data_files)
table = data.frame(row.names =c("filename","size") )
for(a in lst)
{
    row = data.frame("filename"=a,"size"=file.size(paste(data_files,a,sep = "/")))
    table<-rbind(table,row)
   
}
table

##            filename      size
## 1   en_US.blogs.txt 210160014
## 2    en_US.news.txt 205811889
## 3 en_US.twitter.txt 167105338

Line count summary

blog <- readLines(paste(data_files,"en_US.blogs.txt",sep="/"))
length(blog)

## [1] 899288

news <- readLines(paste(data_files,"en_US.news.txt",sep="/"))

## Warning in readLines(paste(data_files, "en_US.news.txt", sep = "/")):
## incomplete final line found on 'final/en_US/en_US.news.txt'

length(news)

## [1] 77259

twitter <- readLines(paste(data_files,"en_US.twitter.txt",sep="/"))

## Warning in readLines(paste(data_files, "en_US.twitter.txt", sep = "/")):
## line 167155 appears to contain an embedded nul

## Warning in readLines(paste(data_files, "en_US.twitter.txt", sep = "/")):
## line 268547 appears to contain an embedded nul

## Warning in readLines(paste(data_files, "en_US.twitter.txt", sep = "/")):
## line 1274086 appears to contain an embedded nul

## Warning in readLines(paste(data_files, "en_US.twitter.txt", sep = "/")):
## line 1759032 appears to contain an embedded nul

length(twitter)

## [1] 2360148

Wordcount summary

library(stringi)
library(ggplot2)

words_blogs   <- stri_count_words(blog)
words_news    <- stri_count_words(news)
words_twitter <- stri_count_words(twitter)

qplot(words_news)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qplot(words_news)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qplot(words_twitter)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Conclusions

The objective of the project are build a prediction engine .This report only show some simple analysis .