1. Introduction

This milestone report is based on the exploratory data analysis of the SwiftKey data under the Coursera Data Science Capstone Project in week2. The dataset can be found in this link https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip The data consists of three text files containing text from three different source such as blogs, news and twitter. The provided text data are given in four different languages but here it will be used English language data only. The purpose of this report is to develop and understand the data as a first step for building the prediction model which will be built in later.

As a first step. We downloaded the necessary packages and the dataset given in the source unzip the data.

library(tidyverse)
library(downloader)
library(plyr)
library(stringi)
library(tm)
library(RWeka)
library(ggplot2)
library(NLP)
if(!file.exists("./projectData")){
  dir.create("./projectData")
}
Url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"

if(!file.exists("./projectData/Coursera-SwiftKey.zip")){
  download.file(Url,destfile="./projectData/Coursera-SwiftKey.zip",mode = "wb")
}

if(!file.exists("./projectData/final")){
  unzip(zipfile="./projectData/Coursera-SwiftKey.zip",exdir="./projectData")
}
setwd("./projectData/final/en_US")
twitterInfo<-readLines("en_US.twitter.txt",warn=FALSE,encoding="UTF-8",skipNul = TRUE)
blogsInfo<-readLines("en_US.blogs.txt",warn=FALSE,encoding="UTF-8",skipNul = TRUE)
newsInfo<-readLines("en_US.news.txt",warn=FALSE,encoding="UTF-8",skipNul = TRUE)
twitterwords <-stri_stats_latex(twitterInfo)[4]
blogswords <-stri_stats_latex(blogsInfo)[4]
newswords <-stri_stats_latex(newsInfo)[4]

nchar_twitter<-sum(nchar(twitterInfo))
nchar_blogs<-sum(nchar(blogsInfo))
nchar_news<-sum(nchar(newsInfo))

df <-data.frame("File Name" = c("twitter", "blogs", "news"),
           "num of lines" = c(length(twitterInfo),length(blogsInfo), length(newsInfo)),
           "num ofwords" = c(sum(blogswords), sum(newswords), sum(twitterwords)))
df
  File.Name num.of.lines num.ofwords
1   twitter      2360148    37570839
2     blogs       899288     2651432
3      news        77259    30451170

According to this output, it can be easy to understand the summary of the data. The file size, number of words are given in this summary output. It can be seen the dataset is too large and file size is also too large.

2. Data Cleaning

sampledata<-c(sample(twitter_c,length(twitter_c)*0.01),
              sample(blogs_c,length(blogs_c)*0.01),
               sample(news_c,length(news_c)*0.01))
corpus <- VCorpus(VectorSource(sampledata))
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
corpus <- tm_map(corpus, toSpace, "@[^\\s]+")
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("en"))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)

corpusresult<-data.frame(text=unlist(sapply(corpus,'[',"content")),stringsAsFactors = FALSE)

In order to do exploratory data analysis, it must be clean and arrange the data in a specific way. It is selected 1% as a sample because the population data is too large and difficult to handle. All non-English characters, urls, punctuation marks, numbers, whitespaces, common English stopped marks and profanities are removed from the sample dataset.

Above figures shows most common n-grams in the dataset. The predicted model handles the unigram, trigram and bigrams. We will build the model in future to handle those features and interactive user interface which can be easy to use.

Here is attached the full code of this report https://github.com/muhan1027/CapstonReportD4.