Milestone Report for Data Science Capstone Project

Introduction

This report aims to let the reader understand the training dataset used to develop the text prediction engine of a shiny application that is able to predict the next word based on the user’s input.

In practice, we will get familiar with the data by means of describing it. For each of the sources (internet news, blog posts and twitter messages), we will provide:

file size
number of lines
number of words
number of words histogram
number of characters

In order to get to know the data we’re going to work with, we’re going to size the data (in terms of file size, number of lines, and so on) and also explore it a bit (and put it into some graphics).

About the Data

The training dataset comes from a corpus called HC Corpora. It consists of three files containing text scrapped by a web crawler from blogs, news articles and social media, specifically, tweets from Twitter, in the en_US locale (US English).

Data Acquisition

The training dataset was downloaded from this link and unzipped into the working directory.

setwd("C:/Users/Ankit/Documents/Coursera Data Science Specialization/10 - Capstone/Assignment/milestone_report")

dataDirectory <- "../"

trainingDataset <- file.path(dataDirectory, "Coursera-SwiftKey.zip")

if (!file.exists(trainingDataset)) {
  url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
  download.file(url, destfile = trainingDataset)
}

unzip(trainingDataset, exdir = "C:/Users/Ankit/Documents/Coursera Data Science Specialization/10 - Capstone/Assignment")

The contents of each file were then read in separately.

con <- file("../final/en_US/en_US.blogs.txt", open="rb")
blogs <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
con <- file("../final/en_US/en_US.news.txt", open="rb")
news <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
con <- file("../final/en_US/en_US.twitter.txt", open="rb")
twitter <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)

Statistics

library(stringi)

## Warning: package 'stringi' was built under R version 3.2.3

blogs.size <- round(file.info("en_US.blogs.txt")$size / 1024^2)
blogs.stats <- stri_stats_general(blogs)
blogs.words <- sum(stri_count_words(blogs))
blogs.words.per.line <- blogs.words/blogs.stats[1]
blogs.maxwords <- max(stri_count_words(blogs))
blogs.chars.per.word <- blogs.stats[3]/blogs.words

news.size <- round(file.info("en_US.news.txt")$size / 1024^2)
news.stats <- stri_stats_general(news)
news.words <- sum(stri_count_words(news))
news.words.per.line <- news.words/news.stats[1]
news.maxwords <- max(stri_count_words(news))
news.chars.per.word <- news.stats[3]/news.words

twitter.size <- round(file.info("en_US.twitter.txt")$size / 1024^2)
twitter.stats <- stri_stats_general(twitter)
twitter.words <- sum(stri_count_words(twitter))
twitter.words.per.line <- twitter.words/twitter.stats[1]
twitter.maxwords <- max(stri_count_words(twitter))
twitter.chars.per.word <- twitter.stats[3]/twitter.words

data.stats <- data.frame(Source = c("Blogs","News","Twitter"),
        Size.MB = c(blogs.size, news.size, twitter.size),
        Total.Lines = c(blogs.stats[1], news.stats[1], twitter.stats[1]),
        Total.Words = c(blogs.words, news.words, twitter.words),
        Total.Chars = c(blogs.stats[3], news.stats[3], twitter.stats[3]),
        Words.Per.Lne = c(blogs.words.per.line, news.words.per.line, twitter.words.per.line),
        Max.Words.Per.Line = c(blogs.maxwords, news.maxwords, twitter.maxwords),
        Chars.Per.Word =c(blogs.chars.per.word, news.chars.per.word, twitter.chars.per.word))

data.stats

##    Source Size.MB Total.Lines Total.Words Total.Chars Words.Per.Lne
## 1   Blogs      NA      899288    37546246   206824382      41.75108
## 2    News      NA     1010242    34762395   203223154      34.40997
## 3 Twitter      NA     2360148    30093410   162096241      12.75065
##   Max.Words.Per.Line Chars.Per.Word
## 1               6726       5.508524
## 2               1796       5.846063
## 3                 47       5.386436

The table above summarizes the data. A short description of each column follows:

Source: specifies where the original data comes from

Size.MB: is the original file size rounded to MB

Total.Lines: is the number of total lines included in the data

Total.Words: is the number of total words included in the data

Total.Chars: is the number of total chars included in the data

Words.Per.Line: is the average number of words per line, that is, the average number of words in blog posts, news articles and tweets

Max.Words.Per.Line: is the largest number of words in a line, that is, the number of words in the largest blog post, news article and tweet

Chars.Per.Word: is the average number os characters per word

It seems that the words used are similar across the different sources, at least the average number of letters per word is very similar (from 5.4 to 5.8). But there’s a significant difference regarding the number of words per text object (blog post, news article and tweet), ranging from 13 (twitter) to 42 (blogs) words per text object. This is expected because (i) theres’s a limit of 140 characters per twitter message and (ii) the purpose of a twitter message is, in essence, to be short.

Blogs

The way text varies in a blog post, ie, the number of words per line, is depicted here and it shows that almost all blog posts have less than 500 words.

blogs.words <- stri_count_words(blogs)
hist(blogs.words, main="Histogram of words in blogs", xlab="No. of words per blog post", breaks=20)

And if we “zoom” this to blog posts with less than 200 words, we still see that most blog posts have less than 50 words.

hist(blogs.words, main="Histogram of words in blogs", xlab="No. of words per blog post", breaks=600, xlim=c(0, 200))

News

At a first glance, it seems that the way text varies in news seems much like the way it varies on blogs:

news.words <- stri_count_words(news)
hist(news.words, main="Histogram of words in news", xlab="No. of words per news article", breaks=20)

But if we “zoom” this to news with less than 100 words, we now see something different: the most common in a news text is to have a text between 20 and 40 words.

hist(news.words, main="Histogram of words in news", xlab="No. of words per news article", breaks=300, xlim=c(0, 100))

Twitter

The way text varies in tweets is different. For starters, the longest tweet allowed is 140 characters. Measured in words, the largest tweet in the data is 47 words long:

twitter.words <- stri_count_words(twitter)
hist(twitter.words, main="Histogram of words in tweets", xlab="No. of words per tweet", breaks=20)

Almost all tweets have less than 30 words and the higher concentration is around 8 words per tweet.

Conclusion

This basic exploration of the data avaiable confirms what we would expect: blog posts use more words than news articles and news articles use much more words than tweets.

The only surprise may come regarding word complexity (as measured in the number of letters per word). The average number of letters per word range from 5.4 (in tweets) to 5.8 (in news articles), which means that the vocabulary used is probably of the same kind, ie, no use of complex or long words. Still, the vocalubaly used in news articles is slightly more complex than the one used in blog posts and the vocabulary used in blog posts is slightly more complex than the one used in tweets.