This report aims to let the reader understand the training dataset used to develop the text prediction engine of a shiny application that is able to predict the next word based on the user’s input.
In practice, we will get familiar with the data by means of describing it. For each of the sources (internet news, blog posts and twitter messages), we will provide:
file size
number of lines
number of words
number of words histogram
number of characters
In order to get to know the data we’re going to work with, we’re going to size the data (in terms of file size, number of lines, and so on) and also explore it a bit (and put it into some graphics).
The training dataset comes from a corpus called HC Corpora. It consists of three files containing text scrapped by a web crawler from blogs, news articles and social media, specifically, tweets from Twitter, in the en_US locale (US English).
The training dataset was downloaded from this link and unzipped into the working directory.
setwd("C:/Users/Ankit/Documents/Coursera Data Science Specialization/10 - Capstone/Assignment/milestone_report")
dataDirectory <- "../"
trainingDataset <- file.path(dataDirectory, "Coursera-SwiftKey.zip")
if (!file.exists(trainingDataset)) {
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(url, destfile = trainingDataset)
}
unzip(trainingDataset, exdir = "C:/Users/Ankit/Documents/Coursera Data Science Specialization/10 - Capstone/Assignment")
The contents of each file were then read in separately.
con <- file("../final/en_US/en_US.blogs.txt", open="rb")
blogs <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
con <- file("../final/en_US/en_US.news.txt", open="rb")
news <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
con <- file("../final/en_US/en_US.twitter.txt", open="rb")
twitter <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)
library(stringi)
## Warning: package 'stringi' was built under R version 3.2.3
blogs.size <- round(file.info("en_US.blogs.txt")$size / 1024^2)
blogs.stats <- stri_stats_general(blogs)
blogs.words <- sum(stri_count_words(blogs))
blogs.words.per.line <- blogs.words/blogs.stats[1]
blogs.maxwords <- max(stri_count_words(blogs))
blogs.chars.per.word <- blogs.stats[3]/blogs.words
news.size <- round(file.info("en_US.news.txt")$size / 1024^2)
news.stats <- stri_stats_general(news)
news.words <- sum(stri_count_words(news))
news.words.per.line <- news.words/news.stats[1]
news.maxwords <- max(stri_count_words(news))
news.chars.per.word <- news.stats[3]/news.words
twitter.size <- round(file.info("en_US.twitter.txt")$size / 1024^2)
twitter.stats <- stri_stats_general(twitter)
twitter.words <- sum(stri_count_words(twitter))
twitter.words.per.line <- twitter.words/twitter.stats[1]
twitter.maxwords <- max(stri_count_words(twitter))
twitter.chars.per.word <- twitter.stats[3]/twitter.words
data.stats <- data.frame(Source = c("Blogs","News","Twitter"),
Size.MB = c(blogs.size, news.size, twitter.size),
Total.Lines = c(blogs.stats[1], news.stats[1], twitter.stats[1]),
Total.Words = c(blogs.words, news.words, twitter.words),
Total.Chars = c(blogs.stats[3], news.stats[3], twitter.stats[3]),
Words.Per.Lne = c(blogs.words.per.line, news.words.per.line, twitter.words.per.line),
Max.Words.Per.Line = c(blogs.maxwords, news.maxwords, twitter.maxwords),
Chars.Per.Word =c(blogs.chars.per.word, news.chars.per.word, twitter.chars.per.word))
data.stats
## Source Size.MB Total.Lines Total.Words Total.Chars Words.Per.Lne
## 1 Blogs NA 899288 37546246 206824382 41.75108
## 2 News NA 1010242 34762395 203223154 34.40997
## 3 Twitter NA 2360148 30093410 162096241 12.75065
## Max.Words.Per.Line Chars.Per.Word
## 1 6726 5.508524
## 2 1796 5.846063
## 3 47 5.386436
The table above summarizes the data. A short description of each column follows:
Source: specifies where the original data comes from
Size.MB: is the original file size rounded to MB
Total.Lines: is the number of total lines included in the data
Total.Words: is the number of total words included in the data
Total.Chars: is the number of total chars included in the data
Words.Per.Line: is the average number of words per line, that is, the average number of words in blog posts, news articles and tweets
Max.Words.Per.Line: is the largest number of words in a line, that is, the number of words in the largest blog post, news article and tweet
Chars.Per.Word: is the average number os characters per word
It seems that the words used are similar across the different sources, at least the average number of letters per word is very similar (from 5.4 to 5.8). But there’s a significant difference regarding the number of words per text object (blog post, news article and tweet), ranging from 13 (twitter) to 42 (blogs) words per text object. This is expected because (i) theres’s a limit of 140 characters per twitter message and (ii) the purpose of a twitter message is, in essence, to be short.
The way text varies in a blog post, ie, the number of words per line, is depicted here and it shows that almost all blog posts have less than 500 words.
blogs.words <- stri_count_words(blogs)
hist(blogs.words, main="Histogram of words in blogs", xlab="No. of words per blog post", breaks=20)
And if we “zoom” this to blog posts with less than 200 words, we still see that most blog posts have less than 50 words.
hist(blogs.words, main="Histogram of words in blogs", xlab="No. of words per blog post", breaks=600, xlim=c(0, 200))
At a first glance, it seems that the way text varies in news seems much like the way it varies on blogs:
news.words <- stri_count_words(news)
hist(news.words, main="Histogram of words in news", xlab="No. of words per news article", breaks=20)
But if we “zoom” this to news with less than 100 words, we now see something different: the most common in a news text is to have a text between 20 and 40 words.
hist(news.words, main="Histogram of words in news", xlab="No. of words per news article", breaks=300, xlim=c(0, 100))
The way text varies in tweets is different. For starters, the longest tweet allowed is 140 characters. Measured in words, the largest tweet in the data is 47 words long:
twitter.words <- stri_count_words(twitter)
hist(twitter.words, main="Histogram of words in tweets", xlab="No. of words per tweet", breaks=20)
Almost all tweets have less than 30 words and the higher concentration is around 8 words per tweet.
This basic exploration of the data avaiable confirms what we would expect: blog posts use more words than news articles and news articles use much more words than tweets.
The only surprise may come regarding word complexity (as measured in the number of letters per word). The average number of letters per word range from 5.4 (in tweets) to 5.8 (in news articles), which means that the vocabulary used is probably of the same kind, ie, no use of complex or long words. Still, the vocalubaly used in news articles is slightly more complex than the one used in blog posts and the vocabulary used in blog posts is slightly more complex than the one used in tweets.