Overview

This milestone report explores the SwiftKey text data sets provided for the Data Science Capstone project.
The goal is to understand the structure of the data, basic statistics, and challenges involved in building a predictive text model.

library(stringi)
library(dplyr)

Data Loading

blogs <- readLines("data/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("data/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("data/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)

Basic Summary Statistics

data.frame(
  Dataset = c("Blogs", "News", "Twitter"),
  Lines = c(length(blogs), length(news), length(twitter)),
  Words = c(
    sum(stri_count_words(blogs)),
    sum(stri_count_words(news)),
    sum(stri_count_words(twitter))
  )
)
##   Dataset   Lines    Words
## 1   Blogs  899288 37546250
## 2    News 1010242 34762395
## 3 Twitter 2360148 30093413

File Sizes (MB)

file.info("data/en_US.blogs.txt")$size / 1024^2
## [1] 200.4242
file.info("data/en_US.news.txt")$size / 1024^2
## [1] 196.2775
file.info("data/en_US.twitter.txt")$size / 1024^2
## [1] 159.3641

Longest Line Length

max(nchar(blogs))
## [1] 40833
max(nchar(news))
## [1] 11384
max(nchar(twitter))
## [1] 140

Sampling the Data

Due to the large size of the data, a random sample is taken for exploratory analysis.

set.seed(123)
sample_blogs <- sample(blogs, 5000)
sample_news <- sample(news, 5000)
sample_twitter <- sample(twitter, 5000)

Challenges & Next Steps

Conclusion

This exploratory analysis provided an understanding of the dataset structure and scale. The findings will guide the development of a predictive text model in the next stages of the capstone project.