Data Science Capstone – Milestone Report

Overview

This milestone report explores the SwiftKey text data sets provided for the Data Science Capstone project.
The goal is to understand the structure of the data, basic statistics, and challenges involved in building a predictive text model.

library(stringi)
library(dplyr)

Data Loading

blogs <- readLines("data/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("data/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("data/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)

Basic Summary Statistics

data.frame(
  Dataset = c("Blogs", "News", "Twitter"),
  Lines = c(length(blogs), length(news), length(twitter)),
  Words = c(
    sum(stri_count_words(blogs)),
    sum(stri_count_words(news)),
    sum(stri_count_words(twitter))
  )
)

##   Dataset   Lines    Words
## 1   Blogs  899288 37546250
## 2    News 1010242 34762395
## 3 Twitter 2360148 30093413

File Sizes (MB)

file.info("data/en_US.blogs.txt")$size / 1024^2

## [1] 200.4242

file.info("data/en_US.news.txt")$size / 1024^2

## [1] 196.2775

file.info("data/en_US.twitter.txt")$size / 1024^2

## [1] 159.3641

Longest Line Length

max(nchar(blogs))

## [1] 40833

max(nchar(news))

## [1] 11384

max(nchar(twitter))

## [1] 140

Sampling the Data

Due to the large size of the data, a random sample is taken for exploratory analysis.

set.seed(123)
sample_blogs <- sample(blogs, 5000)
sample_news <- sample(news, 5000)
sample_twitter <- sample(twitter, 5000)

Challenges & Next Steps

Large data size requires efficient memory management
Data cleaning is required (punctuation, profanity, stopwords)
N-gram models will be explored in the next phase

Conclusion

This exploratory analysis provided an understanding of the dataset structure and scale. The findings will guide the development of a predictive text model in the next stages of the capstone project.