Capstone EDA Report

Loading and Summary Statistics

blogs <- readLines("en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)

library(stringi)

# Summary
length(blogs)

## [1] 899288

length(news)

## [1] 1010206

length(twitter)

## [1] 2360148

sum(stri_count_words(blogs))

## [1] 37546806

sum(stri_count_words(news))

## [1] 34761151

sum(stri_count_words(twitter))

## [1] 30096690

Capstone EDA Report

Mayank Jain

2025-06-21

Introduction

Loading and Summary Statistics