Exploratory Data Analysis for Next Word Prediction

Introduction

The goal of this report is to explore the text data provided for the Coursera Data Science Capstone project. This analysis helps understand the structure and size of the data before building a next-word prediction model.

Data Description

The data consists of three text sources: - Blogs - News - Twitter

These datasets contain large amounts of unstructured English text.

Basic Summary Statistics

The following statistics summarize the size of each dataset.

blogs_lines <- 899288
news_lines <- 77259
twitter_lines <- 2360148

data.frame(
  Source = c("Blogs", "News", "Twitter"),
  Lines = c(blogs_lines, news_lines, twitter_lines)
)

##    Source   Lines
## 1   Blogs  899288
## 2    News   77259
## 3 Twitter 2360148

Exploratory Data Analysis for Next Word Prediction

Soaif Habib

Introduction

Data Description

Basic Summary Statistics