Data Science Capstone - Exploratory Data Analysis

Introduction

This report presents an exploratory analysis of the training data used in the Coursera Data Science Capstone Project. The goal is to understand the structure and basic characteristics of the text data that will be used to build a word prediction model.

The model will be trained using a unified document corpus compiled from the following three sources of text data:

Blogs
News
Twitter

Loading The Data

All datasets were successfully loaded into R, and the number of lines in each dataset confirms their large scale and suitability for text mining and natural language processing tasks.

Line length distribution

Histograms of line lengths show that:

Blog entries have the widest distribution, with many long-form texts.
News articles are moderately sized and more structured.
Twitter text is short and compact, reflecting informal communication styles.

Word count comparision

Most frequent words

A frequency analysis of a random sample of Twitter text reveals that commonly used English words dominate the dataset. This highlights the importance of removing stop words and applying further text normalization techniques before model training.

Summary Table

The table below summarizes the basic characteristics of the three datasets, including the number of lines, maximum character length, and mean character length per line.

Key observation from the summary table:

The Twitter dataset contains the largest number of lines.
Blog posts have the longest individual entries.
Tweets are significantly shorter due to character limitations.

Summary statistics of the Text Datasets
Dataset	Lines	Max_Characters	Mean_Characters
Blogs	899288	40833	229.98695
News	1010206	11384	201.16149
Twitter	2360148	140	68.68045

Findings:

The exploratory analysis shows clear difference between the three datasets:

Blogs contain long, descriptive, and informal text.
News articles are more structured and formal.
Twitter data is short, informal, and highly variable.

These distinctions suggest that careful preprocessing, including tokenization, normalization, and n-gram modeling, will be critical for building an effective predictive text model.

Next steps:

The next phase of the project will focus on deeper text preprocessing and model development. This will include:

Tokenizing text into unigrams, bigrams, and trigrams.
Building frequency-based n-gram models.
Evaluating prediction accuracy and performance.
Deploying the final predictive model using a Shinny application.

These steps will enable the development of an interactive text prediction tool that suggests the next word based on user input.

Conclusion:

This exploratory analysis provides a solid foundation for the predictive modeling phase of the Coursera Data Science Capstone project. Understanding the size, structure, and characteristics of the datasets ensures that appropriate modeling and preprocessing decisions can be in subsequent stages.