Introduction

This report presents an exploratory analysis of the training data used in the Coursera Data Science Capstone Project. The goal is to understand the structure and basic characteristics of the text data that will be used to build a word prediction model.

The model will be trained using a unified document corpus compiled from the following three sources of text data:

  1. Blogs

  2. News

  3. Twitter

Loading The Data

All datasets were successfully loaded into R, and the number of lines in each dataset confirms their large scale and suitability for text mining and natural language processing tasks.

Line length distribution

Histograms of line lengths show that:

Word count comparision

Most frequent words

A frequency analysis of a random sample of Twitter text reveals that commonly used English words dominate the dataset. This highlights the importance of removing stop words and applying further text normalization techniques before model training.

Summary Table

The table below summarizes the basic characteristics of the three datasets, including the number of lines, maximum character length, and mean character length per line.

Key observation from the summary table:

Summary statistics of the Text Datasets
Dataset Lines Max_Characters Mean_Characters
Blogs 899288 40833 229.98695
News 1010206 11384 201.16149
Twitter 2360148 140 68.68045

Findings:

The exploratory analysis shows clear difference between the three datasets:

These distinctions suggest that careful preprocessing, including tokenization, normalization, and n-gram modeling, will be critical for building an effective predictive text model.

Next steps:

The next phase of the project will focus on deeper text preprocessing and model development. This will include:

These steps will enable the development of an interactive text prediction tool that suggests the next word based on user input.

Conclusion:

This exploratory analysis provides a solid foundation for the predictive modeling phase of the Coursera Data Science Capstone project. Understanding the size, structure, and characteristics of the datasets ensures that appropriate modeling and preprocessing decisions can be in subsequent stages.