This report presents an exploratory analysis of the training data used in the Coursera Data Science Capstone Project. The goal is to understand the structure and basic characteristics of the text data that will be used to build a word prediction model.
The model will be trained using a unified document corpus compiled from the following three sources of text data:
Blogs
News
All datasets were successfully loaded into R, and the number of lines in each dataset confirms their large scale and suitability for text mining and natural language processing tasks.
Histograms of line lengths show that:
Blog entries have the widest distribution, with many long-form texts.
News articles are moderately sized and more structured.
Twitter text is short and compact, reflecting informal communication styles.
A frequency analysis of a random sample of Twitter text reveals that commonly used English words dominate the dataset. This highlights the importance of removing stop words and applying further text normalization techniques before model training.
The table below summarizes the basic characteristics of the three datasets, including the number of lines, maximum character length, and mean character length per line.
Key observation from the summary table:
The Twitter dataset contains the largest number of lines.
Blog posts have the longest individual entries.
Tweets are significantly shorter due to character limitations.
| Dataset | Lines | Max_Characters | Mean_Characters |
|---|---|---|---|
| Blogs | 899288 | 40833 | 229.98695 |
| News | 1010206 | 11384 | 201.16149 |
| 2360148 | 140 | 68.68045 |
The exploratory analysis shows clear difference between the three datasets:
Blogs contain long, descriptive, and informal text.
News articles are more structured and formal.
Twitter data is short, informal, and highly variable.
These distinctions suggest that careful preprocessing, including tokenization, normalization, and n-gram modeling, will be critical for building an effective predictive text model.
The next phase of the project will focus on deeper text preprocessing and model development. This will include:
Tokenizing text into unigrams, bigrams, and trigrams.
Building frequency-based n-gram models.
Evaluating prediction accuracy and performance.
Deploying the final predictive model using a Shinny application.
These steps will enable the development of an interactive text prediction tool that suggests the next word based on user input.
This exploratory analysis provides a solid foundation for the predictive modeling phase of the Coursera Data Science Capstone project. Understanding the size, structure, and characteristics of the datasets ensures that appropriate modeling and preprocessing decisions can be in subsequent stages.