Data Science Capstone Milestone Report

Introduction

The purpose of this project is to build a predictive text model using Natural Language Processing techniques. The training data consists of blogs, news articles, and Twitter text. The final goal is to create a Shiny application that predicts the next word based on user input.

Data Summary

The dataset contains three files:

  • Blogs dataset
  • News dataset
  • Twitter dataset

The data was successfully downloaded and loaded into R for analysis.

Exploratory Analysis

Basic exploratory analysis was performed on the datasets.

Key observations:

  • The blogs dataset contains long-form text.
  • The news dataset contains formal language.
  • The Twitter dataset contains short and informal messages.
  • Word frequencies vary significantly across datasets.

Findings

Some words occur much more frequently than others. Common English words dominate the corpus, while rare words appear only a few times. This pattern is useful for building an efficient prediction model.

Prediction Model Plan

The prediction model will be based on N-grams. The model will analyze previous words and predict the most likely next word. Techniques such as backoff and smoothing may be used to improve prediction accuracy.

Conclusion

The exploratory analysis provided useful insights into the structure of the text data. These findings will support the development of an effective predictive text application.