1. Introduction

The goal of this project is to build a text prediction application similar to mobile phone keyboards. This report presents an exploratory analysis of the provided text data and outlines plans for the prediction algorithm and Shiny application.


2. Data Description

The data comes from the HC Corpora dataset and contains text from three sources:

The English (en_US) datasets were used for this analysis.


3. Data Loading

The data was successfully downloaded and loaded into R. Due to the large size of the datasets, only samples were used for exploratory analysis to ensure efficient processing.


4. Summary Statistics

The following table summarizes the basic characteristics of the datasets:

Dataset Description
Blogs Long-form personal writing
News Formal news articles
Twitter Short informal messages

5. Exploratory Data Analysis

Basic text preprocessing steps included: - Converting text to lowercase - Removing punctuation and numbers - Removing extra whitespace - Removing stop words

The most frequently occurring words were common English words such as “the”, “and”, and “to”.


6. Interesting Findings


7. Plan for Prediction Algorithm

The prediction algorithm will: - Use n-gram models (unigrams, bigrams, trigrams) - Apply a backoff strategy for prediction - Focus on efficiency and accuracy


8. Plan for Shiny Application

The Shiny app will: - Take user text as input - Predict the next word - Display multiple suggestions - Be simple and responsive


9. Conclusion

This report confirms that the data is suitable for building a text prediction model. The project is on track for successful completion.