The goal of this project is to build a text prediction application similar to mobile phone keyboards. This report presents an exploratory analysis of the provided text data and outlines plans for the prediction algorithm and Shiny application.
The data comes from the HC Corpora dataset and contains text from three sources:
The English (en_US) datasets were used for this analysis.
The data was successfully downloaded and loaded into R. Due to the large size of the datasets, only samples were used for exploratory analysis to ensure efficient processing.
The following table summarizes the basic characteristics of the datasets:
| Dataset | Description |
|---|---|
| Blogs | Long-form personal writing |
| News | Formal news articles |
| Short informal messages |
Basic text preprocessing steps included: - Converting text to lowercase - Removing punctuation and numbers - Removing extra whitespace - Removing stop words
The most frequently occurring words were common English words such as “the”, “and”, and “to”.
The prediction algorithm will: - Use n-gram models (unigrams, bigrams, trigrams) - Apply a backoff strategy for prediction - Focus on efficiency and accuracy
The Shiny app will: - Take user text as input - Predict the next word - Display multiple suggestions - Be simple and responsive
This report confirms that the data is suitable for building a text prediction model. The project is on track for successful completion.