This milestone report explores the SwiftKey dataset provided for the Data Science Capstone project. The goal is to analyze the structure of the text data and outline the plan for building the word prediction algorithm and Shiny application.
The dataset consists of text from three different sources: Blogs, News, and Twitter. Below are the summary statistics for the English datasets.
| File Source | Line Count | Word Count |
|---|---|---|
| Blogs | 899,288 | 37,334,131 |
| News | 1,010,242 | 34,372,530 |
| 2,360,148 | 30,373,543 |
Initial exploration reveals that the datasets require cleaning and preprocessing before modeling. Tasks include removing punctuation, converting text to lowercase, removing special characters, and filtering profanity. The distribution of words follows Zipf’s Law, where a small number of words appear very frequently.
The final prediction model will utilize an N-gram approach combined with a Stupid Backoff algorithm.
The algorithm will analyze sequences of words such as bigrams, trigrams, and four-grams to predict the next likely word based on historical frequency.
If a four-word sequence is unavailable, the model will back off to a smaller N-gram model until a prediction can be generated.
The Shiny app will provide a simple user interface where users can enter text and receive a predicted next word instantly.
To improve speed and usability, rare words and infrequent phrases will be removed from the model.
This report demonstrated successful loading and exploration of the SwiftKey datasets. Initial analysis showed differences in size and structure between Blogs, News, and Twitter datasets. The next stage of the project will focus on cleaning the data, building N-gram models, and deploying a Shiny application for next-word prediction.