This report presents an exploratory analysis of the SwiftKey text dataset used in the Data Science Capstone Project. The objective is to understand the structure and characteristics of the data and identify patterns that can be used to build a predictive text model and Shiny application.
The dataset consists of three English language text sources: Blogs, News, and Twitter. Exploratory analysis was conducted to summarize the data and identify common word usage patterns.
The dataset contains text collected from three different sources:
Basic summary statistics indicate that the datasets contain millions of words and lines of text. The Twitter dataset contains shorter messages, while the Blogs dataset contains longer text entries. News articles provide more formal language structures.
The following preprocessing steps were applied:
These steps help standardize the text and improve the quality of subsequent analysis.
Exploratory analysis revealed several important characteristics of the dataset.
Word frequency analysis showed that a small number of words occur very frequently, while most words appear only a few times. This pattern is typical in natural language processing datasets.
The distribution of words suggests that frequency-based prediction methods can effectively model language usage.
Several visualizations were created:
The plots demonstrate that language usage follows predictable frequency patterns suitable for predictive modeling.
The prediction algorithm will be developed using n-gram language models.
The model will generate:
When users enter text, the algorithm will search for matching word sequences and predict the most probable next word based on observed frequencies.
A backoff strategy will be implemented when an exact match is unavailable.
A Shiny application will be developed to demonstrate the prediction algorithm.
Features of the application:
The application will provide suggested next words based on the trained language model.
This exploratory analysis provided valuable insights into the SwiftKey dataset. The data was successfully explored and important language patterns were identified. The findings support the development of an n-gram based predictive text model and an interactive Shiny application for next-word prediction.
Future work will focus on building, evaluating, and optimizing the prediction algorithm before deployment in the final application.