The goal of this project is to develop a predictive text model capable of suggesting the next word based on a sequence of words entered by a user. The final application will be implemented using R and Shiny and will demonstrate basic natural language processing techniques.
The dataset used for this project consists of text collected from three sources:
These datasets are provided in English and contain a large volume of real-world text suitable for language modeling.
The dataset contains text from:
| Source | Description |
|---|---|
| Blogs | Personal blog entries |
| News | News articles |
| Social media posts |
The combined dataset contains millions of words and thousands of lines of text.
Before analysis, the text data was cleaned by:
These preprocessing steps help standardize the text for analysis.
The most common words found in the corpus include:
These words appear frequently across all text sources.
The figure below illustrates the most frequently occurring words in the dataset.
Bigrams represent pairs of consecutive words.
Examples include:
Trigrams represent sequences of three words.
Examples include:
The next stage of the project will focus on:
The exploratory analysis provided valuable insight into the structure of the dataset. Common words, bigrams, and trigrams were identified and will be used to construct a predictive text model for the final SwiftKey application.