Introduction

The goal of this project is to develop a predictive text model capable of suggesting the next word based on a sequence of words entered by a user. The final application will be implemented using R and Shiny and will demonstrate basic natural language processing techniques.

The dataset used for this project consists of text collected from three sources:

These datasets are provided in English and contain a large volume of real-world text suitable for language modeling.

Data Summary

The dataset contains text from:

Source Description
Blogs Personal blog entries
News News articles
Twitter Social media posts

The combined dataset contains millions of words and thousands of lines of text.

Data Cleaning

Before analysis, the text data was cleaned by:

These preprocessing steps help standardize the text for analysis.

Exploratory Analysis

The most common words found in the corpus include:

These words appear frequently across all text sources.

Top Words

The figure below illustrates the most frequently occurring words in the dataset.

Bigram Analysis

Bigrams represent pairs of consecutive words.

Examples include:

Trigram Analysis

Trigrams represent sequences of three words.

Examples include:

Future Plans

The next stage of the project will focus on:

Conclusion

The exploratory analysis provided valuable insight into the structure of the dataset. Common words, bigrams, and trigrams were identified and will be used to construct a predictive text model for the final SwiftKey application.