SwiftKey Exploratory Data Analysis

Introduction

The goal of this project is to develop a predictive text model capable of suggesting the next word based on a sequence of words entered by a user. The final application will be implemented using R and Shiny and will demonstrate basic natural language processing techniques.

The dataset used for this project consists of text collected from three sources:

Blogs
News articles
Twitter posts

These datasets are provided in English and contain a large volume of real-world text suitable for language modeling.

Data Summary

The dataset contains text from:

Source	Description
Blogs	Personal blog entries
News	News articles
Twitter	Social media posts

The combined dataset contains millions of words and thousands of lines of text.

Data Cleaning

Before analysis, the text data was cleaned by:

Converting text to lowercase
Removing punctuation
Removing numbers
Removing extra whitespace
Removing special characters

These preprocessing steps help standardize the text for analysis.

Exploratory Analysis

The most common words found in the corpus include:

These words appear frequently across all text sources.

Top Words

The figure below illustrates the most frequently occurring words in the dataset.

Bigram Analysis

Bigrams represent pairs of consecutive words.

Examples include:

thank you
i love
good morning
how are
looking forward

Trigram Analysis

Trigrams represent sequences of three words.

Examples include:

thank you for
one of the
i want to
going to be
looking forward to

Future Plans

The next stage of the project will focus on:

Building n-gram language models
Creating a predictive text algorithm
Improving prediction accuracy
Deploying the model using Shiny
Publishing the application online

Conclusion

The exploratory analysis provided valuable insight into the structure of the dataset. Common words, bigrams, and trigrams were identified and will be used to construct a predictive text model for the final SwiftKey application.