Exploratory Analysis of the SwiftKey Text Prediction Dataset

Introduction

This report presents an exploratory analysis of the SwiftKey text dataset used in the Data Science Capstone Project. The objective is to understand the structure and characteristics of the data and identify patterns that can be used to build a predictive text model and Shiny application.

The dataset consists of three English language text sources: Blogs, News, and Twitter. Exploratory analysis was conducted to summarize the data and identify common word usage patterns.

Data Summary

The dataset contains text collected from three different sources:

  • Blogs
  • News
  • Twitter

Basic summary statistics indicate that the datasets contain millions of words and lines of text. The Twitter dataset contains shorter messages, while the Blogs dataset contains longer text entries. News articles provide more formal language structures.

Data Cleaning

The following preprocessing steps were applied:

  • Conversion to lowercase
  • Removal of punctuation
  • Removal of numbers
  • Removal of extra white spaces
  • Removal of special characters

These steps help standardize the text and improve the quality of subsequent analysis.

Exploratory Data Analysis

Exploratory analysis revealed several important characteristics of the dataset.

Word frequency analysis showed that a small number of words occur very frequently, while most words appear only a few times. This pattern is typical in natural language processing datasets.

The distribution of words suggests that frequency-based prediction methods can effectively model language usage.

Key Observations

  • Blogs contain longer sentences and descriptive language.
  • Twitter data contains abbreviations, hashtags, and informal expressions.
  • News articles contain structured and formal language.
  • Common words dominate the corpus across all sources.
  • Frequently occurring word combinations can be used for next-word prediction.

Visualizations

Several visualizations were created:

  1. Histogram of word frequencies
  2. Bar chart of the most common words
  3. Bigram frequency plot
  4. Trigram frequency plot

The plots demonstrate that language usage follows predictable frequency patterns suitable for predictive modeling.

Prediction Algorithm Plan

The prediction algorithm will be developed using n-gram language models.

The model will generate:

  • Unigrams
  • Bigrams
  • Trigrams
  • Four-grams

When users enter text, the algorithm will search for matching word sequences and predict the most probable next word based on observed frequencies.

A backoff strategy will be implemented when an exact match is unavailable.

Shiny Application Plan

A Shiny application will be developed to demonstrate the prediction algorithm.

Features of the application:

  • User text input box
  • Real-time next word prediction
  • Simple and user-friendly interface
  • Fast prediction response

The application will provide suggested next words based on the trained language model.

Conclusion

This exploratory analysis provided valuable insights into the SwiftKey dataset. The data was successfully explored and important language patterns were identified. The findings support the development of an n-gram based predictive text model and an interactive Shiny application for next-word prediction.

Future work will focus on building, evaluating, and optimizing the prediction algorithm before deployment in the final application.