Introduction

This report summarizes the initial exploration of a large text dataset that will be used to build a predictive text application, similar to the auto-complete feature found on many smartphones. The goal of this application is to predict the next word a user is likely to type, based on the preceding words they have entered.

Data Source

The data consists of text collected from three different sources:

These three sources represent a diverse range of writing styles and topics, which is important for building a robust and versatile prediction model.

Data Summary

A preliminary analysis of the data reveals the following key statistics:

File Lines Words Unique_Words
Blogs 899288 37334131 1103548
News 77259 2643969 197858
Twitter 2360148 30373583 1290173

Observations:

Word Frequency Analysis

An analysis of word frequencies reveals that a small number of words account for a large proportion of the text. This is a common characteristic of natural language.

Histogram of Top 20 Most Frequent Words

The histogram above shows the top 20 most frequent words in the combined dataset. As you can see, these are mostly common words like “the,” “to,” “and,” and “a.”

Cumulative Word Coverage

The plot above illustrates how many unique words are needed to cover a certain percentage of all word occurrences in the text.

## Words needed to cover 50% of instances: 804
## Words needed to cover 90% of instances: 15472

Key Findings:

This finding suggests that we can potentially build a smaller and more efficient prediction model by focusing on the most frequent words.

N-gram Analysis (Word Combinations)

In addition to individual words, we also analyzed common combinations of 2 words (bigrams) and 3 words (trigrams).

feature frequency
right_now 21735
last_night 14534
’_s 11362
feel_like 11293
looking_forward 10724
feature frequency
let_us_know 2444
happy_new_year 1882
happy_mothers_day 1734
happy_mother’s_day 1638
new_york_city 1257

Top 5 Most Frequent Bigrams

The table above shows the 5 most frequent bigrams and their counts.

Top 5 Most Frequent Trigrams

The table above shows the 5 most frequent trigrams and their counts.

These common word combinations are essential for building a model that can accurately predict the next word in a sequence.

Plan for Prediction Algorithm and Shiny App

Algorithm

The core of the prediction algorithm will be an n-gram model. This model calculates the probability of a word appearing, given the previous n-1 words (the context). We will use a combination of quadgrams (4 words), trigrams (3 words), bigrams (2 words), and unigrams (single words) to make predictions.

A technique called backoff will be used to handle cases where a particular word combination is not found in the training data. In such cases, the model will “back off” to a lower-order n-gram (e.g., from a trigram to a bigram) to make a prediction.

Shiny App

The prediction algorithm will be deployed as a user-friendly web application using the Shiny framework. The app will have a simple interface:

  1. A text input box where the user can type.
  2. A display area that shows the top 3 predicted words in real time.

Efficiency Considerations

To ensure a responsive user experience, the model will be optimized for both size and speed:

  • Reduced Vocabulary: The model will focus on the most frequent words to minimize memory usage.
  • Efficient Data Structures: We will use optimized data structures (like data.table) for storing and retrieving n-gram probabilities.
  • Pre-computation: As much as possible, calculations will be pre-computed and stored to reduce prediction time.

Conclusion

This exploratory analysis has provided valuable insights into the structure and characteristics of the text data. These insights will guide the development of an efficient and accurate predictive text application. The next steps involve refining the n-gram model, implementing smoothing techniques to improve prediction accuracy, and building the Shiny app for deployment. ```