This report summarizes the initial exploration of a large text dataset that will be used to build a predictive text application, similar to the auto-complete feature found on many smartphones. The goal of this application is to predict the next word a user is likely to type, based on the preceding words they have entered.
The data consists of text collected from three different sources:
These three sources represent a diverse range of writing styles and topics, which is important for building a robust and versatile prediction model.
A preliminary analysis of the data reveals the following key statistics:
| File | Lines | Words | Unique_Words |
|---|---|---|---|
| Blogs | 899288 | 37334131 | 1103548 |
| News | 77259 | 2643969 | 197858 |
| 2360148 | 30373583 | 1290173 |
Observations:
An analysis of word frequencies reveals that a small number of words account for a large proportion of the text. This is a common characteristic of natural language.
Histogram of Top 20 Most Frequent Words
The histogram above shows the top 20 most frequent words in the combined dataset. As you can see, these are mostly common words like “the,” “to,” “and,” and “a.”
Cumulative Word Coverage
The plot above illustrates how many unique words are needed to cover a certain percentage of all word occurrences in the text.
## Words needed to cover 50% of instances: 804
## Words needed to cover 90% of instances: 15472
Key Findings:
This finding suggests that we can potentially build a smaller and more efficient prediction model by focusing on the most frequent words.
In addition to individual words, we also analyzed common combinations of 2 words (bigrams) and 3 words (trigrams).
| feature | frequency |
|---|---|
| right_now | 21735 |
| last_night | 14534 |
| ’_s | 11362 |
| feel_like | 11293 |
| looking_forward | 10724 |
| feature | frequency |
|---|---|
| let_us_know | 2444 |
| happy_new_year | 1882 |
| happy_mothers_day | 1734 |
| happy_mother’s_day | 1638 |
| new_york_city | 1257 |
Top 5 Most Frequent Bigrams
The table above shows the 5 most frequent bigrams and their counts.
Top 5 Most Frequent Trigrams
The table above shows the 5 most frequent trigrams and their counts.
These common word combinations are essential for building a model that can accurately predict the next word in a sequence.
The core of the prediction algorithm will be an n-gram model. This model calculates the probability of a word appearing, given the previous n-1 words (the context). We will use a combination of quadgrams (4 words), trigrams (3 words), bigrams (2 words), and unigrams (single words) to make predictions.
A technique called backoff will be used to handle cases where a particular word combination is not found in the training data. In such cases, the model will “back off” to a lower-order n-gram (e.g., from a trigram to a bigram) to make a prediction.
The prediction algorithm will be deployed as a user-friendly web application using the Shiny framework. The app will have a simple interface:
To ensure a responsive user experience, the model will be optimized for both size and speed:
data.table) for storing and
retrieving n-gram probabilities.This exploratory analysis has provided valuable insights into the structure and characteristics of the text data. These insights will guide the development of an efficient and accurate predictive text application. The next steps involve refining the n-gram model, implementing smoothing techniques to improve prediction accuracy, and building the Shiny app for deployment. ```