Introduction

The goal of this project is to demonstrate familiarity with large text datasets and to perform an initial exploratory analysis in preparation for building a word prediction algorithm and a Shiny application.

The analysis focuses on three English-language datasets: - Blogs - News - Twitter


Data Loading

The datasets were downloaded and loaded successfully.
Only summary statistics and samples are used to avoid memory issues.

## [1] 210160014 205811889 167105338

Summarize

Table 1 summarizes the size and number of lines of each dataset. Twitter contains the largest number of lines, while blogs and news contain longer text entries.

## [1] 200.42 196.28 159.36
##   Dataset   Lines Size_MB
## 1   Blogs  898436  200.42
## 2    News 1010172  196.28
## 3 Twitter 2304374  159.36

Line Length Distributions

Key observations from the exploratory analysis include:

Twitter data consists of very short text entries.

Blog data contains extremely long lines.

There is high variability in text length across datasets.

This distribution shows that tweets typically contain a small number of words, which supports the use of short-context prediction models.

##Word Frequency

## tokens
##  the   to    i    a  you  and  for   in   of   is 
## 1996 1675 1535 1300 1036  938  849  829  754  745

Most common words are short connectors.

Plan for the Prediction Algorithm

The prediction model will be based on n-gram language models, starting with simple bigrams and trigrams. The main objective is to balance prediction accuracy with computational efficiency to ensure fast responses.

Text preprocessing steps will include normalization, tokenization, and removal of noise such as punctuation and numbers.


Plan for the Shiny Application

A Shiny application will be developed to allow users to input text and receive predictions for the next word. The application will prioritize simplicity, responsiveness, and low memory usage.