The goal of this project is to demonstrate familiarity with large text datasets and to perform an initial exploratory analysis in preparation for building a word prediction algorithm and a Shiny application.
The analysis focuses on three English-language datasets: - Blogs - News - Twitter
The datasets were downloaded and loaded successfully.
Only summary statistics and samples are used to avoid memory issues.
## [1] 210160014 205811889 167105338
Table 1 summarizes the size and number of lines of each dataset. Twitter contains the largest number of lines, while blogs and news contain longer text entries.
## [1] 200.42 196.28 159.36
## Dataset Lines Size_MB
## 1 Blogs 898436 200.42
## 2 News 1010172 196.28
## 3 Twitter 2304374 159.36
Key observations from the exploratory analysis include:
Twitter data consists of very short text entries.
Blog data contains extremely long lines.
There is high variability in text length across datasets.
This distribution shows that tweets typically contain a small number of
words, which supports the use of short-context prediction models.
##Word Frequency
## tokens
## the to i a you and for in of is
## 1996 1675 1535 1300 1036 938 849 829 754 745
Most common words are short connectors.
The prediction model will be based on n-gram language models, starting with simple bigrams and trigrams. The main objective is to balance prediction accuracy with computational efficiency to ensure fast responses.
Text preprocessing steps will include normalization, tokenization, and removal of noise such as punctuation and numbers.
A Shiny application will be developed to allow users to input text and receive predictions for the next word. The application will prioritize simplicity, responsiveness, and low memory usage.