This report outlines the initial exploratory data analysis performed on the text datasets for a text prediction project. The goal is to provide a concise overview of the data and plans for building a prediction algorithm and a Shiny application, understandable to a non-data scientist manager.
The project uses three text datasets: blogs, news, and Twitter. The data was downloaded and successfully loaded into R.
| Dataset | File.Size..MB. | Number.of.Lines | Number.of.Characters | Number.of.Words |
|---|---|---|---|---|
| Blogs | 200.42 | 899288 | 206824505 | 37546250 |
| News | 196.28 | 1010242 | 203223159 | 34762395 |
| 159.36 | 2360148 | 162096241 | 30093413 |
To understand the characteristics of the text data, the distribution of word counts within each document (line) of the datasets was analyzed. The analysis focused on the average sentence length and how it varies across the different text sources.
The goal is to develop a text prediction algorithm and deploy it as a Shiny application.
A basic n-gram model with a backoff strategy is planned. If a higher-order n-gram (e.g., a trigram) isn’t found in the training data, the model will “back off” to a lower-order n-gram (e.g., a bigram) to make a prediction. This helps to handle unseen word combinations.
The Shiny app will provide a user-friendly interface for text prediction. Users will be able to: 1. Enter text into an input field. 2. Receive the top three predicted next words. 3. Potentially explore some of the underlying data features through interactive plots or tables. This approach balances accuracy with computational efficiency, providing a useful predictive text experience.
The initial data exploration confirms successful loading and understanding of the basic structure of the text datasets. The word count distributions provide insights into the characteristics of each source. The plan to build an n-gram based prediction algorithm and deploy it via a Shiny app provides a solid foundation for the next stages of this project.