The purpose of this project is to build a predictive text model using Natural Language Processing techniques. The training data consists of blogs, news articles, and Twitter text. The final goal is to create a Shiny application that predicts the next word based on user input.
The dataset contains three files:
The data was successfully downloaded and loaded into R for analysis.
Basic exploratory analysis was performed on the datasets.
Key observations:
Some words occur much more frequently than others. Common English words dominate the corpus, while rare words appear only a few times. This pattern is useful for building an efficient prediction model.
The prediction model will be based on N-grams. The model will analyze previous words and predict the most likely next word. Techniques such as backoff and smoothing may be used to improve prediction accuracy.
The exploratory analysis provided useful insights into the structure of the text data. These findings will support the development of an effective predictive text application.