The goal of this project is to perform an exploratory data analysis on the provided text datasets (Blogs, News, and Twitter) and outline a plan for creating a text prediction algorithm. This report is designed to be clear and concise for a non-data scientist manager.
The dataset was downloaded and successfully loaded into R. The table below shows the basic summaries of the three files, including line counts and word counts.
## File_Name Line_Count Word_Count
## 1 Blogs Data 899288 37546806
## 2 News Data 1010206 34761151
## 3 Twitter Data 2360148 30096690
Algorithm: I will build an N-gram model (unigrams, bigrams, and trigrams) to understand which words frequently appear together. This will help predict the next possible word based on the previous words typed.
Shiny App: I plan to create a simple, user-friendly interface. The user will type a phrase into a text box, and the app will use the algorithm to instantly suggest the most likely next word.