`Introduction: The objective of this project is to build a predictive text model isong the swift key dataset. The dataset contains text from blogs, news, twitter. This report presents the Exploratory Data ANalysis performed on the dataset and outlines planned for building the prediction algorithm
Data Loading: The datasets were downloaded and loaded into R using the readLines() functions
File used are: en_US.blogs.txt en_US.twitter.txt en_US.news.txt
Basic Statistics:
Dataset Lines Blogs 899288 News 77259 Twitter 2360148
A bar chart was created to compare the number of lines across datasets.
EDA (Exploratory Data Analysis)
1.The twitter dataset contains the largest number of lines, followed by blogs and news
2.Basic text processing techniques such as tokenization and word counting were explored
Interesting finding: 1. Twitter contains the highest number of text records 2. The datasets vary significanlty in size and structure 3. These differences will influence the prediction model
Prediction Algorithm Plan:
Shiny App Plan:
A shiny application will be developed where users can enter text and receive prediction for the next words
The app will provide a simple and interactive interface for demonstrating the prediction algorithm
Conclusion:
The dataset were successfully loaded and analyzed
Initial exploratory analysis has been completed, and the next step is to develop the prediction algorithm and deploy it using a Shiny application