Data Science Capstone Project

`Introduction: The objective of this project is to build a predictive text model isong the swift key dataset. The dataset contains text from blogs, news, twitter. This report presents the Exploratory Data ANalysis performed on the dataset and outlines planned for building the prediction algorithm

Data Loading: The datasets were downloaded and loaded into R using the readLines() functions

File used are: en_US.blogs.txt en_US.twitter.txt en_US.news.txt

Basic Statistics:

Dataset Lines Blogs 899288 News 77259 Twitter 2360148

A bar chart was created to compare the number of lines across datasets.

EDA (Exploratory Data Analysis)

1.The twitter dataset contains the largest number of lines, followed by blogs and news

2.Basic text processing techniques such as tokenization and word counting were explored

Interesting finding: 1. Twitter contains the highest number of text records 2. The datasets vary significanlty in size and structure 3. These differences will influence the prediction model

Prediction Algorithm Plan:

The prediction model will be based on N-grams Unigrams Bigrams Trigrams
A backoff model will be used when an exact N-grams match is not available

Shiny App Plan:

A shiny application will be developed where users can enter text and receive prediction for the next words
The app will provide a simple and interactive interface for demonstrating the prediction algorithm

Conclusion:

The dataset were successfully loaded and analyzed
Initial exploratory analysis has been completed, and the next step is to develop the prediction algorithm and deploy it using a Shiny application

Data Science Capstone Project

Sangeetha

25/06/2026