Abstract

This milestone report summarizes the initial phases of developing a predictive text application as part of Coursera’s Data Science Capstone. It covers data acquisition, preprocessing steps, exploratory analysis, and outlines the next steps in building the predictive model and application.

Data Acquisition

We used three text datasets provided by SwiftKey:

Blogs (en_US.blogs.txt)
News (en_US.news.txt)
Twitter (en_US.twitter.txt)

These datasets were downloaded and stored locally for analysis.

Data Processing

Basic Statistics

Initial file characteristics:

library(readr)
library(stringi)
library(quanteda)
library(quanteda.textstats)
library(ggplot2)
library(knitr)

## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)

Basic Statistics of the Datasets
Source	Size_MB	Total_Lines	Total_Words
Blogs	200.42	899288	37546250
News	196.28	1010242	34762395
Twitter	159.36	2360148	30093372

Data Sampling

Due to large sizes, 10,000 random lines from each dataset were sampled for efficient analysis:

Summary of Sampled Data
Source	Size_MB	Total_Lines	Total_Words
Combined Samples	6.8 bytes	30000	889456

Data Cleaning and Corpus Building

The sampled data were cleaned by:

Converting text to lowercase
Removing punctuation, numbers, extra whitespace, URLs, and profanity
Removing common English stop words

Exploratory Data Analysis

Most Frequent Unigrams

The top 15 frequent words (excluding common stop words):

Most Frequent Bigrams

The top 15 frequent two-word phrases:

Observations

The Twitter dataset generally contains shorter messages.
“Love” is significantly more frequent than “hate,” suggesting positive sentiment.
Common meaningless words (stop words) have been removed to improve analysis clarity.

Plans for the Predictive Model

Our predictive model will:

Use n-gram modeling (unigram, bigram, trigram)
Apply smoothing methods such as Kneser-Ney to handle unseen phrases
Balance computational efficiency with prediction accuracy

Shiny Application Development

The resulting predictive model will be integrated into a Shiny application designed to:

Instantly predict next words as users type
Offer a simple and intuitive user interface
Maintain high performance for real-time predictions

Next Steps

Develop and optimize the predictive text algorithm
Build, test, and deploy the Shiny application

Feedback and suggestions are welcome to guide future improvements.

Capstone Project Milestone Report

Alejandra

2025-03-27