Abstract

This milestone report summarizes the initial phases of developing a predictive text application as part of Coursera’s Data Science Capstone. It covers data acquisition, preprocessing steps, exploratory analysis, and outlines the next steps in building the predictive model and application.

Data Acquisition

We used three text datasets provided by SwiftKey:

These datasets were downloaded and stored locally for analysis.

Data Processing

Basic Statistics

Initial file characteristics:

library(readr)
library(stringi)
library(quanteda)
library(quanteda.textstats)
library(ggplot2)
library(knitr)
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)
Basic Statistics of the Datasets
Source Size_MB Total_Lines Total_Words
Blogs 200.42 899288 37546250
News 196.28 1010242 34762395
Twitter 159.36 2360148 30093372

Data Sampling

Due to large sizes, 10,000 random lines from each dataset were sampled for efficient analysis:

Summary of Sampled Data
Source Size_MB Total_Lines Total_Words
Combined Samples 6.8 bytes 30000 889456

Data Cleaning and Corpus Building

The sampled data were cleaned by:

  • Converting text to lowercase
  • Removing punctuation, numbers, extra whitespace, URLs, and profanity
  • Removing common English stop words

Exploratory Data Analysis

Most Frequent Unigrams

The top 15 frequent words (excluding common stop words):

Most Frequent Bigrams

The top 15 frequent two-word phrases:

Observations

  • The Twitter dataset generally contains shorter messages.
  • “Love” is significantly more frequent than “hate,” suggesting positive sentiment.
  • Common meaningless words (stop words) have been removed to improve analysis clarity.

Plans for the Predictive Model

Our predictive model will:

Shiny Application Development

The resulting predictive model will be integrated into a Shiny application designed to:

Next Steps

Feedback and suggestions are welcome to guide future improvements.