Capstone Milestone Report

Introduction

This milestone report is part of the Capstone Project for the Data Science Specialization by Johns Hopkins University on Coursera. The goal of this report is to demonstrate basic data handling, exploratory data analysis, and outline the plans for building a prediction algorithm and Shiny application.

Data Loading

The dataset is provided by SwiftKey and consists of three text files containing content from blogs, news, and Twitter.

library(stringi)
library(ggplot2)
library(dplyr)

## 
## Caricamento pacchetto: 'dplyr'

## I seguenti oggetti sono mascherati da 'package:stats':
## 
##     filter, lag

## I seguenti oggetti sono mascherati da 'package:base':
## 
##     intersect, setdiff, setequal, union

# Load the data
blogs <- readLines("final/en_US/en_US.blogs.txt", warn = FALSE, encoding = "UTF-8")
news <- readLines("final/en_US/en_US.news.txt", warn = FALSE, encoding = "UTF-8")
twitter <- readLines("final/en_US/en_US.twitter.txt", warn = FALSE, encoding = "UTF-8")

Summary Statistics

data_summary <- data.frame(
  File = c("Blogs", "News", "Twitter"),
  Lines = c(length(blogs), length(news), length(twitter)),
  Words = c(sum(stri_count_words(blogs)),
            sum(stri_count_words(news)),
            sum(stri_count_words(twitter)))
)
knitr::kable(data_summary, caption = "Line and Word Counts per File")

Line and Word Counts per File
File	Lines	Words
Blogs	899288	37546806
News	1010206	34761151
Twitter	2360148	30096649

Word Distribution by Line

For performance reasons, we will sample 5,000 lines from each dataset.

set.seed(123)
sample_blogs <- sample(blogs, 5000)
sample_news <- sample(news, 5000)
sample_twitter <- sample(twitter, 5000)

sample_df <- data.frame(
  Source = rep(c("Blogs", "News", "Twitter"), each = 5000),
  Words = c(stri_count_words(sample_blogs),
            stri_count_words(sample_news),
            stri_count_words(sample_twitter))
)

ggplot(sample_df, aes(x = Words, fill = Source)) +
  geom_histogram(binwidth = 5, color = "black") +
  facet_wrap(~Source, scales = "free_y") +
  labs(title = "Word Count per Line Distribution",
       x = "Number of Words per Line", y = "Frequency")

Findings

The Twitter dataset has shorter entries on average, consistent with character limits.
The Blogs dataset has the longest entries and greater variability.
The News dataset has moderately long, more consistent entries.

These differences will influence tokenization and model training.

Next Steps

Data Cleaning: Remove non-ASCII characters, punctuation, numbers, profanity.
Tokenization: Create uni-, bi-, and tri-grams using the quanteda or tidytext package.
Modeling: Develop an n-gram model with smoothing (e.g., Stupid Backoff).
Shiny App: Build an interactive application that predicts the next word based on user input.

Conclusion

This report presented the initial exploratory analysis of the dataset. The next steps will involve preparing the data for modeling and implementing a predictive algorithm using n-grams, followed by deployment via a Shiny application.