Introduction

This milestone report is part of the Capstone Project for the Data Science Specialization by Johns Hopkins University on Coursera. The goal of this report is to demonstrate basic data handling, exploratory data analysis, and outline the plans for building a prediction algorithm and Shiny application.

Data Loading

The dataset is provided by SwiftKey and consists of three text files containing content from blogs, news, and Twitter.

library(stringi)
library(ggplot2)
library(dplyr)
## 
## Caricamento pacchetto: 'dplyr'
## I seguenti oggetti sono mascherati da 'package:stats':
## 
##     filter, lag
## I seguenti oggetti sono mascherati da 'package:base':
## 
##     intersect, setdiff, setequal, union
# Load the data
blogs <- readLines("final/en_US/en_US.blogs.txt", warn = FALSE, encoding = "UTF-8")
news <- readLines("final/en_US/en_US.news.txt", warn = FALSE, encoding = "UTF-8")
twitter <- readLines("final/en_US/en_US.twitter.txt", warn = FALSE, encoding = "UTF-8")

Summary Statistics

data_summary <- data.frame(
  File = c("Blogs", "News", "Twitter"),
  Lines = c(length(blogs), length(news), length(twitter)),
  Words = c(sum(stri_count_words(blogs)),
            sum(stri_count_words(news)),
            sum(stri_count_words(twitter)))
)
knitr::kable(data_summary, caption = "Line and Word Counts per File")
Line and Word Counts per File
File Lines Words
Blogs 899288 37546806
News 1010206 34761151
Twitter 2360148 30096649

Word Distribution by Line

For performance reasons, we will sample 5,000 lines from each dataset.

set.seed(123)
sample_blogs <- sample(blogs, 5000)
sample_news <- sample(news, 5000)
sample_twitter <- sample(twitter, 5000)

sample_df <- data.frame(
  Source = rep(c("Blogs", "News", "Twitter"), each = 5000),
  Words = c(stri_count_words(sample_blogs),
            stri_count_words(sample_news),
            stri_count_words(sample_twitter))
)
ggplot(sample_df, aes(x = Words, fill = Source)) +
  geom_histogram(binwidth = 5, color = "black") +
  facet_wrap(~Source, scales = "free_y") +
  labs(title = "Word Count per Line Distribution",
       x = "Number of Words per Line", y = "Frequency")

Findings

These differences will influence tokenization and model training.

Next Steps

Conclusion

This report presented the initial exploratory analysis of the dataset. The next steps will involve preparing the data for modeling and implementing a predictive algorithm using n-grams, followed by deployment via a Shiny application.