This milestone report is part of the Capstone Project for the Data Science Specialization by Johns Hopkins University on Coursera. The goal of this report is to demonstrate basic data handling, exploratory data analysis, and outline the plans for building a prediction algorithm and Shiny application.
The dataset is provided by SwiftKey and consists of three text files containing content from blogs, news, and Twitter.
library(stringi)
library(ggplot2)
library(dplyr)
##
## Caricamento pacchetto: 'dplyr'
## I seguenti oggetti sono mascherati da 'package:stats':
##
## filter, lag
## I seguenti oggetti sono mascherati da 'package:base':
##
## intersect, setdiff, setequal, union
# Load the data
blogs <- readLines("final/en_US/en_US.blogs.txt", warn = FALSE, encoding = "UTF-8")
news <- readLines("final/en_US/en_US.news.txt", warn = FALSE, encoding = "UTF-8")
twitter <- readLines("final/en_US/en_US.twitter.txt", warn = FALSE, encoding = "UTF-8")
data_summary <- data.frame(
File = c("Blogs", "News", "Twitter"),
Lines = c(length(blogs), length(news), length(twitter)),
Words = c(sum(stri_count_words(blogs)),
sum(stri_count_words(news)),
sum(stri_count_words(twitter)))
)
knitr::kable(data_summary, caption = "Line and Word Counts per File")
| File | Lines | Words |
|---|---|---|
| Blogs | 899288 | 37546806 |
| News | 1010206 | 34761151 |
| 2360148 | 30096649 |
For performance reasons, we will sample 5,000 lines from each dataset.
set.seed(123)
sample_blogs <- sample(blogs, 5000)
sample_news <- sample(news, 5000)
sample_twitter <- sample(twitter, 5000)
sample_df <- data.frame(
Source = rep(c("Blogs", "News", "Twitter"), each = 5000),
Words = c(stri_count_words(sample_blogs),
stri_count_words(sample_news),
stri_count_words(sample_twitter))
)
ggplot(sample_df, aes(x = Words, fill = Source)) +
geom_histogram(binwidth = 5, color = "black") +
facet_wrap(~Source, scales = "free_y") +
labs(title = "Word Count per Line Distribution",
x = "Number of Words per Line", y = "Frequency")
These differences will influence tokenization and model training.
quanteda or tidytext package.This report presented the initial exploratory analysis of the dataset. The next steps will involve preparing the data for modeling and implementing a predictive algorithm using n-grams, followed by deployment via a Shiny application.