Homework 1

1 Step 1. Data
2 Step 2. Data Quality
3 Step 3. Data Visualization

1 Step 1. Data

Title: Cryptocurrency Dataset (Hourly, 2025)

Author: ylmzasel

Publication: September 11, 2025

Location: Kaggle Datasets

Description: synthetic dataset with hourly OHLCV-style data and auxiliary fields for the year 2025, designed for exploration exercises and time-series modeling.

URL: https://www.kaggle.com/datasets/ylmzasel/synthetic-cryptocurrency-dataset-hourly-2025

How it was created: according to the Kaggle page, this is a synthetic but realistic dataset that replicates the typical structure of cryptocurrency prices and volumes at an hourly frequency in 2025, aiming to facilitate analysis and forecasting without relying on live APIs.

data_set_crypto <- read_csv("/Users/arturoaguilarargueta/Desktop/Rstudio work/sentetik_kripto_veriseti_saatlik_2025.csv")

## Rows: 70080 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (1): asset
## dbl  (15): open, high, low, close, volume, spread, num_trades, onchain_tx_co...
## dttm  (1): timestamp
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(data_set_crypto, 5)

2 Step 2. Data Quality

df <- data_set_crypto

2.1 Data size and data types

nrow(df); ncol(df)

## [1] 70080

## [1] 17

sapply(df, class)

## $timestamp
## [1] "POSIXct" "POSIXt" 
## 
## $asset
## [1] "character"
## 
## $open
## [1] "numeric"
## 
## $high
## [1] "numeric"
## 
## $low
## [1] "numeric"
## 
## $close
## [1] "numeric"
## 
## $volume
## [1] "numeric"
## 
## $spread
## [1] "numeric"
## 
## $num_trades
## [1] "numeric"
## 
## $onchain_tx_count
## [1] "numeric"
## 
## $sentiment
## [1] "numeric"
## 
## $exchange_net_flow
## [1] "numeric"
## 
## $returns_1h
## [1] "numeric"
## 
## $rolling_vol_24h
## [1] "numeric"
## 
## $rolling_vol_7d
## [1] "numeric"
## 
## $future_return_1h
## [1] "numeric"
## 
## $future_return_24h
## [1] "numeric"

str(df)

## spc_tbl_ [70,080 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ timestamp        : POSIXct[1:70080], format: "2023-09-11 09:00:00" "2023-09-11 10:00:00" ...
##  $ asset            : chr [1:70080] "ADA" "ADA" "ADA" "ADA" ...
##  $ open             : num [1:70080] 0.88 0.88 0.883 0.892 0.89 ...
##  $ high             : num [1:70080] 0.881 0.885 0.893 0.895 0.892 ...
##  $ low              : num [1:70080] 0.877 0.878 0.882 0.889 0.882 ...
##  $ close            : num [1:70080] 0.88 0.883 0.892 0.89 0.884 ...
##  $ volume           : num [1:70080] 57029449 57842442 61024970 42584523 60785144 ...
##  $ spread           : num [1:70080] 0.000488 0.000576 0.000585 0.0005 0.000471 0.000514 0.000732 0.00062 0.000668 0.000543 ...
##  $ num_trades       : num [1:70080] 97606 102197 118399 87912 132983 ...
##  $ onchain_tx_count : num [1:70080] 39042 41696 49708 35513 54909 ...
##  $ sentiment        : num [1:70080] 0.618 0.548 0.425 0.436 0.556 ...
##  $ exchange_net_flow: num [1:70080] 0 -344421 7494836 3225385 -8554658 ...
##  $ returns_1h       : num [1:70080] NA 0.004 0.00992 -0.00199 -0.00645 ...
##  $ rolling_vol_24h  : num [1:70080] NA NA NA NA NA NA NA NA NA NA ...
##  $ rolling_vol_7d   : num [1:70080] NA NA NA NA NA NA NA NA NA NA ...
##  $ future_return_1h : num [1:70080] 0.004 0.00992 -0.00199 -0.00645 -0.00855 ...
##  $ future_return_24h: num [1:70080] 0.01074 0.00644 -0.01258 -0.01514 -0.02045 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   timestamp = col_datetime(format = ""),
##   ..   asset = col_character(),
##   ..   open = col_double(),
##   ..   high = col_double(),
##   ..   low = col_double(),
##   ..   close = col_double(),
##   ..   volume = col_double(),
##   ..   spread = col_double(),
##   ..   num_trades = col_double(),
##   ..   onchain_tx_count = col_double(),
##   ..   sentiment = col_double(),
##   ..   exchange_net_flow = col_double(),
##   ..   returns_1h = col_double(),
##   ..   rolling_vol_24h = col_double(),
##   ..   rolling_vol_7d = col_double(),
##   ..   future_return_1h = col_double(),
##   ..   future_return_24h = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

The dataset contains 70,080 observations and 17 variables. It includes a date-time field (timestamp, POSIXct), a categorical asset identifier (asset, character), and 15 numeric variables (OHLCV, activity metrics, volatilities, and returns).

2.2 Assess missing data

miss_summary <- miss_var_summary(df)  
kable(miss_summary, caption = "Missing values by variable (count and %)")

Missing values by variable (count and %)
variable	n_miss	pct_miss
rolling_vol_7d	672	0.959
rolling_vol_24h	96	0.137
future_return_24h	96	0.137
returns_1h	4	0.00571
future_return_1h	4	0.00571
timestamp	0	0
asset	0	0
open	0	0
high	0	0
low	0	0
close	0	0
volume	0	0
spread	0	0
num_trades	0	0
onchain_tx_count	0	0
sentiment	0	0
exchange_net_flow	0	0

total_na <- sum(is.na(df))
total_na

## [1] 872

miss_var_summary(df)

set.seed(123)
df_small <- dplyr::slice_sample(df, n = 5000)

naniar::vis_miss(df_small)

naniar::gg_miss_var(df_small) +
  ggplot2::labs(title = "Percentage of missing values by variable (sample of 5k rows)")

2.3 Descriptive statistics

We will use timestamp, asset, and close as the core variables, and add volume and num_trades. These columns do not contain missing values and are directly relevant for describing and visualizing price trends. Conversely, we excluded derived variables such as returns_1h, rolling_vol_24h/7d, and future_return_1h/24h, because they contained missing values due to the rolling window calculations and time lags (first/last hours), which would have hindered visualization without adding value to the analysis.

vars_keep <- c("timestamp","asset","close","volume","num_trades")
df_clean <- df %>%
  dplyr::select(dplyr::any_of(vars_keep)) %>%
  dplyr::arrange(asset, timestamp)

dim(df_clean); head(df_clean, 4)

## [1] 70080     5

2.3.1 Overall descriptive statistics

overall_stats <- df_clean %>%
  summarise(
    n_obs        = n(),
    start_ts     = min(timestamp, na.rm = TRUE),
    end_ts       = max(timestamp, na.rm = TRUE),
    mean_close   = mean(close, na.rm = TRUE),
    median_close = median(close, na.rm = TRUE),
    sd_close     = sd(close, na.rm = TRUE),
    min_close    = min(close, na.rm = TRUE),
    max_close    = max(close, na.rm = TRUE),
    total_volume = sum(volume, na.rm = TRUE),
    total_trades = sum(num_trades, na.rm = TRUE)
  )
kable(overall_stats, caption = "Overall descriptive statistics")

Overall descriptive statistics
n_obs	start_ts	end_ts	mean_close	median_close	sd_close	min_close	max_close	total_volume	total_trades
70080	2023-09-11 09:00:00	2025-09-10 08:00:00	24916.63	2151.56	38942.08	0.515325	144263.7	988431683311	20650320962

2.3.2 Descriptive statistics by asset

by_asset_stats <- df_clean %>%
  group_by(asset) %>%
  summarise(
    n_obs        = n(),
    mean_close   = mean(close, na.rm = TRUE),
    median_close = median(close, na.rm = TRUE),
    sd_close     = sd(close, na.rm = TRUE),
    min_close    = min(close, na.rm = TRUE),
    max_close    = max(close, na.rm = TRUE),
    total_volume = sum(volume, na.rm = TRUE),
    total_trades = sum(num_trades, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(asset)
kable(by_asset_stats, caption = "Descriptive statistics by asset")

Descriptive statistics by asset
asset	n_obs	mean_close	median_close	sd_close	min_close	max_close	total_volume	total_trades
ADA	17520	1.239871	1.12753	5.074697e-01	0.515325	2.801942e+00	970424967582	1940839178
BTC	17520	88394.071017	86060.73098	2.428430e+04	47528.375239	1.442637e+05	28062474	841288169
ETH	17520	11036.583132	11547.91887	4.913139e+03	3870.376797	2.194045e+04	380635097	15230505036
SOL	17520	234.629323	216.59040	6.252055e+01	134.561174	4.327438e+02	17598018158	2637688579

This dataset contains 70,080 data points spanning from September 11, 2023 to September 10, 2025. Since it is synthetic data, it may generate extreme values that haven’t occurred in real life (hence BTC reaching ~144.3k). Nevertheless, the price ranking makes sense. BTC is the most expensive, followed by ETH, then SOL and ADA. The large differences between the median and the maximum values, along with the high variance, indicate distributions with occasional outliers. Furthermore, the volume and number of transactions vary significantly across different cryptocurrencies, so for better comparison, it’s advisable to use logarithmic scales or normalize the data, making the graphs easier to read and preventing outliers from dominating the analysis.

3 Step 3. Data Visualization

ggplot(df_clean, aes(x = close)) +
  geom_histogram(bins = 40) +
  facet_wrap(~ asset, scales = "free_x") +
  labs(title = "Distribution of Close by asset", x = "Close", y = "Count")

For all cryptocurrencies, the distributions show a right skew, with most data points concentrated in the lower price range and a narrower tail extending towards higher prices. ADA clusters around 0.8-1.2, with a long tail extending beyond 2. BTC, on the other hand, exhibits multiple peaks (around 70,000-85,000 and 110,000-125,000), suggesting distinct price regimes. ETH shows a bimodal distribution around 5,000-6,000 and 12,000-16,000. Finally, SOL clusters near 160-220, with another peak around 280-340 and a tail extending beyond 400. These patterns suggest heterogeneous volatility and occasional price spikes.

ggplot(dplyr::filter(df_clean, close > 0), aes(x = close)) +
  geom_histogram(bins = 40) +
  facet_wrap(~ asset, scales = "free_x") +
  scale_x_log10() +
  labs(title = "Distribution of Close (log scale) by asset", x = "Close (log10)", y = "Count")

Using a logarithmic scale for the X-axis reduces the spread of extreme data points and facilitates the comparison of data distribution across different assets. While the right-skewed distribution remains, it becomes less pronounced, revealing a clearer multimodal structure for BTC and ETH, and a more compact distribution for ADA and SOL. In summary, the logarithmic scale visualization improves readability and confirms that price levels vary systematically depending on the asset, although all assets exhibit a similar asymmetric distribution pattern.

ggplot(df_clean, aes(x = asset, y = close)) +
  geom_boxplot() +
  labs(title = "Close by asset", x = "Asset", y = "Close")

BTC shows the highest price level, with a wide interquartile range and a long upper whisker, indicating significant price dispersion and occasional price spikes. ETH, on the other hand, is positioned much lower than BTC, but still with a considerable price gap. SOL and ADA have even lower price levels and narrower price ranges. Therefore, the large gaps between the medians of the different assets reflect their very different price scales, and the presence of outliers and long whiskers (especially in BTC) indicates a right-skewed distribution and greater volatility.

ggplot(df_clean, aes(x = asset, y = volume)) +
  geom_boxplot() +
  labs(title = "Volume by asset", x = "Asset", y = "Volume")

ADA dominates the scale, with a much higher average volume and numerous extreme values, suggesting frequent spikes in market activity. On the other hand, BTC, ETH, and SOL appear as virtually zero on the original scale, making it difficult to compare these assets. This indicates that the scales used for the different cryptocurrencies are not consistent.

ggplot(dplyr::filter(df_clean, close > 0), aes(x = asset, y = close)) +
  geom_boxplot() +
  scale_y_log10() +
  labs(title = "Close by asset (log scale)", x = "Asset", y = "Close (log10)")

Using a logarithmic scale for the Y-axis allows for a direct comparison of the price levels of different assets. BTC is positioned at the top, followed by ETH, SOL, and ADA, with a clear separation in terms of magnitude. For each cryptocurrency, the price distribution is easier to interpret on a logarithmic scale. While BTC still exhibits the greatest price volatility, SOL and ADA have narrower price ranges. The overall asymmetry is reduced, but it persists, especially during periods of high prices.

ggplot(dplyr::filter(df_clean, volume > 0), aes(x = asset, y = volume)) +
  geom_boxplot() +
  scale_y_log10() +
  labs(title = "Volume by asset (log scale)", x = "Asset", y = "Volume (log10)")

On the logarithmic scale, ADA exhibits the highest trading volumes and the greatest variability, followed by SOL. ETH, on the other hand, has lower volumes, and BTC shows the lowest typical volumes in this dataset. The large number of data points above the box plots indicates frequent volume spikes, particularly for ADA and SOL. Overall, trading activity varies by several orders of magnitude across different assets, making the use of a logarithmic Y-axis essential for a fair comparison of the central ranges.

df_sc <- df_clean %>%
  filter(!is.na(close), !is.na(volume), close > 0, volume > 0) %>%
  mutate(log_close = log10(close), log_volume = log10(volume))

ggplot(df_sc, aes(x = log_close, y = log_volume, color = asset)) +
  geom_point(alpha = 0.35) +
  geom_smooth(se = FALSE, method = "loess") +
  labs(title = "log10(Volume) vs log10(Close)", x = "log10(Close)", y = "log10(Volume)")

## `geom_smooth()` using formula = 'y ~ x'

The scatter plot shows four distinct clusters, one for each cryptocurrency, as each operates on a different price and volume scale. Therefore, cryptocurrencies with lower prices (ADA and SOL) exhibit higher trading volumes, while those with higher prices (BTC and ETH) are located in lower volume ranges. Within each cluster, the trend is nearly flat or slightly positive, suggesting that trading volume increases slightly with the price of that particular cryptocurrency.

ggplot(df_clean, aes(x = close, y = num_trades, color = asset)) +
  geom_point(alpha = 0.25) +
  scale_y_log10() +
  labs(title = "Number of trades vs Close", x = "Close", y = "Num trades (log scale)")

Trading activity varies by asset, reflecting different volume levels. ETH sees the highest number of transactions across a range of price levels, while BTC experiences significantly fewer transactions, even at high prices. ADA and SOL trade at lower prices with moderate volumes. Overall, for each cryptocurrency, the relationship between price and trading volume is relatively flat or slightly positive; hours with higher prices tend to have a similar or slightly higher number of transactions.

agg <- df_clean %>%
  group_by(asset) %>%
  summarise(
    mean_close    = mean(close, na.rm = TRUE),
    median_volume = median(volume, na.rm = TRUE),
    .groups = "drop"
  )

ggplot(agg, aes(x = asset, y = mean_close)) +
  geom_col() +
  labs(title = "Mean Close by asset", x = "Asset", y = "Mean Close")

ggplot(agg, aes(x = asset, y = median_volume)) +
  geom_col() +
  labs(title = "Median Volume by asset", x = "Asset", y = "Median Volume")

BTC has the highest average closing price, followed by ETH. On the other hand, SOL and ADA show significantly lower prices. These large differences reflect the varying price scales among cryptocurrencies, and the fact that this data is simulated confirms what was observed in the box plots, where price levels vary considerably.

ADA has the highest trading volume, closely followed by SOL. ETH shows considerably lower volume, and BTC the lowest in this dataset. These significant differences suggest varying levels of trading activity among the different cryptocurrencies.

Homework 1

Javier Aguilar

2025-09-23

1 Step 1. Data

2 Step 2. Data Quality

2.1 Data size and data types

2.2 Assess missing data

2.3 Descriptive statistics

2.3.1 Overall descriptive statistics

2.3.2 Descriptive statistics by asset

3 Step 3. Data Visualization