RendonAlgecira_hwk1_SMWR_2025_II
1 Part I: Introduction to probability and statistics
1.1 From the IDEAM, CAR, IDIGER or other national or regional agencies, obtain a time series (TS) of hourly precipitation in mm in any place in Colombia. The series must have at least 15 yrs of continuous hourly records.
The station selected is SOCHA-AUT-24035360 from the Boyacá department. This data was obtained from datos.gov.co using the python API, the time interval is 10 minutes.
# Library
library(data.table)
library(ggplot2)
# Read the csv
ruta = normalizePath("D:/Users/Joshua/ME/hmwk1/Data.csv")
df = fread(ruta)
# Convert to POSIXct
df[, fechaobservacion := as.POSIXct(fechaobservacion)]
# Remove rows from years 2005, 2008, 2009, 2023, 2024, 2025
df = df[!format(fechaobservacion, "%Y") %in% c("2005", "2008", "2009", "2023", "2024", "2025")]
# Group the data hourly
TS_Hourly = df[, .(Hourly_precipitation = sum(valorobservado, na.rm = TRUE)),
by = .(Hour = format(fechaobservacion, "%Y-%m-%d %H:00:00"))]
TS_Hourly[, Hour := as.POSIXct(Hour)]
# Plot the data
ggplot(TS_Hourly, aes(x = Hour, y = Hourly_precipitation)) +
geom_point(alpha = 0.3, size = 0.5, color = "darkblue") +
labs(title = "Hourly Precipitation",
x = "Date",
y = "Precipitation (mm)") +
theme_minimal()
There is lack of data between October of 2008 and June of 2008, and between May of 2023 and September of 2024.
1.2 From the hourly TS, build a series of the following data: 1) duration (hours) of rainfall events, 2) intensity (mm of rainfall) of the continuous event and 3) time (hours) between rainfall events.
For this, is necessary to count the number of hours of continuous precipitation and sum the precipitation in this hours, then count the time between the end of the first event and the beginign of the next event.
# Create a binary indicator of rain (1 if rain > 0, else 0)
TS_Hourly[, rain := ifelse(Hourly_precipitation > 0, 1, 0)]
# Identify consecutive rain blocks
# rleid() creates a run ID for sequences of equal values
TS_Hourly[, event_id := rleid(rain)]
# Keep only rain events (rain == 1) and compute stats per event
Rainfall_Event = TS_Hourly[rain == 1,
.(duration_hours = .N, # number of rows = hours
total_precip = sum(Hourly_precipitation)/.N, # total precipitation
start_time = min(Hour)), # event start time
by = event_id]
# Compute time between consecutive events (in hours)
Rainfall_Event[, time_to_next := as.numeric(
shift(start_time, type = "lead") - start_time,
units = "hours"
)- duration_hours]
# Final result
Rainfall_Event
## event_id duration_hours total_precip start_time time_to_next
## <int> <int> <num> <POSc> <num>
## 1: 2 1 0.1000000 2006-01-05 07:00:00 38
## 2: 4 4 0.5750000 2006-01-06 22:00:00 1
## 3: 6 1 0.1000000 2006-01-07 03:00:00 11
## 4: 8 1 0.2000000 2006-01-07 15:00:00 5
## 5: 10 3 0.2666667 2006-01-07 21:00:00 1
## ---
## 4623: 9246 1 0.1000000 2022-12-06 16:00:00 8
## 4624: 9248 1 0.1000000 2022-12-07 01:00:00 192
## 4625: 9250 1 0.1000000 2022-12-15 02:00:00 354
## 4626: 9252 1 0.1000000 2022-12-29 21:00:00 5
## 4627: 9254 1 0.1000000 2022-12-30 03:00:00 NA
1.3 Accumulate the hourly TS to daily, monthly and annual precipitation.
To group daily, monthly and annual, is necessary sum the data in their corresponding times.
# Group Daily
TS_Daily = TS_Hourly[, .(Daily_precip = sum(Hourly_precipitation, na.rm = TRUE)),
by = .(Date = as.Date(Hour))]
# Group Monthly
TS_Monthly = TS_Hourly[, .(Monthly_precip = sum(Hourly_precipitation, na.rm = TRUE)),
by = .(YearMonth = format(Hour, "%Y-%m"))]
# Group Annual
TS_Annual = TS_Hourly[, .(Annual_precip = sum(Hourly_precipitation, na.rm = TRUE)),
by = .(Year = format(Hour, "%Y"))]
# Convert Year/Month to proper Date (first day of month)
TS_Monthly[, YearMonth := as.IDate(paste0(YearMonth, "-01"))]
# Convert Year to proper Date (first day of year)
TS_Annual[, Year := as.IDate(paste0(Year, "-01-01"))]
# Show results
TS_Daily
## Date Daily_precip
## <Date> <num>
## 1: 2006-01-01 0.0
## 2: 2006-01-02 0.0
## 3: 2006-01-03 0.0
## 4: 2006-01-04 0.0
## 5: 2006-01-05 0.1
## ---
## 5037: 2022-12-26 0.0
## 5038: 2022-12-27 0.0
## 5039: 2022-12-28 0.0
## 5040: 2022-12-29 0.0
## 5041: 2022-12-30 0.2
## YearMonth Monthly_precip
## <IDat> <num>
## 1: 2006-01-01 25.2
## 2: 2006-02-01 5.8
## 3: 2006-03-01 156.5
## 4: 2006-04-01 154.1
## 5: 2006-05-01 146.8
## ---
## 173: 2022-08-01 31.9
## 174: 2022-09-01 45.8
## 175: 2022-10-01 94.4
## 176: 2022-11-01 1.0
## 177: 2022-12-01 2.5
## Year Annual_precip
## <IDat> <num>
## 1: 2006-01-01 817.3
## 2: 2007-01-01 582.2
## 3: 2010-01-01 961.1
## 4: 2011-01-01 1181.3
## 5: 2012-01-01 595.5
## 6: 2013-01-01 891.4
## 7: 2014-01-01 408.1
## 8: 2015-01-01 441.7
## 9: 2016-01-01 409.8
## 10: 2017-01-01 264.2
## 11: 2018-01-01 581.4
## 12: 2019-01-01 690.5
## 13: 2020-01-01 601.2
## 14: 2021-01-01 455.7
## 15: 2022-01-01 221.9
1.4 Estimate the mean, the variance, the skewness, the kurtosis and the interquartile range for the original TS and the TS build in 2 and 3. Organize the results in tables and perform the analysis of the results.
A data frame with the statistical parameter as the vertical header, and the variable as the horizontal headers.
# Library
library(moments)
# Function to compute stats
get_stats = function(x) {
round(c(mean = mean(x, na.rm = TRUE),
variance = var(x, na.rm = TRUE),
standar_deviation = sqrt(var(x, na.rm = TRUE)),
skewness = skewness(x, na.rm = TRUE),
kurtosis = kurtosis(x, na.rm = TRUE),
IQR = IQR(x, na.rm = TRUE)),3)
}
# Apply function to each dataset
stats_hourly = get_stats(TS_Hourly$Hourly_precipitation)
stats_duration_hours = get_stats(Rainfall_Event$duration_hours)
stats_total_precip = get_stats(Rainfall_Event$total_precip)
stats_time_to_next = get_stats(Rainfall_Event$time_to_next)
stats_daily = get_stats(TS_Daily$Daily_precip)
stats_monthly = get_stats(TS_Monthly$Monthly_precip)
stats_annual = get_stats(TS_Annual$Annual_precip)
# Combine into one table
Stats_Table = rbind(
TS_Hourly = stats_hourly,
duration_hours = stats_duration_hours,
total_precip = stats_total_precip,
time_to_next = stats_time_to_next,
TS_Daily = stats_daily,
TS_Monthly = stats_monthly,
TS_Annual = stats_annual
)
Stats_Table <- as.data.table(Stats_Table, keep.rownames = "Time_Series")
Stats_Table$CV <- round(sqrt(Stats_Table$variance) / Stats_Table$mean * 100, 3)
Stats_Table
## Time_Series mean variance standar_deviation skewness kurtosis IQR
## <char> <num> <num> <num> <num> <num> <num>
## 1: TS_Hourly 0.079 0.398 0.631 26.422 1337.739 0.00
## 2: duration_hours 2.188 4.943 2.223 6.160 102.739 2.00
## 3: total_precip 0.612 1.459 1.208 8.956 152.400 0.55
## 4: time_to_next 29.993 83894.503 289.645 54.584 3301.621 18.00
## 5: TS_Daily 1.806 27.984 5.290 15.964 536.701 1.30
## 6: TS_Monthly 51.431 2503.449 50.034 1.936 8.739 46.50
## 7: TS_Annual 606.887 70228.841 265.007 0.580 2.689 328.15
## CV
## <num>
## 1: 798.573
## 2: 101.613
## 3: 197.368
## 4: 965.710
## 5: 292.912
## 6: 97.285
## 7: 43.667
1.4.1 TS_Hourly
# Create a continuous hourly sequence
full_time = data.table(Hour = seq(min(TS_Hourly$Hour),
max(TS_Hourly$Hour),
by = "hour"))
# Merge with TS_Hourly to introduce NA values for missing hours
TS_Hourly_full = merge(full_time, TS_Hourly, by = "Hour", all.x = TRUE)
# Plot as line chart (gaps shown where data is missing)
ggplot(TS_Hourly_full, aes(x = Hour, y = Hourly_precipitation)) +
geom_line(color = "darkblue", linewidth = 0.4, na.rm = FALSE) +
labs(title = "Hourly Precipitation",
x = "Date",
y = "Precipitation (mm)") +
theme_minimal()
The mean of this time series is near to zero with a value (0.079), because in most of the time there is not precipitation happening. The Variance has a low value (0.398) since the same logic of the mean. Additionally, The hourly precipitation presents a high coefficient of variation (798.573%) owing to most of the values are zero. Is important to denote IQR is (0) because as the past parameters there are most of the values like zero
The skewness for the hourly precipitation (26.422) says longer tail is on the right, that can be seen in the histogram, where the tails is on the right. The great kurtosis (1337.739) declare in this case a high peak near to the mean, that is depicted in the histogram where the biggest bin is located in the mean with a instantly decrease.
library(gridExtra)
# Convert to data.table
DT = as.data.table(TS_Hourly)
# Regular histogram (bars)
p1 = ggplot(DT, aes(x = Hourly_precipitation)) +
geom_histogram(binwidth = 1, fill = "skyblue", color = "black", alpha = 0.7) +
labs(title = "Histogram of Hourly Precipitation",
x = "Precipitation (mm)",
y = "Count") +
theme_minimal()
# Compute cumulative counts (raw)
hist_data = hist(DT$Hourly_precipitation,
breaks = seq(0, max(DT$Hourly_precipitation, na.rm = TRUE), by = 1),
plot = FALSE)
cum_counts = cumsum(hist_data$counts)
cum_df = data.frame(
mid = hist_data$mids,
cum_count = cum_counts)
# Cumulative histogram
p2 = ggplot(cum_df, aes(x = mid, y = cum_count)) +
geom_col(fill = "skyblue", color = "black", alpha = 0.7, width = 1) +
labs(title = "Cumulative Histogram ",
x = "Precipitation (mm)",
y = "Count") +
theme_minimal()
# Arrange plots side by side
grid.arrange(p1, p2, ncol = 2)
1.4.2 Duration hours of rainfall event
There are something to denote: first, it seems to be a slightly increase in the maximum duration hours of rainfall across the time, and second, there is a possible outlier, which has a value of almost 60 hours of duration.
ggplot(Rainfall_Event, aes(x = start_time, y = duration_hours)) +
# vertical lines to the baseline
geom_segment(aes(x = start_time, xend = start_time,
y = 0, yend = duration_hours),
color = "gray50", linewidth = 0.4) +
# points on top
geom_point(color = "darkblue", size = 2, alpha = 0.8) +
labs(title = "Rainfall Event Durations",
x = "Date",
y = "Duration (hours)") +
theme_minimal()
This variable count the time of each rainfall event in hours, the average time of rainfall event is (2.188) hours, with a standard deviation of (2.223), which is very close to the mean, this is evidenced in the Coefficient of variation with the value of (101.613%) , it alludes to a barely variation of the values respect to the mean. The IQR presents a relatively low value (2).
The Skewness (6.16) is high, which denotes long tail at the right as is shown in the histogram, and the high kurtosis (102.739) conveys to a peak near located in the mean with a fast decrease .
# Convert to data.table
DT = as.data.table(Rainfall_Event)
# Regular histogram (bars) of event duration
p1 = ggplot(DT, aes(x = duration_hours)) +
geom_histogram(binwidth = 1, fill = "skyblue", color = "black", alpha = 0.7) +
labs(title = "Histogram of Event Duration",
x = "Duration (hours)",
y = "Count") +
theme_minimal()
# Compute cumulative counts (raw) for event duration
hist_data = hist(DT$duration_hours,
breaks = seq(0, max(DT$duration_hours, na.rm = TRUE), by = 1),
plot = FALSE)
cum_counts = cumsum(hist_data$counts)
cum_df = data.frame(
mid = hist_data$mids,
cum_count = cum_counts
)
# Cumulative histogram (bars)
p2 = ggplot(cum_df, aes(x = mid, y = cum_count)) +
geom_col(fill = "skyblue", color = "black", alpha = 0.7, width = 1) +
labs(title = "Cumulative Histogram of Event Duration",
x = "Duration (hours)",
y = "Cumulative Count") +
theme_minimal()
# Arrange plots side by side
grid.arrange(p1, p2, ncol = 2)
1.4.3 Intensity of rainfall events
ggplot(Rainfall_Event, aes(x = start_time, y = total_precip)) +
# vertical lines to the baseline
geom_segment(aes(x = start_time, xend = start_time,
y = 0, yend = total_precip),
color = "gray50", linewidth = 0.4) +
# points on top
geom_point(color = "darkblue", size = 2, alpha = 0.8) +
labs(title = "Intensity",
x = "Date",
y = "Intensity (mm/hour)") +
theme_minimal()
The mean (0.612) of this variable is low but the standard deviation (1.208) is certainly high,