In prepping for a stats course, I was searching around for some data with a tangible interest for students. Turns out, the Geosciences department here at the university has a weather station, from which you can download minute-by-minute weather data! They only have the data compiled into year’s worth csvs between 2007 and 2015, but that’s still plenty to dig our teeth into.
library(tidyverse)
library(magrittr)
library(lubridate)
library(ggplot2)
library(mgcv)
library(itsadug)
I’ll suck down the csvs by
map()ping read_csv() ton the urls.weathers <- paste("https://www.geos.ed.ac.uk/~weather/jcmb_ws/JCMB_", seq(2007, 2015), ".csv", sep = "")
weather_list <- map(weathers, read_csv)
weather_df <- bind_rows(weather_list)
nrow(weather_df)
[1] 4572805
head(weather_df)
So, this is a pretty meaty data frame. We’ll convert date-time to an actual datetime format with one of the handy dandy lubridate functions.
weather_df <- weather_df%>%
mutate(`date-time`= ymd_hm(`date-time`))
All formats failed to parse. No formats found.
mean(diff(weather_df$`date-time`))
Time difference of 60.98193 secs
Yeah, so roughly one recording every minute. I was specifically interested in the temperature data, and in my first blush with it, there were some really values.
range(weather_df$`surface temperature (C)`)
[1] -9999.0 89.5
Sufficed to say, it has never been -9999 degrees celcius, nor 89 degrees celcius for even one minute in Edinburgh. There is some more weirdness at the very low temperatures though.
weather_df %>%
arrange(`surface temperature (C)`)%>%
slice(1:100)%>%
use_series("surface temperature (C)")
[1] -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00
[15] -9999.00 -9999.00 -9999.00 -9999.00 -6999.00 -40.00 -40.00 -40.00 -40.00 -40.00 -40.00 -40.00 -40.00 -40.00
[29] -40.00 -40.00 -34.97 -26.43 -9.28 -9.28 -9.27 -9.27 -9.24 -9.23 -9.23 -9.23 -9.23 -9.23
[43] -9.22 -9.22 -9.22 -9.22 -9.21 -9.20 -9.20 -9.19 -9.17 -9.17 -9.17 -9.16 -9.13 -9.13
[57] -9.12 -9.12 -9.12 -9.12 -9.11 -9.11 -9.11 -9.10 -9.10 -9.10 -9.09 -9.08 -9.08 -9.07
[71] -9.07 -9.06 -9.06 -9.06 -9.06 -9.06 -9.05 -9.05 -9.05 -9.05 -9.04 -9.04 -9.04 -9.04
[85] -9.04 -9.03 -9.03 -9.03 -9.01 -9.01 -9.00 -9.00 -8.97 -8.97 -8.97 -8.96 -8.96 -8.95
[99] -8.95 -8.93
The sudden step function brtween -26 and -9 is kind of strange, and -26 is not a very plausible temperature for this city.
Looking at the higher temperatures, 89 is the only strange outlier.
weather_df %>%
arrange(-`surface temperature (C)`)%>%
slice(1:100)%>%
use_series("surface temperature (C)")
[1] 89.50 28.98 28.86 28.84 28.84 28.83 28.80 28.79 28.79 28.76 28.76 28.76 28.76 28.76 28.74 28.72 28.72 28.71 28.71 28.71 28.70
[22] 28.70 28.70 28.69 28.69 28.69 28.68 28.68 28.68 28.68 28.67 28.66 28.66 28.65 28.65 28.64 28.64 28.63 28.63 28.63 28.63 28.62
[43] 28.62 28.62 28.62 28.61 28.61 28.61 28.60 28.60 28.59 28.59 28.59 28.58 28.58 28.58 28.58 28.58 28.58 28.57 28.57 28.57 28.57
[64] 28.57 28.57 28.56 28.56 28.56 28.55 28.55 28.55 28.55 28.54 28.54 28.54 28.54 28.54 28.54 28.53 28.53 28.53 28.53 28.53 28.52
[85] 28.52 28.52 28.52 28.52 28.51 28.51 28.51 28.51 28.50 28.50 28.50 28.50 28.50 28.50 28.49 28.49
I have to say that I don’t think 28 C is a very plausible temperature for the city either! But I’ll leave it in.
For the purpose of modelling, I filtered out these extrema, and summarized the data down to just average temperature per hour. I’ll also convert the date to day of the year, and year to a factor for the purpose of modelling.
weather_df %>%
filter(`surface temperature (C)` > -20,
`surface temperature (C)` < 40)%>%
mutate(date = date(`date-time`),
hour = hour(`date-time`)) %>%
group_by(date, hour)%>%
summarise(temp = mean(`surface temperature (C)`))%>%
ungroup() %>%
mutate(yday = yday(date),
year = year(date) %>% factor()) ->temps
summary(temps)
date hour temp yday year
Min. :2007-01-01 Min. : 0.0 Min. :-8.876 Min. : 1.0 2008 : 8770
1st Qu.:2009-03-07 1st Qu.: 5.0 1st Qu.: 5.036 1st Qu.: 89.0 2009 : 8760
Median :2011-06-12 Median :12.0 Median : 8.838 Median :180.0 2014 : 8753
Mean :2011-06-04 Mean :11.5 Mean : 8.751 Mean :180.4 2011 : 8718
3rd Qu.:2013-08-28 3rd Qu.:18.0 3rd Qu.:12.397 3rd Qu.:270.0 2007 : 8714
Max. :2015-11-02 Max. :23.0 Max. :28.560 Max. :366.0 2013 : 8687
(Other):23822
And here’s a few visualizations:
temps %>%
ggplot(aes(date, temp))+
geom_point(size = 1, alpha = 0.2)+
theme_minimal()
temps %>%
ggplot(aes(yday, temp))+
geom_point(size= 1, alpha=0.1, aes(color = hour))+
scale_color_gradient2(high = "#2b83ba",
low = "#2b83ba",
mid = "#d7191c",
midpoint = 12)+
theme_minimal()
temps %>%
ggplot(aes(hour, temp))+
geom_point(size= 1, alpha=0.1, aes(color = yday))+
scale_color_gradient2(high = "#2b83ba",
low = "#2b83ba",
mid = "#d7191c",
midpoint = 365/2)+
theme_minimal()
I did a bunch of messing around with the paramters for the generalized additive model, and I think this is the best I came up with.
I don’t want to include all that work. Here’s the model as best I got it.
model_acf <- bam(temp ~
s(yday, bs = 'cc', k = 25) +
s(hour, bs = 'cc', k = 24)+
ti(yday, hour, bs = c('cc', 'cc'), k = c(25, 24))+
s(year, bs = 're')+
s(yday, year, bs = 'fs', m = 1),
rho = 0.9780324,
data=temps)
summary(model_acf)
Family: gaussian
Link function: identity
Formula:
temp ~ s(yday, bs = "cc", k = 25) + s(hour, bs = "cc", k = 24) +
ti(yday, hour, bs = c("cc", "cc"), k = c(25, 24)) + s(year,
bs = "re") + s(yday, year, bs = "fs", m = 1)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.7221 0.1798 48.52 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(yday) 10.986 23 51.299 < 2e-16 ***
s(hour) 21.336 22 1317.969 < 2e-16 ***
ti(yday,hour) 301.655 506 16.646 < 2e-16 ***
s(year) 2.314 8 0.410 0.000319 ***
s(yday,year) 18.681 80 0.551 0.049769 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.699 Deviance explained = 70.1%
fREML = 59297 Scale est. = 6.3233 n = 76224
gam.check(model_acf)
Method: fREML Optimizer: perf newton
full convergence after 6 iterations.
Gradient range [-7.239578e-08,7.292326e-08]
(score 59296.99 & scale 6.323303).
eigenvalue range [-2.677858e-13,38111.84].
Model rank = 642 / 642
Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.
k' edf k-index p-value
s(yday) 23.00 10.99 0.94 <2e-16 ***
s(hour) 22.00 21.34 1.01 0.73
ti(yday,hour) 506.00 301.66 0.96 <2e-16 ***
s(year) 9.00 2.31 NA NA
s(yday,year) 81.00 18.68 0.94 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
So, it could be better, but whatever. Here’s plots of the partial effects.
Looks like for any given year we’ve got a temperature excursion in the neighborhood of ~12 degrees, and on any given day, ~3 degrees. But there’s a very big interaction between the time of day effect and day of year!
Here’s a plot
plot_smooth(model_acf, view = "hour",
cond = list(yday = 0,
year = 2010),
rm.ranef = T,
ylim = c(0,20),
col = "#e66101",
print.summary = F)
plot_smooth(model_acf, view = "hour",
cond = list(yday = 100,
year = 2010),
rm.ranef = T,
ylim = c(0,20),
add = T,
col = "#fdb863",
print.summary = F)
plot_smooth(model_acf, view = "hour",
cond = list(yday = 200,
year = 2010),
rm.ranef = T,
ylim = c(0,20),
add = T,
col = "#b2abd2",
print.summary = F)
plot_smooth(model_acf, view = "hour",
cond = list(yday = 300,
year = 2010),
rm.ranef = T,
ylim = c(0,20),
add = T,
col = "#5e3c99",
print.summary = F)
legend(x= 10,
y = 20,
c("1 Jan", "10 Apr", "19 Jul", "27 Oct"),
col = c("#e66101", "#fdb863", "#b2abd2", "#5e3c99"),
pch = 19,
ncol = 4)
Update
I was interested in whether the time of day of peak temperature shifted across the year at all, so I got the fitted values.
temp_fits <- get_predictions(model_acf, cond = list(yday = 0:365, hour = seq(0,23, length = 200)), rm.ranef = T)
Summary:
* yday : integer vector with 366 values; set to the value(s): 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, ...
* hour : numeric predictor; with 200 values ranging from 0.000000 to 23.000000.
* year : factor; set to the value(s): 2008. (Might be canceled as random effect, check below.)
* NOTE : The following random effects columns are canceled: s(year),s(yday,year)
temp_fits%>%
group_by(yday)%>%
filter(fit == max(fit))%>%
ggplot(aes(yday, hour))+
geom_point()+
ylim(0,23)+
theme_minimal()
For the winter months, the peak temperature seems to be around 1pm, but for the summer it’s actually closer to 3pm.
Just for completeness sake, a heatmap of the fitted values:
temp_fits %>%
ggplot(aes(yday, hour))+
geom_tile(aes(fill=fit))+
theme_minimal()