Sydney and NSW are having a lot of rain this year, thought to be due to a La Nina weather event. But this raise a few questions:
To answer this question, I initially looked for river level data in NSW, but river levels over time can be affected by development of the river basin. A cleaner set of data is rainfall data, and on a BOM website I found a set of daily rainfall data for the Sydney Botanic Gardens (next to Circular Quay) that has continuous data back to 1910. The link to the data source is shown below, and the ‘station number’ is 066006.
Data source - http://www.bom.gov.au/climate/data/index.shtml
This data was downloaded from the BOM website on 15 August 2022, and had data up to 1st August 2022. I imported this data into the RStudio data science package. Altogether there were 50,250 observations of daily rainfall in these gardens. This included 201 missing values - I replaced these missing values with the rainfall data for the previous day.
setwd("G:/My Drive/Training and development/R sandpit/Sydney rainfall analysis")
# read file
rainfall <- read.csv(file = 'Daily_rainfall_data_15Aug2022.csv') # import rainfall data
rainfall %<>% rename(month_num = Month,
rainfall_mm = 6) # rename 2 columns
rainfall %<>% mutate(Month = case_when(
month_num == 1 ~ 'January',
month_num == 2 ~ 'February',
month_num == 3 ~ 'March',
month_num == 4 ~ 'April',
month_num == 5 ~ 'May',
month_num == 6 ~ 'June',
month_num == 7 ~ 'July',
month_num == 8 ~ 'August',
month_num == 9 ~ 'September',
month_num == 10 ~ 'October',
month_num == 11 ~ 'November',
month_num == 12 ~ 'December',
TRUE ~ 'error'),
year_month = paste0(Year,'-',Month))
rainfall %<>% filter(Year > 1909) # limit to data after 2009 - which is the continuous data
rainfall$Month <- factor(rainfall$Month, levels = c('January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November','December'), ordered = TRUE) # create ordered factor variable for month
rainfall$rainfall_mm <- nafill(rainfall$rainfall_mm, type = 'locf') # replace missing values with the rainfall for the earlier day
rainfall %<>% mutate(wet_day =as.integer(if_else(rainfall_mm > 0,1,0)))
rainfall %<>% group_by(year_month) %>% mutate(wet_day_count = cumsum(wet_day)) %>%
mutate(wet_day_sum = max(wet_day_count))
#head(rainfall)
Looking at the data, the first thing I noted is that rainfall is, not surprisingly, affected by the month of the year, as shown below. In this diagram, the error bars show the 95% confidence intervals.
alpha <- 0.05
monthly_mean <- rainfall %>%
group_by(Month) %>%
summarize(mean = mean(rainfall_mm),
lower_rain = mean(rainfall_mm) - qt(1- alpha/2, (n() - 1))*sd(rainfall_mm)/sqrt(n()),
upper_rain = mean(rainfall_mm) + qt(1- alpha/2, (n() - 1))*sd(rainfall_mm)/sqrt(n()),
wet_days = mean(wet_day_sum),
lower_days = mean(wet_day_sum) - qt(1- alpha/2, (n() - 1))*sd(wet_day_sum)/sqrt(n()),
upper_days = mean(wet_day_sum) + qt(1- alpha/2, (n() - 1))*sd(wet_day_sum)/sqrt(n()))
monthly_rainfall_mean <- mean(monthly_mean$mean)
monthly_mean$rainfall_correction = monthly_rainfall_mean/monthly_mean$mean # determine correction factor
monthly_rainfall_count <- mean(monthly_mean$wet_days)
monthly_mean$wet_day_correction = monthly_rainfall_count/monthly_mean$wet_days
month_mean2 <- monthly_mean %>% select(Month, rainfall_correction,wet_day_correction)
monthly_mean %>%
ggplot(aes(x = Month, y = mean)) +
geom_bar(stat = "identity", fill = "steelblue") +
geom_errorbar(aes(ymin = lower_rain, ymax = upper_rain), width=.2,
position=position_dodge(.9)) +
labs(title = "Fig. 1 - Average daily rainfall in Sydney Botanic Gardens peaks \nin March and June",
caption = "Data source: BOM",
x = "Month", y = "Average daily \nrainfall in mm") +
theme(axis.title.y=element_text(angle=0, vjust = 0.5, size = 10)) +
scale_x_discrete(breaks = function(x){x[c(TRUE, FALSE)]})
We can also look at the number of wet days:
monthly_mean %>%
ggplot(aes(x = Month, y = wet_days)) +
geom_bar(stat = "identity", fill = "darkolivegreen3") +
geom_errorbar(aes(ymin = lower_days, ymax = upper_days), width=.2,
position=position_dodge(.9)) +
labs(title = "Fig. 2 - Wet days in Sydney Botanic Gardens peaks in March",
caption = "Data source: BOM",
x = "Month", y = "Average number \nof wet days \nin month") +
theme(axis.title.y=element_text(angle=0, vjust = 0.5, size = 10)) +
scale_x_discrete(breaks = function(x){x[c(TRUE, FALSE)]})
Lets take a look at the long term trends. But since rainfall is so affected by the month of the year, and for 2022 we only have data up to the beginning of August, I will restrict this data for all available years for the months of January to July only.
So after making this data adjustment, what are the long term trends? The data for 2022 has been highlighted in red.
rainfall_year <- rainfall %>% filter (month_num < 8) %>%
group_by(Year) %>%
summarise(mean_rainfall = mean(rainfall_mm),
mean_wet_days = mean(wet_day_sum)) %>%
mutate(rainfall_colour = if_else(Year == 2022,'firebrick1','steelblue'),
wet_days_colour = if_else(Year == 2022,'firebrick1','darkolivegreen3'))
ggplot(data=rainfall_year, aes(x=Year, y=mean_rainfall)) +
geom_bar(stat = "identity", fill = rainfall_year$rainfall_colour) +
labs(title = "Fig 3. - Average daily rainfall in Sydney Botanic Gardens in 2022 \nis unusually high",
subtitle = "Data is limited to January to July only, and has been averaged over these months",
caption = "Data source: BOM",
x = "Year", y = "Average daily \nrainfall in mm") +
theme(axis.title.y=element_text(angle=0, vjust = 0.5, size = 10)) +
annotate("text", x = 2005, y = 9, label = "Red bar is 2022")
And if we zoom into this, we can see that the year 2022 is a clear outlier:
rainfall_year_2000 <- rainfall_year %>% filter (Year >1999)
ggplot(data=rainfall_year_2000, aes(x=Year, y=mean_rainfall)) +
geom_bar(stat = "identity", fill = rainfall_year_2000$rainfall_colour) +
labs(title = "Fig 4 - Average daily rainfall in Sydney Botanic Gardens \nover last 20 years",
subtitle = "Data is limited to January to July only, and has been averaged over these months",
caption = "Data source: BOM",
x = "Year", y = "Average daily \nrainfall in mm") +
theme(axis.title.y=element_text(angle=0, vjust = 0.5)) +
annotate("text", x = 2003, y = 9, label = "Red bar is 2022")
The result for 2022 looks to be very different to earlier years, but what does a statistical analysis show us? To test this, I ran an two-sided T-test on this data*, comparing the 2022 result to the earlier data.
The results is shown below - in English, these results show that the results were statistically significant at the 5% level, i.e there is a less than a 5% probability (‘p=value’) that the differences are due to random variation.
temp <- left_join(rainfall,month_mean2, by = 'Month')
rainfall <- temp
options(dplyr.summarise.inform = FALSE)
rainfall %<>% mutate(year_month = paste0(Year,'-',Month),
corrected_rainfall = rainfall_mm * rainfall_correction,
corrected_wet_days = wet_day_sum * wet_day_correction)
monthly_sum <- rainfall %>% filter(month_num < 8) %>%
group_by(Year, Month) %>%
summarise(corrected_mean_rainfall = mean(corrected_rainfall),
corrected_mean_wet_days = mean(corrected_wet_days))
Monthly_average_rainfall_before_2022 <- monthly_sum$corrected_mean_rainfall[monthly_sum$Year < 2022]
Monthly_average_rainfall_2022 <- monthly_sum$corrected_mean_rainfall[monthly_sum$Year == 2022]
library(ggpubr)
#shapiro.test(Monthly_average_rainfall_2022)
#shapiro.test(Monthly_average_rainfall_before_2022)
#ggqqplot(Monthly_average_rainfall_2022)
#ggqqplot(Monthly_average_rainfall_before_2022)
t.test(Monthly_average_rainfall_2022,Monthly_average_rainfall_before_2022, alternative = 'two.sided') # conduct two-sided T test on 2022 vs earlier data
##
## Welch Two Sample t-test
##
## data: Monthly_average_rainfall_2022 and Monthly_average_rainfall_before_2022
## t = 2.4471, df = 6.0252, p-value = 0.04982
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.006027623 11.182829658
## sample estimates:
## mean of x mean of y
## 8.999734 3.405305
* The T Test was run on the rainfall data after averaging it on a month-by-month basis. I did this to try to ‘normalise’ the data, as according to the ‘Central Limit Theorem’. Admittedly this was not completely successful as the monthly average for pre-2022 data failed the Shapiro test for normality. Also since we know that rainfall also varies on a monthly basis, we have corrected the means for each month based on the mean data reported in Fig. 1. For these reasons, we should regard the p-value above as preliminary, particular as the reported p-value of 0.0498 is only just less than the 0.05 cut-off. And finally, I used a T-test rather than a Z-test as there was a low number of samples for 2022.
We can also look at the average number of wet days per month, in each of these years:
ggplot(data=rainfall_year, aes(x=Year, y=mean_wet_days)) +
geom_bar(stat = "identity", fill = rainfall_year$wet_days_colour) +
labs(title = "Fig 5. - 2022 to date has had more wet days in Sydney Botanic gardens",
subtitle = "Data is limited to January to July only, and has been averaged over these months",
caption = "Data source: BOM",
x = "Year", y = "Average number \nof wet days \nin month") +
theme(axis.title.y=element_text(angle=0, vjust = 0.5, size = 10)) +
annotate("text", x = 2005, y = 17, label = "Red bar is 2022")
We can also apply the T-test test to the number of wet days in a month. The results shown below shows a lower p-value of 0.0113, which shows a lower probability that this variation was due to chance alone.
Monthly_average_wet_days_before_2022 <- monthly_sum$corrected_mean_wet_days[monthly_sum$Year < 2022]
Monthly_average_wet_days_2022 <- monthly_sum$corrected_mean_wet_days[monthly_sum$Year == 2022]
#shapiro.test(Monthly_average_rainfall_2022)
#shapiro.test(Monthly_average_rainfall_before_2022)
#ggqqplot(Monthly_average_rainfall_2022)
#ggqqplot(Monthly_average_rainfall_before_2022)
t.test(Monthly_average_wet_days_before_2022,Monthly_average_wet_days_2022, alternative = 'two.sided') # conduct two-sided T test on 2022 vs earlier data
##
## Welch Two Sample t-test
##
## data: Monthly_average_wet_days_before_2022 and Monthly_average_wet_days_2022
## t = -3.5894, df = 6.0667, p-value = 0.0113
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -12.06626 -2.29918
## sample estimates:
## mean of x mean of y
## 9.760983 16.943704
So a different finding to the analysis based on average daily rainfall. These are different things of course:
Apart from 2022, which were the next wettest years? The table below shows that the next wettest year was 1950, then 1990. This further suggests that, apart from 2022, there does not appear to be a recent trend towards wetter years.
# https://gt.rstudio.com/articles/intro-creating-gt-tables.html
top_ten <- rainfall_year %>% arrange(desc(mean_rainfall)) %>% slice(1:10) %>%
mutate(mean_rainfall = round(mean_rainfall, digits = 1),
mean_wet_days = round(mean_wet_days, digits = 1)) %>%
select(-wet_days_colour, -rainfall_colour)
library(gt)
top_ten %>% gt() %>%
tab_header(title = "Wettest known years for Sydney Botanical Gardens",
subtitle = "Average rain per day in mm - Jan to July") %>%
cols_label(
mean_rainfall = "Average daily rainfall, mm",
mean_wet_days = "Wet days per month"
)
| Wettest known years for Sydney Botanical Gardens | ||
|---|---|---|
| Average rain per day in mm - Jan to July | ||
| Year | Average daily rainfall, mm | Wet days per month |
| 2022 | 10.1 | 17.9 |
| 1950 | 8.5 | 14.6 |
| 1990 | 7.5 | 12.5 |
| 1956 | 7.0 | 13.5 |
| 1913 | 6.6 | 11.3 |
| 1963 | 6.4 | 11.5 |
| 1955 | 6.2 | 13.4 |
| 1989 | 6.0 | 11.9 |
| 1919 | 5.7 | 10.4 |
| 1949 | 5.7 | 10.4 |
We could also look at the wettest single days, which are shown in the table below:
rainfall2 <- rainfall %>% arrange(desc(rainfall_mm))
top_ten_days <- rainfall2[1:10,]
top_ten_days %<>% mutate(Date = paste0(Day,' ',Month,' ',Year)) %>%
select(Date, rainfall_mm) %>% select(2,3) %>%
mutate(rainfall_mm = round(rainfall_mm,0))
top_ten_days <- top_ten_days[,c(2:3)]
top_ten_days %>% gt() %>%
tab_header(title = "Wettest known days for Sydney Botanical Gardens") %>%
cols_label(
rainfall_mm = "Total rainfall, mm")
| Wettest known days for Sydney Botanical Gardens | |
|---|---|
| Date | Total rainfall, mm |
| 6 August 1986 | 340 |
| 28 March 1942 | 277 |
| 9 February 1992 | 264 |
| 9 November 1984 | 248 |
| 3 February 1990 | 238 |
| 1 May 1988 | 230 |
| 2 May 1953 | 226 |
| 11 March 1975 | 217 |
| 1 May 1955 | 193 |
| 11 February 1956 | 191 |
We can also look at rainfall events, or the total rain over a consecutive series of days. The table below shows that the single wettest event was in 1950, where over 17 days 578 mm of rain fell. The most recent event in this top ten list was in 2020.
rainfall3 <- rainfall %>% arrange(desc(Rainfall_in_sequence)) %>% distinct(Rainfall_sequence_number, .keep_all = TRUE)
top_ten_events <- rainfall3[1:10,]
top_ten_events %<>% mutate(Date = paste0(Day,' ',Month,' ',Year)) %>%
select(Date, Rainfall_in_sequence,Rainfall_sequence_days) %>%
mutate(Rainfall_in_sequence = round(Rainfall_in_sequence,0))
top_ten_events <- top_ten_events[,c(2:4)]
top_ten_events %>% gt() %>%
tab_header(title = "Wettest known weather events for Sydney Botanical Gardens") %>%
cols_label(
Date = "Started raining",
Rainfall_in_sequence = "Total rainfall, mm",
Rainfall_sequence_days = "# of wet days")
| Wettest known weather events for Sydney Botanical Gardens | ||
|---|---|---|
| Started raining | Total rainfall, mm | # of wet days |
| 27 June 1950 | 578 | 17 |
| 2 December 1961 | 524 | 17 |
| 9 May 1953 | 472 | 11 |
| 30 November 1961 | 471 | 15 |
| 8 August 1986 | 469 | 4 |
| 16 February 1992 | 456 | 14 |
| 9 November 1984 | 436 | 6 |
| 13 June 1964 | 434 | 8 |
| 10 February 2020 | 434 | 8 |
| 5 February 1990 | 418 | 5 |
None of these ten wettest days have occurred in the last 10 years.
We could look to see if there has a statistically significant trend line in the weather, again with data averaged on a month-by-month basis. The graph below shows the outcome of this analysis on average monthly rainfall. A separate regression analysis showed that the best fit line had a ‘p-value’ of 0.24, which means that we cannot consider this trend to be statistically significant. Note that this analysis was based on corrected mean monthly data, as discussed for the anova analysis.
monthly_sum %<>% mutate(Date = as.Date(paste0(1,'-',Month,'-',Year),format = "%d-%B-%Y"))
reg_model_a <- lm(corrected_mean_rainfall ~ Date, data = monthly_sum) # Run regression model for trial data
#summary(reg_model_a)
monthly_sum %>%
ggplot(aes(x = Date, y = corrected_mean_rainfall)) +
geom_point(color = "grey68") +
theme_bw() +
geom_smooth(data = monthly_sum, aes(x = Date, y = corrected_mean_rainfall, color = "deepskyblue3"), formula = y ~ x, method = "lm", fill = "light blue", size = 1.5) +
labs(title = "Fig 6. - No significant trend in average daily rainfall \nin Sydney Botanic Gardens",
subtitle = "Best fit line is shown, but the slight gradient was not statistically significant. \nData was averaged on a month by basis basis",
caption = "Data source: BOM",
x = "", y = "Average daily \nrainfall in mm") +
theme(axis.title.y=element_text(angle=0, vjust = 0.5, size = 10)) +
theme(legend.position="none")
However - when we looked at the number of wet days, there is a statistically significant trend, with the average number of wet days increasing from around 8 to around 11.
reg_model_b <- lm(corrected_mean_wet_days ~ Date, data = monthly_sum) # Run regression model for trial data
#summary(reg_model_b)
monthly_sum %>%
ggplot(aes(x = Date, y = corrected_mean_wet_days)) +
geom_point(color = "grey68") +
theme_bw() +
geom_smooth(data = monthly_sum, aes(x = Date, y = corrected_mean_wet_days, color = "deepskyblue3"), formula = y ~ x, method = "lm", fill = "light blue", size = 1.5) +
labs(title = "Fig 7. - It is raining more often at Sydney Botanic Gardens",
subtitle = "Best fit line is shown - trend was significant at 1% level.",
caption = "Data source: BOM",
x = "", y = "Number \nof wet days \nper month") +
theme(axis.title.y=element_text(angle=0, vjust = 0.5, size = 10)) +
theme(legend.position="none")
The earlier graphs show a bit of noise in long term trends, so to simplify this, we could group this data into five year blocks, as shown in the graph below. Again this analysis is based on corrected monthly data.
fives <- c(1910, 1915, 1920, 1925, 1930, 1935, 1940, 1945, 1950, 1955, 1960, 1965, 1970, 1975,
1980, 1985, 1990, 1995, 2000, 2005, 2010, 2015, 2020)
rainfall$fives <- fives[findInterval(rainfall$Year, fives)]
alpha <- 0.05
five_sum <- rainfall %>%
group_by(fives) %>%
summarize(mean_rainfall = mean(corrected_rainfall),
lower_rainfall = mean(corrected_rainfall) - qt(1- alpha/2, (n() - 1))*sd(rainfall_mm)/sqrt(n()),
upper_rainfall = mean(corrected_rainfall) + qt(1- alpha/2, (n() - 1))*sd(rainfall_mm)/sqrt(n()),
mean_days = mean(wet_day_sum),
lower_days = mean(wet_day_sum) - qt(1- alpha/2, (n() - 1))*sd(wet_day_sum)/sqrt(n()),
upper_days = mean(wet_day_sum) + qt(1- alpha/2, (n() - 1))*sd(wet_day_sum)/sqrt(n()),
) %>%
mutate(colour = if_else(fives == 2020,'firebrick1','steelblue'))
five_sum %>%
ggplot(aes(x = fives, y = mean_rainfall)) +
geom_bar(stat = "identity", fill = five_sum$colour) +
geom_errorbar(aes(ymin = lower_rainfall, ymax = upper_rainfall), width=.2,
position=position_dodge(.9)) +
labs(title = "Fig 8. - Average daily rainfall in Sydney Botanic Gardens",
subtitle = "Data is grouped into 5 year blocks (1910 to 1914, 1915 to 1999, etc), \nand for Jan to July only.",
caption = "Data source: BOM",
x = "", y = "Average daily \nrainfall in mm") +
theme(axis.title.y=element_text(angle=0, vjust = 0.5, size = 10),
plot.title = element_text(size = 14)) +
annotate("text", x = 2000, y = 5.5, label = "Red bar is 2020 onwards")
The first thing we note is that that the most recent block (2020 to 2022 only remember) has a much higher average rainfall level than any previous five year block, although ideally we would wait until the end of this period.
This helps to confirm that there are no long term trends in this data. Rainfall is slightly lower around the 1930s and slightly higher in the 1950s.
On a similar basis, we can also investigate whether the variability in rainfall in increasing. In the graph below, I have determined the variability, as defined by standard deviation of the average (corrected) monthly rainfall, for each five year block since 2010:
monthly_sum$fives <- fives[findInterval(monthly_sum$Year, fives)]
alpha <- 0.05
five_sumb <- monthly_sum %>%
group_by(fives) %>%
summarize(sd_rainfall = sd(corrected_mean_rainfall),
sd_wet_days = sd(corrected_mean_wet_days)) %>%
mutate(colour = if_else(fives == 2020,'firebrick1','lightskyblue'))
five_sumb %>%
ggplot(aes(x = fives, y = sd_rainfall)) +
geom_bar(stat = "identity", fill = five_sumb$colour) +
labs(title = "Fig 9. - Variability in monthly average rainfall in Sydney Botanic Gardens",
subtitle = "Shows the standard deviation in average rainfall \non a month-by-month basis for each five year period.",
caption = "Data source: BOM",
x = "", y = "Standard deviation \n in monthly \nrainfall,mm") +
theme(axis.title.y=element_text(angle=0, vjust = 0.5, size = 10),
plot.title = element_text(size = 12)) +
annotate("text", x = 1995, y = 5, label = "Red bar is 2020 onwards")
This graph shows that while the most recent period (2020 to July 2022) has greater variability, there has no been no strong trend prior to this. The high standard deviation for the recent period will be due to the very high rainfall for 2022.
However, when we look at the varibility in the number of wet days over these five year blocks - the 2020 block is not greater than some other blocks.
five_sumb %>%
ggplot(aes(x = fives, y = sd_wet_days)) +
geom_bar(stat = "identity", fill = five_sumb$colour) +
labs(title = "Fig 10. - Variability in monthly wet days in Sydney Botanic Gardens",
subtitle = "Shows the standard deviation in average number of wet days \non a month-by-month basis for each five year period.",
caption = "Data source: BOM",
x = "", y = "Standard deviation \n in monthly count \n of wet days") +
theme(axis.title.y=element_text(angle=0, vjust = 0.5, size = 10),
plot.title = element_text(size = 12)) +
annotate("text", x = 1997, y = 1, label = "Red bar is 2020 onwards")
In the introduction, I asked:
This data analysis shows that the 2022 is much wetter than any previous year in Sydney. In terms of average monthly rainfall there is no long-term trend, but there is a long term increase in the number of wet days per month.
Regardless, it will be worth watching to see if 2022 is the start of a longer term trend.
# https://stackoverflow.com/questions/31591516/pubishing-to-rpubs-existing-html-file
#result <- rpubsUpload(title='SPC',contentFile='sydney-rainfall-data_clean_2', method=getOption('rpubs.upload.method','auto'))