Summary

This analysis was inspired by the walk-through found on r-bloggers: https://www.r-bloggers.com/author/pedro-m/. I have reused parts of the code in my own analysis and built upon it.

The data set consists of daily records of several meteorological parameters, measured in the city of Portoover the year of 2014. The problem is set up as a regression problem but could also be set up as a binary classification or multiclass classification problem.

There are 365 observations for each of the following 14 variables:

day.count: number of days passed since the beginning of the year.

day: day of the month.

month: month of the year.

season: season of the year.

l.temp, h.temp, ave.temp: lowest, highest and average temperature for the day (in ºC).

l.temp.time, h.temp.time: hour of the day when l.temp and h.temp occurred.

rain: amount of precipitation (in mm).

ave.wind: average wind speed for the day (in km/h).

gust.wind: maximum wind speed for the day (in km/h).

gust.wind.time: hour of the day when gust.wind occurred.

dir.wind: dominant wind direction for the day.

Questions

  1. What are good predictors of rain?

  2. How accurately can the amount of rain be predicted by different techniques?

Synopsis

To answer the above questions, an analisys have been performed using the weather data for city of Portoover. The weather data set have been processed to obtain the information needed. This processing includes cleaning the data and plotting results.

Data Processing

The data use can be downloaded from https://github.com/JMFlin/Machine-Learning/tree/master/Data


Loading the data

Sys.setlocale("LC_TIME", "English")
## [1] "English_United States.1252"
library(psych)
library(ggplot2)
library(caret)
library(reshape2)
library(gridExtra)
library(GGally)
library(ggcorrplot)
library(rpart)
library(tidyr)
library(partykit)
weather <- read.csv("weather_2014.csv", sep=";", stringsAsFactors=FALSE, na.strings = c("Na", "NaN"))
describe(weather)
##                 vars   n   mean     sd median trimmed    mad min   max
## day.count          1 365 183.00 105.51  183.0  183.00 134.92 1.0 365.0
## day                2 365  15.72   8.81   16.0   15.71  11.86 1.0  31.0
## month              3 365   6.53   3.45    7.0    6.53   4.45 1.0  12.0
## season*            4 365    NaN     NA     NA     NaN     NA Inf  -Inf
## l.temp             5 365  12.65   4.21   12.9   12.73   5.19 3.1  22.6
## h.temp             6 365  19.19   5.20   19.1   18.89   6.38 9.8  31.5
## ave.temp           7 365  15.74   4.38   15.8   15.67   5.49 7.3  26.6
## l.temp.time*       8 365    NaN     NA     NA     NaN     NA Inf  -Inf
## h.temp.time*       9 365    NaN     NA     NA     NaN     NA Inf  -Inf
## rain              10 365   5.84  11.98    0.3    2.85   0.44 0.0  74.9
## ave.wind          11 365   4.04   2.43    3.5    3.78   2.08 0.0  16.6
## gust.wind         12 365  31.15  11.66   29.0   30.01  11.86 3.2  86.9
## gust.wind.time*   13 365    NaN     NA     NA     NaN     NA Inf  -Inf
## dir.wind*         14 365    NaN     NA     NA     NaN     NA Inf  -Inf
##                 range  skew kurtosis   se
## day.count       364.0  0.00    -1.21 5.52
## day              30.0  0.01    -1.20 0.46
## month            11.0 -0.01    -1.22 0.18
## season*          -Inf    NA       NA   NA
## l.temp           19.5 -0.14    -0.93 0.22
## h.temp           21.7  0.36    -0.89 0.27
## ave.temp         19.3  0.10    -1.01 0.23
## l.temp.time*     -Inf    NA       NA   NA
## h.temp.time*     -Inf    NA       NA   NA
## rain             74.9  2.99     9.88 0.63
## ave.wind         16.6  1.25     2.42 0.13
## gust.wind        83.7  1.06     1.70 0.61
## gust.wind.time*  -Inf    NA       NA   NA
## dir.wind*        -Inf    NA       NA   NA
str(weather)
## 'data.frame':    365 obs. of  14 variables:
##  $ day.count     : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ day           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ month         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ season        : chr  "Winter" "Winter" "Winter" "Winter" ...
##  $ l.temp        : num  12.7 11.3 12.6 7.7 8.8 11.8 11.4 12.4 9.2 8.3 ...
##  $ h.temp        : num  14 14.7 14.7 13.9 14.6 14.4 14.8 15.6 18.4 14.8 ...
##  $ ave.temp      : num  13.4 13.5 13.6 11.3 13 13.1 13.5 14.1 12.9 11 ...
##  $ l.temp.time   : chr  "01:25" "07:30" "21:00" "10:35" ...
##  $ h.temp.time   : chr  "23:50" "11:15" "14:00" "01:50" ...
##  $ rain          : num  32 64.8 12.7 20.1 9.4 38.9 2 1.5 0 0 ...
##  $ ave.wind      : num  11.4 5.6 4.3 10.3 11.6 9.9 6.6 5.9 0.2 1.4 ...
##  $ gust.wind     : num  53.1 41.8 38.6 66 51.5 57.9 38.6 33.8 16.1 24.1 ...
##  $ gust.wind.time: chr  "15:45" "22:25" "00:00" "09:05" ...
##  $ dir.wind      : chr  "S" "S" "SSW" "SW" ...
table(is.na(weather))
## 
## FALSE 
##  5110


The variable dir.wind had quite alot of different classes. Looks like we can combine them.

sort(round(prop.table(table(weather$dir.wind))*100,1),decreasing = TRUE)
## 
##   NW  NNW  SSE    S   NE   SE  WNW    N  SSW  ENE    E   SW  NNE    W  WSW 
## 29.6 10.1  8.5  7.1  6.8  6.6  6.6  4.9  4.7  4.1  3.0  3.0  2.2  1.4  0.8 
##  ESE 
##  0.5
weather$dir.wind.8 <- weather$dir.wind 

weather$dir.wind.8 <- ifelse(weather$dir.wind %in%  c("NNE","ENE"),
                                 "NE",as.character(weather$dir.wind.8)) 

weather$dir.wind.8 <- ifelse(weather$dir.wind %in% c("NNW","WNW"),
                               "NW",as.character(weather$dir.wind.8)) 

weather$dir.wind.8 <- ifelse(weather$dir.wind %in% c("WSW","SSW"),
                               "SW",as.character(weather$dir.wind.8)) 

weather$dir.wind.8 <- ifelse(weather$dir.wind %in% c("ESE","SSE"),
                               "SE",as.character(weather$dir.wind.8)) 

# create factors, ordered by "levels" 
weather$dir.wind.8 <- factor(weather$dir.wind.8,
                                 levels = c("N","NE","E","SE","S","SW","W","NW"))

## A 2-way table (direction vs season), with relative frequencies calculated over column
round(prop.table(table(weather$dir.wind.8,weather$season),margin = 2)*100,1)
##     
##      Autumn Spring Summer Winter
##   N    12.1    1.1    3.3    3.3
##   NE   20.9   14.1    5.4   12.2
##   E     5.5    0.0    0.0    6.7
##   SE   20.9   13.0   14.1   14.4
##   S     4.4    5.4   12.0    6.7
##   SW    2.2    6.5    8.7   16.7
##   W     1.1    2.2    0.0    2.2
##   NW   33.0   57.6   56.5   37.8


More processing is needed like making a date variable from the “day.count” variable and rounding times to the nearest hour. Another observation is making the “month” variable into a character variable and make new variables out of the time variables.

weather$season <- factor(weather$season, levels = c("Spring","Summer","Autumn","Winter"))
weather$day <- as.factor(weather$day)
weather$month <- as.factor(weather$month)
weather$dir.wind <- as.factor(weather$dir.wind)

first.day <- "2014-01-01"
first.day <- as.Date(first.day)
weather$date  <- first.day + weather$day.count - 1 

l.temp.time.date <- as.POSIXlt(paste(weather$date, weather$l.temp.time), tz = "GMT")
#head(l.temp.time.date)

# Round to the nearest hour
l.temp.time.date <- round(l.temp.time.date,"hours")

#attributes(l.temp.time.date)

# Extract the value of the hour attribute as a number and add it to the data set
weather$l.temp.hour <- l.temp.time.date[["hour"]]

h.temp.time.date <- as.POSIXlt(paste(weather$date, weather$h.temp.time), tz = "GMT")
#head(h.temp.time.date)

h.temp.time.date <- round(h.temp.time.date,"hours")

weather$h.temp.hour <- h.temp.time.date[["hour"]]

gust.wind.time.date <- as.POSIXlt(paste(weather$date, weather$gust.wind.time), tz = "GMT")
#head(gust.wind.time.date)

gust.wind.time.date <- round(gust.wind.time.date,"hours")

weather$gust.wind.time.hour <- gust.wind.time.date[["hour"]]

weather$l.temp.hour <- as.numeric(weather$l.temp.hour)
weather$h.temp.hour <- as.numeric(weather$h.temp.hour)
weather$gust.wind.time.hour <- as.numeric(weather$gust.wind.time.hour)

weather$month <- factor(weather$month,
                       labels = c("Jan","Feb","Mar","Apr",
                                  "May","Jun","Jul","Aug","Sep",
                                  "Oct","Nov","Dec"))


It should be clear at this point that one possible approach would be to dichotomise the dependent variable (rain vs. no rain). Note that it is common to consider days with rain only those where the total amount was at least 1mm (to allow for measurement errors), and that’s the criterion we will adopt here.

weather$rained <- ifelse(weather$rain >= 1, "Yes", "No")
weather$rained <- as.factor(weather$rained)
table(weather$rained)
## 
##  No Yes 
## 219 146
prop.table(table(weather$rained)) 
## 
##  No Yes 
## 0.6 0.4


Since the daily high temperature variable is continuous, we could transform it to categorical. A common strategy is to split the continuous variables in four groups of (roughly) the same size, i.e., the quartiles.

weather$h.temp.quant <- cut(weather$h.temp, breaks = quantile(weather$h.temp),
                            labels = c("Cool","Mild","Warm","Hot"),include.lowest = T)
table(weather$h.temp.quant)
## 
## Cool Mild Warm  Hot 
##   92   91   94   88
my_fn <- function(data, mapping, pts=list(), smt=list(), ...){
  ggplot(data = data, mapping = mapping, ...) + 
    do.call(geom_point, pts) +
    do.call(geom_smooth, smt) + 
    theme(axis.line = element_line(), axis.text=element_text(color='black'), axis.title = element_text(colour = 'black'), legend.text=element_text(), legend.title=element_text()) 
}

ggpairs(weather[,sapply(weather,is.numeric)], 
        lower = list(continuous = 
                       wrap(my_fn,
                            pts=list(colour="black"), 
                            smt=list(method="loess", se=F, colour="blue"))),
        upper = list(continuous = 
                       wrap(my_fn,
                            pts=list(colour="black"), 
                            smt=list(method="loess", se=F, colour="blue"))))


The blue line is the mean and the red line is the median. The histogram of rain shows that the distribution is extremely right-skewed.

temp <- weather[,sapply(weather,is.numeric)]
temp <- temp[,!names(temp) %in% "day.count"]
longtemp <- melt(temp, measure.vars = names(temp))
vline.dat.mean <- aggregate(longtemp[,2], list(longtemp$variable), mean)
vline.dat.median <- aggregate(longtemp[,2], list(longtemp$variable), median)
names(vline.dat.mean)[1] <- "variable"
names(vline.dat.median)[1] <- "variable" 

ggplot(longtemp,aes(x=value))+
  geom_histogram(aes(y = ..density..,  fill = variable), binwidth=1, colour = "blue", fill = "darkgrey")+
  geom_density()+ 
  theme(axis.line = element_line(), axis.text=element_text(color='black'), axis.title = element_text(colour = 'black'), legend.text=element_text(), legend.title=element_text())+
  geom_vline(aes(xintercept = x), data = vline.dat.mean, linetype = "longdash", color = "blue")+
  geom_vline(aes(xintercept = x), data = vline.dat.median, linetype = "longdash", color = "red")+
  xlab("")+
  facet_wrap(~ variable, ncol = 3, scales = "free")

longtemp <- cbind(longtemp, weather[,"rained"])
names(longtemp) <- c("variable", "value", "rained")
vline.dat.mean <- aggregate(longtemp[,2], list(longtemp$variable), mean)
vline.dat.median <- aggregate(longtemp[,2], list(longtemp$variable), median)
names(vline.dat.mean)[1] <- "variable"
names(vline.dat.median)[1] <- "variable" 
ggplot(longtemp,aes(x=value))+
  geom_histogram(aes(y = ..density..,  fill = rained), colour = "black", binwidth=1)+geom_density() + 
  theme(axis.line = element_line(), axis.text=element_text(color='black'), axis.title = element_text(colour = 'black'), legend.text=element_text(), legend.title=element_text())+
  geom_vline(aes(xintercept = x), data = vline.dat.mean, linetype = "longdash", color = "blue")+
  geom_vline(aes(xintercept = x), data = vline.dat.median, linetype = "longdash", color = "red")+
  xlab("")+ 
  facet_wrap(~ variable, ncol = 3, scales = "free")

ggplot(longtemp,aes(x=value, fill = rained))+
  geom_density(alpha = 0.1) + 
  theme(axis.line = element_line(), axis.text=element_text(color='black'), axis.title = element_text(colour = 'black'), legend.text=element_text(), legend.title=element_text())+
  geom_vline(aes(xintercept = x), data = vline.dat.mean, linetype = "longdash", color = "blue")+
  geom_vline(aes(xintercept = x), data = vline.dat.median, linetype = "longdash", color = "red")+
  xlab("")+ 
  facet_wrap(~ variable, ncol = 3, scales = "free")


High temp and low temp with season and wind direction.

a <- ggplot(data = weather, aes(x = l.temp, y= h.temp, colour = season)) + 
  geom_point()+ 
  ggtitle("Lowest vs highest temperature by season")+
  theme(axis.line = element_line(), axis.text=element_text(color='black'), axis.title = element_text(colour = 'black'), legend.text=element_text(), legend.title=element_text(), legend.key = element_rect(colour = "black")) 

b <- ggplot(data = weather, aes(x = l.temp, y= h.temp, colour = dir.wind.8)) + 
  geom_point()+ 
  ggtitle("Lowest vs highest temperature by wind direction")+
  theme(axis.line = element_line(), axis.text=element_text(color='black'), axis.title = element_text(colour = 'black'), legend.text=element_text(), legend.title=element_text(), legend.key = element_rect(colour = "black")) 

grid.arrange(a, b, ncol=1)


Time series plots of average daily temperature, with smoother curve.

a <-ggplot(weather,aes(x = date,y = ave.temp)) + 
  geom_point(aes(colour = ave.temp)) +
  scale_colour_gradient2(low = "blue", mid = "green" , high = "red", midpoint = mean(weather$ave.temp))+
  geom_smooth(color = "red", method = "loess") +
  scale_y_continuous(limits = c(5,30), breaks = seq(5,30,5)) +
  ggtitle ("Daily average temperature") +
  xlab("Date") +  ylab ("Average Temperature ( ºC )")+
  theme(axis.line = element_line(), axis.text=element_text(color='black'), axis.title = element_text(colour = 'black'), legend.text=element_text(), legend.title=element_text()) 

b <-ggplot(weather,aes(x = date,y = ave.temp)) + 
  geom_point(aes(colour = rain)) +
  scale_colour_gradient2(low = "blue", mid = "blue" , high = "red", midpoint = mean(weather$rain)) + 
  geom_smooth(color = "red", method = "loess") +
  scale_y_continuous(limits = c(5,30), breaks = seq(5,30,5)) +
  ggtitle ("Daily average temperature") +
  xlab("Date") +  
  ylab ("Average Temperature ( ºC )")+ 
  theme(axis.line = element_line(), axis.text=element_text(color='black'), axis.title = element_text(colour = 'black'), legend.text=element_text(), legend.title=element_text()) 

grid.arrange(a, b, ncol=1)


Distribution of the average temperature by month - violin plot, with a jittered point layer on top, and with size mapped to amount of rain. A violin plot is sort of a mixture of a box plot with a histogram (or even better, a rotated density plot). I also added a point layer on top (with jitter for better visualisation), in which the size of the circle is mapped to the continuous variable representing the amount of rain.

c <-ggplot(weather,aes(x = date,y = ave.temp)) + 
  geom_point(aes(colour = rained)) +
  geom_smooth(color = "red", method = "loess") +
  scale_y_continuous(limits = c(5,30), breaks = seq(5,30,5)) +
  ggtitle ("Daily average temperature") +
  xlab("Date") +  
  ylab ("Average Temperature ( ºC )")+ 
  theme(axis.line = element_line(), axis.text=element_text(color='black'), axis.title = element_text(colour = 'black'), legend.text=element_text(), legend.title=element_text(), legend.key = element_rect(colour = "black")) 

d <- ggplot(weather,aes(x = month, y = ave.temp)) +
  geom_violin(fill = "orange") +
  geom_point(aes(size = rain), colour = "blue", position = "jitter") +
  ggtitle ("Temperature distribution by month") +
  xlab("Month") +  
  ylab ("Average temperature ( ºC )")+ 
  theme(axis.line = element_line(), axis.text=element_text(color='black'), axis.title = element_text(colour = 'black'), legend.text=element_text(), legend.title=element_text(), legend.key = element_rect(colour = "black"))

grid.arrange(c, d, ncol=1)


Spring and autumn seasons are often seen as the transition from cold to warm and warm to cold days, respectively. The spread of their distributions reflect the high thermal amplitude of these seasons. On the other hand, winter and summer average temperatures are much more concentrated around a few values, and hence the peaks shown on the graph.

ggplot(weather,aes(x = ave.temp, colour = season, fill = season)) +
  geom_density(alpha = 0.1) +
  scale_x_continuous(limits = c(5,30), breaks = seq(5,30,5)) +
  ggtitle ("Temperature distribution by season") +
  xlab("Average temperature ( ºC )") +  
  ylab("Probability")+ 
  theme(axis.line = element_line(), axis.text=element_text(color='black'), axis.title = element_text(colour = 'black')) 


Scatter plot of low vs high daily temperatures, with a smoother curve for each season.

my_fn <- function(data, mapping, ...){
  p <- ggplot(data = weather, mapping = mapping) + #NOTE THE WEATHER
    geom_point(colour = "firebrick", alpha = 0.3) + 
    geom_smooth(aes(colour = season),se= F, method = 'loess')+ 
    theme(axis.line = element_line(), axis.text=element_text(color='black'), axis.title = element_text(colour = 'black'), legend.text=element_text(), legend.title=element_text()) #NOTE THE SEASON
  p
}

ggpairs(weather[,sapply(weather,is.numeric)], lower = list(continuous = my_fn))


Time series of the daily rain amount, with smoother curve. The time series plot shows that the daily rain amount varies wildly throughout the year. There are many dry days interspersed with stretches of consecutive wet ones, often severe, especially in the autumn and winter seasons. As shown below, the positive skewness remains even after removing all the days where it did not rain.

ggplot(weather, aes(date,rain)) +
  geom_point(aes(colour = rain)) +
  geom_smooth(colour = "blue", method = 'loess') +
  scale_colour_gradient2(low = "green", mid = "orange",high = "red", midpoint = mean(weather$rain)) +
  xlab("Month") +
  ylab("Rain (mm)") +
  ggtitle("Daily rain amount")+ 
  theme(axis.line = element_line(), axis.text=element_text(color='black'), axis.title = element_text(colour = 'black'), legend.text=element_text(), legend.title=element_text()) 


Heavily right-skewed distribution.

describe(weather$rain)
##    vars   n mean    sd median trimmed  mad min  max range skew kurtosis
## X1    1 365 5.84 11.98    0.3    2.85 0.44   0 74.9  74.9 2.99     9.88
##      se
## X1 0.63


Right-skewness is still there after removing all the dry days.

describe(subset(weather, rain > 0)$rain)
##    vars   n  mean    sd median trimmed  mad min  max range skew kurtosis
## X1    1 191 11.17 14.67    5.1    7.96 7.12 0.3 74.9  74.6 2.04     4.16
##      se
## X1 1.06


Looking at the association between rain and season of the year, the time series plot seems to indicate that season of the year might be a good predictor for the occurrence of rain. Let’s start by investigating this relation, with both the continuous rain variable and the binary one. It would also be interesting to look at the wind direction a bit closer.

a <- ggplot(weather ,aes(weather$season, weather$rain)) + 
         ylab("Rain (mm)") + 
         xlab("Season") + 
         ggtitle("Daily rain amount by season") + 
         stat_boxplot(geom ='errorbar')+ 
         geom_boxplot(outlier.shape = NA) +  
         geom_jitter(aes(colour=weather$rain), position = position_jitter(width = 0.2)) +
         scale_colour_gradient2(low = "blue", mid = "red",high = "black", midpoint = mean(weather$rain), name="rain")+
theme(axis.line = element_line(), axis.text=element_text(color='black'), axis.title = element_text(colour = 'black'), legend.text=element_text(), legend.title=element_text(), legend.key = element_rect(colour = "black"))

b <- ggplot(weather ,aes(weather$dir.wind.8, weather$rain)) + 
         ylab("Rain (mm)") + 
         xlab("Wind Direction") + 
         ggtitle("Daily rain amount by wind direction") + 
         stat_boxplot(geom ='errorbar')+ 
         geom_boxplot(outlier.shape = NA) +  
         geom_jitter(aes(colour=weather$rain), position = position_jitter(width = 0.2)) +
         scale_colour_gradient2(low = "blue", mid = "red",high = "black", midpoint = mean(weather$rain), name="rain")+
theme(axis.line = element_line(), axis.text=element_text(color='black'), axis.title = element_text(colour = 'black'), legend.text=element_text(), legend.title=element_text(), legend.key = element_rect(colour = "black"))

grid.arrange(a, b, ncol = 1)


Here are the associated t-tests.

pairwise.t.test(weather$rain,weather$season)
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  weather$rain and weather$season 
## 
##        Spring  Summer  Autumn
## Summer 0.999   -       -     
## Autumn 0.057   0.057   -     
## Winter 9.7e-05 9.7e-05 0.114 
## 
## P value adjustment method: holm
pairwise.t.test(weather$rain,weather$dir.wind.8)
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  weather$rain and weather$dir.wind.8 
## 
##    N       NE      E       SE      S       SW      W      
## NE 1.00000 -       -       -       -       -       -      
## E  1.00000 1.00000 -       -       -       -       -      
## SE 0.01886 0.00026 0.12779 -       -       -       -      
## S  0.00090 1.2e-05 0.01025 1.00000 -       -       -      
## SW 0.00901 0.00026 0.05788 1.00000 1.00000 -       -      
## W  0.44786 0.38657 0.60837 1.00000 1.00000 1.00000 -      
## NW 1.00000 1.00000 1.00000 6.8e-06 1.2e-06 3.3e-05 0.43206
## 
## P value adjustment method: holm


What about daily rain amount by wind direction and by temperature?

a <- ggplot(weather ,aes(weather$h.temp.quant, weather$rain)) + 
         ylab("Rain (mm)") + 
         xlab("Temperature") + 
         ggtitle("Daily rain amount by temperature") + 
         stat_boxplot(geom ='errorbar')+ 
         geom_boxplot(outlier.shape = NA) +  
         geom_jitter(aes(colour=weather$rain), position = position_jitter(width = 0.2)) +
         scale_colour_gradient2(low = "blue", mid = "red",high = "black", midpoint = mean(weather$rain), name="rain")+
theme(axis.line = element_line(), axis.text=element_text(color='black'), axis.title = element_text(colour = 'black'), legend.text=element_text(), legend.title=element_text(), legend.key = element_rect(colour = "black"))


b <- ggplot(weather ,aes(weather$month, weather$rain)) + 
         ylab("Rain (mm)") + 
         xlab("Month") + 
         ggtitle("Daily rain amount by month") + 
         stat_boxplot(geom ='errorbar')+ 
         geom_boxplot(outlier.shape = NA) +  
         geom_jitter(aes(colour=weather$rain), position = position_jitter(width = 0.2)) +
         scale_colour_gradient2(low = "blue", mid = "red",high = "black", midpoint = mean(weather$rain), name="rain")+
theme(axis.line = element_line(), axis.text=element_text(color='black'), axis.title = element_text(colour = 'black'), legend.text=element_text(), legend.title=element_text(), legend.key = element_rect(colour = "black"))
  
grid.arrange(a, b, ncol = 1)


Here are the associated t-tests.

pairwise.t.test(weather$rain,weather$h.temp.quant)
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  weather$rain and weather$h.temp.quant 
## 
##      Cool    Mild    Warm   
## Mild 0.19743 -       -      
## Warm 0.00143 0.06821 -      
## Hot  2.1e-06 0.00059 0.19743
## 
## P value adjustment method: holm
pairwise.t.test(weather$rain,weather$month)
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  weather$rain and weather$month 
## 
##     Jan     Feb     Mar     Apr     May     Jun     Jul     Aug    
## Feb 1.00000 -       -       -       -       -       -       -      
## Mar 0.31158 0.00063 -       -       -       -       -       -      
## Apr 0.21294 0.00037 1.00000 -       -       -       -       -      
## May 0.05077 4.4e-05 1.00000 1.00000 -       -       -       -      
## Jun 0.00750 3.2e-06 1.00000 1.00000 1.00000 -       -       -      
## Jul 0.02840 1.8e-05 1.00000 1.00000 1.00000 1.00000 -       -      
## Aug 0.00690 2.7e-06 1.00000 1.00000 1.00000 1.00000 1.00000 -      
## Sep 1.00000 0.03869 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000
## Oct 1.00000 0.04715 1.00000 1.00000 1.00000 0.91224 1.00000 0.90322
## Nov 1.00000 0.96965 1.00000 0.86065 0.25216 0.04706 0.14635 0.04457
## Dec 0.00901 3.8e-06 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000
##     Sep     Oct     Nov    
## Feb -       -       -      
## Mar -       -       -      
## Apr -       -       -      
## May -       -       -      
## Jun -       -       -      
## Jul -       -       -      
## Aug -       -       -      
## Sep -       -       -      
## Oct 1.00000 -       -      
## Nov 1.00000 1.00000 -      
## Dec 1.00000 1.00000 0.05379
## 
## P value adjustment method: holm


We can see that most of the extreme values (outliers) are in the winter and autumn. There are still, however, many dry days in both these seasons, and hence the means are too close from each other. Fitting a model to predict the actual amount of rain (not the same as probability of occurrence) based on the season alone might not give the greatest results.

a <- ggplot(weather,aes(season)) +
  geom_bar(aes(fill = rained), position = "fill", colour="black") +
  geom_hline(aes(yintercept = prop.table(table(weather$rained))["Yes"]),linetype = "dashed") +
  annotate("text", x = 1, y = 0.45, label = "Yearly avg") +
  xlab("Season") +
  ylab ("Proportion") +
  ggtitle("Proportion of days without and with rain, by season")+ 
  theme(axis.line = element_line(), axis.text=element_text(color='black'), axis.title = element_text(colour = 'black'), legend.text=element_text(), legend.title=element_text()) 

b <- ggplot(weather,aes(month)) +
  geom_bar(aes(fill = rained), position = "fill", colour="black") +
  geom_hline(aes(yintercept = prop.table(table(weather$rained))["Yes"]),linetype = "dashed") +
  annotate("text", x = 1.5, y = 0.45, label = "Yearly avg") +
  xlab("Season") +
  ylab ("Proportion") +
  ggtitle("Proportion of days without and with rain, by month")+ 
  theme(axis.line = element_line(), axis.text=element_text(color='black'), axis.title = element_text(colour = 'black'), legend.text=element_text(), legend.title=element_text()) 

grid.arrange(a, b, ncol = 1)

It appears that, when it comes to calculate the likelihood of raining on a particular day, the season of the year may have some predictive power. This is especially true for the winter season, where the rainy days (63%) are well above the yearly average (40%). The temperature confirms this finding.

We are now going to calculate different types of correlations between the continuous outcome variable (daily rain amount) and all the numeric variables.

corr_num_both <- function(data, n = 5){
  plot(ggcorrplot(cor(data[,sapply(data,is.numeric)], method = c("pearson")), hc.order = TRUE, type = "lower", ggtheme = ggplot2::theme_gray, lab = TRUE) + ggtitle("Pearson Linear Correlation")+ theme(axis.line = element_line(), axis.text=element_text(color='black'), axis.title = element_text(colour = 'black'), legend.text=element_text(), legend.title=element_text()))
  
  plot(ggcorrplot(cor(data[,sapply(data,is.numeric)], method = c("kendall")), hc.order = TRUE, type = "lower", ggtheme = ggplot2::theme_gray, lab = TRUE) + ggtitle("Kendall Rank Correlation")+ theme(axis.line = element_line(), axis.text=element_text(color='black'), axis.title = element_text(colour = 'black'), legend.text=element_text(), legend.title=element_text()))
  
 plot(ggcorrplot(cor(data[,sapply(data,is.numeric)], method = c("spearman")), hc.order = TRUE, type = "lower", ggtheme = ggplot2::theme_gray, lab = TRUE) + ggtitle("Spearman Rank Correlation")+ theme(axis.line = element_line(), axis.text=element_text(color='black'), axis.title = element_text(colour = 'black'), legend.text=element_text(), legend.title=element_text()))
}
corr_num_both(weather, 3)

Always bear in mind that correlation does not imply causation, therefore while it is true that the wind correlates with rain, this does not necessarily mean that the wind itself is causing the rain. It could actually be the other way around or, what is very common, both of the variables are caused by a third one that we are not even considering.

There are also some negative correlations with the temperatures (especially the daily high) that, even though not as strong as the wind ones, are still worth looking at. It appears higher amounts of rain are correlated with lower high temperatures. But let’s think about it for a minute: we saw that it is in the winter when it rains the most, and we also saw that the temperatures were lower in the winter.

Here is an example of a potential (yet to be confirmed) interaction with a third variable: it may not be the lower temperature by itself that causes more rain, but the fact the both the precipitation and lower temperatures tend to occur during a specific period of the year.

Since the season seems to have an impact on these variables, I would like to explore it a bit further, calculating all these correlations by season and check whether the values hold. If the correlation between rain and some other variable is very dissimilar across all seasons, then there is the proof for an interaction.

weather.num.season <- split(weather[,sapply(weather,is.numeric)],weather$season)
summary(weather.num.season)
##        Length Class      Mode
## Spring 10     data.frame list
## Summer 10     data.frame list
## Autumn 10     data.frame list
## Winter 10     data.frame list
attributes(weather.num.season)
## $names
## [1] "Spring" "Summer" "Autumn" "Winter"
sapply(weather.num.season, function (x) round(cor(x, method = c("pearson"))["rain",],2))
##                     Spring Summer Autumn Winter
## day.count            -0.20   0.19  -0.15  -0.30
## l.temp               -0.33   0.06   0.13   0.05
## h.temp               -0.45  -0.26  -0.07  -0.26
## ave.temp             -0.43  -0.14   0.06  -0.09
## rain                  1.00   1.00   1.00   1.00
## ave.wind              0.34   0.50   0.38   0.66
## gust.wind             0.37   0.45   0.66   0.71
## l.temp.hour           0.27   0.17   0.07   0.36
## h.temp.hour          -0.14  -0.25  -0.14  -0.31
## gust.wind.time.hour  -0.26  -0.34  -0.18   0.06
sapply(weather.num.season, function (x) round(cor(x, method = c("kendall"))["rain",],2))
##                     Spring Summer Autumn Winter
## day.count            -0.27   0.04  -0.05  -0.40
## l.temp               -0.35   0.05  -0.02   0.01
## h.temp               -0.54  -0.28  -0.17  -0.29
## ave.temp             -0.50  -0.14  -0.09  -0.11
## rain                  1.00   1.00   1.00   1.00
## ave.wind              0.17   0.31   0.21   0.49
## gust.wind             0.19   0.29   0.47   0.57
## l.temp.hour           0.03  -0.03  -0.04   0.11
## h.temp.hour          -0.05   0.02  -0.24  -0.21
## gust.wind.time.hour  -0.23  -0.20  -0.02  -0.02
sapply(weather.num.season, function (x) round(cor(x, method = c("spearman"))["rain",],2))
##                     Spring Summer Autumn Winter
## day.count            -0.36   0.06  -0.06  -0.56
## l.temp               -0.47   0.06  -0.03   0.01
## h.temp               -0.70  -0.36  -0.24  -0.43
## ave.temp             -0.64  -0.19  -0.13  -0.15
## rain                  1.00   1.00   1.00   1.00
## ave.wind              0.23   0.40   0.32   0.66
## gust.wind             0.25   0.37   0.61   0.75
## l.temp.hour           0.05  -0.03  -0.05   0.15
## h.temp.hour          -0.06   0.02  -0.30  -0.29
## gust.wind.time.hour  -0.29  -0.25  -0.04  -0.02

What do you conclude from the table above? The correlation between the rain and wind varies, but keeps moderately strong, regardless of the season of the year, making this the most promising variable in our data set; the correlation between the rain and daily high temperature does not confirm what I had hypothesised above.

In fact, the correlation is even stronger in the spring than in the winter, and we would have to go even deeper if we really needed to understand what is going on (keep in mind we are just analysing one of the possible interactions - the season - but in practice there can be multiple ones). For the purpose of what we are doing now, it is enough to be aware that this correlation is not stable throughout the year, and actually goes to zero during the autumn.

Lastly, the correlations between rain and the hour of the events (low and high temperatures, and wind gust) are rather weak but show some stability (see the l.temp.hour and h.temp.hour). They might have some predictive power, at least in a linear model.

Now that we know that the wind has the strongest correlation with the rain, and that it holds true across all seasons, it’s time to plot these variables, because we want to learn something we still don’t know: what is the shape of this relation? Linear, piecewise linear, curvilinear?

Plot the high correlations

ggplot(weather,aes(h.temp, rain)) +
  geom_point(colour = "firebrick") +
  geom_smooth(se = F, method = 'loess') +
  facet_wrap(~season) +
  xlab("Daily high temperature") +
  ylab ("Rain (mm)") +
  ggtitle("Amount of rain vs. daily high temperature, by season")+ 
  theme(axis.line = element_line(), axis.text=element_text(color='black'), axis.title = element_text(colour = 'black'), legend.text=element_text(), legend.title=element_text(), legend.key = element_rect(colour = "black")) 

ggplot(weather,aes(gust.wind,rain)) +
  geom_point(colour = "firebrick") +
  geom_smooth(se = F, method = 'loess') +
  facet_wrap(~season) +
  xlab("Maximum wind speed (km/h)") +
  ylab ("Rain (mm)") +
  ggtitle("Amount of rain vs. maximum wind speed, by season")+ 
  theme(axis.line = element_line(), axis.text=element_text(color='black'), axis.title = element_text(colour = 'black'), legend.text=element_text(), legend.title=element_text(), legend.key = element_rect(colour = "black")) 

This plot confirms what we had already discovered: there is a positive correlation between rain and wind, and the association holds regardless of the season. But now we know more: this correlation is non-linear. In fact, if we were to generalise, we could say there is no correlation at all when the maximum wind speed is below 25 km/h.

For values higher than that, there seems to be a linear association in the autumn and winter, not so linear in the spring, and definitely non-linear during the summer. If we wanted to model this relation, we would either fit a non-linear model (such as a regression tree) or we could try to force a piecewise linear model (linear spline), where the equation relating the outcome to the predictors would, itself, be different depending on the value of the wind.

Let’s check whether the variable that seemed to have some predictive power for the amount of rain (maximum wind speed), is also good in the case of a binary outcome (occurrence of rain), but now we will not only control for the season, but also for the daily high temperature (because, as we have seen before, this variable was interacting with both the rain and the season). We will do this simultaneously, using the faceting technique on two variables.

#ggplot(weather, aes(h.temp.quant,rain)) +
#  geom_jitter(aes(colour=rain), position = position_jitter(width = 0.2)) +
#  scale_colour_gradient2(low = "blue", mid = "red",high = "black", midpoint = 30) +
#  xlab("Season") +
#  ylab ("Rain (mm)") +
#  ggtitle("Daily rain amount by temperature")+ theme(axis.line = element_line(), #axis.text=element_text(color='black'), axis.title = element_text(colour = 'black'), #legend.text=element_text(), legend.title=element_text()) 

ggplot(weather,aes(rained,gust.wind)) +
  stat_boxplot(geom ='errorbar')+
  geom_boxplot(aes(colour=rained)) +
  facet_grid(h.temp.quant~season) +
  xlab("Occurrence of rain") +
  ylab ("Maximum wind speed (km/h)") +
  ggtitle("Occurrence of rain, by season and daily high temperature")+ 
  theme(axis.line = element_line(), axis.text=element_text(color='black'), axis.title = element_text(colour = 'black'), legend.text=element_text(), legend.title=element_text(), legend.key = element_rect(colour = "black")) 

The graph reveals a clear pattern: the median of the maximum wind speed is always higher when it rains, and this is not affected by the range the of daily high temperature, even after controlling for the temperature variation within each season of the year. I think we now have a much better understanding of the data. We know which variables matter the most, and which ones seem to be useless, when it comes to predict the rain, either the actual amount or the probability of its occurrence.

pcp_func <- function(data, Ycol, notXcols){
  
  if(missing(notXcols) == TRUE){
    a <- match(names(data[,sapply(data,is.numeric)]), names(data))
    remove <- c(Ycol)
    b <- a[!a %in% remove]
    
    data[,Ycol] <- as.factor(data[,Ycol])
    pcp <- ggparcoord(data = data, columns = b, groupColumn = Ycol,showPoints = TRUE, title = "Parallel Coordinate Plot", scale = "std")
    pcp <- pcp + theme(plot.title = element_text(size = rel(1.5)))+theme(axis.line = element_line(), axis.text=element_text(color='black'), axis.title = element_text(colour = 'black'), legend.text=element_text(), legend.title=element_text(), legend.key = element_rect(colour = "black"))
    pcp
    
  }else{
    
    a <- match(names(data[,sapply(data,is.numeric)]), names(data))
    remove <- c(notXcols, Ycol)
    b <- a[!a %in% remove]
    
    data[,Ycol] <- as.factor(data[,Ycol])
    pcp <- ggparcoord(data = data, columns = b, groupColumn = Ycol, showPoints = TRUE, title = "Parallel Coordinate Plot", scale = "std")
    pcp <- pcp + theme(plot.title = element_text(size = rel(1.5)))+ theme(axis.line = element_line(), axis.text=element_text(color='black'), axis.title = element_text(colour = 'black'), legend.text=element_text(), legend.title=element_text(), legend.key = element_rect(colour = "black"))
    pcp
  }
}
pcp_func(weather, Ycol = c(20), notXcols = c(10))

This confirms that wind and temperature are good predictors for rain.

The summary indicates that ten PCs where created: the number of possible PCs always equals the number of original variables. A “scree plot” allows a graphical assessment of the relative contribution of the PCs in explaining the variability of the data. The “Rotation” matrix contains the “loadings” of each of the original variables on the newly created PCs. We can also see that rain correlates with ave.wind, gust.wind and l.temp.hour.

pca_num2 <- function(data, cumu = TRUE, ...){

  pca2 <- prcomp(data[,sapply(data,is.numeric)], scale = TRUE, center = TRUE)
  
  pr.var2 <- pca2$sdev^2
  pve2 <- pr.var2/sum(pr.var2)
  
  b <- summary(pca2)
  
  variances <- data.frame(variances=pca2$sdev**2, pcomp=1:length(pca2$sdev))
  varPlot <- ggplot(variances, aes(pcomp, variances))+ geom_bar(stat="identity", fill = "darkgrey", color = "blue")+
    geom_line() + 
    ggtitle("Scree plot with normalization")+
    xlab("Eigenvectors")+
    ylab("Eigenvalues")
    theme(plot.title = element_text(size = rel(1.5)))+
      theme(axis.line = element_line(), axis.text=element_text(color='black'), axis.title = element_text(colour = 'black'), legend.text=element_text(), legend.title=element_text())
  plot(varPlot)
  
  par(mfrow = c(1,1))
  a2 <- data.frame(cumsum(pve2), Component = 1:length(pve2))
  my_plot <- ggplot(a2, aes(y=cumsum.pve2., x=Component)) + 
    geom_line() + geom_point() + 
    coord_cartesian(ylim = c(0,1))+
    ylab("")+ 
    ggtitle("Cumulative  Proportion of Variance  Explained") + 
    theme(axis.line = element_line(), axis.text=element_text(color='black'), axis.title = element_text(colour = 'black'), legend.text=element_text(), legend.title=element_text())
  plot(my_plot)
  
  PCbiplot <- function(PC, x="PC1", y="PC2", mm) {
    data <- data.frame(obsnames="", PC$x)#data.frame(obsnames=my_data[,12], PC$x)
    plot <- ggplot(data, aes_string(x=x, y=y))# + geom_text(alpha=.4, size=3, aes(label=obsnames))
    plot <- plot + geom_hline(aes(0), size=.2, yintercept = 0) + geom_vline(aes(0), size=.2, xintercept = 0) + geom_point(alpha = 1,size = 1)+ theme(axis.line = element_line(), axis.text=element_text(color='black'), axis.title = element_text(colour = 'black'), legend.text=element_text(), legend.title=element_text())
    datapc <- data.frame(varnames=rownames(PC$rotation), PC$rotation)
    mult <- min(
      (max(data[,y]) - min(data[,y])/(max(datapc[,y])-min(datapc[,y]))),
      (max(data[,x]) - min(data[,x])/(max(datapc[,x])-min(datapc[,x])))
    )
    datapc <- transform(datapc,
                        v1 = .7 * mult * (get(x)),
                        v2 = .7 * mult * (get(y))
    )
    plot <- plot + coord_equal() + geom_text(data=datapc, aes(x=v1, y=v2, label=varnames), size = 3.5, vjust=1, color="red")
    plot <- plot + geom_segment(data=datapc, aes(x=0, y=0, xend=v1, yend=v2), arrow=arrow(length=unit(0.2,"cm")), alpha=0.75, color="red")+
      labs(title = mm)+
      theme(plot.title = element_text(size = rel(1.5)))
    plot
  }
  
  numeric <- data[,sapply(data,is.numeric)]
  numeric <- numeric[,...]
  fit.1 <- prcomp(numeric, scale=TRUE, center = TRUE)#Zero mean and unit variance
  
  plot(PCbiplot(fit.1, mm = "PCA with normalization"))
  
  if(cumu == TRUE){
    return(b)
  }
}
pca_num2(weather, cumu = TRUE, -c(1))

## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6
## Standard deviation     1.8797 1.4486 1.0980 0.91854 0.89629 0.86866
## Proportion of Variance 0.3533 0.2098 0.1206 0.08437 0.08033 0.07546
## Cumulative Proportion  0.3533 0.5632 0.6837 0.76809 0.84843 0.92388
##                            PC7     PC8     PC9    PC10
## Standard deviation     0.70319 0.38512 0.33451 0.08033
## Proportion of Variance 0.04945 0.01483 0.01119 0.00065
## Cumulative Proportion  0.97333 0.98816 0.99935 1.00000

We will build several predictive models and evaluate their accuracies. Now our dependent value will be continuous, and we will be predicting the daily amount of rain.

In both the continuous and binary cases, we will try to fit the following models: Baseline model - usually, this means we assume there are no predictors (i.e., independent variables). Thus, we have to make an educated guess (not a random one), based on the value of the dependent value alone. Majority class in classification and mean/median in regression. Here I chose to use he median because in the first histograms we saw that the median better represents the data than the mean.

Two models from inferential statistics - In the case of a continuous outcome, we will fit a multiple linear regression and a principal components regression.

Two models from machine learning - we will first build a decision tree (regression tree for the continuous outcome, and classification tree for the binary case); these models usually offer high interpretability and decent accuracy; then, we will build random forests, a very popular method, where there is often a gain in accuracy, at the expense of interpretability.

First we split the data into a testing and training set.

set.seed(123)
index <- createDataPartition(weather$rain, p = 2/3, list = FALSE)
train <- weather[ index,]
test  <- weather[-index,]
train$h.temp.hour <- as.numeric(train$h.temp.hour)
train$l.temp.hour <- as.numeric(train$l.temp.hour)
train$gust.wind.time.hour <- as.numeric(train$gust.wind.time.hour)
test$h.temp.hour <- as.numeric(test$h.temp.hour)
test$l.temp.hour <- as.numeric(test$l.temp.hour)
test$gust.wind.time.hour <- as.numeric(test$gust.wind.time.hour)

Here is a plot showing which points belong to which set (train or test).

group <- rep(NA,nrow(train) + nrow(test))
group <- ifelse(seq(1,nrow(train) + nrow(test)) %in% index,"Train","Test")
df <- data.frame(date=weather$date,rain=weather$rain, group)

ggplot(df,aes(x = date,y = rain, color = group)) + 
  geom_point() +
  scale_color_discrete(name="") + 
  ggtitle("Test vs Train set") + 
  theme(legend.position="right")+ 
  theme(axis.line = element_line(), axis.text=element_text(color='black'),axis.title = element_text(colour = 'black'), legend.text=element_text(), legend.title=element_text(), legend.key = element_rect(colour = "black")) 

For the continuous outcome, the main error metric we will use to evaluate our models is the RMSE (root mean squared error). This error measure gives more weight to larger residuals than smaller ones. We will use the MAE (mean absolute error) as a secondary error metric. It gives equal weight to the residuals.

Baseline model - predict the median of the training data.

best.guess <- median(train$rain) 
RMSE.baseline <- sqrt(mean((best.guess-test$rain)^2))
RMSE.baseline
## [1] 13.9001
MAE.baseline <- mean(abs(best.guess-test$rain))
MAE.baseline
## [1] 6.101653
base.pred <- data.frame(date = test$date, actual = test$rain,
                       predicted = best.guess)
base.pred <- melt(base.pred, c("actual", "predicted"), id = "date")

ggplot(base.pred, aes(x = date, y = value, color = variable))+
   geom_point() + 
  scale_color_discrete(name="")+ 
  ggtitle("Actual vs Predicted for Baseline") + 
  theme(legend.position="right") + 
  ylab("Rain (mm)") + 
  xlab("Date")+ 
  theme(axis.line = element_line(), axis.text=element_text(color='black'),axis.title = element_text(colour = 'black'), legend.text=element_text(), legend.title=element_text(), legend.key = element_rect(colour = "black")) 

We will now fit a (multiple) linear regression, which is probably the best known statistical model. Linear models do not require variables to have a Gaussian distribution (only the errors / residuals must be normally distributed); they do require, however, a linear relation between the dependent and independent variables.

Like we saw, the distribution of the amount of rain is right-skewed, and the relation with some other variables is highly non-linear. For this reason of linearity, and also to help fixing the problem with residuals having non-constant variance across the range of predictions (called heteroscedasticity), we will do the usual log transformation to the dependent variable. Since we have zeros (days without rain), we can’t do a simple ln(x) transformation, but we can do ln(x+1), where x is the rain amount.

Why do we choose to apply a logarithmic function? Simply because the regression coefficients can still be interpreted, although in a different way when compared with a pure linear regression. This model we will fit is often called log-linear.

I chose the variables for the models based on the exploratory data analysis in the previous sections. Some of the variables in our data are highly correlated (for instance, the minimum, average, and maximum temperature on a given day), which means that sometimes when we eliminate a non-significant variable from the model, another one that was previously non-significant becomes statistically significant. This iterative process of backward elimination stops when all the variables in the model are significant (in the case of factors, here we consider that at least one level must be significant);

Our dependent variable has lots of zeros and can only take positive values; if you’re an expert statistician, perhaps you would like to fit very specific models that can deal better with count data, such as negative binomial, zero-inflated, Poisson regression, and hurdle models and mixture models.

Create a multiple (log)linear regression model using the training data.

ctrl <- trainControl(method = "cv",
                     number = 10,
                     savePredictions = TRUE)

set.seed(123)
lin.reg <- train(log(rain+1) ~ month + season + h.temp + gust.wind + dir.wind.8 + gust.wind.time.hour, 
                    data = train,
                    method = "lm",
                    metric = "RMSE",
                    trControl = ctrl)

test.pred.lin <- predict(lin.reg,test) 

RMSE.lin.reg <- sqrt(mean((test.pred.lin-test$rain)^2))
RMSE.lin.reg
## [1] 13.13634
summary(lin.reg)
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.82793 -0.47738 -0.02618  0.35218  2.23611 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.275606   0.566641   2.251 0.025367 *  
## monthFeb            -0.151091   0.252160  -0.599 0.549669    
## monthMar            -0.099089   0.285126  -0.348 0.728531    
## monthApr             0.245362   0.422919   0.580 0.562402    
## monthMay            -0.041824   0.427523  -0.098 0.922157    
## monthJun             0.094972   0.456358   0.208 0.835337    
## monthJul             0.356892   0.508522   0.702 0.483536    
## monthAug             0.437194   0.511132   0.855 0.393295    
## monthSep             0.611763   0.477800   1.280 0.201768    
## monthOct             0.294092   0.482386   0.610 0.542718    
## monthNov             0.348814   0.430081   0.811 0.418222    
## monthDec            -0.780787   0.350675  -2.227 0.026997 *  
## seasonSummer         0.032643   0.365176   0.089 0.928853    
## seasonAutumn         0.558923   0.381369   1.466 0.144200    
## seasonWinter         0.226388   0.304774   0.743 0.458395    
## h.temp              -0.097725   0.019868  -4.919 1.71e-06 ***
## gust.wind            0.045788   0.005092   8.993  < 2e-16 ***
## dir.wind.8NE         0.313355   0.269765   1.162 0.246669    
## dir.wind.8E         -0.251950   0.362807  -0.694 0.488139    
## dir.wind.8SE         0.988400   0.274256   3.604 0.000388 ***
## dir.wind.8S          1.022222   0.303104   3.373 0.000881 ***
## dir.wind.8SW         0.914787   0.293759   3.114 0.002092 ** 
## dir.wind.8W          0.815369   0.418572   1.948 0.052695 .  
## dir.wind.8NW         0.260505   0.245409   1.062 0.289625    
## gust.wind.time.hour -0.053057   0.010934  -4.853 2.31e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.732 on 219 degrees of freedom
## Multiple R-squared:  0.6903, Adjusted R-squared:  0.6564 
## F-statistic: 20.34 on 24 and 219 DF,  p-value: < 2.2e-16
# What is the multiplicative effect of the wind variable?
exp(coef(lin.reg$finalModel)["gust.wind"])
## gust.wind 
##  1.046853
# Apply the model to the testing data (i.e., make predictions) ...
# (Don't forget to exponentiate the results to revert the log transformation)
test.pred.lin <- exp(predict(lin.reg,test))-1
test.pred.lin <- ifelse(test.pred.lin < 0 , 0, test.pred.lin)

RMSE.lin.reg <- sqrt(mean((test.pred.lin-test$rain)^2))
RMSE.lin.reg
## [1] 10.55173
MAE.lin.reg <- mean(abs(test.pred.lin-test$rain))
MAE.lin.reg
## [1] 4.286606
lm.pred <- data.frame(date = test$date, actual = test$rain,
                       predicted = test.pred.lin)
lm.pred <- melt(lm.pred, c("actual", "predicted"), id = "date")

ggplot(lm.pred, aes(x = date, y = value, color = variable))+
   geom_point() + 
  scale_color_discrete(name="")+ 
  ggtitle("Actual vs Predicted for Linear Regression") + 
  theme(legend.position="right") + 
  ylab("Rain (mm)") + 
  xlab("Date")+ 
  theme(axis.line = element_line(), axis.text=element_text(color='black'), axis.title = element_text(colour = 'black'), legend.text=element_text(), legend.title=element_text(), legend.key = element_rect(colour = "black")) 

The R-squared is 0.69, which means that 69% of the variance in our dependent variable can be explained by the set of predictors in the model; at the same time, the adjusted R-squared is not far from that number, meaning that the original R-squared has not been artificially increased by adding variables to the model. Note that the R-squared can only increase or stay the same by adding variables, whereas the adjusted R-squared can even decrease if the variable added doesn’t help the model more than what is expected by chance.

Most of the variables are statistically significant (p < 0.05), as expected from the way the model was built, and the most significant predictor is the wind gust (p = < 2e-16). The advantage of doing a log transformation is that, if the regression coefficient is small (i.e. -0.1 to 0.1), a unit increase in the independent variable yields an increase of approximately coeff*100 in the dependent variable. To be clear, the coefficient of the wind gust is 0.0457. It means that a unit increase in the gust wind (i.e., increasing the wind by 1 km/h), increases the predicted amount of rain by approximately 4.57%. By the same token, for each degree (ºC) the daily high temperature increases, the predicted rain increases by exp(-0.0977) = 0.906 (i.e., it decreases by about 10%).

How do we grow trees, then? In very simple terms, we start with a root node, which contains all the training data, and split it into two new nodes based on the most important variable (i.e., the variable that better separates the outcome into two groups). The disadvantage of a decision tree is that they tend to overfit.

set.seed(123)
rt <- rpart(rain ~ month + season + h.temp + gust.wind + dir.wind.8 + gust.wind.time.hour, data=train)

plot(as.party(rt))

test.pred.rtree <- predict(rt,test) 

RMSE.rtree <- sqrt(mean((test.pred.rtree-test$rain)^2))
RMSE.rtree
## [1] 9.837549
MAE.rtree <- mean(abs(test.pred.rtree-test$rain))
MAE.rtree
## [1] 4.714517
tree.pred_1 <- data.frame(date = test$date, actual = test$rain,
                        predicted = test.pred.rtree)

tree.pred_1 <- melt(tree.pred_1, c("actual", "predicted"), id = "date")

ggplot(tree.pred_1, aes(x = date, y = value, color = variable))+
  geom_point() + 
  scale_color_discrete(name="")+ 
  ggtitle("Actual vs Predicted for Full Tree") + 
  theme(legend.position="right") + 
  ylab("Rain (mm)") + 
  xlab("Date")+ 
  theme(axis.line = element_line(), axis.text=element_text(color='black'), axis.title = element_text(colour = 'black'), legend.text=element_text(), legend.title=element_text(), legend.key = element_rect(colour = "black")) 

Now that we have a full-grown tree, let’s see if it’s possible to prune it.

# Get the optimal CP programmatically...
min.xerror <- rt$cptable[which.min(rt$cptable[,"xerror"]),"CP"]

# ...and use it to prune the tree
rt.pruned <- prune(rt, cp = min.xerror)

plot(as.party(rt.pruned))

test.pred.rtree.p <- predict(rt.pruned,test)
RMSE.rtree.pruned <- sqrt(mean((test.pred.rtree.p-test$rain)^2))
RMSE.rtree.pruned
## [1] 9.773112
MAE.rtree.pruned <- mean(abs(test.pred.rtree.p-test$rain))
MAE.rtree.pruned
## [1] 5.020861
tree.pred <- data.frame(date = test$date, actual = test$rain,
                      predicted = test.pred.rtree.p)

tree.pred <- melt(tree.pred, c("actual", "predicted"), id = "date")

ggplot(tree.pred, aes(x = date, y = value, color = variable))+
  geom_point() + 
  scale_color_discrete(name="")+ 
  ggtitle("Actual vs Predicted for Pruned Tree") + 
  theme(legend.position="right") + ylab("Rain (mm)") + 
  xlab("Date")+ 
  theme(axis.line = element_line(), axis.text=element_text(color='black'),axis.title = element_text(colour = 'black'), legend.text=element_text(), legend.title=element_text(), legend.key = element_rect(colour = "black")) 

As you can see, we were able to prune our tree to gaining simplicity and an improvement in performance (as measured by the error metrics). In the final tree, the wind gust speed is considered the most relevant predictor of the amount of rain on a given day. The accuracy of this extremely simple model is better than the much more complicated linear regression.

Next I will build a Random Forest.

set.seed(123)

grid <- data.frame(mtry = seq(1, 21, 4))

rf <- train(rain ~ month + season + l.temp + h.temp + ave.temp + ave.wind +
                     gust.wind + dir.wind.8 + h.temp.hour + l.temp.hour +
                     gust.wind.time.hour, 
                   ntree=1000,
                   data = train,
                   method = "rf",
                   trControl = ctrl,
                   metric="RMSE",
                   tuneGrid = grid,
                   importance = TRUE)

ggplot(rf)+ 
  ggtitle("Random Forest Tuning Parameter") +
  theme(axis.line = element_line(), axis.text=element_text(color='black'),axis.title = element_text(colour = 'black'), legend.text=element_text(), legend.title=element_text())

imp <- varImp(rf)
ggplot(imp)+ 
  ggtitle("Random Forest Variable Importance") +
  theme(axis.line = element_line(), axis.text=element_text(color='black'),axis.title = element_text(colour = 'black'), legend.text=element_text(), legend.title=element_text()) 

test.pred.forest <- predict(rf,test)
RMSE.forest <- sqrt(mean((test.pred.forest-test$rain)^2))
RMSE.forest
## [1] 9.657945
MAE.forest <- mean(abs(test.pred.forest-test$rain))
MAE.forest
## [1] 4.73083
rf.pred <- data.frame(date = test$date, actual = test$rain,
                        predicted = test.pred.forest)
rf.pred <- melt(rf.pred, c("actual", "predicted"), id = "date")

ggplot(rf.pred, aes(x = date, y = value, color = variable))+
  geom_point() + 
  scale_color_discrete(name="")+ 
  ggtitle("Actual vs Predicted for Random Forest") + 
  theme(legend.position="right") + 
  ylab("Rain (mm)") + 
  xlab("Date")+ 
  theme(axis.line = element_line(), axis.text=element_text(color='black'), axis.title = element_text(colour = 'black'), legend.text=element_text(), legend.title=element_text(), legend.key = element_rect(colour = "black")) 

Principal Components Regression is a dimension reduction method. The main point is that the model is built by a linear combination of the original variables.

set.seed(123)

grid <- data.frame(ncomp = seq(1, 21, 1))

pcr <- train(rain ~ month + season + l.temp + h.temp + ave.temp + ave.wind +
              gust.wind + dir.wind.8 + h.temp.hour + l.temp.hour +
              gust.wind.time.hour, 
            data = train,
            method = "pcr",
            trControl = ctrl,
            metric="RMSE",
            tuneGrid = grid)

ggplot(pcr)+ 
  ggtitle("Principal Components tuning parameter") +
  theme(axis.line = element_line(), axis.text=element_text(color='black'), axis.title = element_text(colour = 'black'), legend.text=element_text(), legend.title=element_text()) 

test.pred.comp <- predict(pcr,test)
test.pred.comp <- ifelse(test.pred.comp < 0 , 0, test.pred.comp)
RMSE.comp <- sqrt(mean((test.pred.comp-test$rain)^2))
RMSE.comp
## [1] 9.448656
MAE.comp <- mean(abs(test.pred.comp-test$rain))
MAE.comp
## [1] 5.066825
pcr.pred <- data.frame(date = test$date, actual = test$rain,
                      predicted = test.pred.comp)
pcr.pred <- melt(pcr.pred, c("actual", "predicted"), id = "date")

ggplot(pcr.pred, aes(x = date, y = value, color = variable))+
  geom_point() + 
  scale_color_discrete(name="") + 
  ggtitle("Actual vs Predicted for Principal Components") + 
  theme(legend.position="right") + 
  ylab("Rain (mm)") + 
  xlab("Date")+ 
  theme(axis.line = element_line(), axis.text=element_text(color='black'),axis.title = element_text(colour = 'black'), legend.text=element_text(), legend.title=element_text(), legend.key = element_rect(colour = "black")) 

We have just built and evaluated the accuracy of six different models: baseline, linear regression, fully-grown decision tree, pruned decision tree, random forest and principal components regression. Let’s create a data frame with the RMSE and MAE for each of these methods.

accuracy <- data.frame(Method = c("Baseline","Linear Regression","Full Tree","Pruned Tree","Random Forest", "Principal Components"),
                           RMSE   = c(RMSE.baseline,RMSE.lin.reg,RMSE.rtree,RMSE.rtree.pruned,RMSE.forest,RMSE.comp),
                           MAE    = c(MAE.baseline,MAE.lin.reg,MAE.rtree,MAE.rtree.pruned,MAE.forest,MAE.comp)) 

# Round the values and print the table
accuracy$RMSE <- round(accuracy$RMSE,2)
accuracy$MAE <- round(accuracy$MAE,2) 

accuracy[order(accuracy$RMSE,decreasing = FALSE), ]
##                 Method  RMSE  MAE
## 6 Principal Components  9.45 5.07
## 5        Random Forest  9.66 4.73
## 4          Pruned Tree  9.77 5.02
## 3            Full Tree  9.84 4.71
## 2    Linear Regression 10.55 4.29
## 1             Baseline 13.90 6.10
accuracy[order(accuracy$MAE,decreasing = FALSE), ]
##                 Method  RMSE  MAE
## 2    Linear Regression 10.55 4.29
## 3            Full Tree  9.84 4.71
## 5        Random Forest  9.66 4.73
## 4          Pruned Tree  9.77 5.02
## 6 Principal Components  9.45 5.07
## 1             Baseline 13.90 6.10

All methods beat the baseline, regardless of the error metric, with the princiapl components and linear regression offering the best performance when using RMSE and MAE respectively. It would be interesting, still, to compare the fitted vs. actual values for each model.

all.predictions <- data.frame(actual = test$rain,
                              baseline = best.guess,
                              linear.regression = test.pred.lin,
                              full.tree = test.pred.rtree,
                              pruned.tree = test.pred.rtree.p,
                              random.forest = test.pred.forest,
                              principal.components = test.pred.comp)

# First six observations 
#head(all.predictions)

# Needed to melt the columns with the gather() function. tidyr is an alternative to the reshape2 package. Gather the prediction variables (columns) into a single row (i.e., wide to long). Recall the ggplot2 prefers the long data format
all.predictions <- gather(all.predictions, key = model, value = predictions, 2:7)
#head(all.predictions)

# Predicted vs. actual for each model
ggplot(data = all.predictions, aes(x = actual, y = predictions)) + 
  geom_point(colour = "blue") + 
  geom_smooth(color = "black", method = "loess", size = 0.5)+
  geom_abline(intercept = 0, slope = 1, colour = "red") +
  facet_wrap(~ model,ncol = 2) + 
  ggtitle("Predicted vs. Actual, by model")+ 
  theme(axis.line = element_line(), axis.text=element_text(color='black'), axis.title = element_text(colour = 'black'), legend.text=element_text(), legend.title=element_text(), legend.key = element_rect(colour = "black")) 

The graph shows that most of the models can not predict accurately values over 25 mm of daily rain. The Principal Components seems to be the method that better approximates the diagonal line (i.e., smaller prediction error). Even though these models can not be considered more than fair, they still do a much better job when compared with the baseline prediction.

This could be continued by building models, this time considering the rain as a binary outcome. This means all models will assign probabilities to the occurrence of rain, for each day in the test set.

This could be further continued by trying different types of regressions such as poisson, quasi-poisson, negative binomial, zero-inflated, hurdle models, mixture models and ordinal regression on Y binned into 5 categories. Zero-inflated poisson regression is used to model count data that has an excess of zero counts.

Poisson Regression

Negative Binomial Regression

Zero-Inflated Count Models Zero-Inflated Poisson Zero-Inflated Negative Binomial

Zero-Truncated Count Models Zero-Truncated Poisson Zero-Truncated Negative Binomial

Hurdle Models

Random-Effects Count Models (Mixture Models)

http://www.ats.ucla.edu/stat/r/dae/nbreg.htm

http://www.ats.ucla.edu/stat/r/dae/zipoisson.htm

http://www.ats.ucla.edu/stat/stata/seminars/count_presentation/count.htm

http://www.ats.ucla.edu/stat/dae/