Melbourne Rain Trends

By Gregory Baharis (s3545698), Harri Somakanthan (s3775653) & Syed Wajahath (s3750039)

Group/Individual Details

Introduction & Problem Statement

Melbourne Weather trends. The following investigation takes a look at melbourne weather data from 2008-2017.

RPubs link information: RPubs Link

There were three primary questions that drove our investigation

The dataset that we used was the Rain in Australia dataset available on Kaggle: https://www.kaggle.com/jsphyg/weather-dataset-rattle-package.

We will use statistics to analyse the relationship between each of the 2 variables per question. We shall undergo this process in 3 steps. Firstly we will collect the descriptive statistics and construct a box plot to analyse the spread of the data and check for any noticeable outlier patterns. Secondly we will then complete a t.test to determine the accuracy of the sample and visualise it using granova. Thirdly we will construct scatter plots and complete a linear regression and correlation analysis determing if a relationship exists.

Data

Descriptive Statistics and Visualisation

setwd("C:/Users/Harri/OneDrive/Masters of Analytics/S1_MATH1324 Introduction to Statistics/")
rain <- fread(file = "weatherAUS.csv")
rain <- rain[,-c(8:13, 16:19, 22:24)]
rain <- rain %>% filter(Location == "Melbourne")
rain$TempVar <- rain$MaxTemp - rain$MinTemp
rain <- rain[, c(1:4,12,5:11)] #rearranging
Rainonly <- rain %>% filter(Rainfall != 0)

TempVar <- rain %>%  summarise(Min = min(TempVar,na.rm = TRUE),
                   Q1 = quantile(TempVar,probs = .25,na.rm = TRUE),
                   Median = median(TempVar, na.rm = TRUE),
                   Q3 = quantile(TempVar,probs = .75,na.rm = TRUE),
                   Max = max(TempVar,na.rm = TRUE),
                   Mean = mean(TempVar, na.rm = TRUE),
                   SD = sd(TempVar, na.rm = TRUE),
                   n = n(),
                   Missing = sum(is.na(TempVar)))

Rainfall <- rain %>% summarise(Min = min(Rainfall,na.rm = TRUE),
                   Q1 = quantile(Rainfall,probs = .25,na.rm = TRUE),
                   Median = median(Rainfall, na.rm = TRUE),
                   Q3 = quantile(Rainfall,probs = .75,na.rm = TRUE),
                   Max = max(Rainfall,na.rm = TRUE),
                   Mean = mean(Rainfall, na.rm = TRUE),
                   SD = sd(Rainfall, na.rm = TRUE),
                   n = n(),
                   Missing = sum(is.na(Rainfall)))

Rainfallonlyrain <- Rainonly %>% summarise(Min = min(Rainfall,na.rm = TRUE),
                   Q1 = quantile(Rainfall,probs = .25,na.rm = TRUE),
                   Median = median(Rainfall, na.rm = TRUE),
                   Q3 = quantile(Rainfall,probs = .75,na.rm = TRUE),
                   Max = max(Rainfall,na.rm = TRUE),
                   Mean = mean(Rainfall, na.rm = TRUE),
                   SD = sd(Rainfall, na.rm = TRUE),
                   n = n(),
                   Missing = sum(is.na(Rainfall)))

Evaporation <- rain %>% summarise(Min = min(Evaporation,na.rm = TRUE),
                   Q1 = quantile(Evaporation,probs = .25,na.rm = TRUE),
                   Median = median(Evaporation, na.rm = TRUE),
                   Q3 = quantile(Evaporation,probs = .75,na.rm = TRUE),
                   Max = max(Evaporation,na.rm = TRUE),
                   Mean = mean(Evaporation, na.rm = TRUE),
                   SD = sd(Evaporation, na.rm = TRUE),
                   n = n(),
                   Missing = sum(is.na(Evaporation)))

Sunshine <- rain %>% summarise(Min = min(Sunshine,na.rm = TRUE),
                   Q1 = quantile(Sunshine,probs = .25,na.rm = TRUE),
                   Median = median(Sunshine, na.rm = TRUE),
                   Q3 = quantile(Sunshine,probs = .75,na.rm = TRUE),
                   Max = max(Sunshine,na.rm = TRUE),
                   Mean = mean(Sunshine, na.rm = TRUE),
                   SD = sd(Sunshine, na.rm = TRUE),
                   n = n(),
                   Missing = sum(is.na(Sunshine)))

Humidity9am <- rain %>% summarise(Min = min(Humidity9am,na.rm = TRUE),
                   Q1 = quantile(Humidity9am,probs = .25,na.rm = TRUE),
                   Median = median(Humidity9am, na.rm = TRUE),
                   Q3 = quantile(Humidity9am,probs = .75,na.rm = TRUE),
                   Max = max(Humidity9am,na.rm = TRUE),
                   Mean = mean(Humidity9am, na.rm = TRUE),
                   SD = sd(Humidity9am, na.rm = TRUE),
                   n = n(),
                   Missing = sum(is.na(Humidity9am)))

Humidity3pm <- rain %>% summarise(Min = min(Humidity3pm,na.rm = TRUE),
                   Q1 = quantile(Humidity3pm,probs = .25,na.rm = TRUE),
                   Median = median(Humidity3pm, na.rm = TRUE),
                   Q3 = quantile(Humidity3pm,probs = .75,na.rm = TRUE),
                   Max = max(Humidity3pm,na.rm = TRUE),
                   Mean = mean(Humidity3pm, na.rm = TRUE),
                   SD = sd(Humidity3pm, na.rm = TRUE),
                   n = n(),
                   Missing = sum(is.na(Humidity3pm)))

Temp9am <- rain %>% summarise(Min = min(Temp9am,na.rm = TRUE),
                   Q1 = quantile(Temp9am,probs = .25,na.rm = TRUE),
                   Median = median(Temp9am, na.rm = TRUE),
                   Q3 = quantile(Temp9am,probs = .75,na.rm = TRUE),
                   Max = max(Temp9am,na.rm = TRUE),
                   Mean = mean(Temp9am, na.rm = TRUE),
                   SD = sd(Temp9am, na.rm = TRUE),
                   n = n(),
                   Missing = sum(is.na(Temp9am)))

Temp3pm <- rain %>% summarise(Min = min(Temp3pm,na.rm = TRUE),
                   Q1 = quantile(Temp3pm,probs = .25,na.rm = TRUE),
                   Median = median(Temp3pm, na.rm = TRUE),
                   Q3 = quantile(Temp3pm,probs = .75,na.rm = TRUE),
                   Max = max(Temp3pm,na.rm = TRUE),
                   Mean = mean(Temp3pm, na.rm = TRUE),
                   SD = sd(Temp3pm, na.rm = TRUE),
                   n = n(),
                   Missing = sum(is.na(Temp3pm)))


# Display table of Summary Stats
weather <- c("TempVar", "Rainfall", "Rainfallonlyrain", "Evaporation", "Sunshine", "Humidity9am", "Humidity3pm", "Temp9am", "Temp3pm")
knitr::kable(cbind(weather, rbind.data.frame(TempVar, Rainfall, Rainfallonlyrain, Evaporation, Sunshine, Humidity9am, Humidity3pm, Temp9am, Temp3pm)))   
weather Min Q1 Median Q3 Max Mean SD n Missing
TempVar 0.1 6.2 8.3 11.4 27.7 9.072021 4.148690 2435 1
Rainfall 0.0 0.0 0.0 1.2 82.2 1.837772 5.188695 2435 137
Rainfallonlyrain 0.2 0.6 2.0 5.4 82.2 4.671681 7.431709 904 0
Evaporation 0.0 2.0 4.0 6.4 23.8 4.595893 3.327234 2435 0
Sunshine 0.0 3.2 6.7 9.6 13.9 6.465776 3.918608 2435 1
Humidity9am 14.0 58.0 68.0 78.0 100.0 67.320312 14.733523 2435 3
Humidity3pm 6.0 41.0 50.0 61.0 100.0 50.897447 16.015694 2435 7
Temp9am 3.1 11.3 14.2 17.4 35.5 14.636046 4.805460 2435 2
Temp3pm 8.4 14.9 18.3 22.6 45.4 19.348992 5.764457 2435 4

Boxplot of outliers

boxplot(Rainonly$Rainfall, rain$Rainfall, names = c("Days of Rain Only", "All Days"), main="Rainfall", 
        ylab="Rain in mm")

boxplot(rain$TempVar, main="Temp Variance", ylab="Temp in Celcius")

boxplot(rain$Sunshine, main="Sunshine", ylab="Hours of Sunshine")

boxplot(rain$Evaporation, main="Evaporation", ylab = "A pan evaporation (mm)")

boxplot(rain$Humidity9am, rain$Humidity3pm, names = c("9am", "3pm"), main="Humidity", ylab="Humidity (percent)")

boxplot(rain$Temp9am, rain$Temp3pm, names = c("9am", "3pm"), main="Temp", ylab="Temperature (degrees C)")

Hypothesis Tests for Linear Relationships and Correlation.

###Analysis
#Does the amount of Rainfall impact the Variation of Temperature?
#H0: Rainfall has no impact on the variation of temperature.
#H1: Rainfall has an impact on the variation of temperature.

ggplot(Rainonly) + theme_minimal() + aes(TempVar, Rainfall) + geom_point(alpha = 0.8, size = 0.5) + 
  ggtitle( "Rainfall in Melbourne based on Temp Variance")  + theme(plot.title = element_text(hjust = 0.5)) +geom_smooth(method = "lm")

Rainonlylm <- lm(Rainonly$Rainfall~Rainonly$TempVar)
Rainonlylm %>% summary()
## 
## Call:
## lm(formula = Rainonly$Rainfall ~ Rainonly$TempVar)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.808 -3.936 -2.618  0.764 78.287 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        6.2479     0.6555   9.532  < 2e-16 ***
## Rainonly$TempVar  -0.2182     0.0841  -2.595  0.00961 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.408 on 902 degrees of freedom
## Multiple R-squared:  0.00741,    Adjusted R-squared:  0.00631 
## F-statistic: 6.734 on 1 and 902 DF,  p-value: 0.009614
##Significance test for the regression constant and slope 
Rainonlylm %>% confint()
##                       2.5 %     97.5 %
## (Intercept)       4.9614266  7.5343277
## Rainonly$TempVar -0.3833083 -0.0531826
# H0: R = 0 H1: R != 0 
r=cor(Rainonly$Rainfall,Rainonly$TempVar)
sprintf("The corrolation constant for Rainfall vs TempVar is %g", r) %>% cat 
## The corrolation constant for Rainfall vs TempVar is -0.0860813
cf <- CIr(r = r, n = 904, level = .95)
sprintf("\nThe confidence interval is %g" , cf) %>% cat 
## 
## The confidence interval is -0.15044 
## The confidence interval is -0.0209959
#Does the amount of sunshine impact evaporation?
#H0: Sunshine has no impact on evaporation.
#H1: Sunshine has an impact on evaporation.

DatanoNA <- na.omit(rain) # Filter out the NA values for the data

ggplot(rain) + theme_minimal() + aes(Evaporation, Sunshine) + geom_point(alpha = 0.8, size = 0.5) +
   ggtitle( "Evaporation in Melbourne based on Sunshine") + xlab("Hours of Sunshine") + ylab("A pan evaporation (mm)")  + theme(plot.title = element_text(hjust = 0.5)) + geom_smooth(method = "lm")
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).

Rainlm1 <- lm(rain$Sunshine~rain$Evaporation)
Rainlm1 %>% summary()
## 
## Call:
## lm(formula = rain$Sunshine ~ rain$Evaporation)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.1054  -2.9271   0.2605   3.1254   8.0760 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       4.91759    0.12981   37.88   <2e-16 ***
## rain$Evaporation  0.33687    0.02288   14.72   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.756 on 2432 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.08185,    Adjusted R-squared:  0.08147 
## F-statistic: 216.8 on 1 and 2432 DF,  p-value: < 2.2e-16
##Significance test for the regression constant and slope
Rainlm1 %>% confint()
##                      2.5 %    97.5 %
## (Intercept)      4.6630405 5.1721406
## rain$Evaporation 0.2920049 0.3817332
# H0: R = 0 H1: R != 0 
r=cor(DatanoNA$Sunshine , DatanoNA$Evaporation )
sprintf("\nThe corrolation constant for Sunshine vs Evaportation is %g", r) %>% cat 
## 
## The corrolation constant for Sunshine vs Evaportation is 0.289054
cf <- CIr(r = r, n = 2290, level = .95)
sprintf("\nThe confidence interval is %g" , cf) %>% cat
## 
## The confidence interval is 0.251065 
## The confidence interval is 0.326153
#Does humidity impact temperature?
#H0: Humidity has no impact on Temperature.
#H1: Humidty has an impact on Temperature.

ggplot(rain) + theme_minimal() + aes(Temp9am, Humidity9am) + geom_point(alpha = 0.8, size = 0.5) +
   ggtitle( "Tempreature vs Humditidy at 9am in Melbourne") + ylab("Humidity") + xlab("Temp in Celcius")  + theme(plot.title = element_text(hjust = 0.5)) + geom_smooth(method = "lm")
## Warning: Removed 3 rows containing non-finite values (stat_smooth).
## Warning: Removed 3 rows containing missing values (geom_point).

Rainlm2 <- lm(rain$Humidity9am~rain$Temp9am)
Rainlm2 %>% summary()
## 
## Call:
## lm(formula = rain$Humidity9am ~ rain$Temp9am)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -39.848  -8.666  -0.313   8.352  40.999 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  91.42440    0.80794   113.2   <2e-16 ***
## rain$Temp9am -1.64705    0.05245   -31.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.43 on 2430 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.2886, Adjusted R-squared:  0.2884 
## F-statistic:   986 on 1 and 2430 DF,  p-value: < 2.2e-16
##Significance test for the regression constant and slope
Rainlm2 %>% confint()
##                  2.5 %    97.5 %
## (Intercept)  89.840083 93.008718
## rain$Temp9am -1.749905 -1.544194
# H0: R = 0 H1: R != 0 
r=cor(DatanoNA$Humidity9am, DatanoNA$Temp9am)
sprintf("\nThe corrolation constant for Humdity  vs Temp at 9am is %g", r) %>% cat 
## 
## The corrolation constant for Humdity  vs Temp at 9am is -0.537186
cf <- CIr(r = r, n = 2290, level = .95)
sprintf("\nThe confidence interval is %g" , cf) %>% cat 
## 
## The confidence interval is -0.5657 
## The confidence interval is -0.507389
ggplot(rain) + theme_minimal() + aes(Temp3pm, Humidity3pm) + geom_point(alpha = 0.8, size = 0.5) +
   ggtitle( "Tempreature vs Humditidy at 3pm in Melbourne") + ylab("Humidity") + xlab("Temp in Celcius")  + theme(plot.title = element_text(hjust = 0.5)) +geom_smooth(method = "lm")
## Warning: Removed 7 rows containing non-finite values (stat_smooth).
## Warning: Removed 7 rows containing missing values (geom_point).

Rainlm3 <- lm(rain$Humidity3pm~rain$Temp3pm)
Rainlm3 %>% summary()
## 
## Call:
## lm(formula = rain$Humidity3pm ~ rain$Temp3pm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.077  -8.648  -1.246   7.507  50.815 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  84.87367    0.88263   96.16   <2e-16 ***
## rain$Temp3pm -1.75672    0.04374  -40.17   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.41 on 2426 degrees of freedom
##   (7 observations deleted due to missingness)
## Multiple R-squared:  0.3994, Adjusted R-squared:  0.3991 
## F-statistic:  1613 on 1 and 2426 DF,  p-value: < 2.2e-16
##Significance test for the regression constant and slope
Rainlm3 %>% confint()
##                  2.5 %    97.5 %
## (Intercept)  83.142894 86.604453
## rain$Temp3pm -1.842488 -1.670957
# H0: R = 0 H1: R != 0 
r=cor(DatanoNA$Humidity3pm,DatanoNA$Temp3pm)
sprintf("\nThe corrolation constant for Humdity  vs Temp at 3pm is %g", r) %>% cat 
## 
## The corrolation constant for Humdity  vs Temp at 3pm is -0.634058
cf <- CIr(r = r, n = 2290, level = .95)
sprintf("\nThe confidence interval is %g" , cf) %>% cat 
## 
## The confidence interval is -0.657932 
## The confidence interval is -0.608911

Interpretation

Summary Statistics

Following with the problem statement we first determined the descriptive statistics of the 8 relevant variables to our question: TempVar, Rainfall, Evaporation, Sunshine, Humidity (9am and 3pm) and Temp (9am and 3pm). TempVar was calculated by taking the maximum temperature of the day minus the minimum temperature and all other variables we used are present within the dataset.

Boxplots

We constructed boxplots for each of the variables and an additional one for Rainfall. We separated Rainfall into another boxplot to get a more accurate picture as the days where it did not rain were skewing the results towards 0 as indicated by the summary statistics.

Outliers

We kept all the outliers in our dataset. This is primarily done to ensure that no data points are missing in one variable and present in a corresponding point in another variable, as conceptually it does not make sense for a value to be missing (unless it is not collected). For example if a humidity data point was deleted, the temperature at 9:00 am or 3:00 pm would have no corresponding humidity value which does not make sense conceptually as humidity is on a scale from 0-100. A secondary reason is also that our data points for all eight variables are numeric and measurable. For example if we were comparing the prices of goods between two supermarkets, we would be selecting different objects (like meats or dairy products) where products can vary dramatically in type, whereas for example temperature cannot be divided into further subcategories.

Linear Regression

We constructed 4 different scatter plots and then conducted a linear regression and correlation test on each: (TempVar vs Rainfall, Evaporation vs Sunshine, Humidity (9am) vs Temp (9am) and Humidity (3pm) vs Temp (3pm). We found that Evaporation vs Sunshine had a positive trend and the others had negative trends. Testing the significance we found that all the curves are statistically significant for the constant and slope terms and that there is a statistically significant possibility that a relationship does indeed exist for these variables. Likewise for correlation we had a similar outcome with were Evaporation vs Sunshine were positively correlated and the other three were negatively correlated. Testing the confidence intervals, H0 (no correlation) was not between any of them which allows us to reject the hypothesis stating that there is statistically significant correlation between each of the pairs.

Discussion

Major Findings

Strengths

The strengths of our analysis is that we looked at the weather patterns of Melbourne city overall. With many saying as Melbourne has the most unpredictable weather, this analysis helps us to know how the factors of weather effects the weather pattern. Thus, from the results, one could plan his/her activities accordingly.

limitations

The limitation of our analysis is the sample size and number of weather factors. We have considered only few factors like temperature, rainfall, sunshine, humidity and evaporation. The reason for the selected factors is because they are the most common factors which define the weather of particular location.

Findings

1. What is the effect of Rainfall on the variation of temperature in Melbourne ?

2. What is the effect of Sunshine on evaportation ?

3. What is the effect of Humidty on temperature ?

Proposal for future investigations

A take home message for the reader

Final Conclusion

References