MATH1324 Assignment 2

If weather determined whether you live in Melbourne or Sydney

Arif Mutluel s3400285

Last updated: 23 October, 2020

Introduction

It’s a rivalry as old as time, Melbourne vs. Sydney. Which is the greater city?
It is also an argument that haunts me personally every other day. My wife and I met in Sydney in 2013. She was born and raised in Sydney and I was based up there for what should’ve been a temporary work placement. We spent the next 5 years living in Sydney.
After many years of convincing and possibly begging, I convinced her to move to Melbourne, my home city.
I used various facts and rationale such as the points below as my arguments;
- The Economist Intelligence Unit’s (https://www.eiu.com/topic/liveability) 2017 Global Liveability Index awarded Melbourne as the world’s top city for seven years running.
- Melbourne has been ranked 2nd behind Vienna the last two years
- Domain (https://www.domain.com.au/research/house-price-report/march-2020/) has the median house price in Sydney at a whopping $1,168,806 compared to Melbourne’s more affordable median house price of $918,350
- we won’t even begin to discuss the traffic…
But according to her none of the above matter, the only attribute that matters is the weather and Sydney has better weather (apparently).

Problem Statement

We’re going to test the theory that Sydney has better weather than Melbourne
It can be quite a subjective topic, so we’ll base our investigation from a scientific and social aspect.
- Rainfall - We know rainfall is essential survive but no one wants it to rain every day. So we we will measure:
  - How frequently does it rain?
  - Measure against the Bureau of Meteorology definition of a drought (http://www.bom.gov.au/climate/glossary/drought.shtml)
- Wind - I couldn’t find a single positive for wind. So we will measure:
  - no. of days with strong wind
  - breakdown of wind speeds
- Sunshine is linked to many health factors, primarily mental health and Vitamin D. We will measure;
  - which city has more hours of sunshine per day
- Temperature - Australia is famous for its beautiful coast line and beaches, but we know also know there’s plenty of danger involved with extreme heat.Same can also be applied to extreme colds.
  - We’ll take a comparative view of average temperatures between the two cities and make an analysis taking into consideration extreme temperatures
We’ll also perform a two-sample t-test to compare the difference between the average temperature in Melbourne vs Sydney i.e. Do Melbourne and Sydney have different average temperatures?

Data - Weather

Daily weather data was sourced from the Bureau of Meteorology
- Data is for a 12 month period from 1/09/2019 to 31/08/2020
- Daily data has been downloaded in monthly chunks and then combined
- Data must be sourced separately for both Melbourne and Sydney and then combined
- I’ve added a column to distinguish between the 2 cities (City)
- Another column has been to identify the season (), i.e. dates between;
  - 1/09/2019 - 30/11/2019 = Spring
  - 1/12/2019 - 29/02/2020 = Summer
  - 1/03/2020 - 31/05/2020 = Autumn
  - 1/06/2020 - 31/08/2020 = Winter
- Further more a column was added to record the Month
- Missing values in Sunshine, max temp and min temp were replaced with the mean
The daily weather observations can be sourced directly from http://www.bom.gov.au/climate/dwo/
Observations were drawn from Melbourne (Olympic Park) {station 086338}
Temperature, humidity and rainfall observations are from Sydney (Observatory Hill) {station 066214}. Pressure, cloud, evaporation and sunshine observations are from Sydney Airport AMO {station 066037}. Wind observations are from Fort Denison {station 066022}

Data - Weather

Dataset contains 24 variables and 366 observations (1 year worth of data). Only eight variables are relevant for the purposes of this analysis. These variables are:
- Date: Date format DD/MM/YYYY
- Minimum temperature (C): Minimum temperature in the 24 hours to 9am in degrees celcius (numeric)
- Maximum temperature (C): Maximum temperature in the 24 hours from 9am degrees celcius (numeric)
- Rainfall (mm): Precipitation (rainfall) in the 24 hours to 9am in millimetres (numeric)
- Sunshine (hours): Bright sunshine in the 24 hours to midnight in hours (numeric)
- Speed of maximum wind gust (km/h): Speed of strongest wind gust in the 24 hours to midnight in kilometres per hour (numeric)
- City: City of observation (factor) (Levels: Melbourne, Sydney)
- Season: Season depending on time year (factor) (Levels: Spring, Summer, Autumn, Winter)
- Month: (factor) (Levels: September, October, November, December, January, February, March, April, May, June, July, August)
More information on variables and definitions can be found at: http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml. Keeping in mind the last 3 variables weren’t on the original data set

Descriptive Statistics and Visualisation - Rainfall

Total rainfall comparison

wcm_tot <- weather %>% group_by(City) %>% summarise(`Total Rainfall` = sum(`Rainfall (mm)`, na.rm = TRUE))
kableExtra::kable(wcm_tot)

City	Total Rainfall
Melbourne	713.8
Sydney	1331.2

Sydney’s definitely wetter, almost double the rainfall. Let’s look at rainfall on a monthly basis

wcm <- weather %>% group_by(City, Month) %>% summarise(`Total Rainfall (mm)` = sum(`Rainfall (mm)`, na.rm = TRUE))
ggplot(wcm, aes(x = Month, y = `Total Rainfall (mm)`, fill = City)) +  geom_col()

#A lot of green on the chart, but it looks like Sydney had a significant amount of rainfall in February which may be skewing with the totals

Descriptive Statistics and Visualisation - Rainfall

Here we’ll look at the numbers without the outlier (February)

wcm_tot_clean <- weather %>% filter(Month != "February") %>% group_by(City) %>%summarise(`Total Rainfall` = sum(`Rainfall (mm)`, na.rm = TRUE))
kableExtra::kable(wcm_tot_clean)

City	Total Rainfall
Melbourne	637.6
Sydney	896.8

wcm_clean <- weather %>% filter(Month != "February") %>% group_by(City, Month) %>%summarise(`Total Rainfall` = sum(`Rainfall (mm)`, na.rm = TRUE))
ggplot(wcm_clean, aes(x = Month, y = `Total Rainfall`, fill = City)) +  geom_col()

#Better, but we can see from the table and chart that Sydney received more rain. At this stage we'd look at holidaying in Sydney in April

Descriptive Statistics and Visualisation - Rainfall

The good news is, neither city fits the BoM’s definition of drought as the yearly rainfall is greater than the average rainfall for both Sydney and Melbourne respectively. Which city is more likely to receive rain daily on a daily basis?

# First, we shall replace missing values in Rainfall with 0. Assuming no record means no rain. 
sum(is.na(weather$`Rainfall (mm)`))

## [1] 6

weather$`Rainfall (mm)`[is.na(weather$`Rainfall (mm)`)] <- 0

# First let's calculate the days without rain per city and then calculate probability of rain. Count of days with no rain for Melbourne 
weatherMel <- weather %>% filter(City == "Melbourne") %>% filter(`Rainfall (mm)` == 0) %>% summarise(Melb_NoRain_Day_Count = n())
kableExtra::kable(weatherMel)

Melb_NoRain_Day_Count
222

#Probability of rain during any given day in Melbourne (n = 366 (leap year), f = 366 - 222 = 144)
144 / 366

## [1] 0.3934426

#Repeat the same for Sydney. Count of days with no rain for Sydney
weatherSyd <- weather %>% filter(City == "Sydney") %>% filter(`Rainfall (mm)` == 0) %>% summarise(Syd_NoRain_Day_Count = n())
kableExtra::kable(weatherSyd)

Syd_NoRain_Day_Count
234

#Probability of rain during any given day in Sydney (n = 366 (leap year), f = 366 - 228 = 138)
138 / 366

## [1] 0.3770492

Higher probably to have a rainy day in Melbourne, however we’re more likely to receive more rain in Sydney.

Descriptive Statistics and Visualisation - Wind

A seasonal comparison of wind breakdown between cities

ggplot(weather, aes(x = Date, y = `Speed of maximum wind gust (km/h)`, color = City)) + geom_point() + facet_wrap(~ Season)

#Looking at the diagram, Sydney is more prominent with stronger winds. Let's break this down further

Descriptive Statistics and Visualisation - Wind

BoM defines strong wind as greater than 26 knots.

# First I'll create a variable to convert the wind measurement from (km/h) to knots
knots <- weather %>% mutate(knots1 = `Speed of maximum wind gust (km/h)`/ 1.852)
# Measure the number of days with strong wind
windy <- knots %>% group_by(City) %>% filter(knots1 >= 26) %>% summarise(strongwind = n())
kableExtra::kable(windy)

City	strongwind
Melbourne	38
Sydney	98

#Wind speeds in knots broken down proportionally into buckets of 10 knots, start = 0 knots & finish 60 knots
melbknots <- knots %>% filter (City == "Melbourne")
sydknots <- knots %>% filter (City == "Sydney")
melbcut <- cut(melbknots$knots1, breaks = seq(0,60,10))
sydcut <- cut(sydknots$knots1, breaks = seq(0,60,10))
melbcut %>% table() %>% prop.table()

## .
##      (0,10]     (10,20]     (20,30]     (30,40]     (40,50]     (50,60] 
## 0.033240997 0.601108033 0.318559557 0.044321330 0.002770083 0.000000000

sydcut %>% table() %>% prop.table()

## .
##      (0,10]     (10,20]     (20,30]     (30,40]     (40,50]     (50,60] 
## 0.005555556 0.438888889 0.355555556 0.172222222 0.019444444 0.008333333

Descriptive Statistics and Visualisation - Hours of Sunshine

weather %>% group_by(City) %>% summarise(Min = min(`Sunshine (hours)`,na.rm = TRUE),
                                           Q1 = quantile(`Sunshine (hours)`,probs = .25,na.rm = TRUE),
                                           Median = median(`Sunshine (hours)`, na.rm = TRUE),
                                           Q3 = quantile(`Sunshine (hours)`,probs = .75,na.rm = TRUE),
                                           Max = max(`Sunshine (hours)`,na.rm = TRUE),
                                           Mean = mean(`Sunshine (hours)`, na.rm = TRUE),
                                           SD = sd(`Sunshine (hours)`, na.rm = TRUE),
                                           n = n(),
                                           Missing = sum(is.na(`Sunshine (hours)`))) -> tablesun
knitr::kable(tablesun)

City	Min	Q1	Median	Q3	Max	Mean	SD	n	Missing
Melbourne	0	3.2	6.4	8.475	13.6	6.048087	3.544354	366	0
Sydney	0	3.7	7.9	10.200	12.9	6.940710	3.821792	366	0

# Using the Central Limit Theorem, What is the probability of randomly selecting a sample of 183 days (half the year) that has an average of at least 6 hours and 30 minutes of sunlight in Melbourne?   

pnorm(q = 6.5, mean = 6.048087, sd = 3.544354/sqrt(183), lower.tail = FALSE)

## [1] 0.04228013

#applying a similar approach to Sydney, What is the probability of randomly selecting a sample of 183 days (half the year) that has an average of at least 6 hours and 30 minutes of sunlight in Sydney?   
pnorm(q = 6.5, mean = 6.940710, sd = 3.821792/sqrt(183), lower.tail = FALSE)

## [1] 0.9406145

# You're entitled to feel a bit flat at this point Melbournians

Descriptive Statistics and Visualisation - Hours of Sunshine

Let’s visualise Hours of Sunshine

weather %>% boxplot(`Sunshine (hours)` ~ City,data = ., na.rm=TRUE, main="Box Plot of Sunshine (hours) by City", 
                    ylab="City", xlab="Sunshine Hours",horizontal=TRUE, col = "orange")

Helps make the picture clearer and you can see that although Melbourne has a larger Max, the upper quartile is significantly lower than Sydney’s.

Descriptive Statistics and Visualisation - Temperature

# Before we measure temperature I'm going to tidy the data a little more and combine minimum and maximum temperatures
tidyw <- weather %>% select(Date, `Maximum temperature (C)`, `Minimum temperature (C)`, City) %>%
  gather(`Maximum temperature (C)`, `Minimum temperature (C)`, key = "Measurement", value = "Temperature")
# Plotting the Maximum and Minimum temperatures 
ggplot(tidyw, aes(x = Date, y = Temperature)) + geom_point(aes(color = City)) +
   expand_limits(y = 0) + facet_wrap(~ Measurement)

Chart shows that Melbourne’s minimum temperatures is quite colder than Sydney’s at the same point in time. There’s a bit more overlap with the maximum temperatures and we’ll have a further look into this. Melbourne also appears to have a larger range of maximum temperatures.

Descriptive Statistics and Visualisation - Temperature

weather %>% group_by(City) %>% summarise(Min = min(`Maximum temperature (C)`,na.rm = TRUE),
                                           Q1 = quantile(`Maximum temperature (C)`,probs = .25,na.rm = TRUE),
                                           Median = median(`Maximum temperature (C)`, na.rm = TRUE),
                                           Q3 = quantile(`Maximum temperature (C)`,probs = .75,na.rm = TRUE),
                                           Max = max(`Maximum temperature (C)`,na.rm = TRUE),
                                           Mean = mean(`Maximum temperature (C)`, na.rm = TRUE),
                                           SD = sd(`Maximum temperature (C)`, na.rm = TRUE),
                                           n = n(),
                                           Missing = sum(is.na(`Maximum temperature (C)`))) -> tabletemp
knitr::kable(tabletemp)

City	Min	Q1	Median	Q3	Max	Mean	SD	n	Missing
Melbourne	10.3	15.325	18.6	22.925	43.5	19.93880	6.143188	366	0
Sydney	13.6	19.800	23.5	26.900	41.2	23.72158	4.899242	366	0

weather %>% boxplot(`Maximum temperature (C)` ~ City,data = ., na.rm=TRUE, main="Box Plot of Maximum temperature by City", 
                    ylab="City", xlab="Temperature", horizontal = TRUE, col = "red")

# The outliers in Melbourne accentuate the variability with Melbourne maximum temperatures

Hypothesis Testing - Two-sample t-tests

It looks fairly clear that Sydney has a higher maximum temperature than Melbourne, however we will use a two-sample t-test to check whether the difference is statistically significant.
We don’t need to test the assumption of normality as n > 30
- we have 366 observations for each city
Homogeneity of Variance: Using Levene’s test

leveneTest(`Maximum temperature (C)` ~ City, data = weather)

The p-value for the Levene’s test of equal variance for temperature between Sydney and Melbourne was p=0.03474149. We find p<.05.The Levene’s test was statistically significant, we reject H0. It’s safe to assume unequal variance.

Hypothesis Testing - Two-sample t-tests

We need to use the Welch two-sample t-test as the assumption of equal variance was violated

t.test(
  `Maximum temperature (C)` ~ City,
  data = weather,
  var.equal = FALSE,
  alternative = "two.sided"
  )

## 
##  Welch Two Sample t-test
## 
## data:  Maximum temperature (C) by City
## t = -9.2101, df = 695.57, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4.589189 -2.976384
## sample estimates:
## mean in group Melbourne    mean in group Sydney 
##                19.93880                23.72158

A two-sample t-test was used to test for a significant difference between the mean temperature of Sydney and Melbourne. The central limit theorem ensured that the t-test could be applied due to the large sample size in each group. The Levene’s test of homogeneity of variance indicated that equal variance was violated. The results of the two-sample t-test assuming unequal variance found a statistically significant difference between the mean temperatures of Sydney and Melbourne, t(df=696)=−9.21, p<.001, 95% CI for the difference in means [-4.59 -2.98]. The results of the investigation suggest that Sydney has significantly higher average temperatures than Melbourne

Discussion

Sydney received almost double (1.86x) the amount of rain that Melbourne did between Sep 19 - Aug 20. Surprisingly most of that rain came during summer. Sydney received more rain in February than the winter months combined, a clear outlier. Removing February from the calculations and the cities are closer aligned with rainfall.
Although wetter, Sydney had a lower probability of having a rainy day
Sydney had significantly more days with stronger winds. Sydney observed 98 days of winds greater than 26 knots compared to Melbourne’s 38.
The probability of Melbourne wind being less than 30 knots was 0.953, compared to Sydney’s 0.801
However, wind for Sydney was observed from Fort Denison which is a little island off Sydney Harbour, therefore you’d expect open waters to record stronger winds
On average, Sydney observed almost an hour more of sunlight daily than Melbourne. The probability of randomly selecting a sample of 183 days with at least 6.5 hours of sunlight in Melbourne was only 0.042 compared to Sydney’s probability of 0.940.
During the summer months Melbourne had some extreme heat days, however on average Sydney consistently had warmer days.
In fact, Sydney’s average maximum temperature was hotter than Melbourne’s Q3 value.
The data is limited to 12 months of observations and not an accurate guide to ongoing weather predictions or historical representations.
Large parts of Sydney and wider parts of NSW were experiencing a drought not many years ago.
We understand that weather is cyclical. Therefore, we would need several decades of data to draw accurate comparisons.

Discussion - continued

We also need to take into consideration the impacts of global warming and the likely impact this will have on weather going forward.
I measured 4 variables (rainfall, wind, hours of sunshine and temperature), and you could possibly draw an association between hours of sunshine and temperature. However, it would be a worthwhile exercise doing further analysis on each variable to draw more precising findings;
- Does it rain in catchment areas or regions where water is required the most (i.e. agricultural growth zones)
- How does the wind compare in Fort Denision to Lidcombe/Auburn which is the geographical centre of Sydney’s population
- When is the sun shining? Is it during work hours or after hours?
- What time is sunrise and sundown
- What temperatures are ideal for healthy lifestyles? What is the scientifically ideal temperature? What is a healthy temperature for the ecosystem?
The list goes on and on, however given the weather analysis that has taken place we observed;
- Melbourne has more rainy days but less rain, a tick for Sydney
- Sydney has windier weather, a tick for Melbourne
- Sydney has sunshine for longer than Melbourne, a tick for Sydney
- Melbourne has more extreme heat days, however is consistently colder than Sydney, a tick for Sydney
The analysis strongly suggests that Sydney has better weather based on the criteria outlined which is preferential rather than scientific.

References

Bureau of Meteorology, Drought, http://www.bom.gov.au/climate/glossary/drought.shtml
Bureau of Meteorology, Daily Weather Observations, http://www.bom.gov.au/climate/dwo/
Bureau of Meteorology, Notes to accompany Daily Weather Observations, http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml
Bureau of Meteorology, Six things you need to know about wind warnings, http://www.bom.gov.au/marine/about/six-things-about-wind-warnings.shtml
dplyr, Create, modify, and delete columns, https://dplyr.tidyverse.org/reference/mutate.html
R Markdown: The Definitive Guide, https://bookdown.org/yihui/rmarkdown/markdown-syntax.html
Datanovia, HOW TO CREATE A GGPLOT WITH MULTIPLE LINES, https://www.datanovia.com/en/blog/how-to-create-a-ggplot-with-multiple-lines/
Zevross, Tips and tricks for working with images and figures in R Markdown documents, http://zevross.com/blog/2017/06/19/tips-and-tricks-for-working-with-images-and-figures-in-r-markdown-documents/
R-Bloggers, R Function of the Day: cut, https://www.r-bloggers.com/2009/09/r-function-of-the-day-cut-2/
Baglin, J 2020, Module 7 Testing the Null: Data on Trial, Course Material, RMIT University, https://astral-theory-157510.appspot.com/secured/MATH1324_Module_07.html#overview
The Economist, The Global Liveability Index, https://www.eiu.com/topic/liveability
Domain, Domain House Price Report, https://www.domain.com.au/research/house-price-report/march-2020/