MATH1324 Assignment 2

If weather determined whether you live in Melbourne or Sydney

Arif Mutluel s3400285

Last updated: 23 October, 2020

Introduction

Problem Statement

Data - Weather

Data - Weather

Descriptive Statistics and Visualisation - Rainfall

Total rainfall comparison

wcm_tot <- weather %>% group_by(City) %>% summarise(`Total Rainfall` = sum(`Rainfall (mm)`, na.rm = TRUE))
kableExtra::kable(wcm_tot)
City Total Rainfall
Melbourne 713.8
Sydney 1331.2

Sydney’s definitely wetter, almost double the rainfall. Let’s look at rainfall on a monthly basis

wcm <- weather %>% group_by(City, Month) %>% summarise(`Total Rainfall (mm)` = sum(`Rainfall (mm)`, na.rm = TRUE))
ggplot(wcm, aes(x = Month, y = `Total Rainfall (mm)`, fill = City)) +  geom_col()

#A lot of green on the chart, but it looks like Sydney had a significant amount of rainfall in February which may be skewing with the totals

Descriptive Statistics and Visualisation - Rainfall

Here we’ll look at the numbers without the outlier (February)

wcm_tot_clean <- weather %>% filter(Month != "February") %>% group_by(City) %>%summarise(`Total Rainfall` = sum(`Rainfall (mm)`, na.rm = TRUE))
kableExtra::kable(wcm_tot_clean)
City Total Rainfall
Melbourne 637.6
Sydney 896.8
wcm_clean <- weather %>% filter(Month != "February") %>% group_by(City, Month) %>%summarise(`Total Rainfall` = sum(`Rainfall (mm)`, na.rm = TRUE))
ggplot(wcm_clean, aes(x = Month, y = `Total Rainfall`, fill = City)) +  geom_col()

#Better, but we can see from the table and chart that Sydney received more rain. At this stage we'd look at holidaying in Sydney in April

Descriptive Statistics and Visualisation - Rainfall

The good news is, neither city fits the BoM’s definition of drought as the yearly rainfall is greater than the average rainfall for both Sydney and Melbourne respectively. Which city is more likely to receive rain daily on a daily basis?

# First, we shall replace missing values in Rainfall with 0. Assuming no record means no rain. 
sum(is.na(weather$`Rainfall (mm)`))
## [1] 6
weather$`Rainfall (mm)`[is.na(weather$`Rainfall (mm)`)] <- 0

# First let's calculate the days without rain per city and then calculate probability of rain. Count of days with no rain for Melbourne 
weatherMel <- weather %>% filter(City == "Melbourne") %>% filter(`Rainfall (mm)` == 0) %>% summarise(Melb_NoRain_Day_Count = n())
kableExtra::kable(weatherMel)
Melb_NoRain_Day_Count
222
#Probability of rain during any given day in Melbourne (n = 366 (leap year), f = 366 - 222 = 144)
144 / 366
## [1] 0.3934426
#Repeat the same for Sydney. Count of days with no rain for Sydney
weatherSyd <- weather %>% filter(City == "Sydney") %>% filter(`Rainfall (mm)` == 0) %>% summarise(Syd_NoRain_Day_Count = n())
kableExtra::kable(weatherSyd)
Syd_NoRain_Day_Count
234
#Probability of rain during any given day in Sydney (n = 366 (leap year), f = 366 - 228 = 138)
138 / 366
## [1] 0.3770492

Higher probably to have a rainy day in Melbourne, however we’re more likely to receive more rain in Sydney.

Descriptive Statistics and Visualisation - Wind

A seasonal comparison of wind breakdown between cities

ggplot(weather, aes(x = Date, y = `Speed of maximum wind gust (km/h)`, color = City)) + geom_point() + facet_wrap(~ Season)

#Looking at the diagram, Sydney is more prominent with stronger winds. Let's break this down further

Descriptive Statistics and Visualisation - Wind

BoM defines strong wind as greater than 26 knots.

# First I'll create a variable to convert the wind measurement from (km/h) to knots
knots <- weather %>% mutate(knots1 = `Speed of maximum wind gust (km/h)`/ 1.852)
# Measure the number of days with strong wind
windy <- knots %>% group_by(City) %>% filter(knots1 >= 26) %>% summarise(strongwind = n())
kableExtra::kable(windy)
City strongwind
Melbourne 38
Sydney 98
#Wind speeds in knots broken down proportionally into buckets of 10 knots, start = 0 knots & finish 60 knots
melbknots <- knots %>% filter (City == "Melbourne")
sydknots <- knots %>% filter (City == "Sydney")
melbcut <- cut(melbknots$knots1, breaks = seq(0,60,10))
sydcut <- cut(sydknots$knots1, breaks = seq(0,60,10))
melbcut %>% table() %>% prop.table()
## .
##      (0,10]     (10,20]     (20,30]     (30,40]     (40,50]     (50,60] 
## 0.033240997 0.601108033 0.318559557 0.044321330 0.002770083 0.000000000
sydcut %>% table() %>% prop.table()
## .
##      (0,10]     (10,20]     (20,30]     (30,40]     (40,50]     (50,60] 
## 0.005555556 0.438888889 0.355555556 0.172222222 0.019444444 0.008333333

Descriptive Statistics and Visualisation - Hours of Sunshine

weather %>% group_by(City) %>% summarise(Min = min(`Sunshine (hours)`,na.rm = TRUE),
                                           Q1 = quantile(`Sunshine (hours)`,probs = .25,na.rm = TRUE),
                                           Median = median(`Sunshine (hours)`, na.rm = TRUE),
                                           Q3 = quantile(`Sunshine (hours)`,probs = .75,na.rm = TRUE),
                                           Max = max(`Sunshine (hours)`,na.rm = TRUE),
                                           Mean = mean(`Sunshine (hours)`, na.rm = TRUE),
                                           SD = sd(`Sunshine (hours)`, na.rm = TRUE),
                                           n = n(),
                                           Missing = sum(is.na(`Sunshine (hours)`))) -> tablesun
knitr::kable(tablesun)
City Min Q1 Median Q3 Max Mean SD n Missing
Melbourne 0 3.2 6.4 8.475 13.6 6.048087 3.544354 366 0
Sydney 0 3.7 7.9 10.200 12.9 6.940710 3.821792 366 0
# Using the Central Limit Theorem, What is the probability of randomly selecting a sample of 183 days (half the year) that has an average of at least 6 hours and 30 minutes of sunlight in Melbourne?   

pnorm(q = 6.5, mean = 6.048087, sd = 3.544354/sqrt(183), lower.tail = FALSE)
## [1] 0.04228013
#applying a similar approach to Sydney, What is the probability of randomly selecting a sample of 183 days (half the year) that has an average of at least 6 hours and 30 minutes of sunlight in Sydney?   
pnorm(q = 6.5, mean = 6.940710, sd = 3.821792/sqrt(183), lower.tail = FALSE)
## [1] 0.9406145
# You're entitled to feel a bit flat at this point Melbournians

Descriptive Statistics and Visualisation - Hours of Sunshine

Let’s visualise Hours of Sunshine

weather %>% boxplot(`Sunshine (hours)` ~ City,data = ., na.rm=TRUE, main="Box Plot of Sunshine (hours) by City", 
                    ylab="City", xlab="Sunshine Hours",horizontal=TRUE, col = "orange")

Helps make the picture clearer and you can see that although Melbourne has a larger Max, the upper quartile is significantly lower than Sydney’s.

Descriptive Statistics and Visualisation - Temperature

# Before we measure temperature I'm going to tidy the data a little more and combine minimum and maximum temperatures
tidyw <- weather %>% select(Date, `Maximum temperature (C)`, `Minimum temperature (C)`, City) %>%
  gather(`Maximum temperature (C)`, `Minimum temperature (C)`, key = "Measurement", value = "Temperature")
# Plotting the Maximum and Minimum temperatures 
ggplot(tidyw, aes(x = Date, y = Temperature)) + geom_point(aes(color = City)) +
   expand_limits(y = 0) + facet_wrap(~ Measurement)

Chart shows that Melbourne’s minimum temperatures is quite colder than Sydney’s at the same point in time. There’s a bit more overlap with the maximum temperatures and we’ll have a further look into this. Melbourne also appears to have a larger range of maximum temperatures.

Descriptive Statistics and Visualisation - Temperature

weather %>% group_by(City) %>% summarise(Min = min(`Maximum temperature (C)`,na.rm = TRUE),
                                           Q1 = quantile(`Maximum temperature (C)`,probs = .25,na.rm = TRUE),
                                           Median = median(`Maximum temperature (C)`, na.rm = TRUE),
                                           Q3 = quantile(`Maximum temperature (C)`,probs = .75,na.rm = TRUE),
                                           Max = max(`Maximum temperature (C)`,na.rm = TRUE),
                                           Mean = mean(`Maximum temperature (C)`, na.rm = TRUE),
                                           SD = sd(`Maximum temperature (C)`, na.rm = TRUE),
                                           n = n(),
                                           Missing = sum(is.na(`Maximum temperature (C)`))) -> tabletemp
knitr::kable(tabletemp)
City Min Q1 Median Q3 Max Mean SD n Missing
Melbourne 10.3 15.325 18.6 22.925 43.5 19.93880 6.143188 366 0
Sydney 13.6 19.800 23.5 26.900 41.2 23.72158 4.899242 366 0
weather %>% boxplot(`Maximum temperature (C)` ~ City,data = ., na.rm=TRUE, main="Box Plot of Maximum temperature by City", 
                    ylab="City", xlab="Temperature", horizontal = TRUE, col = "red")

# The outliers in Melbourne accentuate the variability with Melbourne maximum temperatures 

Hypothesis Testing - Two-sample t-tests

leveneTest(`Maximum temperature (C)` ~ City, data = weather)

Hypothesis Testing - Two-sample t-tests

t.test(
  `Maximum temperature (C)` ~ City,
  data = weather,
  var.equal = FALSE,
  alternative = "two.sided"
  )
## 
##  Welch Two Sample t-test
## 
## data:  Maximum temperature (C) by City
## t = -9.2101, df = 695.57, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4.589189 -2.976384
## sample estimates:
## mean in group Melbourne    mean in group Sydney 
##                19.93880                23.72158

A two-sample t-test was used to test for a significant difference between the mean temperature of Sydney and Melbourne. The central limit theorem ensured that the t-test could be applied due to the large sample size in each group. The Levene’s test of homogeneity of variance indicated that equal variance was violated. The results of the two-sample t-test assuming unequal variance found a statistically significant difference between the mean temperatures of Sydney and Melbourne, t(df=696)=−9.21, p<.001, 95% CI for the difference in means [-4.59 -2.98]. The results of the investigation suggest that Sydney has significantly higher average temperatures than Melbourne

Discussion

Discussion - continued

References