Arif Mutluel s3400285
Last updated: 23 October, 2020
Total rainfall comparison
wcm_tot <- weather %>% group_by(City) %>% summarise(`Total Rainfall` = sum(`Rainfall (mm)`, na.rm = TRUE))
kableExtra::kable(wcm_tot)| City | Total Rainfall |
|---|---|
| Melbourne | 713.8 |
| Sydney | 1331.2 |
Sydney’s definitely wetter, almost double the rainfall. Let’s look at rainfall on a monthly basis
wcm <- weather %>% group_by(City, Month) %>% summarise(`Total Rainfall (mm)` = sum(`Rainfall (mm)`, na.rm = TRUE))
ggplot(wcm, aes(x = Month, y = `Total Rainfall (mm)`, fill = City)) + geom_col()Here we’ll look at the numbers without the outlier (February)
wcm_tot_clean <- weather %>% filter(Month != "February") %>% group_by(City) %>%summarise(`Total Rainfall` = sum(`Rainfall (mm)`, na.rm = TRUE))
kableExtra::kable(wcm_tot_clean)| City | Total Rainfall |
|---|---|
| Melbourne | 637.6 |
| Sydney | 896.8 |
wcm_clean <- weather %>% filter(Month != "February") %>% group_by(City, Month) %>%summarise(`Total Rainfall` = sum(`Rainfall (mm)`, na.rm = TRUE))
ggplot(wcm_clean, aes(x = Month, y = `Total Rainfall`, fill = City)) + geom_col()The good news is, neither city fits the BoM’s definition of drought as the yearly rainfall is greater than the average rainfall for both Sydney and Melbourne respectively. Which city is more likely to receive rain daily on a daily basis?
# First, we shall replace missing values in Rainfall with 0. Assuming no record means no rain.
sum(is.na(weather$`Rainfall (mm)`))## [1] 6
weather$`Rainfall (mm)`[is.na(weather$`Rainfall (mm)`)] <- 0
# First let's calculate the days without rain per city and then calculate probability of rain. Count of days with no rain for Melbourne
weatherMel <- weather %>% filter(City == "Melbourne") %>% filter(`Rainfall (mm)` == 0) %>% summarise(Melb_NoRain_Day_Count = n())
kableExtra::kable(weatherMel)| Melb_NoRain_Day_Count |
|---|
| 222 |
#Probability of rain during any given day in Melbourne (n = 366 (leap year), f = 366 - 222 = 144)
144 / 366## [1] 0.3934426
#Repeat the same for Sydney. Count of days with no rain for Sydney
weatherSyd <- weather %>% filter(City == "Sydney") %>% filter(`Rainfall (mm)` == 0) %>% summarise(Syd_NoRain_Day_Count = n())
kableExtra::kable(weatherSyd)| Syd_NoRain_Day_Count |
|---|
| 234 |
#Probability of rain during any given day in Sydney (n = 366 (leap year), f = 366 - 228 = 138)
138 / 366## [1] 0.3770492
Higher probably to have a rainy day in Melbourne, however we’re more likely to receive more rain in Sydney.
A seasonal comparison of wind breakdown between cities
ggplot(weather, aes(x = Date, y = `Speed of maximum wind gust (km/h)`, color = City)) + geom_point() + facet_wrap(~ Season)BoM defines strong wind as greater than 26 knots.
# First I'll create a variable to convert the wind measurement from (km/h) to knots
knots <- weather %>% mutate(knots1 = `Speed of maximum wind gust (km/h)`/ 1.852)
# Measure the number of days with strong wind
windy <- knots %>% group_by(City) %>% filter(knots1 >= 26) %>% summarise(strongwind = n())
kableExtra::kable(windy)| City | strongwind |
|---|---|
| Melbourne | 38 |
| Sydney | 98 |
#Wind speeds in knots broken down proportionally into buckets of 10 knots, start = 0 knots & finish 60 knots
melbknots <- knots %>% filter (City == "Melbourne")
sydknots <- knots %>% filter (City == "Sydney")
melbcut <- cut(melbknots$knots1, breaks = seq(0,60,10))
sydcut <- cut(sydknots$knots1, breaks = seq(0,60,10))
melbcut %>% table() %>% prop.table()## .
## (0,10] (10,20] (20,30] (30,40] (40,50] (50,60]
## 0.033240997 0.601108033 0.318559557 0.044321330 0.002770083 0.000000000
## .
## (0,10] (10,20] (20,30] (30,40] (40,50] (50,60]
## 0.005555556 0.438888889 0.355555556 0.172222222 0.019444444 0.008333333
weather %>% group_by(City) %>% summarise(Min = min(`Sunshine (hours)`,na.rm = TRUE),
Q1 = quantile(`Sunshine (hours)`,probs = .25,na.rm = TRUE),
Median = median(`Sunshine (hours)`, na.rm = TRUE),
Q3 = quantile(`Sunshine (hours)`,probs = .75,na.rm = TRUE),
Max = max(`Sunshine (hours)`,na.rm = TRUE),
Mean = mean(`Sunshine (hours)`, na.rm = TRUE),
SD = sd(`Sunshine (hours)`, na.rm = TRUE),
n = n(),
Missing = sum(is.na(`Sunshine (hours)`))) -> tablesun
knitr::kable(tablesun)| City | Min | Q1 | Median | Q3 | Max | Mean | SD | n | Missing |
|---|---|---|---|---|---|---|---|---|---|
| Melbourne | 0 | 3.2 | 6.4 | 8.475 | 13.6 | 6.048087 | 3.544354 | 366 | 0 |
| Sydney | 0 | 3.7 | 7.9 | 10.200 | 12.9 | 6.940710 | 3.821792 | 366 | 0 |
# Using the Central Limit Theorem, What is the probability of randomly selecting a sample of 183 days (half the year) that has an average of at least 6 hours and 30 minutes of sunlight in Melbourne?
pnorm(q = 6.5, mean = 6.048087, sd = 3.544354/sqrt(183), lower.tail = FALSE)## [1] 0.04228013
#applying a similar approach to Sydney, What is the probability of randomly selecting a sample of 183 days (half the year) that has an average of at least 6 hours and 30 minutes of sunlight in Sydney?
pnorm(q = 6.5, mean = 6.940710, sd = 3.821792/sqrt(183), lower.tail = FALSE)## [1] 0.9406145
Let’s visualise Hours of Sunshine
weather %>% boxplot(`Sunshine (hours)` ~ City,data = ., na.rm=TRUE, main="Box Plot of Sunshine (hours) by City",
ylab="City", xlab="Sunshine Hours",horizontal=TRUE, col = "orange") Helps make the picture clearer and you can see that although Melbourne has a larger Max, the upper quartile is significantly lower than Sydney’s.
# Before we measure temperature I'm going to tidy the data a little more and combine minimum and maximum temperatures
tidyw <- weather %>% select(Date, `Maximum temperature (C)`, `Minimum temperature (C)`, City) %>%
gather(`Maximum temperature (C)`, `Minimum temperature (C)`, key = "Measurement", value = "Temperature")
# Plotting the Maximum and Minimum temperatures
ggplot(tidyw, aes(x = Date, y = Temperature)) + geom_point(aes(color = City)) +
expand_limits(y = 0) + facet_wrap(~ Measurement) Chart shows that Melbourne’s minimum temperatures is quite colder than Sydney’s at the same point in time. There’s a bit more overlap with the maximum temperatures and we’ll have a further look into this. Melbourne also appears to have a larger range of maximum temperatures.
weather %>% group_by(City) %>% summarise(Min = min(`Maximum temperature (C)`,na.rm = TRUE),
Q1 = quantile(`Maximum temperature (C)`,probs = .25,na.rm = TRUE),
Median = median(`Maximum temperature (C)`, na.rm = TRUE),
Q3 = quantile(`Maximum temperature (C)`,probs = .75,na.rm = TRUE),
Max = max(`Maximum temperature (C)`,na.rm = TRUE),
Mean = mean(`Maximum temperature (C)`, na.rm = TRUE),
SD = sd(`Maximum temperature (C)`, na.rm = TRUE),
n = n(),
Missing = sum(is.na(`Maximum temperature (C)`))) -> tabletemp
knitr::kable(tabletemp)| City | Min | Q1 | Median | Q3 | Max | Mean | SD | n | Missing |
|---|---|---|---|---|---|---|---|---|---|
| Melbourne | 10.3 | 15.325 | 18.6 | 22.925 | 43.5 | 19.93880 | 6.143188 | 366 | 0 |
| Sydney | 13.6 | 19.800 | 23.5 | 26.900 | 41.2 | 23.72158 | 4.899242 | 366 | 0 |
weather %>% boxplot(`Maximum temperature (C)` ~ City,data = ., na.rm=TRUE, main="Box Plot of Maximum temperature by City",
ylab="City", xlab="Temperature", horizontal = TRUE, col = "red")t.test(
`Maximum temperature (C)` ~ City,
data = weather,
var.equal = FALSE,
alternative = "two.sided"
)##
## Welch Two Sample t-test
##
## data: Maximum temperature (C) by City
## t = -9.2101, df = 695.57, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -4.589189 -2.976384
## sample estimates:
## mean in group Melbourne mean in group Sydney
## 19.93880 23.72158
A two-sample t-test was used to test for a significant difference between the mean temperature of Sydney and Melbourne. The central limit theorem ensured that the t-test could be applied due to the large sample size in each group. The Levene’s test of homogeneity of variance indicated that equal variance was violated. The results of the two-sample t-test assuming unequal variance found a statistically significant difference between the mean temperatures of Sydney and Melbourne, t(df=696)=−9.21, p<.001, 95% CI for the difference in means [-4.59 -2.98]. The results of the investigation suggest that Sydney has significantly higher average temperatures than Melbourne