Student Details
Name: “Karan Khurana”
ID: “s3998115”
Problem Statement
“This investigation’s goal is to analyse the normality distribution
of a subset of Melbourne and Sydney weather data and offer suggestions
for modelling these variables. The data set’s Daily Wind speed and
Maximum temperature variables have been chosen.
Approach: The investigation includes the following steps:
The mean, median, standard deviation, first and third quartiles,
interquartile range, minimum and maximum values, and R functions will be
used to provide summary statistics for each variable in Melbourne and
Sydney.
For each variable in Melbourne and Sydney, histograms with overlays
of the normal distribution will be generated using the ggplot2 program
in R. The variable is thought to be roughly normally distributed if the
histogram resembles a bell-shaped curve.We may need to take into account
using a non-parametric model if the histogram is skewed.
We shall offer suggestions for the modelling of each variable in
Melbourne and Sydney based on the histograms’ findings. For variables
that are not normally distributed, we would advise using a
non-parametric model, and for variables that are roughly regularly
distributed, we might advise using a parametric model (such a normal
distribution).”
Load Packages
library(dplyr)
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
library(ggplot2)
Data
Import the climate data and prepare it for analysis. Show your
code.
setwd("C:/Users/bhara/Downloads/Data-Applied Analytics/")
Warning: The working directory was changed to C:/Users/bhara/Downloads/Data-Applied Analytics inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the working directory for notebook chunks.
melbourne <- read.csv("Climate_Data_Melbourne.csv")
sydney <- read.csv("Climate_Data_Sydney.csv")
Summary Statistics
Calculate descriptive statistics (i.e., mean, median, standard
deviation, first and third quartile, interquartile range, minimum and
maximum values) of the selected variable grouped by city.
# This is a chunk for your Summary Statistics section.
# summary of given column of data set Melbourne which contain mean, median, maximum, minimum, 1st and 3rd quartile
print("Summary of melbourne's max temperature")
[1] "Summary of melbourne's max temperature"
summary(melbourne$Maximum.Temperature)
Min. 1st Qu. Median Mean 3rd Qu. Max.
16.10 21.43 25.10 25.64 29.05 41.30
print("Summary of melbourne's wind speed")
[1] "Summary of melbourne's wind speed"
summary(melbourne$Wind.speed)
Min. 1st Qu. Median Mean 3rd Qu. Max.
30.00 35.00 41.00 46.16 54.00 83.00
# printing standard deviation
print("Standard deviation of melbourne's max temperature")
[1] "Standard deviation of melbourne's max temperature"
sd(melbourne$Maximum.Temperature)
[1] 5.768589
print("Standard deviation of melbourne's wind speed")
[1] "Standard deviation of melbourne's wind speed"
sd(melbourne$Wind.speed)
[1] 13.33738
# printing Inter Quartile Range
print("inter Quartile Range of melbourne's max temperature")
[1] "inter Quartile Range of melbourne's max temperature"
IQR(melbourne$Maximum.Temperature)
[1] 7.625
print("inter Quartile Range of melbourne's wind speed")
[1] "inter Quartile Range of melbourne's wind speed"
IQR(melbourne$Wind.speed)
[1] 19
# summary of given column of data set Sydney which contain mean, median, maximum, minimum, 1st and 3rd quartile
print("Summary of Sydney's max temperature")
[1] "Summary of Sydney's max temperature"
summary(sydney$Maximum.Temperature)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
19.40 23.80 27.10 26.97 29.50 36.60 1
print("Summary of Sydney's Wind speed")
[1] "Summary of Sydney's Wind speed"
summary(sydney$maximum.wind.speed)
Min. 1st Qu. Median Mean 3rd Qu. Max.
17.00 31.00 33.00 35.02 37.00 72.00
# printing standard deviation
print("Standard deviation of Sydney's max temperature")
[1] "Standard deviation of Sydney's max temperature"
sd(sydney$Maximum.Temperature, na.rm = TRUE)
[1] 3.762337
print("Standard deviation of Sydney's wind speed")
[1] "Standard deviation of Sydney's wind speed"
sd(sydney$maximum.wind.speed)
[1] 8.055953
# printing Inter Quartile Range
print("inter Quartile Range of Sydney's Max temperature")
[1] "inter Quartile Range of Sydney's Max temperature"
IQR(sydney$Maximum.Temperature,na.rm = TRUE)
[1] 5.7
print("inter Quartile Range of Sydney's wind speed")
[1] "inter Quartile Range of Sydney's wind speed"
IQR(sydney$maximum.wind.speed)
[1] 6
Distribution Fitting
Compare the empirical distribution of selected variable to a normal
distribution separately in Melbourne and in Sydney. You need to do this
visually by plotting the histogram with normal distribution overlay.
Show your code.


# This is a chunk for your Distribution Fitting section.
library(ggplot2)
# For maximum temperature speed in Melbourne
ggplot(melbourne, aes(x = Maximum.Temperature)) +
geom_histogram(aes(y=..density..), binwidth = 2, colour="black", fill="white") +
geom_density(color="blue") +
labs(x = "Max temperature", y = "Density", title = "Histogram of Maximum temperature in Melbourne")
# For Daily Wind speed in Melbourne
ggplot(melbourne, aes(x = Wind.speed)) +
geom_histogram(aes(y=..density..), binwidth = 2, colour="black", fill="white") +
geom_density(color="blue") +
labs(x = "Daily Wind speed (km/h)", y = "Density", title = "Histogram of Daily Wind speed in Melbourne")
# For Daily Wind speed in Sydney
ggplot(sydney, aes(x = maximum.wind.speed)) +
geom_histogram(aes(y=..density..), binwidth = 2, colour="black", fill="white") +
geom_density(color="blue") +
labs(x = "Daily Wind speed (km/h)", y = "Density", title = "Histogram of Daily Wind speed in Sydney")

# For Maximum temperature in Sydney
ggplot(sydney, aes(x = Maximum.Temperature)) +
geom_histogram(aes(y=..density..), binwidth = 2, colour="black", fill="white") +
geom_density(color="blue") +
labs(x = "Maximum temperature (°C)", y = "Density", title = "Histogram of Maximum temperature in Sydney")

Interpretation
Going back to your problem statement, what insight has been gained
from the investigation?
Based on the histograms , we can see that the distribution of “Daily
Wind speed” in both Melbourne and Sydney is not normal. The distribution
of “Maximum temperature” in Melbourne is approximately normal, but the
distribution in Sydney is slightly skewed to the right. Therefore, we
can recommend using a non-parametric model for “Daily Wind speed”, and a
parametric model (such as a normal distribution) for “Maximum
temperature” in Melbourne, but a non-parametric for “Maximum
temperature” in Sydney.
A non-parametric model, such as the median or rank-based methods, can
be more appropriate for non-normal distributions because they do not
make any assumptions about the underlying distribution of the data. On
the other hand, parametric models, such as the normal distribution,
assume that the data follow a specific distribution, which may not be
the case in non-normal distributions.
In the case of “Maximum temperature” in Melbourne, the approximately
normal distribution indicates that a parametric model, such as a normal
distribution, may be appropriate. However, in the case of “Maximum
temperature” in Sydney, the slightly skewed distribution suggests that a
non-parametric model may be more appropriate.
