Student Details

Name: “Karan Khurana”

ID: “s3998115”

Problem Statement

“This investigation’s goal is to analyse the normality distribution of a subset of Melbourne and Sydney weather data and offer suggestions for modelling these variables. The data set’s Daily Wind speed and Maximum temperature variables have been chosen.

Approach: The investigation includes the following steps:

The mean, median, standard deviation, first and third quartiles, interquartile range, minimum and maximum values, and R functions will be used to provide summary statistics for each variable in Melbourne and Sydney.

For each variable in Melbourne and Sydney, histograms with overlays of the normal distribution will be generated using the ggplot2 program in R. The variable is thought to be roughly normally distributed if the histogram resembles a bell-shaped curve.We may need to take into account using a non-parametric model if the histogram is skewed.

We shall offer suggestions for the modelling of each variable in Melbourne and Sydney based on the histograms’ findings. For variables that are not normally distributed, we would advise using a non-parametric model, and for variables that are roughly regularly distributed, we might advise using a parametric model (such a normal distribution).”

Load Packages

library(dplyr)

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union
library(ggplot2)

Data

Import the climate data and prepare it for analysis. Show your code.

setwd("C:/Users/bhara/Downloads/Data-Applied Analytics/")
Warning: The working directory was changed to C:/Users/bhara/Downloads/Data-Applied Analytics inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the working directory for notebook chunks.
melbourne <- read.csv("Climate_Data_Melbourne.csv")
sydney <- read.csv("Climate_Data_Sydney.csv")

Summary Statistics

Calculate descriptive statistics (i.e., mean, median, standard deviation, first and third quartile, interquartile range, minimum and maximum values) of the selected variable grouped by city.

# This is a chunk for your Summary Statistics section. 
# summary of given column of data set Melbourne which contain mean, median, maximum, minimum, 1st and 3rd quartile
print("Summary of melbourne's max temperature")
[1] "Summary of melbourne's max temperature"
summary(melbourne$Maximum.Temperature)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  16.10   21.43   25.10   25.64   29.05   41.30 
print("Summary of melbourne's wind speed")
[1] "Summary of melbourne's wind speed"
summary(melbourne$Wind.speed)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  30.00   35.00   41.00   46.16   54.00   83.00 
# printing standard deviation 
print("Standard deviation of melbourne's max temperature")
[1] "Standard deviation of melbourne's max temperature"
sd(melbourne$Maximum.Temperature)
[1] 5.768589
print("Standard deviation of melbourne's wind speed")
[1] "Standard deviation of melbourne's wind speed"
sd(melbourne$Wind.speed)
[1] 13.33738
# printing Inter Quartile Range
print("inter Quartile Range of melbourne's max temperature")
[1] "inter Quartile Range of melbourne's max temperature"
IQR(melbourne$Maximum.Temperature)
[1] 7.625
print("inter Quartile Range of melbourne's wind speed")
[1] "inter Quartile Range of melbourne's wind speed"
IQR(melbourne$Wind.speed)
[1] 19
# summary of given column of data set Sydney which contain mean, median, maximum, minimum, 1st and 3rd quartile
print("Summary of Sydney's max temperature")
[1] "Summary of Sydney's max temperature"
summary(sydney$Maximum.Temperature)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  19.40   23.80   27.10   26.97   29.50   36.60       1 
print("Summary of Sydney's Wind speed")
[1] "Summary of Sydney's Wind speed"
summary(sydney$maximum.wind.speed)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  17.00   31.00   33.00   35.02   37.00   72.00 
# printing standard deviation
print("Standard deviation of Sydney's max temperature")
[1] "Standard deviation of Sydney's max temperature"
sd(sydney$Maximum.Temperature, na.rm = TRUE)
[1] 3.762337
print("Standard deviation of Sydney's wind speed")
[1] "Standard deviation of Sydney's wind speed"
sd(sydney$maximum.wind.speed)
[1] 8.055953
# printing Inter Quartile Range
print("inter Quartile Range of Sydney's Max temperature")
[1] "inter Quartile Range of Sydney's Max temperature"
IQR(sydney$Maximum.Temperature,na.rm = TRUE)
[1] 5.7
print("inter Quartile Range of Sydney's wind speed")
[1] "inter Quartile Range of Sydney's wind speed"
IQR(sydney$maximum.wind.speed)
[1] 6

Distribution Fitting

Compare the empirical distribution of selected variable to a normal distribution separately in Melbourne and in Sydney. You need to do this visually by plotting the histogram with normal distribution overlay. Show your code.

# This is a chunk for your Distribution Fitting section. 
  library(ggplot2)

# For maximum temperature speed in Melbourne
ggplot(melbourne, aes(x = Maximum.Temperature)) +
  geom_histogram(aes(y=..density..), binwidth = 2, colour="black", fill="white") +
  geom_density(color="blue") +
  labs(x = "Max temperature", y = "Density", title = "Histogram of Maximum temperature in Melbourne")

# For Daily Wind speed in Melbourne
ggplot(melbourne, aes(x = Wind.speed)) +
  geom_histogram(aes(y=..density..), binwidth = 2, colour="black", fill="white") +
  geom_density(color="blue") +
  labs(x = "Daily Wind speed (km/h)", y = "Density", title = "Histogram of Daily Wind speed in Melbourne")


# For Daily Wind speed in Sydney
ggplot(sydney, aes(x = maximum.wind.speed)) +
  geom_histogram(aes(y=..density..), binwidth = 2, colour="black", fill="white") +
  geom_density(color="blue") +
  labs(x = "Daily Wind speed (km/h)", y = "Density", title = "Histogram of Daily Wind speed in Sydney")


# For Maximum temperature in Sydney
ggplot(sydney, aes(x = Maximum.Temperature)) +
  geom_histogram(aes(y=..density..), binwidth = 2, colour="black", fill="white") +
  geom_density(color="blue") +
  labs(x = "Maximum temperature (°C)", y = "Density", title = "Histogram of Maximum temperature in Sydney")

Interpretation

Going back to your problem statement, what insight has been gained from the investigation?

Based on the histograms , we can see that the distribution of “Daily Wind speed” in both Melbourne and Sydney is not normal. The distribution of “Maximum temperature” in Melbourne is approximately normal, but the distribution in Sydney is slightly skewed to the right. Therefore, we can recommend using a non-parametric model for “Daily Wind speed”, and a parametric model (such as a normal distribution) for “Maximum temperature” in Melbourne, but a non-parametric for “Maximum temperature” in Sydney.

A non-parametric model, such as the median or rank-based methods, can be more appropriate for non-normal distributions because they do not make any assumptions about the underlying distribution of the data. On the other hand, parametric models, such as the normal distribution, assume that the data follow a specific distribution, which may not be the case in non-normal distributions.

In the case of “Maximum temperature” in Melbourne, the approximately normal distribution indicates that a parametric model, such as a normal distribution, may be appropriate. However, in the case of “Maximum temperature” in Sydney, the slightly skewed distribution suggests that a non-parametric model may be more appropriate.

