Zhihan Jian (s3958653)
The data we use in this assignment provide climate data for Melbourne and Sydney in summer 2021 from December 2021 to February 2022. Both of the data contains three variables: “Maximum wind speed”, “Solar Exposure (total solar energy falling on each city’s horizontal surface)”, and “Maximum Temperature (measured in Degree C)”. “Solar Exposure” and “Maximum Temperature” are chosen for this report. After loading and preparing the data, descriptive statistics are calculated by the summary function in R to summarize the characteristics of the data. And then I will use histogram with normal distribution overlay as a visualization method to assist with interpreting data to discuss the distribution of the variables.
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
In this section, two data sets will be loaded and prepared for further analyis.
#load the data and check the structures and components
#scan for any missing value and address it if it does
mel <- read_csv("Climate Data-Melbourne-1.csv")
## Rows: 90 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (7): station number, Year, Month, Day, Maximum temperature (Degree C), s...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
syd <- read_csv("Climate Data-Sydney-1.csv")
## Rows: 90 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (7): station number, Year, Month, Day, Maximum temperature (Degree C), s...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(mel) # check and covert if any variable need to be factored
head(syd) # check and covert if any variable need to be factored
sum(is.na(mel)) # scan fo missing values, and result shows there is no missing values
## [1] 0
sum(is.na(syd)) # scan fo missing values, and result shows there is no missing values
## [1] 0
At this stage, descriptive statistics will be calculated for the chosen variables from two data sets: Mean, Median, Standard Deviation, First and Third Quartiles, IQR(Interquartile range), Minimum and Maximum Values
#at this step two columns "Solar exposure" and " Maximum temperature" are selected
#the summary statistics will be checked
summary(mel$`solar exposure`)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.40 16.88 22.60 21.91 28.10 32.00
summary(syd$`solar exposure`)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 15.88 20.65 20.09 25.88 32.20
summary(mel$`Maximum temperature (Degree C)`)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 14.90 22.82 26.85 27.02 31.68 38.40
summary(syd$`Maximum temperature (Degree C)`)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 19.70 24.32 26.85 27.01 29.20 37.80
IQR_mel_solar <- IQR(mel$`solar exposure`)
IQR_syd_solar <- IQR(syd$`solar exposure`)
IQR_mel_temp <- IQR(mel$`Maximum temperature (Degree C)`)
IQR_syd_temp <- IQR(syd$`Maximum temperature (Degree C)`)
IQR_mel_solar
## [1] 11.225
IQR_syd_solar
## [1] 10
IQR_mel_temp
## [1] 8.85
IQR_syd_temp
## [1] 4.875
In this part, four histogram will be created for two variables of two cities , and also a overlay normal distribution curve will be added to each figure for distributiuon fitting.
#create histogram for each variables and compare with normal distribution
#1 histogram of Melbourne solar exposure
hist(mel$`solar exposure`,breaks = 30, prob = TRUE, xlab = "Melbourne Solar Exposure", xlim = c(3.4, 32), main = "histogram of Melbourne solar exposure with normal curve", col = "lightblue")
# generate a normal distribution with mean and sigma of given variables with same length
mean_mel_solar = mean(mel$`solar exposure`)
sd_mel_solar = sd(mel$`solar exposure`)
mel_solar_norm <- rnorm(length(mel$`solar exposure`), mean_mel_solar, sd_mel_solar)
#add the normal line
lines(density(mel_solar_norm, adjust = 2), col = "blue", lwd = 2)
#2 histogram of Sydney solar exposure
hist(syd$`solar exposure`,breaks = 30, prob = TRUE, xlab = "Sydney Solar Exposure", xlim = c(3, 32.2) ,main = "histogram of Sydney solar exposure with normal curve", col = "lightgreen")
# generate a normal distribution with mean and sigma of given variables with same length
mean_syd_solar = mean(syd$`solar exposure`)
sd_syd_soalr = sd(syd$`solar exposure`)
syd_solar_norm <- rnorm(length(syd$`solar exposure`), mean_syd_solar, sd_syd_soalr)
#add the normal line
lines(density(syd_solar_norm, adjust = 2), col = "blue", lwd = 2)
#3 histogram of Melbourne max temperature
hist(mel$`Maximum temperature (Degree C)`,breaks = 30, prob = TRUE, xlab = "Melbourne Max Temperature", xlim = c(14.9, 38.4) ,main = "histogram of Melbourne Max Temperature with normal curve", col = "pink")
#generate a normal distribution with mean and sigma of given variables with same length
mean_mel_temp = mean(mel$`Maximum temperature (Degree C)`)
sd_mel_temp = sd(mel$`Maximum temperature (Degree C)`)
mel_temp_norm <- rnorm(length(mel$`Maximum temperature (Degree C)`), mean_mel_temp, sd_mel_temp)
#add the normal line
lines(density(mel_temp_norm, adjust = 2), col = "blue", lwd = 2)
#4 histogram of Sydney Max Temperature
hist(mel$`Maximum temperature (Degree C)`,breaks = 60, prob = TRUE, xlab = "Sydney Max Temperature", xlim = c(19.7, 37.8) ,main = "histogram of Sydney Max Temperature with normal curve", col = "orange")
#generate a normal distribution with mean and sigma of given variables with same length
mean_syd_temp = mean(syd$`Maximum temperature (Degree C)`)
sd_syd_temp = sd(syd$`Maximum temperature (Degree C)`)
syd_temp_norm <- rnorm(length(syd$`Maximum temperature (Degree C)`), mean_syd_temp, sd_syd_temp)
#add the normal line
lines(density(syd_temp_norm, adjust = 2), col = "blue", lwd = 2)
As we know in a normal distribution, the mean and median are equal and it is symmetric and asymptotic. We can use it as a benchmark to check the distributions. * For “Solar Exposure”, from the results of calculating the descriptive statistics we can see that both the mean of Melbourne and Sydney are obviously less than the median. And when we take a look at the histograms, both of then show a left-skewness. As universally accepted that, summer is the hottest season of a year, the fact that the level of energy amount received from the sun stays high, which leads to the left skewness of the histograms, is reasonable. * For “Maximum Temperature” in the two biggest city of Australia, we can see that the mean data are slightly bigger the average figure. In terms of distribution shape, the histogram of Melbourne fits the normal distribution the most, Meanwhile the Sydney figure presents a uniform distribution, which means nearly every value in this column occurs roughly the same number of times.