Student Details

Priya Krishnamurthi Chandra (s3939191)
Su Myat Noe Yee (s3913797)
Usman Khalid (s3914769)

Problem Statement

The objective of the analysis is to investigate the distribution in data for certain variables related to climate in Melbourne and Sydney.
The variables considered are ‘Maximum temperature’, which records the highest temperature in a day, and ‘Solar exposure’, which records the total solar energy falling on a horizontal surface in a day. Both variables are inspected separately in both cities and the distribution of data is visualized and studied.
The tests of normality include graphical visualizations such as histogram and density plots, QQ-plot and Shapiro-Wilk statistical test.

Load Packages

Loaded the necessary packages needed for the analysis.

    knitr::opts_knit$set(root.dir = normalizePath("~/Desktop/RMIT_Year1_Semester 1/Applied Analytics/Assignments")) 
library(readr) #Used for reading data
library(magrittr) #Used for Forward-pipe operator
library(dplyr) #Used for data manipulation

Data

# This is a chunk for your Data section. 
setwd("~/Desktop/RMIT_Year1_Semester 1/Applied Analytics/Assignments")
mel<- read_csv("Climate data-Melbourne.csv") 
Rows: 90 Columns: 7
── Column specification ───────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
dbl (7): station number, Year, Month, Day, Maximum temperature (Degree C), solar exposure, max wind speed

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
syd<- read_csv("Climate data-Sydney.csv")
Rows: 90 Columns: 7
── Column specification ───────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
dbl (7): station number, Year, Month, Day, Maximum temperature (Degree C), solar exposure, max wind speed

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
mel_subset <- mel %>% select(-7)
syd_subset <- syd %>% select(-7) 
total <- rbind(mel_subset, syd_subset)

Summary Statistics

Calculate descriptive statistics (i.e., mean, median, standard deviation, first and third quartile, interquartile range, minimum and maximum values) of the selected variable grouped by city.

# This is a chunk for your Summary Statistics section
# Summary Statistics of Maximum temperature group by city
total %>%
  group_by(`station number`) %>%
  summarise(`Mean Maximum temperature (Degree C)` = mean(`Maximum temperature (Degree C)`),
            median = median(`Maximum temperature (Degree C)`),
            sd = sd(`Maximum temperature (Degree C)`),
            `First Quantile` = quantile(`Maximum temperature (Degree C)`, 0.25),
            `Third Quantile` = quantile(`Maximum temperature (Degree C)`, 0.75),
            `Inter quartile Range` = `Third Quantile`-`First Quantile`,
            `Minimum Value` = min(`Maximum temperature (Degree C)`),
            `Maximumm Value` = max(`Maximum temperature (Degree C)`)
            )
# Summary Statistics of solar exposure group by city
total %>%
  group_by(`station number`) %>%
  summarise(`Mean solar exposure` = mean(`solar exposure`),
            median = median(`solar exposure`),
            sd = sd(`solar exposure`),
            `First Quantile` = quantile(`solar exposure`, 0.25),
            `Third Quantile` = quantile(`solar exposure`, 0.75),
            `Inter quartile Range` = `Third Quantile`-`First Quantile`,
            `Minimum Value` = min(`solar exposure`),
            `Maximumm Value` = max(`solar exposure`)
            )

Distribution Fitting

# This is a chunk for your Distribution Fitting section. 
mel_temp<-mel_subset$`Maximum temperature (Degree C)` 
mel_solar<-mel_subset$`solar exposure` 
syd_temp<-syd_subset$`Maximum temperature (Degree C)`
syd_solar<-syd_subset$`solar exposure` 

Melbourne Maximum temperature

# Histogram of Melbourne maximum temperature
h<-hist(mel_temp, breaks=15,  col="yellow", xlab="Maximum temperature",
        main="Histogram of Melbourne Maximum temperature", prob=TRUE)
lines(density(mel_temp))
xfit<-seq(min(mel_temp),max(mel_temp),length=40)
yfit<-dnorm(xfit,mean=mean(mel_temp),sd=sd(mel_temp))
lines(xfit, yfit, col="blue", lwd=2)

#Boxplot of Melbourne maximum temperature
boxplot(mel_temp, horizontal = TRUE)

#Shapiro test for Melbourne maximum temperature
shapiro.test(mel_temp)

    Shapiro-Wilk normality test

data:  mel_temp
W = 0.97913, p-value = 0.157
#QQplot test of  Melbourne maximum temperature
qqnorm(mel_temp);
qqline(mel_temp);

Melbourne Solar Exposure

#Histogram of Melbourne solar exposure
h<-hist(mel_solar, breaks=15,  col="tomato", xlab="Solar exposure",
        main="Histogram of Melbourne solar exposure", prob=TRUE)
lines(density(mel_solar))
xfit<-seq(min(mel_solar),max(mel_solar),length=40)
yfit<-dnorm(xfit,mean=mean(mel_solar),sd=sd(mel_solar))
lines(xfit, yfit, col="blue", lwd=2)

#Boxplot of Melbourne solar exposure
boxplot(mel_solar, horizontal = TRUE)

#Shapiro test of Melbourne solar exposure
shapiro.test(mel_solar)

    Shapiro-Wilk normality test

data:  mel_solar
W = 0.93519, p-value = 0.0002275
#QQplot test of Melbourne solar exposure
qqnorm(mel_solar);
qqline(mel_solar);

Sydney Maximum temperature

#Histogram of Sydney maximum temperature 
h<-hist(syd_temp, breaks=15,  col="skyblue", xlab="Maximum temperature",
        main="Histogram of Sydney maximum temperature", prob=TRUE)
lines(density(syd_temp))
xfit<-seq(min(syd_temp),max(syd_temp),length=40)
yfit<-dnorm(xfit,mean=mean(syd_temp),sd=sd(syd_temp))
lines(xfit, yfit, col="blue", lwd=2)

#Boxplot of Sydney maximum temperature
boxplot(syd_temp, horizontal = TRUE)

#Shapiro test of Sydney maximum temperature
shapiro.test(syd_temp)

    Shapiro-Wilk normality test

data:  syd_temp
W = 0.98213, p-value = 0.2533
#QQplot test of Sydney maximum temperature
qqnorm(syd_temp);
qqline(syd_temp);

Sydney Solar exposure

#Histogram of Sydney solar exposure
h<-hist(syd_solar, breaks=15,  col="lightgreen", xlab="Solar exposure",
        main="Histogram of Sydney solar exposure", prob=TRUE)
lines(density(syd_solar))
xfit<-seq(min(syd_solar),max(syd_solar),length=40)
yfit<-dnorm(xfit,mean=mean(syd_solar),sd=sd(syd_solar))
lines(xfit, yfit, col="blue", lwd=2)

#Boxplot test of Sydney solar exposure
boxplot(syd_solar, horizontal = TRUE)

#QQplot test of Sydney solar exposure
qqnorm(syd_solar);
qqline(syd_solar);

#Shapiro test of Sydney solar exposure
shapiro.test(syd_solar)

    Shapiro-Wilk normality test

data:  syd_solar
W = 0.97075, p-value = 0.03994

Interpretation

From descriptive statistics it can be observed that Sydney (station number 66212) and Melbourne (station number 86282) have about the same mean maximum temeratures but tghe stqandard deviation is greater for Melbourne which indicates a slightly higher variability in maximum temperatures observed in Melbourne. The maximum temperature for both the cities follows approximately normal distribution as indicated by the histogram overlays, quantile-quantile plots, and Shapiro-Wilks test ( p-values greater than 0.05).

The mean solar exposure for Melbourne is higher for Melbourne i.e., 21.90 as compared to Sydney (21.90). The box plot of sydney solar explosure indicates that the data is left skewed. Similarly the box plot shows left skewednesss in Melbourne data for solar exposure. The Shapiro-Wilks test also indicates the significat departure from normality with p-values of 0.0002275 for Melbourne solar exposure and 0.03394 for Sydney solar exposure. Thus solar exposure variable doesnot follow the norml distribution. The maximum temperature

---
title: "MATH1324 Assignment 1"
subtitle: Statistical analysis of Climate data
output:
  html_notebook: default
---

## Student Details
Priya Krishnamurthi Chandra (s3939191)  
Su Myat Noe Yee (s3913797)  
Usman Khalid (s3914769)  


## Problem Statement
The objective of the analysis is to investigate the distribution in data for certain variables related to climate in Melbourne and Sydney.  
The variables considered are 'Maximum temperature', which records the highest temperature in a day, and 'Solar exposure', which records the total solar energy falling on a horizontal surface in a day. Both variables are inspected separately in both cities and the distribution of data is visualized and studied.  
The tests of normality include graphical visualizations such as histogram and density plots, QQ-plot and Shapiro-Wilk statistical test.

## Load Packages
Loaded the necessary packages needed for the analysis.

```{r setup}
    knitr::opts_knit$set(root.dir = normalizePath("~/Desktop/RMIT_Year1_Semester 1/Applied Analytics/Assignments")) 
```

```{r}
library(readr) #Used for reading data
library(magrittr) #Used for Forward-pipe operator
library(dplyr) #Used for data manipulation
```

## Data
* Imported the climate data
* Subset the imported data to consider only the variables of interest- Maximum temperature and Solar exposure


```{r}
# This is a chunk for your Data section. 
setwd("~/Desktop/RMIT_Year1_Semester 1/Applied Analytics/Assignments")
mel<- read_csv("Climate data-Melbourne.csv") 
syd<- read_csv("Climate data-Sydney.csv")

mel_subset <- mel %>% select(-7)
syd_subset <- syd %>% select(-7) 
total <- rbind(mel_subset, syd_subset)
```


## Summary Statistics
Calculate descriptive statistics (i.e., mean, median, standard deviation, first and third quartile, interquartile range, minimum and maximum values) of the selected variable grouped by city.

```{r}
# This is a chunk for your Summary Statistics section
# Summary Statistics of Maximum temperature group by city
total %>%
  group_by(`station number`) %>%
  summarise(`Mean Maximum temperature (Degree C)` = mean(`Maximum temperature (Degree C)`),
            median = median(`Maximum temperature (Degree C)`),
            sd = sd(`Maximum temperature (Degree C)`),
            `First Quantile` = quantile(`Maximum temperature (Degree C)`, 0.25),
            `Third Quantile` = quantile(`Maximum temperature (Degree C)`, 0.75),
            `Inter quartile Range` = `Third Quantile`-`First Quantile`,
            `Minimum Value` = min(`Maximum temperature (Degree C)`),
            `Maximumm Value` = max(`Maximum temperature (Degree C)`)
            )
# Summary Statistics of solar exposure group by city
total %>%
  group_by(`station number`) %>%
  summarise(`Mean solar exposure` = mean(`solar exposure`),
            median = median(`solar exposure`),
            sd = sd(`solar exposure`),
            `First Quantile` = quantile(`solar exposure`, 0.25),
            `Third Quantile` = quantile(`solar exposure`, 0.75),
            `Inter quartile Range` = `Third Quantile`-`First Quantile`,
            `Minimum Value` = min(`solar exposure`),
            `Maximumm Value` = max(`solar exposure`)
            )
```
## Distribution Fitting
* Compared the empirical distribution of selected variables to normal distribution separately in Melbourne and Sydney.
* Plotted the histogram with normal distribution overlay.
* Also performed the shapiro-wilk test on variables to see if the follow the normal distribution ( A p-value less than 0.05 indicates that the variable doesnot follow the normal distribution)
* Quantile quantile plots were also drawn to observe if the quantiles of empirical observation plotted against quantiles of the normal distribution gives data points lying on the 45 degree line. If the points lie along that line then the distribution followed by the sample variable is normal.

* Checking Normality of Maximum temperature (Degree C) variable of Melbourne dataset


```{r}
# This is a chunk for your Distribution Fitting section. 
mel_temp<-mel_subset$`Maximum temperature (Degree C)` 
mel_solar<-mel_subset$`solar exposure` 
syd_temp<-syd_subset$`Maximum temperature (Degree C)`
syd_solar<-syd_subset$`solar exposure` 
```

## Melbourne Maximum temperature

```{r}
# Histogram of Melbourne maximum temperature
h<-hist(mel_temp, breaks=15,  col="yellow", xlab="Maximum temperature",
        main="Histogram of Melbourne Maximum temperature", prob=TRUE)
lines(density(mel_temp))
xfit<-seq(min(mel_temp),max(mel_temp),length=40)
yfit<-dnorm(xfit,mean=mean(mel_temp),sd=sd(mel_temp))
lines(xfit, yfit, col="blue", lwd=2)
```

```{r}
#Boxplot of Melbourne maximum temperature
boxplot(mel_temp, horizontal = TRUE)
```

```{r}
#Shapiro test for Melbourne maximum temperature
shapiro.test(mel_temp)
```
```{r}
#QQplot test of  Melbourne maximum temperature
qqnorm(mel_temp);
qqline(mel_temp);
```

## Melbourne Solar Exposure
```{r}
#Histogram of Melbourne solar exposure
h<-hist(mel_solar, breaks=15,  col="tomato", xlab="Solar exposure",
        main="Histogram of Melbourne solar exposure", prob=TRUE)
lines(density(mel_solar))
xfit<-seq(min(mel_solar),max(mel_solar),length=40)
yfit<-dnorm(xfit,mean=mean(mel_solar),sd=sd(mel_solar))
lines(xfit, yfit, col="blue", lwd=2)
```
```{r}
#Boxplot of Melbourne solar exposure
boxplot(mel_solar, horizontal = TRUE)
```
```{r}
#Shapiro test of Melbourne solar exposure
shapiro.test(mel_solar)
```
```{r}
#QQplot test of Melbourne solar exposure
qqnorm(mel_solar);
qqline(mel_solar);
```

## Sydney Maximum temperature
```{r}
#Histogram of Sydney maximum temperature 
h<-hist(syd_temp, breaks=15,  col="skyblue", xlab="Maximum temperature",
        main="Histogram of Sydney maximum temperature", prob=TRUE)
lines(density(syd_temp))
xfit<-seq(min(syd_temp),max(syd_temp),length=40)
yfit<-dnorm(xfit,mean=mean(syd_temp),sd=sd(syd_temp))
lines(xfit, yfit, col="blue", lwd=2)
```
```{r}
#Boxplot of Sydney maximum temperature
boxplot(syd_temp, horizontal = TRUE)
```
```{r}
#Shapiro test of Sydney maximum temperature
shapiro.test(syd_temp)
```
```{r}
#QQplot test of Sydney maximum temperature
qqnorm(syd_temp);
qqline(syd_temp);
```

## Sydney Solar exposure
```{r}
#Histogram of Sydney solar exposure
h<-hist(syd_solar, breaks=15,  col="lightgreen", xlab="Solar exposure",
        main="Histogram of Sydney solar exposure", prob=TRUE)
lines(density(syd_solar))
xfit<-seq(min(syd_solar),max(syd_solar),length=40)
yfit<-dnorm(xfit,mean=mean(syd_solar),sd=sd(syd_solar))
lines(xfit, yfit, col="blue", lwd=2)
```
```{r}
#Boxplot test of Sydney solar exposure
boxplot(syd_solar, horizontal = TRUE)
```
```{r}
#QQplot test of Sydney solar exposure
qqnorm(syd_solar);
qqline(syd_solar);
```
```{r}
#Shapiro test of Sydney solar exposure
shapiro.test(syd_solar)
```

## Interpretation
From descriptive statistics it can be observed that Sydney (station number 66212) and Melbourne (station number 86282) have about the same mean maximum temeratures but tghe stqandard deviation is greater for Melbourne which indicates a slightly higher variability in maximum temperatures observed in Melbourne. 
The maximum temperature for both the cities follows approximately normal distribution as indicated by the histogram overlays, quantile-quantile plots, and Shapiro-Wilks test ( p-values greater than 0.05). 

The mean solar exposure for Melbourne is higher for Melbourne i.e., 21.90 as compared to Sydney (21.90). The box plot of sydney solar explosure indicates that the data is left skewed. Similarly the box plot shows left skewednesss in Melbourne data for solar exposure. The Shapiro-Wilks test also indicates the significat departure from normality with p-values of 0.0002275 for Melbourne solar exposure and  0.03394 for Sydney solar exposure. Thus solar exposure variable doesnot follow the norml distribution.
The maximum temperature


