Required packages
library(dplyr)
library(tidyr)
library(readr)
library(lubridate)
library(outliers)
library(ggplot2)
library(forecast)
Registered S3 method overwritten by 'xts':
method from
as.zoo.xts zoo
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
Registered S3 methods overwritten by 'forecast':
method from
fitted.fracdiff fracdiff
residuals.fracdiff fracdiff
This is forecast 8.9
Want to stay up-to-date? Read the Hyndsight blog:
https://robjhyndman.com/hyndsight/
Executive Summary
- In this task of Data Preprocessing, selecting two open data-sets and combining them and understanding the datatypes, applying proper type conversions. Further identifying weather the data-set is in Tidy format and satisfies tidy principles.
- Then we analyze the data for missing values, handles the missing values and along with that scanning for any outliers.
- Along with that verifying if the data is in normalized form and performing required suitable transformation to make it into normalized form. Performing appropriate methods to remove the outliers such that it helps creating the report more accurate.
- Concluding, we are making the data-set ready and suitable for any statistical analysis.
Data
Datasets from Bureau of Meteorology - Autralian Government have been considered for performing data preprocessing.
Source: http://bom.gov.au/climate/data
We have considered 2 datasets which consists data about Rainfall and Temperature in last 5 years.
In dataset 1 containing Rainfall data have variables Bureau of Meteorology station number, Year, Month, Rainfall Amount, Quality and Report Generation Date.
‘Bureau of Meteorology station number’ gives unique identification number for a weather station from which data is collected. ‘Year’ indicates the year in which the observation was recorded.
Month indicated month in which observation was recorded. ‘Rainfall Amount’ gives mean quantity of rainfall observed in that particular month in milimeters(mm) as unit.
‘Quality’ is a quality flag. Climate data pass through a number of stages in quality control which occurs over a period of time. If qulity flag = Y then all the minimum requirements are fulfilled and it ensures there is no errors. If quality flag = N then it indicates that all the minimun requirements have been fulfilled. However there exists certain error in the observations conidered. ‘Report Generation Date’ is the date on which the observations were published.
‘Outcome’ indicates High if mean value for that month is higher than Annual mean value.
In dataset 2 containing Temperature data have variables Bureau of Meteorology station number, Year, Month, Mean Maximum temperature, Quality and Report Generation Date.
‘Bureau of Meteorology station number’ gives unique identification number for a weather station from which data is collected. ‘Year’ indicates the year in which the observation was recorded. Month indicated month in which observation was recorded.
‘Mean Maximum temperature’ gives mean of maximum temperature observed in a day in a month in Celcius (C) as unit.
‘Quality’ is a quality flag. Climate data pass through a number of stages in quality control which occurs over a period of time. If qulity flag = Y then all the minimum requirements are fulfilled and it ensures there is no errors. If quality flag = N then it indicates that all the minimun requirements have been fulfilled. However there exists certain error in the observations conidered. ‘Report Generation Date’ is the date on which the observations were published.
‘Outcome’ indicates High if mean value for that month is higher than Annual mean value.
setwd("F:/DataPre/Assignment3")
data1 <- read_csv("Data/Data_rainfall_melbourne.csv")
Parsed with column specification:
cols(
`Bureau of Meteorology station number` = [32mcol_double()[39m,
Year = [32mcol_double()[39m,
Month = [32mcol_double()[39m,
`Rainfall Amount` = [32mcol_double()[39m,
Quality = [31mcol_character()[39m,
`Report Generation Date` = [31mcol_character()[39m,
Outcome = [31mcol_character()[39m
)
data2 <- read_csv("Data/Data_temperature_melbourne.csv")
Parsed with column specification:
cols(
`Bureau of Meteorology station number` = [32mcol_double()[39m,
Year = [32mcol_double()[39m,
Month = [32mcol_double()[39m,
`Mean maximum temperature` = [32mcol_double()[39m,
Quality = [31mcol_character()[39m,
`Report Generation Date` = [31mcol_character()[39m,
Outcome = [31mcol_character()[39m
)
Dataset 1
- Removing the Station number variable as it persistes as same throughout the dataset.
data1 <- data1 %>% select(-`Bureau of Meteorology station number`)
head(data1)
Dataset 2
- Removing the Station number variable as it persistes as same throughout the dataset.
data2 <- data2 %>% select(-`Bureau of Meteorology station number`)
head(data2)
Joining Datasets
- Combining the datasets using Year, Month and Report generation date as common attribute using left join to ensure that only those records from second dataset are combined which have same Year, Month and Report generation date as in first dataset.
data <- left_join(data1, data2 , by=c("Year","Month", "Report Generation Date"))
head(data)
Understand
- In this step we are typecasting variable Outcome.x and Outcome.y from character to an ordered factor using factor() function.
- We are converting ‘Report generation Date’ variable to Date type from character type using as.Date() method. and summarising Rainfall and Temperature we get to know that there are 6 and 5 missing values respectively and have mean rainfall of 1.6 mm and Temperature of 20.4 degree celcius monthly for past 5 years.
str(data)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 57 obs. of 9 variables:
$ Year : num 2015 2015 2015 2015 2015 ...
$ Month : num 1 2 3 4 5 6 7 8 9 10 ...
$ Rainfall Amount : num 1.832 1.478 0.967 1.038 1.38 ...
$ Quality.x : chr "N" "Y" "Y" "Y" ...
$ Report Generation Date : chr "01/02/2015" "01/03/2015" "01/04/2015" "01/05/2015" ...
$ Outcome.x : chr "High" "High" "Low" "Low" ...
$ Mean maximum temperature: num 25.9 26.4 22.7 19.2 17.1 NA 13.3 13.8 17.3 NA ...
$ Quality.y : chr "Y" "Y" "Y" "N" ...
$ Outcome.y : chr "High" "High" "High" "Low" ...
data$Outcome.x <- factor(data$Outcome.x, levels = c("Low","High"), ordered = TRUE)
data$Outcome.y <- factor(data$Outcome.y,levels = c("Low","High"), ordered = TRUE)
data$Quality.x <- factor(data$Quality.x,levels = c("N","Y"))
data$Quality.y <- factor(data$Quality.y,levels = c("N","Y"))
data$`Report Generation Date` <- as.Date(data$`Report Generation Date`, format = "%d/%m/%Y")
summary(data$`Rainfall Amount`)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.0410 0.9605 1.5460 1.6048 2.0945 4.7560 6
summary(data$`Mean maximum temperature`)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
13.30 15.80 20.75 20.40 25.20 27.50 5
str(data)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 57 obs. of 9 variables:
$ Year : num 2015 2015 2015 2015 2015 ...
$ Month : num 1 2 3 4 5 6 7 8 9 10 ...
$ Rainfall Amount : num 1.832 1.478 0.967 1.038 1.38 ...
$ Quality.x : Factor w/ 2 levels "N","Y": 1 2 2 2 2 2 1 1 2 2 ...
$ Report Generation Date : Date, format: "2015-02-01" "2015-03-01" "2015-04-01" "2015-05-01" ...
$ Outcome.x : Ord.factor w/ 2 levels "Low"<"High": 2 2 1 1 2 1 2 NA 1 1 ...
$ Mean maximum temperature: num 25.9 26.4 22.7 19.2 17.1 NA 13.3 13.8 17.3 NA ...
$ Quality.y : Factor w/ 2 levels "N","Y": 2 2 2 1 2 1 2 2 2 1 ...
$ Outcome.y : Ord.factor w/ 2 levels "Low"<"High": 2 2 2 1 1 NA 1 1 1 NA ...
Tidy & Manipulate Data I
str(data)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 57 obs. of 9 variables:
$ Year : num 2015 2015 2015 2015 2015 ...
$ Month : num 1 2 3 4 5 6 7 8 9 10 ...
$ Rainfall Amount : num 1.832 1.478 0.967 1.038 1.38 ...
$ Quality.x : Factor w/ 2 levels "N","Y": 1 2 2 2 2 2 1 1 2 2 ...
$ Report Generation Date : Date, format: "2015-02-01" "2015-03-01" "2015-04-01" "2015-05-01" ...
$ Outcome.x : Ord.factor w/ 2 levels "Low"<"High": 2 2 1 1 2 1 2 NA 1 1 ...
$ Mean maximum temperature: num 25.9 26.4 22.7 19.2 17.1 NA 13.3 13.8 17.3 NA ...
$ Quality.y : Factor w/ 2 levels "N","Y": 2 2 2 1 2 1 2 2 2 1 ...
$ Outcome.y : Ord.factor w/ 2 levels "Low"<"High": 2 2 2 1 1 NA 1 1 1 NA ...
head(data)
- The dataset is Tidy as each variable have its own column, Each observation have its own row and Each value must have its own cell. Hence fulfilling the tidy principles we can say that the dataset is tidy.
Tidy & Manipulate Data II
- Mutating the variable Month and replacing the numerical representation of month number with name of month.
data <- mutate(data,Month=factor(Month,labels = c('Jan','Feb','Mar','April','May','June','July', 'Aug', 'Sept', 'Oct', 'Nov', 'Dec')))
head(data)
Scan I
- Scanning the data for missing values
colSums(is.na(data))
Year Month Rainfall Amount Quality.x Report Generation Date
0 0 6 0 0
Outcome.x Mean maximum temperature Quality.y Outcome.y
6 5 0 5
sum(is.na(data))
[1] 22
- After Scanning the data for missing values and we can see there are 22 missing values in total from different columns.
new_data <- na.omit(data)
colSums(is.na(new_data))
Year Month Rainfall Amount Quality.x Report Generation Date
0 0 0 0 0
Outcome.x Mean maximum temperature Quality.y Outcome.y
0 0 0 0
sum(is.na(new_data))
[1] 0
- In this step, removing the missing values rather than going for other approaches like replacing values with a constant or mean or median as this can affect the ‘Outcome’ variable and give false wrong value.
Scan II
z.scores_rainfall <- new_data$`Rainfall Amount` %>% scores(type = "z")
length (which( abs(z.scores_rainfall) >3 ))
[1] 1
boxplot(new_data$`Rainfall Amount`, main = "BoxPlot of Mean Rainfall", ylab = "Rainfall Amount in (mm)" , col = "red" )

z.scores_temperature <- new_data$`Mean maximum temperature` %>% scores(type = "z")
length (which( abs(z.scores_temperature) >3 ))
[1] 0
boxplot(new_data$`Mean maximum temperature`, main = "BoxPlot of Mean Temperature", ylab = "Mean Temperature in (C)" , col = "blue" )

- In this step, after scanning for outliers using boxplot we can infer that there exist only one outlier in Rainfall data while no outliers in Temperature data.
clean_rainfall <- new_data$`Rainfall Amount`[ - which( abs(z.scores_rainfall) >3 )]
z.scores_rainfall <- clean_rainfall %>% scores(type = "z")
length (which( abs(z.scores_rainfall) >3 ))
[1] 0
Here we can see we have sucessfully removed all outliers from rainfall data
---
title: "MATH2349 Semester 2, 2019"
author: "Devansh Parmar - S3793557"
subtitle: Assignment 3
output:
  html_notebook: default
---


## Required packages 

```{r}
library(dplyr)
library(tidyr)
library(readr)
library(lubridate)
library(outliers)
library(ggplot2)
library(forecast)

```


## Executive Summary 

- In this task of Data Preprocessing, selecting two open data-sets and combining them and understanding the datatypes, applying proper type conversions. Further identifying weather the data-set is in Tidy format and satisfies tidy principles. 
- Then we analyze the data for missing values, handles the missing values and along with that scanning for any outliers. 
- Along with that verifying if the data is in normalized form and performing required suitable transformation to make it into normalized form. Performing appropriate methods to remove the outliers such that it helps creating the report more accurate. 
- Concluding, we are making the data-set ready and suitable for any statistical analysis.


## Data 

- Datasets from Bureau of Meteorology - Autralian Government have been considered for performing data preprocessing.

- Source: http://bom.gov.au/climate/data

- We have considered 2 datasets which consists data about Rainfall and Temperature in last 5 years.

- In dataset 1 containing Rainfall data have variables Bureau of Meteorology station number, Year, Month, Rainfall Amount, Quality and Report Generation Date. 

- 'Bureau of Meteorology station number' gives unique identification number for a weather station from which data is collected. 'Year' indicates the year in which the observation was recorded. 

- Month indicated month in which observation was recorded. 
'Rainfall Amount' gives mean quantity of rainfall observed in that particular month in milimeters(mm) as unit. 

- 'Quality' is a quality flag. Climate data pass through a number of stages in quality control which occurs over a period of time.  If qulity flag = Y then all the minimum requirements are fulfilled and it ensures there is no errors. If quality flag = N then it indicates that all the minimun requirements have been fulfilled. However there exists certain error in the observations conidered. 'Report Generation Date' is the date on which the observations were published.

- 'Outcome' indicates High if mean value for that month is higher than Annual mean value.

- In dataset 2 containing Temperature data have variables Bureau of Meteorology station number, Year, Month, Mean Maximum temperature, Quality and Report Generation Date.

- 'Bureau of Meteorology station number' gives unique identification number for a weather station from which data is collected. 'Year' indicates the year in which the observation was recorded. Month indicated month in which observation was recorded. 

- 'Mean Maximum temperature' gives mean of maximum temperature observed in a day in a month in Celcius (C) as unit. 

- 'Quality' is a quality flag. Climate data pass through a number of stages in quality control which occurs over a period of time.  If qulity flag = Y then all the minimum requirements are fulfilled and it ensures there is no errors. If quality flag = N then it indicates that all the minimun requirements have been fulfilled. However there exists certain error in the observations conidered. 'Report Generation Date' is the date on which the observations were published. 

- 'Outcome' indicates High if mean value for that month is higher than Annual mean value.

```{r}
setwd("F:/DataPre/Assignment3")
data1 <- read_csv("Data/Data_rainfall_melbourne.csv")
data2 <- read_csv("Data/Data_temperature_melbourne.csv")
```

## Dataset 1
- Removing the Station number variable as it persistes as same throughout the dataset.
```{r}
data1 <- data1 %>% select(-`Bureau of Meteorology station number`)
head(data1)
```

## Dataset 2
- Removing the Station number variable as it persistes as same throughout the dataset.
```{r}
data2 <- data2 %>% select(-`Bureau of Meteorology station number`)
head(data2)
```
## Joining Datasets

- Combining the datasets using Year, Month and Report generation date as common attribute using left join to ensure that only those records from second dataset are combined which have same Year, Month and Report generation date as in first dataset.

```{r}
data <- left_join(data1, data2 , by=c("Year","Month", "Report Generation Date"))
head(data)
```

## Understand 

- In this step we are typecasting variable Outcome.x and Outcome.y from character to an ordered factor using factor() function.
- We are converting 'Report generation Date' variable to Date type from character type using as.Date() method.
and summarising Rainfall and Temperature we get to know that there are 6 and 5 missing values respectively and have mean rainfall of 1.6 mm and Temperature of 20.4 degree celcius monthly for past 5 years.

```{r}
str(data)
data$Outcome.x <-  factor(data$Outcome.x, levels = c("Low","High"), ordered = TRUE)
data$Outcome.y <- factor(data$Outcome.y,levels = c("Low","High"), ordered = TRUE)
data$Quality.x <- factor(data$Quality.x,levels = c("N","Y"))
data$Quality.y <- factor(data$Quality.y,levels = c("N","Y"))
data$`Report Generation Date` <- as.Date(data$`Report Generation Date`, format = "%d/%m/%Y")
summary(data$`Rainfall Amount`)
summary(data$`Mean maximum temperature`)
str(data)
```

##	Tidy & Manipulate Data I 

```{r}
str(data)
head(data)
```

- The dataset is Tidy as each variable have its own column, Each observation have its own row and Each value must have its own cell. Hence fulfilling the tidy principles we can say that the dataset is tidy. 


##	Tidy & Manipulate Data II 

- Mutating the variable Month and replacing the numerical representation of month number with name of month.

```{r}
data <- mutate(data,Month=factor(Month,labels = c('Jan','Feb','Mar','April','May','June','July', 'Aug', 'Sept', 'Oct', 'Nov', 'Dec')))
head(data)
```


##	Scan I 

- Scanning the data for missing values

```{r}
colSums(is.na(data))
sum(is.na(data))
```

- After Scanning the data for missing values and we can see there are 22 missing values in total from different columns.

```{r}
new_data <- na.omit(data) 
colSums(is.na(new_data)) 
sum(is.na(new_data))
```

- In this step, removing the missing values rather than going for other approaches like replacing values with a constant or mean or median as this can affect the 'Outcome' variable and give false wrong value.

##	Scan II

```{r}
z.scores_rainfall <- new_data$`Rainfall Amount` %>%  scores(type = "z")
length (which( abs(z.scores_rainfall) >3 ))
boxplot(new_data$`Rainfall Amount`, main = "BoxPlot of  Mean Rainfall", ylab = "Rainfall Amount in (mm)" , col = "red" )
z.scores_temperature <- new_data$`Mean maximum temperature` %>%  scores(type = "z")
length (which( abs(z.scores_temperature) >3 ))
boxplot(new_data$`Mean maximum temperature`, main = "BoxPlot of Mean Temperature", ylab = "Mean Temperature in (C)" , col = "blue" )
```

- In this step, after scanning for outliers using boxplot we can infer that there exist only one outlier in Rainfall data while no outliers in Temperature data.

```{r}
clean_rainfall <- new_data$`Rainfall Amount`[ - which( abs(z.scores_rainfall) >3 )]
z.scores_rainfall <- clean_rainfall %>%  scores(type = "z")
length (which( abs(z.scores_rainfall) >3 ))
```
Here we can see we have sucessfully removed all outliers from rainfall data

##	Transform 

```{r}
hist(new_data$`Rainfall Amount`, main = "Histogram of Temperature with vertical mean line", xlab = " Mean Rainfall in (mm)")
abline(v = mean(new_data$`Rainfall Amount`), col="red", lwd=3, lty=2)
hist(new_data$`Mean maximum temperature`, main = "Histogram of Temperature with vertical mean line", xlab = "Mean Temperature in (C)")
abline(v = mean(new_data$`Mean maximum temperature`), col="red", lwd=3, lty=2)
```

- We can see that both the data are slightly right skewed. Hence applying Boxcox Transformation in order to normalize the data.

```{r}
boxcox_rainfall <- BoxCox(new_data$`Rainfall Amount`, lambda = "auto")
hist(boxcox_rainfall, main="Histogram of Rainfall data after Transformation", xlab = " Mean Rainfall in (mm)" )
abline(v = mean(boxcox_rainfall), col="blue", lwd=3, lty=2)
boxcox_temperature <- BoxCox(new_data$`Mean maximum temperature`, lambda = "auto")
hist(boxcox_temperature, main="Histogram of Temperature data after Transformation", xlab = "Mean Temperature in (C)")
abline(v = mean(boxcox_temperature), col="blue", lwd=3, lty=2)
```
- Here after transformation through visual inspection we can see the data is in normalized form using mean line as reference.

<br>
<br>
