Required packages
library("dplyr")
library("tidyr")
library("readr")
Executive Summary
The aim of this report is to preprocess the combined data of emission rates and surface temperature of countries around the world across couple of centuries which could be used to find (if any) correlation between the two variables.
In the Tidy & Manipulate sections, we first tidy the tables separately by using tidyr functions like gather, separate and drop unwanted columns (variables). We make sure both the tables are in the same format so that we can join them correctly. After merging the tables, we carry out any type conversions that may be required. We then manipulate the data using mutate function to add new variables.
In the SCAN sections, we scan for missing values after we are done tidying and manipulating the data. We handle the missing values with the appropriate methods, we impute the missing values with the mean values of the remaining data. Scanning for outliers is the next part of the preprocessing and we find out that we require to Transform the data to make it more appropriate for our evaluation of the correlation between the emission rates and the surface temperature. After carrying out the transformation we again scan for outliers and deal with them with suitable methods.
Completing all these steps makes the data ready for evaluation and we end the report by displaying a scatter plot showing the relation of the emission rates and surface temperatures.
Data
This source has repackaged the data from compilation put together by the Berkeley Earth, which is affiliated with Lawrence Berkeley National Laboratory. The Berkeley Earth Surface Temperature Study combines 1.6 billion temperature reports from 16 pre-existing archives. The raw data comes from the Berkeley Earth data page.
This source has multiple files that give temperature data for the world. We select the CSV file - GlobalLandTemperaturesByCountry.csv as our first data source.
The GlobalLandTemperaturesByCountry data set consists of the following variables:
dt [date]: Date (yyyy-mm-dd) when the temperture was recorded, starts from the year 1743
AverageTemperature [numeric]: Average Land Temperature in Celsius
AverageTemperatureUncertainty [numeric]: the 95% confidence interval around the average temperature
Country [character]: name of the country for which the temperature is recorded
This dataset contains emission data of Green House Gases (in tonnes) of different countries from year 1750 - 2017 taken from https://ourworldindata.org/co2-and-other-greenhouse-gas-emissions
Our second data source is the CSV file - emission data.csv
The emission data data set consists of the following variables:
Country [character]: name of the country for which the emission data is recorded
1751, 1752,..., 2018, 2019 [numeric]: the year that data was collected. Gives the emission data recorded in tonnes.
Reading the Data:
Importing the data file emission data.csv:
emissionData <- read_csv("emission data.csv")
The first 6 observations of the emission data data set are:
head(emissionData)
Importing the data file GlobalLandTemperaturesByCountry.csv:
temperatureData <- read_csv("GlobalLandTemperaturesByCountry.csv")
The first 6 observations of the GlobalLandTemperaturesByCountry data set are:
head(temperatureData)
NOTE: We do not merge the datasets in this section as the datasets need to be tidied before we can proceed to merge them. The data sets are merged in the Merging Datasets section after Tidy & Manipulate Data I.
Understand
1. Checking the dimensions and structure of the data frame emissionData:
dim(emissionData)
[1] 231 268
str(emissionData[,1:8])
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 231 obs. of 8 variables:
$ Country: chr "Afghanistan" "Africa" "Albania" "Algeria" ...
$ 1751 : num 0 0 0 0 0 0 0 0 0 0 ...
$ 1752 : num 0 0 0 0 0 0 0 0 0 0 ...
$ 1753 : num 0 0 0 0 0 0 0 0 0 0 ...
$ 1754 : num 0 0 0 0 0 0 0 0 0 0 ...
$ 1755 : num 0 0 0 0 0 0 0 0 0 0 ...
$ 1756 : num 0 0 0 0 0 0 0 0 0 0 ...
$ 1757 : num 0 0 0 0 0 0 0 0 0 0 ...
The columns 9-268 all give the emission reading for a year and are of numeric class. These aren’t shown above to improve readability.
2. Checking the attributes of emissionData:
- Checking the data type of the
Country variable:
class(emissionData$Country)
[1] "character"
typeof(emissionData$Country)
[1] "character"
The variable Country gives the name of the country. The variable is of type character. No type conversions are needed.
- Checking the data type of the
1751 variable:
class(emissionData$`1751`)
[1] "numeric"
typeof(emissionData$`1751`)
[1] "double"
The variable 1751 gives emission readings for the year 1751. The variable is of type double. The emission readings are continuous numeric values and cannot be factored.
The remaining variables 1752 - 2019 are all numeric and give the emission values. We do not individually check the type of all the variables due to large number of variables.
3. Checking the dimensions and structure of the data frame temperatureData:
dim(temperatureData)
[1] 577462 4
str(temperatureData, give.attr = FALSE)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 577462 obs. of 4 variables:
$ dt : Date, format: "1743-11-01" "1743-12-01" "1744-01-01" "1744-02-01" ...
$ AverageTemperature : num 4.38 NA NA NA NA ...
$ AverageTemperatureUncertainty: num 2.29 NA NA NA NA ...
$ Country : chr "Ã…land" "Ã…land" "Ã…land" "Ã…land" ...
4. Checking the attributes of temperatureData:
- Checking the data type of the
dt variable:
class(temperatureData$dt)
[1] "Date"
typeof(temperatureData$dt)
[1] "double"
The variable dt is of class Date. dt gives the date when the temperature was recorded. The Date type is correct and no type conversion is made.
- Checking the data type of the
AverageTemperature variable:
class(temperatureData$AverageTemperature)
[1] "numeric"
typeof(temperatureData$AverageTemperature)
[1] "double"
The variable AverageTemperature gives average temperature value. The variable is numeric. The temperature readings are continuous numeric values and cannot be factored.
- Checking the data type of the
AverageTemperatureUncertainty variable:
class(temperatureData$AverageTemperatureUncertainty)
[1] "numeric"
typeof(temperatureData$AverageTemperatureUncertainty)
[1] "double"
The variable AverageTemperatureUncertainty is numeric. AverageTemperatureUncertainty has continuous numeric values and cannot be factored. The type numeric is correct. No changes are made.
- Checking the data type of the
Country variable:
class(temperatureData$Country)
[1] "character"
typeof(temperatureData$Country)
[1] "character"
The variable Country gives the name of the country. The variable is of type character. No type conversions are needed.
NOTE: As you can see we do not have any variables that we can factor. We factor year after we have gathered it in the Merging Datasets section after we merge the two datasets. We also create a new variable and factor it in section Tidy & Manipulate Data II.
Tidy & Manipulate Data I
1. Gather emissionData:
The columns 2-268 give the emission reading measured for the years 1751-2017. The columns 2-268 contain year data in the column names, these column names are gathered to form a new variable year.
-c(Country) denotes that we gather all the columns except the Country column.
emissionDataTidy <- emissionData %>% gather(-c(Country),key = "year", value = "emission")
emissionDataTidy
Dimensions of emissionData after gather():
dim(emissionDataTidy)
[1] 61677 3
3. Drop month, day and AverageTemperatureUncertainty columns from temperatureData:
We only need the year column, so we drop the month and day columns. The column AverageTemperatureUncertainty gives the 95% confidence interval around the average temperature. As we do not need this data for our evaluation, we drop this column too.
temperatureDataTidy <- temperatureDataTidy %>% select(-month, -day, -AverageTemperatureUncertainty)
temperatureDataTidy
Dimensions of temperatureDataTidy:
dim(emissionDataTidy)
[1] 61677 3
4. Group temperatureData by year and Country:
The temperatureDataTidy data set contains multiple rows for each country each corresponding to the same year. This is beacuse temperatureData had entries for different months of the same year. Now since we dropeed the month and kept only the year, we see multiple rows for the same year. To tidy the data, the data is grouped by year & Country to obtain the new average temperature value AverageTemperature for each year for each country.
temperatureDataTidy <- temperatureDataTidy %>% group_by(year, Country) %>% summarise(AverageTemperature = mean(AverageTemperature, na.rm = TRUE))
temperatureDataTidy
Dimensions of temperatureDataTidy after grouping:
dim(emissionDataTidy)
[1] 61677 3
Merging Datasets
We merge the two datasets - emissionDataTidy and temperatureDataTidy. We use the pair of variables Country and year as the key to join the tables using inner_join().
emissionTemperatureJoined <- inner_join(emissionDataTidy, temperatureDataTidy, by = c("Country","year"))
emissionTemperatureJoined
Dimensions of emissionTemperatureJoined:
dim(emissionTemperatureJoined)
[1] 37437 4
Factoring year variable
Examining the number of unique values for year
cat("Number of unique values for `year`:", length(unique(emissionTemperatureJoined$year)),
"\nNumber of rows in the data frame `emissionTemperatureJoined`:", nrow(emissionTemperatureJoined))
Number of unique values for `year`: 263
Number of rows in the data frame `emissionTemperatureJoined`: 37437
The number of unique values for year is much smaller than the size of the data frame emissionTemperatureJoined. Therefore we proceed to factor year.
We factor the year variable and order it.
emissionTemperatureJoined$year <- factor(emissionTemperatureJoined$year, levels= unique(emissionTemperatureJoined$year), ordered = TRUE)
str(emissionTemperatureJoined$year)
Ord.factor w/ 263 levels "1751"<"1752"<..: 1 1 1 1 1 1 1 1 1 1 ...
Tidy & Manipulate Data II
In this section we create a new variable thanMeanEmission, this variable will tell us if the emission value is higher or lower than the Mean Emission of all the countries for the corresponding year.
1. Calculating the mean emission value meanEmissionByYear for each year:
meanEmissionByYearData <- emissionTemperatureJoined %>%
group_by(year) %>%
summarise(meanEmissionByYear = mean(emission, na.rm = TRUE))
meanEmissionByYearData
2. Add the meanEmissionByYear variable to emissionTemperatureJoined joining by year:
The new data frame is named joinedData.
joinedData <- left_join(emissionTemperatureJoined, meanEmissionByYearData,"year")
joinedData
3. Use meanEmissionByYear to create a new variable thanMeanEmission using mutate():
Code referred from: Link
joinedData <- mutate(joinedData, thanMeanEmission = ifelse(emission > meanEmissionByYear, "Higher", "Lower"))
joinedData
4. Factoring thanMeanEmission :
joinedData$thanMeanEmission <- factor(joinedData$thanMeanEmission,
levels = c("Lower", "Higher"),
ordered = TRUE)
str(joinedData$thanMeanEmission)
Ord.factor w/ 2 levels "Lower"<"Higher": 1 1 1 1 1 1 1 1 1 1 ...
5. Dropping meanEmissionByYear:
We can see that the column meanEmissionByYear has the same value for multiple rows. Having repeated values isn’t desirable, so we drop this column. We just added this column to create the column thanMeanEmission.
Now that meanEmissionByYear has served its purpose, we drop it.
joinedData <- joinedData %>% select(-meanEmissionByYear)
str(joinedData)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 37437 obs. of 5 variables:
$ Country : chr "Albania" "Andorra" "Austria" "Belarus" ...
$ year : Ord.factor w/ 263 levels "1751"<"1752"<..: 1 1 1 1 1 1 1 1 1 1 ...
$ emission : num 0 0 0 0 0 0 0 0 0 0 ...
$ AverageTemperature: num 13.57 12.2 7.18 6.69 10.15 ...
$ thanMeanEmission : Ord.factor w/ 2 levels "Lower"<"Higher": 1 1 1 1 1 1 1 1 1 1 ...
If we still do however need the mean emission for a year, we can obtain it from the data frame meanEmissionByYearData. This data frame has just 263 rows compared to the 37437 rows the variable meanEmissionByYear had in joinedData.
str(meanEmissionByYearData)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 263 obs. of 2 variables:
$ year : Ord.factor w/ 263 levels "1751"<"1752"<..: 1 2 3 4 5 6 7 8 9 10 ...
$ meanEmissionByYear: num 259737 519576 719459 959404 1199443 ...
Scan I
In this section we detect and deal with missing values.
1. Checking for Missing Values:
colSums(is.na(joinedData))
Country year emission AverageTemperature thanMeanEmission
0 0 0 1676 0
Only AverageTemperature column has missing values. We examine the missing values in the next step.
2. Examining the Missing Values:
joinedData %>% filter(is.na(joinedData$AverageTemperature))
sum(is.nan(joinedData$AverageTemperature))
[1] 1676
We can see that all the missing values are NaN, and they were generated as a result of calculation of mean_temp using mean() with na.rm = true. This resulted in an NaN value when all AverageTemperature values are NA for a particular country and year.
3. Calculating mean temperature for each Country to impute the missing values:
We choose to impute the missing values with the mean value of AverageTemperature. However, replacing the missing values with the mean of the all the Average temperatures isn’t wise. Instead, we impute the missing values with the mean Average temperatures of its corresponding country.
For example, the missing value of Average temperature for Bahamas for the year 1761 will be imputed with the mean temperatures of Bahamas for the rest of the years (1751 - 2019), ignoring any missing values, if any.
Here we calculate the mean temperature for each country through all the years (1751 - 2019) and not just one year.
countryMeanTemp <- joinedData %>%
group_by(Country) %>%
summarise(meanTempByCountry = mean(AverageTemperature, na.rm = TRUE))
countryMeanTemp
sum(is.na(countryMeanTemp))
[1] 0
The mean we calculated above doesn’t have any missing values. Now we can be sure that we do not impute any missing values with NA or NaN.
4. Combine countryMeanTemp with joinedData:
Before we impute the missing values, we need to combine the mean we calculated - countryMeanTemp with joinedData.
joinedData <- left_join(joinedData, countryMeanTemp, "Country")
joinedData
Checking the number of missing values before imputing them:
colSums(is.na(joinedData))
Country year emission AverageTemperature thanMeanEmission
0 0 0 1676 0
meanTempByCountry
0
5. Imputing missing values:
joinedData_imputed <- mutate(joinedData, AverageTemperature = ifelse(is.na(AverageTemperature),
meanTempByCountry,
AverageTemperature))
Checking the number of missing values after imputing them:
colSums(is.na(joinedData_imputed))
Country year emission AverageTemperature thanMeanEmission
0 0 0 0 0
meanTempByCountry
0
The output above shows us that all the missing values have been imputed successfully.
6. Dropping meanTempByCountry:
Just like meanEmissionByYear, meanTempByCountry also has same values for multiple rows. As having repeated values isn’t desirable, so we drop this column too. We just added this column to impute the missing values in AverageTemperature. Now that meanTempByCountry has served its purpose, we drop it.
joinedData_imputed <- joinedData_imputed %>% select(-meanTempByCountry)
str(joinedData_imputed)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 37437 obs. of 5 variables:
$ Country : chr "Albania" "Andorra" "Austria" "Belarus" ...
$ year : Ord.factor w/ 263 levels "1751"<"1752"<..: 1 1 1 1 1 1 1 1 1 1 ...
$ emission : num 0 0 0 0 0 0 0 0 0 0 ...
$ AverageTemperature: num 13.57 12.2 7.18 6.69 10.15 ...
$ thanMeanEmission : Ord.factor w/ 2 levels "Lower"<"Higher": 1 1 1 1 1 1 1 1 1 1 ...
If we still do however need the mean tempereature for a country, we can obtain it from the data frame countryMeanTemp. This data frame has just 189 rows compared to the 37437 rows the variable meanTempByCountry had in joinedData_new_imputed.
str(countryMeanTemp)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 189 obs. of 2 variables:
$ Country : chr "Afghanistan" "Africa" "Albania" "Algeria" ...
$ meanTempByCountry: num 14.1 24.1 12.6 22.9 11.2 ...
Scan II
Lets take a look at our dataset:
str(joinedData_imputed)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 37437 obs. of 5 variables:
$ Country : chr "Albania" "Andorra" "Austria" "Belarus" ...
$ year : Ord.factor w/ 263 levels "1751"<"1752"<..: 1 1 1 1 1 1 1 1 1 1 ...
$ emission : num 0 0 0 0 0 0 0 0 0 0 ...
$ AverageTemperature: num 13.57 12.2 7.18 6.69 10.15 ...
$ thanMeanEmission : Ord.factor w/ 2 levels "Lower"<"Higher": 1 1 1 1 1 1 1 1 1 1 ...
We have only two numeric values, emission and AverageTemperature
Examine emission variable for outliers:
joinedData_imputed %>% summary()
Country year emission AverageTemperature thanMeanEmission
Length:37437 1883 : 189 Min. :0.000e+00 Min. :-22.62 Lower :34654
Class :character 1884 : 189 1st Qu.:0.000e+00 1st Qu.: 9.99 Higher: 2783
Mode :character 1885 : 189 Median :0.000e+00 Median : 21.23
1886 : 189 Mean :1.327e+09 Mean : 17.65
1887 : 189 3rd Qu.:3.582e+07 3rd Qu.: 25.61
1888 : 189 Max. :3.780e+11 Max. : 30.74
(Other):36303
cat("Number of outliers: ", length(boxplot(joinedData_imputed$emission)$out))
Number of outliers: 7697

We find a very odd boxplot, this is because most of the data contains 0 as the emission rate. We even get median = 0. From the boxplot we can see that there are a huge number of outliers, this is because in our data, a larger number of entries have the emission value as 0, which is skewing our data towards 0. The zeroes could be entered to represent missing values, or the actual value of zero emission. The source was unclear about the nature of 0 value in emission column. We consider the zeros to represent 0 emission for the purpose of this assignment.
Examining the percentage of zero entries in emission:
joinedData_imputed %>% filter(emission == 0) %>% nrow()
[1] 21609
joinedData_imputed %>% nrow()
[1] 37437
cat("Percentage of rows with zero emission:", 21609/37437 *100)
Percentage of rows with zero emission: 57.72097
After analyzing the data closely, we find that for the data before 1950s, most countries had 0 emission rate. Also in the later years there are few countries such as USA, India and China that have exponentially huge values of emissions.
Since the application of our dataset could be to find the correlation (if any) between the rise in temperature and the emission rate, by removing the outliers we find (extremely large emission values), we may lose valuable results/insights of the impacts of the large emission by these handful of countries on the rise of global temperature.
As the dataset gives emission values beginning from the year 1751, the reason for the zeros could be the unavailability of data for such early years. Instead of excluding or imputing all the 0 values in our dataset, we choose to filter the data and consider the data from the year 1955 onwards, as almost all countries have 0 emission rates until the 1950s.
Filtering joinedData_imputed after the year 1955
dataAfter1955<- joinedData_imputed %>% filter(year >= 1955)
dataAfter1955 %>% filter(emission == 0) %>% nrow()
[1] 531
dataAfter1955 %>% nrow()
[1] 11151
cat("Percentage of rows with zero emission:", 531/11151 *100)
Percentage of rows with zero emission: 4.761905
Here we can see that the percentage of zeros has reduced drastically in the data.
Checking for outliers in the filtered data:
cat("Number of outliers in `dataAfter1955`: ", length(boxplot(dataAfter1955$emission)$out))
Number of outliers in `dataAfter1955`: 1768

NOTE: We first tranform our data and then deal with the outliers in Scan III after Transform.
Scan III
2. Examining with outliers for emission
length(outliers[outliers != 0])
[1] 6
length(outliers[outliers == 0])
[1] 531
The majority of the outliers 531/537 * 100 = 98.8826816% have the value zero. Zero corresponds to 1 (e^0 = 1) in the non-transformed data. Hence, by removing the outliers, we are effectively removing the rows with 0 as their emission, which is what we were aiming for.
3. Removing with outliers from emission
dataAfter1955 <- dataAfter1955[-(which(dataAfter1955$emission %in% outliers)),]
boxplot(dataAfter1955$emission)

4. Examine AverageTemperature variable for outliers:
Now we look at the second numeric variable - AverageTemperature. We consider the dataAfter1955 data frame iteslf.
cat("Number of outliers: " , length(boxplot(dataAfter1955$AverageTemperature)$out))
Number of outliers: 118

length(dataAfter1955$AverageTemperature)
[1] 10614
cat("Percentage of outliers: ", 118/10614 * 100)
Percentage of outliers: 1.111739
5. Dealing with outliers in AverageTemperature:
We find that there are 118 outliers, which is roughly around 1.11%. These outliers correspond to countries that are exteremely cold places with temperatures that drop below -10°C. Therefore deleting them would not be safe way to handle these outliers. Hence we have opted for using the Capping method to bring these extreme values to the closest quantile.
Referred from: Module 6 Notes | Stackoverflow
Capping the outliers:
cap <- function(x){
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
x
}
dataAfter1955$AverageTemperature <- dataAfter1955$AverageTemperature %>% cap()
Boxplot after capping the outliers:
boxplot(dataAfter1955$AverageTemperature)

Examine variables
We finally show the scatter plot of emisison and temperature. We apply exp() on the emission variable to get back the untransformed data to plot against Temperature.
dataAfter1955 %>% plot(exp(emission) ~ AverageTemperature, data = ., ylab="Emission", xlab="Temperature", main="Emission by Temperature")

---
title: "MATH2349 Semester 2, 2019"
author:
- Maaz Shaikh - S3795603
- Vaishnavi Narayana Naik - S3797442
subtitle: Assignment 3
output:
  html_notebook: default
---

## Required packages 
```{r, echo = TRUE, warnings = FALSE, message = FALSE}
library("dplyr")
library("tidyr")
library("readr")
```

<hr style="border: 0.5px solid Silver;  align:center">

## Executive Summary 

The aim of this report is to preprocess the combined data of emission rates and surface temperature of countries around the world across couple of centuries which could be used to find (if any) correlation between the two variables.

In the `Tidy & Manipulate` sections, we first tidy the tables separately by using `tidyr` functions like gather, separate and drop unwanted columns (variables). We make sure both the tables are in the same format so that we can join them correctly. After merging the tables, we carry out any type conversions that may be required. We then manipulate the data using mutate function to add new variables.

In the `SCAN` sections, we scan for missing values after we are done tidying and manipulating the data. We handle the missing values with the appropriate methods, we impute the missing values with the mean values of the remaining data. Scanning for outliers is the next part of the preprocessing and we find out that we require to `Transform` the data to make it more appropriate for our evaluation of the correlation between the emission rates and the surface temperature. After carrying out the transformation we again scan for outliers and deal with them with suitable methods.

Completing all these steps makes the data ready for evaluation and we end the report by displaying a scatter plot showing the relation of the emission rates and surface temperatures.

<hr style="border: 0.5px solid Silver;  align:center">

## Data 
### Source 1: [Climate Change: Earth Surface Temperature Data](https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data)

This source has repackaged the data from compilation put together by the **[Berkeley Earth](http://berkeleyearth.org/about/)**, which is affiliated with Lawrence Berkeley National Laboratory. The Berkeley Earth Surface Temperature Study combines 1.6 billion temperature reports from 16 pre-existing archives.
The raw data comes from the [Berkeley Earth data page](http://berkeleyearth.org/data/).

This source has multiple files that give temperature data for the world. **We select the CSV file - `GlobalLandTemperaturesByCountry.csv` as our first data source.**

The `GlobalLandTemperaturesByCountry` data set consists of the following variables:

  * `dt [date]`: **Date** (yyyy-mm-dd) when the temperture was recorded, starts from the year 1743
  
  * `AverageTemperature [numeric]`: **Average Land Temperature** in Celsius 
  
  * `AverageTemperatureUncertainty [numeric]`: the **95% confidence interval** around the average temperature
  
  * `Country [character]`: name of the country for which the temperature is recorded

##### Dataset Download Link: **[GlobalLandTemperaturesByCountry.csv ](https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data/download/eM0xChhsTAba4VhLeD2K%2Fversions%2FfgruQrMFEPB6Vq19bCe2%2Ffiles%2FGlobalLandTemperaturesByCountry.csv?datasetVersionNumber=1)**

***
 
### Source 2: [CO2 and GHG emission data](https://www.kaggle.com/srikantsahu/co2-and-ghg-emission-data)

This dataset contains emission data of Green House Gases (in tonnes) of different countries from year 1750 - 2017 taken from https://ourworldindata.org/co2-and-other-greenhouse-gas-emissions


**Our second data source is the CSV file - `emission data.csv`**

The `emission data` data set consists of the following variables:

  * `Country [character]`: name of the country for which the emission data is recorded

  * `1751, 1752,..., 2018, 2019 [numeric]`: the **year** that data was collected. Gives the emission data recorded in tonnes.

##### Dataset Download Link: **[emission data.csv](https://www.kaggle.com/srikantsahu/co2-and-ghg-emission-data/download/sWqYUoyXGfrjWDy6Q3Yj%2Fversions%2FghykD5vDrBmp0fT16rbY%2Ffiles%2Femission%20data.csv?datasetVersionNumber=1)**

***

### Reading the Data:

#### Importing the data file `emission data.csv`:
```{r, echo = TRUE, warnings = FALSE, message = FALSE}
emissionData <- read_csv("emission data.csv")
```
<br>

The first 6 observations of the `emission data` data set are:
```{r, echo = TRUE, warnings = FALSE}
head(emissionData)
```

<br>

#### Importing the data file `GlobalLandTemperaturesByCountry.csv`:
```{r, echo = TRUE, warnings = FALSE, message = FALSE}
temperatureData <- read_csv("GlobalLandTemperaturesByCountry.csv")
```

<br>

The first 6 observations of the `GlobalLandTemperaturesByCountry` data set are:
```{r, echo = TRUE, warnings = FALSE}
head(temperatureData)
```
<br>

#### NOTE: We do not merge the datasets in this section as the datasets need to be tidied before we can proceed to merge them. The data sets are merged in the `Merging Datasets` section after `Tidy & Manipulate Data I`.

<hr style="border: 0.5px solid Silver;  align:center">

## Understand

### 1. Checking the dimensions and structure of the data frame `emissionData`:
```{r, echo = TRUE, warnings = FALSE}
dim(emissionData)
str(emissionData[,1:8])
```
The columns `9-268` all give the emission reading for a year and are of `numeric` class. These aren't shown above to improve readability.

***

### 2. Checking the attributes of `emissionData`:
*  **Checking the data type of the `Country` variable:**
```{r, echo = TRUE, warnings = FALSE}
class(emissionData$Country)
typeof(emissionData$Country)
```
The variable `Country` gives the name of the country. The variable is of type `character`. No type conversions are needed.

<br>

*  **Checking the data type of the `1751` variable:**
```{r, echo = TRUE, warnings = FALSE}
class(emissionData$`1751`)
typeof(emissionData$`1751`)
```
The variable `1751` gives emission readings for the year 1751. The variable is of type `double`. The emission readings are continuous numeric values and cannot be factored.

#### The remaining variables `1752 - 2019` are all numeric and give the emission values. We do not individually check the type of all the variables due to large number of variables. 

***

### 3. Checking the dimensions and structure of the data frame `temperatureData`:
```{r, echo = TRUE, warnings = FALSE}
dim(temperatureData)
str(temperatureData, give.attr = FALSE)
```

***

### 4. Checking the attributes of `temperatureData`:
*  **Checking the data type of the `dt` variable:**
```{r, echo = TRUE, warnings = FALSE}
class(temperatureData$dt)
typeof(temperatureData$dt)
```
The variable `dt` is of class `Date`. `dt` gives the date when the temperature was recorded. The `Date` type is correct and no type conversion is made.

<br>

*  **Checking the data type of the `AverageTemperature` variable:**
```{r, echo = TRUE, warnings = FALSE}
class(temperatureData$AverageTemperature)
typeof(temperatureData$AverageTemperature)
```
The variable `AverageTemperature` gives average temperature value. The variable is `numeric`. The temperature readings are continuous numeric values and cannot be factored. 

<br>

*  **Checking the data type of the `AverageTemperatureUncertainty` variable:**
```{r, echo = TRUE, warnings = FALSE}
class(temperatureData$AverageTemperatureUncertainty)
typeof(temperatureData$AverageTemperatureUncertainty)
```
The variable `AverageTemperatureUncertainty` is `numeric`. `AverageTemperatureUncertainty` has continuous numeric values and cannot be factored. The type `numeric` is correct. No changes are made. 

<br>

*  **Checking the data type of the `Country` variable:**
```{r, echo = TRUE, warnings = FALSE}
class(temperatureData$Country)
typeof(temperatureData$Country)
```
The variable `Country` gives the name of the country. The variable is of type `character`. No type conversions are needed.

<br>

#### NOTE: As you can see we do not have any variables that we can factor. We factor `year` after we have gathered it in the `Merging Datasets` section after we merge the two datasets. We also create a new variable and factor it in section `Tidy & Manipulate Data II`. 
<hr style="border: 0.5px solid Silver;  align:center">

##	Tidy & Manipulate Data I 

### 1. Gather `emissionData`:
The columns `2-268` give the emission reading measured for the years `1751-2017`. The columns `2-268` contain year data in the column names, these column names are gathered to form a new variable `year`.

`-c(Country)` denotes that we gather all the columns except the `Country` column.
```{r}
emissionDataTidy <- emissionData %>% gather(-c(Country),key = "year", value = "emission")
emissionDataTidy
```

<br>

##### __Dimensions of `emissionData` after `gather()`:__
```{r}
dim(emissionDataTidy)
```

***

### 2. Separate `dt` to extract `year` from the date in `temperatureData`:
```{r}
temperatureDataTidy <- temperatureData %>% separate(dt, into = c("year", "month", "day"), sep = '-')
temperatureDataTidy
```

***

### 3. Drop `month`, `day` and `AverageTemperatureUncertainty` columns from `temperatureData`:
We only need the `year` column, so we drop the `month` and `day` columns. The column `AverageTemperatureUncertainty` gives the 95% confidence interval around the average temperature. As we do not need this data for our evaluation, we drop this column too.
```{r}
temperatureDataTidy <- temperatureDataTidy %>% select(-month, -day, -AverageTemperatureUncertainty)
temperatureDataTidy
```

<br>

##### __Dimensions of `temperatureDataTidy`:__
```{r}
dim(emissionDataTidy)
```

***

### 4. Group `temperatureData` by `year` and `Country`:
The `temperatureDataTidy` data set contains multiple rows for each country each corresponding to the same year. This is beacuse `temperatureData` had entries for different months of the same year. Now since we dropeed the month and kept only the year, we see multiple rows for the same year. To tidy the data, the data is grouped by `year` & `Country` to obtain the new average temperature value `AverageTemperature` for each year for each country.
```{r}
temperatureDataTidy <- temperatureDataTidy %>% group_by(year, Country) %>% summarise(AverageTemperature = mean(AverageTemperature, na.rm = TRUE))
temperatureDataTidy
```

<br>

##### __Dimensions of `temperatureDataTidy` after grouping:__
```{r}
dim(emissionDataTidy)
```

<hr style="border: 0.5px solid Silver;  align:center">

## Merging Datasets
We merge the two datasets - `emissionDataTidy` and `temperatureDataTidy`. We use the pair of variables `Country` and `year` as the key to join the tables using `inner_join()`.
```{r}
emissionTemperatureJoined <- inner_join(emissionDataTidy, temperatureDataTidy, by = c("Country","year"))
emissionTemperatureJoined
```

<br>

##### __Dimensions of `emissionTemperatureJoined`:__
```{r}
dim(emissionTemperatureJoined)
```

***

### Factoring `year` variable

#### Examining the number of unique values for `year`
```{r, results='hold'}
cat("Number of unique values for `year`:", length(unique(emissionTemperatureJoined$year)),
    "\nNumber of rows in the data frame `emissionTemperatureJoined`:", nrow(emissionTemperatureJoined))

```
The number of unique values for `year` is much smaller than the size of the data frame `emissionTemperatureJoined`. Therefore we proceed to factor `year`.

<br>

#### We factor the `year` variable and order it.
```{r}
emissionTemperatureJoined$year <- factor(emissionTemperatureJoined$year, levels= unique(emissionTemperatureJoined$year), ordered = TRUE)
str(emissionTemperatureJoined$year)
```

<hr style="border: 0.5px solid Silver;  align:center">

##	Tidy & Manipulate Data II 

In this section we create a new variable `thanMeanEmission`, this variable will tell us if the emission value is higher or lower than the Mean Emission of all the countries for the corresponding year.

### 1. Calculating the mean emission value `meanEmissionByYear` for each `year`:
```{r}
meanEmissionByYearData <- emissionTemperatureJoined %>% 
  group_by(year) %>% 
  summarise(meanEmissionByYear = mean(emission, na.rm = TRUE))

meanEmissionByYearData
```

***

### 2. Add the `meanEmissionByYear` variable to `emissionTemperatureJoined` joining by `year`:
The new data frame is named `joinedData`.
```{r}
joinedData <- left_join(emissionTemperatureJoined, meanEmissionByYearData,"year")
joinedData
```

***

### 3. Use `meanEmissionByYear` to create a new variable `thanMeanEmission` using `mutate()`:

Code referred from: [Link](https://rstudio-pubs-static.s3.amazonaws.com/116317_e6922e81e72e4e3f83995485ce686c14.html#/5
)
```{r}
joinedData <- mutate(joinedData, thanMeanEmission = ifelse(emission > meanEmissionByYear, "Higher", "Lower"))
joinedData
```

***

### 4. Factoring `thanMeanEmission` :
```{r}
joinedData$thanMeanEmission <- factor(joinedData$thanMeanEmission, 
                                      levels = c("Lower", "Higher"), 
                                      ordered = TRUE)
str(joinedData$thanMeanEmission)
```

***

### 5. Dropping `meanEmissionByYear`:
We can see that the column `meanEmissionByYear` has the same value for multiple rows. Having repeated values isn't desirable, so we drop this column. We just added this column to create the column `thanMeanEmission`. 

Now that `meanEmissionByYear` has served its purpose, we drop it.

```{r}
joinedData <- joinedData %>% select(-meanEmissionByYear)
str(joinedData)
```

If we still do however need the mean emission for a year, we can obtain it from the data frame `meanEmissionByYearData`. This data frame has just 263 rows compared to the 37437 rows the variable `meanEmissionByYear` had in `joinedData`.

```{r}
str(meanEmissionByYearData)
```

<hr style="border: 0.5px solid Silver;  align:center">

##	Scan I 

#### In this section we detect and deal with missing values.
### 1. Checking for Missing Values:
```{r}
colSums(is.na(joinedData))
```
Only `AverageTemperature` column has missing values. We examine the missing values in the next step.

***

### 2. Examining the Missing Values:

```{r}
joinedData %>% filter(is.na(joinedData$AverageTemperature))
sum(is.nan(joinedData$AverageTemperature))
```
We can see that all the missing values are `NaN`, and they were generated as a result of calculation of `mean_temp` using `mean()` with `na.rm = true`. This resulted in an `NaN` value when all `AverageTemperature` values are `NA` for a particular country and year. 

***

### 3. Calculating mean temperature for each `Country` to impute the missing values:

We choose to impute the missing values with the mean value of `AverageTemperature`. However, replacing the missing values with the mean of the all the Average temperatures isn't wise. Instead, we impute the missing values with the mean Average temperatures of its corresponding country. 

For example, the missing value of Average temperature for `Bahamas` for the year `1761` will be imputed with the mean temperatures of `Bahamas` for the rest of the years (1751 - 2019), ignoring any missing values, if any.

<br>

##### Here we calculate the mean temperature for each country through all the years (1751 - 2019) and not just one year. 
```{r}
countryMeanTemp <- joinedData %>% 
  group_by(Country) %>% 
  summarise(meanTempByCountry = mean(AverageTemperature, na.rm = TRUE))
countryMeanTemp
```
```{r}
sum(is.na(countryMeanTemp))
```
The mean we calculated above doesn't have any missing values. Now we can be sure that we do not impute any missing values with `NA` or `NaN`.

***

### 4. Combine `countryMeanTemp` with `joinedData`:
Before we impute the missing values, we need to combine the mean we calculated - `countryMeanTemp` with `joinedData`.
```{r}
joinedData <- left_join(joinedData, countryMeanTemp, "Country")
joinedData
```
<br>

##### Checking the number of missing values before imputing them:
```{r}
colSums(is.na(joinedData))
```


***

### 5. Imputing missing values:
```{r}
joinedData_imputed <- mutate(joinedData, AverageTemperature = ifelse(is.na(AverageTemperature), 
                                                                     meanTempByCountry, 
                                                                     AverageTemperature))
```

<br>

##### Checking the number of missing values after imputing them:
```{r}
colSums(is.na(joinedData_imputed))
```
The output above shows us that all the missing values have been imputed successfully.

***

### 6. Dropping `meanTempByCountry`:
Just like `meanEmissionByYear`, **`meanTempByCountry`** also has same values for multiple rows. As having repeated values isn't desirable, so we drop this column too. We just added this column to impute the missing values in `AverageTemperature`. Now that `meanTempByCountry` has served its purpose, we drop it.

```{r}
joinedData_imputed <- joinedData_imputed %>% select(-meanTempByCountry)
str(joinedData_imputed)
```

If we still do however need the mean tempereature for a country, we can obtain it from the data frame `countryMeanTemp`. This data frame has just 189 rows compared to the 37437 rows the variable `meanTempByCountry` had in `joinedData_new_imputed`.

```{r}
str(countryMeanTemp)
```

<hr style="border: 0.5px solid Silver;  align:center">

##	Scan II

**Lets take a look at our dataset:**

```{r}
str(joinedData_imputed)
```

**We have only two numeric values, `emission` and `AverageTemperature`** 

***

### Examine `emission` variable for outliers:
```{r}
joinedData_imputed %>% summary()
cat("Number of outliers: ", length(boxplot(joinedData_imputed$emission)$out))
```

We find a very odd boxplot, this is because most of the data contains 0 as the emission rate. We even get median = `0`.
From the boxplot we can see that there are a huge number of outliers, this is because in our data, a larger number of entries have the emission value as `0`, which is skewing our data towards 0.
The zeroes could be entered to represent missing values, or the actual value of zero emission. The [source](https://www.kaggle.com/srikantsahu/co2-and-ghg-emission-data) was unclear about the nature of `0` value in `emission` column. We consider the zeros to represent 0 emission for the purpose of this assignment.

<br>

#### Examining the percentage of zero entries in `emission`:
```{r}
joinedData_imputed %>% filter(emission == 0) %>% nrow()
joinedData_imputed %>% nrow()
cat("Percentage of rows with zero emission:", 21609/37437 *100)
```

After analyzing the data closely, we find that for the data before 1950s, most countries had 0 emission rate. Also in the later years there are few countries such as USA, India and China that have exponentially huge values of emissions.

Since the application of our dataset could be to find the correlation (if any) between the rise in temperature and the emission rate, by removing the outliers we find (extremely large emission values), we may lose valuable results/insights of the impacts of the large emission by these handful of countries on the rise of global temperature.

As the dataset gives emission values beginning from the year 1751, the reason for the zeros could be the unavailability of data for such early years. Instead of excluding or imputing all the `0` values in our dataset, we choose to filter the data and consider the data from the year 1955 onwards, as almost all countries have 0 emission rates until the 1950s.

<br>

#### Filtering `joinedData_imputed` after the year `1955`
```{r}
dataAfter1955<- joinedData_imputed %>% filter(year >= 1955)
dataAfter1955 %>% filter(emission == 0) %>% nrow()
dataAfter1955 %>% nrow()
cat("Percentage of rows with zero emission:", 531/11151 *100)
```
Here we can see that the percentage of zeros has reduced drastically in the data.

<br>

#### Checking for outliers in the filtered data:
```{r}
cat("Number of outliers in `dataAfter1955`: ", length(boxplot(dataAfter1955$emission)$out))
```

<br>

#### NOTE: We first tranform our data and then deal with the outliers in `Scan III` after `Transform`.
<hr style="border: 0.5px solid Silver;  align:center">

##	Transform 

**In this section, to try and deal with the outliers we transform the data:**
```{r}
dataAfter1955 %>% summary()
```
<br>

#### Examine the histogram for `emission`:
```{r}
hist(dataAfter1955$emission, xlab = "Emission (in tonnes)", main = "Histogram for `Emission`")
length(dataAfter1955$emission)
```
From the histogram we can observe that the data is `strongly right skewed`. We transform the `emission` variable by applying a `Logarithmic Transformation`.

As we know, Log transformation cannot be applied to zero or negative values directly. In order to apply the Log transformation to `emission`, we add the non-negative constant `1` to all observations and then apply the tranformation. 
```{r}
length(dataAfter1955$emission)
dataAfter1955$emission <- dataAfter1955$emission + 1
dataAfter1955$emission <- log(dataAfter1955$emission)
hist(dataAfter1955$emission, 
     xlab = "log(emission)", 
     main = "Histogram for `Emission` after \nLog Transformation")
```
After the transformation, our data is more normally distrubuted.
<hr style="border: 0.5px solid Silver;  align:center">

## Scan III

### 1. Dealing with outliers from the transformed `emission` variable.
We find that there are `537` outliers, which is roughly around `4.81%`. Since the outliers form less than 5% of the the data, we can safely delete them. We examine the outliers before we delete them.

```{r}
outliers <- boxplot(dataAfter1955$emission)$out
cat("Number of outliers: ", length(outliers))
length(dataAfter1955$emission)
cat("Percentage of outliers: ", 537/11151 * 100)
```
***

### 2. Examining with outliers for `emission`
```{r}
length(outliers[outliers != 0])
length(outliers[outliers == 0])
```
The majority of the outliers 531/537 * 100 = ``r 531/537 * 100``% have the value zero. Zero corresponds to 1 (e^0 = 1) in the non-transformed data. Hence, by removing the outliers, we are effectively removing the rows with `0` as their emission, which is what we were aiming for.

***

### 3. Removing with outliers from `emission`
```{r}
dataAfter1955 <- dataAfter1955[-(which(dataAfter1955$emission %in% outliers)),]
boxplot(dataAfter1955$emission)
```

***

### 4. Examine `AverageTemperature` variable for outliers:
Now we look at the second numeric variable - `AverageTemperature`. We consider the `dataAfter1955` data frame iteslf.
```{r}
cat("Number of outliers: " , length(boxplot(dataAfter1955$AverageTemperature)$out))
length(dataAfter1955$AverageTemperature)

cat("Percentage of outliers: ", 118/10614 * 100)
```

***

### 5. Dealing with outliers in `AverageTemperature`:

We find that there are 118 outliers, which is roughly around `1.11%`. These outliers correspond to countries that are exteremely cold places with temperatures that drop below -10°C. Therefore deleting them would not be safe way to handle these outliers. Hence we have opted for using the **Capping method** to bring these extreme values to the closest quantile. 

Referred from: [Module 6 Notes](http://rare-phoenix-161610.appspot.com/secured/Module_06.html#capping_(aka_winsorising)) | [Stackoverflow](https://stackoverflow.com/questions/13339685/how-to-replace-outliers-with-the-5th-and-95th-percentile-values-in-r?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa)

#### Capping the outliers:
```{r}
cap <- function(x){
    quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
    x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
    x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
    x
}
dataAfter1955$AverageTemperature <- dataAfter1955$AverageTemperature %>% cap()
```

<br>

#### Boxplot after capping the outliers:
```{r}
boxplot(dataAfter1955$AverageTemperature) 
```

<hr style="border: 0.5px solid Silver;  align:center">

## Examine variables 

We finally show the scatter plot of emisison and temperature.
We apply `exp()` on the `emission` variable to get back the untransformed data to plot against Temperature.
```{r}
dataAfter1955 %>% plot(exp(emission) ~ AverageTemperature, data = ., ylab="Emission", xlab="Temperature", main="Emission by Temperature")
```
<br>
<br>
