MATH2349 Data Wrangling

Required packages

Provide the packages required to reproduce the report. Make sure you fulfilled the minimum requirement #10.

library(readr)
library(magrittr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)

## 
## Attaching package: 'tidyr'

## The following object is masked from 'package:magrittr':
## 
##     extract

library(ggplot2)
library (outliers) 
library(scales)

## 
## Attaching package: 'scales'

## The following object is masked from 'package:readr':
## 
##     col_factor

Executive Summary

In this preprocessing report, we are preparing this report about carbon dioxide emission to by classified by different Economy type. The preprocessing include the following steps:

merging 2 datasets into one,
subsetting with the required variables,
scan for missing values and remove as required,
tidy the dataset - convert from wide to long,
creating new variables for manipulation,
check the data structure and variables types,
Scan for any outliers,
apply data transformation to present a approximate normal distribution.

Data

Dataset named “CO2” include carbon dioxide emitted from burning of fossil fuels and the manufacture of cement from different countries from 1960 - 2016. The second dataset named “economy” include classifications of the countries’ economy.

The CO2 dataset was collected by Carbon Dioxide Information Analysis Center, Environmental Sciences Division, Oak Ridge National Laboratory, Tennessee, United States. The economy dataset is part of the World Development Indicators developed by the World Bank’s Development Data Group. Both datasets were downloaded from https://data.worldbank.org/indicator/EN.ATM.CO2E.KT.

The datasets are merged by the common variable “country code”. The economy dataset is left joined to the CO2 dataset, the new dataset is named “carbon”. We would like to look at the classification of each country’s economy with their carbon dioxide emission.

There are 70 variables in the carbon dataset, in this report, we will focus on carbon emission in the a decade between 1997 and 2016 and the income group of those countries. The carbon dataset is subset with the required variables, the description of the variables are as below:

Country Name : Country Name

1997 - 2016: Carbon dioxide (CO2) emission in the unit of kt (kiloton) in each year (1997 - 2016)

IncomeGroup : Classification of country economy according to World Bank definition

Country Name and IncomeGroup are renamed to Country and Economy

CO2 <- read_csv("dataset research/API_EN.ATM.CO2E.KT_DS2_en_csv_v2_1345584/API_EN.ATM.CO2E.KT_DS2_en_csv_v2_1345584.csv", 
    skip = 3)

## Warning: Missing column names filled in: 'X65' [65]

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   `Country Name` = col_character(),
##   `Country Code` = col_character(),
##   `Indicator Name` = col_character(),
##   `Indicator Code` = col_character(),
##   `2017` = col_logical(),
##   `2018` = col_logical(),
##   `2019` = col_logical(),
##   X65 = col_logical()
## )

## See spec(...) for full column specifications.

economy <- read_csv("dataset research/API_EN.ATM.CO2E.KT_DS2_en_csv_v2_1345584/Metadata_Country_API_EN.ATM.CO2E.KT_DS2_en_csv_v2_1345584.csv")

## Warning: Missing column names filled in: 'X6' [6]

## Parsed with column specification:
## cols(
##   `Country Code` = col_character(),
##   Region = col_character(),
##   IncomeGroup = col_character(),
##   SpecialNotes = col_character(),
##   TableName = col_character(),
##   X6 = col_logical()
## )

carbon<- CO2 %>% left_join(economy, by="Country Code")

carbon_sub<- carbon %>%
  select("Country Name","1997":"2016","IncomeGroup")

colnames(carbon_sub)[c(1,22)]<-c("Country","Economy")

head(carbon_sub)

Scan I

When scanning the data, there are some values in the Country column that are not an actual country and have missing values in the Economy column. Furthermore, there are also missing carbon emission in some years. There could be a trend in the carbon emission along the years and using the average carbon emission value may not be appropriate. In this analysis we are only including countries that have no missing values to avoid bias. After subsetting to exclude above, we check to confirm there is no more values with NAs. We have 197 countries after the clean up.

#remove all rows without economy category (non country)
carbon_sub<-carbon_sub[!is.na(carbon_sub$Economy),]

#subset countries with complete cases only to avoid bias.
carbon_sub <- carbon_sub[complete.cases(carbon_sub),]

#check if anymore rows with NA
which(is.na(carbon_sub))

## integer(0)

#check dimension of the subset
dim(carbon_sub)

## [1] 197  22

Tidy & Manipulate Data I

The dataset is not tidy as it has the CO2 emission of the years from 1997 to 2016 in columns. Here we are using the gather function to have all years in one column and CO2 emission in another column, changing the data frame from wide to long, named “carb_tidy”.

head(carbon_sub)

carb_tidy<-gather(carbon_sub,key="year",value="CO2_emission",2:21)

head(carb_tidy)

Tidy & Manipulate Data II

As we are to analyse the CO2 in each economy group, we are excluding the country variable and then use the summarise function to create a new dataset, named “economy_grp”, which has the below new variables:

avg_CO2: Average CO2 emission of each economy group in the year

totalCO2: Total CO2 emission of each economy group in the year

no_country: The number of countries in the economy group

#create new df group by Economy
economy_grp<-carb_tidy %>% group_by(Economy,year) %>%
  summarise(avg_CO2=round(mean(CO2_emission),0),
            total_CO2=sum(CO2_emission),
            no_country=n())

## `summarise()` regrouping output by 'Economy' (override with `.groups` argument)

head(economy_grp)

Understand

Using str function to check the data type and if “economy_grp” is a data frame. It appears that Economy and year column will need to be factorised and ordered. The data type and level is checked after factoring using attributes function.

#check structure
str(economy_grp)

## tibble [80 × 5] (S3: grouped_df/tbl_df/tbl/data.frame)
##  $ Economy   : chr [1:80] "High income" "High income" "High income" "High income" ...
##  $ year      : chr [1:80] "1997" "1998" "1999" "2000" ...
##  $ avg_CO2   : num [1:80] 172138 170922 173433 179378 178555 ...
##  $ total_CO2 : num [1:80] 11705368 11622689 11793420 12197703 12141723 ...
##  $ no_country: int [1:80] 68 68 68 68 68 68 68 68 68 68 ...
##  - attr(*, "groups")= tibble [4 × 2] (S3: tbl_df/tbl/data.frame)
##   ..$ Economy: chr [1:4] "High income" "Low income" "Lower middle income" "Upper middle income"
##   ..$ .rows  : list<int> [1:4] 
##   .. ..$ : int [1:20] 1 2 3 4 5 6 7 8 9 10 ...
##   .. ..$ : int [1:20] 21 22 23 24 25 26 27 28 29 30 ...
##   .. ..$ : int [1:20] 41 42 43 44 45 46 47 48 49 50 ...
##   .. ..$ : int [1:20] 61 62 63 64 65 66 67 68 69 70 ...
##   .. ..@ ptype: int(0) 
##   ..- attr(*, ".drop")= logi TRUE

is.data.frame(economy_grp)

## [1] TRUE

# factorise economy_grp & year
economy_grp$Economy <- factor(economy_grp$Economy, 
  levels=c("Low income","Lower middle income","Upper middle income" ,"High income"), 
  labels=c("Low_income","Low_mid_income","Upp_mid_income" ,"High_income"),ordered=T  )

economy_grp$year <- factor(economy_grp$year,
  levels=c("1997", "1998", "1999", "2000", "2001", "2002","2003",
         "2004", "2005", "2006", "2007", "2008", "2009", "2010", "2011",
          "2012", "2013", "2014", "2015", "2016"),ordered=T)

head(economy_grp)

sapply(economy_grp,FUN = attributes)

## $Economy
## $Economy$levels
## [1] "Low_income"     "Low_mid_income" "Upp_mid_income" "High_income"   
## 
## $Economy$class
## [1] "ordered" "factor" 
## 
## 
## $year
## $year$levels
##  [1] "1997" "1998" "1999" "2000" "2001" "2002" "2003" "2004" "2005" "2006"
## [11] "2007" "2008" "2009" "2010" "2011" "2012" "2013" "2014" "2015" "2016"
## 
## $year$class
## [1] "ordered" "factor" 
## 
## 
## $avg_CO2
## NULL
## 
## $total_CO2
## NULL
## 
## $no_country
## NULL

Scan II

To check for any outliers, we are using the Tukey’s method and z-score method to detect.

Tukey’s method sets the outlier fences and outliers can be visualised using box plot. I have grouped the box plot by year and Economy respectively, it does not appear that the dataset has any outlier in the total_CO2 variable.

Z-score method is also used to check for outlier and it also return with no outlier in the total_CO2 variable.

ggplot(economy_grp, aes(x=year, y=total_CO2)) + 
    geom_boxplot()+
  labs(title="Box Plot of total CO2 in each year",x="Year", y = "CO2 emission in kt (kiloton)")+
  theme_grey(base_size =9)+
  scale_y_continuous(labels = comma)

ggplot(economy_grp, aes(x=Economy, y=total_CO2)) + 
    geom_boxplot()+
  labs(title="Box Plot of total CO2 in each Economy category",x="Economy Category", y = "CO2 emission in kt (kiloton)")+
  theme_grey(base_size =10)+
  scale_y_continuous(labels = comma)

z.scores <- economy_grp$total_CO2 %>% 
scores(type = "z" ) 
z.scores %>% summary()

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.1758 -0.9298 -0.1501  0.0000  0.9082  1.7292

length (which( abs(z.scores) > 3 ))

## [1] 0

Transform

The histogram of totalCO2 variable is displayed using ggplot and it shows a bimodal distribution. The data is transformed to “eco_trans” using this formula: Absolute (totalCO2 - mean(totalCO2)), to convert the distribution into an approximate normal distribution.

ggplot(economy_grp, aes(x=total_CO2)) + geom_histogram(color="black", fill="white",binwidth=1000000) +scale_x_continuous(labels = comma)+
labs(title="Histogram of total CO2 emission",x="CO2 emission in kt (kiloton)")

eco_trans <- abs(economy_grp$total_CO2 - mean(economy_grp$total_CO2))

ggplot(as.data.frame(eco_trans), aes(x=eco_trans)) + geom_histogram(color="black", fill="white",binwidth=1000000) +scale_x_continuous(labels = comma)+
  labs(title="Histogram of transformed total CO2 emission",x="CO2 emission in kt (kiloton)")

References:

Data.worldbank.org. n.d. CO2 Emissions (Kt) | Data. [online] Available at: https://data.worldbank.org/indicator/EN.ATM.CO2E.KT [Accessed 5 October 2020].

Datatopics.worldbank.org. 2020. WDI - The World By Income And Region. [online] Available at: http://datatopics.worldbank.org/world-development-indicators/the-world-by-income-and-region.html [Accessed 5 October 2020].