Required packages

library(knitr)
library(ggplot2)
library(readr)
library(stringr)
library(dplyr)
library(tidyr)
library(validate)
library(outliers)
library(deductive)
library(lubridate)

Executive Summary

The aim of this assignment is to demostrate knowledge and understanding of data preprocessing. Data pre-processing is the process of cleaning and organising raw data to make it ready for analysis. The datasets used in this assignment concern countries’ carbon emissions and two electricity production outputs namely, output from renewable sources and output from fossil sources. They were obtained from World Bank website. The process starts with understanding the features and variables. It was noted that three datasets did not satisfy principles of tidy data (Wickam 2014), some column headers were variables. These datasets were converted to tidy formats. In three datasets, a new variable was generated that would identify each row and was used to combine the data sets. The four datasets was merged based on existing variables. The data was scanned for missing observations, inconsistencies, and obvious errors. The years where no data was recorded on one or more variables have been removed. There was one inconsistency, where a negative observation of the amount of carbon emissions, this was converting to a positive number. It was found that one variable was strongly skewed and had outliers, log transformation was applied to the variable and skewness and outliers were reduced.

Data

All the datasets used has been obtained from World bank. Following is the description and links of the datasets: -Renewable electricity output by country-The amount of electricity generated from renewable sources is annually as a percentage of the total electricity output in the countries listed.
Link: http://api.worldbank.org/v2/en/indicator/EG.ELC.RNEW.ZS?downloadformat=csv -Electricity produced from gas, oil, and coal- contains amount of electricity generated from gas, oil and coal sources as percentage of total electricity output in country per year.
Link: http://api.worldbank.org/v2/en/indicator/EG.ELC.FOSL.ZS?downloadformat=csv -Carbon emission data- contains metric tonnes per capita of carbon gas emitted in country each year.
Link: http://api.worldbank.org/v2/en/indicator/EN.ATM.CO2E.PC?downloadformat=csv -Countries’ Income groups and Regions metadata- The data has World Bank income categories and geographical location of countries.
Link: http://api.worldbank.or.g/v2/en/indicator/EN.ATM.CO2E.PC?downloadformat=csv Columns of the datasets are as follows:

Renewable electricity output by country Country Name: has the names of countries
Country Code: contains ISO codes for each country
Indicator Name: has the name of indicator which is Renewable electricity output as a % of total electricity output
Indicator Code: world Bank code for the indicator
years: The data has a column for each year since 1960 to 2019

Electricity production from gas, oil and coal sources
Country Name: contains names of countries
Country Code: contains ISO codes for each country
Indicator Name: has the name of indicator which in this case is Electricity production from oil, gas and coal sources as % of total
Indicator Code: world Bank code for the indicator
years: The data has a column for each year since 1960 to 2019

Carbon emission data world bank
Country Name: contains names of countries
Country Code: contains ISO codes for each country
Indicator Name: has the name of indicator which is CO2 emissions in metric tons per capita
Indicator Code: world Bank code for the indicator
years: The data has a column for each year since 1960 to 2019

Countries’ Income groups and Regions metadata Country Code: contains ISO alpha-3 codes for each country
Region: geographical region for each country
IncomeGroup: country’s income categories according to world Bank
SpecialNotes: contains notes about the country or region
TableName: contains full names of the countries

#Import data

carb_em<-read_csv("API_EN.ATM.CO2E.PC_DS2_en_csv_v2_1345311.csv",skip=4) #carbon emmission

# % of electricityproduced from fossil materials
fossil_elec<-read_csv("API_EG.ELC.FOSL.ZS_DS2_en_csv_v2_1349749.csv",skip=4)

# % of electricity produced from renewablesources
renew_elec<-read_csv("API_EG.ELC.RNEW.ZS_DS2_en_csv_v2_1345509.csv",skip=4)  

#countries income classification
countries<-read_csv("Metadata_Country_API_EN.ATM.CO2E.PC_DS2_en_csv_v2_1345311.csv")

Understand

We check attributes of the dataset and the variables.

Carbon emission data

To know the object type:

class(carb_em)
## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

To see dimensions of dataframe:

dim(carb_em) 
## [1] 264  65

The dataframe has 264 rows and 65 columns.

Overview of the first few observations in each colums and the datatypes:

head(carb_em)

The data set has variables of character, numeric and logical observations.

The dataframe is in wide format and should be converts to long format.

There is a blank column without a name (last column).

###Fossil electricity output To know the object type:

class(fossil_elec)
## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

To see dimensions of dataframe:

dim(fossil_elec) 
## [1] 264  65

The dataframe has 264 rows and 65 columns.

Overview of the first few observations in each colums and the datatypes:

head(fossil_elec)

The dataframe has character, numeric and logical observations.

This dataframe also is messy and should be converts to long format.

Some columns have NA observations.

A blank column without a name. R has named the column as X65.

Repeating the process for the renewable electricity output data The object type

class(renew_elec)
## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

Data dimensions:

dim(renew_elec) 
## [1] 264  65

The dataframe has 264 rows and 65 columns.

overview of the first few observations in each column and the datatypes:

head(renew_elec)

The data is messy.

Many columns have NA observations.

A blank column with no name.

###Countries’ income and region data.

Object type:

class(countries)
## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

Dataframe dimensions:

dim(countries) 
## [1] 263   6

We note that the dataframe has 263 rows and 6 columns.

Overview of the first few observations in each column and the datatypes:

head(countries)

all the columns have character as type, except the last one is blank and has no name.

###Understanding Data Summary

To get overview of the data, the following functions have been used: Class() - shows the the object types. All dataset objects are dataframes/tibbles.
dim()- shows the dataframe dimensions -number of rows and columns.
head()

We can see from the results that some columns have missing data, as well as columns with missing column names from the data. The data types in the dataframes are, numerical, characters and logical.

Other functions could be used, glimpse, str(), attributes() but their results woulde be similar or close to what we have obtained.

In the countries dataframe, we will convert IncomeGroup variable to an ordered factor at a later section.

Tidy & Manipulate Data I

The output of Carbon emission, fossil electricity, and renewable electricity datasets are messy. They do not satisfy the principles of tidy datasets (Wickham 2014) in that a number of columns are made up of a variable. We have columns of years, thus the data is in wide format. We can correct this by putting the years in a single column (making column a variable).

Carbon emission dataset

carb_em <- carb_em %>% 
  gather(key="Year", value ="CO2_metric_tons_per_capita", 5:65)

since Variable “X65” is undefined and has no data, it will be removed.

the “Indicator Name” and “Indicator Code” columns are not important for analysis, therefore, they can be removed.

#remove X65
carb_em=carb_em[!carb_em$Year=='X65',]
#remove unimportant columns
carb_em=select(carb_em,-c("Indicator Name" ,"Indicator Code"))

Viewing the resulting dataframe:

head(carb_em)

We remove duplicating records if any by running the following code:

carb_em=distinct(carb_em)

There were no duplicate records since the number of records remain unchanged.

###Fossil electricity dataset

# Fossil electricity output  data
fossil_elec <- fossil_elec %>% 
  gather(key="Year", value ="elec_producted_from_fossils", 5:65)

Variable “X65” can be removed, since it is undefined and has no data. the “Indicator Name” and “Indicator Code” columns are not important, thus, they can be removed.

#remove X65
fossil_elec=fossil_elec[!fossil_elec$Year=='X65',]
#remove unimportant columns
fossil_elec=select(fossil_elec,-c("Indicator Name" ,"Indicator Code"))

Viewing the resulting dataframe:

head(fossil_elec)

We remove duplicating records if any:

fossil_elec=distinct(fossil_elec)

There are no duplicate records.

###Renewable electricity dataset

# tidy up the dataset
renew_elec <- renew_elec %>% 
  gather(key="Year", value ="elec_producted_from_renewables", 5:65)

Dealing with “X65” as above:

#remove X65
renew_elec=renew_elec[!renew_elec$Year=='X65',]

Removing columns “Indicator Name” and “Indicator Code”:

#remove unimportant columns
renew_elec=select(renew_elec,-c("Indicator Name" ,"Indicator Code"))

Viewing the resulting dataframe:

head(renew_elec)

We remove duplicating records if any:

renew_elec=distinct(renew_elec)

There are no duplicate records.

###Countries’ income and region dataset Countries classification data is tidy but we can remove “SpecialNotes” and “X6” columns.

countries=select(countries,-c("SpecialNotes" ,"X6"))

The resulting dataframe:

head(countries)

Tidy & Manipulate Data II

We have created a new variable by joining country code and year in carbon emission, fossil electricity and renewable electricity datasets. since this variable will be unique for each row, we can use it to merge the datasets.

#Carbon emmission data
carb_em=carb_em %>%
  unite("country_year", "Country Code" ,  "Year" , sep="_",remove = FALSE)
#fossil production data
fossil_elec =fossil_elec %>%
  unite("country_year", "Country Code" ,  "Year" , sep="_",remove = FALSE)
#renewable production data
renew_elec =renew_elec %>%
  unite("country_year", "Country Code" ,  "Year" , sep="_",remove = FALSE)

We merge these three datasets:

carb_em_elec_data=merge(carb_em,fossil_elec,by.x = "country_year",by.y = "country_year")
carb_em_elec_data=merge(carb_em_elec_data,renew_elec,by.x = "country_year",by.y = "country_year")

The number of countries in the ‘countries’ dataframe is 263 while the other datasets have 264, now we check which country name is missing in the ‘countries’ dataset.

setdiff(carb_em_elec_data$`Country Code.x`,countries$`Country Code`)
## [1] "INX"

From the data country code “INX” is not defined and has no data so we can ignore it.

Removing redundant columns:

carb_em_elec_data=select(carb_em_elec_data,-c("Year.x","Year.y" ,"Country Name.x",
                                                "Country Name.y", "Country Code.x","Country Code.y"))

We add the countries classification data:

carb_em_elec_data=merge(countries,carb_em_elec_data, by.x = "Country Code",by.y = "Country Code")

We remove redundant column created as a result of merging:

carb_em_elec_data=select(carb_em_elec_data,-c("TableName"))

Viewing the resulting dataframe:

head(carb_em_elec_data)
#arranging the columns
carb_em_elec_data=carb_em_elec_data[,c("country_year","Country Code","Country Name","Region","IncomeGroup",                                        "Year","CO2_metric_tons_per_capita","elec_producted_from_fossils",
                                         "elec_producted_from_renewables")]

Converting “IncomeGroup” variable from character to a factor and ordering it:

carb_em_elec_data$IncomeGroup<- factor(carb_em_elec_data$IncomeGroup,levels=c("Low income",
                                                                                "Lower middle income",
                                                                                "Upper middle income",
                                                                                "High income"),
                                         labels =c("Low","Lower middle","Upper middle","High"),
                                         ordered=TRUE)

Scan I

Check that the numerical variables are greater than zero as expected:

chk <- check_that(carb_em_elec_data,CO2_metric_tons_per_capita>=0.0,
                    elec_producted_from_fossils>=0.0,elec_producted_from_renewables >=0.0 )

summary(chk)
barplot(chk,main="NA and Positive values Checks")

One observation in carbon emitted amount variable is negative. We will assume this is a typing error and convert the number to positive. All other numerical variables are equal to or greater than zero or NA.

#convert the variable to positive

carb_em_elec_data$CO2_metric_tons_per_capita<-abs(carb_em_elec_data$CO2_metric_tons_per_capita)

A check for nan observations

sapply(carb_em_elec_data, function(x) sum(is.nan(x)))
##                   country_year                   Country Code 
##                              0                              0 
##                   Country Name                         Region 
##                              0                              0 
##                    IncomeGroup                           Year 
##                              0                              0 
##     CO2_metric_tons_per_capita    elec_producted_from_fossils 
##                              0                              0 
## elec_producted_from_renewables 
##                              0

No NaN observations.

Checking missing values:

sapply(carb_em_elec_data, function(x) sum(is.na(x)))
##                   country_year                   Country Code 
##                              0                              0 
##                   Country Name                         Region 
##                              0                           2760 
##                    IncomeGroup                           Year 
##                           2760                              0 
##     CO2_metric_tons_per_capita    elec_producted_from_fossils 
##                           2804                           7760 
## elec_producted_from_renewables 
##                           8942

There is many missing data in the dataset.

Through a quick eyecheck, we note that the countries which regions and income groups are missing are not real countries, some are regions. For example, Europe & Central Asia, OECD members, North America and so on. Since we want data of a specific countries, we can delete these region names from “Country Name” column.

carb_em_elec_data=carb_em_elec_data %>% drop_na(Region)#drop rows where region is na
sapply(carb_em_elec_data, function(x) sum(is.na(x))) #run scan again
##                   country_year                   Country Code 
##                              0                              0 
##                   Country Name                         Region 
##                              0                              0 
##                    IncomeGroup                           Year 
##                              0                              0 
##     CO2_metric_tons_per_capita    elec_producted_from_fossils 
##                           2633                           7008 
## elec_producted_from_renewables 
##                           7378

The numerical columns whose data is missing can be considered as unavailable data.

Check if this data is missing in a systematic way by summarising the data.

#We sum the data by year(ignoring NAs)

sum_carb=carb_em_elec_data%>%group_by(Year)%>%
  summarize(total_carb=sum(CO2_metric_tons_per_capita,na.rm=TRUE)) #We summerise by sum
## `summarise()` ungrouping output (override with `.groups` argument)
ggplot(data=sum_carb, aes(x=Year, y=total_carb)) + 
  geom_bar(stat="identity")+
  theme(axis.text.x = element_text(size=9, angle=90))

average_fossil=carb_em_elec_data%>%group_by(Year)%>%
  summarize(mean_fossil=mean(elec_producted_from_fossils,na.rm=TRUE)) #we summerise by mean 
## `summarise()` ungrouping output (override with `.groups` argument)
ggplot(data=average_fossil, aes(x=Year, y=mean_fossil)) + 
  geom_bar(stat="identity")+
  theme(axis.text.x = element_text(size=9, angle=90))

average_renew=carb_em_elec_data%>%group_by(Year)%>%
  summarize(mean_renew=mean(elec_producted_from_renewables,na.rm = TRUE))#summarise by mean
## `summarise()` ungrouping output (override with `.groups` argument)
ggplot(data=average_renew, aes(x=Year, y=mean_renew)) + 
  geom_bar(stat="identity")+
  theme(axis.text.x = element_text(size=9, angle=90))

Carbon emission has no data in years 2016 to 2019. Fossil electricity output lacks 2015 to 2019 data. Renewable electricity has no data between 1960 to 1989 and 2016 to 2019.

The missing values could be replaced with a mean, a median or extrapolate the data. But since the range over which the data is missing is large, large error can be added to the analysis.

It is possible to remove or drop blank years for respective missing years in each dataset and stick with data from years 1990 to 2015, so all datasets will have full data. We can chose to leave the data as it is and drop NAs when needed during the analysis.

If we replace NA observations with zero in the remaining data, it will imply that the observation was zero which is likely to be incorrect hence we assume that no record was taken and leave it as NA.

We filter the data to keep the years 1990 to 2015 for all variables.

carb_em_elec_data<-carb_em_elec_data[which(carb_em_elec_data$Year>=1990&carb_em_elec_data$Year<2016),]

Scan II

summary(carb_em_elec_data)
##  country_year       Country Code       Country Name          Region         
##  Length:5642        Length:5642        Length:5642        Length:5642       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##        IncomeGroup       Year           CO2_metric_tons_per_capita
##  Low         : 754   Length:5642        Min.   : 0.0085           
##  Lower middle:1300   Class :character   1st Qu.: 0.5938           
##  Upper middle:1456   Mode  :character   Median : 2.4067           
##  High        :2132                      Mean   : 4.7727           
##                                         3rd Qu.: 6.8465           
##                                         Max.   :70.0422           
##                                         NA's   :462               
##  elec_producted_from_fossils elec_producted_from_renewables
##  Min.   :  0.00              Min.   :  0.00                
##  1st Qu.: 30.65              1st Qu.:  0.00                
##  Median : 64.68              Median : 11.95                
##  Mean   : 59.36              Mean   : 28.62                
##  3rd Qu.: 92.69              3rd Qu.: 54.09                
##  Max.   :100.00              Max.   :100.00                
##  NA's   :2051

The maximum value of ‘CO2_metric_tons_per_capita’ variable is relatively very high compared to the mean and third quartile. This implies that observations in this variable may be skewed.

Scatter plot Carbon emission amount:

par(mfrow=c(2,2))
plot(carb_em_elec_data$CO2_metric_tons_per_capita, xaxt = "n",ylab="CO2 tons per capita",
       xlab="",pch=16,cex=0.5,main="Scatter plot" )
plot(carb_em_elec_data$elec_producted_from_fossils , xaxt = "n",ylab="Fossil Electricity",
       xlab="",pch=16,cex=0.5,main="Scatter plot" )
plot(carb_em_elec_data$elec_producted_from_renewables , xaxt = "n",ylab="Renewable Electricity",
       xlab="",pch=16,cex=0.5,main="Scatter plot" )

There are few large data points in the carbon emission data. Observations in other variables are even.

We create our own function to compute coefficient of skewness:

#function to compute coefficient of skewness
myskewness <-  function(x) {  #we call the function myskewness
  m3 <- mean((x-mean(x,na.rm = TRUE))^3,na.rm=TRUE)
  skew <- m3/(sd(x,na.rm=TRUE)^3)
  skew
}

Checking skewness of the numerical varibles:

cat("Skewness of carbon emission variable is: ",
      myskewness(carb_em_elec_data$CO2_metric_tons_per_capita),"\n")
## Skewness of carbon emission variable is:  3.497234
cat("Skewness of fossil electricity variable is: ",
      myskewness(carb_em_elec_data$elec_producted_from_fossils ),"\n")
## Skewness of fossil electricity variable is:  -0.3722603
cat("Skewness of renewable electricity variable is: ",
      myskewness(carb_em_elec_data$elec_producted_from_renewables ),"\n")
## Skewness of renewable electricity variable is:  0.920487

Skewness of the carbon emission variable should be reduced

Using boxplot to check presence of outliers:

par(mfrow=c(2,2))
boxplot(CO2_metric_tons_per_capita ~ Year, data=carb_em_elec_data, main="Amount of Carbon Emission")

boxplot(elec_producted_from_fossils ~ Year, data=carb_em_elec_data, main="% Fossil Electricity")

boxplot(elec_producted_from_renewables ~ Year, data=carb_em_elec_data, main="% Renewable electricity")

Amount of carbon emission has many outlier observations. Other variables look okay.

outlier(carb_em_elec_data$CO2_metric_tons_per_capita, opposite = FALSE, logical = FALSE)
## [1] 70.04223

Computing z-scores:

data<-na.omit(carb_em_elec_data$CO2_metric_tons_per_capita)#remove the NAs
z_scores <- data %>% scores(type = "z")
z_scores %>% summary()
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.7286 -0.6391 -0.3619  0.0000  0.3172  9.9824

This show skewness in the data.

Transform

Applying transformation to carbon emission amount variable reduce skewness. We try log transformation.

# Creating log of Carbon emission variable to the datframe
carb_em_elec_data$log_CO2_metric_tons_per_capita<-log(carb_em_elec_data$CO2_metric_tons_per_capita)

Boxplot of the resulting variable:

boxplot(log_CO2_metric_tons_per_capita~Year, data=carb_em_elec_data, 
          main="Log Amount of carbon emission")

What is the skewness of the transformed variable?

cat("Skewness of Log carbon emission variable
    is:",myskewness(carb_em_elec_data$log_CO2_metric_tons_per_capita),"\n")
## Skewness of Log carbon emission variable
##     is: -0.5109248

Which is lower than the one for original variable

Computing z-scores for the transformed variable

#scores(carb_em_elec_data$CO2_metric_tons_per_capita, type = "z")
data2<-na.omit(carb_em_elec_data$log_CO2_metric_tons_per_capita)#remove the NAs
z_scores <- data2 %>% scores(type = "z")
z_scores %>% summary()
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -3.2751 -0.6832  0.1700  0.0000  0.8074  2.2251

The transformed variable is less skewed