1. OVERVIEW

2. DATA PREPARATION

A. Import Libraries

We begin by importing the necessary libraries required for data manipulation, visualization, and analysis. Common libraries include lubridate for Date and time data manipulation, ggplot2 for Static data visualization, and tidyr for data wrangling.

# for Date and time data manipulation
library(lubridate) 
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
# for data wrangling
library(tidyr)
# Static data visualization
library(ggplot2)

B. Import Data & Inspection

Next, we load the dataset and perform an initial inspection. This step involves understanding the structure of the data, identifying the types of variables, and checking for any immediate issues such as missing values or incorrect data types.

# Load the Data
CO_Emission <- read.csv("data_input/Agrofood_co2_emission.csv")
head(CO_Emission)
# Inspect the structure of the data
str(CO_Emission)
## 'data.frame':    4570 obs. of  32 variables:
##  $ Area                           : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ Region                         : chr  "Middle East/Central Asia" "Middle East/Central Asia" "Middle East/Central Asia" "Middle East/Central Asia" ...
##  $ Year                           : int  1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 ...
##  $ Savanna.fires                  : num  14.7 14.7 14.7 14.7 14.7 ...
##  $ Forest.fires                   : num  0.0557 0.0557 0.0557 0.0557 0.0557 ...
##  $ Crop.Residues                  : num  206 209 197 231 242 ...
##  $ Rice.Cultivation               : num  686 678 686 686 706 ...
##  $ Drained.organic.soils..CO2.    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Pesticides.Manufacturing       : num  11.8 11.7 11.7 11.7 11.7 ...
##  $ Food.Transport                 : num  63.1 61.2 53.3 54.4 54 ...
##  $ Forestland                     : num  -2389 -2389 -2389 -2389 -2389 ...
##  $ Net.Forest.conversion          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Food.Household.Consumption     : num  79.1 80.5 80.8 85.1 88.8 ...
##  $ Food.Retail                    : num  109.6 116.7 126.2 81.5 90.4 ...
##  $ On.farm.Electricity.Use        : num  14.27 11.42 9.28 9.06 8.4 ...
##  $ Food.Packaging                 : num  67.6 67.6 67.6 67.6 67.6 ...
##  $ Agrifood.Systems.Waste.Disposal: num  692 711 744 792 832 ...
##  $ Food.Processing                : num  252 252 252 252 252 ...
##  $ Fertilizers.Manufacturing      : num  12 12.9 13.5 14.1 15.1 ...
##  $ IPPU                           : num  210 217 222 201 182 ...
##  $ Manure.applied.to.Soils        : num  260 269 265 262 268 ...
##  $ Manure.left.on.Pasture         : num  1591 1657 1654 1643 1689 ...
##  $ Manure.Management              : num  319 342 349 352 368 ...
##  $ Fires.in.organic.soils         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fires.in.humid.tropical.forests: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ On.farm.energy.use             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Rural.population               : int  9655167 10230490 10995568 11858090 12690115 13401971 13952791 14373573 14733655 15137497 ...
##  $ Urban.population               : int  2593947 2763167 2985663 3237009 3482604 3697570 3870093 4008032 4130344 4266179 ...
##  $ Total.Population...Male        : int  5348387 5372959 6028494 7003641 7733458 8219467 8569175 8916862 9275541 9667811 ...
##  $ Total.Population...Female      : int  5346409 5372208 6028939 7000119 7722096 8199445 8537421 8871958 9217591 9595036 ...
##  $ total_emission                 : num  2199 2324 2356 2368 2501 ...
##  $ Average.Temperature.C          : num  0.5362 0.0207 -0.2596 0.1019 0.3723 ...
# Summary of the data set
summary(CO_Emission)
##      Area              Region               Year      Savanna.fires      
##  Length:4570        Length:4570        Min.   :1990   Min.   :     0.00  
##  Class :character   Class :character   1st Qu.:1998   1st Qu.:     0.05  
##  Mode  :character   Mode  :character   Median :2005   Median :     6.19  
##                                        Mean   :2005   Mean   :  1484.72  
##                                        3rd Qu.:2013   3rd Qu.:   298.60  
##                                        Max.   :2020   Max.   :114616.40  
##                                                                          
##   Forest.fires      Crop.Residues      Rice.Cultivation  
##  Min.   :    0.00   Min.   :    0.00   Min.   :     0.0  
##  1st Qu.:    0.00   1st Qu.:   21.56   1st Qu.:   126.2  
##  Median :    1.78   Median :  148.36   Median :   320.2  
##  Mean   : 1178.51   Mean   : 1189.19   Mean   :  5592.5  
##  3rd Qu.:  121.52   3rd Qu.:  486.82   3rd Qu.:  1191.2  
##  Max.   :52227.63   Max.   :33490.07   Max.   :164915.3  
##                     NA's   :362                          
##  Drained.organic.soils..CO2. Pesticides.Manufacturing Food.Transport    
##  Min.   :     0.00           Min.   :    0.0          Min.   :    0.37  
##  1st Qu.:     0.00           1st Qu.:    4.0          1st Qu.:   64.00  
##  Median :     8.25           Median :   20.0          Median :  322.05  
##  Mean   :  5080.59           Mean   :  408.8          Mean   : 2457.40  
##  3rd Qu.:  2461.47           3rd Qu.:  129.4          3rd Qu.: 1472.75  
##  Max.   :241025.07           Max.   :16459.0          Max.   :67945.76  
##                                                                         
##    Forestland        Net.Forest.conversion Food.Household.Consumption
##  Min.   :-797183.1   Min.   :     0.0      Min.   :     0.0          
##  1st Qu.:  -4619.1   1st Qu.:     0.0      1st Qu.:    24.6          
##  Median :   -170.9   Median :   156.3      Median :   283.9          
##  Mean   : -21499.5   Mean   : 12560.2      Mean   :  6607.2          
##  3rd Qu.:      0.0   3rd Qu.:  5667.6      3rd Qu.:  2118.4          
##  Max.   : 171121.1   Max.   :724602.2      Max.   :466288.2          
##  NA's   :93          NA's   :93            NA's   :177               
##   Food.Retail        On.farm.Electricity.Use Food.Packaging     
##  Min.   :     0.00   Min.   :     0.00       Min.   :     0.00  
##  1st Qu.:    58.32   1st Qu.:     8.70       1st Qu.:    50.86  
##  Median :   283.22   Median :    72.26       Median :    67.63  
##  Mean   :  2821.19   Mean   :  2361.44       Mean   :  2342.89  
##  3rd Qu.:  1329.63   3rd Qu.:   609.75       3rd Qu.:   372.89  
##  Max.   :133784.07   Max.   :165676.30       Max.   :175741.31  
##                                                                 
##  Agrifood.Systems.Waste.Disposal Food.Processing    Fertilizers.Manufacturing
##  Min.   :     0.34               Min.   :     0.0   Min.   :     0.0         
##  1st Qu.:   317.28               1st Qu.:   209.6   1st Qu.:   356.5         
##  Median :  1327.40               Median :   331.1   Median :   658.5         
##  Mean   :  7681.93               Mean   :  5456.2   Mean   :  3606.2         
##  3rd Qu.:  4835.84               3rd Qu.:  1612.6   3rd Qu.:  2327.9         
##  Max.   :213289.70               Max.   :274253.5   Max.   :170826.4         
##                                                                              
##       IPPU           Manure.applied.to.Soils Manure.left.on.Pasture
##  Min.   :      0.3   Min.   :    0.106       Min.   :    0.0       
##  1st Qu.:    142.9   1st Qu.:   28.468       1st Qu.:  213.6       
##  Median :   1605.3   Median :  147.864       Median :  862.6       
##  Mean   :  26440.4   Mean   : 1103.153       Mean   : 3518.3       
##  3rd Qu.:   9533.7   3rd Qu.:  550.386       3rd Qu.: 2628.3       
##  Max.   :1861640.7   Max.   :29865.389       Max.   :60880.4       
##  NA's   :186         NA's   :176                                   
##  Manure.Management  Fires.in.organic.soils Fires.in.humid.tropical.forests
##  Min.   :    0.57   Min.   :     0         Min.   :    0.00               
##  1st Qu.:   83.72   1st Qu.:     0         1st Qu.:    0.00               
##  Median :  379.87   Median :     0         Median :    0.00               
##  Mean   : 2866.95   Mean   :  1845         Mean   :  829.25               
##  3rd Qu.: 1540.21   3rd Qu.:     0         3rd Qu.:   14.03               
##  Max.   :70592.65   Max.   :991718         Max.   :51771.26               
##  NA's   :176                                                              
##  On.farm.energy.use  Rural.population    Urban.population   
##  Min.   :     0.03   Min.   :        0   Min.   :        0  
##  1st Qu.:    26.08   1st Qu.:   728122   1st Qu.:  1195061  
##  Median :   276.86   Median :  3394547   Median :  3780934  
##  Mean   :  4008.63   Mean   : 26026424   Mean   : 22506955  
##  3rd Qu.:  1676.45   3rd Qu.: 11812388   3rd Qu.: 12032732  
##  Max.   :139388.92   Max.   :900099113   Max.   :902077760  
##  NA's   :628                                                
##  Total.Population...Male Total.Population...Female total_emission   
##  Min.   :     2939       Min.   :     2966         Min.   :-391884  
##  1st Qu.:  1227770       1st Qu.:  1158133         1st Qu.:   6223  
##  Median :  4016625       Median :  4119207         Median :  15656  
##  Mean   : 24540823       Mean   : 24045410         Mean   :  77992  
##  3rd Qu.: 12874136       3rd Qu.: 12802408         3rd Qu.:  51095  
##  Max.   :743586579       Max.   :713341908         Max.   :3115114  
##                                                                     
##  Average.Temperature.C
##  Min.   :-1.3024      
##  1st Qu.: 0.5210      
##  Median : 0.8968      
##  Mean   : 0.9222      
##  3rd Qu.: 1.2904      
##  Max.   : 3.5581      
## 

B. Data Cleansing & Coercions

-. Missing Value

Missing values can significantly affect the quality and accuracy of our analysis. Therefore, handling them appropriately is crucial. We can either remove rows with missing values, fill them with suitable values, or use advanced imputation methods.

To determine if there are any missing values in the data, we can use the anyNA() function

anyNA(CO_Emission)
## [1] TRUE

The anyNA(CO_Emission) function in R checks if there are any NA (missing) values in the CO_Emission variable. The result [1] TRUE indicates that there are indeed missing values present in the CO_Emission variable.

Step-by-Step Handling of Missing Values -. Identify Missing Values by using sapply() function or is.na() and colSums() functions

# Identify missing values
missing_values <- sapply(CO_Emission, function(x) sum(is.na(x)))
missing_values
##                            Area                          Region 
##                               0                               0 
##                            Year                   Savanna.fires 
##                               0                               0 
##                    Forest.fires                   Crop.Residues 
##                               0                             362 
##                Rice.Cultivation     Drained.organic.soils..CO2. 
##                               0                               0 
##        Pesticides.Manufacturing                  Food.Transport 
##                               0                               0 
##                      Forestland           Net.Forest.conversion 
##                              93                              93 
##      Food.Household.Consumption                     Food.Retail 
##                             177                               0 
##         On.farm.Electricity.Use                  Food.Packaging 
##                               0                               0 
## Agrifood.Systems.Waste.Disposal                 Food.Processing 
##                               0                               0 
##       Fertilizers.Manufacturing                            IPPU 
##                               0                             186 
##         Manure.applied.to.Soils          Manure.left.on.Pasture 
##                             176                               0 
##               Manure.Management          Fires.in.organic.soils 
##                             176                               0 
## Fires.in.humid.tropical.forests              On.farm.energy.use 
##                               0                             628 
##                Rural.population                Urban.population 
##                               0                               0 
##         Total.Population...Male       Total.Population...Female 
##                               0                               0 
##                  total_emission           Average.Temperature.C 
##                               0                               0

To check for missing values in each column, we can use the is.na() and colSums() functions.

colSums(is.na(CO_Emission))
##                            Area                          Region 
##                               0                               0 
##                            Year                   Savanna.fires 
##                               0                               0 
##                    Forest.fires                   Crop.Residues 
##                               0                             362 
##                Rice.Cultivation     Drained.organic.soils..CO2. 
##                               0                               0 
##        Pesticides.Manufacturing                  Food.Transport 
##                               0                               0 
##                      Forestland           Net.Forest.conversion 
##                              93                              93 
##      Food.Household.Consumption                     Food.Retail 
##                             177                               0 
##         On.farm.Electricity.Use                  Food.Packaging 
##                               0                               0 
## Agrifood.Systems.Waste.Disposal                 Food.Processing 
##                               0                               0 
##       Fertilizers.Manufacturing                            IPPU 
##                               0                             186 
##         Manure.applied.to.Soils          Manure.left.on.Pasture 
##                             176                               0 
##               Manure.Management          Fires.in.organic.soils 
##                             176                               0 
## Fires.in.humid.tropical.forests              On.farm.energy.use 
##                               0                             628 
##                Rural.population                Urban.population 
##                               0                               0 
##         Total.Population...Male       Total.Population...Female 
##                               0                               0 
##                  total_emission           Average.Temperature.C 
##                               0                               0

From the results above, we can see that our data consists of 4570 rows and 32 columns (4570 obs. of 32 variables). Additionally, we noticed that some columns contain empty character strings (““), which can prevent these data points from being recognized as missing values.

-. Remove Rows with Many Missing Values:

From step Identify Missing Values by using sapply() function or is.na() and colSums() functions, we have identified several column that has missing value as below. From the list below we pick only 2 columns that has missing value more than 300 : Crop.Residues and On.farm.energy.use.

List of missing values columns: Crop.Residues: 362 missing values Net.Forest.conversion: 93 missing values Food.Household.Consumption: 177 missing values Fertilizers.Manufacturing: 186 missing values Manure.applied.to.Soils: 176 missing values Manure.left.on.Pasture: 176 missing values Manure.Management: 176 missing values On.farm.energy.use: 628 missing values

CO_Emission_Clean <- subset(CO_Emission, select = -c(On.farm.energy.use, Crop.Residues))
CO_Emission_Clean

To another column that still have missing value, we should convert these empty strings to 0. Some columns have missing values which need to be addressed, as the presence of these 0 values can affect the analysis.

-. Replace Empty Column with 0:

# Replace empty column with 0

CO_Emission_Clean[is.na(CO_Emission_Clean)] <- 0
str(CO_Emission_Clean)
## 'data.frame':    4570 obs. of  30 variables:
##  $ Area                           : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ Region                         : chr  "Middle East/Central Asia" "Middle East/Central Asia" "Middle East/Central Asia" "Middle East/Central Asia" ...
##  $ Year                           : int  1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 ...
##  $ Savanna.fires                  : num  14.7 14.7 14.7 14.7 14.7 ...
##  $ Forest.fires                   : num  0.0557 0.0557 0.0557 0.0557 0.0557 ...
##  $ Rice.Cultivation               : num  686 678 686 686 706 ...
##  $ Drained.organic.soils..CO2.    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Pesticides.Manufacturing       : num  11.8 11.7 11.7 11.7 11.7 ...
##  $ Food.Transport                 : num  63.1 61.2 53.3 54.4 54 ...
##  $ Forestland                     : num  -2389 -2389 -2389 -2389 -2389 ...
##  $ Net.Forest.conversion          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Food.Household.Consumption     : num  79.1 80.5 80.8 85.1 88.8 ...
##  $ Food.Retail                    : num  109.6 116.7 126.2 81.5 90.4 ...
##  $ On.farm.Electricity.Use        : num  14.27 11.42 9.28 9.06 8.4 ...
##  $ Food.Packaging                 : num  67.6 67.6 67.6 67.6 67.6 ...
##  $ Agrifood.Systems.Waste.Disposal: num  692 711 744 792 832 ...
##  $ Food.Processing                : num  252 252 252 252 252 ...
##  $ Fertilizers.Manufacturing      : num  12 12.9 13.5 14.1 15.1 ...
##  $ IPPU                           : num  210 217 222 201 182 ...
##  $ Manure.applied.to.Soils        : num  260 269 265 262 268 ...
##  $ Manure.left.on.Pasture         : num  1591 1657 1654 1643 1689 ...
##  $ Manure.Management              : num  319 342 349 352 368 ...
##  $ Fires.in.organic.soils         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fires.in.humid.tropical.forests: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Rural.population               : int  9655167 10230490 10995568 11858090 12690115 13401971 13952791 14373573 14733655 15137497 ...
##  $ Urban.population               : int  2593947 2763167 2985663 3237009 3482604 3697570 3870093 4008032 4130344 4266179 ...
##  $ Total.Population...Male        : int  5348387 5372959 6028494 7003641 7733458 8219467 8569175 8916862 9275541 9667811 ...
##  $ Total.Population...Female      : int  5346409 5372208 6028939 7000119 7722096 8199445 8537421 8871958 9217591 9595036 ...
##  $ total_emission                 : num  2199 2324 2356 2368 2501 ...
##  $ Average.Temperature.C          : num  0.5362 0.0207 -0.2596 0.1019 0.3723 ...

check again for missing values in each column, we can use the is.na() and colSums() functions.

colSums(is.na(CO_Emission_Clean))
##                            Area                          Region 
##                               0                               0 
##                            Year                   Savanna.fires 
##                               0                               0 
##                    Forest.fires                Rice.Cultivation 
##                               0                               0 
##     Drained.organic.soils..CO2.        Pesticides.Manufacturing 
##                               0                               0 
##                  Food.Transport                      Forestland 
##                               0                               0 
##           Net.Forest.conversion      Food.Household.Consumption 
##                               0                               0 
##                     Food.Retail         On.farm.Electricity.Use 
##                               0                               0 
##                  Food.Packaging Agrifood.Systems.Waste.Disposal 
##                               0                               0 
##                 Food.Processing       Fertilizers.Manufacturing 
##                               0                               0 
##                            IPPU         Manure.applied.to.Soils 
##                               0                               0 
##          Manure.left.on.Pasture               Manure.Management 
##                               0                               0 
##          Fires.in.organic.soils Fires.in.humid.tropical.forests 
##                               0                               0 
##                Rural.population                Urban.population 
##                               0                               0 
##         Total.Population...Male       Total.Population...Female 
##                               0                               0 
##                  total_emission           Average.Temperature.C 
##                               0                               0

The results of colSums(is.na(CO_Emission_Clean)) show that there are no missing values (NA) in any of the columns of the CO_Emission_Clean data frame.

-. Rename column if needed

To Rename columns we can use make.names() functions. The make.names() function in R is used to ensure that the names of objects, such as column names in a data frame, are syntactically valid variable names. This is particularly useful when the names might contain spaces, special characters, or other invalid syntax for variable names in R

# Rename columns if needed
colnames(CO_Emission_Clean) <- make.names(colnames(CO_Emission_Clean), unique = TRUE)
names(CO_Emission_Clean)
##  [1] "Area"                            "Region"                         
##  [3] "Year"                            "Savanna.fires"                  
##  [5] "Forest.fires"                    "Rice.Cultivation"               
##  [7] "Drained.organic.soils..CO2."     "Pesticides.Manufacturing"       
##  [9] "Food.Transport"                  "Forestland"                     
## [11] "Net.Forest.conversion"           "Food.Household.Consumption"     
## [13] "Food.Retail"                     "On.farm.Electricity.Use"        
## [15] "Food.Packaging"                  "Agrifood.Systems.Waste.Disposal"
## [17] "Food.Processing"                 "Fertilizers.Manufacturing"      
## [19] "IPPU"                            "Manure.applied.to.Soils"        
## [21] "Manure.left.on.Pasture"          "Manure.Management"              
## [23] "Fires.in.organic.soils"          "Fires.in.humid.tropical.forests"
## [25] "Rural.population"                "Urban.population"               
## [27] "Total.Population...Male"         "Total.Population...Female"      
## [29] "total_emission"                  "Average.Temperature.C"

D. Convert data types

To Convert data type we can use below functions. * as.character() * as.Date() * as.integer() * as.numeric() * as.factor()

From the str(CO_Emission_Clean),The data types of the dataset seem appropriate for the types of data they represent. However, there might be some considerations for further analysis:

Area and Region: These columns are currently character (chr) types, which is suitable for categorical data. However, converting them to factor (factor) types could be beneficial for analysis and modeling.

Year: The int type is appropriate for the Year column.

All other numeric columns: These are already of type num, which is appropriate for continuous numeric data.

So,we only change the Area and Region from chr to factor:

# Convert character columns to factors
CO_Emission_Clean$Area <- as.factor(CO_Emission_Clean$Area)
CO_Emission_Clean$Region <- as.factor(CO_Emission_Clean$Region)

E. Check the data after change the data type

To verify the data after changes, we can use the str() function

str(CO_Emission_Clean)
## 'data.frame':    4570 obs. of  30 variables:
##  $ Area                           : Factor w/ 153 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Region                         : Factor w/ 6 levels "Africa","Asia-Pacific",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Year                           : int  1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 ...
##  $ Savanna.fires                  : num  14.7 14.7 14.7 14.7 14.7 ...
##  $ Forest.fires                   : num  0.0557 0.0557 0.0557 0.0557 0.0557 ...
##  $ Rice.Cultivation               : num  686 678 686 686 706 ...
##  $ Drained.organic.soils..CO2.    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Pesticides.Manufacturing       : num  11.8 11.7 11.7 11.7 11.7 ...
##  $ Food.Transport                 : num  63.1 61.2 53.3 54.4 54 ...
##  $ Forestland                     : num  -2389 -2389 -2389 -2389 -2389 ...
##  $ Net.Forest.conversion          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Food.Household.Consumption     : num  79.1 80.5 80.8 85.1 88.8 ...
##  $ Food.Retail                    : num  109.6 116.7 126.2 81.5 90.4 ...
##  $ On.farm.Electricity.Use        : num  14.27 11.42 9.28 9.06 8.4 ...
##  $ Food.Packaging                 : num  67.6 67.6 67.6 67.6 67.6 ...
##  $ Agrifood.Systems.Waste.Disposal: num  692 711 744 792 832 ...
##  $ Food.Processing                : num  252 252 252 252 252 ...
##  $ Fertilizers.Manufacturing      : num  12 12.9 13.5 14.1 15.1 ...
##  $ IPPU                           : num  210 217 222 201 182 ...
##  $ Manure.applied.to.Soils        : num  260 269 265 262 268 ...
##  $ Manure.left.on.Pasture         : num  1591 1657 1654 1643 1689 ...
##  $ Manure.Management              : num  319 342 349 352 368 ...
##  $ Fires.in.organic.soils         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fires.in.humid.tropical.forests: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Rural.population               : int  9655167 10230490 10995568 11858090 12690115 13401971 13952791 14373573 14733655 15137497 ...
##  $ Urban.population               : int  2593947 2763167 2985663 3237009 3482604 3697570 3870093 4008032 4130344 4266179 ...
##  $ Total.Population...Male        : int  5348387 5372959 6028494 7003641 7733458 8219467 8569175 8916862 9275541 9667811 ...
##  $ Total.Population...Female      : int  5346409 5372208 6028939 7000119 7722096 8199445 8537421 8871958 9217591 9595036 ...
##  $ total_emission                 : num  2199 2324 2356 2368 2501 ...
##  $ Average.Temperature.C          : num  0.5362 0.0207 -0.2596 0.1019 0.3723 ...

The dataset CO_Emission_Clean has been meticulously cleaned and structured for analysis. It consists of 4570 observations across 30 variables, including factors like Area and Region, which have been appropriately converted. Numeric variables such as emissions from various sources, population data, and temperature measurements are well-distributed with no missing values following the replacement of empty entries with zeros. This preparation ensures the dataset is ready for comprehensive exploration and modeling to understand the relationships between emissions, population dynamics, and environmental factors across different regions and years.

3. DATA EXPLORATION AND VISUALIZATION

Let see our data, with head() function:

head(CO_Emission_Clean)

Let’s take a look at the CO_Emission_Clean data by year. We can use Summary()function:

summary(CO_Emission_Clean$Year)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1990    1998    2005    2005    2013    2020

From the output, the earliest year in CO_Emission_Clean data set is 1990 and the latest year is 2020. The dataset CO_Emission_Clean spans from the year 1990 to 2020. The median year of the dataset is 2005, indicating that the data is roughly centered around this period, with a fairly even spread across the years from 1990 to 2020.

Let’s see how many regions in data set CO_Emission_Clean, using unique() function

# Get unique regions
unique_regions <- unique(CO_Emission_Clean$Region)

# Display the list of unique regions
print(unique_regions)
## [1] Middle East/Central Asia Northern/Eastern Europe  Africa                  
## [4] European Union           Asia-Pacific             North America           
## 6 Levels: Africa Asia-Pacific European Union ... Northern/Eastern Europe

unique(CO_Emission_Clean$Region) function retrieves all unique regions listed in the Region column of the CO_Emission_Clean dataset. The dataset CO_Emission_Clean contains observations from six unique regions: 1. Middle East/Central Asia 2. Northern/Eastern Europe 3. Africa 4. European Union 5. Asia-Pacific 6. North America

The dataset CO_Emission_Clean spans from the year 1990 to 2020 and contains observations of CO_Emission from 6 unique regions. Let’s visually explore the data through exploratory visualization.

First we want to know Total CO2 emissions based on region and year. Here are the step to get the visualization.

  1. Loaded necessary libraries dplyr and ggplot2 and scales.
  2. Aggregated CO2 emissions by Region and Year using group_by() and summarise().
  3. Summarise() function calculated Total_CO2_Emission as the sum of total_emission for each group. Arranged the data in descending order of Total_CO2_Emission using arrange().
  4. the mutate() function is used to create new columns or modify existing ones within a dataframe to creates a new column text in dataframe, where each row will have a string formatted as “Total_CO2_Emission: value_with_commas kt”. This is useful for creating human-readable labels or summaries within our dataframe.
# Load necessary libraries
library(dplyr)  # for data manipulation
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)  # for plotting (optional)
library(scales)  # for comma formatting


# Aggregate CO2 emissions by Region and Year
CO2_agg <- CO_Emission_Clean %>%
  group_by(Region, Year) %>%
  summarise(Total_CO2_Emission = sum(total_emission, na.rm = TRUE), .groups = 'drop')%>% 
  arrange(desc(Total_CO2_Emission))%>% 
  mutate(text = paste0("Total_CO2_Emission:", {comma(Total_CO2_Emission)}, " kt"))

For visualization, we can use ggplot2, a popular package for creating visualization in R, as follow:

ggplot(CO2_agg, aes(x = Year, y = Total_CO2_Emission, color = Region)) +
  geom_line(size = 1.2) +  
  geom_point(size = 2.5) +   
  labs(title = "Total CO2 Emissions by Region per Year",
       x = "Year",
       y = "Total CO2 Emissions (kt)",
       color = "Region") +
  scale_y_continuous(labels = scales::comma) +
  scale_color_manual(values = c(
    "Middle East/Central Asia" = "#1f77b4",
    "Northern/Eastern Europe" = "#ff7f0e",
    "Africa" = "#2ca02c",
    "European Union" = "#d62728",
    "Asia-Pacific" = "#9467bd",
    "North America" = "#8c564b"
  )) +
  theme_minimal() +
  theme(
    panel.background = element_rect(fill = "#f0f0f0"),  # Set background color
    plot.background = element_rect(fill = "#C4D5C5"),
    panel.grid.major = element_line(size = 0.5, linetype = "solid", color = "gray"),  # Major grid lines
    panel.grid.minor = element_line(size = 0.2, linetype = "dotted", color = "gray"),  # Minor grid lines
    legend.position = "bottom",  # Position of the legend
    legend.title = element_text(face = "bold"),  # Title of the legend
    legend.text = element_text(size = 10),  # Text of the legend
    plot.title = element_text(face = "bold", size = 14, hjust = 0.5, color = "#333333"), 
    axis.text.y = element_text(size = 10, color = "black"),
    axis.title = element_text(size = 12, face = "bold", color = "#333333"),  # Adjust axis title
    axis.text = element_text(size = 10, color = "black")  # Adjust axis text
  )
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The visualization of “Total CO2 Emissions by Region per Year” provides insights of how CO2 emissions have varied across different regions from 1990 to 2020.

  1. Over the decades, emissions have fluctuated differently across regions, with some showing increasing trends while others fluctuate or decrease.
  2. Regions like Asia-Pacific generally exhibit higher emissions compared to others like Africa or the European Union.
  3. Changes in emissions trends might correlate with economic development, policy interventions, or environmental regulations in each region.

This exploratory visualization helps us understand the broad patterns and variations in CO2 emissions across regions over time, laying the foundation for more detailed analyses or policy discussions regarding climate change mitigation strategies.

We want to point out Region Asia-Pasific as exhibit the higher emission compared to others Region. We want to look the correlation between the Emission with average temperature increase. Below is the step:

  1. Load Necessary Libraries These libraries are essential for data manipulation and visualization.

ggplot2 is used for creating plots. glue is used for string interpolation, allowing us to create informative text labels. scales provides functions for formatting numbers. ggrepel helps in avoiding overlap of text labels in the plot.

library(ggplot2) # Static data visualization
library(glue) # String interpolation
library(scales)  # for comma formatting
library(ggrepel) 
## Warning: package 'ggrepel' was built under R version 4.4.1
  1. This block of code filters and prepares the data for plotting.

The filter function selects records specific to the “Asia-Pacific” region. The group_by function groups the data by Year. The summarise function calculates the total CO2 emissions, average temperature increase, and sums of rural and urban populations for each year. The arrange function sorts the data by Year. *The mutate function creates two text labels (text and smooth_text) for annotating the plot.

# Data Preparation
CO_Emission_Region_Temperature <-
CO_Emission_Clean %>%
  filter(Region=="Asia-Pacific")%>% 
  group_by(Year) %>% 
  summarise(total_emission = sum(total_emission), Average.Temperature =mean(Average.Temperature.C), Rural_Population=sum(Rural.population), Urban_Population=sum(Urban.population)) %>% 
  arrange(Year)%>% 
  mutate(text = glue("Total_Emission: {comma(total_emission)} kt
                    Average Temperature Increase: {format(Average.Temperature, digits= 1, nsmall =2)} C"),
         smooth_text = glue("({format(Average.Temperature, digits = 2, nsmall = 2)} C, {comma(total_emission, accuracy = 0.01)} kt)"))
  1. Creates and customizes the visualization plot using ggplot2.

geom_smooth adds a smoothed line with specified color, fill, size, and transparency. geom_point adds points, mapping the size to the combined rural and urban population and color to the year. geom_text_repel adds text labels that avoid overlapping. scale_y_continuous formats the y-axis labels with commas. scale_size_continuous adjusts the size range of the points. scale_color_viridis_c provides a color gradient for the year variable. labs sets the plot title and axis labels. theme_minimal applies a minimal theme, with further customizations for the title, background, grid lines, axis, and legend.

# Plotting with ggplot2
ggplot(data = CO_Emission_Region_Temperature, aes(x = Average.Temperature, y = total_emission)) +
  geom_smooth(col = "maroon", fill = "lightpink", size = 1, alpha = 1) +
  geom_point(aes(size = Rural_Population + Urban_Population, color = Year)) +  # Map color to Year for legend
  geom_text_repel(aes(label = smooth_text), vjust = -0.5, hjust = 0.5, size = 1.5, color = "blue") +
  scale_y_continuous(labels = comma) +
  scale_size_continuous(range = c(1, 6), guide = "none") +
  scale_color_viridis_c(option = "A", direction = -1, labels = scales::number_format()) +
  labs(title = "Correlation CO2 Emissions and Average Temperature Increase in Asia-Pacific",
       x = "Average Temperature Increase (\u00B0C)",  # Unicode for degree symbol
       y = "Total CO2 Emission (kt)",
       color = "Year") +  # Add color legend label
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 12, hjust = 0.5, color = "#333333"),
    panel.background = element_rect(fill = "#f9f9f9"),
    plot.background = element_rect(fill = "#C4D5C5"),
    panel.grid.major = element_line(colour = "grey"),
    axis.line = element_line(color = "grey"),
    axis.text = element_text(size = 10, colour = "black"),
    legend.position = "right"  # Add the legend back to the plot
  )
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

The code filters and aggregates CO2 emissions and average temperature data for the Asia-Pacific region, then creates a scatter plot showing the relationship between these variables over the years. The plot includes a smoothed trend line, points sized by population, and text annotations for each point. The color of the points represents the year, and a legend is included to indicate this. By incorporating various elements such as size, color, and text annotations, the plot provides a clear and informative depiction of the data. The use of ggrepel ensures that the text labels do not overlap, making the plot easy to interpret. This visualization effectively illustrates the correlation between CO2 emissions and average temperature increase in the Asia-Pacific region over time.

The visualization and analysis indicates that in the Asia-Pacific region, there is a correlation between the increase in CO2 emissions and the average temperature rise over the years. As CO2 emissions increase, the average temperature also tends to rise, suggesting a potential link between higher emissions and temperature increases.

4. SUMMARY

The findings from this analysis and visualization can serve as a foundation for further research and analysis, particularly in the context of “go green” initiatives. Policymakers and environmentalists can use these insights to design and implement strategies aimed at reducing CO2 emissions and mitigating climate change. Furthermore, the data can inform social and political policies, encouraging governments and organizations to adopt sustainable practices and prioritize environmental conservation. This analysis underscores the importance of understanding the impact of emissions on temperature increases, highlighting the need for concerted efforts to address climate change at both national and regional levels.