Week 3 Challenge

1. Introduction

This report looks into a country-level Global Waste Management data set from World Bank to explore the relationship between a country’s GDP and its waste generation. The data set can be found from https://datacatalog.worldbank.org/search/dataset/0039597/What-a-Waste-Global-Database

2. Prepare the the programming environment

#config working directory
setwd("~/Desktop/mydata")

#load necessary libraries
library(readr)
library(tidyverse)
library(ggplot2)
library(ggthemes)

3. Load the dataset

#import data
data = read_csv("global_waste_data.csv")
head(data)

## # A tibble: 6 × 51
##   iso3c region_id country_name         income_id    gdp composition_food_organ…¹
##   <chr> <chr>     <chr>                <chr>      <dbl>                    <dbl>
## 1 ABW   LCN       Aruba                HIC       35563.                     NA  
## 2 AFG   SAS       Afghanistan          LIC        2057.                     NA  
## 3 AGO   SSF       Angola               LMC        8037.                     51.8
## 4 ALB   ECS       Albania              UMC       13724.                     51.4
## 5 AND   ECS       Andorra              HIC       43712.                     31.2
## 6 ARE   MEA       United Arab Emirates HIC       67119.                     39  
## # ℹ abbreviated name: ¹composition_food_organic_waste_percent
## # ℹ 45 more variables: composition_glass_percent <dbl>,
## #   composition_metal_percent <dbl>, composition_other_percent <dbl>,
## #   composition_paper_cardboard_percent <dbl>,
## #   composition_plastic_percent <dbl>,
## #   composition_rubber_leather_percent <dbl>, composition_wood_percent <dbl>,
## #   composition_yard_garden_green_waste_percent <dbl>, …

4. Slice the data with the right variables

The data set has many columns half-filled or less. A new data set is created by removing them because they would not be useful in the analysis.

slice = data %>% select(3,4,5,20,6,10,11,27)
names(slice)

## [1] "country_name"                           
## [2] "income_id"                              
## [3] "gdp"                                    
## [4] "population_population_number_of_people" 
## [5] "composition_food_organic_waste_percent" 
## [6] "composition_paper_cardboard_percent"    
## [7] "composition_plastic_percent"            
## [8] "total_msw_total_msw_generated_tons_year"

5. Rename column names for shorter names

The column names of the sliced data set needs to be renamed shorted because they are too long to be referenced.

slice = rename(slice, "population" = "population_population_number_of_people",
               "food&organic_waste(%)" = "composition_food_organic_waste_percent",
               "paper_waste(%)" = "composition_paper_cardboard_percent",
               "plastic_waste(%)" = "composition_plastic_percent",
               "annual_total_waste(tons)" = "total_msw_total_msw_generated_tons_year")
names(slice)

## [1] "country_name"             "income_id"               
## [3] "gdp"                      "population"              
## [5] "food&organic_waste(%)"    "paper_waste(%)"          
## [7] "plastic_waste(%)"         "annual_total_waste(tons)"

6. Clean the data

In this section, rows with NA data will be removed. This however will not have much impact on the data set because those are either very small countries or territories. Also, income_id is changed to a factor so that it can be summarized in the later sections.

#remove rows with NA values
slice = slice %>% filter(complete.cases(slice))    

#convert income_id as factors
income_id = as.factor(slice$income_id)
levels(slice$income_id) = c("LIC", "LMC", "UMC", "HIC")
head(slice)

## # A tibble: 6 × 8
##   country_name         income_id    gdp population `food&organic_waste(%)`
##   <chr>                <chr>      <dbl>      <dbl>                   <dbl>
## 1 Angola               LMC        8037.   25096150                    51.8
## 2 Albania              UMC       13724.    2854191                    51.4
## 3 Andorra              HIC       43712.      82431                    31.2
## 4 United Arab Emirates HIC       67119.    9770529                    39  
## 5 Argentina            HIC       23550.   42981516                    38.7
## 6 Armenia              UMC       11020.    2906220                    57  
## # ℹ 3 more variables: `paper_waste(%)` <dbl>, `plastic_waste(%)` <dbl>,
## #   `annual_total_waste(tons)` <dbl>

7. Summarize the data by GDP income group

Calculate the average waste by category per GDP income group to see the relationship between GDP and waste generation

attach(slice)
slice %>% group_by(income_id) %>% summarize(avg_total = mean(`annual_total_waste(tons)`), food_organic_waste_avg = mean(`food&organic_waste(%)`), paper_waste_avg = mean(`paper_waste(%)`), plastic_waste_avg = mean(`plastic_waste(%)`)) %>% arrange(income_id)

## # A tibble: 4 × 5
##   income_id avg_total food_organic_waste_avg paper_waste_avg plastic_waste_avg
##   <chr>         <dbl>                  <dbl>           <dbl>             <dbl>
## 1 HIC       10037674.                   33.0           21.4              12.4 
## 2 LIC        2528783.                   51.3            7.29              7.97
## 3 LMC        6635725.                   50.4           10.7              11.3 
## 4 UMC       16920344.                   46.6           12.2              12.6

The conclusions from the above data are as below:

LICs and LMCs are responsible for generating more food and organic wastes as opposed to UMC and HIC, probably due to their agricultural nature.
HICs are responsible for generating more paper wastes as opposed to LICs and LMCs, indicating higher contribution to deforestation.
Except for LICs, all of LMCs, UMCs, and HICs are responsible for generating plastic wastes.

**HIC: High income Country | UMC: Upper Middle Income Country | LMC: Low Middle Income Country | LIC: Low Income Country

8. Filter the data further for big economies (population > 10 million)

Data summary by GDP income group will be performed again for countries with population higher than 10 million

8.1 Prep the new data

big_economies = slice %>% filter(population > 10000000) 
head(big_economies)

## # A tibble: 6 × 8
##   country_name income_id    gdp population `food&organic_waste(%)`
##   <chr>        <chr>      <dbl>      <dbl>                   <dbl>
## 1 Angola       LMC        8037.   25096150                    51.8
## 2 Argentina    HIC       23550.   42981516                    38.7
## 3 Australia    HIC       47784.   23789338                    48.4
## 4 Belgium      HIC       51915.   11484055                    14.2
## 5 Burkina Faso LIC        1925.   18110624                    21  
## 6 Bangladesh   LMC        3196.  155727056                    80.6
## # ℹ 3 more variables: `paper_waste(%)` <dbl>, `plastic_waste(%)` <dbl>,
## #   `annual_total_waste(tons)` <dbl>

detach(slice)
attach(big_economies)

## The following object is masked _by_ .GlobalEnv:
## 
##     income_id

## The following object is masked from package:tidyr:
## 
##     population

8.2 Summary Statistics for Countries with Population > 10 million

The below summary statistics shows that the following points verified:

LICs and LMCs are responsible for generating more food and organic wastes as opposed to UMC and HIC.
HICs and UMCs are responsible for generating more paper wastes as opposed to LICs and LMCs.

However, below is a new finding from big_economies dataset:

all of LICs, LMCs, UMCs, and HICs are contributing to plastic waste generation. This implies that large economies are contributing more to plastic waste generation.

**HIC: High income Country | UMC: Upper Middle Income Country | LMC: Low Middle Income Country | LIC: Low Income Country

big_economies %>% group_by(income_id) %>% summarize(avg_total = mean(`annual_total_waste(tons)`), food_organic_waste_avg = mean(`food&organic_waste(%)`), paper_waste_avg = mean(`paper_waste(%)`), plastic_waste_avg = mean(`plastic_waste(%)`)) %>% arrange(income_id)

## # A tibble: 4 × 5
##   income_id avg_total food_organic_waste_avg paper_waste_avg plastic_waste_avg
##   <chr>         <dbl>                  <dbl>           <dbl>             <dbl>
## 1 HIC       31260457.                   35.1           22.9              11.5 
## 2 LIC        3342614.                   50.4            8.33              9.43
## 3 LMC       11673521.                   57.4            9.66              9.44
## 4 UMC       40511531.                   52.3           11.9              12.1

9. Graphical Representation of the Findings

9.1 Good & Organic Waste and Paper Waste

The graphical representation of GDP versus different waste category further corroborates our previous finding as below:

GDP and Food&Organic Wastes produce a graph showing a inversely proportional relationship between GDP increase and food&organic waste generation.
GDP and Paper Wastes produce a graph showing a directly proportional relationship betwwen GDP increase and paper waste generation.

par(mfrow = c(1, 2))
plot(x=`food&organic_waste(%)`, y=gdp, type="p", xlab="% in Total Waste Composition", ylab="GDP in US$", main="GDP VS. Food&Organic Waste", frame.plot=FALSE)
plot(x=`paper_waste(%)`, y=gdp, type="p", xlab="% in Total Waste Composition", ylab="GDP in US$", main="GDP VS. Paper Waste", frame.plot=FALSE)

9.2 Plastic Waste

The scatter plot below shows that there is no specific relationship between GDP and Plastic Waste. This indicates all countries should contribute in plastic waste reduction.

ggplot(slice, aes(x=`plastic_waste(%)`, y=gdp)) + geom_point(color="#4cbea3") + labs(title = "Plastic Waste VS. Country's GDP", x = "% in Total Waste Composition", y = "GDP in US$") + theme_economist_white()