Introduction

Air pollution and a country’s socioeconomic status are interconnected and have serious consequences for human health and socioeconomic growth of the country. Air pollution is caused by different contaminants being emitted into the atmosphere. It is linked to respiratory and cardiovascular disorders, as well as mortality rates and environmental damage.

A country’s socioeconomic position, as assessed by metrics such as the Human Development Index (HDI), GDP, and poverty levels, influences environmental regulations, technological breakthroughs, and public health infrastructure and policies. In addition, the socioeconomic gaps between countries can increase the unequal distribution of air pollution loads, hurting the poorer section of society more.

Understanding the link between air pollution and a country’s socioeconomic position is necessary for developing successful environmental policies and public health infrastructure. With this study, We hope to discover trends and potential associations between the two by combining air quality measures such as the Air Quality Index (AQI) and particulate matter (PM2.5) data with socioeconomic variables such as HDI, GDP, and poverty rates. The findings of this study can be used to influence evidence-based policymaking, sustainable development planning, and initiatives to mitigate the effects of air pollution on public health and socioeconomic well-being.

By addressing the socio-economic factors of air pollution, we can work towards creating cleaner and healthier environments for current and future generations, and promote development that is inclusive in nature.

Methadology

In this study to investigate the relationship between air pollution and the socio-economic status of countries, we used three datasets that were downloaded from Kaggle.We cleaned and wrangled these datasets, then merged them as per our requirements.

Next, we conducted exploratory data analysis using data visualization techniques. For this, we used the ggplot2 and leaflet packages to generate graphs and interactive maps. Through these visualizations, we studied air pollution pattern and distribution across the globe, and explored its correlations with socio-economic factors such as Human development index.

Finally to quantify the relationship between air pollution and socio-economic status, we performed exploratory data analysis using box plots and then proceeded to conduct a one-way ANOVA test. With this we studied the potential differences in AQI means based on various HDI groups.

By utilizing box plots, we visually observed the variation in air pollution levels (AQI value) across different HDI groups, and thus gained insights into the variations and trends. Aftre this, we conducted a one-way ANOVA test to statistically assess whether there were significant differences in mean air pollution levels among the HDI groups.

The null hypothesis (H0) was that there’d be no difference in means based on the HDI groups, and the alternative hypothesis (H1) was that there will be differences. The ANOVA test provided a p-value which we used to determine the statistical significance of any observed differences.

In conclusion, our methodology involved merging datasets, conducting exploratory data visualizasion, and performing a one-way ANOVA test to examine the relationship between air pollution and socio-economic status. By doinso we were able to understand the interplay between air pollution and equitable development and obtained meaningful insights into the matter.

#load all libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)   
library(dplyr)     
library(lubridate) 
library(leaflet)   
library(shiny)     
library(plotly)
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library(rjson)
library(rgdal)
## Loading required package: sp
## The legacy packages maptools, rgdal, and rgeos, underpinning this package
## will retire shortly. Please refer to R-spatial evolution reports on
## https://r-spatial.org/r/2023/05/15/evolution4.html for details.
## This package is now running under evolution status 0 
## Please note that rgdal will be retired during October 2023,
## plan transition to sf/stars/terra functions using GDAL and PROJ
## at your earliest convenience.
## See https://r-spatial.org/r/2023/05/15/evolution4.html and https://github.com/r-spatial/evolution
## rgdal: version: 1.6-7, (SVN revision 1203)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 3.6.2, released 2023/01/02
## Path to GDAL shared files: C:/Users/tanis/AppData/Local/R/win-library/4.3/rgdal/gdal
##  GDAL does not use iconv for recoding strings.
## GDAL binary built with GEOS: TRUE 
## Loaded PROJ runtime: Rel. 9.2.0, March 1st, 2023, [PJ_VERSION: 920]
## Path to PROJ shared files: C:/Users/tanis/AppData/Local/R/win-library/4.3/rgdal/proj
## PROJ CDN enabled: FALSE
## Linking to sp version:1.6-1
## To mute warnings of possible GDAL/OSR exportToProj4() degradation,
## use options("rgdal_show_exportToProj4_warnings"="none") before loading sp or rgdal.
library(gganimate)
## No renderer backend detected. gganimate will default to writing frames to separate files
## Consider installing:
## - the `gifski` package for gif output
## - the `av` package for video output
## and restarting the R session
library(packcircles)
library(viridis)
## Loading required package: viridisLite
library(tidyverse)
library(sf)
## Linking to GEOS 3.11.2, GDAL 3.6.2, PROJ 9.2.0; sf_use_s2() is TRUE
library(leaflet)
library(shiny)
library(geojsonsf)
library(tmap)
library(tmaptools)
library(tidyverse)
library(repr)
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.0 ──
## ✔ broom        1.0.4     ✔ rsample      1.1.1
## ✔ dials        1.2.0     ✔ tune         1.1.1
## ✔ infer        1.0.4     ✔ workflows    1.1.3
## ✔ modeldata    1.1.0     ✔ workflowsets 1.0.1
## ✔ parsnip      1.1.0     ✔ yardstick    1.2.0
## ✔ recipes      1.0.6     
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ plotly::filter()  masks dplyr::filter(), stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ infer::observe()  masks shiny::observe()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Use tidymodels_prefer() to resolve common conflicts.
library(tidyverse) 
library(ggplot2)  
library(dplyr)     
library(lubridate) 
library(leaflet)   
library(shiny)     
library(plotly)
library(rjson)
library(rgdal)
library(gganimate)
library(packcircles)
library(viridis)

Downloading, cleaning and wrangling datasets

#reading csv files downloaded from Kaggle
pollution <- read_csv("air_pollution.csv")
## Rows: 23463 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): Country, City, AQI Category, CO AQI Category, Ozone AQI Category, N...
## dbl (5): AQI Value, CO AQI Value, Ozone AQI Value, NO2 AQI Value, PM2.5 AQI ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
capital <-  read_csv("country-capitals.csv")
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)
## Rows: 245 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): CountryName, CapitalName, CapitalLatitude, CountryCode, ContinentName
## dbl (1): CapitalLongitude
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
HDI <- read_csv("HDI.csv")
## Rows: 195 Columns: 880
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (4): ISO3, Country, Human Development Groups, UNDP Developing Regions
## dbl (876): HDI Rank (2021), Human Development Index (1990), Human Developmen...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(HDI)
Pollution_df <- data.frame(pollution)
Capitals_df <- data.frame(capital)
HDI_df <- data.frame(HDI)
#Dataset cleaning and wrangling

colnames(Capitals_df)[colnames(Capitals_df) == "CountryName"] <- "Country"
colnames(Capitals_df)[colnames(Capitals_df) == "CapitalLatitude"] <- "latitude"
colnames(Capitals_df)[colnames(Capitals_df) == "CapitalName"] <- "City"
colnames(Capitals_df)[colnames(Capitals_df) == "CapitalLongitude"] <- "longitude"
colnames(Capitals_df)[colnames(Capitals_df) == "CapitalLongitude"] <- "longitude"


temp_HDI <- HDI_df |> select ("Country", "Human.Development.Groups", "HDI.Rank..2021.", "Human.Development.Index..2021.")

clean_HDI_df <- na.omit(temp_HDI)
#Generating new datasets as per our requirements by merging ones downloaded from kaggle 

Temp_Merge_1 <- merge(Pollution_df, Capitals_df,
                      by = "City", all = FALSE)


#Delete row with missing values
Clean_Pollution_df <-  na.omit(Temp_Merge_1)
colnames(Clean_Pollution_df)[colnames(Clean_Pollution_df) == "Country.x"] <- "Country"
#Remove rows with missing values


HDI_Pollution_df <-merge(Pollution_df, clean_HDI_df, by = "Country",
                         all = FALSE)
Clean_HDI_Pollution_df <-  na.omit(HDI_Pollution_df)
#Dataset cleaning and wrangling
colnames(Clean_Pollution_df)[colnames(Clean_Pollution_df) == "Country.x"] <- "Country"

Bar Plot of countries with highest AQI and countries with lowest AQI

We will now start visualising our datasets. First, let’s start by using ggplot to make bar plots for top 25 countries with highest AQI value, and bottom 25 countries with lowest AQI Value.

# Selecting top 25 countries based on AQI
highest_countries <- Clean_Pollution_df %>% top_n(25, wt = `AQI.Value`)

highest_countries
# Selecting bottom 25 countries based on AQI
lowest_countries <- Clean_Pollution_df %>%
  arrange(`AQI.Value`) %>%
  head(25)
# using ggplot for bar plot
lowest_plot <- ggplot(lowest_countries, aes(x = reorder(Country, `AQI.Value`), y = `AQI.Value`)) +
  geom_bar(stat = "identity", fill = "#0000FF") +
  labs(x = "Country", y = "AQI Value", title = "Countries with Lowest AQI (Bottom 25)") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

lowest_plot

# using ggplot for make bar plot
highest_plot <- ggplot(highest_countries, aes(x = reorder(Country, `AQI.Value`), y = `AQI.Value`)) +
  geom_bar(stat = "identity", fill = "#FF0000") +
  labs(x = "Country", y = "AQI Value", title = "Countries with Highest AQI (top 25)") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

highest_plot

Shiny Application – 1: countries and concentration of various pollutants

This app lets users to select an air quality parameter/pollutant and view its distribution across the world on an interactive map.

#load geojson file and convert it to sf object to get polygon data for map
countries.json <- fromJSON(file = "countries.geo.json")
countries <- geojson_sf(toJSON(countries.json))

# clean, Wrangle, and merge with pollution dataset to get appropriate dataset that we can use to make our map
colnames(countries)[colnames(countries) == "name"] <- "Country"
countries$Country <- gsub("United States of America", "United States", countries$Country)
merged <- merge(countries, Clean_Pollution_df, by = "Country", all = FALSE)

# UI shiny app
ui <- fluidPage(
  titlePanel(""),
  sidebarLayout(
    sidebarPanel(selectInput("column", "Select a column:", choices = c(
        "CO.AQI.Value", "Ozone.AQI.Value",  "NO2.AQI.Value", "PM2.5.AQI.Value"))),
    mainPanel( leafletOutput(outputId = 'map'), tags$p("Map 1 : AQI levels across the world") ) ))


#server for shiny app
server <- function(input, output) {
   map_df <- reactive({
    merged %>%
      select(latitude, longitude, column = input$column) %>%
      mutate(column = as.numeric(column)) %>%
      filter(!is.na(longitude) & !is.na(latitude)) %>%
      st_as_sf(coords = c("longitude", "latitude")) %>%
      st_set_crs(4326)})

  output$map <- renderLeaflet({
    data <- map_df()
    column_values <- data$column
    breaks <- quantile(column_values, probs = seq(0, 1, 0.05))
    color_palette <- colorNumeric(palette = "Spectral", domain = column_values)
    
    leaflet() %>%
      addTiles() %>%
      setView(lng = -5, lat = 55, zoom = 1) %>%
      addPolygons(data = data,
                  fillColor = ~color_palette(column),
                  color = "black",
                  weight = 1,
                  opacity = 1,
                  fillOpacity = 0.8) %>%
      addLegend("bottomleft", pal = color_palette, values = column_values, 
                title = input$column, opacity = 0.8)})}

# Run shiny app
shinyApp(ui, server)
Shiny applications not supported in static R Markdown documents

Now that we know what the distribution of various air pollutants looks like across the world, let’s begin to study the potential relationship between air pollution and the socio-economic status of countries.

Shiny Application – 2: world cities and relation between their country’s HDI rank and their AQI values

This shiny app has an interface with a slider input for selecting HDI rank ranges. Users can select the range of countries they want as per level of development (interval of lower ranks for developed countries, interval of higher ranks for less developed countries). The map has circle markers representing cities from within the chose HDI rank interval, colored based on their AQI values.

# loadind world cities dataset
World_Cities <- read_csv("worldcities.csv")
## Rows: 44691 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): city, city_ascii, country, iso2, iso3, admin_name, capital
## dbl (4): lat, lng, population, id
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
World_Cities_temp <- data.frame(World_Cities)
World_Cities_df <- select(World_Cities_temp, c("city", "lat", "lng"))

# clean, wrangle and merge to get desirable data set
colnames(World_Cities_df)[colnames(World_Cities_df) == "lng"] <- "longitude"
colnames(World_Cities_df)[colnames(World_Cities_df) == "lat"] <- "latitude"
colnames(World_Cities_df)[colnames(World_Cities_df) == "city"] <- "City"

temp_for4 <-merge(Clean_HDI_Pollution_df, World_Cities_df, by = "City", all = FALSE)
clean_for4 <-  na.omit(temp_for4)



#Ui for shiny app 2
ui <- fluidPage(
  titlePanel(""),
  sidebarLayout(
    sidebarPanel(
      sliderInput('HDI.Rank..2021.', 'HDI.Rank..2021.', min = 1, max = 190, value = c(1,10)) ),
    mainPanel( leafletOutput(outputId = 'map'),
               tags$p("Map 2 : world cities and relation between their country's HDI rank and their AQI Values")) ))

# server for shiny app
server <- function(input, output) {
  map_df <- reactive({
    clean_for4 %>%
      filter(HDI.Rank..2021. > input$HDI.Rank..2021.[1] & HDI.Rank..2021. < input$HDI.Rank..2021.[2]) %>%
      filter(!is.na(longitude) & !is.na(latitude)) %>%
      st_as_sf(coords = c("longitude", "latitude")) %>%
      st_set_crs(4326)})

output$map <- renderLeaflet({
    colorPalette <- colorQuantile(palette =  c("yellow", "green", "orange", "red"), domain = map_df()$AQI.Value)
    
    leaflet() %>%
      addTiles() %>%
      setView(lng = -5, lat = 54, zoom = 1) %>%
      addCircleMarkers(data = map_df(), radius = 1, color = ~colorPalette(AQI.Value),      fillOpacity = 0.6) %>%
      addLegend(position = "bottomright", pal = colorPalette, values = map_df()$AQI.Value,
                title = "AQI value", opacity = 0.6) })}
shinyApp(ui, server)
Shiny applications not supported in static R Markdown documents

The key takeaway from this map is that there is a noticeable pattern where cities in countries known to be developed with very high HDI rankings, typically those in the West, Australia, and Europe, are marked with yellow/green circles indicating low AQI values. Another interesting thing to notice is that cities in extremely poor underdeveloped countries, typically those in Africa and some parts of Asia are also marked with yellow/green circles indicating low AQI values. On the other hand, cities in countries that fall in the mid-range of HDI ranks, mostly developing countries in South Asia, few in Africa, and some in South America, have higher AQI values and are marked with red/orange circles. This observation suggests that there may be a link between the socio-economic development of a country and its air quality since from this map we can see that developed countries with a well established infrastructure, and under-developed countries with fewer economic activities and industries generally have better AQI (green/yellow circles) as compared to developing countries with booming uncontrolled industrial expansion. Let’s study this link further.

Scatter plot – Visualise the relationship between PM2.5 AQI values and Ozone AQI values

This scatter plot displays the distribution of PM2.5 AQI values on the x-axis and Ozone AQI values on the y-axis. Each data point represents a city, and the color of the point indicates the Human Development Group of the country it belongs to.

ggplot(Clean_HDI_Pollution_df, aes(x = PM2.5.AQI.Value, y = Ozone.AQI.Value, color = Human.Development.Groups)) +
  geom_point(shape = 16, size = 1) +
  labs(x = "PPM", y = "AQI") +
  theme_minimal() 

The plot reveals distinct clusters of data points.The countries with lower pollution levels (bottom left) are primarily found in two clusters: those belonging to the low HDI group and those in the very high HDI group (Purple and Green). In this cluster there were also a few occurance of those in the High HDI group (Blue). The other cluster is observed as we move to the right-top section of the graph (ie. as ozone and PM level increases). This cluster mainly contains cities of the medium HDI group, and some of the high HDI group. The first cluster (ie. very high and low) predominantly constitute cities from developed and underdeveloped countries, while the second cluster (Ie. mainly medium, and some high) are the cities in developing countries. This clustering, and where they lie on the map is further evidence of there being a potential relationship between the socio-economic development of a country and its air quality, where developing countries with rapidly growing industries and urbanisation have the poorest air quality.

Statistical analysis

box plots to visualize the distribution of AQI values across Human Development Groups

Before conduction one-way Anova test, let’s plot a box plot to understand the relationship between HDI groups and air quality.

library(stats)

boxplot(Clean_HDI_Pollution_df$AQI.Value~factor(Clean_HDI_Pollution_df$Human.Development.Groups),
        xlab = "Human.Development.Groups", ylab = "AQI.Value")

We see that AQI values for High, very high, and low HDI groups (typically developed and under developed countries) is low, whereas it is the highest for HDI groups in the medium category (typically developing countries with a booming industrialisation). The whiskers also support the claim - developing countries have highest AQI values and air pollution.

ANOVA test

Finally we perform the ANOVA test to examine the relationship between AQI values and Human Development Groups. The AQI values are categorized according to the different Human Development Groups and the test is used to determine if there are significant differences in AQI values among these groups.

ANOVA_HDI_Pollution <- aov(AQI.Value ~ Human.Development.Groups, data = Clean_HDI_Pollution_df)

summary(ANOVA_HDI_Pollution)
##                             Df   Sum Sq Mean Sq F value Pr(>F)    
## Human.Development.Groups     3 16945902 5648634    2326 <2e-16 ***
## Residuals                22256 54051814    2429                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Null hypothesis: There is no significant difference in the mean AQI values among the different Human Development Groups.

Alternative hypothesis: There is a significant difference in the mean AQI values among the different Human Development Groups.

Here we see that the P-value is 2e-16, which is very close to 0. This is strong evidence against the null hypothesis. Thus, we reject it and conclude that there are significant differences in the mean AQI values among the Human Development Groups. From the box plot we can also see that the highest mean AQI and range is for the medium HDI group typically constituting developing countries. With this we further establish that it is developing countries that face the most air pollution due to fast growing industries and urbanisation.

Limitations

  1. AQI may not be the best measure of air quality. AQI values can also vary across countries and regions due to differences in monitoring systems and pollutant sources.
  2. The datsets used didn’t have enough data on Russia, Canada, and Greenland.
  3. Air quality can vary over time due to seasonal variations, industrial activities, and policy interventions, and our analysis might not fully capture these temporal variations.
  4. Other limitations may include the exclusion of certain confounding factors,and any biases present in the data sources utilized.

Summarising what we found

In summary, our study revealed that there are variations in air pollution levels across different Human Development Groups (HDI), confirming there being a relation between a country’s socioeconomic status and air pollution levels. Countries with very high HDI, typically well developed countries in Europe, North America, and Australia, and countries with vert low level HDI, typically under developed countries in Africa and some parts of Asia demonstrated high levels of Air quality. In contrast those with medium HDI, typically developing countries in Asia, some parts of Africa and South America showed higher pollution levels. This can be attributed their booming industrial expansion and urbanisation. The ANOVA test confirmed a significant difference in AQI values among the HDI groups (p < 2e-16), allowing us to reject out Null hypotesis – There is no significant difference in the mean AQI values among the different Human Development Groups.

Implications and policy changes

From our study we conclude that policy makers in developing countries should attempt to control air pollution resulting from industrialization. By implementing environmental regulations, investing in cleaner technologies, and promoting sustainable practices, we can reduce air pollution associated with uncontrolled industrial growth and urbanization. These findings provide valuable insights for policy makers to develop necessary strategies and policies aimed at reducing air pollution, protecting public health, and ensuring a sustainable future for the next generation.