Introduction and Source:

Growing up in Vietnam, a mask is a personal thing that is with me everywhere regardless of day or night. I had always thought that it was how the World and the air out there was supposed to be. However, when I got an opportunity to step out of my country, I noticed that low- and middle-income countries like my country Vietnam where I came from suffer from the highest exposure. Many people did not know that the air they breathe every day actually exceeds WHO guideline limits and contains high levels of pollutants. This is why I want to bring it to the audience’s attention and have their consciousness about the global health issue. And that is why I decided to choose this global air pollution dataset.

This data was sourced from https://www.kaggle.com/datasets/hasibalmuzdadid/global-air-pollution-dataset. These data are collected from Elichens in 2022 by using web scraping and slightly preprocessed by using hand engineering. It contains AQI values of different pollutants for many cities all over the world.

Global air pollution is rich with 23463 observations and 12 variables. The dataset provides geolocated information, Air Quality Index value, and category of each type of pollutant which includes Carbon Monoxide CO, Ozone O3, Nitrogen Dioxide NO2, Particulate Matters 2.5 which refers to those particles with a diameter of 2.5 micrometers or less, overall Air Quality Index value and category of each city with the name of relevant country.

According to the World Health Organization (2023), air pollution is the contamination of the indoor or outdoor environment by any chemical, physical or biological agent that modifies the natural characteristics of the atmosphere. The U.S. AQI is EPA’s index for reporting air quality (AirNow, 2023). The higher the AQI value, the greater the level of air pollution and the greater the health concern. The AQI is divided into six categories which are good, moderate, unhealthy for sensitive groups, unhealthy, very unhealthy, and hazardous.

The question I have for this final project is ” What is the overview of Air Quality around the World and among countries?“. To answer that question, I create two visualizations to show the Global AQI World Map and a scatter plot of Distribution of Particulate Matter 2.5 AQI Value by Category. Plotly is used in both visualizations to increase interactivity. Since each country has many different cities, I use group_by to group them by country and create a new dataset with the mean of AQI value and other pollutants. At first, I was using world_map dataset but the longitude and latitude were not accurate. Thus, the map generated was not looking right. I am thankful for Professor’s suggestion to try out rnaturalearth and rnaturalearthdata packages and plot the countries on a map using simple features. It worked!! And I managed to merge the two datasets by renaming Sovereignt in the world dataset to Country and start using the merged data to create two lively visualizations. Moreover, I create scatter plots to analyze the statistical relations between different variables.

Loading needed packages

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(readr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(ggmap)
## ℹ Google's Terms of Service: <https://mapsplatform.google.com>
## ℹ Please cite ggmap if you use it! Use `citation("ggmap")` for details.
library(maps)
## 
## Attaching package: 'maps'
## 
## The following object is masked from 'package:purrr':
## 
##     map
library(tidyr)
library(reshape2)
## 
## Attaching package: 'reshape2'
## 
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(rnaturalearth)
library(rnaturalearthdata)
## 
## Attaching package: 'rnaturalearthdata'
## 
## The following object is masked from 'package:rnaturalearth':
## 
##     countries110
library(sf)
## Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
library(plotly)
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggmap':
## 
##     wind
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library(highcharter)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo 
## Highcharts (www.highcharts.com) is a Highsoft software product which is
## not free for commercial and Governmental use
library(grid)
library(jpeg)
register_google(key = "AIzaSyAWNnHdZGMjHN90q0FieoCiiqQxP1Lhi68")

Add an air pollution image

# Set the file path of the image
img_path <- "/Users/Linh/Desktop/Air-pollution-in-India.jpeg"

# Read in the image file
img <- readJPEG(img_path)

# Set the plot dimensions
par(mar = rep(0,4), xaxs = "i", yaxs = "i")
plot(0, 0, type = "n", xlim = c(0, 1), ylim = c(0, 1), xaxt = "n", yaxt = "n", bty = "n", ann = FALSE)

# Adjust the image size
xleft <- par("usr")[1]
xright <- par("usr")[2]
ybottom <- par("usr")[3]
ytop <- par("usr")[4]
ratio <- (ytop - ybottom) / (xright - xleft)
if (ratio < 1) {
  width <- 1
  height <- ratio
} else {
  width <- 1/ratio
  height <- 1
}
# Display the image
rasterImage(img, xleft, ybottom, xright, ytop, interpolate=FALSE)

The Current 2005 guidelines do little to protect the estimated four million children per year that develop asthma due to NO2 exposure, say experts. Photo: Media India Group

Set working directory and load the dataset

setwd("/Users/Linh/Desktop/DATASETS ")
global_air_pollution <- read_csv("global air pollution dataset.csv")
## Rows: 23463 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): Country, City, AQI Category, CO AQI Category, Ozone AQI Category, N...
## dbl (5): AQI Value, CO AQI Value, Ozone AQI Value, NO2 AQI Value, PM2.5 AQI ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Create a new dataset called grouped_global_air_pollution by grouping Country and find mean of AQI Value, CO, Ozone, NO2, and PM2.5 AQI Value.

grouped_global_air_pollution <- global_air_pollution %>%
  group_by(Country) %>%
  summarize(mean_AQI_Value = mean(`AQI Value`),
           mean_CO_AQI_Value = mean(`CO AQI Value`),
           mean_Ozone_AQI_Value = mean(`Ozone AQI Value`),
           mean_NO2_AQI_Value = mean(`NO2 AQI Value`),
           mean_PM2.5_AQI_Value = mean(`PM2.5 AQI Value`))
grouped_global_air_pollution
## # A tibble: 176 × 6
##    Country     mean_AQI_Value mean_CO_AQI_Value mean_Ozone_AQI_Value
##    <chr>                <dbl>             <dbl>                <dbl>
##  1 Afghanistan           96.0             0.592                 40.2
##  2 Albania               68.2             1                     42.1
##  3 Algeria               88.2             1.92                  47.2
##  4 Andorra               29.3             0.667                 29.3
##  5 Angola                83.9             3.15                  22.7
##  6 Argentina             28.2             0.353                 15.5
##  7 Armenia               53.6             0.864                 34.4
##  8 Aruba                163               0                     23  
##  9 Australia             33.6             0.212                 22.1
## 10 Austria               53.7             0.941                 36.0
## # ℹ 166 more rows
## # ℹ 2 more variables: mean_NO2_AQI_Value <dbl>, mean_PM2.5_AQI_Value <dbl>

Load world data

world <- ne_countries(scale = "medium", returnclass = "sf")

Rename the variable Sovereight to Country in order to merge the grouped_global_air_pollution later.

world_map <- world %>% rename(Country = sovereignt)

Merge the two datasets by Country

merged_data <- right_join(world_map, grouped_global_air_pollution, by = "Country")
merged_data <- merged_data[complete.cases(merged_data$scalerank), ]

Create the Global Air Quality Index by Country

ggplot and sf are combined to create the map. It is filled with mean of AQI value. I add title, gradiant color, and also use plotly to show country names and AQI mean value of each country.

p <- ggplot(data = merged_data) +
  geom_sf(aes(fill = mean_AQI_Value, text = Country), lwd = 0.1) +
  ggtitle("Global Air Quality Index by Country") +
  scale_fill_gradient(low = "green", high = "red")
## Warning in layer_sf(geom = GeomSf, data = data, mapping = mapping, stat = stat,
## : Ignoring unknown aesthetics: text
p <- ggplotly(p, tooltip = c("Country", "text", "mean_AQI_Value"))
p

A brief paragraph about what this map visualization shows:

This visualization uses ggplot2 and sf to create a beautiful World map displaying the mean Air Quality Index value in different parts of the World. At first glance, the audience can see that Asia, and Africa regions have the high air pollution levels, followed by countries in the Eastern Mediterranean. Pakistan seems to be the one of the countries that has the worst air pollution. Noticeably, the air quality in Mauritanian in Africa is considered unsafe and unhealthy. Europe, the Americas, and the Western Pacific are observed to have the lowest air pollution.

Statistical Analysis

Create scatter plots to analyze the relationships between different variables

Find the correlation between the two variables AQI Value and PM2.5 AQI Values

cor(global_air_pollution$`AQI Value`, global_air_pollution$`PM2.5 AQI Value`)
## [1] 0.9843266
ggplot(global_air_pollution, aes(x = `AQI Value`, y = `PM2.5 AQI Value`)) + 
  geom_point() +
  xlab("AQI Value") +
  ylab("PM2.5 Value") +
  ggtitle("Scatter plot of AQI Value vs. PM2.5 Value")

Scatter plot of AQI Value vs. Ozone AQI Value

ggplot(global_air_pollution, aes(x = `AQI Value`, y = `Ozone AQI Value`)) + 
  geom_point() +
  xlab("AQI Value") +
  ylab("Ozone AQI Value") +
  ggtitle("Scatter plot of AQI Value vs. Ozone AQI Value")

### Scatter plot of Ozone AQI Value vs. PM2.5 Value

ggplot(global_air_pollution, aes(x = `Ozone AQI Value`, y = `PM2.5 AQI Value`)) + 
  geom_point() +
  xlab("Ozone AQI Value") +
  ylab("PM2.5 AQI Value") +
  ggtitle("Scatter plot of Ozone AQI Value vs. PM2.5 Value")

Filter the data to only include PM2.5 AQI values and remove NAs

PM2.5_data <- global_air_pollution %>% 
  filter(!is.na(`PM2.5 AQI Value`))

Create the second visualization about the Distribution of PM2.5 AQI Value by Category

The country is too large which makes it overlapped and hard to read. I try to use coord_flip and adjust the text size with the purpose of making country more visible. Plotly is also used to help with country and PM value information.

ggplot(PM2.5_data, aes(x = Country, y = `PM2.5 AQI Value`, color = Country)) +
  geom_boxplot() +
  guides(color = FALSE) +
  labs(title = "Distribution of PM2.5 AQI Value by Country", y = "PM2.5 AQI Value", x = "Country") +
  coord_flip() +
  theme(text = element_text(size = 5))
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

### Because it seems to be unreadable for the boxplot, I decide to create a scatter plot showing PM2.5 AQI Value by Category.

gg <- ggplot(PM2.5_data, aes(x = `PM2.5 AQI Value`, y = `PM2.5 AQI Category`, color = `PM2.5 AQI Category`)) +
 geom_point(aes(text = paste("Country: ", Country, "<br>PM2.5 AQI Value: ", `PM2.5 AQI Value`, "<br>PM2.5 AQI Category", `PM2.5 AQI Category`))) +
  theme(text = element_text(size = 10), axis.text.y = element_text(size = 8)) +
  labs(title = "Distribution of PM2.5 AQI Value by Category") +
  scale_color_manual(values=c("Good" = "green", "Hazardous" = "red", "Moderate" = "orange", "Unhealthy" = "purple", "Unhealthy for Sensitive Groups" = "yellow", "Very Unhealthy" = "brown"))
## Warning in geom_point(aes(text = paste("Country: ", Country, "<br>PM2.5 AQI
## Value: ", : Ignoring unknown aesthetics: text
ggplotly(gg, tooltip = c("text"))

A brief paragraph about what this map visualization shows.

The chart shows AQI category of Particulate Matter with a diameter of 2.5 micrometers or less in different countries. Good air quality is considered satisfactory and air pollution poses little or no risk. Australia topped the list as the least polluted country in the world with the very low level of PM2.5, followed by countries in Europe such as Finland, Sweden, or Japan in Asia. Moderate air quality is acceptable. Unhealthy for sensitive groups members of sensitive groups may experience health effects, unhealthy everyone may begin to experience health effects, very unhealthy health warnings of emergency conditions and hazardous health alert. When we mouse over, India, Russian Federation, and Pakistan are the top three countries that have the highest PM2.5 AQI levels which put them in a hazardous zone.

Conclusion

It appears that there is a strong linear relationship between AQI Value and PM2.5 Value. The correlation coefficient between the two variables is 0.98, which indicates a strong positive correlation. The correlation coefficient of 0.98 represents the strength of the linear relationship between two variables, AQI Value and PM2.5 Value, in the dataset. It means that there is a strong positive correlation between these two variables, indicating that as AQI Value increases, PM2.5 Value tends to increase as well. Air pollution is one of the world’s largest health and environmental problems. The global map and the plot give the audience a general visualization of the World’s air pollution situation with specific value by each country and region. Because the country list is so rich, it is better to do it in map. However, when I tried it in boxplot, it does not really work because of the country overlapped. Even in scatter plot, it takes quite some time to render. What I wish to improve is that in the World map, it could be more informative by displaying the AQI category, meaning AQI Value of 179 could also indicate that it is Hazardous in a specific country.

Bibliography:

AirNow.gov, U.S. EPA. (n.d.). Aqi Basics. AQI Basics | AirNow.gov. Retrieved May 6, 2023, from https://www.airnow.gov/aqi/aqi-basics/

World Health Organization. (n.d.). Air Pollution. World Health Organization. Retrieved May 6, 2023, from https://www.who.int/health-topics/air-pollution#tab=tab_1