Introduction

This report provide an analysis of air quality data, focusing on some pollutants: PM1O(particulate matter), NO(Nitric Oxide), NO2(Nitrogen Dioxide) and NdiOx(Nitrogen Oxides as NO2). The datasets are from 2018-2023 and was processed to highlight pollutant levels on specific dates, obtain a monthly average of pollutants in 2020 and I ran my own obseravations by identifying long-term changes, particularly from 2021 to 2023 when the Clean Air Zone (CAZ) was implemented.

Flex Dashboard containing my 5 Plots

click here for Interactive Flex Dashoard

Loading required libraries

library(flexdashboard) #: for implementing dashboard
library(knitr)         #: for report formatting
library(tidyverse)     #: for data manipulation
library(plotly)        #: for interactive visualization
library(viridis)       #: for color scale for plots
library(rmarkdown)     #: for report formatting
library(htmlwidgets)   #: for report formatting
library(tinytex)       #: for creation of pdf file from

Step One: Data Cleaning

In this step, before importing my files, I examined the structure of the csv files in excel and observed that the first four rows contained metadata rather than actual data. According to Hadley Wickham’s R for Data Science[1], a recommended approach for handling such metadate is to use the skip parameter when reading the file. To implement this, I applied skip = 4 in the read.csv() function. Since the same cleaning and renaming processes were required for multiple CSV files, I created a function for these steps, following recommendations from a tutorial by Jonathan Ng[2] to avoid redundancy and improve efficiency. To enhance clarity and consistency, I renamed column names, replacing the original column complex names with better labels(e.g., “PM.sub.10..sub..particulate.matter..Hourly.measured.” to “PM10”). Additionally, I converted instances of “24:00” to “00:00” to ensure proper time formatting and created a datetime column for accurate timestamp tracking and finally selected relevant columns. References:
[1]H.Wickham, R for Data Science. Available at:https://r4ds.had.co.nz/data-import.html
[2]Q&A R Script to R Shiny Flexdashboard, ggplot, reproducibility, functional programming, rmarkdown by Jonathan Ng. Available at:https://youtu.be/9ka4cvA9GY0?si=aQrX_PVnMvzGvp3E

#Creating a function for reading files, skipping metadata rows and renaming
clean <- function(file) {
  data <- read.csv(file, skip = 4) %>% 
    rename(PM10 = "PM.sub.10..sub..particulate.matter..Hourly.measured.",
           NO = "Nitric.oxide" ,
           NO2 = "Nitrogen.dioxide",
           NdiOx = "Nitrogen.oxides.as.nitrogen.dioxide") %>% 
    mutate(time = ifelse(time == "24:00", "00:00", time), 
           datetime = as.POSIXct(paste(Date, time), format="%d-%m-%Y %H:%M", 
                                 tz = "UTC")) %>% 
    select(datetime, PM10, NO, NO2, NdiOx)
}

Step Two:Importing Files, Combining and Handling missing data

In this step, I created a variable to hold a vector of my imported csv files from a folder, I applied the clean function and merged all the data into one dataset.To address the missing values, I chose mean imputation, a technique where missing values in a column are replaced with the mean of that column’s mon-missing values. Mean imputation is particularly useful when only a small percentage of data is missing and cannot be removed without losing valuable insights. According to a Medium article by PingSubhak[1], this approach is effective for handling datasets where missing values occur sporadically especially in cases where extreme outliers are absent. Upon examining my datasets, I found that some pollutant concentration readings were only missing specific timestamps in a day, which is not necessarily large data that was not being recorded, and did not contain extreme outliers, hence replacing missing values with the mean ensured a balanced approach without skewing the overall distribution. However, if an entire row consisted of missing values, I discarded it entirely, as such rows would not contribute meaningful data to analysis. [2] DataCamp, Handling Missing Values in R. References: [1] DataCamp, Handling Missing Values in R. Available at: https://www.datacamp.com/tutorial/na-rm-in-r [2] Pingsubhak, Handling Missing Values in Datasets: 7 Methods You Need to Know [Medium]. Available at: https://medium.com/@pingsubhak/handling-missing-values-in-dataset-7-methods-that-you-need-to-know-5067d4e32b62

#Listing my dataset files
files <- c("C:\\Users\\WINDOWS11\\Desktop\\DataAVAssessement\\DataSets\\POAR_2018.csv",
           "C:\\Users\\WINDOWS11\\Desktop\\DataAVAssessement\\DataSets\\POAR_2019.csv",
           "C:\\Users\\WINDOWS11\\Desktop\\DataAVAssessement\\DataSets\\POAR_2020.csv",
           "C:\\Users\\WINDOWS11\\Desktop\\DataAVAssessement\\DataSets\\POAR_2021.csv",
           "C:\\Users\\WINDOWS11\\Desktop\\DataAVAssessement\\DataSets\\POAR_2022.csv",
           "C:\\Users\\WINDOWS11\\Desktop\\DataAVAssessement\\DataSets\\POAR_2023.csv")

#Combining all rows into one dataframe and cleaning
data_list <- files %>% 
  lapply(clean) %>% 
  bind_rows() %>% 
  drop_na()

#Using mean imputation to get rid of null values
data_list <- data_list %>% 
  mutate(PM10 = ifelse(is.na(PM10), mean(PM10, na.rm = TRUE), PM10),
         NO = ifelse(is.na(NO), mean(NO, na.rm = TRUE), NO),
         NO2= ifelse(is.na(NO2), mean(NO2, na.rm = TRUE), NO2),
         NdiOx = ifelse(is.na(NdiOx), mean(NdiOx, na.rm = TRUE), NdiOx)
  )

Step Three: Filtering

In this step, I simply filtered the data for specific dates and extracted necessary observations

#getting required dates
dates_req <- as.Date(c("2018-12-20", "2019-01-03", "2020-03-19",
                       "2020-03-26", "2020-06-29", "2020-11-10",
                       "2020-12-20", "2021-01-03", "2021-11-29",
                       "2022-07-25", "2023-07-24"))

#Getting cleaned dates from data frame
data_req <- data_list%>% 
  filter(as.Date(datetime) %in% dates_req)

Step Four: Monthly Average Pollutants for 2020 Extraction

In this step, I extracted the 2020 data and grouped it by month to calculate the average pollutant levels per month by using (format datetime, “%B”) function, which converts the date format into full month names, as recommended in R for Data Science[1] and then I ordered the months to make sure they were arranged chronologically and not alphabetically as it displayed initially. While attempting to visualize the data in a bar chart, I encountered an issue because my dataset was still in a wide format, which is not suitable for bar charts in ggplot2. In a wide format, each pollutant was in a separate column rather than being under a single column hence to fix this, I applied the pivot_long format[2] to ensure each row represents a single pollutant for a month. References:
[1] Wickham, H., R for Data Science - Working with Dates. Available at: https://r4ds.had.co.nz/data-import.html [2] Wickham, H., R for Data Science - Tidy Data Concepts. Available at: https://r4ds.had.co.nz/tidy-data.html

#Deriving monthly pollutants in 2020
avg2020 <- data_list %>% 
  filter(format(datetime, "%Y") == 2020) %>% 
  mutate(month = factor(format(datetime, "%B"), levels = month.name, ordered = TRUE))

monthlyAvg <- avg2020 %>% 
  group_by(month) %>% 
  summarise(
    PM10 = mean(PM10),
    NO = mean(NO),
    NO2 = mean(NO2),
    NdiOx = mean(NdiOx)
  ) %>% 
  pivot_longer(cols = c(PM10, 
                        NO, 
                        NO2, 
                        NdiOx), 
               names_to = "Pollutant",
               values_to = "averageValue")

Step Five: Yearly Averages of PM10 Levels from 2021- 2023 Extraction

According to this case study, the Clean Air Zone (CAZ) began operations in late 2021. To assess its effectiveness, I analyzed the pollutants levels over the years and noticed a significant spike in all pollutants in 2022, with PM10 peaking the highest, hence why I chose to visualize it. This observation prompted further research. According to Portsmouth City Council’s 2023 Annual Status Report on Air Quality [1], the increase in pollutants during 2022 was likely an aftereffect of the COVID-19 pandemic. The report suggests that pollutant levels initially dropped in 2020 due to reduced traffic but then rose again in 2021, peaking in 2022 as traffic volumes returned to pre-pandemic levels. Reference: [1] Portsmouth City Council. 2023 Annual Status Report on Air Quality. Available at: https://democracy.portsmouth.gov.uk/documents/s49399/2023%202023%20Annual%20status%20report%20of%20air%20quality.pdf

#Deriving PM10 levels from 2021 to 2023 for 5th analysis
yearlyPM10 <- data_list %>% 
  filter(format(datetime, "%Y") %in%
           c("2021", "2022", "2023")) %>% 
  mutate(year = format(datetime, "%Y")) %>% 
  group_by(year) %>% 
  summarise(PM10 = mean(PM10))

#Step Six: Plotting * In this step, I visualized my data using interactive plots with Plotly, selecting the most suitable chart types based on recommendations from GeeksforGeeks [1], *R for Data Science [2]**, and Statology [3]. While designing the visualizations, I ensured accessibility by considering colorblind-friendly palettes, referencing Datanovia’s guide on effective color palettes [4]. Additionally, I created a flexdashboard in R and deployed it using the free RPubs service by Posit. During this process, I noticed that some variables displayed duplicate readings when hovering over them. To address this, I customized the tooltip settings, ensuring that only the most relevant information was displayed for each data point, improving readability and user experience. References: [1] GeeksforGeeks. Data Visualization in R. Available at: https://www.geeksforgeeks.org/data-visualization-in-r/ [2] Wickham, H. R for Data Science - Data Visualization. Available at: https://r4ds.had.co.nz/data-visualisation.html [3] Statology. How to Create a Bubble Chart in R. Available at: https://www.statology.org/bubble-chart-in-r/#:~:text=You%20can%20use%20the%20following%20basic%20syntax%20to,syntax%20to%20create%20a%20bubble%20chart%20in%20practice. [4] Datanovia. Top R Color Palettes to Know for Great Data Visualization. Available at: https://www.datanovia.com/en/blog/top-r-color-palettes-to-know-for-great-data-visualization/#google_vignette*

PM10 Levels for required dates

#1. Plot for pm10 levels
pm10levels <- data_req %>% 
  ggplot(aes(x = datetime, y = PM10, color = PM10, text = paste("Dates and time:", datetime, "<br>PM10:", PM10))) +
  geom_point()+
  scale_color_viridis(option = "D")+
  labs(title="Scatter Plot of PM10 Levels on Required Dates", x = "Dates and time", y = "PM10 (µg/m³)")+
  theme_gray(base_size = 14)
ggplotly(pm10levels, tooltip = "text")

NO Levels for required dates

#2. Plot for Nitric Oxide(NO)
noPlot <- data_req %>% 
  ggplot(aes(x = datetime, y = NO, color = NO, text = paste("Dates and time:", datetime, "<br>NO:", NO))) +
  geom_jitter() +
  scale_color_viridis(option = "C") +
  labs(title="Jitter Plot of NO Levels on Required Dates", x = "Dates and time", y = "NO (µg/m³)")+
  theme_gray(base_size = 14)
ggplotly(noPlot, tooltip = "text")

NdiOx Levels for required dates

#3. Plot for NdiOx levels
ndioxPlot <- data_req %>% 
  ggplot(aes(x = datetime, y = NdiOx, size = NdiOx, color = NdiOx, text = paste("Dates and time:", datetime, "<br>NO2:", NO2))) +
  geom_point(alpha = 0.5, size = 3) +
  scale_color_viridis(option = "H") +
  labs(title="Bubble Plot of NdiOx Levels on Required Dates", x = "Dates and time", y = "NdiOx (µg/m³)")+
  theme_gray(base_size = 14)
ggplotly(ndioxPlot, tooltip = "text")

Monthly Average of Pollutants for 2020

#4. Plot for Monthly Average of Pollutants for 2020
month_avgPlot <- monthlyAvg %>% 
  ggplot(aes(x = month, y =averageValue, fill = Pollutant))+
  geom_bar(stat = "identity", position = "dodge") +
  scale_fill_viridis_d(option = "C") +
  labs(title = "Monthly Averages for Pollutants in 2020", x = "Months", y = "Average Value(µg/m³)",
       fill = "Pollutants")+
  theme_gray(base_size = 14)
ggplotly(month_avgPlot)

PM10 Levels from 2021-2023

#5. PM10 Levels from 2021-2023
yearlyPM10_plot <- yearlyPM10 %>% 
  ggplot(aes(x = year, y = PM10, group = 1)) +
  geom_line(color = viridis(1, option = "D"))+
  geom_point(color = viridis(1, option = "B"))+
  labs(title = "PM10 Trends from 2021-2023", x = "Year", y = "PM10 Levels(µg/m³)")+
  theme_gray(base_size = 14)
ggplotly(yearlyPM10_plot)

Conclusion

This study provides insights into pollution trends, highlighting fluctuations in PM10, NO, NO2 and NdiOx over specific days, months, and yeas. Since the CAZ implementation, a notable decline in pollutant levels has been observed, suggesting a positive impact.

Additional Resources used for this project:

My Lesson Data Analysis with R programming Complete Course. Available here: https://youtu.be/x79bPHXCxlM?si=N7FuWNKI01Ndu_69

R programming in one hour - a crash course for beginners. Available here:https://youtu.be/eR-XRSKsuR4?si=UrhNbEFODFiLXg_f

Freecode Camp R Programming. Available here:https://youtu.be/_V8eKsto3Ug?si=LYMUNcjICY6leFST

Generate a .pdf from RMarkdown file with R. Available here:https://www.geeksforgeeks.org/generate-pdf-from-rmarkdown-file-with-r/

Openchatai: used to check just go through some basic concepts in R. Chat available here: https://chatgpt.com/share/67ed2be4-431c-800c-ba50-cd0eb332bbfd

Openchatai: used to check through my report to ensure there were no errors. Chat available here: https://chatgpt.com/share/67ed2a90-554c-800c-bfef-de99c6206bdb

Report on Air Quality Assessment

Amarachi Ashley Okeke

2025-04-01