Air Quality Analysis

Sara Monica

July 19, 2022

Business Problem :

Into the data

According to the World Health Organization (WHO), COVID-19 is a global pandemic and a public health emergency. The infectious disease COVID-19 could survive for hours in the environment and was contagious. As a result, this study investigates the relationship between COVID-19 distribution in Jakarta, Indonesia, and air pollutants (PM2.5, PM10, CO, SO2, NO2, and O3). Additionally, it assesses how large-scale social restrictions (LSSR) affect the air pollution index (API). Data from COVID-19 and air pollutants were examined during four time periods.

The global COVID-19 pandemic has highlighted the close connection between human health and the environment. The governments of each country have carried out various lockdown strategies to contain the spread of COVID-19. In Indonesia, COVID-19 has significantly affected the economic mobility, industry, companies, environment, and government policy arrangements and caused severe economic losses. Restrictions on human activities during the lockdown have contributed to reducing global carbon emissions because decreasing human mobility or economic activities, that assumed to reduce air pollutants due to less traffic and fewer industrial activities due to the COVID-19 lockdowns.

Data was provided at https://www.airnow.gov/international/us-embassies-and-consulates/ and https://data.jakarta.go.id/dataset/

Data Input and Inspection

Importing the required packages and libraries

Data Input:

Read “.csv” from data folder and remove unnessecary columns

##FOR PM25
south_2019 <- read.csv("data-input/pm25/2019.csv", stringsAsFactors = T)
south_2019$date <- with(south_2019, ymd_h(paste(Year, Month, Day, Hour, sep= ' ')))
south_2019 <- south_2019  %>% select (-c("Parameter", "Date..LT.", "Day", "Hour", "Year", "Month", "Duration", "Conc..Unit"))

south_2020 <- read.csv("data-input/pm25/2020.csv", stringsAsFactors = T)
south_2020$date <- with(south_2020, ymd_h(paste(Year, Month, Day, Hour, sep= ' ')))
south_2020 <- south_2020  %>% select (-c("Parameter", "Date..LT.", "Day", "Hour", "Year", "Month", "Duration", "Conc..Unit"))

south_2021 <- read.csv("data-input/pm25/2021.csv", stringsAsFactors = T)
south_2021$date <- with(south_2021, ymd_h(paste(Year, Month, Day, Hour, sep= ' ')))
south_2021 <- south_2021  %>% select (-c("Parameter", "Date..LT.", "Day", "Hour", "Year", "Month", "Duration", "Conc..Unit"))
 
##FOR PM10 and Other Poluttan
tbl1920 <- list.files(path = "data-input/ispu",
               pattern = "*.csv", 
               full.names = T) %>% map_df(~read_csv(., col_types = cols(.default = "c")))

tbl1920 <-  select(tbl1920, -(11))

tbl21 <- list.files(path = "data-input/ispu/ispu_2021",
               pattern = "*.csv", 
               full.names = T) %>% 
        map_df(~read_csv(., col_types = cols(.default = "c")))
tbl21 <- tbl21 %>% select(-pm25)

Data Inspection, Transform, Manipulation :

Check the Missing Value of the data

#Check any Missing or Invalid or any N/A in the data frame
colSums(is.na(south_2019))
south_2019 <- south_2019[!(south_2019$QC.Name =="Missing" | south_2019$QC.Name =="Invalid"  | south_2019$AQI < 0 | south_2019$NowCast.Conc. < 0 | south_2019$Raw.Conc. < 0 ),]
#Remove the last row because if we bind the data frame it the day after the last day in that year and would be a duplicate 
south_2019 <- slice(south_2019, 1:(n() - 1))

#Check any Missing or Invalid or any N/A in the data frame
colSums(is.na(south_2020))
south_2020 <- south_2020[!(south_2020$QC.Name =="Missing" | south_2020$QC.Name =="Invalid" | south_2020$AQI < 0 | south_2020$NowCast.Conc. < 0 | south_2020$Raw.Conc. < 0 ),]
#Remove the last row because if we bind the data frame it the day after the last day in that year and would be a duplicate 
south_2020 <- slice(south_2020, 1:(n() - 1))

#Check any Missing or Invalid or any N/A in the data frame
colSums(is.na(south_2021))
south_2021 <- south_2021[!(south_2021$QC.Name =="Missing" | south_2021$QC.Name =="Invalid"  | south_2021$AQI < 0 | south_2021$NowCast.Conc. < 0 | south_2021$Raw.Conc. < 0 ),]
#Remove the last row because if we bind the data frame it the day after the last day in that year and would be a duplicate 
south_2021 <- slice(south_2021, 1:(n() - 1))

Bind the PM2.5 data into one data frame and arrange the date

tbl_pm25 <- bind_rows(south_2019, south_2020, south_2021)
tbl_pm25 <- tbl_pm25 %>% arrange(date)

Bind the PM10 and other Pollutan data into one data frame and arrange the date

tbl_indicator <- bind_rows(tbl1920, tbl21)

Summarise the mean of the indicator from one hour to a full day (24H)

#summarise the mean of the indicator from one hour to a full day (24H)
tbl_pm25 <- tbl_pm25 %>%
  mutate(tanggal = floor_date(date, "day")) %>%
  group_by(tanggal) %>%
  summarise(pm25 = mean(Raw.Conc.)) %>% 
  mutate(across(where(is.numeric), round, 0)) %>%
  ungroup()

Add the AQI Calculation and Change selected columuns to factor format

#Add the AQI of PM25 as our paramater to analyze this research
tbl_pm25$aqi <- con2aqi("pm25", tbl_pm25$pm25)
#Change the unmatched columns format
tbl_pm25 <- tbl_pm25 %>% 
            mutate(
               aqi = as.numeric(aqi),
               pm25 = as.numeric(pm25),
               tanggal = ymd(tanggal))

Replace any Misiing, N/A or Invalid Value in the data frame

library(naniar)
tbl_indicator <- tbl_indicator %>% replace_with_na_all(condition = ~.x == "---") %>% arrange(tanggal)

Check and Change the unmatched columns format, also remove unused columns

#columns format
glimpse(tbl_indicator)
## Rows: 5,480
## Columns: 10
## $ tanggal  <chr> "2019-01-01", "2019-01-01", "2019-01-01", "2019-01-01", "2019…
## $ stasiun  <chr> "DKI1 (Bunderan HI)", "DKI2 (Kelapa Gading)", "DKI3 (Jagakars…
## $ pm10     <chr> "29", "14", "21", "16", "10", "24", "20", "20", "17", "7", "1…
## $ so2      <chr> "15", "10", "13", "7", "10", "17", "12", "13", "7", "11", "16…
## $ co       <chr> "5", "7", "3", "6", "7", "5", "4", "4", "5", "6", "5", "4", "…
## $ o3       <chr> NA, "71", "40", "24", "44", NA, "79", "47", "25", "44", "29",…
## $ no2      <chr> "13", NA, "1", "3", "4", "6", NA, "2", "4", "3", "4", NA, "2"…
## $ max      <chr> "29", "71", "40", "24", "44", "24", "79", "47", "25", "44", "…
## $ critical <chr> "PM10", "O3", "O3", "O3", "O3", "PM10", "O3", "O3", "O3", "O3…
## $ categori <chr> "BAIK", "SEDANG", "BAIK", "BAIK", "BAIK", "BAIK", "SEDANG", "…
tbl_indicator <- tbl_indicator %>% mutate(stasiun = as.factor(stasiun),
               pm10 = as.numeric(pm10),
               co = as.numeric(co),
               so2 = as.numeric(so2),
               o3 = as.numeric(o3),
               no2 = as.numeric(no2),
               categori = as.factor(categori),
               tanggal = ymd(tanggal))

#remove columns
tbl_clean <- tbl_indicator %>% select(-c(max, critical, categori))

Merge all using left join by date(tanggal) into one data frame

tbl_south <- tbl_clean %>% filter(stasiun == "DKI3 (Jagakarsa)",) %>% arrange(tanggal) %>% select(-stasiun)
tbl_all <-  left_join(tbl_south, tbl_pm25, by="tanggal")
colSums(is.na(tbl_all))
## tanggal    pm10     so2      co      o3     no2    pm25     aqi 
##       0      41      29      25      60      41      34      34

Create the level of the AQI

tbl_all <- tbl_all  %>%  mutate(category = case_when(aqi <= 50 ~ "Good",
                  aqi >= 51 & aqi <= 100 ~ "Moderate",
                  aqi >= 101 & aqi <= 150 ~ "Unhealthy for Sensitive Groups",
                  aqi >= 151 & aqi <= 200 ~ "Unhealthy",
                  aqi >= 201 & aqi <= 300 ~ "Very Unhealthy",
                  aqi >= 301 ~ "Hazardous"),
                  category = as.factor(category)) 

Data Visualization :

Kind of Social-Disctancing Term in Indonesia

After the first case of COVID-19 was recorded in March, the government announced that PSBB, or partial lockdown, has been used as a preventative measure rather than total lockdown. However, only provinces in Indonesia with many cases implemented PSBB; the rest did not. DKI Jakarta introduced PSBB on April 10, 2020.

The government has announced its intention to close all schools and offices during the first phase of PSBB (from April 10 to June 4, 2020), except for strategic services like essential needs provision, defense and security operations, and health care. It has also decided to implement “work from home” policies for non-essential services. The DKI Jakarta Governor Regulation No. 33/2020 was the foundation for all regulations.

Transitional PSBB, or the second phase of PSBB, was fully implemented from June 5 to September 13, 2020. Universities and schools remained closed. Only half of the capacity is permitted for opening non-essential service providers, places of worship, public buildings, public transit, and recreation centers. The DKI Jakarta Governor Regulation No. 51/2020 was the foundation for all interim PSBB regulations. Depending on the situation, the implementations of PSBB and transitory PSBB were established alternatively.

pre_lssr <- tbl_all %>% filter(tanggal <= "2020-04-10" & tanggal >= "2020-01-01")
lssr <- tbl_all %>% filter(tanggal <= "2020-06-04" & tanggal >= "2020-04-11")
lssrt <- tbl_all %>% filter(tanggal <= "2021-01-11" & tanggal >= "2020-06-05")
ppkm <- tbl_all %>% filter(tanggal <= "2021-8-02" & tanggal >= "2021-01-12")

Air Quality Index Trend

plotaqi <- ggplot(tbl_all, aes(x=tanggal, y=aqi)) + 
    geom_point(size=1.5, aes(color = category)) +
    geom_smooth(method = 'gam', formula = y ~ s(x, bs = "cs")) +
    scale_y_continuous(breaks = seq(20,200,20), limits=c(0,200)) +
    scale_color_manual(values=c("Good" = "#26B52A",
                                "Moderate" = "#E2F224",
                                "Unhealthy for Sensitive Groups" = "#F2B224",
                                "Unhealthy" = "#F22424",
                                "Very Unhealthy" = "#7D18B4",
                                "Hazardous" = "#350e1d"))+
    labs(title = "Air Quality Index - PM25 from January 2019 until December 2021",
             x = "Year",
             y = "AQI(PM2.5)",
             fill = "Category") +
        theme(legend.title = element_blank(),
              axis.text.x = element_text(hjust = 1, angle = 45),
              plot.title = element_text(face = "bold"),
              panel.background = element_rect(fill = "#ffffff"),
              axis.line.y = element_line(colour = "grey"),
              axis.line.x = element_line(colour = "grey"))
ggplotly(plotaqi)

The air quality index (AQI) decreased significantly between the Q3 of 2020 until Q4 of 2020 when it was under large-scale social restrictions transition (PSBB Transition), as seen from the plot above. However, what is interesting is that the air quality index (AQI) increased, albeit barely, after large-scale social restrictions. After 2020, we can conclude that the AQI follows the trend line from the previous year but still declines from 2019.

library(zoo)
aqi_cat <- tbl_all %>% mutate(tahun = year(tbl_all$tanggal)) %>%
  group_by(tahun, category) %>%
  mutate(tahun = as.factor(tahun)) %>%
  summarise(freq = n()) %>%
  ungroup() %>% 
  mutate(label = glue("Total : {freq}
                       Level : {category}"))

plotcount <- ggplot(aqi_cat, aes(x=tahun, y = freq, fill = category, text = label)) + 
  geom_col(position = "dodge") +
  scale_fill_manual(values=c("Good" = "#26B52A",
                                "Moderate" = "#E2F224",
                                "Unhealthy for Sensitive Groups" = "#F2B224",
                                "Unhealthy" = "#F22424",
                                "Very Unhealthy" = "#7D18B4",
                                "Hazardous" = "#350e1d")) +
      labs(title = "Air Quality Index - PM2.5 Level on Each Year",
             x = "Year",
             y = NULL,
             fill = "Category") +
        theme(legend.title = element_blank(),
              axis.text.x = element_text(hjust = 1),
              plot.title = element_text(face = "bold"),
              panel.background = element_rect(fill = "#ffffff"),
              axis.line.y = element_line(colour = "grey"),
              axis.line.x = element_line(colour = "grey"))
ggplotly(plotcount, tooltip = "text")

Over the year, more days with moderate air quality than days with unhealthy air quality. Even yet, there were more days in 2020 with Unhealthy air quality for Vulnerable Groups than in 2019. Compared to the prior year, good air quality also appears to be improving, albeit not much. As opposed to 2020, 2021 shows a drop in unhealthy air quality and an increase in unhealthy air quality for vulnerable groups. However, the number of days with Good air quality experienced the opposite, which decreased by almost half the number of days with good air quality in 2020.

Recap of Air Quality Index from Januari 2019 to December 2021

library(viridis)

recap <- tbl_all %>% mutate(tahun = year(tbl_all$tanggal),
                   bulan = month(tbl_all$tanggal, label = T),
                   hari = day(tbl_all$tanggal)) %>% ggplot(aes(x = bulan , y = hari, fill= category)) + 
  geom_tile() +
  facet_grid(~tahun) +
  scale_y_continuous(breaks = seq(1,31)) +
  scale_fill_manual(values=c("Good" = "#26B52A",
                                "Moderate" = "#E2F224",
                                "Unhealthy for Sensitive Groups" = "#F2B224",
                                "Unhealthy" = "#F22424",
                                "Very Unhealthy" = "#7D18B4",
                                "Hazardous" = "#350e1d")) +
      labs(title = "Air Quality Index - PM2.5 Level on Each Year",
             x = "Year",
             y = NULL,
             fill = "Category") +
        theme(legend.title = element_blank(),
              axis.text.x = element_text(hjust = 1, angle = 90),
              plot.title = element_text(face = "bold"),
              panel.background = element_rect(fill = "#ffffff"),
              axis.line.y = element_line(colour = "grey"),
              axis.line.x = element_line(colour = "grey"))
ggplotly(recap)

Distribution of each Pollutant from January 2019 until December 2021

plotpoll <- tbl_all %>% 
    select(-c(category)) %>%
    pivot_longer(col = -tanggal, names_to = 'predictor') %>% 
    ggplot(aes(x = tanggal, y = value)) +
    geom_point() +
    geom_smooth(method = 'gam', formula = y ~ s(x, bs = "cs"), col = "red") +
    facet_wrap(~predictor, scales = 'free_x', nrow = 2, ncol = 4)  +
    labs(
        title = 'Distribution of each Pollutant from January 2019 until December 2021',
        x = 'Year',
        y = 'Values'
    )
ggplotly(plotpoll)

To get a deeper understanding, we may observe that the plot above comprises a variety of pollutants that could indicate the air quality. According to the distribution, O3 will considerably decrease between 2019 and 2021. We might infer that PM10 and 03 are displaying a descending line in Q3 to Q4 of 2020. On the other hand, beginning in the middle of Q2, PM2.5 levels start to fall until the end of the year. On the other hand, CO, SO2, and NO2 have some roller-coaster slope on several days in Q1 (Feb-Mar) and Q4 (Oct-Nov). We presume that it is a part of the outlier of the data, but we decided not to remove the outlier since it may limit the information on Air Quality.

Seasonality of Pollutant

PM10

plotpm10 <- ggplot(tbl_all, aes(x=tanggal, y=pm10)) + 
    geom_point(size=1.5, color = "black") +
    geom_point(data = pre_lssr, aes(x = tanggal, y = pm10), col = 'blue') + 
    geom_point(data = lssr, aes(x = tanggal, y = pm10), col = 'red') +
    geom_point(data = lssrt, aes(x = tanggal, y = pm10), col = 'green') +
    geom_point(data = ppkm, aes(x = tanggal, y = pm10), col = 'pink') +
    geom_smooth(method = 'gam', formula = y ~ s(x, bs = "cs"), col = "red") +
    scale_y_continuous(breaks = seq(20,100,20), limits=c(0,100)) +
    labs(title = "PM10 from January 2019 until December 2021",
             x = "Year",
             y = "PM10",
             fill = "Category") +
        theme(legend.title = element_blank(),
              axis.text.x = element_text(hjust = 1, angle = 45),
              plot.title = element_text(face = "bold"),
              panel.background = element_rect(fill = "#ffffff"),
              axis.line.y = element_line(colour = "grey"),
              axis.line.x = element_line(colour = "grey"))
ggplotly(plotpm10)

PM2.5

plotpm25 <- ggplot(tbl_all, aes(x=tanggal, y=pm25)) + 
    geom_point(size= 1.5, col = "black") +
    geom_point(data = pre_lssr, aes(x = tanggal, y = pm25), col = 'blue') + 
    geom_point(data = lssr, aes(x = tanggal, y = pm25), col = 'red') +
    geom_point(data = lssrt, aes(x = tanggal, y = pm25), col = 'green') +
    geom_point(data = ppkm, aes(x = tanggal, y = pm25), col = 'pink') +
    geom_smooth(method = 'gam', formula = y ~ s(x, bs = "cs"), col = "red") +
    scale_y_continuous(breaks = seq(25,125,25), limits=c(0,125)) +
    labs(title = "PM2.5 from January 2019 until December 2021",
             x = "Year",
             y = "PM2.5") +
        theme(legend.title = element_blank(),
              axis.text.x = element_text(hjust = 1, angle = 45),
              plot.title = element_text(face = "bold"),
              panel.background = element_rect(fill = "#ffffff"),
              axis.line.y = element_line(colour = "grey"),
              axis.line.x = element_line(colour = "grey"))
ggplotly(plotpm25)

SO2

plotso2 <-  tbl_all %>% 
    ggplot(aes(x = tanggal, y = so2)) +
    geom_point(size = 1) +
    geom_point(data = pre_lssr, aes(x = tanggal, y = so2), col = 'blue') + 
    geom_point(data = lssr, aes(x = tanggal, y = so2), col = 'red') +
    geom_point(data = lssrt, aes(x = tanggal, y = so2), col = 'green') +
    geom_point(data = ppkm, aes(x = tanggal, y = so2), col = 'pink') +
    geom_smooth(method = 'gam', formula = y ~ s(x, bs = "cs"), col = "red") +
    labs(title = "SO2 Level on Each Year",
             x = "Year",
             y = "SO2",
             fill = "Category") +
    theme(legend.title = element_blank(),
              axis.text.x = element_text(hjust = 1),
              plot.title = element_text(face = "bold"),
              panel.background = element_rect(fill = "#ffffff"),
              axis.line.y = element_line(colour = "grey"),
              axis.line.x = element_line(colour = "grey"))
ggplotly(plotso2)

NO2

plotno2 <-  tbl_all %>% 
    ggplot(aes(x = tanggal, y = no2)) +
    geom_point(size = 1) +
    geom_point(data = pre_lssr, aes(x = tanggal, y = no2), col = 'blue') + 
    geom_point(data = lssr, aes(x = tanggal, y = no2), col = 'red') +
    geom_point(data = lssrt, aes(x = tanggal, y = no2), col = 'green') +
    geom_point(data = lssrt, aes(x = tanggal, y = no2), col = 'green') +
    geom_point(data = ppkm, aes(x = tanggal, y = no2), col = 'pink') +
    labs(title = "NO2 Level on Each Year",
             x = "Year",
             y = "NO2",
             fill = "Category") +
    theme(legend.title = element_blank(),
              axis.text.x = element_text(hjust = 1),
              plot.title = element_text(face = "bold"),
              panel.background = element_rect(fill = "#ffffff"),
              axis.line.y = element_line(colour = "grey"),
              axis.line.x = element_line(colour = "grey"))
ggplotly(plotno2)

The plot above shows the trend and daily distribution of some pollutants from January 2019 to December 2020.

  • a: The blue dots represent the pre-lockdown period on January 1st, 2020 - April 10th, 2020.
  • b: The red dots represent the PSBB 1st phase period on April 11th, 2020 - June 4th, 2020
  • c: The pink dots represent the transitional PSBB including PSBB Ketat (2nd phase) period on June 5th, 2020 - January 11th, 2021
  • d: The green dots represent the PPKM period on January 11th, 2021 - August 2nd, 2021.

The line indicates a linear trend between pre and during the lockdown period.

As we know, Indonesia has two seasons, one of which is the Rainy Season, which starts in October - April and has the highest rainfall at the beginning of the year. The pollutants were related to precipitation patterns, with the highest rainfall in January and decreasing each month until it reached the driest month in August. So we can assume that Air Quality is not only affected by people’s mobility or activities but also by climate change, such as rainfall, wind speed, temperature, and humidity.

Conclusion :

Based on the data, we conclude that air quality changes during lockdowns were not as large and significant as measured by PM10, PM2.5, SO2, and CO2 during the lockdown period. Although human mobility decreases during PSBB/PPKM, it cannot significantly decrease air pollution in highly-populated and polluted cities such as Jakarta, as fossil fuel usage and industrial activities were not significantly reduced. In addition to people’s mobility, climate change factors, including rainfall, wind speed, temperature, and humidity, all impact air quality. Air Quality is not only affected by people’s mobility or activities but also by climate change, such as rainfall, wind speed, temperature, and humidity.