Project Describtion

This project investigates various air pollutants in South Korea using real-time and historical data obtained from the Air Korea Official Website. The primary objective is to analyze pollution trends, identify major pollutants, and assess their potential impact on public health and the environment.

Data

Data Source

The dataset was sourced from the Air Korea Official Website, which provides credible, comprehensive records of air quality data across South Korea. Its long-term coverage and real-time updates make it a reliable source for environmental research.

Data Collection Process

Due to website limitations, users can only download air quality data in 60-days intervals. As a result, the process had to be repeated numerous times to collect data for around 7 years. For clarity, I have decided to download data in a month interval. An example of the unprocessed month dataset is shown below:

Data Explanation

  • The columns are consisted of various air pollutants
    • PM10
    • PM2.5
    • O3 (ozone)
    • NO2 (Nitrogen dioxide)
    • CO (Carbon monoxide)
    • SO2 (sulfur dioxide)
  • The rows are the dates

Data Issues

The raw data contains several significant issues that must be addressed before analysis.

  • Date Problems
    • The date entries are not sorted chronologically.
    • The date column is stored as a string rather than a proper date type, requiring conversion
  • Columns Issues
    • Most column names are in Korean, complicating data mainpulation and visualization
    • The final column is completely empty and can be removed
  • Rows Issues
    • The first row is only indicating the unit.
    • Currently, all rows are in character.
  • Missing Values
    • Some cells contain NA values, which require proper handling depending on the analysis

Data Cleaning

The data cleaning workflow follows these steps:

  1. Import monthly data iteratively using a loop:
    • Read each month’s data and skip first 2 headers
    • Remove the first row, which contains unit
    • Rename columns from Korean to English for clarity
    • Drop the empty last column
    • Sort the data by the date column in chronological order
    • Append the cleaned data to an empty dataframe
total = data.frame()
for (i in 1:12){
  wt <- read_xls(paste0(('/Users/lionlucky7/Desktop/R Project/'),i,('.xls')), skip=2)
  wt <- wt[-1,]
  wt %>% 
    rename(
      Date = '날짜',
      O3 = '오 존',
      NO2 = '이산화질소',
      CO = '일산화탄소',
      SO2 = '아황산가스'
    ) %>%
    select(-'최종확정\n여부') %>%
    arrange(Date) -> wt
  total <- rbind(total,wt)
}
total
  1. After consolidating all monthly datasets into a single dataframe, further processing is performed as follows:
    • Transform the date column from string to actual Date format
    • Separate the Date column into Year, Month, and Day for easier grouping and analysis.
    • Reorder the columns
    • Convert all pollutant value columns to numeric value type

Exploratory Data Analysis

Correlation

This heat map shows the correlation between air pollutants

corr <- cor(total[5:10],total[5:10])
ggcorrplot(corr,outline.col = "white",
          colors = c("#bf3022", "white", "#2967ff"), lab=TRUE) +
  ggtitle("Correlation Matrix of Air Pollutants in South Korea (2018)") +
  theme(plot.title = element_text(hjust = 0.5, size=12))

This correlation heat map tells us various features about the dataset:

  • These are proportional columns:
    • SO2 and (PM10, PM2.5, NO2, CO)
    • CO and (PM10, PM2.5, NO2, SO2)
    • NO2 and (PM10, PM2.5, CO, SO2)
    • PM2.5 and (PM10, NO2, CO, SO2)
    • PM10 and (PM2.5, NO2, CO, SO2)
    • This implies that all air pollutants excpet O3 are all closely related to each other
  • These are inverse proportional columns:
    • O3 and (NO2, CO)
  • These relatively are not related to anything
    • O3 and (PM10, PM2.5, SO2)
total %>% 
  mutate(season = case_when(
    Month %in% c(12, 1, 2) ~ 'Winter',
    Month %in% c(3,4,5) ~ 'Spring', 
    Month %in% c(6,7,8) ~ 'Summer',
    Month %in% c(9, 10, 11) ~ 'Fall'
  )) %>%
  pivot_longer(cols = c(PM10, PM2.5, O3, NO2, CO, SO2),
               names_to = 'matters',
               values_to = 'number') %>%
  ggplot(aes(y=number, color=season)) +
  geom_boxplot() +
  theme_igray() +
  facet_wrap(vars(matters), scales = 'free_y') +
  labs(title = 'Seasonal Distribution of Air Pollutants', 
       y = 'Measurement Data') +
  theme(plot.title = element_text(hjust=0.76, size=17, face='bold'),
    axis.text.x = element_blank(), 
    axis.title.y = element_text(size=15))

Average of each air pollutants per month

total %>%  
  group_by(Month) %>%
  summarise(
    PM10 = mean(PM10, na.rm=TRUE), 
    PM2.5 = mean(PM2.5,na.rm=TRUE),
    O3 = mean(O3, na.rm=TRUE), 
    NO2 = mean(NO2, na.rm=TRUE),
    CO = mean(CO, na.rm=TRUE),
    SO2 = mean(SO2, na.rm=TRUE)
  ) %>% 
  pivot_longer(!Month, names_to = 'matter', values_to = 'number') %>%
  ggplot(aes(x=Month, y=number)) +
  facet_wrap(vars(matter), scales='free_y',
             nrow=3) +
  geom_col() +
  scale_x_continuous(breaks = 1:12, 
                     labels = c(
                       "Jan", "Feb", "Mar", "Apr", "May", "Jun",
                       "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
                     )) +
  theme_igray() +
  labs(
    y = 'Measurement Data',
    title = 'Monthly Average Concentrations of Major Air Pollutants'
  ) +
  theme(
    plot.title = element_text(hjust=0.5, face='bold', size=18),
    panel.grid.major.x = element_blank(),
    axis.text.x = element_text(size=8),
    axis.title.x = element_text(size=15),
    axis.title.y = element_text(size=15)
  )

PMs <- total %>% select(Date, Year, Month, Day, PM10, PM2.5)
PMs %>%
  ggplot(aes(x=Date)) +
  geom_line(aes(y=PM10, color='PM10'), size=0.5) +
  geom_line(aes(y=PM2.5, color= 'PM2.5'), size=0.5) +
  scale_color_manual(values=c("#CC6666", "#9999CC")) +
  labs(y = 'PM level(㎍/㎥)')

This monthly average concentrations of major air pollutants tell us a general trend * Except O3, all other pollutants are increased during the winter and decrease in the summer