MATH2349 Data Wrangling: Pre-processing the Air Pollution (Seoul, South Korea) Measurement Dataset

Required packages

# This is the R chunk for the required packages
library(readr) 
library(knitr) 
library(dplyr)
library(tidyr)
library(stringr)
library(magrittr)
library(lubridate)
library(outliers)
library(ggplot2)
library(editrules)
library(forecast)

Executive Summary

This paper presents the data wrangling processes performed on the air pollution measurement information dataset of Seoul, South Korea. The goal is to explore the air pollution status and statistics relating to six pollutants for the year 2019 and station code 101: Jongno-gu, Seoul. Two datasets, Measurement_info.csv and Measurement_item_info.csv, were used. Measurement_info contains air pollution measurement information and Measurement_item_info captured information on pollution measurement items. Initially, datasets were imported into R. head() used to check data loaded correctly. Measurement_info dataset was filtered with year 2019 and station_code 101 as per the objective. Measurement_items_info was then left joined with the filtered Ms_info dataset to create the research dataset. The structure of the dataset was checked and necessary type conversions were applied to the variables. To explore pollutant’s status in a particular day and hour, a new column “Pollution_status” was mutated by comparing the average values of pollutants with pollutant standard ranges and explored that PM2.5 and PM10 are causing unhealthy air quality. ‘Unit_of_measurement’ and ‘Item_name’ columns were united to reduce dimension of dataset. This column containing Pollutant_Unit was then spread as ‘key’ and ‘value’=Average_value columns forming tidy data. Afterwards, missing, special values and obvious errors were checked and corrected. Boxplots were used to detect outliers for numeric columns. Identified outliers were capped and replaced with the value of 5th and 95th percentile for the observations outside the lower limit and upper limit, respectively. Data normalisation was applied to create matching scale across all numeric columns whilst maintaining the underlying data distribution because four of the pollutants had ppm and two had Microgramperm3 unit with huge difference in value ranges. Lastly, square root data transformation technique was used to reduce the right skewness of the variables’ data distribution. To demonstrate the effect of data transformation, PM10_Microgramperm3 and PM2.5_Microgramperm3 variable’s transformation were shown.

Data

The datasets used in this assignment, deal with air pollution measurement information in Seoul, South Korea. The datasets were downloaded from Kaggle. (2020) Air Pollution in Seoul. https://www.kaggle.com/bappekim/air-pollution-in-seoul.

For the data pre-processing purpose, two datasets named Measurement_info.csv and Measurement_item_info.csv were used. These datasets contain air pollution measurement item and information in Seoul as follows:

First Dataset

Measurement info dataset: It contains air pollution measurement information. It has 3885066 observations of 5 variables. The definitions of these variables are given below:

Measurement date: Year, month and day information every hour between 2017 and 2019.

Station code: 25 air pollution instrument stations’ code from 101 to 125.

Item code: item codes for pollutants where 1: SO2, 3: NO2, 5: CO, 6: O3, 8: PM10, 9: PM2.5.

Average value: 1-hour average measurement value of six air pollutants.

Instrument status: 0: Normal, 1: Need for calibration, 2: Abnormal, 4: Power cut off, 8: Under repair, 9: abnormal data

Second Dataset

Measurement item info dataset: Information on air pollution measurement items. It has 6 observations of 7 variables as follows:

Item code: Item codes for pollutants where 1: SO2, 3: NO2, 5: CO, 6: O3, 8: PM10, 9: PM2.5.

Item name: Names of pollutants: SO2, NO2, CO, O3, PM10, PM2.5.

Unit of measurement: Scales of the measurement where ppm: Parts per million, Microgram/m3: Micrograms per cubic meter.

Good: Pollutant standard value as good level of pollution.

Normal: Pollutant standard value as normal (fair) level of pollution.

Bad: Pollutant standard value as bad (poor) level of pollution.

Very bad: Pollutant standard value as very bad (unhealthy) level of pollution.

The datasets were imported using readr package’s read_csv() function.The column names were supplied by skipping first row and using col_names argument with better naming convention for easy recalling. To ensure data loaded correctly, head() function has been used to see first few rows of the datasets. Measurement_info.csv dataset was then filtered with the variables Measurement_date year 2019 and station_code 101 using filter() function from dplyr package and checked first few records of the data frame using head() function and saved as Ms_info dataset. While filtering the Measurement_date variable, it was converted to data type POSIXct from dttm format using as.POSIXct() function.

Measurement_items_info.csv dataset was then left joined with the filtered Ms_info dataset using left_join() function from dplyr package by the key variable “Item_code” and named df_join. Again Checked the first few records of the merged dataset using head() function.

# This is the R chunk for the Data Section
Ms_whole <- read_csv("C:/Users/tanto/Desktop/Graduate cert-Data Science/Data Wrangling/Assignments/Assignment 2/Dataset/Measurement_info.csv", skip=1,
              col_names = c( "Measurement_date", "Station_code", "Item_code", "Average_value", "Instrument_status"))

## Parsed with column specification:
## cols(
##   Measurement_date = col_datetime(format = ""),
##   Station_code = col_double(),
##   Item_code = col_double(),
##   Average_value = col_double(),
##   Instrument_status = col_double()
## )

head(Ms_whole, 3)

Ms_item_info <- read_csv("C:/Users/tanto/Desktop/Graduate cert-Data Science/Data Wrangling/Assignments/Assignment 2/Dataset/Measurement_item_info.csv", skip=1,
              col_names = c("Item_code", "Item_name", "Unit_of_measurement", "Good", "Normal", "Bad", "Very_bad"))

## Parsed with column specification:
## cols(
##   Item_code = col_double(),
##   Item_name = col_character(),
##   Unit_of_measurement = col_character(),
##   Good = col_double(),
##   Normal = col_double(),
##   Bad = col_double(),
##   Very_bad = col_double()
## )

head(Ms_item_info,3)

Ms_info <- Ms_whole %>% filter(Measurement_date >= as.POSIXct("2019-01-01 00:00:00", tz="UTC"), Station_code == 101)
head(Ms_info,3)

df_join <- Ms_info %>% left_join(Ms_item_info, by ="Item_code")
head(df_join, 3)

Understand

To check the structure, dimension and data types of the merged dataset df_join, glimpse() function has been used and found that it has 50,310 observations with 11 variables. It has also been observed that Station_code and Item_code were parsed as data type of double. But these are integer values. To change the data type of these two variables, as.integer() function has been used and checked the type of converted variables using typeof() function and found to be integer. Next, Instrument_status variable has been converted to factor from double type of data using factor() function and by specifying it’s levels and labels as per the source description. Similarly, Item_name and Unit_of_measurement variables were both parsed as character type. Thus, these variables were converted to factor using factor() function specifying levels for Item_name; in addition levels and labels for Unit_of_measurement as required as per source description. Class() and levels() fucntions were used to check the converted type of data and the levels of the factor variables. During this phase, a spelling error in Unit_of_measurement variable as “Mircrogram/m3” has been found and corrected by passing correct spelling as “Microgramperm3” in labels argument of factor() function.

# This is the R chunk for the Understand Section
glimpse(df_join)

## Rows: 50,310
## Columns: 11
## $ Measurement_date    <dttm> 2019-01-01 01:00:00, 2019-01-01 01:00:00, 2019...
## $ Station_code        <dbl> 101, 101, 101, 101, 101, 101, 101, 101, 101, 10...
## $ Item_code           <dbl> 1, 3, 6, 8, 5, 9, 1, 3, 5, 6, 8, 9, 1, 3, 5, 6,...
## $ Average_value       <dbl> 0.004, 0.061, 0.002, 37.000, 1.100, 22.000, 0.0...
## $ Instrument_status   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Item_name           <chr> "SO2", "NO2", "O3", "PM10", "CO", "PM2.5", "SO2...
## $ Unit_of_measurement <chr> "ppm", "ppm", "ppm", "Mircrogram/m3", "ppm", "M...
## $ Good                <dbl> 0.02, 0.03, 0.03, 30.00, 2.00, 15.00, 0.02, 0.0...
## $ Normal              <dbl> 0.05, 0.06, 0.09, 80.00, 9.00, 35.00, 0.05, 0.0...
## $ Bad                 <dbl> 0.15, 0.20, 0.15, 150.00, 15.00, 75.00, 0.15, 0...
## $ Very_bad            <dbl> 1.0, 2.0, 0.5, 600.0, 50.0, 500.0, 1.0, 2.0, 50...

df_join$Station_code <-  as.integer(df_join$Station_code)
typeof(df_join$Station_code)

## [1] "integer"

df_join$Item_code <- as.integer(df_join$Item_code)
typeof(df_join$Item_code)

## [1] "integer"

df_join$Instrument_status <- df_join$Instrument_status %>% 
                      factor(levels=c(0,1,2,4,8,9), 
                labels = c("Normal", "Need for calibration", "Abnormal",
                               "Power cut off","Under repair", "Abnormal data") )
class(df_join$Instrument_status)

## [1] "factor"

levels(df_join$Instrument_status)

## [1] "Normal"               "Need for calibration" "Abnormal"            
## [4] "Power cut off"        "Under repair"         "Abnormal data"

unique(df_join$Item_name)

## [1] "SO2"   "NO2"   "O3"    "PM10"  "CO"    "PM2.5"

df_join$Item_name <- df_join$Item_name %>% 
                      factor(levels=c("SO2","NO2","CO","O3","PM10","PM2.5"))

class(df_join$Item_name)

## [1] "factor"

levels(df_join$Item_name)

## [1] "SO2"   "NO2"   "CO"    "O3"    "PM10"  "PM2.5"

unique(df_join$Unit_of_measurement)

## [1] "ppm"           "Mircrogram/m3"

df_join$Unit_of_measurement <- df_join$Unit_of_measurement %>% 
                                  factor( levels= c ( "ppm", "Mircrogram/m3"),
                                          labels= c( "ppm", "Microgramperm3")) 
class(df_join$Unit_of_measurement)

## [1] "factor"

levels(df_join$Unit_of_measurement)

## [1] "ppm"            "Microgramperm3"

glimpse(df_join)

## Rows: 50,310
## Columns: 11
## $ Measurement_date    <dttm> 2019-01-01 01:00:00, 2019-01-01 01:00:00, 2019...
## $ Station_code        <int> 101, 101, 101, 101, 101, 101, 101, 101, 101, 10...
## $ Item_code           <int> 1, 3, 6, 8, 5, 9, 1, 3, 5, 6, 8, 9, 1, 3, 5, 6,...
## $ Average_value       <dbl> 0.004, 0.061, 0.002, 37.000, 1.100, 22.000, 0.0...
## $ Instrument_status   <fct> Normal, Normal, Normal, Normal, Normal, Normal,...
## $ Item_name           <fct> SO2, NO2, O3, PM10, CO, PM2.5, SO2, NO2, CO, O3...
## $ Unit_of_measurement <fct> ppm, ppm, ppm, Microgramperm3, ppm, Microgrampe...
## $ Good                <dbl> 0.02, 0.03, 0.03, 30.00, 2.00, 15.00, 0.02, 0.0...
## $ Normal              <dbl> 0.05, 0.06, 0.09, 80.00, 9.00, 35.00, 0.05, 0.0...
## $ Bad                 <dbl> 0.15, 0.20, 0.15, 150.00, 15.00, 75.00, 0.15, 0...
## $ Very_bad            <dbl> 1.0, 2.0, 0.5, 600.0, 50.0, 500.0, 1.0, 2.0, 50...

Tidy & Manipulate Data I (Mutating New Column)

To explore pollutant’s status in a particular day and at a particular hour, a new column named “Pollution_status” has been mutated in the df_join dataset using mutate() function from dplyr package. Initially this column was created as empty column. Later using for() loop and ifelse() functions and comparing statements, pollution statuses were generated. Pollution status values were generated by comparing average values of the pollutants and pollutant’s standard value columns such as, Good, Normal, bad and Very_bad. To check the newly mutated column, head() function has been used to see first few observations. Since pollution status has different levels indicating the pollution levels, it has been converted into factor type of data using factor() function and labels with ordered=TRUE. table() function has been used to examine the frequencies of each pollution status.
To investigate, which pollutant is responsible for unhealthy air quality, pollution status across different pollutants were generated using group_by() function with Item_name and Pollution_status. Then summarise() function has been used to find the counts of the pollutants with pollution status and arranged them in descending order of counts using arrange(desc()) fucntions from dplyr package. Later, this data is filtered with the condition Pollution_status== “Hazardous” | Pollution_status == “Unhealthy” using filter() function from dplyr package. It was explored that PM2.5 and PM10 are the pollutants causing unhealthy air quality in the environment. The result was stored in pollution_status_count.

# This is the R chunk for the Tidy & Manipulate Data I 
df_join <- df_join %>% mutate("Pollution_status"= "")

#Generating the pollution status from Average_values and standard values
for (i in (1:nrow(df_join) ) ){
  ifelse(df_join$Average_value[i]<= df_join$Good[i],
         df_join$Pollution_status [i]<-"Good", 
  ifelse(df_join$Average_value[i]> df_join$Good[i] & 
        df_join$Average_value[i]<= df_join$Normal[i],
        df_join$Pollution_status [i]<-"Fair",
  ifelse(df_join$Average_value[i] > df_join$Normal[i] & 
        df_join$Average_value[i] <= df_join$Bad[i],
        df_join$Pollution_status[i]<-"Poor",
  ifelse(df_join$Average_value[i] > df_join$Bad[i] & 
        df_join$Average_value[i]<= df_join$Very_bad[i],
        df_join$Pollution_status[i]<-"Unhealthy",
        df_join$Pollution_status[i] <- "Hazardous"))))
}   

df_join$Pollution_status <- df_join$Pollution_status %>% factor (levels = c( "Hazardous", "Unhealthy", "Poor", "Fair", "Good" ), ordered=TRUE)

head(df_join, 3)

df_join$Pollution_status%>%table()

## .
## Hazardous Unhealthy      Poor      Fair      Good 
##         1       300      1964     13207     34838

#Pollutants' counts across Hazardous & Unhealthy pollution status:
pollution_status_count <- df_join %>% 
  group_by(Item_name, Pollution_status) %>% 
  summarise ( count = n() ) %>% 
  arrange(desc(count)) %>% 
  filter( Pollution_status== "Hazardous" | Pollution_status == "Unhealthy" )

## `summarise()` regrouping output by 'Item_name' (override with `.groups` argument)

#Pollutants causing Hazardous & Unhealthy air quality
pollution_status_count

Tidy & Manipulate Data II

To reduce the dimension of the data frame and to add the ‘Unit_of_measurement’ variable with the pollutants’ name variable ‘Item_name’, unite() function from tidyr package has been used to combine Item_name and Unit_of_measurement columns into a single column named “Pollutant_unit” using separator "_". For a quick check-up for missing values, sum(is.na()) function has been used and found to have no missing value in the dataset.

In the current format of df_join dataset, multiple variables such that different pollutants: SO2, NO2, CO, O3, PM10 and PM2.5 are stored in a single column named Pollutant_unit. But to analyse the relationship among the pollutants in a particular time, to assess the underlying distribution of each pollutants and to apply any predictive machine learning model to predict the amount of pollutants in the air, this dataset violated the three precepts of tidy data. In tidy data:

 * Each variable must have its own column.
 * Each observation must have its own row.
 * Each value must have its own cell.

Therefore, for further analysis of the distribution of each pollutants and the relationship among them, tidyr packages’ spread() function has been used to spread the Pollutant_Unit column as ‘key’ and associated the Average_value column as ‘value’ for each of the new column. This wide format of the new data frame was named tidy_df. While tidying df_join dataset into wide format, select() function from dplyr package had been used prior to spreading, for selecting required columns such as, Measurement_date, Average_value, Pollutant_Unit and Instrument_status for further analysis.

# This is the R chunk for the Tidy & Manipulate Data II 
#Uniting item name & unit of measurement columns
df_join <- df_join %>% unite("Pollutant_Unit",
                             Item_name, Unit_of_measurement, sep = "_")

sum(is.na(df_join)) #At this point no NA in the data frame.

## [1] 0

#Selecting and arranging required variables as per tidy rules using select () and spread()
tidy_df <- df_join %>% select( Measurement_date, Average_value, Pollutant_Unit, Instrument_status) %>%
  spread (key = Pollutant_Unit, value = Average_value)

head(tidy_df, 3)

Scan I

Special Values

The new data frame has now been scanned for special values such as -Inf, Inf and NaN values. To perform the scanning for special values, a function has been written to scan numerical columns with infinite and nan values and named the function is.special. sapply() function along with sum() function of is.special function has been applied to get the total number of special values (if present) in the data frame and found to be zero.

Missing Values

Next to find the total number of missing values-‘NA’, sum(is.na()) function has been applied and found to have 1446 NAs in the data frame. These NAs were introduced in the data frame after reshaping of the data frame into tidy format. Missing values were handled and explained in later steps.

Obvious Inconsistencies or Errors

Obvious inconsistencies (if present) were checked using edit rules text file and using editrules package from R. The edit rules text file contains the following rules stated below:

numerical rules

CO_ppm > 0

NO2_ppm > 0

O3_ppm > 0

PM10_Microgramperm3 > 0

PM2.5_Microgramperm3 > 0

SO2_ppm > 0

categorical rules

Instrument_status %in% c(“Normal”,“Need for calibration”, “Abnormal”, “Power cut off”,“Under repair”, “Abnormal data”)

violatedEdits() function has been used to search for the violated edits and summarised using summary () function which showed that there were violation for numerical rules. Using plot() function, the violations’ frequencies were plotted for 7 edits. From count vs. number of violations plot it was seen that there were 8137 records or observations with no violation among 8626 observations.

To explore the violations, the dataset was piped using pipe operator and filtered for ‘less than and equal to’ 0 values using filter() function for all six numeric pollutant’s columns. By examining the columns it was found that -1 and 0 values were present in the pollutant average values. It was also observed that -1 and 0 values were extracted from the measurement instrument when the instrument statuses were “Abnormal”, “Need for calibration”, “Power cut off” or “Abnormal Data” states. Therefore, -1 and 0 values were decided to remove from the dataset using filter() function and stored in new data frame named tidy_clean.

Once again, the sum of NA values and violated editrules were checked and found to have no NAs (missing values) and violation in the tidy_clean dataset.

# This is the R chunk for the Scan I

is.special <- function(x){
if (is.numeric(x)) (is.infinite(x) | is.nan(x))
}

sapply(tidy_df, function(x) sum(is.special(x)))

##     Measurement_date    Instrument_status               CO_ppm 
##                    0                    0                    0 
##              NO2_ppm               O3_ppm  PM10_Microgramperm3 
##                    0                    0                    0 
## PM2.5_Microgramperm3              SO2_ppm 
##                    0                    0

sum(is.na(tidy_df)) #After reshaping the data frame to tidy format, NAs were introduced.

## [1] 1446

Rules <- editfile("rules.txt", type = "all")
Rules

## 
## Data model:
## dat1 : Instrument_status %in% c('Abnormal', 'Abnormal data', 'Need for calibration', 'Normal', 'Power cut off', 'Under repair') 
## 
## Edit set:
## num1 : 0 < CO_ppm
## num2 : 0 < NO2_ppm
## num3 : 0 < O3_ppm
## num4 : 0 < PM10_Microgramperm3
## num5 : 0 < PM2.5_Microgramperm3
## num6 : 0 < SO2_ppm

Violated <- violatedEdits(Rules, tidy_df)

# summary of violated rules
summary(Violated)

## Edit violations, 8626 observations, 0 completely missing (0%):
## 
##  editname freq  rel
##      num5   24 0.3%
##      num4   16 0.2%
##      num6    8 0.1%
##      num1    4   0%
##      num2    4   0%
##      num3    2   0%
## 
## Edit violations per record:
## 
##  errors freq   rel
##       0 8137 94.3%
##       1  184  2.1%
##       2   56  0.6%
##       3    2    0%
##       4   54  0.6%
##       5  172    2%
##       6   21  0.2%

plot(Violated)

check <- tidy_df %>%  
  filter((SO2_ppm <=0) | (CO_ppm <= 0)| (O3_ppm <= 0) | (NO2_ppm <= 0) | (PM10_Microgramperm3 <= 0) | (PM2.5_Microgramperm3 <= 0))

head(check,3) #It has been noticed that -1 and 0 values are mostly extracted when the instrument status were "Abnormal" , "Need for calibration" , "Power cut off" or "Abnormal Data" state.

tidy_clean <- tidy_df %>% filter(!((SO2_ppm <=0) | (CO_ppm <= 0)| (O3_ppm <= 0) | (NO2_ppm <= 0) | (PM10_Microgramperm3 <= 0) | (PM2.5_Microgramperm3 <= 0)))

sum(is.na(tidy_clean)) #NA's were also cleaned along with -1 & 0 values.

## [1] 0

head(tidy_clean, 3)

Violated_clean <- violatedEdits(Rules, tidy_clean)
summary(Violated_clean)

## No violations detected, 0 checks evaluated to NA

## NULL

Scan II

Tukey’s method of outlier detection-‘boxplots’ was used to detect outliers for the six air pollutants’ numeric variable columns as these variables were having non-symmetric/non-normal data distributions (right skewed data distributions). Boxplots detected outliers as follows:

CO_ppm had 369, NO2_ppm had 55, O3_ppm had 88, PM10_Microgramperm3 had 543, PM2.5_Microgramperm3 had 621 and SO2_ppm had 478 outliers as per the boxplot outer fences.

To plot the boxplots, first the dataset was filter for only the six pollutants’ numeric columns. Afterwards, ‘for loop’ has been applied to plot six boxplots using for() loop and boxplot() functions. Next, inside the same ‘for loop’, length(plot$out) function has been used to find the total number of outliers in each column.

All these outliers were recorded when the instrument statuses were normal. Moreover, the outliers were more than 5% of the total observation of the dataset. So, dropping the outliers would be incorrect approach to take. Again, an outlier can adversely affect various statistical and machine learning models. Thus, to handle outliers, capping (winsorising) method has been used to cap it by replacing those observations outside the lower limit with the value of 5th percentile and those that lie above the upper limit, with the value of 95th percentile. Before replacing the outliers, summary() function has been used to check the descriptive statistics for all six pollutant variables. Cap function was written and applied to sapply() function to cap all six numeric pollutant columns.

Boxplots were again plotted to check the outliers in modified distribution of the data after capping the outliers. Descriptive summary of the columns were checked again and found that the min, max and mean values were changed while median values stayed the same.

# This is the R chunk for the Scan II
tidy_clean_sub <- tidy_clean %>% select(-(1:2) )

colnames <-  colnames(tidy_clean_sub)

# Tukey's method of outlier detection: Boxplot of pollutants
par(mfrow=c(2,3))
for (i in (1:6)) {
  plot=boxplot(unlist(tidy_clean_sub[,i]), col = "dodgerblue", 
               main = paste0("Boxplot of ", colnames[i]))
  cat(colnames(tidy_clean_sub[i]),"has",length(plot$out),"outliers\n")  
}

## CO_ppm has 369 outliers

## NO2_ppm has 55 outliers

## O3_ppm has 88 outliers

## PM10_Microgramperm3 has 543 outliers

## PM2.5_Microgramperm3 has 621 outliers

## SO2_ppm has 478 outliers

#The pollutants have rightly skewed distribution as per the boxplots. 
#All these outliers were recorded when the instrument status was normal. Moreover, the outliers are more than 5% of the total observations. So dropping outliers would be inappropriate.     

summary(tidy_clean_sub)

##      CO_ppm           NO2_ppm            O3_ppm        PM10_Microgramperm3
##  Min.   : 0.1000   Min.   :0.00200   Min.   :0.00100   Min.   :  3.00     
##  1st Qu.: 0.4000   1st Qu.:0.01800   1st Qu.:0.00900   1st Qu.: 21.00     
##  Median : 0.5000   Median :0.02700   Median :0.02100   Median : 31.00     
##  Mean   : 0.5973   Mean   :0.03014   Mean   :0.02294   Mean   : 38.43     
##  3rd Qu.: 0.7000   3rd Qu.:0.04100   3rd Qu.:0.03300   3rd Qu.: 46.00     
##  Max.   :11.0000   Max.   :0.08900   Max.   :0.13700   Max.   :516.00     
##  PM2.5_Microgramperm3    SO2_ppm        
##  Min.   :  1.00       Min.   :0.001000  
##  1st Qu.: 12.00       1st Qu.:0.003000  
##  Median : 18.00       Median :0.004000  
##  Mean   : 23.38       Mean   :0.003869  
##  3rd Qu.: 27.00       3rd Qu.:0.004000  
##  Max.   :513.00       Max.   :0.096000

cap <- function(x){
    quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
    x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
    x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
    x
}

# Apply a user defined function "cap" to a data frame
AQ_capped <- sapply(tidy_clean_sub, FUN = cap)

par(mfrow=c(2,3))
for (i in (1:6)) {
  outliers_removed = boxplot((AQ_capped[,i]), col = "dodgerblue", 
               main = paste0("Boxplot after outliers\ncapped", colnames[i]))
}

summary(AQ_capped)

##      CO_ppm          NO2_ppm            O3_ppm        PM10_Microgramperm3
##  Min.   :0.1000   Min.   :0.00200   Min.   :0.00100   Min.   : 3.00      
##  1st Qu.:0.4000   1st Qu.:0.01800   1st Qu.:0.00900   1st Qu.:21.00      
##  Median :0.5000   Median :0.02700   Median :0.02100   Median :31.00      
##  Mean   :0.5803   Mean   :0.03001   Mean   :0.02262   Mean   :36.56      
##  3rd Qu.:0.7000   3rd Qu.:0.04100   3rd Qu.:0.03300   3rd Qu.:46.00      
##  Max.   :1.1000   Max.   :0.07500   Max.   :0.06900   Max.   :91.00      
##  PM2.5_Microgramperm3    SO2_ppm        
##  Min.   : 1.00        Min.   :0.002000  
##  1st Qu.:12.00        1st Qu.:0.003000  
##  Median :18.00        Median :0.004000  
##  Mean   :21.85        Mean   :0.003722  
##  3rd Qu.:27.00        3rd Qu.:0.004000  
##  Max.   :58.00        Max.   :0.006000

Transform

Huge differences in scales and values in variables might cause problems when we attempt to do any statistical analytics or data modelling. For example, distance-based methods in machine learning are sensitive to the scale of the variables and could impact the prediction accuracy. In the dataset, four of the pollutants (SO2, NO2, O3, CO) have ppm and two pollutants (PM10, PM2.5) have Microgramperm3 unit with huge difference in value ranges. Therefore, data normalisation was applied, using writing a function minmiaxnormalise() and applying the function using lapply() function, to create a matching scale across all columns whilst maintaining the underlying data distribution of each variable. The normalised data was stored as dataframe named ‘norm_AQ_capped’ using as.data.frame() function.

Afterwards, tried different transformation techniques and found that square root transformation worked well to reduce the skewness of the underlying data distribution of the variables. Thus, square root transformation was applied to reduce the right skewness of the pollutant columns PM10 and PM2.5 (due to page constraint only two numeric columns’ transformation were shown) and converted the data distribution into normal distribution.

# This is the R chunk for the Transform Section
AQ_capped <- as.data.frame(AQ_capped)
head(AQ_capped, 3)

#Data Normalisation:
minmaxnormalise <- function(x){(x- min(x)) /(max(x)-min(x))}
norm_AQ_capped <- as.data.frame(lapply(AQ_capped, minmaxnormalise))
head(norm_AQ_capped, 3)

par(mfrow=c(1,2))
hist (norm_AQ_capped$PM10_Microgramperm3, main = "Hist of PM10_Microgram/m3", xlab = "PM10_Microgram/m3", col = "dodgerblue" )
hist (norm_AQ_capped$PM2.5_Microgramperm3, main = "Hist of PM2.5_Microgram/m3", xlab = "PM2.5_Microgram/m3", col = "dodgerblue")

#Square Root transformation:
par(mfrow=c(1,2))
sqrt_PM10 <- sqrt(norm_AQ_capped$PM10_Microgramperm3)
hist(sqrt_PM10, main = "Hist of SQRT\nPM10_Microgram/m3", xlab = "Sqrt of PM10_Microgram/m3", col = "dodgerblue")
sqrt_PM2.5 <- sqrt(norm_AQ_capped$PM2.5_Microgramperm3)
hist(sqrt_PM2.5, main = "Hist of SQRT\nPM2.5_Microgram/m3", xlab = "Sqrt of PM2.5_Microgram/m3", col = "dodgerblue")

References

Kaggle. (2020) Air Pollution in Seoul. https://www.kaggle.com/bappekim/air-pollution-in-seoul