MATH2349 Semester 1, 2020

Required Packages

library(readr) #To import data from csv files
library(dplyr) #To Manipulate data
library(tidyr) #To Tidy data
library(lubridate) # To deal with Date/Time data
library(editrules) # To check for obvious errors and Missing values
library(stringr) # To deal with String/ character data
library(knitr) # To create nice tables
library(formatR) # For tidy code chunks

Executive Summary

• Data pre-processing is an important step in the data science process as the analysis results depend on the data we input. Thus, it is important to input less garbage into analysis to get less garbage out.

• This document is about the data pre-processing steps performed prior to analysis, on the data sets deals with air pollution measurement information in Seoul, South Korea.

• The analysis is to get insights about the pollution for the year 2019 for the station code 101, by considering the average values of the pollutants.

• The required data for this analysis is available in the following 2 data sets, for every hour between 2017 and 2019.

Measurement info: Air pollution measurement information

Measurement item info: Information on air pollution measurement items

• Hence, these 2 data sets were loaded into R and structures and data were checked for better understanding.

• Then filtered out the necessary data and joined 2 tables to get the pollutants information along with their measured average values.

• After that converted data to correct data types and missing values and obvious errors were checked and corrected.

• Then mutated a new column with information about pollution status by comparing the average measured values of pollutants with pollutant standard indexes, to know about the pollution status and which pollutants were causing pollution.

• By checking the counts of each pollutants across their pollution status found out that PM10 and PM2.5 are the main pollutants.

• Then tidy data steps (spreading pollutants across columns) took place for further analysis and checked for the possible outliers.

• Data normalisation and data transformations took place to remove the effect of outliers and to reduce the skewness. cube root power transformation worked well to reduce the skewness of the data for PM10 and PM2.5 variables.

• Data is ready for analysis.

Data

Datasets considered for this analysis are deals with air pollution measurement information in Seoul, South Korea. This data provides average values for six pollutants (SO2, NO2, CO, O3, PM10, PM2.5).Data were measured every hour between 2017 and 2019.Data were measured for 25 districts in Seoul.

source links

https://www.kaggle.com/bappekim/air-pollution-in-seoul?

https://data.seoul.go.kr/dataList/OA-15526/S/1/datasetView.do

https://data.seoul.go.kr/dataList/OA-15515/S/1/datasetView.do

1st Data set

Measurement info: Air pollution measurement information

It has 3885066 observations and 5 variables as below.

Measurement date: Measurement date and time(date and time stored in date POSIXct format)

Station code : code of the measurement station (integer/numeric data stored as double/numeric)

Item code : Code of the pollutant (integer/numeric data stored as double/numeric)

Average value: measured average value of the pollutant(double/numeric data stored as double/numeric)

Instrument status : status of the instrument used for measurement(categorical data stored as character)

(0: Normal, 1: Need for calibration, 2: Abnormal 4: Power cut off, 8: Under repair, 9: abnormal data)

2nd Data set

Measurement item info: Information on air pollution measurement items

it has 6 observations and 7 variables as below

Item code : Code of the pollutant (integer/numeric data stored as double/numeric)

Item name :Name of the pollutant (character data stored as character)

Unit of measurement : scale of the measurement(ppm, microgram/cubic meter)(character data stored as character)

Good: pollutant standard index good range(double/numeric data stored as double/numeric)

Normal : pollutant standard index normal range (double/numeric data stored as double/numeric)

Bad : pollutant standard index bad range (double/numeric data stored as double/numeric)

Very bad : pollutant standard index very bad range (double/numeric data stored as double/numeric)

Loading, checking structures and Joining data sets

The required datasets are available as csv files , hence loaded them into R using readr package read_csv() function.

For easy recalling changed the column names by supplying a list of names in read_csv() with col_names argument, without changing their meaning.

First, we loaded measurement information data set.

To ensure data loaded correctly and to know the dimensions of the data frame checked structure using str() and dim() functions. (But not providing code for structure of the data set due to page constraint.)

Then filtered the required data for the year 2019 and station 101 using filter() function form dplyr package and checked it’s dimensions, structure and first few records of the data frame using dim(),str() and head() functions.(But not providing the structure of the data set due to page constraint.)

Measurement items data also loaded using read_csv() and by supplying column names and checked it’s dimensions, structure and first few records using dim(),str() and head() functions.(But not providing the structure of the data set due to page constraint.)

Then joined the information data set with item data set using left_join() as information set is the main measurement data. Checked it’s dimensions, structure and first few records using str() and head() functions.

#Setting the path for files location
setwd("C:/Users/sneha/Desktop")

#Loading measurement information data
info <- read_csv("Measurement_info.csv",skip=1,
                 col_names=c("Measurement_date","Station_code","Item_code",
                             "Average_value","Instrument_status"))

## Parsed with column specification:
## cols(
##   Measurement_date = col_datetime(format = ""),
##   Station_code = col_double(),
##   Item_code = col_double(),
##   Average_value = col_double(),
##   Instrument_status = col_double()
## )

#Checking dimesnions of info data frame
dim(info)

## [1] 3885066       5

#Checking structure and dimesnions of info data frame
#str(info)                 

#Filtering required data
info_2019_101 <- info %>% filter(year(Measurement_date)== 2019 & Station_code== 101)

#Checking dimesnions of info data frame
dim(info_2019_101)

## [1] 50310     5

#Checking the structure and dimensions of the filtered data
#str(info_2019_101)

#Checking first 3 records of the filtered data
head(info_2019_101,n=3)

#Loading measurements items data
item <- read_csv("Measurement_item_info.csv",skip=1,
                 col_names=c("Item_code","Item_name","Unit_of_measurement",
                             "Good","Normal","Bad","Very_bad"))

## Parsed with column specification:
## cols(
##   Item_code = col_double(),
##   Item_name = col_character(),
##   Unit_of_measurement = col_character(),
##   Good = col_double(),
##   Normal = col_double(),
##   Bad = col_double(),
##   Very_bad = col_double()
## )

#Checking dimesnions of info data frame
dim(item)

## [1] 6 7

#Checking item data structure and dimensions
#str(item)

#Checking first 3 records of the item data set
head(item,n=3)

#Joining both data sets to get the required data set for analysis
item_info <- info_2019_101 %>% left_join(item,by = "Item_code")

#Checking the structure and dimensions of the joined data set
str(item_info)

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 50310 obs. of  11 variables:
##  $ Measurement_date   : POSIXct, format: "2019-01-01 01:00:00" "2019-01-01 01:00:00" ...
##  $ Station_code       : num  101 101 101 101 101 101 101 101 101 101 ...
##  $ Item_code          : num  1 3 6 8 5 9 1 3 5 6 ...
##  $ Average_value      : num  0.004 0.061 0.002 37 1.1 22 0.003 0.054 0.9 0.002 ...
##  $ Instrument_status  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Item_name          : chr  "SO2" "NO2" "O3" "PM10" ...
##  $ Unit_of_measurement: chr  "ppm" "ppm" "ppm" "Mircrogram/m3" ...
##  $ Good               : num  0.02 0.03 0.03 30 2 15 0.02 0.03 2 0.03 ...
##  $ Normal             : num  0.05 0.06 0.09 80 9 35 0.05 0.06 9 0.09 ...
##  $ Bad                : num  0.15 0.2 0.15 150 15 75 0.15 0.2 15 0.15 ...
##  $ Very_bad           : num  1 2 0.5 600 50 500 1 2 50 0.5 ...

#Checking first 3 records of the joined data set
#print(head(item_info,n=3),width=7)
head(item_info,3)

Understand

From the parsed columns inforamtion/structures of the data sets above, we came to know that station_code and item_code were stored as double/ numeric data , so checked their data types using typeof() function and changed to integer using as.integer() function as they are integer data and checked the data type again to ensure the changes.

Instrument status is factor data stored as numeric data hence, converted it to factor using factor() function by specifying it’s levels and labels as per the source description.

To ensure of the changes checked the frequencies of each available instrument status using table() function before and after the conversion and checked the structure and first few records of the column using str() and head() functions.

Now the data set contains, date, character, integer, double and factor type variables and the factor variable is levelled and labelled.

#Checking data type of item_code
typeof(item_info$Item_code)

## [1] "double"

#Changing item_code as integer
item_info$Item_code <- as.integer(item_info$Item_code)

#Ensuring the change
typeof(item_info$Item_code)

## [1] "integer"

#Checking data type of station_code
typeof(item_info$Station_code)

## [1] "double"

#Changing station_code as integer
item_info$Station_code <- as.integer(item_info$Station_code)

#Ensuring the change
typeof(item_info$Station_code)

## [1] "integer"

#Checking the frequencies of different instrument status values
item_info$Instrument_status %>% table()

## .
##     0     1     2     4     9 
## 49729   314   168    48    51

#Converting instrument_status as factor 
item_info$Instrument_status <- item_info$Instrument_status %>% 
  factor(levels=c(0,1,2,4,8,9),
  labels=c("Normal","Need for calibration","Abnormal",
           "Power cut off","Under repair","abnormal data"))

#Checking the frequencies of different instrument status values after converting it itno factor
item_info$Instrument_status %>% table()

## .
##               Normal Need for calibration             Abnormal 
##                49729                  314                  168 
##        Power cut off         Under repair        abnormal data 
##                   48                    0                   51

#Checking the structure of the data to ensure the changes
str(item_info$Instrument_status)

##  Factor w/ 6 levels "Normal","Need for calibration",..: 1 1 1 1 1 1 1 1 1 1 ...

#Ensuring the changes done to instrument_status
head(item_info$Instrument_status)

## [1] Normal Normal Normal Normal Normal Normal
## 6 Levels: Normal Need for calibration Abnormal Power cut off ... abnormal data

Scan I

Checking for Missing values

Before exploring the data further checked the data set for missing values in each column using is.na(), sum() function in the sapply() function.

Checking for obvious errors

As there were no missing values checked for the obvious errors using violatedEdits() function from editrules package.

To find out the errors, created edit rules text file with the required rules as below.

Edit rules

Numerical Rules

Station_code==101

Item_code %in% c(1,3,5,6,8,9)

Average_value>=0

Good >0

Normal>0

Bad>0

Very_bad>0

Categorical Rules

Instrument_status %in% c(“Normal”,“Need for calibration”,“Abnormal”,“Power cut off”,“Under repair”,“abnormal data”)

Item_name %in% c(“SO2”,“NO2”,“O3”,“PM10”,“CO”,“PM2.5”)

Unit_of_measurement %in% c(“ppm”,“Microgram/m3”)

Mixed Rules

if(Item_name %in% c(“SO2”,“NO2”,“O3”,“CO”)) Unit_of_measurement ==“ppm”

if(Item_name %in% c(“PM10”,“PM2.5”)) Unit_of_measurement==“Microgram/m3”

Fixing the errors

By using the edit rules with violated Edits() function found out the errors in Unit_of_measurement and average values.

Found out the reasons for errors by checking the frequencies for unit_of_measurement and unique average values and by filtering the average values less than 0.(Due to page Constraint not including the code for unique values.)

There was a typo error in unit_of_measurement “Mircrogram/m3” instead of “Microgram/m3”, so we replaced the typo error with the correct value using str_replace() function from stringr package.

The error with average value was negative readings. Checked the reasons for the negative reading found out that those are due to instrument error(Need for calibration status), hence replaced them with 0’s, as we are not sure about the calibration value for the instrument used for the measurement.

Then rechecked for obvious errors to ensure the changes.

#Checking for missing values in all columns
sapply(item_info,function(x) sum(is.na(x)))

##    Measurement_date        Station_code           Item_code       Average_value 
##                   0                   0                   0                   0 
##   Instrument_status           Item_name Unit_of_measurement                Good 
##                   0                   0                   0                   0 
##              Normal                 Bad            Very_bad 
##                   0                   0                   0

#Setting up the rules to find out obvious errors
rules <- editfile("Edit_Rules.txt",type="all")
rules

## 
## Data model:
## dat1 : Instrument_status %in% c('Abnormal', 'abnormal data', 'Need for calibration', 'Normal', 'Power cut off', 'Under repair')
## dat2 : Item_code %in% c('1', '3', '5', '6', '8', '9')
## dat3 : Item_name %in% c('CO', 'NO2', 'O3', 'PM10', 'PM2.5', 'SO2')
## dat4 : Unit_of_measurement %in% c('Microgram/m3', 'ppm') 
## 
## Edit set:
## num1 : Station_code == 101
## num2 : 0 <= Average_value
## num3 : 0 < Good
## num4 : 0 < Normal
## num5 : 0 < Bad
## num6 : 0 < Very_bad
## cat7 : if( Item_name %in% c('CO', 'NO2', 'O3', 'SO2') ) Unit_of_measurement != 'Microgram/m3'
## cat8 : if( Item_name %in% c('PM10', 'PM2.5') ) Unit_of_measurement != 'ppm'

#Finding out violated rules
violated <- violatedEdits(rules,item_info)
violated %>% summary()

## Edit violations, 50310 observations, 0 completely missing (0%):
## 
##  editname  freq   rel
##      dat4 16770 33.3%
##      num2     7    0%
## 
## Edit violations per record:
## 
##  errors  freq   rel
##       0 33540 66.7%
##       1 16763 33.3%
##       2     7    0%

#Finding out the causes of errors
item_info$Unit_of_measurement %>% table()

## .
## Mircrogram/m3           ppm 
##         16770         33540

#Replacing the typo error with correct value 
item_info$Unit_of_measurement <- str_replace(item_info$Unit_of_measurement,
                                        pattern="Mircrogram/m3",
                                        replacement="Microgram/m3")
#Checking unique values of Average vlaue
#item_info$Average_value %>% sort() %>% unique()

#Checking the neagative values and the instrument_status for those values
item_info %>% filter (Average_value<0)

#Replacing the '-'ve values with 0's
for (i in (1:nrow(item_info))){
  if(item_info$Average_value[i]==-1){
    item_info$Average_value[i]=0
  }else item_info$Average_value=item_info$Average_value
}

#Ensuring of the changes  
item_info %>% filter (Average_value<0)

#Rechecking for the obvious errors and missing values
violated <- violatedEdits(rules,item_info)
violated %>% summary()

## No violations detected, 0 checks evaluated to NA

## NULL

Tidy & Manipulate Data I - Mutating a new column

To understand which pollutant causing the most pollution and the status of the pollution added new column indicating pollution status.

To add this new column, we used mutate() function from dplyr package and initially assigned “NA” values to the column. Later using for() and if else() functions generated the pollution status values by comparing average values of the pollutants and pollutants standard index columns good, normal, bad and very bad.

to ensure the data got created correctly checked first few records of the variable pollution_status, average value standard index values.

As the pollution status has different levels indicating the pollution levels converted it to factor by adding labels and ordered it’s levels.

To ensure of the type change checked it’s structure before and after using str().

# Adding a column to data frame using mutate function
item_info <- item_info %>% mutate("pollution_status"= "NA")

#Generating the pollution status from avrage and standard index values
for (i in (1:nrow(item_info))){
  if(item_info$Average_value[i]<= item_info$Good[i]){
    item_info$pollution_status[i]="No Pollution"
  } else if (item_info$Average_value[i]>= item_info$Good[i] & 
             item_info$Average_value[i]<=item_info$Normal[i]){
    item_info$pollution_status[i]="Normal Pollution"
  }else if (item_info$Average_value[i]>=item_info$Normal[i] & 
            item_info$Average_value[i]<= item_info$Bad[i]){
    item_info$pollution_status[i]="Bad Pollution"
  }else if(item_info$Average_value[i]>= item_info$Bad[i] & 
           item_info$Average_value[i]<=item_info$Very_bad[i]){
    item_info$pollution_status[i]="Very Bad Pollution"
  }else{
    item_info$pollution_status[i]="Extremely Bad Pollution"
  }
}

#Checking the generated pollution status values
item_info[c(4,8,9,10,11,12)] %>% head(n=3)

#Checking the frequencies of pollution status 
item_info$pollution_status %>% table()

## .
##           Bad Pollution Extremely Bad Pollution            No Pollution 
##                    1964                       1                   34838 
##        Normal Pollution      Very Bad Pollution 
##                   13207                     300

#Checing the type and structure of pollution status
item_info$pollution_status %>% str()

##  chr [1:50310] "No Pollution" "Bad Pollution" "No Pollution" ...

#Converting pollution status to factor data with levels , labels and order
item_info$pollution_status <- item_info$pollution_status %>% factor(
  levels=c("No Pollution","Normal Pollution","Bad Pollution",
           "Very Bad Pollution","Extremely Bad Pollution"),
  labels=c("No Pollution","Normal Pollution","Bad Pollution",
           "Very Bad Pollution","Extremely Bad Pollution"),
  ordered=TRUE
)

#Re-checking the frequenies of pollution status
item_info$pollution_status %>% table()

## .
##            No Pollution        Normal Pollution           Bad Pollution 
##                   34838                   13207                    1964 
##      Very Bad Pollution Extremely Bad Pollution 
##                     300                       1

#Ensuring of the type changes to pollution status
item_info$pollution_status %>% str()

##  Ord.factor w/ 5 levels "No Pollution"<..: 1 3 1 2 1 2 1 2 1 1 ...

Exploration of pollution status

To understand different pollution status across different pollutants generated counts of pollution status grouped by pollutants using group_by(), summarise() functions and arranged them in descending order of pollutants and pollution_status using arrange() fucntions from dplyr package. The result is stored in pollution_status_counts.

From pollutant_status_counts data filterd on the condition of pollution stutus very bad or extremely very bad using filter() function from dplyr package, we can see that PM10 and PM2.5 are the pollutants causing very_bad pollution.

#Checking the pollutants counts across different pollution status
pollutants_status_counts <- item_info %>% group_by(Item_name,pollution_status) %>% summarise(count=n()) %>% arrange(desc(Item_name,pollution_status))

#Checking pollutanst_status_counts
pollutants_status_counts %>% head(3)

#Getting the pollutants causing very bad pollution
pollutants_status_counts %>% filter(pollution_status=="Very Bad Pollution" |
                             pollution_status=="Extremely Bad Pollution")

Tidy & Manipulate Data II - changing data as per Tidy rules for further analysis

To assess the overall pollution in a considered hour based on all the pollutants average values or to explore any one pollutant or relation between any two pollutants, our data is not in suitable format. So, we are going to use tidy rules and going to spread the pollutants across the columns using spread() function from tidyr package with average values as the values under these columns.

As we are going to work on the pollutant’s average values for each hour, we selected only measurement date, item name and average values.

Before that we united the columns item_name and unit_of_measurement for data redundancy and to have measurement information using unite() function from tidyr package.

But, some of the variables are with ppm units of measurement and some are measured in microgram per cubic meter.

So we used data normalisation technique of min-max transformation, by defining a function for that. Data normalisation also reduces the effect of outliers also, if present.

#Uniting item name adn unit of measurement
item_info <- item_info %>% unite("ItemName_Mesurement",
                                 Item_name,Unit_of_measurement,sep="_")

#Selecting and arranging required variables as per tidy rules using select () and spread()
pollutants <- item_info %>% 
  select(Measurement_date,ItemName_Mesurement,Average_value) %>%
  spread(key=ItemName_Mesurement,value=Average_value)

#Checking data after tidying it for further analysis
pollutants %>% head(3)

#Defining a fucntion for min max transforamtion
minmaxnormalise <- function(x){(x- min(x)) /(max(x)-min(x))}

#Applying min max transforamtion for all the numerica vriables in the data set
pollutants[,2:7] <- as.data.frame(lapply(pollutants[,2:7],function(x) minmaxnormalise(x)))

#Checking data after Noramalisation
pollutants %>% head(3)

Scan II

Checking for possible outliers

To explore the pollutants PM10 or PM2.5, which are causing more pollution and to find the relation between these 2 pollutants, we need to check for possible outliers. We used “Tukey’s method of outlier detection” which uses quantile ranges to find the outliers.

SO, we are going to plot boxplots for both the varaibles and going to check the number of outliers from the boxplot results.

#Changing the layout of visualisation
par(mfrow=c(1,2))

#Getting number of outliers for each column using tukey's method
for (i in c(5,6)){
  result=boxplot(pollutants[i],main=colnames(pollutants[i]))
  cat(colnames(pollutants[i]),"has",length(result$out),"outliers\n")
}

## PM10_Microgram/m3 has 582 outliers

## PM2.5_Microgram/m3 has 636 outliers

Handling outliers

From the above boxplots we can see number of possible outliers for both the variables. However, before handling the outliers we need to check the values of the outliers and the cause/source of the outliers, for that we need to inspect the unique values of the 2 variables or summary statistics of the two variables. (Due to page constraint not providing the unique values)

From the unique values of each column we can see that possible outliers are not data entry errors. But some of them are might be due to instrument errors , however as we are not sure about the calibration of the instrument we can’t remove or replace/impute these values with any other value. But to reduce the effect of these possible outliers , that is to reduce the skewness of data we are going to apply transformations.

#Checking sorted unique value of each column
#lapply(pollutants[,c(5,6)], function(x) sort(unique(x)))
#Checking summary statistics for min and max values
pollutants$`PM10_Microgram/m3` %>% summary()

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.04070 0.06008 0.07406 0.08721 1.00000

pollutants$`PM2.5_Microgram/m3` %>% summary()

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.02339 0.03509 0.04536 0.05263 1.00000

Transform

Transformations to reduce the skewness

Due to the possible outliers, the data for the variables PM10 and PM2.5 is skewed, which is shown below using histogram of both variables using hist() function. To reduce the skewness, we need to apply transformations. For PM10 and PM2.5 variables tried different transformations and found cube root power transformation worked well to reduce the skewness as below. Then we checked the distribution of both variables to enure the reduction of skewness as below using hist() function on the transformed variables.

#Changing the layout of visualisation
par(mfrow=c(1,2))
#Checking the distribution of variables with histogram
hist(pollutants$`PM10_Microgram/m3`,main="Histogram of PM10")
hist(pollutants$`PM2.5_Microgram/m3`,main="Histogram of PM2.5")

#Applying BoxCox transforamtion for both variables
pollutants$`PM10_Microgram/m3` <-(pollutants$`PM10_Microgram/m3`)^(1/3)
pollutants$`PM2.5_Microgram/m3` <- (pollutants$`PM2.5_Microgram/m3`) ^(1/3)

#Checking the distribution of variables with histogram after transformation
hist(pollutants$`PM10_Microgram/m3`,main="Histogram of PM10")
hist(pollutants$`PM2.5_Microgram/m3`,main="Histogram of PM2.5")

Now the data is ready for analysing the relation between pollutants PM2.5 and PM10.

References:

Module Notes and Demo slides