The necessary packages to reproduce the report are loaded using library() function. It is assumed that the packages were already installed previously. The knitr::opts_knit$set is used to set working directory to the Project’s root directory to allow separate subfolders for data and documentation. The file to be processed is located in “data” subfolder while the R Markdown template file is located inside “doc” subfolder.
library(dplyr)
library(readr)
library(magrittr)
library(knitr)
library(tidyr)
library(stringr)
# set the global root directory to the Project's root directory in setup chunk
# so that it is recognised in all subsequent chunks
knitr::opts_knit$set(root.dir = rprojroot::find_rstudio_root_file())
The aim of this report is to provide series of tasks necessary to preprocess some data pertaining to PM2.5 concentration in the air and its calculated impact to human’s mortality rate. The raw data are taken from 3 datasets in CSV format downloadable from publicly accessible websites maintained by Health Effect Institute and the World Bank.
The preprocessing involves 5 major tasks from data preprocessing framework namely get, understand, tidy & manipulate, scan, and transform. First of all, the raw data imported into R Studio and renaming and dropping of some redundant variables performed, followed by joining of datasets. Next, the structure and data types of variables were inspected and necessary data type conversion was performed. In the subsequent tasks, dataset was reshaped, proper handling of data inconsistencies applied and a new explanatory variable was added. The process continued with scanning of missing and any special values followed with appropriate handling. The detection of potential outliers followed through, however, the potential outliers were not handled immediately at this point due to distribution of the variables appeared to be right-skewed and hence, further investigation performed by using transformation which is part of the last task. The necessary data transformation performed using logarithms approach. After transformed, the numbers of potential outliers significantly reduced which suggests that they are rather appeared to be outliers due to the non-normality of the data. Data transformation concludes the whole data preprocessing task and the final dataset can be used as a base version for further statistical investigation and inference.
This report uses 3 datasets related to air pollutant concentration reading and its impact to mortality/death of human worldwide. The first dataset (air_pm25.csv) contains the annual mean of atmospheric particulate matter (PM) that have a diameter of less than 2.5 micrometers, abbreviated PM2.5, in 189 countries from year 1990 until 2018 while the second dataset (death_pm25.csv) contains the annual mean of age-standardized death rate in the same list of countries and period of the first dataset. These datasets can be downloaded from State of Global Air.
Original file of first dataset consists of 14 variables, while the second dataset consists of 22 variables. However for simplicity of this report, some variables which are not of interest were purposely removed from data file prior of being imported into R. The third dataset (population.csv) contains the yearly human population in 264 countries worldwide from 1960 up until 2018. This dataset can be downloaded from World Bank Open Data. Original file of third dataset consists of 63 variables. However for simplicity of this report, some variables which are not of interest were purposely removed from data file prior of being imported into R.
This report focused only on PM2.5 pollutant reading because this type of pollutant is considered to be the most health-harmful air pollutant, contributing to 20% of 3.8 million premature deaths of related respiratory illness and cancers [1]. The annual age-standardized death rate is defined as number of deaths (per 100,000 people) in a given year that likely occurred earlier than would be expected in the absence of air pollution. It is calculated value following “burden of disease” framework from Global Burden Disease (GBD) 2017 project [2].
The first dataset (air_pm25.csv) consists of 2,280 observations of 6 variables. The following is brief description of each variable:
The second dataset (death_pm25.csv) consists of 2,280 observations of 6 variables. The following is brief description of each variable:
The third dataset (population.csv) consists of 264 observations of 61 variables. The following is brief description of each variable:
The working directory is set to a predefined path in a physical directory in the computer using setwd() function. This step is actually redundant because the Project’s root directory already set previously in the Required Packages section (see above) and also has no effect in the subsequent notebook chunk as working directory will be reset when the chunk is finished running. It is put here to show an example of setting working directory when working outside R markdown environment.
setwd("~/Documents/Study/MATH2349/Assignments/Assignment3")
The data files (in CSV format) are read from subfolder “data” and imported into R using read_csv() function and named as “air”, “death”, and “population” respectively. The output then converted to tbl_df class using the as.tbl() function to suppress the automatic row numbering printed when displaying the output of the data frame. Upon reading “population” file, the first 4 rows need to be skipped because they contain non-data information.
air <- read_csv(file="data/air_pm25.csv") %>% as.tbl()
death <- read_csv(file="data/death_pm25.csv") %>% as.tbl()
population <- read_csv(file="data/population.csv", skip = 4) %>% as.tbl()
For air dataset, the variable “Name” renamed to “Region” using rename() function to give a more proper name that reflects their values. The dim() function is used to check and display the dimension of the data. Noted that air dataset consists of 2,280 observations with 6 variables. Header and first few observations of the data are displayed thereafter using head() function.
air <- rename(air, Region = Name)
dim(air)
## [1] 2280 6
head(air)
For death dataset, the variable “Name” renamed to “Region” while “Exposure Mean” renamed to “Death Mean” to give a more proper name that reflects their values. Noted that the dimension of death dataset is 2,280 observations with 6 variables. Header and first few observations of the data are displayed thereafter.
death <- rename(death, Region = Name, "Death Mean" = "Exposure Mean")
dim(death)
## [1] 2280 6
head(death)
For population dataset, noticed that it contains a redundant variable i.e. “Country Name” because it already contains “Country Code” variable which is the international standard organization (ISO) code for country and is sufficiently unique to identify a country. Furthermore, the first data set (i.e. air) already has a variable that stores the name for each country. Therefore, the population dataset being subset to exclude this variable. Noted that the dimension of population dataset is now 264 observations with 60 variables.
population <- population %>% select(-"Country Name")
dim(population)
## [1] 264 60
head(population)
Next process is the joining of datasets. As the join operation involves 3 datasets, then it needs to be performed as two-steps join. The first join to be performed by adding columns of second dataset (death) into first dataset (air) and the second join to be performed by adding columns of third dataset into the result of first join.
For the first join, the identified common columns used are “ISO3”, “Year”, and “Pollutant”. Prior of joining, the data type of these common columns in both datasets are checked to ensure its uniformity. This is checked using function str().
str(air, give.attr = FALSE)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 2280 obs. of 6 variables:
## $ Country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ ISO3 : chr "AFG" "AFG" "AFG" "AFG" ...
## $ Region : chr "Asia" "Afghanistan" "Afghanistan" "Asia" ...
## $ Exposure Mean: num 65 65 65 65 65 65 65 65 65 65 ...
## $ Year : num 1990 1990 1995 1995 2000 ...
## $ Pollutant : chr "pm25" "pm25" "pm25" "pm25" ...
str(death, give.attr = FALSE)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 2280 obs. of 6 variables:
## $ Country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ ISO3 : chr "AFG" "AFG" "AFG" "AFG" ...
## $ Region : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ Year : num 1990 1995 2000 2005 2010 ...
## $ Death Mean: num 46 45 45 46 47 48 48 46 45 47 ...
## $ Pollutant : chr "pm25" "pm25" "pm25" "pm25" ...
It can be observed from the output above that the data types of identified common columns are consistent in both datasets and therefore can be safely used for join operation. It is also observed that the second dataset contains other columns that are redundant with the first dataset i.e. “Country” and “Pollutant”. Therefore, the second dataset being subset to exclude these variables before the joining. Noted that after subset, dimension of death dataset is now 2,280 observations with 4 variables.
death <- death %>% select(-Country, -Pollutant)
dim(death)
## [1] 2280 4
The join is performed using left_join() function because this report assumes the first dataset (air) as the main information. The resulting dataframe is named as air_death. From the first join operation, it can be seen that dimension of resulting dataframe (air_death) is 2,280 observations with 7 variables. Header and first few observations of the data are displayed thereafter.
air_death <- left_join(air, death, by = c("ISO3", "Region", "Year"))
dim(air_death)
## [1] 2280 7
head(air_death)
The second join to be performed between the resulting dataset from first join (air_death) with third dataset (population). However, upon inspection of the population data, it can be seen that it is in wide format whereby values for each year stored as individual columns. Such structure is considered untidy and is not suitable for joining with other datasets. As such, the joining of third dataset will be performed after necessary datatype conversion, tidy and manipulate process which will be elaborated in the respective sections below.
The function str() is used to check the types of variables and structures of each dataset.
str(air_death, give.attr = FALSE)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 2280 obs. of 7 variables:
## $ Country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ ISO3 : chr "AFG" "AFG" "AFG" "AFG" ...
## $ Region : chr "Asia" "Afghanistan" "Afghanistan" "Asia" ...
## $ Exposure Mean: num 65 65 65 65 65 65 65 65 65 65 ...
## $ Year : num 1990 1990 1995 1995 2000 ...
## $ Pollutant : chr "pm25" "pm25" "pm25" "pm25" ...
## $ Death Mean : num 46 46 45 45 45 45 46 46 47 47 ...
Noted that air_death contains multiple data types and at least 1 variable (i.e. “Region”) suitable for data type conversion from character to factor. It is also observed that variable “Year” is in numeric data type, however because it is a qualitative (categorical) rather than quantitative (numerical) data, then it is converted to character. The following code performs the appropriate data type conversion. Conversion to factor uses as.factor() while function as.character() used to convert numeric to character type.
air_death$Region <- as.factor(air_death$Region)
air_death$Year <- as.character(air_death$Year)
str(air_death, give.attr = FALSE)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 2280 obs. of 7 variables:
## $ Country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ ISO3 : chr "AFG" "AFG" "AFG" "AFG" ...
## $ Region : Factor w/ 5 levels "Afghanistan",..: 4 1 1 4 1 4 4 1 4 1 ...
## $ Exposure Mean: num 65 65 65 65 65 65 65 65 65 65 ...
## $ Year : chr "1990" "1990" "1995" "1995" ...
## $ Pollutant : chr "pm25" "pm25" "pm25" "pm25" ...
## $ Death Mean : num 46 46 45 45 45 45 46 46 47 47 ...
It is also observed that this dataset contains a variable called “Exposure Mean” which stores mean of PM2.5 concentration in a given year. For a better insight of reading this variable, there is a need for a new variable to explain the level of PM2.5 according to Air Quality Index (AQI), a system that translates pollutant concentration measurement into scale that represent the air quality [3]. As the value of this new variable is index in a scale, naturally its data type will be in ordinal factor. The creation of this new variable will be elaborated in section Tidy & Manipulate Data II of this report.
Next, the structure and data type of variables in population dataset is observed using the code below. Since the dataset contains 60 variables, only the first 10 variables are displayed.
str(population, give.attr = FALSE, list.len = 10)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 264 obs. of 60 variables:
## $ Country Code: chr "ABW" "AFG" "AGO" "ALB" ...
## $ 1960 : num 54211 8996973 5454933 1608800 13411 ...
## $ 1961 : num 55438 9169410 5531472 1659800 14375 ...
## $ 1962 : num 56225 9351441 5608539 1711319 15370 ...
## $ 1963 : num 56695 9543205 5679458 1762621 16412 ...
## $ 1964 : num 57032 9744781 5735044 1814135 17469 ...
## $ 1965 : num 57360 9956320 5770570 1864791 18549 ...
## $ 1966 : num 57715 10174836 5781214 1914573 19647 ...
## $ 1967 : num 58055 10399926 5774243 1965598 20758 ...
## $ 1968 : num 58386 10637063 5771652 2022272 21890 ...
## [list output truncated]
In the population dataset, the data type of variables are considered proper and no conversion required at this juncture. However, as the structure is considered untidy and requires reshaphing, there could be data type conversion needed thereafter. The tidy and manipulation process for this dataset will be elaborated in the subsequent section below.
Upon inspection of the air_death data, it is oberved that column “Region” contains inconsistent value i.e. “Afghanistan” which is not inside the set of of “Four World Regions”. There are 12 observations identified with this incorrect region name.
air_regions <- air_death %>% group_by(Region) %>% count()
air_regions
Further inspection performed using filter() function to select all observations with “Afghanistan” as their region name. It can be seen that there is only 1 country which is “Afghanistan”.
filter(air_death, Region == "Afghanistan")$Country %>% unique()
## [1] "Afghanistan"
A method that uses unite(), spread(), and identical() functions able to confirm that the observations with “Afghanistan” as their region name are actually duplicate observations with that of same countries with “Asia” as its region name. The output of identical() function is TRUE which means that the observations of country “Afghanistan” are duplicate in the incorrect Region i.e. “Afghanistan”.
air_death_sub <- filter(air_death, (Country == "Afghanistan"))
air_death_sub <- air_death_sub %>% unite(Info, `Exposure Mean`, `Death Mean`, Pollutant, sep = "-")
air_death_sub <- air_death_sub %>% spread(key = Region, value = Info)
identical(air_death_sub$Afghanistan, air_death_sub$Asia)
## [1] TRUE
Based on the above, it is safe to remove all observations with “Afghanistan” as their region name and keep the observations with Region “Asia”. The function droplevels() is needed to remove no longer used level(s) after subseting a dataset. Noted after subsetting, dimension of air_death dataset is 2,268 observations with 7 variables. Header and first few observations of the data are displayed thereafter.
air_death <- filter(air_death, Region != "Afghanistan") %>% droplevels()
dim(air_death)
## [1] 2268 7
head(air_death)
Removal of the duplicate and inconsistent observations was succesfull as shown in the output of code below.
air_regions <- air_death %>% group_by(Region) %>% count()
air_regions
Recalled from previous section that population dataset is considered untidy and is not suitable for joining with other datasets. As such, the population dataset need to be reshaped into long format with new variable called “Year” as a single column storing all the years and “Total_Population” variable storing the value of population for a given year. This is done using function gather().
population <- population %>% gather(c(2:60), key = Year, value = Total_Population)
dim(population)
## [1] 15576 3
head(population)
Noted that after converted to long format, the dimension of population is now 15,576 observations with 3 variables. Header and first few observations of the data are displayed thereafter. It is observed that the new variable “Year” is in character data type which is already consistent with air_death dataset.
str(population, give.attr = FALSE)
## Classes 'tbl_df', 'tbl' and 'data.frame': 15576 obs. of 3 variables:
## $ Country Code : chr "ABW" "AFG" "AGO" "ALB" ...
## $ Year : chr "1960" "1960" "1960" "1960" ...
## $ Total_Population: num 54211 8996973 5454933 1608800 13411 ...
After the tasks above, the air_death and population datasets are considered in tidy format because they already satisfied the tidy data principles which defined as follows: (i) each variables must have its own column, (ii) each observation must have its own row, and (iii) each value must have its own cell.
The last step in this section is to perform the second join of datasets which is the final step in the whole datasets join process. The first join already performed in section Data Loading above between first dataset (air) and second dataset (death) which results in air_death dataset. This second and last join is performed between tidied air_death and population datasets. It is also performed using left_join() function because this report assumes the first dataset as the main information. The resulting dataset is named as air_death_pop. After this final join operation, it can be seen that dimension of air_death_pop dataset is 2,268 observations with 8 variables. Header and first few observations of the data are displayed thereafter.
air_death_pop <- left_join(air_death, population, by = c(ISO3="Country Code", "Year"))
dim(air_death_pop)
## [1] 2268 8
head(air_death_pop)
str(air_death_pop, give.attr = FALSE)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 2268 obs. of 8 variables:
## $ Country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ ISO3 : chr "AFG" "AFG" "AFG" "AFG" ...
## $ Region : Factor w/ 4 levels "Africa","America",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Exposure Mean : num 65 65 65 65 65 67 66 61 59 61 ...
## $ Year : chr "1990" "1995" "2000" "2005" ...
## $ Pollutant : chr "pm25" "pm25" "pm25" "pm25" ...
## $ Death Mean : num 46 45 45 46 47 48 48 46 45 47 ...
## $ Total_Population: num 12412308 18110657 20779953 25654277 29185507 ...
From the output above, it can be seen that air_death_pop dataset is considered in tidy format.
Recalled from previous section (see Understand), there is a need to create a new explanatory variable in the dataset based on the value of PM2.5 exposure (i.e. Exposure Mean). The new variable is created using mutate() and a custom, helper function assignAQILevel().
# custom, helper function
assignAQILevel <- function(pm)
{
ifelse(pm <= 12, 1,
ifelse(pm <= 35.4, 2,
ifelse(pm <= 55.4, 3,
ifelse(pm <= 150.4, 4,
ifelse(pm <= 250.4, 5,
ifelse(pm <= 500.4, 6, NA)
)
)
)
)
)
}
air_death_pop <- air_death_pop %>%
mutate(AQI_PM25_Level = assignAQILevel(air_death_pop$"Exposure Mean"))
Noted that a new variable created with name “AQI_PM25_Level”. The value of this variable is numeric index from 1 to 6 that corresponds to AQI scale which defined as 1 = “Good”, 2 = “Moderate”, 3 = “Unhealthy for Sensitive Groups”, 4 = “Unhealthy”, 5 = “Very Unhealthy”, and 6 = “Hazardous”. Based on this definition, the variable “AQI_PM25_Level” needs to be converted to ordinal factor and labelled accordingly.
air_death_pop$AQI_PM25_Level <- factor(air_death_pop$AQI_PM25_Level,
levels = c(1:6),
labels = c("Good", "Moderate", "Unhealthy for Sensitive Groups",
"Unhealthy", "Very Unhealthy", "Hazardous"),
ordered = TRUE)
str(air_death_pop, give.attr = FALSE)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 2268 obs. of 9 variables:
## $ Country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ ISO3 : chr "AFG" "AFG" "AFG" "AFG" ...
## $ Region : Factor w/ 4 levels "Africa","America",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Exposure Mean : num 65 65 65 65 65 67 66 61 59 61 ...
## $ Year : chr "1990" "1995" "2000" "2005" ...
## $ Pollutant : chr "pm25" "pm25" "pm25" "pm25" ...
## $ Death Mean : num 46 45 45 46 47 48 48 46 45 47 ...
## $ Total_Population: num 12412308 18110657 20779953 25654277 29185507 ...
## $ AQI_PM25_Level : Ord.factor w/ 6 levels "Good"<"Moderate"<..: 4 4 4 4 4 4 4 4 4 4 ...
As can be seen from the above, the final dataset air_death_pop has 2,268 observations with 9 variables. All the variables are in correct data type and necessary factorisation has been applied.
The dataset being scanned for missing values and special values such as NaN and Infinity the using chained any() and apply() function. It can be seen from the output that the dataset does not have any special values, but however, contains missing values.
any(apply(air_death_pop, 2, function(x) any(is.na(x))))
## [1] TRUE
any(apply(air_death_pop, 2, function(x) any(is.nan(x))))
## [1] FALSE
any(apply(air_death_pop, 2, function(x) any(is.infinite(x))))
## [1] FALSE
Further zoom into missing values scanning in each variable/column, showed that that there are 18 observations having missing values in Total_Population variable.
colSums(is.na(air_death_pop))
## Country ISO3 Region Exposure Mean
## 0 0 0 0
## Year Pollutant Death Mean Total_Population
## 0 0 0 18
## AQI_PM25_Level
## 0
Next, the total missing values for each Country is calculated. The total missing values for Total_Population variable is calculated in the summarise() function using the sum() function with nested is.na() function, while the total number of observations calculated using length() function. The mean of Total_Population by country also calculated because it will be used to impute the missing values. The mean is calculated using mean() function with argument na.rm = TRUE. The result saved into a variable named adp_NA.
adp_NA <- air_death_pop %>%
group_by(Country) %>%
summarise(total_obs = length(Total_Population),
total_NA = sum(is.na(Total_Population)),
sum_pop = sum(Total_Population, na.rm = TRUE),
mean_pop = mean(Total_Population, na.rm = TRUE))
By filtering the result, it can be seen that there are 2 countries with missing total population values. Country “Eritrea” has 6 missing values out of total 12 observations while country “Taiwan” has all its 12 observation missing total population value. Note, for country “Taiwan”, the mean of total population was calculated as NaN because its total number of observations is equal to the total of missing values. In other words, all observations of this country were missing value in its Total_Population column.
adp_NA[adp_NA$total_NA > 0,]
Imputation is performed by using the mutate() function with nested ifelse() function. For each observations, the ifelse() function checks whether the value in column Total_Population is missing - if yes, it replaces with calculated mean values, otherwise will remain the value as-is. The result then was saved into air_death_pop_imp data frame. Noted that the dimension of the dataset after imputation is intact, i.e. remain as 2,268 observations with 9 variables. In other words, no changes of dimension before and after imputation.
air_death_pop_imp <- air_death_pop %>%
group_by(Country) %>%
mutate(Total_Population = ifelse(is.na(Total_Population),
mean(Total_Population, na.rm = TRUE),
Total_Population))
dim(air_death_pop_imp)
## [1] 2268 9
Last step in this section is to check whether the imputation was successful or not. For this, summarised values by country is calculated again and saved in a new variable named adp_NA_imputed. It is similar to the previous operation when calculating the adp_NA but for this one, the imputed dataset is used instead.
adp_NA_imputed <- air_death_pop_imp %>%
group_by(Country) %>%
summarise(total_obs_after = length(Total_Population),
total_NA_after = sum(is.na(Total_Population)),
sum_pop_after = sum(Total_Population, na.rm = TRUE),
mean_pop_after = mean(Total_Population, na.rm = TRUE))
The determination whether imputation was successfull is a two-fold process. First process is by checking on the total missing values (NA) after imputation. Second process was by comparing the mean values before and after imputation. For the ease of this checking process, the summary before imputation (adp_NA) dataframe is joined with the summary after imputation (adp_NA_imputed) dataframe to create a “side-by-side” view. Additionally, the delta of the mean (mean_pop_after minus mean_pop) was added to the joined data frame under column delta_mean. Note that prior of constructing the “side-by-side” view, the dataframes being subset to only include the countries that identified having missing values, in this case, “Eritrea” and “Taiwan”.
adp_NA <- filter(adp_NA, Country %in% c("Eritrea", "Taiwan"))
adp_NA_imputed <- filter(adp_NA_imputed, Country %in% c("Eritrea", "Taiwan"))
adp_joined <- inner_join(adp_NA, adp_NA_imputed, by = "Country")
adp_joined <- adp_joined[c("Country", "total_NA", "total_NA_after", "mean_pop", "mean_pop_after")]
adp_joined <- mutate(adp_joined, delta_mean = mean_pop_after - mean_pop)
adp_joined
It can be observed from the above that for country “Eritrea” the total missing value is now 0 which means no more missing values. It can also be observed their delta_mean are also 0 which means mean value for this country remains the same before and after imputation.
It is also noted that for country “Taiwan” , number of total missing values shown remain the same after imputation and its mean is calculated as NaN. This is because the missing value of Total_Population was replaced (imputed) with the calculated mean where NAs are excluded (using argument rm.na = TRUE). Mathematically, mean is calculated by taking the sum of the Total_Population value and divide it with total number of observations. Since all Total_Population for this country are all NAs and NAs are excluded in the computation, then the mean was calculated as 0 divided by 0, which resulted NaN value. The missing value for country “Taiwan” to be left as-is because there are no applicable and logical way to impute them. These observations also cannot be removed from dataset because they have other values in other variables.
There are 3 variables with numeric data type in the dataset, namely “Exposure Mean”, “Death Mean”, and “Total Population”. All these variables need to be scan for outliers. This task is assuming that “Exposure Mean”, “Death Mean”, and “Total Population” are independent variable. Therefore, univariate outliers detection method is used to detect and handle the outliers.
First step is to check whether the distribution of the variables follows an approximately normal distribution. This is necessary to determine whether outlier detection methods that rely on the normality assumption (such as z-score method) are safe and suitable to be used [4]. The check is done using visual technique i.e. histogram plots.
par(mfrow=c(1,2))
hist(air_death_pop_imp$`Exposure Mean`, main = "Annual Exposure Mean",
xlab = "PM2.5 air pollution (µg/m3)",
col = "seagreen")
hist(air_death_pop_imp$`Death Mean`, main = "Annual Death Mean",
xlab = "Age-standardized Deaths/100,000",
col = "seagreen")
par(mfrow=c(1,1))
hist(air_death_pop_imp$Total_Population, main = "Annual Total_Population",
xlab = "Persons",
col = "seagreen")
It can be clearly seen from the histograms above, distribution of those variables are right-skewed, which means outlier detection methods that rely on normality assumption e.g. z-score method is not suitable to be used. As such, Box Plot with Tukey’s method was selected instead. Tukey’s method makes no distributional assumptions and does not depend on the mean or standard deviation, therefore it is applicable to skewed data or non bell-shaped data [5].
For Tukey’s method, boxplot provides visual detection of outliers, as shown below.
par(mfrow=c(1,2))
exposure_outliers <- air_death_pop_imp$`Exposure Mean` %>%
boxplot(main = "Annual Exposure Mean",
ylab = "PM2.5 air pollution (µg/m3)",
col = "coral")
death_outliers <- air_death_pop_imp$`Death Mean` %>%
boxplot(main = "Annual Exposure Death",
ylab = "Age-standardized Deaths/100,000",
col = "coral")
par(mfrow=c(1,1))
pop_outliers <- air_death_pop_imp$Total_Population %>%
boxplot(main = "Annual Total Population",
ylab = "Persons",
col = "coral")
The data points that suspected as outliers for each of the variables captured in the out list of the boxplots translated to percentage and displayed below. Note, missing values are omitted in the calculation of total number of rows of population because there are missing values in country “Taiwan”.
length(exposure_outliers$out)/nrow(air_death_pop_imp)*100
## [1] 4.62963
length(death_outliers$out)/nrow(air_death_pop_imp)*100
## [1] 1.984127
length(pop_outliers$out)/nrow(na.omit(air_death_pop_imp))*100
## [1] 11.61348
From the above, the percentage of suspected outliers for each variables is rather significant i.e. 4.63%, 1.98%, and 11.61% for “Exposure Mean”, “Death Mean”, and “Total Population” respectively. Based on these figures and the skewness of these variables’ distribution, it is consider necessary to apply proper data transformation prior of making a decision whether these are true outliers. A determination that outliers are present within a data set may be due to the non-normality rather than actual outliers if the normality assumption for that data is not valid [4]. The data transformation is elaborated in the subsequent section.
In the previous section, the histogram plots for variable “Exposure Mean”, “Death Mean”, and “Total Population” appear as right skewed distribution. In order to decrease the skewness, proper data transformation approach is applied, in this case the logarithms approach, using log() function.
air_death_pop_imp$`Exposure Mean` <- log(air_death_pop_imp$`Exposure Mean`)
air_death_pop_imp$`Death Mean` <- log(air_death_pop_imp$`Death Mean`)
air_death_pop_imp$Total_Population <- log(air_death_pop_imp$Total_Population)
After transformation, the histograms of these variables are plotted again. It can be seen that the logarithms transformation is able to reduce the skewness of the distributions.
par(mfrow=c(1,2))
hist(air_death_pop_imp$`Exposure Mean`, main = "Annual Exposure Mean",
xlab = "PM2.5 air pollution (µg/m3)",
col = "seagreen")
hist(air_death_pop_imp$`Death Mean`, main = "Annual Death Mean",
xlab = "Age-standardized Deaths/100,000",
col = "seagreen")
par(mfrow=c(1,1))
hist(air_death_pop_imp$Total_Population, main = "Annual Total Population",
xlab = "Persons",
col = "seagreen")
Outliers detection then performed against the transformed distribution using boxplot approach.
par(mfrow=c(1,2))
exposure_outliers <- air_death_pop_imp$`Exposure Mean` %>%
boxplot(main = "Annual Exposure Mean",
ylab = "PM2.5 air pollution (µg/m3)",
col = "coral")
death_outliers <- air_death_pop_imp$`Death Mean` %>%
boxplot(main = "Annual Death Mean",
ylab = "Age-standardized Deaths/100,000",
col = "coral")
par(mfrow=c(1,1))
pop_outliers <- air_death_pop_imp$Total_Population %>%
boxplot(main = "Annual Total Population",
ylab = "Persons",
col = "coral")
The data points that suspected as outliers for each of the variables after transformation, translated to percentage, displayed below.
length(exposure_outliers$out)/nrow(air_death_pop_imp)*100
## [1] 0
length(death_outliers$out)/nrow(air_death_pop_imp)*100
## [1] 0
length(pop_outliers$out)/nrow(na.omit(air_death_pop_imp))*100
## [1] 1.108156
It can be seen that 0 (zero) outliers detected for variable “Exposure Mean” and “Death Mean”. There are still suspected outliers detected for “Total Population” but the percentage is very small (i.e. 1.1%), hence they will be left as-is in the dataset (no further action required). As the conclusion, the outliers detected previously were not true outliers. They are rather appeared to be outliers due to the non-normality of the data.