Data Wrangling: Dataset challenge

Introduction
Data
Understand
Tidy & Manipulate Data I
Tidy & Manipulate Data II
Scan I
Scan II
Transform
Conclusion
References

Introduction

Australia generally experiences relatively clean air in global comparison.The country is however vulnerable to short-term extreme air pollution which occur due to bushfires and dust storms (Keywood et al, 2017). This can pose serious health impacts to its residents even at relatively low levels.The health impacts of Australia air pollution include increasing people’s risk of developing cardiovascular disease, respiratory infections, chronic obstructive pulmonary disease (COPD) and lung cancer, shortened life expectancy and premature death, while short-term effects can also include aggravation of asthma, and eye, nose and throat irritation (WHO, 2021). Exposure to Australia air pollution is estimated to result in 4,880 premature deaths every year (Walter,2019)

Australia measures national air quality against its own air quality standards, called the National Environment Protection Measure for Ambient Air Quality (Air NEPM). The NEPM measures 7 pollutants to which most Australians are exposed: particulate matter (PM2.5 and PM10), carbon monoxide (CO), lead, nitrogen dioxide (NO2), ozone (O3), and sulphur dioxide (SO2)(IQAir, 2022).

In South Australia, the main pollutants of concern vary by area. In Whyalla, Port Pirie and the capital Adelaide, particulate matter pollution is a particular concern; wood smoke in Mount Gambier and Mount Barker; sulfur dioxide at Oliver Street in Port Pirie; lead in Port Pirie, while ozone levels are of concern in Elizabeth (IQAir, 2019).

Data

The dataset was sourced from DATA.SA with following details.

Dataset1 contains validated particle pollution data for the year 2020 for the Spencer Gulf region in monthly files (January)
Dataset2 contains validated particle pollution data for the year 2021 for the Spencer Gulf region in monthly files (January)
Data was collected as 1 hour averages
Sampling method: Tapered Element Oscillating Microbalance (TEOM)
Field labels: PM10 TEOM µg/m3 – Particulate Matter 10 microns or less, PM2.5 TEOM µg/m3 – Particulate Matter 2.5 microns or less, Temperature Deg C – Ambient Temperature, Barometric Pressure atm – Ambient Pressure
PM10 values expressed in micrograms per cubic metre (µg/m3), Ambient Temperature expressed in Degrees Celsius (°C) & Ambient Barometric Pressure expressed in Atmospheres (atm). Time is expressed as Time Ending Averages.

dataset1 <- read.csv("WHY07p_1hr202001.csv", header = TRUE, strip.white = TRUE,  na = c("", "NA"))
dataset2 <- read.csv("WHY07p_1hr202101.csv", header = TRUE, strip.white = TRUE,na = c("", "NA"))
head(dataset1)     # Confirm if importing is successful

##        Date.Time PM10.TEOM.ug.m3 PM2.5.TEOM.ug.m3 Temperature.Deg.C
## 1 1/01/2020 1:00            25.5               NA              18.4
## 2 1/01/2020 2:00            19.9               NA              17.8
## 3 1/01/2020 3:00            23.1               NA              17.3
## 4 1/01/2020 4:00            22.1               NA              17.0
## 5 1/01/2020 5:00            21.9               NA              16.7
## 6 1/01/2020 6:00            33.5               NA              14.5
##   Barometric.Pressure.atm
## 1                   1.001
## 2                   1.000
## 3                   1.000
## 4                   1.000
## 5                   1.001
## 6                   1.002

head(dataset2)     # Confirm if importing is successful

##        Date.Time PM10.TEOM.ug.m3 PM2.5.TEOM.ug.m3 Temperature.Deg.C
## 1 1/01/2021 1:00            26.2               NA              19.5
## 2 1/01/2021 2:00            24.6               NA              18.6
## 3 1/01/2021 3:00            26.3               NA              18.1
## 4 1/01/2021 4:00            25.1               NA              17.9
## 5 1/01/2021 5:00            24.1               NA              17.5
## 6 1/01/2021 6:00            27.9               NA              17.7
##   Barometric.Pressure.atm
## 1                   0.993
## 2                   0.992
## 3                   0.992
## 4                   0.992
## 5                   0.992
## 6                   0.992

names(dataset1)     # Check names

## [1] "Date.Time"               "PM10.TEOM.ug.m3"        
## [3] "PM2.5.TEOM.ug.m3"        "Temperature.Deg.C"      
## [5] "Barometric.Pressure.atm"

names(dataset2)     # Check names

## [1] "Date.Time"               "PM10.TEOM.ug.m3"        
## [3] "PM2.5.TEOM.ug.m3"        "Temperature.Deg.C"      
## [5] "Barometric.Pressure.atm"

intersect(dataset1 %>% names(), # Identify common variable
 dataset2 %>% names())

## [1] "Date.Time"               "PM10.TEOM.ug.m3"        
## [3] "PM2.5.TEOM.ug.m3"        "Temperature.Deg.C"      
## [5] "Barometric.Pressure.atm"

df_join <- dataset1%>% 
full_join(dataset2,by = c("Date.Time", "PM10.TEOM.ug.m3", "PM2.5.TEOM.ug.m3", "Temperature.Deg.C", "Barometric.Pressure.atm"), all=TRUE) # Join data set 1 and dataset 2

The join functions in the dplyr package make sure that the tables are combined so that matching rows are together. As shown in this Air quality dataset, it is not always the case that each row in one table has a matching row in the other. Therefore, it is crucial to chose the most appropriate joining method among several versions of join. The two air quality data set in this case has shown overlapping variable names using intersect() functions but they different in observations. Thus, the full join method has been chosen to keep all the rows from both data frames with all=TRUE arguemnts. The data has been succesffuly joined and now ready further exploration.

Understand

The air quality data has been inspected to get first impressions of the data structure, data quality, missing values, inconsistent variable names, and attributes. This would help us understand the data and its attributes.The data has been checked for its structure using str() and class() functions to better understand it.

str(df_join)           # Examine the data structure

## 'data.frame':    1488 obs. of  5 variables:
##  $ Date.Time              : chr  "1/01/2020 1:00" "1/01/2020 2:00" "1/01/2020 3:00" "1/01/2020 4:00" ...
##  $ PM10.TEOM.ug.m3        : num  25.5 19.9 23.1 22.1 21.9 33.5 35.2 12.6 22.3 26.9 ...
##  $ PM2.5.TEOM.ug.m3       : logi  NA NA NA NA NA NA ...
##  $ Temperature.Deg.C      : num  18.4 17.8 17.3 17 16.7 14.5 17 20.3 21.4 21.8 ...
##  $ Barometric.Pressure.atm: num  1 1 1 1 1 ...

class(df_join)         # Examine the class

## [1] "data.frame"

df_join1 <- df_join$Date.Time%>% 
            as.POSIXct(df_join$Date.Time, format="%d/%m/%Y %H:%M",tz="UTC + 9:30") # Convert strings into date-time format
str(df_join1)

##  POSIXct[1:1488], format: "2020-01-01 01:00:00" "2020-01-01 02:00:00" "2020-01-01 03:00:00" ...

As shown in the output, the air quality data contains 1488 abservations collected over two months period in 2020 and 2021. This observations has been measured for eight variables representing Date.Time, particulate mater (PM10 and PM2.5), temperature and atmospheric pressure. This data set contain character, numerical and logical data data types of which the the charcter data type was incorrectly assigned to the variable "Date.Time". Thus, the character (string) format in variable " Date.Time" has been converted into a date-time format that is recognized by R steps to represent categorical data. This conversion supports efficient plotting, subsetting and time series analysis of air pollution.

Tidy & Manipulate Data I

According to Wickham (2014), untidy data column headers are values, not variable names, multiple variables are stored in one column, variables are stored in both rows and columns. In addition to this, in untidy data multiple types of observational units are stored in the same table and a single observational unit is stored in multiple tables.The air quality data relatively untidy though not a total mess data as it holds so many missing abservations which can skew anything and bias estimates that lead to invalid results. This gaps in data might be due to data intentionally being omitted or not prperly collected or could be created due to transformations of the data. Therefore, any data reshape steps as part of tidy data procedure using pivot_wider() and pivot_longer functions have not been required for this particular data set. As part of data tidying process,some of the columns such as "PM10.TEOM.ug.m3,"Temperature.Deg.C", and "Barometric.Pressure.atm" has been modified to more descriptive and follow column naming standards and conventions.

air_quality <- df_join %>% 
                    rename(
                    observation_date = Date.Time, particulate10_ug_m3 = PM10.TEOM.ug.m3, particulate2.5_ug_m3 = PM2.5.TEOM.ug.m3,temperature_oC =                   Temperature.Deg.C, pressure_atm = Barometric.Pressure.atm)

Tidy & Manipulate Data II

The air quality data in this work was great in sense that it may reflect the trend that the climate change and increasing temperatures may contribute to higher levels of particulate mater over time. The information to convey the hazards of various levels of pollutants for example 30 ug/m3 of particulate10 was not provided in this particular data. Thus, the best way to make more useful this data is to add another variable so that the public can quickly understand and take actions. This additional variable is created based on particulate10 1-hour rolling average (µg/m3) according to Environmrntal Protection Australia (South Australia) air quality category system (EPA, 2022).

air_quality2<-air_quality %>%             # Create a new variable   
  mutate(Air_quality_rating = case_when(particulate10_ug_m3 < 50 ~ "Good",
                                   particulate10_ug_m3 < 100 ~ "Fair",
                                   particulate10_ug_m3 < 200 ~ "Poor",
                                   particulate10_ug_m3 <600 ~ "Very poor",
                                   particulate10_ug_m3 >= 600 ~ "Extremely poor"))
air_quality2$Air_quality_rating <- factor(air_quality2$Air_quality_rating,          # label and order Air_quality_rating variable
 levels = c("Good", "Fair", "Poor","Very poor", "Extremely poor"),
 labels = c("5","4", "3","2", "1"),
 ordered = TRUE,
 exclude = NA)
str(air_quality2)

## 'data.frame':    1488 obs. of  6 variables:
##  $ observation_date    : chr  "1/01/2020 1:00" "1/01/2020 2:00" "1/01/2020 3:00" "1/01/2020 4:00" ...
##  $ particulate10_ug_m3 : num  25.5 19.9 23.1 22.1 21.9 33.5 35.2 12.6 22.3 26.9 ...
##  $ particulate2.5_ug_m3: logi  NA NA NA NA NA NA ...
##  $ temperature_oC      : num  18.4 17.8 17.3 17 16.7 14.5 17 20.3 21.4 21.8 ...
##  $ pressure_atm        : num  1 1 1 1 1 ...
##  $ Air_quality_rating  : Ord.factor w/ 5 levels "5"<"4"<"3"<"2"<..: 1 1 1 1 1 1 1 1 1 1 ...

As shown above, the new variable called air_quality_rating is succefully created that needs to be labelled and ordered. The structure of the newly created data frame was confirmed to avoid coercion in the succeful steps, with Air_quality_rating being as factor data type. As part of cleaning this air quality data, let's have a look on the missing values in the next steps.

Scan I

Missing value is a common phenmonen in any data set which might be intriduced due to several reasons.The most crucial part is how we deal and handle the missing data to avoid invalid conclusions. The first important step we followed was identfiying this missing values in the air quality data set using numerical approaches. Detecting missing values numerically using sum() and colSums() function in every column of the air quality data set would give an idea about the distribution of missing values.

sum(is.na(air_quality2))   # Check the count of missing values

## [1] 1564

colSums(is.na(air_quality2)) # Check the count of missing values in the columns

##     observation_date  particulate10_ug_m3 particulate2.5_ug_m3 
##                    0                   38                 1488 
##       temperature_oC         pressure_atm   Air_quality_rating 
##                    0                    0                   38

which(is.na(air_quality2$particulate10_ug_m3)) # Locate the NA in variable particulate10

##  [1]   13   14   19   48   56  103  195  196  201  228  229  231  232  236  260
## [16]  297  298  299  301  304  306  307  525  689  725  726  727  741  742  744
## [31]  927  950  951  969  975  987 1049 1050

which(is.na(air_quality2$Air_quality_rating)) # Locate the NA in variable Air_quality_rating

##  [1]   13   14   19   48   56  103  195  196  201  228  229  231  232  236  260
## [16]  297  298  299  301  304  306  307  525  689  725  726  727  741  742  744
## [31]  927  950  951  969  975  987 1049 1050

As shown in th output above, 38 values have been missing in particulate matter 10 and Air-quality variables. On the contrary, no null values has been detected in the temeprature and pressure variables. One of the most unexpected part of this data was that all values for particulate matter 2.5 (particulate2.5_ug_m3) are null values despite particulate matter 2.5 is the main pollutants of concern for its impact on human health due to its continuous presence in the air and the numerous sources (DWEP, 2018). There has been no information provided why these measurements are not taken during the month of January in 2020 and 2021 from the source. However, we believe that the data has been intentionally omittted. During, the month of January bushfires are very common which contributed most to particulate matter 10 instead of particulate matter 2.5. Thus The particulate matter 2.5 has been excludedin this analysis as shown below.

Now, let's visualize the distribution of the data so that we can choose the most reasonable guess to impute each missing value with statistical estimates of the missing values.

air_quality2 %>% summarise(mean_particulate10_ug_m3 = mean(particulate10_ug_m3), # Verify data distribution, 
                           sd_particulate10_ug_m3 = sd(particulate10_ug_m3),
                           mean_temperature_oC = mean(temperature_oC),
                           sd_temperature_oC = sd(temperature_oC),
                           mean_pressure_atm = mean(pressure_atm),
                           sd_pressure_atm = sd(pressure_atm)
 )

##   mean_particulate10_ug_m3 sd_particulate10_ug_m3 mean_temperature_oC
## 1                       NA                     NA             23.1752
##   sd_temperature_oC mean_pressure_atm sd_pressure_atm
## 1          6.024048         0.9941129     0.006143161

hist(air_quality2$particulate10_ug_m3,breaks = 50)  # Plot the histogram for particulate10_ug_m3 variable

hist(air_quality2$temperature_oC,breaks = 50)

hist(air_quality2$pressure_atm,breaks = 50)

boxplot(air_quality2$Air_quality_rating)

As shown above from the histplot the the data for particulate matter 10 is positively skewed and long tailed, indicating deviations from normality and the most frequent values are low. On the other hand, atmospheric pressure data is negatively skewed and the most frequent values are high, indicating its non-normal distribution.

Therfore, data imputation with the mean and mode depends on the data type and distribution though k-nearest neighbour algorithms estimate and predicted value based on regression line can be used to replace missing values.

summary(air_quality2)

##  observation_date   particulate10_ug_m3 particulate2.5_ug_m3 temperature_oC 
##  Length:1488        Min.   :   1.60     Mode:logical         Min.   :11.90  
##  Class :character   1st Qu.:  15.10     NA's:1488            1st Qu.:19.00  
##  Mode  :character   Median :  20.90                          Median :21.80  
##                     Mean   :  26.34                          Mean   :23.18  
##                     3rd Qu.:  28.00                          3rd Qu.:25.60  
##                     Max.   :1304.30                          Max.   :44.40  
##                     NA's   :38                                              
##   pressure_atm    Air_quality_rating
##  Min.   :0.9750   5   :1358         
##  1st Qu.:0.9890   4   :  75         
##  Median :0.9950   3   :  13         
##  Mean   :0.9941   2   :   3         
##  3rd Qu.:0.9990   1   :   1         
##  Max.   :1.0080   NA's:  38         
##

air_quality2$particulate10_ug_m3 <- impute(air_quality2$particulate10_ug_m3, fun = median) # Replace the null values with median 
sum(is.na(air_quality2$particulate10_ug_m3))  # Confirm if it is replaced by the median

## [1] 0

air_quality2$Air_quality_rating <- impute(air_quality2$Air_quality_rating, fun = mode)   # Replace the null values with mode
sum(is.na(air_quality2$Air_quality_rating))     # Confirm if it is replaced by the mode

## [1] 0

air_quality3 <- air_quality2 %>% select(-particulate2.5_ug_m3)   # Exclude variable particulate2.5_ug_m3
head(air_quality3)

##   observation_date particulate10_ug_m3 temperature_oC pressure_atm
## 1   1/01/2020 1:00                25.5           18.4        1.001
## 2   1/01/2020 2:00                19.9           17.8        1.000
## 3   1/01/2020 3:00                23.1           17.3        1.000
## 4   1/01/2020 4:00                22.1           17.0        1.000
## 5   1/01/2020 5:00                21.9           16.7        1.001
## 6   1/01/2020 6:00                33.5           14.5        1.002
##   Air_quality_rating
## 1                  5
## 2                  5
## 3                  5
## 4                  5
## 5                  5
## 6                  5

sum(is.na(air_quality3))

## [1] 0

The data type for particulate matter 10 is numerical and has skewed distribution and hence the missing values were replaced using median, which is preferable for skewed distribution(right or left) unlike mean imputation in normally-distributed data.Mode imputation works better in the case of Air quality rating which is a categorical variable.

Scan II

In adressing the potential outliers, performing simple descriptive statistics in particular the minimum and maximum using summary() function has been the first step. Visualizing the data using boxplots and histplots has been also another basic way of scanning potential outliers. Interquartile range has been calculated from the summary for each numeric data subsets. As you can see from the summary output below the the particulate matter 10 ranges from the minimum size of 1.6 to a maximum of 1304.3 with mean = 26.2 indicating, a greater variance among the abservations.Similary, temperature has also shown greater variabilty but realistic measurement with minimum temeprature of 11.9 oc and maximum temeparture of 44.4 oc.

air_quality4 <- air_quality3 %>% 
            select(particulate10_ug_m3,temperature_oC, pressure_atm)   # Select quantitative variables
summary(air_quality4)

## 
##  38 values imputed to 20.9

##  particulate10_ug_m3 temperature_oC   pressure_atm   
##  Min.   :   1.6      Min.   :11.90   Min.   :0.9750  
##  1st Qu.:  15.3      1st Qu.:19.00   1st Qu.:0.9890  
##  Median :  20.9      Median :21.80   Median :0.9950  
##  Mean   :  26.2      Mean   :23.18   Mean   :0.9941  
##  3rd Qu.:  27.9      3rd Qu.:25.60   3rd Qu.:0.9990  
##  Max.   :1304.3      Max.   :44.40   Max.   :1.0080

Let's calculate the interquartile range for the numerical variables.

IQR1 <-IQR(air_quality4$particulate10_ug_m3) # Apply the Interquartile Range, IQR()
IQR1

## [1] 12.6

IQR2 <-IQR(air_quality4$temperature_oC) # Apply the Interquartile Range, IQR()
IQR2

## [1] 6.6

IQR3 <-IQR(air_quality4$pressure_atm) # Apply the Interquartile Range, IQR()
IQR3

## [1] 0.01

LowerInnerFence1 <- 15.3- 1.5 * IQR1        # Compute the lower and upper tukey fences
UpperInnerFence1 <- 27.9 + 1.5 * IQR1
print(LowerInnerFence1); print(UpperInnerFence1)

## [1] -3.6

## [1] 46.8

LowerInnerFence2 <- 19.0- 1.5 * IQR2        # Compute the lower and upper tukey fences
UpperInnerFence2 <- 25.6 + 1.5 * IQR2
print(LowerInnerFence2); print(UpperInnerFence2)

## [1] 9.1

## [1] 35.5

LowerInnerFence3 <- 0.9890- 1.5 * IQR3       # Compute the lower and upper tukey fences
UpperInnerFence3 <- 0.9990 + 1.5 * IQR3
print(LowerInnerFence3); print(UpperInnerFence3)

## [1] 0.974

## [1] 1.014

outliers1 = air_quality4$particulate10_ug_m3[(air_quality4$particulate10_ug_m3 > UpperInnerFence1) | (air_quality4$particulate10_ug_m3 < LowerInnerFence1)] # Check all the points above upper and lower tukey fences


outliers2 = air_quality4$temperature_oC[(air_quality4$temperature_oC > UpperInnerFence2) | (air_quality4$temperature_oC < LowerInnerFence2)] # Check all the points above upper and lower tukey fences

outliers3 = air_quality4$pressure_atm[(air_quality4$pressure_atm > UpperInnerFence3) | (air_quality4$pressure_atm < LowerInnerFence3)] # Check all the points above upper and lower tukey fences

In particulate matter 10, the lower and upper limit of non-outliers is -3.6 and 46.8 respectively. Thus, particulate matters above 46.8 or below -3.6 can be considered as putative outliers in normally distributed data. The exact location of this data points are located using indexing and found that at least 104 data points are outside these range. Similarly, the lower and upper limit of non-outliers is 9.1 and 35.5.respectively in the ambient temerature. as it can be seen from the output, about 80 data points are outside this range. However, we have not decided that this great number of data points are outliers as our data is skewed. The underlying idea of majority outlier detection procedures is that the data follows normal distribution.Looking at pattern of this particular variable, there is a tendency of that sizes of particulate matter can vary greatly during bushfires season contributing for skeweness. The best solution to this is then to apply a transformation on particulate 10 and temeperature variables such as the logarithmic transformation or generally a Box-Cox transformation which might drew some of these points closer to the bulk of observations. There has been no potential outlires detected in the atmospheric pressure.

boxplot(air_quality4, # Check for any outliers using univaraite non-parametric outlier detection
 main = "Plot of air_quality4", xlab = "air_quality4")

The output displays numerous possible outliers for particulate matter 10 and temperature above the tukeys whisker coroborating previous results.

z-scores method are used detect outliers if the data are is assumed normal distribution. We have used z-scores in this skewed data just to confirm the distribution further but not to detect outliers.

z.scores_particulate <- air_quality4$particulate10_ug_m3%>% # Locate the outliers using the z-score approach 
 outliers::scores(type = "z")
z.scores_particulate %>% summary()

## 
##  38 values imputed to -0.1255504

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.58300 -0.25828 -0.12555  0.00000  0.04037 30.29395

z.scores_temperature <- air_quality4$temperature_oC%>% # Locate the outliers using the z-score approach 
 outliers::scores(type = "z")
z.scores_temperature %>% summary()

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.8717 -0.6931 -0.2283  0.0000  0.4025  3.5233

z.scores_pressure <- air_quality4$pressure_atm %>% # Locate the outliers using the z-score approach 
 outliers::scores(type = "z")
z.scores_pressure %>% summary()

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -3.1112 -0.8323  0.1444  0.0000  0.7955  2.2606

hist(z.scores_particulate, freq = FALSE, breaks = 50) # Compare it with normal distribution
x <- seq(-50, 50, by = 0.001) 
lines(x = x, y = dnorm(x), col = "red")

hist(z.scores_temperature, freq = FALSE, breaks = 50) # Compare it with normal distribution
x <- seq(-50, 50, by = 0.001) 
lines(x = x, y = dnorm(x), col = "red")

hist(z.scores_pressure, freq = FALSE, breaks = 10) # Compare it with normal distribution
x <- seq(-50, 50, by = 0.001) 
lines(x = x, y = dnorm(x), col = "red")

qqnorm(z.scores_particulate, main = "Q-Q Plot of particulate") # Check the distribution with parametric normality test
qqline(z.scores_particulate, col = "red", lwd = 1, lty = 2)

qqnorm(z.scores_temperature, main = "Q-Q Plot of temperature") # Check the distribution with parametric normality test
qqline(z.scores_temperature, col = "red", lwd = 1, lty = 2)

qqnorm(z.scores_pressure, main = "Q-Q Plot of pressure") # Check the distribution with parametric normality test
qqline(z.scores_pressure, col = "red", lwd = 1, lty = 2)

We have also computed the skewness in each numerical variable to understand the data better as shown below.

skewness(air_quality4$particulate10_ug_m3, na.rm = TRUE)   # Compute skewness

## [1] 21.30869

skewness(air_quality4$temperature_oC, na.rm = TRUE)

## [1] 1.257667

skewness(air_quality4$pressure_atm, na.rm = TRUE)

## [1] -0.4791756

As observed above the distribution of particulate10_ug_m3 and temperature variables are positivly skewed with tail is toward the high values (on the right-hand side) while the pressure is negatively skewed. This skewness information would help us to chose a better transformation method for example the particulate matter 10 is severly skewed(right) with skewness of 21.3, and hence log10 transformation performs better. While pressure is negatevily skewed (-0.47), and sqrt transformation can perfrom better.

Transform

In this work, we transformed the numeric data with aim to decrease the skewness or convert the distribution into a normal distribution.By doing that we can also understand and treat the outliers instead of deleting as a signficant proportion of data points falls outside upper limit of non-outliers using IQR. Moreover, many prediction models performs better when features are normally distributed.Hence, the numerical data has been transformed using log,sqrt and box-cox transformations.

Log10_transformed_particulate <- log10(air_quality4$particulate10_ug_m3)  # log10 for severe positively skewed data
log10_transformed_temperature <- log10(air_quality4$temperature_oC)       # log10 transform for positive skewed data
sqrt_tranformed_pressure <-sqrt(max(air_quality4$pressure_atm +1) - air_quality4$pressure_atm)  # for moderately negative skewed data

skewness(Log10_transformed_particulate, na.rm = TRUE)   # Confirm skewness

## [1] 0.5460248

skewness(log10_transformed_temperature, na.rm = TRUE)   # Confirm skewness

## [1] 0.6039926

skewness(sqrt_tranformed_pressure, na.rm = TRUE)       # Confirm skewness

## [1] 0.4713253

In the results above, we can see that the new transformed values has shown low skewness for example the skewness of particulate matter 10 has been reduced from a 21.3 to a skewness value of 0.55 in transformed data. This indicated that the log10 transformation has peformed well in reducing the skewness of the data and normalizing the distribution. Similarly, the atmospheric pressure data has been transformed with skewness of 0.47 from a negatively skewed original data. Let's the central tendency measurements of the distribution using summary()functions.

summary(Log10_transformed_particulate) # Perform a descriptive statistics of the stransformed data

## 
##  38 values imputed to 1.320146

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2041  1.1847  1.3201  1.3220  1.4456  3.1154

summary(log10_transformed_temperature)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.076   1.279   1.338   1.352   1.408   1.647

summary(sqrt_tranformed_pressure)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.004   1.006   1.007   1.009   1.016

In the output above, the transformed data mean and median of all variables has been shown. The mean and mode are approximately equal, implying that the transformed data are approximately equal to normaly disributed; the mean taking closer to the central area of the distribution. This results are also coroborated with histplot below to get better visualization.

hist(Log10_transformed_particulate)  # Plot transformed particulate10

hist(log10_transformed_temperature)  # Plot transformed temperature

hist(sqrt_tranformed_pressure)       # Plot transformed pressure

From the above transformation outcome, we can assume now that the data is approximately normal and we can use the Z-score to get the count and location of outliers.

z.scores1 <- Log10_transformed_particulate%>%     # Locate the outliers using the z-score approach once we confirm the normal distribution
 outliers::scores(type = "z")
z.scores1 %>% summary()

## 
##  38 values imputed to -0.007257654

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -4.424098 -0.543340 -0.007258  0.000000  0.489261  7.097638

length(which(abs(z.scores1) >3 ))               # Check the count of outliers

## [1] 17

z.scores2 <- log10_transformed_temperature%>%        # Locate the outliers using the z-score approach once we confirm the normal distribution
 outliers::scores(type = "z")
z.scores1 %>% summary()

## 
##  38 values imputed to -0.007257654

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -4.424098 -0.543340 -0.007258  0.000000  0.489261  7.097638

length(which(abs(z.scores2) >3 ))   # Check the count of outliers

## [1] 0

z.scores3 <- sqrt_tranformed_pressure%>%        # Locate the outliers using the z-score approach once we confirm the normal distribution
 outliers::scores(type = "z")
z.scores1 %>% summary()

## 
##  38 values imputed to -0.007257654

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -4.424098 -0.543340 -0.007258  0.000000  0.489261  7.097638

length(which(abs(z.scores3) >3 ))   # Check the count of outliers

## [1] 2

As you we can see the results above, the outliers are reduced into 17 in particulate matter10 and the number of potential outliers in the ambient temeprature reduced to zero, indicating that transformation draw some of the skewed observation closer to the bulk of observations.

boxcox_particulate10_ug_m3 <- BoxCox(air_quality4$particulate10_ug_m3, lambda = "auto")   # Transform using Box-Cox 
head(boxcox_particulate10_ug_m3)

## [1] 2.730584 2.553759 2.660665 2.629117 2.622617 2.919749
## attr(,"lambda")
## [1] -0.1085599

boxcox_temperature_oC <- BoxCox(air_quality4$temperature_oC, lambda = "auto")   # Transform using Box-Cox 
head(boxcox_temperature_oC)

## [1] 2.503781 2.479431 2.458434 2.445512 2.432334 2.326896

lambda <- attr(boxcox_temperature_oC, which = "lambda")   # Extract the "lambda"
lambda

## [1] -0.1065588

boxcox_pressure_atm <- BoxCox(air_quality4$pressure_atm, lambda = "auto")   # Transform using Box-Cox 
head(boxcox_pressure_atm)

## [1] 0.0010005 0.0000000 0.0000000 0.0000000 0.0010005 0.0020020

lambda1 <- attr(boxcox_pressure_atm, which = "lambda")    # Extract the "lambda"
lambda1

## [1] 1.999924

Let's explore further using box-cox method if outliers can be further reduced.

summary(boxcox_particulate10_ug_m3)      # Perform a descriptive statistics of the stransformed data

## 
##  38 values imputed to 2.589101

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4582  2.3610  2.5891  2.5791  2.7936  4.9836

summary(boxcox_temperature_oC)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.177   2.527   2.627   2.647   2.742   3.120

summary(boxcox_pressure_atm)

##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -0.0246875 -0.0109395 -0.0049875 -0.0058509 -0.0009995  0.0080320

hist(boxcox_particulate10_ug_m3, 
 main = bquote("Histogram of Box-Cox Transformed Particulate10 (" ~ lambda == -0.1085599 ~ ")"), # Plot transformed particulate10
 xlab = "Transformed particulate10_ug_m3")

hist(boxcox_temperature_oC, 
 main = bquote("Histogram of Box-Cox Transformed Temperature(" ~ lambda == -0.1065588 ~ ")"),  # Plot transformed Temperature
 xlab = "Transformed temperature_oC")

hist(boxcox_pressure_atm, 
 main = bquote("Histogram of Box-Cox Transformed Pressure (" ~ lambda == 1.999924 ~ ")"),      # Plot transformed Pressure
 xlab = "Transformed Pressure_atm")

Box-Cox power transformation has also revealed similar results similar to the log10 and sqrt transformationas as shown above. The mean and mode are approximately equal, implying that the transformed data are closer to normaly disributed data.The mean has not been equal to zero as in a perfect normal distribution. The follwing z-score plot can further provide a better visualization regarding this tranformation and the number of potential outliers.

z.scores_boxcox_particulate10 <- boxcox_particulate10_ug_m3%>%     # Locate the outliers using the z-score approach once we confirm the normal distribution
 outliers::scores(type = "z")
z.scores_boxcox_particulate10 %>% summary()

## 
##  38 values imputed to 0.02425207

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -5.12396 -0.52677  0.02425  0.00000  0.51823  5.80935

length(which(abs(z.scores_boxcox_particulate10) >3 ))

## [1] 20

z.scores_boxcox_temperature <- boxcox_temperature_oC%>%        # Locate the outliers using the z-score approach once we confirm the normal distribution
 outliers::scores(type = "z")
z.scores_boxcox_temperature %>% summary()

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -2.7611 -0.7040 -0.1189  0.0000  0.5543  2.7757

length(which(abs(z.scores_boxcox_temperature) >3 ))           # Check the count of outliers

## [1] 0

z.scores_boxcox_pressure_atm <- boxcox_pressure_atm%>%        # Locate the outliers using the z-score approach once we confirm the normal distribution
 outliers::scores(type = "z")
z.scores_boxcox_pressure_atm %>% summary()

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -3.0890 -0.8345  0.1416  0.0000  0.7956  2.2766

length(which(abs(z.scores_boxcox_pressure_atm) >3 ))         # Check the count of outliers

## [1] 2

hist(z.scores_boxcox_particulate10, freq = FALSE, breaks = 50) # Compare it with normal distribution
x <- seq(-30, 30, by = 0.001) 
lines(x = x, y = dnorm(x), col = "red")

hist(z.scores_boxcox_temperature, freq = FALSE, breaks = 50) # Compare it with normal distribution
x <- seq(-30, 30, by = 0.001) 
lines(x = x, y = dnorm(x), col = "red")

hist(z.scores_boxcox_pressure_atm, freq = FALSE, breaks = 10) # Compare it with normal distribution
x <- seq(-30, 30, by = 0.001) 
lines(x = x, y = dnorm(x), col = "red")

In comparisonto log10 and sqrt transformation, the Box-Cox power transformation has increased the count of potential outliers in particulate matter10 from 17 to 20, indicationg that log10 and sqrt transformed data is better if parametric statstical analysis is to be performed on this data set. Though data transformation is one of the method to enhance normality, it has to be handled causeously (Vélez et al 2015) not to lose some observations. In practical terms for example the data points in the variable particulate matter 10 is highly skewed and most those data points of higher concentration are tend to be outliers. On the contrary, the higher the concentration of particulate matter 10, the higher the health risk and the lower air quality rating. Therefore, excluding these data points from analysis can mislead the public. In the case of this air quality data, instead of eliminating the those observations, we suggested to accomodate all observations into the distribution that tends to be less skewed than the original.

Conclusion

In this report, we have utitilized the versatility of R which contains various packages which can be used to tidy, manupulate and transform the data to make it more usable and valuable for a variety of downstream application including modeling. We have shown how we can retrieve different data from different data sources and join them together to create a new data frame. In this process, the data set is ready by removing outliers that can potentially skew our results, changes any null values, standardize and transform to improve quality and consistency.

References

DELWP (Victoria Department of Environment, Land, Water and Planning)((2018), Estimating the health costs of air pollution in Victoria. DELWP Economics working paper to inform the Independent Expert Panel on Interim Targets, https://www.climatechange.vic.gov.au/__data/assets/pdf_file/0022/421717/Final_Health-costs-of-air-pollution-in-Victoria.pdf (Accessed on 17 August 2022).

EPA(Environmental Protection Austaralia) (2022), Air quality monitoring, https://www.epa.sa.gov.au/environmental_info/air_quality/new-air-quality-monitoring (Accessed on 15 August 2022).

IQAir (2019), 2019 World Air Quality Report, pp35

Keywood D, Hibberd F, & Emmerson M (2017). Australia state of the environment 2016: atmosphere, independent report to the Australian Government Minister for the Environment and Energy, Australian Government Department of the Environment and Energy, Canberra, doi:10.4226/94/58b65c70bc372.

Walter C, (2019), The public health impacts of air pollution in Australia: Research, policy and planning disconnects, The University of Melbourne website, https://www.climatecollege.unimelb.edu.au/seminar/public-health-impacts-air-pollution-australia-research-policy-and-planning-disconnects (Accessed on 15 August 2022).

Wickham H (2014), 'Tidy Data', Journal of Statistical Software, 59(10), pp. 1–23. doi: 10.18637/jss.v059.i10

WHO (World Health Organization)(2021), Ambient (outdoor) air pollution, 22 September 2021, https://www.who.int/news-room/fact-sheets/detail/ambient-%28outdoor%29-air-quality-and-health (Accessed on 15 August 2022).