All the instructions to complete this assignment are available on the MATH2349_1810 Assignment_3 Word file. Please read through this document carefully before submitting your report.
You must use the headings and chunks provided in the template, you may add additional sections and R chunks if you require. In the report, all R chunks and outputs needs to be visible. Failure to do so will result in a loss of marks.
This report must be uploaded to Turnitin as a PDF with your code chunks and outputs showing. The easiest way to achieve this is to Preview your notebook in HTML (by clicking Preview) â Open in Browser (Chrome) â Right click on the report in Chrome Click Print and Select the Destination Option to Save as PDF.
You must also publish your report to RPubs (see here) and add this RPubs link to the comments/description section in Turnitin while uploading your report. This online version of the report will be used for marking. Failure to submit your link will delay your feedback and risk late penalties.
Feel free to DELETE the instructional text provided in the template. If you have any questions regarding the assignment instructions and the R template, please post it on Slack under the #assignment3 channel.
Following packages were install and used in the assignment:
install.packages("readxl")
Installing package into <U+393C><U+3E31>C:/Users/Jamriska/Documents/R/win-library/3.4<U+393C><U+3E32>
(as <U+393C><U+3E31>lib<U+393C><U+3E32> is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/readxl_1.1.0.zip'
Content type 'application/zip' length 1488008 bytes (1.4 MB)
downloaded 1.4 MB
package readxl successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\Jamriska\AppData\Local\Temp\RtmpsvBBKC\downloaded_packages
install.packages("readr")
Installing package into <U+393C><U+3E31>C:/Users/Jamriska/Documents/R/win-library/3.4<U+393C><U+3E32>
(as <U+393C><U+3E31>lib<U+393C><U+3E32> is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/readr_1.1.1.zip'
Content type 'application/zip' length 1254896 bytes (1.2 MB)
downloaded 1.2 MB
package readr successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\Jamriska\AppData\Local\Temp\RtmpsvBBKC\downloaded_packages
install.packages("dplyr")
Error in install.packages : Updating loaded packages
install.packages("tidyr")
Installing package into <U+393C><U+3E31>C:/Users/Jamriska/Documents/R/win-library/3.4<U+393C><U+3E32>
(as <U+393C><U+3E31>lib<U+393C><U+3E32> is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/tidyr_0.8.1.zip'
Content type 'application/zip' length 899213 bytes (878 KB)
downloaded 878 KB
package tidyr successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\Jamriska\AppData\Local\Temp\RtmpsvBBKC\downloaded_packages
install.packages("stringr")
Error in install.packages : Updating loaded packages
install.packages("Hmisc")
Installing package into <U+393C><U+3E31>C:/Users/Jamriska/Documents/R/win-library/3.4<U+393C><U+3E32>
(as <U+393C><U+3E31>lib<U+393C><U+3E32> is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/Hmisc_4.1-1.zip'
Content type 'application/zip' length 1817136 bytes (1.7 MB)
downloaded 1.7 MB
package Hmisc successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\Jamriska\AppData\Local\Temp\RtmpsvBBKC\downloaded_packages
install.packages("forecast")
Restarting R session...
The Air Quality has significant effect on public health as exposure to air pollutants is associated with cost in human lives, burden on health services, socio-economical effects and environmental impact. As such, there is a strong need to understand, control and mitigate air pollution. To achieve these objectives, various agencies have been tasked with monitoring and characterisation of air pollution and exposure to atmospheric pollutants. This is of interest also to a broader research community and public, in general.
Although availability and access to air quality data has been improving, the challenges remain, for example access and availability to historical data for a range of air pollutants monitored across different geographical locations is still limited. Often data is available, however may not be in a complete form or require further effort to retrieve, collate and pre-process. This was the aim of this assignment, i.e. demonstrate some of approaches to find, retrieve, collate and pre-process data in relation to Air Quality and Pollution levels in QLD.
In practical terms, to demonstrate the effort, I found and retrieved Air quality data from QLD Government Data repository for 2016. The hourly average data from different monitoring sites were collated, assessed, and modified applying data pre-processing practices. This included viewing, assessing the structure and class of data, tidying, changing type and format. Data were assessed for outliers, skewness and normality. Completeness of data was also assessed and where appropriate either removed or substituted. In general, the activities aim at preparing a good quality data which are readily available for further analysis and applications. The assignment does not attempt to provide complete and in-depth application of various means and techniques available for data pre-processing, rather it outlines some approaches and areas which can be explored in future applications.
(Word count Max: 300 words)
The air quality data used in the assignment were obtained from Queensland Government Data website (QGDW) (https://data.qld.gov.au/dataset/air-quality-monitoring-2016-grouped-by-pollutant). The side provides historical records of Air Quality Monitoring data for QLD. These include data for a broad range of air pollutants measured at different monitoring sites across the QLD. The QGDW also provides current air quality update, however review of historical data is limited. There is a need to access and review historical data for example health research application. This feature (retrieval and processing of historical data) is not directly available to a general public or is rather limited (e.g. Manual retrieval of data for individual days). The aim of this assignment is to develop and demonstrate a methodology allowing to obtain, collate and pre-process historical air quality which then can be used for further analysis.
To demonstrate the methodology, the assignment aims at reconstructing Air Quality Index (AQI) for six pollutants listed in the Air National Environment Protection Measure (Air NEPM). The Air NEPM is a legislative standard for Air Quality; specifying time averaged concentration and exposure standards for different air pollutants. The standard also presents the methodology for assessing Air quality levels, including calculation of the Air Quality Index. The methodology, which was used in this assignment, is available at (http://www.environment.gov.au/protection/air-quality/air-quality-standards).
The selected data are hourly averages of 5 air pollutants measured in 2016 at different air quality monitoring sites spread across the QLD. The focus here was to select, retrieve, collate and pre-process hourly averages of air pollutants concentration measured at South Brisbane air monitoring site. This information (collated data set) is not readily available. Following pollutants and information was collated:
The Air Quality index for each pollutant is calculated as: AQI-x = (pol-x/Standard-x)*100 in [%] where pol-x is an average concentration of pollutant x averaged over 1hour (e.g. 1pm-2pm average ->pol-x record at 2pm)
Following are the URL addresses from where the air quality files in csv format were obtained. Due to a technical problem with direct downloading these files from the URL and time constrains the files were first downloaded into a local directory and then loaded into RStudio. The section below provides URL and the (commented) scripts which were originally used.
# PM2.5
#url_PM2.5 <- "https://data.qld.gov.au/dataset/7844efd9-5116-4dba-8459-c79703d82b39/resource/b5cde0fa-1063-41af-a01d-64e46a221985/download/pm2-5qld2016.csv"
# PM2.5_data_all_sites<-read.csv(url_PM2.5,stringsAsFactors = FALSE)
# PM10
#url_PM10<- "https://data.qld.gov.au/dataset/7844efd9-5116-4dba-8459-c79703d82b39/resource/ab620383-de88-4a8a-86ac-ed388beb3554/download/pm10qld2016.csv"
#PM10_data_all_sites<-read.csv(url_PM10,stringsAsFactors = FALSE)
# Carbon Monoxide
#url_CO <-"https://data.qld.gov.au/dataset/7844efd9-5116-4dba-8459-c79703d82b39/resource/8b6d6550-2941-427c-b1af-181e191c7abb/download/carbonmonoxideqld2016.csv"
#CO_data_all_sites <- read.csv(url_CO, stringsAsFactors = FALSE)
# Nitrogen dioxode
#url_NO2 <- "https://data.qld.gov.au/dataset/7844efd9-5116-4dba-8459-c79703d82b39/resource/e28cb1f2-b262-4723-8261-babe407f9b20/download/nitrogendioxideqld2016.csv"
#NO2_data_all_sites <- read.csv(url_NO2, stringsAsFactors = FALSE)
# Ozone
#url_O3<- "https://data.qld.gov.au/dataset/7844efd9-5116-4dba-8459-c79703d82b39/resource/040293d8-f98b-4748-9253-e50282c14cb5/download/ozoneqld2016.csv"
#O3_data_all_sites <- read.csv(url_O3, stringsAsFactors = FALSE)
#url_visibility <- "https://data.qld.gov.au/dataset/7844efd9-5116-4dba-8459-c79703d82b39/resource/0b23e381-da8b-4ad1-8099-f8ab0688e2e7/download/visibilityreducingparticlesqld2016.csv"
#Visibility_data_all_sites <- read.csv(url_visibility, stringsAsFactors = FALSE)
## READ DATA for EACH POLLUTANT data files from working directory (Files Already Downloaded from the Internet
getwd()
[1] "C:/Users/Jamriska/Documents/RMIT_COURSE Materials/RMIT-2018-S1/MATH2349_Course/Assignments/Assign_3"
setwd("C:/Users/Jamriska/Documents/RMIT_COURSE Materials/RMIT-2018-S1/MATH2349_Course/Data")
The working directory was changed to C:/Users/Jamriska/Documents/RMIT_COURSE Materials/RMIT-2018-S1/MATH2349_Course/Data inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the working directory for notebook chunks.
# PM2.5 (Particulate Matter Smaller than 2.5micron, pol1)
PM2.5.file<-"pm2-5qld2016.csv" # Name of the PM2.5 date file on local directory
PM2.5_data_all_sites<-read.csv(PM2.5.file,stringsAsFactors = FALSE) # load the file into RStudio
head(PM2.5_data_all_sites)
#str(PM2.5_data_all_sites)
# PM10 (Particulate Matter Smaller than 10micron, pol2)
PM10.file<-"pm10qld2016.csv"
PM10_data_all_sites<- read.csv(PM10.file,stringsAsFactors = FALSE)
head(PM10_data_all_sites)
#str(PM10_data_all_sites)
# CO (Carbon Monoxide CO, pol3)
CO.file<- "carbonmonoxideqld2016.csv"
CO_data_all_sites<- read.csv(CO.file,stringsAsFactors = FALSE)
head(CO_data_all_sites)
#str(CO_data_all_sites)
# NO2 (Nitrogen dioxide, pol4)
NO2.file<- "nitrogendioxideqld2016.csv"
NO2_data_all_sites<- read.csv(NO2.file,stringsAsFactors = FALSE)
head(NO2_data_all_sites)
#str(NO2_data_all_sites)
# O3 (Ozone, pol5)
O3.file <- "ozoneqld2016.csv"
O3_data_all_sites<- read.csv(O3.file,stringsAsFactors = FALSE)
head(O3_data_all_sites)
#str(O3_data_all_sites)
# Visibility (Bsp, pol6)
Visibility.file<-"visibilityreducingparticlesqld2016.csv"
Vis_data_all_sites<- read.csv(Visibility.file,stringsAsFactors = FALSE)
head(Vis_data_all_sites)
#str(Vis_data_all_sites)
# --------------- ALL DATA DOWNLOADED -------------------------------------------------------------
To gain understanding of data structure, types of variables, dimensions, etc. each file was explored using View(file), str(file), class(file), colnames(file) and other functions was applied. Since each file contains data for several monitoring sites, only data for South Brisbane (site of interest) was selected and then column renamed, indicating the pollutant name and units. The approach involved following steps: 1. A dataframe pol1, pol2, ..pol6 was created by assigning loaded csv files. 2. Checked names of Variables in each file (these represent monitoring site where the pol was measured) 3. Selected Date, Time and South Brisbane (or nearest) monitoring site 4. Checked which column contains the South Brisbane Data 5. Renamed selected column with a pollutant conc. reading (e.g. colnames(pol1)[3] <- “PM2.5_(ug/m3)”)
Summarise the types of variables and data structures, check the attributes in the data. In addition to the R codes and outputs, explain briefly the steps that you have taken. In this section, show that you have fulfilled minimum requirements 2-4.
# 1. Particulate Matter smaller than 2.5micron PM2.5 data denoted as pol1
pol1<-PM2.5_data_all_sites # assign name for pollutants file subset
#View(pol1) # view and inspect data, select column with South Brisb data
colnames(pol1) # list names of the variables in pol1
[1] "Date" "Time"
[3] "Cannon.Hill..South.East.Queensland...ug.m.3." "Luscombe..South.East.Queensland...ug.m.3."
[5] "Lytton..South.East.Queensland...ug.m.3." "Rocklea..South.East.Queensland...ug.m.3."
[7] "South.Brisbane..South.East.Queensland...ug.m.3." "Springwood..South.East.Queensland...ug.m.3."
[9] "Wynnum.North..South.East.Queensland...ug.m.3." "Wynnum.West..South.East.Queensland...ug.m.3."
[11] "Jondaryan..Jondaryan...ug.m.3." "Boat.Creek..Gladstone...ug.m.3."
[13] "Boyne.Island..Gladstone...ug.m.3." "Clinton..Gladstone...ug.m.3."
[15] "Fisherman.s.Landing..Gladstone...ug.m.3." "South.Gladstone..Gladstone...ug.m.3."
[17] "Targinie..Gladstone...ug.m.3."
pol1 <-pol1 %>% # Select columns Date, Time and South Brisb for pol1
dplyr::select(contains("Date"),contains("Time"),contains("South.Brisb"))
grep(colnames(pol1)[3], colnames(PM2.5_data_all_sites )) # find which column was selected (i.e., with pol1 conc. data)
[1] 7
colnames(pol1)[3] <- "PM2.5_(ug/m3)" # change name of 3rd column (with pol1 conc) to PM2.5_(ug/m3)
# 2. Particulate Matter smaller than 10micron PM10 data denoted as pol2
pol2<-PM10_data_all_sites # assign name for pollutants file subset
pol2 <-pol2 %>%
dplyr::select(contains("Date"),contains("time"),contains("South.Brisb"))
colnames(pol2)[3] <- "PM10_(ug/m3)" # change the name of column
# 3. Carbon monoxide CO data denoted as pol3
pol3 <- CO_data_all_sites # assign name for pollutants file subset
pol3 <-pol3 %>%
dplyr::select(contains("Date"),contains("time"),contains("South.Brisb"))
colnames(pol3)[3] <- "CO_(ppm)" # change the name of column
# 4. Nitrogen dioxide NO2 data denoted as pol4
pol4 <- NO2_data_all_sites # assign name for pollutants file subset
pol4 <-pol4 %>%
dplyr::select(contains("Date"),contains("time"),contains("South.Brisb"))
colnames(pol4)[3] <- "NO2_(ppm)" # change the name of column
# 5. oZONE O3 data denoted as pol5
pol5 <- O3_data_all_sites # assign name for pollutants file subset
colnames(pol5) # South Brisbane data unavailable, the closest monitoring station is "Rocklea"
[1] "Date" "Time"
[3] "Deception.Bay..South.East.Queensland...ppm." "Flinders.View..South.East.Queensland...ppm."
[5] "Mountain.Creek..South.East.Queensland...ppm." "Mutdapilly..South.East.Queensland...ppm."
[7] "North.Maclean..South.East.Queensland...ppm." "Rocklea..South.East.Queensland...ppm."
[9] "Springwood..South.East.Queensland...ppm." "Memorial.Park..Gladstone...ppm."
[11] "Pimlico..Townsville...ppm."
pol5 <-pol5 %>%
dplyr::select(contains("Date"),contains("time"),contains("Rocklea")) #
colnames(pol5)[3] <- "O3_(ppm)" # change the name of column
Next Step is to review the selected files pol1, pol2, ….pol6 in terms of the variable type and files structure
Check if the data conforms the tidy data principles. If your data is not in a tidy format, reshape your data into a tidy format (minimum requirement #5). In addition to the R codes and outputs, explain everything that you do in this step.
cat("# 1. Particulate Matter smaller than 2.5micron PM2.5 data denoted as pol1 \n")
# 1. Particulate Matter smaller than 2.5micron PM2.5 data denoted as pol1
str(pol1)
'data.frame': 8784 obs. of 3 variables:
$ Date : chr "01/01/2016" "01/01/2016" "01/01/2016" "01/01/2016" ...
$ Time : chr "00:00" "01:00" "02:00" "03:00" ...
$ PM2.5_(ug/m3): num 2.4 7.1 9.7 7.3 4 0.8 1.9 2.2 3.4 4.8 ...
cat("# 2. Particulate Matter smaller than 10micron PM10 data denoted as pol2 \n ")
# 2. Particulate Matter smaller than 10micron PM10 data denoted as pol2
str(pol2)
'data.frame': 8784 obs. of 3 variables:
$ Date : chr "01/01/2016" "01/01/2016" "01/01/2016" "01/01/2016" ...
$ Time : chr "00:00" "01:00" "02:00" "03:00" ...
$ PM10_(ug/m3): num 12.9 16.8 18.4 15 12.1 9.5 11.2 12.3 14.6 16.3 ...
cat("# 3. Carbon monoxide CO data denoted as pol3 \n ")
# 3. Carbon monoxide CO data denoted as pol3
str(pol3)
'data.frame': 8784 obs. of 3 variables:
$ Date : chr "01/01/2016" "01/01/2016" "01/01/2016" "01/01/2016" ...
$ Time : chr "00:00" "01:00" "02:00" "03:00" ...
$ CO_(ppm): num NA 0.2 0.1 0.1 0.1 0.1 0.1 0 0 0.1 ...
cat("# 4. Nitrogen dioxide NO2 data denoted as pol4 \n ")
# 4. Nitrogen dioxide NO2 data denoted as pol4
str(pol4)
'data.frame': 8784 obs. of 3 variables:
$ Date : chr "01/01/2016" "01/01/2016" "01/01/2016" "01/01/2016" ...
$ Time : chr "00:00" "01:00" "02:00" "03:00" ...
$ NO2_(ppm): num NA 0.012 0.008 0.006 0.004 0.005 0.004 0.002 0.004 0.006 ...
cat("# 5. oZONE O3 data denoted as pol5 \n ")
# 5. oZONE O3 data denoted as pol5
str(pol5)
'data.frame': 8784 obs. of 3 variables:
$ Date : chr "01/01/2016" "01/01/2016" "01/01/2016" "01/01/2016" ...
$ Time : chr "00:00" "01:00" "02:00" "03:00" ...
$ O3_(ppm): num NA 0.008 0.011 0.012 0.007 0.01 0.013 0.016 0.018 0.019 ...
cat("# 6. Visibility data denoted as pol6 \n ")
# 6. Visibility data denoted as pol6
str(pol6)
'data.frame': 8784 obs. of 3 variables:
$ Date : chr "01/01/2016" "01/01/2016" "01/01/2016" "01/01/2016" ...
$ Time : chr "00:00" "01:00" "02:00" "03:00" ...
$ Vis_(1/Mm): int NA 21 22 19 18 18 18 14 13 12 ...
** SUMMARY ** The results above (output of the str(file) command) for all dataframes pol1, pol2, …pol6, show that the variables Date and Time for each data frame pol1, pol2, …, pol6 are present and are character type.The 3rd variable (e.g., PM2.5, PM10, etc) is numeric type. The variables Date and Time are common variables for each df and therefore can be used for df joining.
In here Create/mutate at least one variable from the existing variables (minimum requirement #6). In addition to the R codes and outputs, explain everything that you do in this step..
Since the concnetration data are in separate dataframes, they need to be joined. As indicated above, the variables Date and Time (char type) are present in each df and can be used for joining. I used left join function, which allowed to join each of the pollutants pol1 to pol6 as an individual column. The date and time variables are first two columns. The script is presented below.
# JOINING dataframes for ALL 6 POLLUTANT INTO ONE DATAFRAME
# Joining by Date and Time Variables (they are still character type)
pol1_6 <- pol1 %>% left_join(pol2) %>%
left_join(pol3) %>%
left_join(pol4) %>%
left_join(pol5) %>%
left_join(pol6)
Joining, by = c("Date", "Time")
Joining, by = c("Date", "Time")
Joining, by = c("Date", "Time")
Joining, by = c("Date", "Time")
Joining, by = c("Date", "Time")
class(pol1_6)
[1] "data.frame"
str(pol1_6)
'data.frame': 8784 obs. of 8 variables:
$ Date : chr "01/01/2016" "01/01/2016" "01/01/2016" "01/01/2016" ...
$ Time : chr "00:00" "01:00" "02:00" "03:00" ...
$ PM2.5_(ug/m3): num 2.4 7.1 9.7 7.3 4 0.8 1.9 2.2 3.4 4.8 ...
$ PM10_(ug/m3) : num 12.9 16.8 18.4 15 12.1 9.5 11.2 12.3 14.6 16.3 ...
$ CO_(ppm) : num NA 0.2 0.1 0.1 0.1 0.1 0.1 0 0 0.1 ...
$ NO2_(ppm) : num NA 0.012 0.008 0.006 0.004 0.005 0.004 0.002 0.004 0.006 ...
$ O3_(ppm) : num NA 0.008 0.011 0.012 0.007 0.01 0.013 0.016 0.018 0.019 ...
$ Vis_(1/Mm) : int NA 21 22 19 18 18 18 14 13 12 ...
head(pol1_6)
** Summary ** The resulting data frame has 8784 observations (raws), contains 8 variables. The variables Date and Time are character type; variables 3-6 are numeric. The created dataframe is in tidy format, as each column represent a unique variable, each observation (raw) is also unique, and each value (cell reading) is unique also. Hence it meets the criterion of bein in tidy format. The dataframe can be improved by changing the type of variable Date and Time (character type) to Date/Time format. This will be done later. Also these two variables can be merged into dd-mm-yyy hh:mm:ss format which allows easier manipulation.
In the next step I checked the pol1-6 df for the presence of missing values (NA), inconsistencies (eg. negative values for pollutants concentration) and other issues.
Scan the data for missing values, inconsistencies and obvious errors. In this step, you should fulfil the minimum requirement #7. In addition to the R codes and outputs, explain how you dealt with these values.
# is.na(pol1_6) # check if there are NAS
nu_Total_NA <-sum(is.na(pol1_6)) # count the number of NA
nu_Total_non_NA <-sum(!is.na(pol1_6)) # number of non- NA values
nu_Total <- (nu_Total_NA + nu_Total_non_NA)
(fract_NA <-nu_Total_NA/nu_Total) # FRaction of NA data in total
[1] 0.05510018
** SUMMARY** About 5.5% of total data from merged file is NA
is.na(dat1) <- do.call(cbind,lapply(dat1, is.infinite))
sum(do.call(cbind,lapply(dat1, is.infinite)))
[1] 0
sum(do.call(cbind,lapply(dat1, is.na)))
[1] 3872
sum(do.call(cbind,lapply(dat1, is.nan)))
[1] 0
** SUMMARY** The results from chunk above show that 1. Number of NA in pol1_6 is 3872; 2. number of nan is zero; and 3. Number of infinite values in pol1_6 is zerp.
This includes checking if any concentration values in pol1_6 dataframe are infinite, negative or otherwise invalid (Note. pol1_6 concentratio n values could not be negative!) The folloiwng r script identifies NEGATIVE VALUES which are then REPLACED with NA values (using library(dplyr) and assigned to dat2 dataframe (i.e., dat2 is pol1_6 with negative values being replced by NA)
dat2 <- dat1 %>%
mutate_all(funs(replace(., .<0, NA)))
(nu_NA_in_dat1 <-sum(do.call(cbind,lapply(dat1, is.na)))) # number of NA in dat1
[1] 3872
(nu_NA_in_dat2 <-sum(do.call(cbind,lapply(dat2, is.na)))) # number of NA in dat2
[1] 4044
(nu_Neg_Values_in_dat1 <- (nu_NA_in_dat2-nu_NA_in_dat1))
[1] 172
** SUMMARY** Number of NA in the original dat1 (i.e., pol1_6 dataframe) is 3872. Number of negative values in dat2 was 172. The new dat3 dataframe (Neg values replaced by NA) has in total 4044 values of NA.
Following script identifies NA values in dat2, which are then REPLACED with MEDIAN values. The modified dataset is then assigned to dat3 dataframe (i.e., dat3 is the original pol1_6 dataset with All NA and Negative Values Being Replced by Median value for each pollutant respectively (e.g., all missing, NA and negative values for CO pollutant are replaced by the median value of CO data measured in 2016).
dat3 <- data.frame(lapply(dat2,function(x) {
if(is.numeric(x)) ifelse(is.na(x),median(x,na.rm=T),x) else x}))
# Count of na in dat3
(sum(do.call(cbind,lapply(dat3, is.na))))
[1] 0
** SUMMARY** All missing, NA and negative values in dat3 have been removed.
Assessment of data completenes for the joined dataframe. Often the data may not be available for different monitoring sites and/or type of pollutanmt. This may play a role when calculting AQI_Overall which requires availability of data for each pollutant (for a given tdate/time).
** SUMMARY** Analysis of complete cases shows that the fraction of complete cases for dat1, dat2 and dat3 is about 1. 75.15% for dat1 (no data treatment applied, number of obesrvation 6602); 2. 73.7% for dat2 (negative values replaced with NA; number of obesrvation 6476), and 3. 100% for dat3 (NA and negative values replaced by Median, number of obesrvation 6784)
** NOTE:** All further data preprocessing, processing and manipulation will be done using dat3 dataframe (i.e., NA and negative values replaced with median value for each pollutant)
str(dat3)
'data.frame': 8784 obs. of 8 variables:
$ Date : chr "01/01/2016" "01/01/2016" "01/01/2016" "01/01/2016" ...
$ Time : chr "00:00" "01:00" "02:00" "03:00" ...
$ PM2.5_.ug.m3.: num 2.4 7.1 9.7 7.3 4 0.8 1.9 2.2 3.4 4.8 ...
$ PM10_.ug.m3. : num 12.9 16.8 18.4 15 12.1 9.5 11.2 12.3 14.6 16.3 ...
$ CO_.ppm. : num 0 0.2 0.1 0.1 0.1 0.1 0.1 0 0 0.1 ...
$ NO2_.ppm. : num 0.013 0.012 0.008 0.006 0.004 0.005 0.004 0.002 0.004 0.006 ...
$ O3_.ppm. : num 0.017 0.008 0.011 0.012 0.007 0.01 0.013 0.016 0.018 0.019 ...
$ Vis_.1.Mm. : num 15 21 22 19 18 18 18 14 13 12 ...
** SUMMARY** As during the process above the type of variables Date and Time was chnged from char to Factor; teh following script converses the type to character.
#library(dplyr)
dat3 %>%
dplyr::mutate_if(is.factor, as.character) -> dat3
str(dat3)
'data.frame': 8784 obs. of 8 variables:
$ Date : chr "01/01/2016" "01/01/2016" "01/01/2016" "01/01/2016" ...
$ Time : chr "00:00" "01:00" "02:00" "03:00" ...
$ PM2.5_.ug.m3.: num 2.4 7.1 9.7 7.3 4 0.8 1.9 2.2 3.4 4.8 ...
$ PM10_.ug.m3. : num 12.9 16.8 18.4 15 12.1 9.5 11.2 12.3 14.6 16.3 ...
$ CO_.ppm. : num 0 0.2 0.1 0.1 0.1 0.1 0.1 0 0 0.1 ...
$ NO2_.ppm. : num 0.013 0.012 0.008 0.006 0.004 0.005 0.004 0.002 0.004 0.006 ...
$ O3_.ppm. : num 0.017 0.008 0.011 0.012 0.007 0.01 0.013 0.016 0.018 0.019 ...
$ Vis_.1.Mm. : num 15 21 22 19 18 18 18 14 13 12 ...
** SUMMARY** The type of variables Date and Time was sucessfully changed to character.
New variable Date_Time was created combining the Date and Time variables. The variable was created using paste function.
dat4<-dat3 # dat3 was assigned to dat4 dataframe as the working dataset to keep dat3 unchanged
# Note: dat4 contains data from pol1_6 with NA and negative values replaced with Median value for each pol1-6
# Create a new variable using paste function
dat4$Date_Time <- paste(dat4$Date, dat4$Time) # var still character type, in a format e.g.;"19/02/2016 06:00"
#str(dat3)
str(dat4)
'data.frame': 8784 obs. of 9 variables:
$ Date : chr "01/01/2016" "01/01/2016" "01/01/2016" "01/01/2016" ...
$ Time : chr "00:00" "01:00" "02:00" "03:00" ...
$ PM2.5_.ug.m3.: num 2.4 7.1 9.7 7.3 4 0.8 1.9 2.2 3.4 4.8 ...
$ PM10_.ug.m3. : num 12.9 16.8 18.4 15 12.1 9.5 11.2 12.3 14.6 16.3 ...
$ CO_.ppm. : num 0 0.2 0.1 0.1 0.1 0.1 0.1 0 0 0.1 ...
$ NO2_.ppm. : num 0.013 0.012 0.008 0.006 0.004 0.005 0.004 0.002 0.004 0.006 ...
$ O3_.ppm. : num 0.017 0.008 0.011 0.012 0.007 0.01 0.013 0.016 0.018 0.019 ...
$ Vis_.1.Mm. : num 15 21 22 19 18 18 18 14 13 12 ...
$ Date_Time : chr "01/01/2016 00:00" "01/01/2016 01:00" "01/01/2016 02:00" "01/01/2016 03:00" ...
#head(dat4)
The order of variables in dat5 df was changed for easier processing and analysis.The order was changed to: 1.Date, 2.Time, 3.Date_Time, 4.pol1, …,9.pol6
str(dat5)
'data.frame': 8784 obs. of 9 variables:
$ Date : chr "01/01/2016" "01/01/2016" "01/01/2016" "01/01/2016" ...
$ Time : chr "00:00" "01:00" "02:00" "03:00" ...
$ Date_Time : chr "01/01/2016 00:00" "01/01/2016 01:00" "01/01/2016 02:00" "01/01/2016 03:00" ...
$ PM2.5_.ug.m3.: num 2.4 7.1 9.7 7.3 4 0.8 1.9 2.2 3.4 4.8 ...
$ PM10_.ug.m3. : num 12.9 16.8 18.4 15 12.1 9.5 11.2 12.3 14.6 16.3 ...
$ CO_.ppm. : num 0 0.2 0.1 0.1 0.1 0.1 0.1 0 0 0.1 ...
$ NO2_.ppm. : num 0.013 0.012 0.008 0.006 0.004 0.005 0.004 0.002 0.004 0.006 ...
$ O3_.ppm. : num 0.017 0.008 0.011 0.012 0.007 0.01 0.013 0.016 0.018 0.019 ...
$ Vis_.1.Mm. : num 15 21 22 19 18 18 18 14 13 12 ...
Variable dat5$Date_Time was changed from char type to Date&Time type in Format of “2016-06-21 01:00:00”
#install.packages("lubridate")
library(lubridate)
dat5$Date_Time <- dmy_hm(dat4$Date_Time, tz = "Australia/Queensland")
str(dat5)
'data.frame': 8784 obs. of 9 variables:
$ Date : chr "01/01/2016" "01/01/2016" "01/01/2016" "01/01/2016" ...
$ Time : chr "00:00" "01:00" "02:00" "03:00" ...
$ Date_Time : POSIXct, format: "2016-01-01 00:00:00" "2016-01-01 01:00:00" "2016-01-01 02:00:00" ...
$ PM2.5_.ug.m3.: num 2.4 7.1 9.7 7.3 4 0.8 1.9 2.2 3.4 4.8 ...
$ PM10_.ug.m3. : num 12.9 16.8 18.4 15 12.1 9.5 11.2 12.3 14.6 16.3 ...
$ CO_.ppm. : num 0 0.2 0.1 0.1 0.1 0.1 0.1 0 0 0.1 ...
$ NO2_.ppm. : num 0.013 0.012 0.008 0.006 0.004 0.005 0.004 0.002 0.004 0.006 ...
$ O3_.ppm. : num 0.017 0.008 0.011 0.012 0.007 0.01 0.013 0.016 0.018 0.019 ...
$ Vis_.1.Mm. : num 15 21 22 19 18 18 18 14 13 12 ...
head(dat5)
#str(dat5$Date_Time)
#class(dat5$Date_Time)
New variables were created by extracting date/time attributes from variable dat5$Date_Time The new variables were: year, month, day, day_of_week, hour, min, sec
# library(lubridate)
dat5$Year <-year(dat5$Date_Time) # create new variable year
dat5$month <-month(dat5$Date_Time) # create new variable month
dat5$day <-day(dat5$Date_Time) # create new variable day
dat5$wday <-wday(dat5$Date_Time, label = TRUE) # create new variable week day (Factor Type)
dat5$hour <-hour(dat5$Date_Time) # create new variable hour
dat5$minute <-minute(dat5$Date_Time) # create new variable minute
str(dat5)
'data.frame': 8784 obs. of 15 variables:
$ Date : chr "01/01/2016" "01/01/2016" "01/01/2016" "01/01/2016" ...
$ Time : chr "00:00" "01:00" "02:00" "03:00" ...
$ Date_Time : POSIXct, format: "2016-01-01 00:00:00" "2016-01-01 01:00:00" "2016-01-01 02:00:00" ...
$ PM2.5_.ug.m3.: num 2.4 7.1 9.7 7.3 4 0.8 1.9 2.2 3.4 4.8 ...
$ PM10_.ug.m3. : num 12.9 16.8 18.4 15 12.1 9.5 11.2 12.3 14.6 16.3 ...
$ CO_.ppm. : num 0 0.2 0.1 0.1 0.1 0.1 0.1 0 0 0.1 ...
$ NO2_.ppm. : num 0.013 0.012 0.008 0.006 0.004 0.005 0.004 0.002 0.004 0.006 ...
$ O3_.ppm. : num 0.017 0.008 0.011 0.012 0.007 0.01 0.013 0.016 0.018 0.019 ...
$ Vis_.1.Mm. : num 15 21 22 19 18 18 18 14 13 12 ...
$ Year : num 2016 2016 2016 2016 2016 ...
$ month : num 1 1 1 1 1 1 1 1 1 1 ...
$ day : int 1 1 1 1 1 1 1 1 1 1 ...
$ wday : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tue"<..: 6 6 6 6 6 6 6 6 6 6 ...
$ hour : int 0 1 2 3 4 5 6 7 8 9 ...
$ minute : int 0 0 0 0 0 0 0 0 0 0 ...
head(dat5)
Variables Date and Time (char type) which are not required anymore were removed from the dat5 dataframe
dat5 <- subset(dat5, select = -c(Date, Time)) # Drop column (variable) Date and Time from dat4 dataframe
str(dat5)
'data.frame': 8784 obs. of 13 variables:
$ Date_Time : POSIXct, format: "2016-01-01 00:00:00" "2016-01-01 01:00:00" "2016-01-01 02:00:00" ...
$ PM2.5_.ug.m3.: num 2.4 7.1 9.7 7.3 4 0.8 1.9 2.2 3.4 4.8 ...
$ PM10_.ug.m3. : num 12.9 16.8 18.4 15 12.1 9.5 11.2 12.3 14.6 16.3 ...
$ CO_.ppm. : num 0 0.2 0.1 0.1 0.1 0.1 0.1 0 0 0.1 ...
$ NO2_.ppm. : num 0.013 0.012 0.008 0.006 0.004 0.005 0.004 0.002 0.004 0.006 ...
$ O3_.ppm. : num 0.017 0.008 0.011 0.012 0.007 0.01 0.013 0.016 0.018 0.019 ...
$ Vis_.1.Mm. : num 15 21 22 19 18 18 18 14 13 12 ...
$ Year : num 2016 2016 2016 2016 2016 ...
$ month : num 1 1 1 1 1 1 1 1 1 1 ...
$ day : int 1 1 1 1 1 1 1 1 1 1 ...
$ wday : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tue"<..: 6 6 6 6 6 6 6 6 6 6 ...
$ hour : int 0 1 2 3 4 5 6 7 8 9 ...
$ minute : int 0 0 0 0 0 0 0 0 0 0 ...
head(dat5)
Order of columns (variables) in dat5 dataframe was rearanged to make data viewing and handling easier
** SUMMARY** The dataframe dat5 is now in tidy form (in terms of columns-variables; raws-observation, and values = cells = record) and will be used for further evaluation.
Scan the numeric data for outliers. In this step, you should fulfil the minimum requirement #8. In addition to the R codes and outputs, explain how you dealt with these values.
Data frame dat5 was checked for the presence oof outliers. The approach started using method for detection of Unilateral outliers. As an example, below is a script detecting for PM2.5, PM10 outliers using boxplot and scater plot (looking at the relationship/dependencies between teh PM10 and PM2.5 variable).
par(mfrow=c(1,3))
boxplot(dat4$PM2.5_.ug.m3.)
boxplot(dat4$PM10_.ug.m3.)
plot(dat4$PM10_.ug.m3.,dat4$PM2.5_.ug.m3.)
** SUMMARY** The boxplots and scatter plot shows presnce of quate a number of outliers for PM2.5, PM10 variables. Similar results (not presented here) were observed for other pollutants, i.e., CO NO2, O3 and Visibility as well.
library(outliers)
z.scores <- dat5$PM2.5_.ug.m3. %>% scores(type = "z") #Checking outliers for PM2.5 pollutant
z.scores %>% summary()
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.7692 -0.6277 -0.1626 0.0000 0.4082 7.6170
** SUMMARY** The z-scores method indicates presence of outliers for PM2.5 (abs(zscore)>3).Using the same approach, I checked zscores for other pollutants (one by one), or as presented below using for loop.
for(i in 8:13)
{
print("--------------",quote = FALSE)
colnames(dat5[i])
z.scores <- dat5[i] %>% scores(type = "z") #Checking outliers for PM2.5 pollutant
print(z.scores %>% summary())
a<-sum(abs(z.scores)>3)
fract_outl <-a/dim(dat5)[1]
#print(a/dim(dat5)[1])
cat("Fraction of outliers:",fract_outl,"\n")
}
[1] --------------
PM2.5_.ug.m3.
Min. :-1.7692
1st Qu.:-0.6277
Median :-0.1626
Mean : 0.0000
3rd Qu.: 0.4082
Max. : 7.6170
Fraction of outliers: 0.01468579
[1] --------------
PM10_.ug.m3.
Min. :-2.2170
1st Qu.:-0.6683
Median :-0.1102
Mean : 0.0000
3rd Qu.: 0.5177
Max. : 7.0893
Fraction of outliers: 0.0120674
[1] --------------
CO_.ppm.
Min. :-0.5587
1st Qu.:-0.5587
Median :-0.5587
Mean : 0.0000
3rd Qu.: 0.2703
Max. :10.2182
Fraction of outliers: 0.01673497
[1] --------------
NO2_.ppm.
Min. :-1.7025
1st Qu.:-0.7505
Median :-0.1555
Mean : 0.0000
3rd Qu.: 0.5586
Max. : 5.3187
Fraction of outliers: 0.008652095
[1] --------------
O3_.ppm.
Min. :-1.56025
1st Qu.:-0.74721
Median :-0.02451
Mean : 0.00000
3rd Qu.: 0.69820
Max. : 5.03442
Fraction of outliers: 0.004439891
[1] --------------
Vis_.1.Mm.
Min. :-1.4169
1st Qu.:-0.5212
Median :-0.1954
Mean : 0.0000
3rd Qu.: 0.2117
Max. :19.2670
Fraction of outliers: 0.01650729
** SUMMARY** The results indicates each pol1-pol6 variables has a small fraction of outliers, less than 2%“.
Create a subset pol1, pol2, ..pol6 data from dat4 dataframe. The subset will be used for detection of outliers
pol1_6_names<- c("PM2.5_.ug.m3.","PM10_.ug.m3.","CO_.ppm.","NO2_.ppm.","O3_.ppm.","Vis_.1.Mm.")
#pol1_2_names<- c("PM2.5_.ug.m3.","PM10_.ug.m3.")
#str(dat5)
#colnames(dat4) checking the columns names
dat5_pol1_6 <-dat5 %>%
dplyr::select(pol1_6_names)
head(dat5_pol1_6) # subset of the subset
The approach allows checking and comparison of outliers for different pollutants
** SUMMARY** Outliers are clearly present for PM2.5, PM10 and Visibility, however for CO, NO2 and O3 the boxplot representation is two narrow. For this reason the data were normalised to allow better boxplot comparison.
Apply an appropriate transformation for at least one of the variables. In addition to the R codes and outputs, explain everything that you do in this step. In this step, you should fulfil the minimum requirement #9.
In here I applied normalisation for each of the pol1-pol6 variables in dat5 dataframe
SUMMARY Not suprisingly, the normalised data for pol1-pol6 also show presence of outliers for CO, NO2 and O3 concentration.
I also tried applying Multivariate Outlier Detection using Mahalanobis distance with QQ plots This was not straightfoward due the large number of datapoint (above the MD distance limit; my dataset has more than 5000 values, which is the MHD limit)and also the number of variables included (pol1-pol6). For an ilustration , an example of the code (commented) which can be used for smaller datasets is presented below:
An attempt was done to clean the dataset dat5 by excluding data with zscore>3; however due time constrains this was not completed.
dat5_pol1_6_norm_clean<- dat5_pol1_6_norm[ - which(abs(z.scores) >3 )]
boxplot(dat5_pol1_6_norm_clean)
SUMMARY The cleaned dataset still have datapoints well beyond the outer fence. This woudl require more exploration.
** Note ** The National Air NEPM standards used to calculate the AQI are based on averaging time intervals presented below: Averaging period PM10 and PM2.5 is 1 day Averaging period CO is 8 hours Averaging period NO2, O3 and Visibility is 1 hour In here, for simplicity, all values were assigned as standard values for averaging 1 hour period
The Air NEPM standards for selected pollutants are set as follow:
pol1_standard = 25 # [ug/m3] PM2.5 (24 hour avg) National Air NEPM Standard
pol2_standard = 50 # [ug/m3] PM10 (24 hour avg) National Air NEPM Standard
pol3_standard = 9 # [ppm] CO ( 8 hour avg) National Air NEPM Standard
pol4_standard = 0.12 # [ppm] NO2 ( 1 hour avg) National Air NEPM Standard
pol5_standard = 0.10 # [ppm] O3 ( 1 hour avg) National Air NEPM Standard
pol6_standard = 210 # [Mm^-1] B_sp Visibility ( 1 hour avg) NSW Standard
The AQI was calculated using Formula: AQI_pol = (pol_reading/pol_standard)*100. Below is a script used to calculate AQI for each of the selected pollutants:
Create a dataframe with AQI_po1, …, AQI_pol6
For further processing conc values for each pollutant were assigned to new variables (later merged into a df)
pol1_conc<-dat5[,8] # pol1 PM2.5
pol2_conc<-dat5[,9] # pol2 PM10
pol3_conc<-dat5[,10] # pol3 CO
pol4_conc<-dat5[,11] # pol4 NO2
pol5_conc<-dat5[,12] # pol5 O3
pol6_conc<-dat5[,13] # pol6 Visibility
Create a dataframe with pol1_conc, …, pol6_conc
Assessing pol1-6 and AQI_pol1-6 data using summary function and plotting
summary(pol1_conc) # summary for pol1 PM2.5
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 5.400 7.600 8.369 10.300 44.400
summary(AQI_pol1) # summary for AQI_pol1
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 21.60 30.40 33.48 41.20 177.60
plot(pol1_conc) # plot pol 1 conc
Assessment of the concnetrtaion and AQI data for pollutants pol2 -pol6
summary(pol2_conc) # summary for pol2 PM2.5
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 11.10 15.10 15.89 19.60 66.70
summary(AQI_pol2) # summary for AQI_pol2
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 22.20 30.20 31.78 39.20 133.40
plot(pol2_conc) # plot pol_2 conc
summary(pol3_conc) # summary for pol3 CO
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.0000 0.0000 0.0674 0.1000 1.3000
summary(AQI_pol3) # summary for AQI_pol3
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.0000 0.0000 0.7488 1.1111 14.4444
plot(pol3_conc) # plot pol_3 conc
summary(pol4_conc) # summary for pol4 NO2
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00000 0.00800 0.01300 0.01431 0.01900 0.05900
summary(AQI_pol4) # summary for AQI_pol4
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 6.667 10.833 11.922 15.833 49.167
plot(pol4_conc) # plot pol_4 conc
summary(pol5_conc) # summary for pol5 O3
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00000 0.00900 0.01700 0.01727 0.02500 0.07300
summary(AQI_pol5) # summary for AQI_pol5
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 9.00 17.00 17.27 25.00 73.00
plot(pol5_conc) # plot pol_5 conc
summary(pol6_conc) # summary for pol6 Visibility
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 11.0 15.0 17.4 20.0 254.0
summary(AQI_pol6) # summary for AQI_pol6
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 5.238 7.143 8.286 9.524 120.952
plot(pol6_conc) # plot pol_6 conc
plot(AQI_pol6) # plot AQI_pol6
** SUMMARY** The summary and scatter plots for pol1-pol6 indicates that the data is quite noisy. Some of the data (straight line as for pol6 above, data points withindex between 5000-7000) correspods to median values (used as replacement of the missing or negative data. This propbaly would need to be imnproved, perhasp by removeing the cases with NA, negatoive values.
#par(mfrow=c(2, 3))
par(mfrow=c(3, 3))
for (i in 1:3) {
plot(Conc_pol1_6[,i], main = paste0("conc_pol",i))
hist(Conc_pol1_6[,i],main = paste0("conc_pol",i))
boxplot(Conc_pol1_6[,i],main = paste0("conc_pol",i))
}
#par(mfrow=c(2, 3))
par(mfrow=c(3, 3))
for (i in 4:6) {
plot(Conc_pol1_6[,i], main = paste0("conc_pol",i))
hist(Conc_pol1_6[,i],main = paste0("conc_pol",i))
boxplot(Conc_pol1_6[,i],main = paste0("conc_pol",i))
}
par(mfrow=c(1, 3))
for (i in 1:3) {
plot(Conc_pol1_6[,i], main = paste0("conc_pol",i))
hist(Conc_pol1_6[,i],main = paste0("conc_pol",i))
boxplot(AQI_pol1_6[,i],main = paste0("AQI_pol",i))
}
par(mfrow=c(1, 3))
for (i in 4:6) {
plot(Conc_pol1_6[,i], main = paste0("conc_pol",i))
hist(Conc_pol1_6[,i],main = paste0("conc_pol",i))
boxplot(AQI_pol1_6[,i],main = paste0("AQI_pol",i))
}
Create an air quality index by converting measured pollutant concentrations into index values. based on Air NEPM (pol 1-5) and NSW EPA standard (pol 6) See URL: https://www.qld.gov.au/environment/pollution/monitoring/air-monitoring/air-quality-index VERY GOOD <- 0-33 GOOD <- 34-66 FAIR <- 67-99 POOR <- 100-149 BAD <- >150
# define multiple graphs plotting
par(mfrow=c(2, 3))
hist(pol1_conc) # pol1 PM2.5
hist(pol2_conc) # pol2 PM10
hist(pol3_conc) # pol3 CO
hist(pol4_conc) # pol4 NO2
hist(pol5_conc) # pol5 O3
hist(pol6_conc) # pol6 Visibility
par(mfrow=c(2, 3))
hist(AQI_pol1) # pol1 PM2.5
hist(AQI_pol2) # pol2 PM10
hist(AQI_pol3) # pol3 CO
hist(AQI_pol4) # pol4 NO2
hist(AQI_pol5) # pol5 O3
hist(AQI_pol6) # pol6 Visibility
par(mfrow=c(1, 2))
boxplot(pol1_conc,pol2_conc,pol3_conc,pol4_conc,pol5_conc,pol6_conc)
boxplot(AQI_pol1,AQI_pol2,AQI_pol3,AQI_pol4,AQI_pol5,AQI_pol6)
Defined as: X - min(X))/(max(X) - min(X)) Define a normalization function
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
dat5_Norm <- as.data.frame(lapply(dat5[8:13], normalize))
Creating dataframe CONC_pol1_6 with variables pol1_conc, …, pol6_conc which will be used later
#CONC_pol1_6 <-data.frame(cbind(pol1_conc,pol2_conc,pol3_conc,pol4_conc,pol5_conc,pol6_conc))
#colnames(CONC_pol1_6)[1] <- "PM2.5_(ug/m3)" # change the name of column
#colnames(CONC_pol1_6)[2] <- "PM10_(ug/m3))" # change the name of column
#colnames(CONC_pol1_6)[3] <- "CO_(ppm)" # change the name of column
#colnames(CONC_pol1_6)[4] <- "NO2_(ppm)" # change the name of column
#colnames(CONC_pol1_6)[5] <- "O3_(ppm)" # change the name of column
#colnames(CONC_pol1_6)[6] <- "Vis_(1/Mm)" # change the name of column
#head(CONC_pol1_6)
Creating dat6 dataframe by binding dat5 with AQI_po1, …, AQI_pol6 variables
dat6 <-data.frame(cbind(dat5,AQI_pol1,AQI_pol2,AQI_pol3,AQI_pol4,AQI_pol5,AQI_pol6))
head(dat6)
colnames(dat6)
[1] "Date_Time" "Year" "month" "day" "wday" "hour"
[7] "minute" "PM2.5_.ug.m3." "PM10_.ug.m3." "CO_.ppm." "NO2_.ppm." "O3_.ppm."
[13] "Vis_.1.Mm." "AQI_pol1" "AQI_pol2" "AQI_pol3" "AQI_pol4" "AQI_pol5"
[19] "AQI_pol6"
Changing names of columns for AIQ pol1-6
AQI_pol1_6_names <-c("AQI_PM2.5","AQI_PM10","AQI_CO","AQI_NO2","AQI_O3","AQI_Vis")
for(i in 1:6){
i_pol=i+13
colnames(dat6)[i_pol] <- AQI_pol1_6_names[i]
}
colnames(dat6)
[1] "Date_Time" "Year" "month" "day" "wday" "hour"
[7] "minute" "PM2.5_.ug.m3." "PM10_.ug.m3." "CO_.ppm." "NO2_.ppm." "O3_.ppm."
[13] "Vis_.1.Mm." "AQI_PM2.5" "AQI_PM10" "AQI_CO" "AQI_NO2" "AQI_O3"
[19] "AQI_Vis"
Calculate AQI_Overall for South Brisbane (monitoring Site) defined as the max hourly value of AQI for each pollutant
AQI_Overall <-AQI_pol1_6[, "max"] <- apply(dat6[, 15:19], 1, max) # a single column - vector representing max of 6 observation for all rows for
class(AQI_Overall) # AIQ_Overall is numeric
[1] "numeric"
summary(AQI_Overall)
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.857 24.600 30.833 33.494 40.000 133.400
The data normality was assessed for original dataset (i.e.,no transformation applied) using histogram and qqplot
hist(AQI_Overall)
# the histogram for AIQ_Overall shows right skeweness.
qqnorm(AQI_Overall);qqline(AQI_Overall, col = 2) # checking data normality using q-q plot
AQI_Overall data were transformed using log, ln and other function to see if these improve Normalility of tha data
As abovem transformation using natural log function
ln_AQI_Overall <- log(AQI_Overall) # natural log of AQI_Overall
hist(ln_AQI_Overall)
qqnorm(ln_AQI_Overall);qqline(ln_AQI_Overall, col = 2) # improved result-> closer to normal distribution
qqnorm(AQI_Overall);qqline(AQI_Overall, col = 2) # not normal distribution
qqnorm(log_AQI_Overall);qqline(log_AQI_Overall, col = 2) # improved result-> closer to normal distribution
Error in qqnorm(log_AQI_Overall) : object 'log_AQI_Overall' not found
SUMMARY There is not much difference between log10 and ln transformation applied to AQI_Overall variable
Mean centering
center_AQI_Overall <-scale(AQI_Overall, center = TRUE, scale = FALSE)
# center_AQI_Overall
hist(center_AQI_Overall)
qqnorm(center_AQI_Overall);qqline(center_AQI_Overall, col = 2) # Did not improve normaility
# Other transformation, e.g, min-max, z Score Standardisation also tested, providing similar results (not-normal distribution)
Create an air quality index by converting measured pollutant concentrations into index values. based on Air NEPM (pol 1-5) and NSW EPA standard (pol 6) For details see: URL: https://www.qld.gov.au/environment/pollution/monitoring/air-monitoring/air-quality-index
Definition of the AQI Levels: VERY GOOD <- 0-33 GOOD <- 34-66 FAIR <- 67-99 POOR <- 100-149 BAD <- >150
The AQI_Overall values were asigned to Air Quality Levels with cut-points (range) determined by Air NEPM Standard.
AQI_Overall_Levels <- cut(AQI_Overall, breaks=c(0,33,66,99,149,Inf),
labels=c("VERY GOOD","GOOD","FAIR","POOR","BAD"))
In some cases it may be required to present relationship betwen AQI_Overall and corresponding Air Quality Levels using simple graphical form as demonstrated in a simple graph using dataset prepared in this assignment.
par(mfrow=c(1, 1))
hist(AQI_Overall)
dat7<-cbind(dat6,AQI_Overall, AQI_Overall_Levels)
colnames(dat6)
[1] "Date_Time" "Year" "month" "day" "wday" "hour"
[7] "minute" "PM2.5_.ug.m3." "PM10_.ug.m3." "CO_.ppm." "NO2_.ppm." "O3_.ppm."
[13] "Vis_.1.Mm." "AQI_PM2.5" "AQI_PM10" "AQI_CO" "AQI_NO2" "AQI_O3"
[19] "AQI_Vis"
colnames(dat7)
[1] "Date_Time" "Year" "month" "day"
[5] "wday" "hour" "minute" "PM2.5_.ug.m3."
[9] "PM10_.ug.m3." "CO_.ppm." "NO2_.ppm." "O3_.ppm."
[13] "Vis_.1.Mm." "AQI_PM2.5" "AQI_PM10" "AQI_CO"
[17] "AQI_NO2" "AQI_O3" "AQI_Vis" "AQI_Overall"
[21] "AQI_Overall_Levels"
head(dat7)
plot(dat7$AQI_Overall,
main = "Hourly Air Quality Index for Brisbane in 2016",
xlab = "hourly index",
ylab = "AQI [-]",
xlim=c(0, 9000),
ylim=c(0, 200)
)
#minor.tick(nx = 4, ny=5, tick.ratio=0.3)
legend("topright", inset=.05, title="AQI Range (upper bounds)",
c("VERY GOOD","GOOD","FAIR","BAD"), col=c("green", "blue", "yellow", "red"),
lty=c(2,2,2,2), lwd=c(3, 3,3,3),
cex = 0.77,
horiz=TRUE)
abline(h=c(33,66,99,149), col=c("green", "blue", "yellow", "red"), lty=c(2,2,2,2), lwd=c(3, 3,3,3))
SUMMARY The graph above shows hourly averages for Air Quality Index (AQI_Overall) observed at South Brisbane over the entire year 2016. The results indicate that the AQI was at VERY GOOD or GOOD levels (i.e., majority of the values are located below the green and blue dashed lines), however there was also a significnat number of data with FAIR, and several with BAD AQI levels. These would be of particulate interest, for eaxmple from the source apportionment point of view or studying teh conditions when these events occur.
These type of data analysis and graphical representation is usefull for further assessment and actions in relation to Air Quality management.