In the 21st century, the largest single threat to the ecology and biodiversity of the planet is a global climate disruption and mass environmental degradation due to the buildup human-generated greenhouse gases and air pollutants in the atmosphere. Although all kinds of air pollutants are harmful to the human being, the most dangerous are both coarse and fine particulate matter (PM2.5 and PM10) due to their strong ability to penetrate deeper into the lungs and bloodstreams unfiltered. In return, they age faster the person’s lungs and declining their function, increasing the risk of COPD (Chronic Obstructive Pulmonary Diseases), causing permanent DNA mutations, heart attacks and cardiovascular disease, premature delivery, birth defects, low birth weight and premature death. It has been noticed that a high prevalence rate of those health effects is strongly connected with the different weather and meteorological conditions in a given location.
The continuous changes in weather and climate extremes have impacted massively the prevalence of local air pollutants. According to climate scientists, the sunshine, precipitation, wind, humidity, and atmospheric pressure can all affect the air quality presents in area.
During heavy rain in the wet season, the air becomes stagnant, traps emitted PM pollutants then washes out water-soluble PM, often resulting in a decrease of their atmospheric concentration. At the other hand, in the dry season with severe sunshine, heat waves are produced and dry out vegetation to provide more fuel for wildfires and natural fire outbreaks whose smokes are serious air pollutants.
The dry season also leads to the increase in atmospheric temperature that has relevant effect on pollutants concentration due to the fact that it speeds up the atmospheric chemical reactions to form the harmful chemical compounds and sometimes produce smog air pollutants.
The atmospheric scientists are highly confident that the wind speed and direction are playing a significant role in the concentration of the air pollutants. Wind diverts the prevailed air contaminants away from their sources and lowers the former concentration to highly concentrated in the other area. For example, a northerly wind blows together with pollutants from the north to the south and the Westerlies from West to the East. Furthermore, the high wind speed can also generate bags of dust, especially in the dry season, a problem in dry windy rural areas. Generally, the higher the wind speed, the more contaminants are dispersed and the lower air pollutants concentration in the zone.
This piece of work is aiming to present the first intuitive understanding of the direct effects of those weather conditions mostly the wind speed and direction on regional harmful PM concentration through data visualization technique.
The data set used has been imported from one of the 100s air quality monitoring sites in the United Kingdom.
The utilized data set is sourced and publicly obtained from both data archived in the London Air Quality Archive http://www.londonair.org.uk and from the openair project website http://www.openair-project.org.
We shall use the various set of packages in R for different tasks:
# Import packages
aligning_plots_packages <- c("gridExtra", "grid")
data_exploration_packages <- c("tidyverse", "plotly", "openxlsx")
table_formating_packages <- c("knitr","kableExtra")
descriptive_statistics_packages <- c("table1","arsenal","pastecs")
air_quality_packages <- c("openair", "worldmet")
NA_treatment_packages <- c("mice", "VIM")
if (!require(install.load)) {install.packages("install.load")}
install.load::install_load(c(aligning_plots_packages,data_exploration_packages,table_formating_packages, descriptive_statistics_packages,air_quality_packages,NA_treatment_packages))
Let’s get the dataset first then check the available variables and their names
# import data from the UK automatic urban and rural network in Marylebone site
#library(openair) #uncomment to run
#mary <- importAURN(site = "my1", year = 1998:2005)#uncomment to run
# Import data from openair package
air_quality_dataset <- openair::mydata
head(air_quality_dataset)
This dataset contains hourly measurements of different air pollutant concentrations in \(\mu gm^{-3}\) with wind speed and direction measured in \(ms^1\) and degrees respectively. It has been collected at the air quality monitoring station of Marylebone (London, UK) from 1st January 1998 to 23rd June 2005.
The dataset has \(65,533\) observations and \(10\) variables where the first date column is unique in the year-month-day hour: minute: second format. The other remaining \(9\) features are described below:
Since the data preparation is one of the sensitive steps in data analytics, let’s prepare our data set for further analysis.
#library(Hmisc)
#Data matrixplot indicating missing values in red color
matrixplot(air_quality_dataset, sortby = 2, ylim = c(0,900), font.axis = 4)
The above graph visualizes data by rectangles with available data represented by grey colour scheme, while missing data is visualized by a clearly distinguishable red colour.
## date ws wd nox no2 o3 pm10 so2 co pm25
## 0 632 219 2423 2438 2589 2162 10450 1936 8775
The above tabular represents exactly the missing value in number for each variable of data set. To get a meaningful insight into missing values, we need also to check the percentage rate of missing values in entire dataset and in our targeted variables (PM2.5 and PM10).
# Rate of missing values in data set
sum(is.na(air_quality_dataset))/(nrow(air_quality_dataset)*ncol(air_quality_dataset))
## [1] 0.0482566
The rate of missing values in data is \(4.8\%\) which is relatively low compared to the available data features and it can’t be able to bias our analytical processes.
Here below chart represents the missing values in both coarse and fine particulate matter variables.
library(VIM)
#a scatterplot with additional information on the missing values
par(
# Change the colors
col.main = "#336600", col.lab = "#0033FF", col.axis = "#333000",
# Titles in italic and bold
font.main = 4, font.lab = 4, font.axis = 4,
# Change font size
cex.main = 1.2, cex.lab = 1, cex.axis = 1
)
marginplot(air_quality_dataset[,c("pm10","pm25")],pch = 16 , cex = 1.5 ,numbers = T, xlim = c(0,800), ylim = c(0,400), main = "Scatterplot with missing values information", xlab = "Hourly PM10 Concentration ", ylab = "Hourly PM2.5 Concentration ")
The above graph indicates that the points with no missing values represented by standard scatterplot in blue. The points for which PM10 is missing are presented in red along the y-axis and PM2.5 in red on the x-axis. In addition, the boxplots of both variables are represented along the axes with and without the missing values (in red all variable PM10 where PM2.5 is missing, in blue all variable PM2.5 where PM10 is observed).
Therefore, at which rate both PM2.5 and PM10 are missing values?
# Rate of missing values for PM10
sum(is.na(air_quality_dataset$pm10))/(nrow(air_quality_dataset)*ncol(air_quality_dataset))
## [1] 0.003299101
# Rate of missing values for PM2.5
sum(is.na(air_quality_dataset$pm25))/(nrow(air_quality_dataset)*ncol(air_quality_dataset))
## [1] 0.0133902
It is shown that the missing value rate of PM10 and PM2.5 are \(0.32\%\) and \(1.33\%\) respectively.
Basing on the facts that a large number of observations and all data set variables are sufficiently represented and thought the existing of missing values in data set lead to analytics biasedness especially in the multivariate estimate (i.e Correlation or regression estimates) we have to exclude (delete) the missing values (NA) in our dataset for smooth analysis. ## Deleting the data missing observations
We are removing the non-available data observations, viewing the remaining data set, and present its final descriptive statistics for clarification.
# Deleting of NA in data set
cleaned_air_quality_dataset <- na.omit(air_quality_dataset)
head(cleaned_air_quality_dataset) # Check first 6 rows after NA deletion
# Graphical Presentation of cleaned data set
#library(Hmisc)
matrixplot(cleaned_air_quality_dataset, sortby = 2, ylim = c(0,900), font.axis = 4)
## date ws wd nox no2 o3 pm10 so2 co pm25
## 0 0 0 0 0 0 0 0 0 0
# Viewing the descriptive statistics of PM2.5 & PM10
#we are first relabelling our columns for aesthetics.
table1::label(cleaned_air_quality_dataset$pm25) <- "Fine Particulate Matter (PM2.5)"
table1::label(cleaned_air_quality_dataset$pm10) <- "Coarse Particulate Matter (PM10)"
table1::label(cleaned_air_quality_dataset$wd) <- "Wind direction"
#Then we are creating the table with only one line of code.
table1::table1(~pm25 + pm10 + wd, data = cleaned_air_quality_dataset)
Overall (n=42524) |
|
---|---|
Fine Particulate Matter (PM2.5) | |
Mean (SD) | 22.0 (12.5) |
Median [Min, Max] | 20.0 [0.00, 381] |
Coarse Particulate Matter (PM10) | |
Mean (SD) | 35.1 (21.4) |
Median [Min, Max] | 32.0 [1.00, 800] |
Wind direction | |
Mean (SD) | 197 (94.4) |
Median [Min, Max] | 210 [0.00, 360] |
It is seen that no remaining missing values in data. As it is shown on the above table, we have a healthy mean of \(22.0 \mu gm^-3\) and \(35.1\mu gm^-3\) for PM2.5 and PM10 respectively which are below the international standard of \(65.4\) and \(150 \mu gm^-3\) for PM2.5 and PM10 respectively.
Finally, the data set is cleaned no more missing values and other strangers, we can go ahead with the first step of data analytics (visualization).
Due to the fact that the particulate matter (PM2.5 and PM10) contains microscopic solids or liquid droplets that are too small to be easily inhaled and cause serious health problems, we are highly motivated to present graphically their concentration that is strongly impacted by the aforementioned weather conditions prevailed in the region.
# Wind speed Vs PM2.5
Plot1 <- ggplot(cleaned_air_quality_dataset, aes(x = ws, y = pm25)) + geom_point(col = "blue", size = 2) + theme_light() + labs(title = "Effect of wind speed on PM2.5 concentration", x = "Hourly wind speed ", y = "Hourly PM2.5 concentration", caption = "@mgisa")
# Wind speed Vs hourly PM10
Plot2 <- ggplot(cleaned_air_quality_dataset, aes(x = ws, y = pm10)) + geom_point(col = "cyan", size = 2) + theme_light() + labs(title = "Effect of wind speed on PM10 concentration", x = "Hourly wind speed ", y = "Hourly PM10 concentration",caption = "@mgisa")
# aligning two plots
grid.arrange(Plot1,Plot2, nrow = 1, top = "The wind speed effects on PM concentration", bottom = textGrob("Plotted on July 30, 2019 11: 11:30",gp = gpar(fontface = 3, fontsize = 9),hjust = 1, x = 1))
The above graphs indicate that as wind speed increases, both fine and coarse atmospheric particulate matter lose their concentration.
#Yearly PM2.5 concentration
polarPlot(cleaned_air_quality_dataset, pollutant = "pm25", type = "year", layout = c(4,2), key.header = "Mean of PM2.5", cols = c("#003300","#0000FF", "#FF9933", "#FF36FF", "#FF3300"), main = " Yearly PM2.5 Concentration")
#Yearly PM10 concentration
polarPlot(cleaned_air_quality_dataset, pollutant = "pm10", type = "year", layout = c(4,2), key.header = "Mean of PM10", cols = c("#009900", "#33FFFF", "#FFFF33","#FF9933", "#990033"), main = " Yearly PM10 Concentration")
It is seen that the PM concentration decreases with time and the highest concentrations occur in three consecutive years (\(1999-2001\)) at this site. They have been caused by the northerly wind blows from North to South and make them concentrated in the south-western part of the site. The blue outer concentric circle represent the maximum wind speed resulting in low concentration of PM air pollutants.
Since we have identified that the high PM concentration occurs in the year 2000, let’s present its both daily and monthly PM concentration.
# Year 2000, daily&monthly PM2.5 concentration
calendarPlot(cleaned_air_quality_dataset, pollutant = "pm25",year = "2000", cols = c("yellow","magenta", "green", "red"),key.header = "Mean of PM2.5",main = " Year 2000, daily and monthly PM2.5 distribution")
# Year 2000, daily&monthly PM2.5 concentration
calendarPlot(cleaned_air_quality_dataset, pollutant = "pm10", year = "2000", cols = c("yellow","cyan", "blue", "red"),key.header = "Mean of PM10",main = " Year 2000, daily and monthly PM10 distribution")
Now it is possible to see that PM is likely to be concentrated at the last week of the month, especially in January, March, June and July. Besides, lesser concentration occurs in last quarter of the year 2000 especially on 25th, 26th and 27 th December 2000.
# Seasonal PM2.5 concentration
pollutionRose(cleaned_air_quality_dataset, pollutant = "pm25", key.header = "PM2.5 concentration",cols = c("yellow", "green", "blue", "black", "red"), type = 'season', legend_title = "PM2.5 concentration",legend.title.align = .5, angle = 45, width = 1,grid.line = list(value = 10, lty = 5, col = "purple"),main = "Seasonal concentration of PM2.5")
# Seasonal PM10 concentration
pollutionRose(cleaned_air_quality_dataset, pollutant = "pm10", key.header = "PM10 concentration",cols = c("yellow", "green", "blue", "black", "red"), type = 'season', legend_title = "PM10 concentration",legend.title.align = .5, angle = 45, width = 1,grid.line = list(value = 10, lty = 5, col = "red"),main = "Seasonal concentration of PM10")
The above graph gives a piece of very concise information on how PM concentration is typically affected by wind speed and direction which are presenting at a given area in different seasons. Graphs show also that the percentage of time that winds blow from a particular direction (4 cardinal direction indicated, N, S, W, E) is changing with PM distribution.
In fact, the above wind roses show that most of the time the wind at this site blow from South-West due to the long spoke around the southwest direction. The highest wind frequency occurs in summer season which is Southwest dry monsoon wind. This kind of wind concentrates particulate matter in the region and sometime might influence heavy rain and storms in the region. This change in the weather can produce flooding and even raise the wildfire threat resulting in the production of smog and smokes harmful to lives in the area.
At the other hand, the low concentration of both PM2.5 and PM10 occur in the remaining seasons with estimated concentrations of \(0-10\) and \(0-20 \mu g m^-3\) respectively with reasonably minimum wind speed prevailed in each climatic season.
Let’s assess the direct impacts of each seasons together with the existing wind on the PM concentartion in the region:
# PM2.5 in weekdays and weekend
polarPlot(cleaned_air_quality_dataset, pollutant = "pm25", type = "weekend",cols = c("yellow", "blue", "magenta"),key.header = "Mean of PM25",main = "PM25 distribution in weekdays and weekend")
# PM10 in weekdays and weekend
polarPlot(cleaned_air_quality_dataset, pollutant = "pm10", type = "weekend",cols = c("yellow", "blue", "red"),key.header = "Mean of PM10",main = "PM10 distribution in weekdays and weekend")
The above bivariate polar plots indicate that both fine and coarse particulate matter is highly concentrated in the south and small part of the northeast during weekdays due to the high wind frequency blowing from Northeast. The weekend met the lesser PM concentration due to the various reasons including the limited traffic jam and human activities emitting PM in the region.
# Daily PM2.5 concentration
polarPlot(cleaned_air_quality_dataset, pollutant = "pm25", type = "weekday",cols = c("yellow", "blue", "magenta"),key.header = "Mean of PM2.5",main = " Daily PM2.5 distribution")
# Daily PM10 concentration
polarPlot(cleaned_air_quality_dataset, pollutant = "pm10", type = "weekday",cols = c("yellow", "blue", "red"),key.header = "Mean of PM10",main = " Daily PM10 distribution")
The above plots indicate that the atmospheric particulate matter is highly occurring during the weekdays than weekend due to the different reasons including the heavy traffic jam and unlimited man-made activities in working days than a weekend.
# Day&nighttime PM2.5 concentration
polarPlot(cleaned_air_quality_dataset, pollutant = "pm25", type = "daylight",cols = c("yellow", "green", "magenta"),key.header = "Mean of PM2.5",main = " Day and night PM2.5 distribution")
# Day&nighttime PM10 concentration
polarPlot(cleaned_air_quality_dataset, pollutant = "pm25", type = "daylight",cols = c("yellow", "blue", "red"),key.header = "Mean of PM10",main = " Day and night PM10 distribution")
The highest PM concentration occurs during the daytime than nighttime at this air quality monitoring site and when the wind is blowing from the South.
Here below the Pearson correlation coefficient for two pollutants is calculated and plotted in order to identify their possible sources basing on their level of relationship.
# Pearson Correlation btn PM2.5 and PM10
polarPlot(cleaned_air_quality_dataset, pollutant = c("pm25","pm10"), statistic = "r",cols = c("yellow", "blue", "green", "red"), key.header = "Correlation coefficient", main = " Pearson correlation between PM2.5 and PM10")
The above graph shows that they are highly correlated at a rate of above \(70\%\). Now, we are confident to identify their main sources basing on this statistical hypothesis namely “The highly correlated pollutants are directly originated from the same sources”
The main prevailed sources of particulate matter at the Marylebone site were automobiles fuel burning, industrial processes, and windblown bags of dust.
Here we evaluated and presented graphically the impact of weather conditions on the air pollutants concentration, specifically most harmful particulate matter, PM at Marylebone air quality monitoring site in the United Kingdom(UK). We plotted the PM concentration in different period of time, and we found that the weather conditions (wind, temperature, precipitation) have significant effects on air pollutants concentration in each period of time considered.
We applied the Pearson correlation methodology to measure the strength of the linear relationship between both coarse and fine particulate matter for identifying and apportioning their main sources at the site. After the identification of the sources, we highly recommend the serious abatement of air pollutants and advice the people to avoid the exposure in case of the outdoor physical exertion especially the vulnerable population (children, elder people and those with pulmonary and cardiovascular diseases)
Finally, it is expected in the near future that Machine Learning algorithms will be widely used for the investigation of statistical dependency of weather conditions and atmospheric particulate matter concentration in Marylebone air quality monitoring site.