1 General introduction

In the 21^st century, the largest single threat to the ecology and biodiversity of the planet is a global climate disruption and mass environmental degradation due to the buildup human-generated greenhouse gases and air pollutants in the atmosphere. Although all kinds of air pollutants are harmful to the human being, the most dangerous are both coarse and fine particulate matter (PM_2.5 and PM₁₀) due to their strong ability to penetrate deeper into the lungs and bloodstreams unfiltered. In return, they age faster the person’s lungs and declining their function, increasing the risk of COPD (Chronic Obstructive Pulmonary Diseases), causing permanent DNA mutations, heart attacks and cardiovascular disease, premature delivery, birth defects, low birth weight and premature death. It has been noticed that a high prevalence rate of those health effects is strongly connected with the different weather and meteorological conditions in a given location.

2 The effects of weather conditions on local air pollutants concentration

The continuous changes in weather and climate extremes have impacted massively the prevalence of local air pollutants. According to climate scientists, the sunshine, precipitation, wind, humidity, and atmospheric pressure can all affect the air quality presents in area.

During heavy rain in the wet season, the air becomes stagnant, traps emitted PM pollutants then washes out water-soluble PM, often resulting in a decrease of their atmospheric concentration. At the other hand, in the dry season with severe sunshine, heat waves are produced and dry out vegetation to provide more fuel for wildfires and natural fire outbreaks whose smokes are serious air pollutants.

The dry season also leads to the increase in atmospheric temperature that has relevant effect on pollutants concentration due to the fact that it speeds up the atmospheric chemical reactions to form the harmful chemical compounds and sometimes produce smog air pollutants.

The atmospheric scientists are highly confident that the wind speed and direction are playing a significant role in the concentration of the air pollutants. Wind diverts the prevailed air contaminants away from their sources and lowers the former concentration to highly concentrated in the other area. For example, a northerly wind blows together with pollutants from the north to the south and the Westerlies from West to the East. Furthermore, the high wind speed can also generate bags of dust, especially in the dry season, a problem in dry windy rural areas. Generally, the higher the wind speed, the more contaminants are dispersed and the lower air pollutants concentration in the zone.

This piece of work is aiming to present the first intuitive understanding of the direct effects of those weather conditions mostly the wind speed and direction on regional harmful PM concentration through data visualization technique.

3 Employed data set

The data set used has been imported from one of the 100s air quality monitoring sites in the United Kingdom.

3.1 Source of data

The utilized data set is sourced and publicly obtained from both data archived in the London Air Quality Archive http://www.londonair.org.uk and from the openair project website http://www.openair-project.org.

3.2 Load the libraries and packages

We shall use the various set of packages in R for different tasks:

Plots aligning packages
Data preparation and exploration packages
Table formating packages
The descriptive statistics packages
Missing data treatment packages
Openair package to implement the bivariate polar plots applied to air pollution problems.

# Import packages

aligning_plots_packages <- c("gridExtra", "grid")
data_exploration_packages <- c("tidyverse", "plotly", "openxlsx")
table_formating_packages <- c("knitr","kableExtra")
descriptive_statistics_packages <- c("table1","arsenal","pastecs")
air_quality_packages <- c("openair", "worldmet")
NA_treatment_packages <- c("mice", "VIM")

if (!require(install.load)) {install.packages("install.load")}

install.load::install_load(c(aligning_plots_packages,data_exploration_packages,table_formating_packages, descriptive_statistics_packages,air_quality_packages,NA_treatment_packages))

3.3 Import and view the dataset

Let’s get the dataset first then check the available variables and their names

# import data from the UK automatic urban and rural network in Marylebone site
#library(openair) #uncomment to run
#mary <- importAURN(site = "my1", year = 1998:2005)#uncomment to run

# Import data from openair package
air_quality_dataset <- openair::mydata
head(air_quality_dataset)

3.4 Dataset description

This dataset contains hourly measurements of different air pollutant concentrations in \(\mu gm^{-3}\) with wind speed and direction measured in \(ms^1\) and degrees respectively. It has been collected at the air quality monitoring station of Marylebone (London, UK) from 1^st January 1998 to 23^rd June 2005.

The dataset has \(65,533\) observations and \(10\) variables where the first date column is unique in the year-month-day hour: minute: second format. The other remaining \(9\) features are described below:

wd (type: dbl): the wind speed in numeric values with decimal points.
ws (type: int): the integer wind direction measured in degrees.
nox (type: int): the integer air pollutant from the reaction of nitrogen and oxygen gases in the air during combustion, especially at high temperatures.
no2 (type: int): the nitrogen dioxide air pollutants produced from road traffic and other fossil fuel combustion processes.
o3 (type: int): the ozone air pollutant which is formed when pollutants emitted by cars, power plants, industrial boilers, refineries, chemical plants, and other sources react chemically in the presence of sunlight.
pm10 (type: int): the atmospheric particulate matter (PM₁₀) that have a diameter of fewer than 10 micrometres.
so2 (type: int): the air pollutant sulfur dioxide (SO₂) which are mainly emitted by the burning of fossil fuels, coal, oil, and diesel or other materials that contain sulfur.
co (type: int): the carbon monoxide (CO) which is a toxic air pollutant produced in the incomplete combustion of carbon-containing fuels, such as gasoline, natural gas, oil, coal, and wood.
pm25(type: int): the atmospheric particulate matter (PM_2.5) that have a diameter of fewer than 2.5 micrometres.

4 Data preparation and preprocessing

Since the data preparation is one of the sensitive steps in data analytics, let’s prepare our data set for further analysis.

4.1 Tracing of missing values

#library(Hmisc)
#Data matrixplot indicating missing values in red color
 
matrixplot(air_quality_dataset, sortby = 2, ylim = c(0,900), font.axis = 4)

The above graph visualizes data by rectangles with available data represented by grey colour scheme, while missing data is visualized by a clearly distinguishable red colour.

#Number of missing values by columns

colSums(is.na(air_quality_dataset))

##  date    ws    wd   nox   no2    o3  pm10   so2    co  pm25 
##     0   632   219  2423  2438  2589  2162 10450  1936  8775

The above tabular represents exactly the missing value in number for each variable of data set. To get a meaningful insight into missing values, we need also to check the percentage rate of missing values in entire dataset and in our targeted variables (PM_2.5 and PM₁₀).

# Rate of missing values in data set
sum(is.na(air_quality_dataset))/(nrow(air_quality_dataset)*ncol(air_quality_dataset))

## [1] 0.0482566

The rate of missing values in data is \(4.8\%\) which is relatively low compared to the available data features and it can’t be able to bias our analytical processes.

Here below chart represents the missing values in both coarse and fine particulate matter variables.

library(VIM)
#a scatterplot with additional information on the missing values

par(
  # Change the colors
  col.main = "#336600", col.lab = "#0033FF", col.axis = "#333000",
  # Titles in italic and bold
  font.main = 4, font.lab = 4, font.axis = 4,
  # Change font size
  cex.main = 1.2, cex.lab = 1, cex.axis = 1
)
marginplot(air_quality_dataset[,c("pm10","pm25")],pch = 16 , cex = 1.5 ,numbers = T, xlim = c(0,800), ylim = c(0,400), main = "Scatterplot with missing values information", xlab = "Hourly PM10 Concentration ", ylab = "Hourly PM2.5 Concentration ")

The above graph indicates that the points with no missing values represented by standard scatterplot in blue. The points for which PM₁₀ is missing are presented in red along the y-axis and PM_2.5 in red on the x-axis. In addition, the boxplots of both variables are represented along the axes with and without the missing values (in red all variable PM₁₀ where PM_2.5 is missing, in blue all variable PM_2.5 where PM₁₀ is observed).

Therefore, at which rate both PM_2.5 and PM₁₀ are missing values?

# Rate of missing values for PM10
sum(is.na(air_quality_dataset$pm10))/(nrow(air_quality_dataset)*ncol(air_quality_dataset))

## [1] 0.003299101

# Rate of missing values for PM2.5
sum(is.na(air_quality_dataset$pm25))/(nrow(air_quality_dataset)*ncol(air_quality_dataset))

## [1] 0.0133902

It is shown that the missing value rate of PM₁₀ and PM_2.5 are \(0.32\%\) and \(1.33\%\) respectively.

Basing on the facts that a large number of observations and all data set variables are sufficiently represented and thought the existing of missing values in data set lead to analytics biasedness especially in the multivariate estimate (i.e Correlation or regression estimates) we have to exclude (delete) the missing values (NA) in our dataset for smooth analysis. ## Deleting the data missing observations

We are removing the non-available data observations, viewing the remaining data set, and present its final descriptive statistics for clarification.

# Deleting of NA in data set
cleaned_air_quality_dataset <- na.omit(air_quality_dataset)
head(cleaned_air_quality_dataset) # Check first 6 rows after NA deletion

# Graphical Presentation of cleaned data set
#library(Hmisc)
matrixplot(cleaned_air_quality_dataset, sortby = 2, ylim = c(0,900), font.axis = 4)

# Checking whether there is no remaining NA in data set
colSums(is.na(cleaned_air_quality_dataset))

## date   ws   wd  nox  no2   o3 pm10  so2   co pm25 
##    0    0    0    0    0    0    0    0    0    0

# Viewing the descriptive statistics of PM2.5 & PM10
#we are first relabelling our columns for aesthetics.
table1::label(cleaned_air_quality_dataset$pm25) <- "Fine Particulate Matter (PM2.5)"
table1::label(cleaned_air_quality_dataset$pm10) <- "Coarse Particulate Matter (PM10)"
table1::label(cleaned_air_quality_dataset$wd) <- "Wind direction"
#Then we are creating the table with only one line of code. 
table1::table1(~pm25 + pm10 + wd, data = cleaned_air_quality_dataset)

	Overall (n=42524)
Fine Particulate Matter (PM2.5)
Mean (SD)	22.0 (12.5)
Median [Min, Max]	20.0 [0.00, 381]
Coarse Particulate Matter (PM10)
Mean (SD)	35.1 (21.4)
Median [Min, Max]	32.0 [1.00, 800]
Wind direction
Mean (SD)	197 (94.4)
Median [Min, Max]	210 [0.00, 360]

It is seen that no remaining missing values in data. As it is shown on the above table, we have a healthy mean of \(22.0 \mu gm^-3\) and \(35.1\mu gm^-3\) for PM_2.5 and PM₁₀ respectively which are below the international standard of \(65.4\) and \(150 \mu gm^-3\) for PM_2.5 and PM₁₀ respectively.

Finally, the data set is cleaned no more missing values and other strangers, we can go ahead with the first step of data analytics (visualization).

5 Visualize the impact of weather conditions on PM concentration

Due to the fact that the particulate matter (PM_2.5 and PM₁₀) contains microscopic solids or liquid droplets that are too small to be easily inhaled and cause serious health problems, we are highly motivated to present graphically their concentration that is strongly impacted by the aforementioned weather conditions prevailed in the region.

5.1 Wind speed Vs PM concentration

# Wind speed Vs PM2.5
Plot1 <- ggplot(cleaned_air_quality_dataset, aes(x = ws, y = pm25)) + geom_point(col = "blue", size = 2) + theme_light() + labs(title = "Effect of wind speed on PM2.5 concentration", x = "Hourly wind speed ", y = "Hourly PM2.5 concentration", caption = "@mgisa")

# Wind speed Vs hourly PM10
Plot2 <- ggplot(cleaned_air_quality_dataset, aes(x = ws, y = pm10)) + geom_point(col = "cyan", size = 2) + theme_light() + labs(title = "Effect of wind speed on PM10 concentration", x = "Hourly wind speed ", y = "Hourly PM10 concentration",caption = "@mgisa")

# aligning two plots
grid.arrange(Plot1,Plot2, nrow = 1, top = "The wind speed effects on PM concentration", bottom = textGrob("Plotted on July 30, 2019 11: 11:30",gp = gpar(fontface = 3, fontsize = 9),hjust = 1, x = 1))

The above graphs indicate that as wind speed increases, both fine and coarse atmospheric particulate matter lose their concentration.

5.2 Yearly PM concentration

#Yearly PM2.5 concentration
polarPlot(cleaned_air_quality_dataset, pollutant = "pm25", type = "year", layout = c(4,2), key.header = "Mean of PM2.5",  cols = c("#003300","#0000FF", "#FF9933", "#FF36FF", "#FF3300"), main = " Yearly PM2.5 Concentration")

#Yearly PM10 concentration
polarPlot(cleaned_air_quality_dataset, pollutant = "pm10", type = "year", layout = c(4,2), key.header = "Mean of PM10", cols = c("#009900", "#33FFFF", "#FFFF33","#FF9933", "#990033"), main = " Yearly PM10 Concentration")

It is seen that the PM concentration decreases with time and the highest concentrations occur in three consecutive years (\(1999-2001\)) at this site. They have been caused by the northerly wind blows from North to South and make them concentrated in the south-western part of the site. The blue outer concentric circle represent the maximum wind speed resulting in low concentration of PM air pollutants.

5.3 The year 2000, daily and monthly PM concentration.

Since we have identified that the high PM concentration occurs in the year 2000, let’s present its both daily and monthly PM concentration.

# Year 2000, daily&monthly PM2.5 concentration

 calendarPlot(cleaned_air_quality_dataset, pollutant = "pm25",year = "2000", cols = c("yellow","magenta", "green", "red"),key.header = "Mean of PM2.5",main = " Year 2000, daily and monthly PM2.5 distribution")

# Year 2000, daily&monthly PM2.5 concentration
 calendarPlot(cleaned_air_quality_dataset, pollutant = "pm10", year = "2000", cols = c("yellow","cyan", "blue", "red"),key.header = "Mean of PM10",main = " Year 2000, daily and monthly PM10 distribution")

Now it is possible to see that PM is likely to be concentrated at the last week of the month, especially in January, March, June and July. Besides, lesser concentration occurs in last quarter of the year 2000 especially on 25^th, 26^th and 27 ^th December 2000.

5.4 Seasonal PM concentration

# Seasonal PM2.5 concentration
pollutionRose(cleaned_air_quality_dataset, pollutant = "pm25", key.header = "PM2.5 concentration",cols = c("yellow", "green", "blue", "black", "red"), type = 'season', legend_title = "PM2.5 concentration",legend.title.align = .5, angle = 45, width = 1,grid.line = list(value = 10, lty = 5, col = "purple"),main = "Seasonal concentration of PM2.5")

# Seasonal PM10 concentration
pollutionRose(cleaned_air_quality_dataset, pollutant = "pm10", key.header = "PM10 concentration",cols = c("yellow", "green", "blue", "black", "red"), type = 'season', legend_title = "PM10 concentration",legend.title.align = .5, angle = 45, width = 1,grid.line = list(value = 10, lty = 5, col = "red"),main = "Seasonal concentration of PM10")

The above graph gives a piece of very concise information on how PM concentration is typically affected by wind speed and direction which are presenting at a given area in different seasons. Graphs show also that the percentage of time that winds blow from a particular direction (4 cardinal direction indicated, N, S, W, E) is changing with PM distribution.

In fact, the above wind roses show that most of the time the wind at this site blow from South-West due to the long spoke around the southwest direction. The highest wind frequency occurs in summer season which is Southwest dry monsoon wind. This kind of wind concentrates particulate matter in the region and sometime might influence heavy rain and storms in the region. This change in the weather can produce flooding and even raise the wildfire threat resulting in the production of smog and smokes harmful to lives in the area.

At the other hand, the low concentration of both PM_2.5 and PM₁₀ occur in the remaining seasons with estimated concentrations of \(0-10\) and \(0-20 \mu g m^-3\) respectively with reasonably minimum wind speed prevailed in each climatic season.

Let’s assess the direct impacts of each seasons together with the existing wind on the PM concentartion in the region:

5.5 The weekdays and weekend PM concentration.

# PM2.5 in weekdays and weekend
polarPlot(cleaned_air_quality_dataset, pollutant = "pm25", type = "weekend",cols = c("yellow", "blue", "magenta"),key.header = "Mean of PM25",main = "PM25 distribution in weekdays and weekend")

# PM10 in weekdays and weekend
polarPlot(cleaned_air_quality_dataset, pollutant = "pm10", type = "weekend",cols = c("yellow", "blue", "red"),key.header = "Mean of PM10",main = "PM10 distribution in weekdays and weekend")

The above bivariate polar plots indicate that both fine and coarse particulate matter is highly concentrated in the south and small part of the northeast during weekdays due to the high wind frequency blowing from Northeast. The weekend met the lesser PM concentration due to the various reasons including the limited traffic jam and human activities emitting PM in the region.

5.6 Daily PM concentration

# Daily PM2.5 concentration
 polarPlot(cleaned_air_quality_dataset, pollutant = "pm25", type = "weekday",cols = c("yellow", "blue", "magenta"),key.header = "Mean of PM2.5",main = " Daily PM2.5 distribution")

# Daily PM10 concentration
 polarPlot(cleaned_air_quality_dataset, pollutant = "pm10", type = "weekday",cols = c("yellow", "blue", "red"),key.header = "Mean of PM10",main = " Daily PM10 distribution")

The above plots indicate that the atmospheric particulate matter is highly occurring during the weekdays than weekend due to the different reasons including the heavy traffic jam and unlimited man-made activities in working days than a weekend.

6 Daytime and Nighttime PM concentration

# Day&nighttime PM2.5 concentration
 polarPlot(cleaned_air_quality_dataset, pollutant = "pm25", type = "daylight",cols = c("yellow", "green", "magenta"),key.header = "Mean of PM2.5",main = " Day and night PM2.5 distribution")

# Day&nighttime PM10 concentration
 polarPlot(cleaned_air_quality_dataset, pollutant = "pm25", type = "daylight",cols = c("yellow", "blue", "red"),key.header = "Mean of PM10",main = " Day and night PM10 distribution")

The highest PM concentration occurs during the daytime than nighttime at this air quality monitoring site and when the wind is blowing from the South.

6.1 Weighted Pearson Correlation between PM_2.5 and PM₁₀

Here below the Pearson correlation coefficient for two pollutants is calculated and plotted in order to identify their possible sources basing on their level of relationship.

# Pearson Correlation btn PM2.5 and PM10

 polarPlot(cleaned_air_quality_dataset, pollutant = c("pm25","pm10"), statistic = "r",cols = c("yellow", "blue", "green", "red"), key.header = "Correlation coefficient", main = " Pearson correlation between PM2.5 and PM10")

The above graph shows that they are highly correlated at a rate of above \(70\%\). Now, we are confident to identify their main sources basing on this statistical hypothesis namely “The highly correlated pollutants are directly originated from the same sources”

The main prevailed sources of particulate matter at the Marylebone site were automobiles fuel burning, industrial processes, and windblown bags of dust.

7 Conclusion

Here we evaluated and presented graphically the impact of weather conditions on the air pollutants concentration, specifically most harmful particulate matter, PM at Marylebone air quality monitoring site in the United Kingdom(UK). We plotted the PM concentration in different period of time, and we found that the weather conditions (wind, temperature, precipitation) have significant effects on air pollutants concentration in each period of time considered.

We applied the Pearson correlation methodology to measure the strength of the linear relationship between both coarse and fine particulate matter for identifying and apportioning their main sources at the site. After the identification of the sources, we highly recommend the serious abatement of air pollutants and advice the people to avoid the exposure in case of the outdoor physical exertion especially the vulnerable population (children, elder people and those with pulmonary and cardiovascular diseases)

Finally, it is expected in the near future that Machine Learning algorithms will be widely used for the investigation of statistical dependency of weather conditions and atmospheric particulate matter concentration in Marylebone air quality monitoring site.

Impact of Weather Conditions on Air Pollutants Concentration

Murera Gisa

Kigali, July 30, 2019

1 General introduction

2 The effects of weather conditions on local air pollutants concentration

3 Employed data set

3.1 Source of data

3.2 Load the libraries and packages

3.3 Import and view the dataset

3.4 Dataset description

4 Data preparation and preprocessing

4.1 Tracing of missing values

5 Visualize the impact of weather conditions on PM concentration

5.1 Wind speed Vs PM concentration

5.2 Yearly PM concentration

5.3 The year 2000, daily and monthly PM concentration.

5.4 Seasonal PM concentration

5.5 The weekdays and weekend PM concentration.

5.6 Daily PM concentration

6 Daytime and Nighttime PM concentration

6.1 Weighted Pearson Correlation between PM_2.5 and PM₁₀

7 Conclusion

Impact of Weather Conditions on Air Pollutants Concentration

Murera Gisa

Kigali, July 30, 2019

1 General introduction

2 The effects of weather conditions on local air pollutants concentration

3 Employed data set

3.1 Source of data

3.2 Load the libraries and packages

3.3 Import and view the dataset

3.4 Dataset description

4 Data preparation and preprocessing

4.1 Tracing of missing values

5 Visualize the impact of weather conditions on PM concentration

5.1 Wind speed Vs PM concentration

5.2 Yearly PM concentration

5.3 The year 2000, daily and monthly PM concentration.

5.4 Seasonal PM concentration

5.5 The weekdays and weekend PM concentration.

5.6 Daily PM concentration

6 Daytime and Nighttime PM concentration

6.1 Weighted Pearson Correlation between PM2.5 and PM10

7 Conclusion

6.1 Weighted Pearson Correlation between PM_2.5 and PM₁₀