Based on atmospheric quality, road traffic and the levels of the main pollutants, we will try to identify the correlation links between these different parameters.
As part of the Open Data Policy, the City of Zurich has made a number of documents available for data scientists. Three of which have caught our attention. It is the air quality measurements that I would like to compare with the weather and the traffic data. As it happens, the sensors used are all in one place, which is very convenient for this analysis.
The first data sets1 are the air quality measurements taken every hour since 1983. They contain measures of ozone (O3), nitrogen oxides (NOx), nitrogen monoxide (NO), nitrogen dioxide (NO2), particulate matter (PM10 and PM2.5), carbon monoxide (CO) and sulfur dioxide (SO2).
The second dataset2 contains hourly weather records since 1992 at the same places. It measures air pressure (p), precipitation duration (RainDur), global radiation (StrGlo), temperature (T), relative humidity (Hr), wind direction, vector and scalar wind speed. These data will allow us to see if there is a correlation between pollutant concentration and the meteorologic events.
The third dataset3 contains the hourly traffic on the Stampfenbachstrasse street since 2007. Some of the weather and air quality measurements are also done on this street, which will allow us to analyze geographically consistent samples.
The questions we will try to answer are the following:
To find these answers, we will collect the hourly data. We will establish the level of granularity (hourly, daily, monthly) best suited to its analysis. The result must be readable, interpretable, complete and not contain too many material resources to be produced.
We then establish correlations between parameters and then observe the joint evolution of correlated parameters over the period.
First we load the necessary packages:
To begin, we will analyse the 2019 data:
year <- "2012"
years <- c(2007,2008)
# year <- "2019"
type <- ".csv"
filename <- "ugz_ogd_air_h1_"
url_path <- "https://data.stadt-zuerich.ch/dataset/ugz_luftschadstoffmessung_stundenwerte/download/"
fileUrl <- paste0(url_path, filename, year, type, sep = "", recycle0 = TRUE)
locfilename <- paste0("./data/", filename, year, type, sep = "", recycle0 = TRUE)
m_filename <- "ugz_ogd_meteo_h1_"
m_url_path <- "https://data.stadt-zuerich.ch/dataset/ugz_meteodaten_stundenmittelwerte/download/"
m_fileUrl <- paste0(m_url_path, m_filename, year, type, sep = "", recycle0 = TRUE)
m_locfilename <- paste0("./data/", m_filename, year, type, sep = "", recycle0 = TRUE)
# Traffic informations since 2007
t_filename <- "ugz_ogd_traffic_h1_"
t_url_path <- "https://data.stadt-zuerich.ch/dataset/ugz_verkehrsdaten_stundenwerte_stampfenbachstrasse/download/"
t_fileUrl <- paste0(t_url_path, t_filename, year, type, sep = "", recycle0 = TRUE)
t_locfilename <- paste0("./data/", t_filename, year, type, sep = "", recycle0 = TRUE)
Download the first files:
# Collect and access the data from internet:
if(!exists("./data")){dir.create("./data")}
if(!exists(locfilename)){
download.file(fileUrl, destfile = locfilename, method = "curl")}
if(!exists(m_locfilename)){
download.file(m_fileUrl, destfile = m_locfilename, method = "curl")}
if(!exists(t_locfilename)){
download.file(t_fileUrl, destfile = t_locfilename, method = "curl")}
Create the data frame:
DFPol <- read.csv(locfilename)
dim(DFPol)
## [1] 140544 7
DFMet <- read.csv(m_locfilename)
dim(DFMet)
## [1] 122976 7
DFTra <- read.csv(t_locfilename)
dim(DFTra)
## [1] 52704 9
str(DFPol)
## 'data.frame': 140544 obs. of 7 variables:
## $ Datum : chr "2012-01-01T00:00+0100" "2012-01-01T00:00+0100" "2012-01-01T00:00+0100" "2012-01-01T00:00+0100" ...
## $ Standort : chr "Zch_Heubeeribüel" "Zch_Heubeeribüel" "Zch_Heubeeribüel" "Zch_Heubeeribüel" ...
## $ Parameter: chr "NO2" "NO" "NOx" "O3" ...
## $ Intervall: chr "h1" "h1" "h1" "h1" ...
## $ Einheit : chr "µg/m3" "µg/m3" "ppb" "µg/m3" ...
## $ Wert : num 14.31 1.06 8.33 36.52 39.96 ...
## $ Status : chr "bereinigt" "bereinigt" "bereinigt" "bereinigt" ...
str(DFMet)
## 'data.frame': 122976 obs. of 7 variables:
## $ Datum : chr "2012-01-01T00:00+0100" "2012-01-01T00:00+0100" "2012-01-01T00:00+0100" "2012-01-01T00:00+0100" ...
## $ Standort : chr "Zch_Schimmelstrasse" "Zch_Schimmelstrasse" "Zch_Schimmelstrasse" "Zch_Schimmelstrasse" ...
## $ Parameter: chr "Hr" "RainDur" "T" "WD" ...
## $ Intervall: chr "h1" "h1" "h1" "h1" ...
## $ Einheit : chr "%Hr" "min" "°C" "°" ...
## $ Wert : num 90.84 6.27 7.92 39.67 0.35 ...
## $ Status : chr "bereinigt" "bereinigt" "bereinigt" "bereinigt" ...
str(DFTra)
## 'data.frame': 52704 obs. of 9 variables:
## $ Datum : chr "2012-01-01T00:00+0100" "2012-01-01T00:00+0100" "2012-01-01T00:00+0100" "2012-01-01T00:00+0100" ...
## $ Standort : chr "Zch_Stampfenbachstrasse" "Zch_Stampfenbachstrasse" "Zch_Stampfenbachstrasse" "Zch_Stampfenbachstrasse" ...
## $ Richtung : chr "Schaffhauserplatz" "Schaffhauserplatz" "Schaffhauserplatz" "Stampfenbachplatz" ...
## $ Spur : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Klasse.ID : int 101 102 103 101 102 103 101 102 103 101 ...
## $ Klasse.Text: chr "Zweirad" "Personenwagen" "Lastwagen" "Zweirad" ...
## $ Intervall : chr "h1" "h1" "h1" "h1" ...
## $ Anzahl : int NA NA NA 4 104 2 NA NA NA 3 ...
## $ Status : chr "provisorisch" "provisorisch" "provisorisch" "provisorisch" ...
The first two data sets (DFPol and DFMet) are very convenient. They have the same data structure with the same parameters.
Here is the list of the unique Parameter values:
unique(DFPol$Parameter)
## [1] "NO2" "NO" "NOx" "O3" "PM10" "CO" "SO2"
unique(DFMet$Parameter)
## [1] "Hr" "RainDur" "T" "WD" "WVv" "p" "WVs"
## [8] "StrGlo"
Datum is the date and hour of the measurement
Parameter gives the type of measurement:
For the air quality:
O3 : Ozone
NOx: nitrogen oxides
NO: nitrogen monoxide
NO2: nitrogen dioxide
PM10 and PM2.5: particulate matter
CO: carbon monoxide
SO2: sulphur dioxide
For the weather measurements:
p: air pressure
RainDur: precipitation duration
StrGlo: global radiation
T: temperature
Hr: relative humidity
WD: wind direction
WVs: scalar wind speed
WVv: vector wind speed
Intervall is the time intervals between each measurement. In both cases, it will always be one hour. So we will ignore it.
Einheit is the unit used.
Wert is the numeric value measured.
Status states the quality of the data, we will ignore it.
The common streets in the two data frames are :
intersect(DFPol$Standort, DFMet$Standort)
## [1] "Zch_Schimmelstrasse" "Zch_Stampfenbachstrasse"
The traffic dataset contains data dispatched in several columns. We will sum up all vehicle classes, all categories together, under a “traffic” parameter that we will then insert later into the merged table of pollution and weather data.
head(DFTra)
## Datum Standort Richtung Spur
## 1 2012-01-01T00:00+0100 Zch_Stampfenbachstrasse Schaffhauserplatz 0
## 2 2012-01-01T00:00+0100 Zch_Stampfenbachstrasse Schaffhauserplatz 0
## 3 2012-01-01T00:00+0100 Zch_Stampfenbachstrasse Schaffhauserplatz 0
## 4 2012-01-01T00:00+0100 Zch_Stampfenbachstrasse Stampfenbachplatz 0
## 5 2012-01-01T00:00+0100 Zch_Stampfenbachstrasse Stampfenbachplatz 0
## 6 2012-01-01T00:00+0100 Zch_Stampfenbachstrasse Stampfenbachplatz 0
## Klasse.ID Klasse.Text Intervall Anzahl Status
## 1 101 Zweirad h1 NA provisorisch
## 2 102 Personenwagen h1 NA provisorisch
## 3 103 Lastwagen h1 NA provisorisch
## 4 101 Zweirad h1 4 provisorisch
## 5 102 Personenwagen h1 104 provisorisch
## 6 103 Lastwagen h1 2 provisorisch
The “Datum” parameter is the same format as those of the Air quality and weather data sets.
Let’s have a look at the other potentially factor columns:
unique(DFTra$Klasse.ID)
## [1] 101 102 103
unique(DFTra$Klasse.Text)
## [1] "Zweirad" "Personenwagen" "Lastwagen"
unique(DFTra$Intervall)
## [1] "h1"
unique(DFTra$Status)
## [1] "provisorisch"
The “Standort” (place), “Richtung” (direction), “Spur” (lane), “Klasse.ID” (vehicle class id), “Klasse.Text” (vehicle class), “Intervall” (time intervals. Always hourly in thess datasets) and “Status” will be ignored because they have unique values or are classifications that we don’t need in this specific context.
The class of the vehicles (Klasse.Text and Klasse.ID) will not be used in this analysis and will be consolidated.
To facilitate the use of the weather data, we will replace the acronyms with full words (in German, if you don’t mind).
DFMet[DFMet$Parameter == "p", ]$Parameter <- "Luftdruck"
DFMet[DFMet$Parameter == "RainDur", ]$Parameter <- "Niederschlagsdauer"
DFMet[DFMet$Parameter == "StrGlo", ]$Parameter <- "Globalstrahlung"
DFMet[DFMet$Parameter == "T", ]$Parameter <- "Temperatur"
DFMet[DFMet$Parameter == "Hr", ]$Parameter <- "Luftfeuchtigkeit"
DFMet[DFMet$Parameter == "WD", ]$Parameter <- "Windrichtung"
DFMet[DFMet$Parameter == "WVv", ]$Parameter <- "Vector Windgeschwindigkeit"
DFMet[DFMet$Parameter == "WVs", ]$Parameter <- "Skalar Windgeschwindigkeit"
We only keep the necessary columns:
DFMet <- DFMet[, c(1,2,3,5,6)]
DFPol <- DFPol[, c(1,2,3,5,6)]
DFTra1 <- DFTra[, c(1,8)]
To reduce the volume of data, we will group it by day and keep the mean of the values:
DFPol1 <- DFPol
DFMet1 <- DFMet
# I delete the hours to keep only the date in Datum
DFPol1$Datum <- as.Date(substr(DFPol1$Datum, 1, 10), format = "%Y-%m-%d")
DFMet1$Datum <- as.Date(substr(DFMet1$Datum, 1, 10), format = "%Y-%m-%d")
DFTra1$Datum <- as.Date(substr(DFTra1$Datum, 1, 10), format = "%Y-%m-%d")
# I delete the NA created
DFPol1 <- DFPol1[!is.na(DFPol1$Wert), ]
DFMet1 <- DFMet1[!is.na(DFMet1$Wert), ]
DFTra1 <- DFTra1[!is.na(DFTra1$Anzahl), ]
# Then group the data per day:
PolbyDay <- DFPol1 %>% group_by(Datum, Standort, Parameter) %>% summarise(total = mean(Wert, na.rm=T))
MetbyDay <- DFMet1 %>% group_by(Datum, Standort, Parameter) %>% summarise(total = mean(Wert, na.rm=T))
head(PolbyDay)
## # A tibble: 6 × 4
## # Groups: Datum, Standort [2]
## Datum Standort Parameter total
## <date> <chr> <chr> <dbl>
## 1 2012-01-01 Zch_Heubeeribüel NO 1.62
## 2 2012-01-01 Zch_Heubeeribüel NO2 12.7
## 3 2012-01-01 Zch_Heubeeribüel NOx 7.94
## 4 2012-01-01 Zch_Heubeeribüel O3 35.3
## 5 2012-01-01 Zch_Schimmelstrasse NO 10.4
## 6 2012-01-01 Zch_Schimmelstrasse NO2 25.5
head(MetbyDay)
## # A tibble: 6 × 4
## # Groups: Datum, Standort [1]
## Datum Standort Parameter total
## <date> <chr> <chr> <dbl>
## 1 2012-01-01 Zch_Schimmelstrasse Luftdruck 972.
## 2 2012-01-01 Zch_Schimmelstrasse Luftfeuchtigkeit 82.5
## 3 2012-01-01 Zch_Schimmelstrasse Niederschlagsdauer 0.426
## 4 2012-01-01 Zch_Schimmelstrasse Temperatur 8.84
## 5 2012-01-01 Zch_Schimmelstrasse Vector Windgeschwindigkeit 0.738
## 6 2012-01-01 Zch_Schimmelstrasse Windrichtung 143.
TrabyDay <- DFTra1 %>% group_by(Datum, Parameter = "Traffic") %>% summarise(total = sum(Anzahl, na.rm=T))
head(TrabyDay)
## # A tibble: 6 × 3
## # Groups: Datum [6]
## Datum Parameter total
## <date> <chr> <int>
## 1 2012-01-01 Traffic 2075
## 2 2012-01-02 Traffic 1843
## 3 2012-01-03 Traffic 3273
## 4 2012-01-04 Traffic 3485
## 5 2012-01-05 Traffic 3580
## 6 2012-01-06 Traffic 3909
To save some space in memory, we will keep only one of the 3 common locations:
length(unique(PolbyDay[PolbyDay$Standort == "Zch_Schimmelstrasse", ]$Parameter))
## [1] 5
length(unique(PolbyDay[PolbyDay$Standort == "Zch_Stampfenbachstrasse", ]$Parameter)) # winner
## [1] 7
length(unique(PolbyDay[PolbyDay$Standort == "Zch_Heubeeribüel", ]$Parameter))
## [1] 4
length(unique(MetbyDay[MetbyDay$Standort == "Zch_Schimmelstrasse", ]$Parameter))
## [1] 6
length(unique(MetbyDay[MetbyDay$Standort == "Zch_Stampfenbachstrasse", ]$Parameter)) # winner
## [1] 8
length(unique(MetbyDay[MetbyDay$Standort == "Zch_Heubeeribüel", ]$Parameter))
## [1] 0
# This location is the one that gets the type of measurements in 2019
# Which is good because it's the street where the traffic was measured since 2007
PolbyDay <- PolbyDay[PolbyDay$Standort == "Zch_Stampfenbachstrasse", ]
MetbyDay <- MetbyDay[MetbyDay$Standort == "Zch_Stampfenbachstrasse", ]
Now the column “Standort” is of no use:
PolbyDay <- PolbyDay[, c(1,3,4)]
MetbyDay <- MetbyDay[, c(1,3,4)]
It will be useful to know which data are related to air quality and which are weather information:
PolbyDay$type <- "Pollution"
MetbyDay$type <- "Meteo"
TrabyDay$type <- "Traffic"
And we regroup these 3 data frames in one unique data frame:
byDay <- rbind(PolbyDay, MetbyDay, TrabyDay)
byDay <- byDay[order(byDay$Datum, byDay$Parameter),]
head(byDay, 20)
## # A tibble: 20 × 4
## # Groups: Datum [2]
## Datum Parameter total type
## <date> <chr> <dbl> <chr>
## 1 2012-01-01 CO 0.312 Pollution
## 2 2012-01-01 Globalstrahlung 39.6 Meteo
## 3 2012-01-01 Luftdruck 969. Meteo
## 4 2012-01-01 Luftfeuchtigkeit 86.9 Meteo
## 5 2012-01-01 Niederschlagsdauer 3.98 Meteo
## 6 2012-01-01 NO 7.83 Pollution
## 7 2012-01-01 NO2 26.5 Pollution
## 8 2012-01-01 NOx 20.2 Pollution
## 9 2012-01-01 O3 22.2 Pollution
## 10 2012-01-01 PM10 18.0 Pollution
## 11 2012-01-01 Skalar Windgeschwindigkeit 2.02 Meteo
## 12 2012-01-01 SO2 1.78 Pollution
## 13 2012-01-01 Temperatur 8.12 Meteo
## 14 2012-01-01 Traffic 2075 Traffic
## 15 2012-01-01 Vector Windgeschwindigkeit 1.91 Meteo
## 16 2012-01-01 Windrichtung 195. Meteo
## 17 2012-01-02 CO 0.236 Pollution
## 18 2012-01-02 Globalstrahlung 12.6 Meteo
## 19 2012-01-02 Luftdruck 965. Meteo
## 20 2012-01-02 Luftfeuchtigkeit 83.9 Meteo
unique(byDay$Parameter)
## [1] "CO" "Globalstrahlung"
## [3] "Luftdruck" "Luftfeuchtigkeit"
## [5] "Niederschlagsdauer" "NO"
## [7] "NO2" "NOx"
## [9] "O3" "PM10"
## [11] "Skalar Windgeschwindigkeit" "SO2"
## [13] "Temperatur" "Traffic"
## [15] "Vector Windgeschwindigkeit" "Windrichtung"
tail(MetbyDay, 15)
## # A tibble: 15 × 4
## # Groups: Datum [2]
## Datum Parameter total type
## <date> <chr> <dbl> <chr>
## 1 2012-12-30 Luftdruck 972. Meteo
## 2 2012-12-30 Luftfeuchtigkeit 60.2 Meteo
## 3 2012-12-30 Niederschlagsdauer 3.00 Meteo
## 4 2012-12-30 Skalar Windgeschwindigkeit 2.83 Meteo
## 5 2012-12-30 Temperatur 7.72 Meteo
## 6 2012-12-30 Vector Windgeschwindigkeit 2.79 Meteo
## 7 2012-12-30 Windrichtung 226. Meteo
## 8 2012-12-31 Globalstrahlung 61.7 Meteo
## 9 2012-12-31 Luftdruck 970. Meteo
## 10 2012-12-31 Luftfeuchtigkeit 69.7 Meteo
## 11 2012-12-31 Niederschlagsdauer 0 Meteo
## 12 2012-12-31 Skalar Windgeschwindigkeit 1.62 Meteo
## 13 2012-12-31 Temperatur 5.14 Meteo
## 14 2012-12-31 Vector Windgeschwindigkeit 1.59 Meteo
## 15 2012-12-31 Windrichtung 183. Meteo
Now let’s group data per month, and keep the mean of the daily “Wert“ value:
byMonth <- byDay %>% mutate(Monat = format(Datum, "%Y-%m")) %>% mutate(Monat = paste0(Monat, "-01")) %>% group_by(Monat, Parameter, type) %>% summarise(total = mean(total, na.rm=T))
byMonth$Monat <- factor(byMonth$Monat)
byMonth$Parameter <- factor(byMonth$Parameter)
head(byMonth, 20)
## # A tibble: 20 × 4
## # Groups: Monat, Parameter [20]
## Monat Parameter type total
## <fct> <fct> <chr> <dbl>
## 1 2012-01-01 CO Pollution 0.393
## 2 2012-01-01 Globalstrahlung Meteo 37.5
## 3 2012-01-01 Luftdruck Meteo 972.
## 4 2012-01-01 Luftfeuchtigkeit Meteo 78.6
## 5 2012-01-01 Niederschlagsdauer Meteo 12.2
## 6 2012-01-01 NO Pollution 22.1
## 7 2012-01-01 NO2 Pollution 32.6
## 8 2012-01-01 NOx Pollution 34.7
## 9 2012-01-01 O3 Pollution 30.9
## 10 2012-01-01 PM10 Pollution 18.0
## 11 2012-01-01 Skalar Windgeschwindigkeit Meteo 2.37
## 12 2012-01-01 SO2 Pollution 2.37
## 13 2012-01-01 Temperatur Meteo 3.52
## 14 2012-01-01 Traffic Traffic 3723.
## 15 2012-01-01 Vector Windgeschwindigkeit Meteo 2.35
## 16 2012-01-01 Windrichtung Meteo 176.
## 17 2012-02-01 CO Pollution 0.532
## 18 2012-02-01 Globalstrahlung Meteo 86.5
## 19 2012-02-01 Luftdruck Meteo 974.
## 20 2012-02-01 Luftfeuchtigkeit Meteo 73.1
First, let’s get rid of NA’s in byDay and byMonth:
anyNA(byDay)
## [1] FALSE
anyNA(byMonth)
## [1] FALSE
Our data per month and day are messy. The column parameter contains data that should be contained in separate columns to conduct a clustering analysis:
byColDay <- byDay[,-c(4)] %>% spread(Parameter, total)
# Reorder columns, so the result will be more convenient (1. Air quality, then weather and traffic columns)
byColDay <- byColDay[,c(1, 2, 7, 8,9,10,11,13,3,4,5,6,12,14,16,17,15)]
# Actually I have NA's after this operation, so:
na.omit(byColDay)
## # A tibble: 358 × 17
## # Groups: Datum [358]
## Datum CO NO NO2 NOx O3 PM10 SO2 Global…¹ Luftd…² Luftf…³
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2012-01-01 0.312 7.83 26.5 20.2 22.2 18.0 1.78 39.6 969. 86.9
## 2 2012-01-02 0.236 3.51 17.9 12.2 36.1 5.34 1.64 12.6 965. 83.9
## 3 2012-01-03 0.245 5.85 19.6 15.0 48.9 6.92 1.40 63.4 973. 67.2
## 4 2012-01-04 0.216 2.59 13.3 9.00 59.5 7.49 1.11 43.0 971. 69.0
## 5 2012-01-05 0.205 2.23 10.1 7.08 62.6 4.92 0.902 11.2 957. 75.6
## 6 2012-01-06 0.239 3.20 17.7 11.8 46.1 7.11 1.10 18.6 968. 82.4
## 7 2012-01-07 0.236 2.36 13.2 8.81 47.5 7.42 1.15 11.5 973. 76.8
## 8 2012-01-08 0.226 2.35 13.9 9.17 48.9 8.47 1.26 13.2 973. 82.8
## 9 2012-01-09 0.332 15.4 30.9 28.5 32.2 14.6 1.59 29.2 979. 78.1
## 10 2012-01-10 0.539 43.5 46.1 59.0 17.7 18.9 2.28 35.5 982. 82.4
## # … with 348 more rows, 6 more variables: Niederschlagsdauer <dbl>,
## # `Skalar Windgeschwindigkeit` <dbl>, Temperatur <dbl>,
## # `Vector Windgeschwindigkeit` <dbl>, Windrichtung <dbl>, Traffic <dbl>, and
## # abbreviated variable names ¹Globalstrahlung, ²Luftdruck, ³Luftfeuchtigkeit
byColMonth <- byMonth[,-c(3)] %>% spread(Parameter, total)
byColMonth <- byColMonth[,c(1, 2, 7, 8,9,10,11,13,3,4,5,6,12,14,16,17,15)]
na.omit(byColMonth)
## # A tibble: 12 × 17
## # Groups: Monat [12]
## Monat CO NO NO2 NOx O3 PM10 SO2 Global…¹ Luftd…² Luftf…³
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2012-01-01 0.393 22.1 32.6 34.7 30.9 18.0 2.37 37.5 972. 78.6
## 2 2012-02-01 0.532 24.9 49.8 46.0 26.6 39.3 5.75 86.5 974. 73.1
## 3 2012-03-01 0.464 26.3 47.0 45.6 35.5 28.8 3.57 163. 975. 59.5
## 4 2012-04-01 0.316 11.9 29.1 24.8 57.4 14.3 2.30 152. 956. 65.9
## 5 2012-05-01 0.299 10.3 26.9 22.4 68.4 15.2 1.38 236. 966. 61.8
## 6 2012-06-01 0.268 9.52 24.4 20.4 61.0 16.4 1.24 228. 965. 67.7
## 7 2012-07-01 0.265 9.69 23.8 20.2 57.2 15.3 1.31 213. 967. 67.2
## 8 2012-08-01 0.339 10.4 27.0 22.5 65.2 17.9 0.829 217. 968. 65.4
## 9 2012-09-01 0.394 18.6 33.9 32.6 39.2 17.9 2.09 136. 967. 76.1
## 10 2012-10-01 0.443 30.3 36.8 43.5 20.0 18.2 3.21 79.9 963. 82.1
## 11 2012-11-01 0.443 35.4 40.9 49.8 15.4 20.8 2.04 47.4 964. 82.7
## 12 2012-12-01 0.364 20.5 36.5 35.5 31.2 15.3 2.01 33.4 965. 80.4
## # … with 6 more variables: Niederschlagsdauer <dbl>,
## # `Skalar Windgeschwindigkeit` <dbl>, Temperatur <dbl>,
## # `Vector Windgeschwindigkeit` <dbl>, Windrichtung <dbl>, Traffic <dbl>, and
## # abbreviated variable names ¹Globalstrahlung, ²Luftdruck, ³Luftfeuchtigkeit
names(byColDay)
## [1] "Datum" "CO"
## [3] "NO" "NO2"
## [5] "NOx" "O3"
## [7] "PM10" "SO2"
## [9] "Globalstrahlung" "Luftdruck"
## [11] "Luftfeuchtigkeit" "Niederschlagsdauer"
## [13] "Skalar Windgeschwindigkeit" "Temperatur"
## [15] "Vector Windgeschwindigkeit" "Windrichtung"
## [17] "Traffic"
names(byColMonth)
## [1] "Monat" "CO"
## [3] "NO" "NO2"
## [5] "NOx" "O3"
## [7] "PM10" "SO2"
## [9] "Globalstrahlung" "Luftdruck"
## [11] "Luftfeuchtigkeit" "Niederschlagsdauer"
## [13] "Skalar Windgeschwindigkeit" "Temperatur"
## [15] "Vector Windgeschwindigkeit" "Windrichtung"
## [17] "Traffic"
Note: Traffic figures were not recorded during the construction work on Stampfenbachstrasse in 2018 and 2019.
So if year is 2018 or 2019, we will have to clean NAs, because traffic was no measured during several months:
# Cleaning the NA requires to put the Parameter data in columns
if(year == "2019" | year == "2018") {
# We delete NA lines
byColDay <- byColDay[!is.na(byColDay$Traffic),]
byColMonth <- byColMonth[!is.na(byColMonth$Traffic),]
} else {
print("These years do not need NA deleting!")
}
## [1] "These years do not need NA deleting!"
dim(byColDay)
## [1] 366 17
dim(byColMonth)
## [1] 12 17
We must transform the data into a matrix in order to be able to calculate the svd derivated matrix. Then we will display a heatmap of the datamatrix and plot the variance graphs so we will see if some parameters explain significantly some variations.
# Transforms the data frame into a matrix
dataMatrix <- data.matrix(byColMonth[,-1])
hh <- hclust(dist(dataMatrix)) # Find the diagonal matrix
# str(hh)
dataMatrixOrdered <- dataMatrix[hh$order,]
svd1 <- svd(scale(dataMatrixOrdered))
#str(svd1)
par(mfrow=c(1,3))
image(t(dataMatrixOrdered)[,nrow(dataMatrixOrdered):1], main="Heatmap")
plot(colMeans(dataMatrixOrdered),xlab="Column index",ylab="Column Mean",pch=19, main="Mean per column")
plot((svd1$d^2/sum(svd1$d^2)), xlab = "Columns", ylab="Prop. of variance explained", pch=19, main="Variance explained")
The 3 first columns explain around 80% of the variations.
What are the 3 first columns?
names(byColMonth[,c(9,10,11,15,16)])
## [1] "Globalstrahlung" "Luftdruck"
## [3] "Luftfeuchtigkeit" "Vector Windgeschwindigkeit"
## [5] "Windrichtung"
The 6 more influential data are 2 pollution measurement and 4 weather. The same matrix with one significant pollutant only, amid all the weather information would help to identify wich connection he has with the weather phenomena:
The very useful getSvdMostInfluential4 function will clarify that:
if("Gmisc" %in% rownames(installed.packages()) == FALSE) {
install.packages("Gmisc")
}
library(Gmisc)
## Le chargement a nécessité le package : Rcpp
## Le chargement a nécessité le package : htmlTable
getSvdMostInfluential(dataMatrix,
quantile=.8,
similarity_threshold = .9,
plot_threshold = .05,
plot_selection = TRUE)
Let’s see if the daily measurements have the same results:
# Transforms the data frame into a matrix
# byColDay1 <- byColDay[,c(1,2,13,3,16,5,6,7,8,9,10,11,12,14,15,17,4)]
dataMatrix_day <- data.matrix(byColDay[,-c(1)])
hhd <- hclust(dist(dataMatrix_day)) #Maintenant je calcule la matrice diagonale
# str(hhd)
dataMatrix_dayOrdered <- na.omit(dataMatrix_day[hhd$order,])
svd1_day <- svd(scale(dataMatrix_dayOrdered))
#str(svd1_day)
par(mfrow=c(1,3))
image(t(dataMatrix_dayOrdered)[,nrow(dataMatrix_dayOrdered):1], main="Heatmap")
plot(colMeans(dataMatrix_dayOrdered),xlab="Column index",ylab="Column Mean",pch=19, main="Mean per column")
plot((svd1_day$d^2/sum(svd1_day$d^2)), xlab = "Columns per ratio", ylab="Prop. of variance explained", pch=19, main="Variance explained")
There are no major difference between the daily and monthly values, but we see that the variance explained gives more importance to the 6 first values than the monthly values.
This does not help to identify correlations between weather or traffic information and the air quality, because the maximum contributors are pollutants that are together, of course, more present when the air quality is low. These only shows that when the atmosphere is more polluted, there are more pollutants in it…
library(Hmisc)
library(lattice)
library(survival)
library(Formula)
# cor(byColDay[,-c(1,2)], method=c("pearson", "kendall", "spearman"))
Maybe the correlation ratios can help to find relationships between weather or traffic and some pollutants:
The rcorr function will allow us to measure the correlation between each pair in the matrix. We exclude the first columns containing the dates.
rcorr(as.matrix(byColDay[,-c(1)]), type="pearson")$r[-c(1,2,3,4,5,6,7), -c(8,9,10,11,12,13,14,15,16,17)]
## CO NO NO2 NOx
## Globalstrahlung -0.2031445 -0.20857820 -0.16861371 -0.20319557
## Luftdruck 0.3265544 0.19185046 0.27722293 0.23118433
## Luftfeuchtigkeit 0.1704773 0.22549334 0.10565246 0.19188523
## Niederschlagsdauer -0.2671892 -0.19136108 -0.20622130 -0.20510591
## Skalar Windgeschwindigkeit -0.4146066 -0.40432166 -0.48954362 -0.45287751
## Temperatur -0.3958797 -0.31427284 -0.37381742 -0.34958647
## Vector Windgeschwindigkeit -0.4072990 -0.39865464 -0.48379332 -0.44693282
## Windrichtung -0.4134102 -0.26012772 -0.31323000 -0.29074009
## Traffic -0.1081340 0.07972877 0.01324629 0.05912409
## O3 PM10 SO2
## Globalstrahlung 0.616125016 0.004252051 -0.18239016
## Luftdruck -0.208863123 0.381974768 0.27339067
## Luftfeuchtigkeit -0.601980232 -0.093581899 -0.04179275
## Niederschlagsdauer -0.003143124 -0.418879840 -0.24677663
## Skalar Windgeschwindigkeit 0.294992967 -0.260309200 -0.05989382
## Temperatur 0.581125732 -0.275044396 -0.58877054
## Vector Windgeschwindigkeit 0.286626022 -0.255867898 -0.05482406
## Windrichtung 0.216714974 -0.356378953 -0.31888024
## Traffic 0.137477810 -0.144623818 -0.33958424
The columns shows the various measured pollutants, and the rows the weather and traffic data.
This matrix displays the r pearson correlation coefficients of the values in our matrix. It measures the strength (0 to 1) and the direction of the relationship between the variables (a positive value for variables moving in the same direction, a negative value if they act opposite).
res <- rcorr(as.matrix(byColDay[,-c(1)]), type="pearson")
flattened <- flattenCorrMatrix(res$r, res$P)
flattened
## row column cor p
## 1 CO NO 0.85 0.00
## 2 CO NO2 0.90 0.00
## 3 NO NO2 0.82 0.00
## 4 CO NOx 0.90 0.00
## 5 NO NOx 0.98 0.00
## 6 NO2 NOx 0.92 0.00
## 7 CO O3 -0.70 0.00
## 8 NO O3 -0.71 0.00
## 9 NO2 O3 -0.68 0.00
## 10 NOx O3 -0.73 0.00
## 11 CO PM10 0.77 0.00
## 12 NO PM10 0.51 0.00
## 13 NO2 PM10 0.68 0.00
## 14 NOx PM10 0.60 0.00
## 15 O3 PM10 -0.36 0.00
## 16 CO SO2 0.67 0.00
## 17 NO SO2 0.47 0.00
## 18 NO2 SO2 0.61 0.00
## 19 NOx SO2 0.54 0.00
## 20 O3 SO2 -0.43 0.00
## 21 PM10 SO2 0.68 0.00
## 22 CO Globalstrahlung -0.20 0.00
## 23 NO Globalstrahlung -0.21 0.00
## 24 NO2 Globalstrahlung -0.17 0.00
## 25 NOx Globalstrahlung -0.20 0.00
## 26 O3 Globalstrahlung 0.62 0.00
## 27 PM10 Globalstrahlung 0.00 0.94
## 28 SO2 Globalstrahlung -0.18 0.00
## 29 CO Luftdruck 0.33 0.00
## 30 NO Luftdruck 0.19 0.00
## 31 NO2 Luftdruck 0.28 0.00
## 32 NOx Luftdruck 0.23 0.00
## 33 O3 Luftdruck -0.21 0.00
## 34 PM10 Luftdruck 0.38 0.00
## 35 SO2 Luftdruck 0.27 0.00
## 36 Globalstrahlung Luftdruck 0.04 0.44
## 37 CO Luftfeuchtigkeit 0.17 0.00
## 38 NO Luftfeuchtigkeit 0.23 0.00
## 39 NO2 Luftfeuchtigkeit 0.11 0.04
## 40 NOx Luftfeuchtigkeit 0.19 0.00
## 41 O3 Luftfeuchtigkeit -0.60 0.00
## 42 PM10 Luftfeuchtigkeit -0.09 0.08
## 43 SO2 Luftfeuchtigkeit -0.04 0.43
## 44 Globalstrahlung Luftfeuchtigkeit -0.78 0.00
## 45 Luftdruck Luftfeuchtigkeit -0.14 0.01
## 46 CO Niederschlagsdauer -0.27 0.00
## 47 NO Niederschlagsdauer -0.19 0.00
## 48 NO2 Niederschlagsdauer -0.21 0.00
## 49 NOx Niederschlagsdauer -0.21 0.00
## 50 O3 Niederschlagsdauer 0.00 0.95
## 51 PM10 Niederschlagsdauer -0.42 0.00
## 52 SO2 Niederschlagsdauer -0.25 0.00
## 53 Globalstrahlung Niederschlagsdauer -0.48 0.00
## 54 Luftdruck Niederschlagsdauer -0.32 0.00
## 55 Luftfeuchtigkeit Niederschlagsdauer 0.54 0.00
## 56 CO Skalar Windgeschwindigkeit -0.41 0.00
## 57 NO Skalar Windgeschwindigkeit -0.40 0.00
## 58 NO2 Skalar Windgeschwindigkeit -0.49 0.00
## 59 NOx Skalar Windgeschwindigkeit -0.45 0.00
## 60 O3 Skalar Windgeschwindigkeit 0.29 0.00
## 61 PM10 Skalar Windgeschwindigkeit -0.26 0.00
## 62 SO2 Skalar Windgeschwindigkeit -0.06 0.25
## 63 Globalstrahlung Skalar Windgeschwindigkeit -0.19 0.00
## 64 Luftdruck Skalar Windgeschwindigkeit -0.11 0.04
## 65 Luftfeuchtigkeit Skalar Windgeschwindigkeit -0.03 0.51
## 66 Niederschlagsdauer Skalar Windgeschwindigkeit 0.23 0.00
## 67 CO Temperatur -0.40 0.00
## 68 NO Temperatur -0.31 0.00
## 69 NO2 Temperatur -0.37 0.00
## 70 NOx Temperatur -0.35 0.00
## 71 O3 Temperatur 0.58 0.00
## 72 PM10 Temperatur -0.28 0.00
## 73 SO2 Temperatur -0.59 0.00
## 74 Globalstrahlung Temperatur 0.68 0.00
## 75 Luftdruck Temperatur -0.16 0.00
## 76 Luftfeuchtigkeit Temperatur -0.42 0.00
## 77 Niederschlagsdauer Temperatur -0.16 0.00
## 78 Skalar Windgeschwindigkeit Temperatur -0.23 0.00
## 79 CO Vector Windgeschwindigkeit -0.41 0.00
## 80 NO Vector Windgeschwindigkeit -0.40 0.00
## 81 NO2 Vector Windgeschwindigkeit -0.48 0.00
## 82 NOx Vector Windgeschwindigkeit -0.45 0.00
## 83 O3 Vector Windgeschwindigkeit 0.29 0.00
## 84 PM10 Vector Windgeschwindigkeit -0.26 0.00
## 85 SO2 Vector Windgeschwindigkeit -0.05 0.30
## 86 Globalstrahlung Vector Windgeschwindigkeit -0.19 0.00
## 87 Luftdruck Vector Windgeschwindigkeit -0.10 0.06
## 88 Luftfeuchtigkeit Vector Windgeschwindigkeit -0.03 0.54
## 89 Niederschlagsdauer Vector Windgeschwindigkeit 0.22 0.00
## 90 Skalar Windgeschwindigkeit Vector Windgeschwindigkeit 1.00 0.00
## 91 Temperatur Vector Windgeschwindigkeit -0.24 0.00
## 92 CO Windrichtung -0.41 0.00
## 93 NO Windrichtung -0.26 0.00
## 94 NO2 Windrichtung -0.31 0.00
## 95 NOx Windrichtung -0.29 0.00
## 96 O3 Windrichtung 0.22 0.00
## 97 PM10 Windrichtung -0.36 0.00
## 98 SO2 Windrichtung -0.32 0.00
## 99 Globalstrahlung Windrichtung -0.05 0.35
## 100 Luftdruck Windrichtung -0.26 0.00
## 101 Luftfeuchtigkeit Windrichtung 0.04 0.45
## 102 Niederschlagsdauer Windrichtung 0.27 0.00
## 103 Skalar Windgeschwindigkeit Windrichtung 0.05 0.31
## 104 Temperatur Windrichtung 0.17 0.00
## 105 Vector Windgeschwindigkeit Windrichtung 0.04 0.39
## 106 CO Traffic -0.11 0.04
## 107 NO Traffic 0.08 0.13
## 108 NO2 Traffic 0.01 0.80
## 109 NOx Traffic 0.06 0.26
## 110 O3 Traffic 0.14 0.01
## 111 PM10 Traffic -0.14 0.01
## 112 SO2 Traffic -0.34 0.00
## 113 Globalstrahlung Traffic 0.34 0.00
## 114 Luftdruck Traffic -0.14 0.01
## 115 Luftfeuchtigkeit Traffic -0.04 0.44
## 116 Niederschlagsdauer Traffic -0.07 0.21
## 117 Skalar Windgeschwindigkeit Traffic -0.25 0.00
## 118 Temperatur Traffic 0.53 0.00
## 119 Vector Windgeschwindigkeit Traffic -0.25 0.00
## 120 Windrichtung Traffic 0.13 0.01
Let’s build the correlation matrix:
mcor <- cor(na.omit(byColDay[,-c(1)]))
head(mcor)
## CO NO NO2 NOx O3 PM10
## CO 1.0000000 0.8442662 0.8967970 0.9004475 -0.6989366 0.7701098
## NO 0.8442662 1.0000000 0.8177938 0.9780536 -0.7096441 0.5120767
## NO2 0.8967970 0.8177938 1.0000000 0.9197559 -0.6818326 0.6803786
## NOx 0.9004475 0.9780536 0.9197559 1.0000000 -0.7308127 0.5955489
## O3 -0.6989366 -0.7096441 -0.6818326 -0.7308127 1.0000000 -0.3617111
## PM10 0.7701098 0.5120767 0.6803786 0.5955489 -0.3617111 1.0000000
## SO2 Globalstrahlung Luftdruck Luftfeuchtigkeit Niederschlagsdauer
## CO 0.6701832 -0.206041760 0.3340554 0.17245977 -0.266728733
## NO 0.4683684 -0.214425363 0.1908097 0.23023934 -0.190587474
## NO2 0.6059638 -0.172675141 0.2814683 0.10929338 -0.205904754
## NOx 0.5388000 -0.208750394 0.2320314 0.19658832 -0.204523143
## O3 -0.4248422 0.623174693 -0.2150114 -0.60784536 -0.009108881
## PM10 0.6848203 0.003768149 0.3837374 -0.09335306 -0.419277005
## Skalar Windgeschwindigkeit Temperatur Vector Windgeschwindigkeit
## CO -0.4155812 -0.4005473 -0.4083642
## NO -0.4068303 -0.3195898 -0.4011903
## NO2 -0.4949809 -0.3787525 -0.4893812
## NOx -0.4566520 -0.3550770 -0.4507784
## O3 0.2935876 0.5883707 0.2853889
## PM10 -0.2593049 -0.2761583 -0.2548478
## Windrichtung Traffic
## CO -0.4232931 -0.113633167
## NO -0.2699444 0.075354250
## NO2 -0.3247486 0.007539406
## NOx -0.3016673 0.054118776
## O3 0.2227125 0.144205304
## PM10 -0.3579222 -0.144623818
Now we use the corrplot package to display a readable matrix:
corrplot(mcor, type="lower", order="hclust", tl.col="black", tl.srt=45, main="Air quality, traffic and weather correlation", mar=c(0,0,1,0))
In red we have the negative correlations, in blue the positive.
The strongest coefficients occur for the
O3 (Ozone) levels ,that seems to be :
positively correlated to Globalstrahlung (the global radiation) and Temperatur (the temperature) with respectively 0.62 and 0.58 coefficients
negatively correlated to Luftfeuchtigkeit (the air humidity) with a coefficient -0.6
SO2 (Sulfur dioxide)
The pearson correlation coefficients for the other pollutants are all lower than 0.5
Conclusion, the traffic seems to have a minor influence on pollution. The traffic is lightly correlated with NO, NOx and NO2 levels, which are more correlated with the levels of O3. The more O3, the less NO, NOx, NO2 and CO. Is it due to some chemical reactions or only because one is created in the conditions that weakens the others ?
The weather condition that are more correlated to air quality are
the speed of the wind for PM10, NO2 and CO.
PM10 seems also influenced by the precipitation duration
O3 seems correlated with global radiation, the temperature and the air humidity
Given this, the parameters that I would focus on would be:
For the Air quality measurements: O3 (negatively correlated to NO, NOx, NO2 and CO, so we can ignore them), PM10 and SO2. PM2.5 is not measured on all our samples, so I will ignore it.
For the Weather conditions: Globalstrahlung, Skalar (strongly correlated with Vector) Windgeschwindigkeit, Niederschlagsdauer, Temperatur and Luftfeuchtigkeit.
I would keep an eye on traffic levels too, because it could have impact during the COVID period.
The purpose here is to gather all the data collected since 2007 in a single data frame, keeping only the previously indexed information, in order to limit the consumption of resources.
We initialize the data frame that will collect all the data (DFAll)
DFAll <- data.frame(matrix(ncol=4,nrow=0, dimnames=list(NULL, c("Datum", "Parameter", "Wert", "Type"))))
DFAllPol will contain a larger scope of data for further investigations
DFAllPol <- data.frame(matrix(ncol=4,nrow=0, dimnames=list(NULL, c("Datum", "Parameter", "Wert", "Type"))))
Download the files and import and transform all the necessary data in DFAll.
## [1] 2007
## [1] 2008
## [1] 2009
## [1] 2010
## [1] 2011
## [1] 2012
## [1] 2013
## [1] 2014
## [1] 2015
## [1] 2016
## [1] 2017
## [1] 2018
## [1] 2019
## [1] 2020
## [1] 2021
totLines <- dim(DFAll)[1]
totLines
## [1] 46390
So we got 46390 rows of data that we will now explore.
head(DFAll)
## # A tibble: 6 × 4
## # Groups: Datum [1]
## Datum Parameter total type
## <date> <chr> <dbl> <chr>
## 1 2007-01-01 Globalstrahlung 6.49 Meteo
## 2 2007-01-01 Luftfeuchtigkeit 77.7 Meteo
## 3 2007-01-01 Niederschlagsdauer 24.8 Meteo
## 4 2007-01-01 O3 48.4 Pollution
## 5 2007-01-01 PM10 14.8 Pollution
## 6 2007-01-01 Skalar Windgeschwindigkeit 3.42 Meteo
We need to explode the various Parameter into disctinct columns to build our diagrams.
DFAllc <- DFAll[,-c(4)] %>% spread(Parameter, total)
The columns of DFAllc contain measured values in disparate units and scales. We will use the R scale function to make them consistent and to compare them.
DFAllcs <- DFAllc
DFAllcs[,-1] <- scale(DFAllc[,-1], scale = TRUE)
head(DFAllcs)
## # A tibble: 6 × 10
## # Groups: Datum [6]
## Datum Globalstrahl…¹ Luftf…² Niede…³ O3 PM10 Skala…⁴ SO2 Tempe…⁵
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2007-01-01 -1.32 0.645 1.96 0.168 -0.377 2.19 1.45 -0.337
## 2 2007-01-02 -1.18 0.891 2.08 0.700 NA 2.39 0.176 -1.02
## 3 2007-01-03 -0.892 0.397 -0.386 -0.0379 -0.276 0.964 0.698 -0.901
## 4 2007-01-04 -1.21 0.148 0.309 -0.0469 -0.741 2.11 1.21 -0.704
## 5 2007-01-05 -1.13 0.0332 -0.0745 0.557 -0.767 1.80 0.336 -0.610
## 6 2007-01-06 -1.08 -0.181 -0.615 -0.178 -0.613 0.823 1.58 -0.493
## # … with 1 more variable: Traffic <dbl>, and abbreviated variable names
## # ¹Globalstrahlung, ²Luftfeuchtigkeit, ³Niederschlagsdauer,
## # ⁴`Skalar Windgeschwindigkeit`, ⁵Temperatur
Let’s see if the result in a diagram coincides with what the correlation matrix calculated above
We need first to load the TSstudio package which allows to draw and manipulate time series
if("TSstudio" %in% rownames(installed.packages()) == FALSE) {
install.packages("TSstudio")
}
library(TSstudio)
Then we can draw our time series
# First convert Datum into a date
DFAllcs$Datum <- as.Date(DFAllcs$Datum, "%Y-%m-%d")
# DFAllbmc$Monat <- as.Date(DFAllbmc$Monat, "%Y-%m-%d")
t <- ts_plot(DFAllcs,
title = "Air quality, weather and traffic progression since 2007",
Xtitle = "Time",
Ytitle = "Scaled value",
slider=TRUE)
t
You can manipulate the graph with the scale bar under the picture and select the values to display in the legend.
If you select only “O3”, Globalstrahlung” and “Luftfeuchtigkeit”, you will note a positive correlation between “O3” levels and the “Globalstrahlung” and a negative correlation with “Luftfeuchtigkeit”.
But this details level is quite confusing, and the result will be more obvious if we draw the same diagram on a monthly basis:
In this view, you can focus on the parameters and period that you wish.
The following schemes are focused on the pollutants and weather conditions evocated formerly:
The O3 was calculated to be correlated with global radiation, temperature and the air humidity with 0.62, 0.58 and -0.6 coefficients. Here is the curve of these observations since 2007:
t <- ts_plot(DFAllbmcs[,c(1,2,3,5,9)],
title = "O3 vs global radiation, temperature and air humidity",
line.mode="lines+markers",
Xtitle = "Time",
Ytitle = "Scaled value",
slider=TRUE)
t
Ozone levels are strongly and sustainably correlated with humidity, global radiation and temperature levels.
We noticed that O3 was strongly correlated with negatively correlated to NO, NOx, NO2 and CO. So we could infer that what affects the O3 levels is although affecting these pollutants levels. It would be interesting to see if these correlation levels are stable on the long term.
To do that, we have to reintroduce these measurements in the data frame and scale it again.
We already prepared these data in DFAllPol
# To do that, I have to reintroduce these measurements in the data frame and scale it again:
DFAllPolbm <- DFAllPol %>% mutate(Monat = format(Datum, "%Y-%m")) %>% mutate(Monat = paste0(Monat, "-01")) %>% group_by(Monat, Parameter, type) %>% summarise(total = mean(total, na.rm=T))
# Convert the Monat and Parameter columns into factors for incoming manipulations
DFAllPolbm$Monat <- factor(DFAllPolbm$Monat)
DFAllPolbm$Parameter <- factor(DFAllPolbm$Parameter)
# head(DFAllbm, 20)
DFAllPolbmc <- DFAllPolbm[,-c(3)] %>% spread(Parameter, total)
DFAllPolbmcs <- DFAllPolbmc
DFAllPolbmcs[,-1] <- scale(DFAllPolbmc[,-1], scale = TRUE)
#head(DFAllbmcs)
#dim(DFAllbmcs)
# First convert Monat into a date
DFAllPolbmcs$Monat <- as.Date(DFAllPolbmcs$Monat, "%Y-%m-%d")
# DFAllbmc$Monat <- as.Date(DFAllbmc$Monat, "%Y-%m-%d")
t <- ts_plot(DFAllPolbmcs[,c(1,2,6,7,8,9)],
title = "O3 vs NO, NO2, NOx and CO levels",
line.mode="lines+markers",
Xtitle = "Time",
Ytitle = "Scaled value",
slider=TRUE)
t
The correlation of the different pollutants with O3 is confirmed on the long term except for CO which seems to weaken between 2018 and 2021, where the O3 curve continues as sinusoidal with a large amplitude while that of CO seems to settle.
We noticed that SO2 was negatively correlated to temperature with a 0.59 coefficient, let’s see this on the long duration:
t <- ts_plot(DFAllbmcs[,c(1,8,9)],
title = "SO2 vs temperature",
line.mode="lines+markers",
Xtitle = "Time",
Ytitle = "Scaled value",
slider=TRUE)
t
Looking at the curves of SO2 versus temperatures here above, you can notice that the correlation is flagrant from 2007 to a date around 2015. But after this point of the time, the variations of the temperatures follow the same pace, while SO2 remains lower and seems to stabilize.
I deduce from this that this correlation was in no way a causal relationship, or that a third factor, linked to the level of temperatures, such as the use of home heating for example, disappeared or was transformed in order to emit less sulfur dioxide.
PM10 is supposed to be correlated with the speed of the wind (-0.26), it’s direction (-0.36) and the precipitation duration (-0.42), atmospheric pressure (0.38) and temperature (-0.28). Traffic is quite low with a -0.14 coefficient. So we see a weak correlation with several events…
Then, for PM10, there are no dominant factor. Precipitation duration and the atmospheric pressure seems to be more influential than the other parameters, but if we guess that it would have positive effect only combined with the others.
t <- ts_plot(DFAllbmcs[,c(1,4,6,7,9,10)],
title = "PM10 vs temperature, precipitation duration, speed of the wind and traffic",
line.mode="lines+markers",
Xtitle = "Time",
Ytitle = "Scaled value",
slider=TRUE)
t
Maybe the sum of weak correlations gives a major correlation with a set of different intrans? We will have a look on the result when we add the absolute value of the environmental data together:
DFAllbmcs[is.na(DFAllbmcs)] <- 0
DFAllbmcs$total <- abs(DFAllbmcs$Temperatur) + abs(DFAllbmcs$Traffic) + abs(DFAllbmcs$Niederschlagsdauer) + abs(DFAllbmcs$`Skalar Windgeschwindigkeit`)
t <- ts_plot(DFAllbmcs[,c(1,4,6,7,9,10)],
title = "PM10 vs temperature, precipitation duration, speed of the wind and traffic",
line.mode="lines+markers",
Xtitle = "Time",
Ytitle = "Scaled value",
slider=TRUE)
t
If you select separately each value with the PM10 levels, you will see that there is no constant correlation on the whole period. This explain the poor correlation calculated for each element.
cor(DFAllbmcs$PM10, DFAllbmcs$total)
## [1] -0.0255194
A -0.02 coefficient factor is very low value that explains the messy curves that we get here over.
But I still hope to find a combination of factors that will allow us to obtain a significant coefficient of correlation and thus a means of acting against this pollution.
Let’s see the coefficients we get by combining these factors two by two with respect to PM10 levels. We have four potential factors, which gives us six groups of two to evaluate:
cor(DFAllbmcs$PM10, abs(DFAllbmcs$Temperatur) + abs(DFAllbmcs$Traffic))
## [1] 0.1406732
cor(DFAllbmcs$PM10, abs(DFAllbmcs$Temperatur) + abs(DFAllbmcs$`Skalar Windgeschwindigkeit`))
## [1] 0.007228167
cor(DFAllbmcs$PM10, abs(DFAllbmcs$Temperatur) + abs(DFAllbmcs$Niederschlagsdauer))
## [1] 0.04768694
cor(DFAllbmcs$PM10, abs(DFAllbmcs$Niederschlagsdauer) + abs(DFAllbmcs$`Skalar Windgeschwindigkeit`))
## [1] -0.1382358
cor(DFAllbmcs$PM10, abs(DFAllbmcs$Niederschlagsdauer) + abs(DFAllbmcs$Traffic))
## [1] -0.04316475
cor(DFAllbmcs$PM10, abs(DFAllbmcs$Traffic) + abs(DFAllbmcs$`Skalar Windgeschwindigkeit`))
## [1] -0.07167835
Well, the values are still much too low to be significant.
The combinations of
temperature and traffic only have a 0.14 coefficient.
temperature associated with the wind speed only get 0.007
temperature and precipitation duration obtain only 0.04
etc…
Perhaps by combining our parameters in groups of 3 we will get better results?
cor(DFAllbmcs$PM10, abs(DFAllbmcs$Temperatur) + abs(DFAllbmcs$Traffic) + abs(DFAllbmcs$`Skalar Windgeschwindigkeit`))
## [1] 0.01762356
cor(DFAllbmcs$PM10, abs(DFAllbmcs$Temperatur) + abs(DFAllbmcs$Traffic) + abs(DFAllbmcs$Niederschlagsdauer))
## [1] 0.05157836
cor(DFAllbmcs$PM10, abs(DFAllbmcs$Temperatur) + abs(DFAllbmcs$Niederschlagsdauer) + abs(DFAllbmcs$`Skalar Windgeschwindigkeit`))
## [1] -0.04156098
cor(DFAllbmcs$PM10, abs(DFAllbmcs$Niederschlagsdauer) + abs(DFAllbmcs$Traffic) + abs(DFAllbmcs$`Skalar Windgeschwindigkeit`))
## [1] -0.1007072
The results are final. We cannot conclude that there is a correlation and even less causality between PM10 levels and our weather or traffic factors. So the truth is elsewhere.
It would have been surprising if this very easy-to-access data had provided us with a simple analysis revolutionary information on the causal links between certain weather events or the level of traffic and the levels of different pollutants measured in the air of the city of Zurich.
We were able, however, to observe that the levels of many pollutants are highly correlated, leaving the hope that finding the cause of the increase in one of them might help to combat others.
We noticed that NO, NOx, NO2 and CO are strongly positive correlated. We can hope that there is a common factor explaining their presence. But as O3 is negatively correlated to these pollutants and revealed strongly and sustainably correlated with humidity, global radiation and temperature levels, we can imagine that these factors have an impact too on all of them. Unfortunately, these pollutants have a strong negative correlation with ozone levels. It is therefore conceivable that if a lever could be found to lower the level of Ozone, it would irreparably cause the levels of NOx, NO, NO2 and carbon monoxide to increase.
But these correlations are not causal links. Weather events that fluctuate as pollution levels change are only factors that aggregate or mitigate the effects of pollution from elsewhere.
We hoped to find an easy culprit in car traffic, at least for the pollutants known to be highly emitted by exhaust gases. But the decorrelation between traffic and pollution levels, even if it does not deny its harmful nature for health and the environment, prove that one or more other causes not measured in this study are at work.
The SO2 levels are an interesting case. Strongly correlated with the temperature levels between 2007 and 2015, they then seem to unravel. As temperature variations continue over the same range, the SO2 level appears to gradually decline and have lower peaks. Perhaps policies to combat emissions of this gas have borne fruit, or this correlation before 2015 was just fortuitous.
But what this analysis seems to reveal is that variations in weather events and pollution levels are linked. Weather acts as a catalyst or inhibitor on the presence of pollutants, and that in a changing climate environment in which extreme episodes amplify and multiply, the same levels of pollutant emissions are likely to cause greater harm in the future. But this is only a hypothesis that will have to be verified with precise data on the effects of pollutants, in particular on health.
library(dplyr)
library(tidyr)
library(ggplot2)
# install.packages("corrplot")
if("corrplot" %in% rownames(installed.packages()) == FALSE) {
install.packages("corrplot")
}
library(corrplot)
# Download each file
if(!exists("./data")){dir.create("./data")}
## Warning in dir.create("./data"): '.\data' existe déjà
for(i in years) {
print(i)
fileUrl <- paste0(url_path, filename, i, type, sep = "", recycle0 = TRUE)
locfilename <- paste0("./data/", filename, i, type, sep = "", recycle0 = TRUE)
m_fileUrl <- paste0(m_url_path, m_filename, i, type, sep = "", recycle0 = TRUE)
m_locfilename <- paste0("./data/", m_filename, i, type, sep = "", recycle0 = TRUE)
t_fileUrl <- paste0(t_url_path, t_filename, i, type, sep = "", recycle0 = TRUE)
t_locfilename <- paste0("./data/", t_filename, i, type, sep = "", recycle0 = TRUE)
# Collect and access the data from internet:
if(!exists(locfilename)){
download.file(fileUrl, destfile = locfilename, method = "curl")}
if(!exists(m_locfilename)){
download.file(m_fileUrl, destfile = m_locfilename, method = "curl")}
if(!exists(t_locfilename)){
download.file(t_fileUrl, destfile = t_locfilename, method = "curl")}
# Collect the data
DFPol <- read.csv(locfilename)
dim(DFPol)
DFMet <- read.csv(m_locfilename)
dim(DFMet)
DFTra <- read.csv(t_locfilename)
dim(DFTra)
# Then clean data
# Rename the parameters
DFMet[DFMet$Parameter == "p", ]$Parameter <- "Luftdruck"
DFMet[DFMet$Parameter == "RainDur", ]$Parameter <- "Niederschlagsdauer"
DFMet[DFMet$Parameter == "StrGlo", ]$Parameter <- "Globalstrahlung"
DFMet[DFMet$Parameter == "T", ]$Parameter <- "Temperatur"
DFMet[DFMet$Parameter == "Hr", ]$Parameter <- "Luftfeuchtigkeit"
DFMet[DFMet$Parameter == "WD", ]$Parameter <- "Windrichtung"
DFMet[DFMet$Parameter == "WVv", ]$Parameter <- "Vector Windgeschwindigkeit"
DFMet[DFMet$Parameter == "WVs", ]$Parameter <- "Skalar Windgeschwindigkeit"
# Exclude the unecessary columns
DFMet <- DFMet[, c(1,2,3,5,6)]
DFPol <- DFPol[, c(1,2,3,5,6)]
DFTra1 <- DFTra[, c(1,8)]
# I delete the hours to keep only the date in Datum
DFPol$Datum <- as.Date(substr(DFPol$Datum, 1, 10), format = "%Y-%m-%d")
DFMet$Datum <- as.Date(substr(DFMet$Datum, 1, 10), format = "%Y-%m-%d")
DFTra$Datum <- as.Date(substr(DFTra$Datum, 1, 10), format = "%Y-%m-%d")
# I delete the NA created
DFPol <- DFPol[!is.na(DFPol$Wert), ]
DFMet <- DFMet[!is.na(DFMet$Wert), ]
DFTra <- DFTra[!is.na(DFTra$Anzahl), ]
# Then group the data per day:
PolbyDay <- DFPol %>% group_by(Datum, Standort, Parameter) %>% summarise(total = mean(Wert, na.rm=T))
MetbyDay <- DFMet %>% group_by(Datum, Standort, Parameter) %>% summarise(total = mean(Wert, na.rm=T))
head(PolbyDay)
head(MetbyDay)
TrabyDay <- DFTra %>% group_by(Datum, Parameter = "Traffic") %>% summarise(total = sum(Anzahl, na.rm=T))
# Only keep the "Zch_Stampfenbachstrasse" place:
PolbyDay <- PolbyDay[PolbyDay$Standort == "Zch_Stampfenbachstrasse", ]
MetbyDay <- MetbyDay[MetbyDay$Standort == "Zch_Stampfenbachstrasse", ]
# The data frame with more pollutants for further investigations
AllPolbyDay <- PolbyDay
# Only keep the parameters on which we decided to focus:
# O3, PM10, SO2 for air quality
# "Globalstrahlung", "Skalar Windgeschwindigkeit", "Niederschlagsdauer", "Temperatur", "Luftfeuchtigkeit" for Meteo
PolbyDay <- PolbyDay[(PolbyDay$Parameter == "O3" | PolbyDay$Parameter == "PM10" | PolbyDay$Parameter == "SO2"), ]
MetbyDay <- MetbyDay[(MetbyDay$Parameter == "Globalstrahlung" | MetbyDay$Parameter == "Skalar Windgeschwindigkeit" | MetbyDay$Parameter == "Niederschlagsdauer" | MetbyDay$Parameter == "Temperatur" | MetbyDay$Parameter == "Luftfeuchtigkeit"), ]
# Get rid of the "Standort" column
AllPolbyDay <- AllPolbyDay[, c(1,3,4)]
PolbyDay <- PolbyDay[, c(1,3,4)]
MetbyDay <- MetbyDay[, c(1,3,4)]
# Add the origin label
AllPolbyDay$type <- "Pollution"
PolbyDay$type <- "Pollution"
MetbyDay$type <- "Meteo"
TrabyDay$type <- "Traffic"
# add it in the DFAll
DFAll <- rbind(DFAll, PolbyDay, MetbyDay, TrabyDay)
# And reorder the data
DFAll <- DFAll[order(DFAll$Datum, DFAll$Parameter),]
# Same keeping all pollutants for further investigations
DFAllPol <- rbind(DFAllPol, AllPolbyDay, MetbyDay, TrabyDay)
DFAllPol <- DFAllPol[order(DFAllPol$Datum, DFAllPol$Parameter),]
}
## [1] 2007
## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum'. You can override using the
## `.groups` argument.
## [1] 2008
## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum'. You can override using the
## `.groups` argument.
## [1] 2009
## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum'. You can override using the
## `.groups` argument.
## [1] 2010
## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum'. You can override using the
## `.groups` argument.
## [1] 2011
## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum'. You can override using the
## `.groups` argument.
## [1] 2012
## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum'. You can override using the
## `.groups` argument.
## [1] 2013
## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum'. You can override using the
## `.groups` argument.
## [1] 2014
## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum'. You can override using the
## `.groups` argument.
## [1] 2015
## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum'. You can override using the
## `.groups` argument.
## [1] 2016
## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum'. You can override using the
## `.groups` argument.
## [1] 2017
## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum'. You can override using the
## `.groups` argument.
## [1] 2018
## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum'. You can override using the
## `.groups` argument.
## [1] 2019
## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum'. You can override using the
## `.groups` argument.
## [1] 2020
## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum'. You can override using the
## `.groups` argument.
## [1] 2021
## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum'. You can override using the
## `.groups` argument.
# Daily results are messy. Let's try monthly
DFAllbm <- DFAll %>% mutate(Monat = format(Datum, "%Y-%m")) %>% mutate(Monat = paste0(Monat, "-01")) %>% group_by(Monat, Parameter, type) %>% summarise(total = mean(total, na.rm=T))
## `summarise()` has grouped output by 'Monat', 'Parameter'. You can override
## using the `.groups` argument.
# Convert the Monat and Parameter columns into factors for incoming manipulations
DFAllbm$Monat <- factor(DFAllbm$Monat)
DFAllbm$Parameter <- factor(DFAllbm$Parameter)
# head(DFAllbm, 20)
DFAllbmc <- DFAllbm[,-c(3)] %>% spread(Parameter, total)
DFAllbmcs <- DFAllbmc
DFAllbmcs[,-1] <- scale(DFAllbmc[,-1], scale = TRUE)
#head(DFAllbmcs)
#dim(DFAllbmcs)
# First convert Monat into a date
DFAllbmcs$Monat <- as.Date(DFAllbmcs$Monat, "%Y-%m-%d")
# DFAllbmc$Monat <- as.Date(DFAllbmc$Monat, "%Y-%m-%d")
t <- ts_plot(DFAllbmcs,
title = "Air quality, weather and traffic monthly progression since 2007",
line.mode="lines+markers",
Xtitle = "Time",
Ytitle = "Scaled value",
slider=TRUE)
t
https://data.stadt-zuerich.ch/dataset/ugz_luftschadstoffmessung_stundenwerte↩︎
https://data.stadt-zuerich.ch/dataset/ugz_meteodaten_stundenmittelwerte↩︎
https://data.stadt-zuerich.ch/dataset/ugz_verkehrsdaten_stundenwerte_stampfenbachstrasse↩︎
https://www.r-bloggers.com/2013/04/using-the-svd-to-find-the-needle-in-the-haystack/↩︎
http://www.sthda.com/french/wiki/matrice-de-correlation-guide-simple-pour-analyser-formater-et-visualiser#cest-quoi-une-matrice-de-correlation↩︎