Context

Based on atmospheric quality, road traffic and the levels of the main pollutants, we will try to identify the correlation links between these different parameters.

As part of the Open Data Policy, the City of Zurich has made a number of documents available for data scientists. Three of which have caught our attention. It is the air quality measurements that I would like to compare with the weather and the traffic data. As it happens, the sensors used are all in one place, which is very convenient for this analysis.

The first data sets ¹ are the air quality measurements taken every hour since 1983. They contain measures of ozone (O3), nitrogen oxides (NOx), nitrogen monoxide (NO), nitrogen dioxide (NO2), particulate matter (PM10 and PM2.5), carbon monoxide (CO) and sulfur dioxide (SO2).

The second dataset ² contains hourly weather records since 1992 at the same places. It measures air pressure (p), precipitation duration (RainDur), global radiation (StrGlo), temperature (T), relative humidity (Hr), wind direction, vector and scalar wind speed. These data will allow us to see if there is a correlation between pollutant concentration and the meteorologic events.

The third dataset ³ contains the hourly traffic on the Stampfenbachstrasse street since 2007. Some of the weather and air quality measurements are also done on this street, which will allow us to analyze geographically consistent samples.

The questions we will try to answer are the following:

Are there connections between the traffic or the weather data and some pollution levels ?
Is it possible to find a cause or aggravating factor common to all or a representative part of these pollution.

Synopsis

To find these answers, we will collect the hourly data. We will establish the level of granularity (hourly, daily, monthly) best suited to its analysis. The result must be readable, interpretable, complete and not contain too many material resources to be produced.

We then establish correlations between parameters and then observe the joint evolution of correlated parameters over the period.

First we load the necessary packages:

Exploratory analysis

Initialization of the data

To begin, we will analyse the 2019 data:

year <- "2012"
years <- c(2007,2008)
# year <- "2019"
type <- ".csv"
filename <- "ugz_ogd_air_h1_"
url_path <- "https://data.stadt-zuerich.ch/dataset/ugz_luftschadstoffmessung_stundenwerte/download/"
fileUrl <- paste0(url_path, filename, year, type, sep = "", recycle0 = TRUE)
locfilename <- paste0("./data/", filename, year, type, sep = "", recycle0 = TRUE)

m_filename <- "ugz_ogd_meteo_h1_"
m_url_path <- "https://data.stadt-zuerich.ch/dataset/ugz_meteodaten_stundenmittelwerte/download/"
m_fileUrl <- paste0(m_url_path, m_filename, year, type, sep = "", recycle0 = TRUE)
m_locfilename <- paste0("./data/", m_filename, year, type, sep = "", recycle0 = TRUE)

# Traffic informations since 2007
t_filename <- "ugz_ogd_traffic_h1_"
t_url_path <- "https://data.stadt-zuerich.ch/dataset/ugz_verkehrsdaten_stundenwerte_stampfenbachstrasse/download/"
t_fileUrl <- paste0(t_url_path, t_filename, year, type, sep = "", recycle0 = TRUE)
t_locfilename <- paste0("./data/", t_filename, year, type, sep = "", recycle0 = TRUE)

Download the first files:

# Collect and access the data from internet:
if(!exists("./data")){dir.create("./data")}
if(!exists(locfilename)){
    download.file(fileUrl, destfile = locfilename, method = "curl")}

if(!exists(m_locfilename)){
    download.file(m_fileUrl, destfile = m_locfilename, method = "curl")}

if(!exists(t_locfilename)){
    download.file(t_fileUrl, destfile = t_locfilename, method = "curl")}

Create the data frame:

DFPol <- read.csv(locfilename)
dim(DFPol)

## [1] 140544      7

DFMet <- read.csv(m_locfilename)
dim(DFMet)

## [1] 122976      7

DFTra <- read.csv(t_locfilename)
dim(DFTra)

## [1] 52704     9

Data exploration

str(DFPol)

## 'data.frame':    140544 obs. of  7 variables:
##  $ Datum    : chr  "2012-01-01T00:00+0100" "2012-01-01T00:00+0100" "2012-01-01T00:00+0100" "2012-01-01T00:00+0100" ...
##  $ Standort : chr  "Zch_Heubeeribüel" "Zch_Heubeeribüel" "Zch_Heubeeribüel" "Zch_Heubeeribüel" ...
##  $ Parameter: chr  "NO2" "NO" "NOx" "O3" ...
##  $ Intervall: chr  "h1" "h1" "h1" "h1" ...
##  $ Einheit  : chr  "µg/m3" "µg/m3" "ppb" "µg/m3" ...
##  $ Wert     : num  14.31 1.06 8.33 36.52 39.96 ...
##  $ Status   : chr  "bereinigt" "bereinigt" "bereinigt" "bereinigt" ...

str(DFMet)

## 'data.frame':    122976 obs. of  7 variables:
##  $ Datum    : chr  "2012-01-01T00:00+0100" "2012-01-01T00:00+0100" "2012-01-01T00:00+0100" "2012-01-01T00:00+0100" ...
##  $ Standort : chr  "Zch_Schimmelstrasse" "Zch_Schimmelstrasse" "Zch_Schimmelstrasse" "Zch_Schimmelstrasse" ...
##  $ Parameter: chr  "Hr" "RainDur" "T" "WD" ...
##  $ Intervall: chr  "h1" "h1" "h1" "h1" ...
##  $ Einheit  : chr  "%Hr" "min" "°C" "°" ...
##  $ Wert     : num  90.84 6.27 7.92 39.67 0.35 ...
##  $ Status   : chr  "bereinigt" "bereinigt" "bereinigt" "bereinigt" ...

str(DFTra)

## 'data.frame':    52704 obs. of  9 variables:
##  $ Datum      : chr  "2012-01-01T00:00+0100" "2012-01-01T00:00+0100" "2012-01-01T00:00+0100" "2012-01-01T00:00+0100" ...
##  $ Standort   : chr  "Zch_Stampfenbachstrasse" "Zch_Stampfenbachstrasse" "Zch_Stampfenbachstrasse" "Zch_Stampfenbachstrasse" ...
##  $ Richtung   : chr  "Schaffhauserplatz" "Schaffhauserplatz" "Schaffhauserplatz" "Stampfenbachplatz" ...
##  $ Spur       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Klasse.ID  : int  101 102 103 101 102 103 101 102 103 101 ...
##  $ Klasse.Text: chr  "Zweirad" "Personenwagen" "Lastwagen" "Zweirad" ...
##  $ Intervall  : chr  "h1" "h1" "h1" "h1" ...
##  $ Anzahl     : int  NA NA NA 4 104 2 NA NA NA 3 ...
##  $ Status     : chr  "provisorisch" "provisorisch" "provisorisch" "provisorisch" ...

Pollution and weather data

The first two data sets (DFPol and DFMet) are very convenient. They have the same data structure with the same parameters.

Here is the list of the unique Parameter values:

unique(DFPol$Parameter)

## [1] "NO2"  "NO"   "NOx"  "O3"   "PM10" "CO"   "SO2"

unique(DFMet$Parameter)

## [1] "Hr"      "RainDur" "T"       "WD"      "WVv"     "p"       "WVs"    
## [8] "StrGlo"

Datum is the date and hour of the measurement
Parameter gives the type of measurement:
- For the air quality:
  - O3 : Ozone
  - NOx: nitrogen oxides
  - NO: nitrogen monoxide
  - NO2: nitrogen dioxide
  - PM10 and PM2.5: particulate matter
  - CO: carbon monoxide
  - SO2: sulphur dioxide
- For the weather measurements:
  - p: air pressure
  - RainDur: precipitation duration
  - StrGlo: global radiation
  - T: temperature
  - Hr: relative humidity
  - WD: wind direction
  - WVs: scalar wind speed
  - WVv: vector wind speed
Intervall is the time intervals between each measurement. In both cases, it will always be one hour. So we will ignore it.
Einheit is the unit used.
Wert is the numeric value measured.
Status states the quality of the data, we will ignore it.

The common streets in the two data frames are :

intersect(DFPol$Standort, DFMet$Standort)

## [1] "Zch_Schimmelstrasse"     "Zch_Stampfenbachstrasse"

Traffic data:

The traffic dataset contains data dispatched in several columns. We will sum up all vehicle classes, all categories together, under a “traffic” parameter that we will then insert later into the merged table of pollution and weather data.

head(DFTra)

##                   Datum                Standort          Richtung Spur
## 1 2012-01-01T00:00+0100 Zch_Stampfenbachstrasse Schaffhauserplatz    0
## 2 2012-01-01T00:00+0100 Zch_Stampfenbachstrasse Schaffhauserplatz    0
## 3 2012-01-01T00:00+0100 Zch_Stampfenbachstrasse Schaffhauserplatz    0
## 4 2012-01-01T00:00+0100 Zch_Stampfenbachstrasse Stampfenbachplatz    0
## 5 2012-01-01T00:00+0100 Zch_Stampfenbachstrasse Stampfenbachplatz    0
## 6 2012-01-01T00:00+0100 Zch_Stampfenbachstrasse Stampfenbachplatz    0
##   Klasse.ID   Klasse.Text Intervall Anzahl       Status
## 1       101       Zweirad        h1     NA provisorisch
## 2       102 Personenwagen        h1     NA provisorisch
## 3       103     Lastwagen        h1     NA provisorisch
## 4       101       Zweirad        h1      4 provisorisch
## 5       102 Personenwagen        h1    104 provisorisch
## 6       103     Lastwagen        h1      2 provisorisch

The “Datum” parameter is the same format as those of the Air quality and weather data sets.

Let’s have a look at the other potentially factor columns:

unique(DFTra$Klasse.ID)

## [1] 101 102 103

unique(DFTra$Klasse.Text)

## [1] "Zweirad"       "Personenwagen" "Lastwagen"

unique(DFTra$Intervall)

## [1] "h1"

unique(DFTra$Status)

## [1] "provisorisch"

The “Standort” (place), “Richtung” (direction), “Spur” (lane), “Klasse.ID” (vehicle class id), “Klasse.Text” (vehicle class), “Intervall” (time intervals. Always hourly in thess datasets) and “Status” will be ignored because they have unique values or are classifications that we don’t need in this specific context.

The class of the vehicles (Klasse.Text and Klasse.ID) will not be used in this analysis and will be consolidated.

Cleaning data

To facilitate the use of the weather data, we will replace the acronyms with full words (in German, if you don’t mind).

DFMet[DFMet$Parameter == "p", ]$Parameter <- "Luftdruck"
DFMet[DFMet$Parameter == "RainDur", ]$Parameter <- "Niederschlagsdauer"
DFMet[DFMet$Parameter == "StrGlo", ]$Parameter <- "Globalstrahlung"
DFMet[DFMet$Parameter == "T", ]$Parameter <- "Temperatur"
DFMet[DFMet$Parameter == "Hr", ]$Parameter <- "Luftfeuchtigkeit"
DFMet[DFMet$Parameter == "WD", ]$Parameter <- "Windrichtung"
DFMet[DFMet$Parameter == "WVv", ]$Parameter <- "Vector Windgeschwindigkeit"
DFMet[DFMet$Parameter == "WVs", ]$Parameter <- "Skalar Windgeschwindigkeit"

We only keep the necessary columns:

DFMet <- DFMet[, c(1,2,3,5,6)]
DFPol <- DFPol[, c(1,2,3,5,6)]
DFTra1 <- DFTra[, c(1,8)]

To reduce the volume of data, we will group it by day and keep the mean of the values:

DFPol1 <- DFPol
DFMet1 <- DFMet

# I delete the hours to keep only the date in Datum
DFPol1$Datum <- as.Date(substr(DFPol1$Datum, 1, 10), format = "%Y-%m-%d")
DFMet1$Datum <- as.Date(substr(DFMet1$Datum, 1, 10), format = "%Y-%m-%d")
DFTra1$Datum <- as.Date(substr(DFTra1$Datum, 1, 10), format = "%Y-%m-%d")


# I delete the NA created
DFPol1 <- DFPol1[!is.na(DFPol1$Wert), ]
DFMet1 <- DFMet1[!is.na(DFMet1$Wert), ]
DFTra1 <- DFTra1[!is.na(DFTra1$Anzahl), ]

# Then group the data per day:
PolbyDay <- DFPol1 %>% group_by(Datum, Standort, Parameter) %>% summarise(total = mean(Wert, na.rm=T))
MetbyDay <- DFMet1 %>% group_by(Datum, Standort, Parameter) %>% summarise(total = mean(Wert, na.rm=T))
head(PolbyDay)

## # A tibble: 6 × 4
## # Groups:   Datum, Standort [2]
##   Datum      Standort            Parameter total
##   <date>     <chr>               <chr>     <dbl>
## 1 2012-01-01 Zch_Heubeeribüel    NO         1.62
## 2 2012-01-01 Zch_Heubeeribüel    NO2       12.7 
## 3 2012-01-01 Zch_Heubeeribüel    NOx        7.94
## 4 2012-01-01 Zch_Heubeeribüel    O3        35.3 
## 5 2012-01-01 Zch_Schimmelstrasse NO        10.4 
## 6 2012-01-01 Zch_Schimmelstrasse NO2       25.5

head(MetbyDay)

## # A tibble: 6 × 4
## # Groups:   Datum, Standort [1]
##   Datum      Standort            Parameter                    total
##   <date>     <chr>               <chr>                        <dbl>
## 1 2012-01-01 Zch_Schimmelstrasse Luftdruck                  972.   
## 2 2012-01-01 Zch_Schimmelstrasse Luftfeuchtigkeit            82.5  
## 3 2012-01-01 Zch_Schimmelstrasse Niederschlagsdauer           0.426
## 4 2012-01-01 Zch_Schimmelstrasse Temperatur                   8.84 
## 5 2012-01-01 Zch_Schimmelstrasse Vector Windgeschwindigkeit   0.738
## 6 2012-01-01 Zch_Schimmelstrasse Windrichtung               143.

TrabyDay <- DFTra1 %>% group_by(Datum, Parameter = "Traffic") %>% summarise(total = sum(Anzahl, na.rm=T))
head(TrabyDay)

## # A tibble: 6 × 3
## # Groups:   Datum [6]
##   Datum      Parameter total
##   <date>     <chr>     <int>
## 1 2012-01-01 Traffic    2075
## 2 2012-01-02 Traffic    1843
## 3 2012-01-03 Traffic    3273
## 4 2012-01-04 Traffic    3485
## 5 2012-01-05 Traffic    3580
## 6 2012-01-06 Traffic    3909

To save some space in memory, we will keep only one of the 3 common locations:

length(unique(PolbyDay[PolbyDay$Standort == "Zch_Schimmelstrasse", ]$Parameter))

## [1] 5

length(unique(PolbyDay[PolbyDay$Standort == "Zch_Stampfenbachstrasse", ]$Parameter)) # winner

## [1] 7

length(unique(PolbyDay[PolbyDay$Standort == "Zch_Heubeeribüel", ]$Parameter))

## [1] 4

length(unique(MetbyDay[MetbyDay$Standort == "Zch_Schimmelstrasse", ]$Parameter))

## [1] 6

length(unique(MetbyDay[MetbyDay$Standort == "Zch_Stampfenbachstrasse", ]$Parameter)) # winner

## [1] 8

length(unique(MetbyDay[MetbyDay$Standort == "Zch_Heubeeribüel", ]$Parameter))

## [1] 0

# This location is the one that gets the type of measurements in 2019
# Which is good because it's the street where the traffic was measured since 2007
PolbyDay <- PolbyDay[PolbyDay$Standort == "Zch_Stampfenbachstrasse", ]
MetbyDay <- MetbyDay[MetbyDay$Standort == "Zch_Stampfenbachstrasse", ]

Now the column “Standort” is of no use:

PolbyDay <- PolbyDay[, c(1,3,4)]
MetbyDay <- MetbyDay[, c(1,3,4)]

It will be useful to know which data are related to air quality and which are weather information:

PolbyDay$type <- "Pollution"
MetbyDay$type <- "Meteo"
TrabyDay$type <- "Traffic"

And we regroup these 3 data frames in one unique data frame:

byDay <- rbind(PolbyDay, MetbyDay, TrabyDay)
byDay <- byDay[order(byDay$Datum, byDay$Parameter),]
head(byDay, 20)

## # A tibble: 20 × 4
## # Groups:   Datum [2]
##    Datum      Parameter                     total type     
##    <date>     <chr>                         <dbl> <chr>    
##  1 2012-01-01 CO                            0.312 Pollution
##  2 2012-01-01 Globalstrahlung              39.6   Meteo    
##  3 2012-01-01 Luftdruck                   969.    Meteo    
##  4 2012-01-01 Luftfeuchtigkeit             86.9   Meteo    
##  5 2012-01-01 Niederschlagsdauer            3.98  Meteo    
##  6 2012-01-01 NO                            7.83  Pollution
##  7 2012-01-01 NO2                          26.5   Pollution
##  8 2012-01-01 NOx                          20.2   Pollution
##  9 2012-01-01 O3                           22.2   Pollution
## 10 2012-01-01 PM10                         18.0   Pollution
## 11 2012-01-01 Skalar Windgeschwindigkeit    2.02  Meteo    
## 12 2012-01-01 SO2                           1.78  Pollution
## 13 2012-01-01 Temperatur                    8.12  Meteo    
## 14 2012-01-01 Traffic                    2075     Traffic  
## 15 2012-01-01 Vector Windgeschwindigkeit    1.91  Meteo    
## 16 2012-01-01 Windrichtung                195.    Meteo    
## 17 2012-01-02 CO                            0.236 Pollution
## 18 2012-01-02 Globalstrahlung              12.6   Meteo    
## 19 2012-01-02 Luftdruck                   965.    Meteo    
## 20 2012-01-02 Luftfeuchtigkeit             83.9   Meteo

unique(byDay$Parameter)

##  [1] "CO"                         "Globalstrahlung"           
##  [3] "Luftdruck"                  "Luftfeuchtigkeit"          
##  [5] "Niederschlagsdauer"         "NO"                        
##  [7] "NO2"                        "NOx"                       
##  [9] "O3"                         "PM10"                      
## [11] "Skalar Windgeschwindigkeit" "SO2"                       
## [13] "Temperatur"                 "Traffic"                   
## [15] "Vector Windgeschwindigkeit" "Windrichtung"

tail(MetbyDay, 15)

## # A tibble: 15 × 4
## # Groups:   Datum [2]
##    Datum      Parameter                   total type 
##    <date>     <chr>                       <dbl> <chr>
##  1 2012-12-30 Luftdruck                  972.   Meteo
##  2 2012-12-30 Luftfeuchtigkeit            60.2  Meteo
##  3 2012-12-30 Niederschlagsdauer           3.00 Meteo
##  4 2012-12-30 Skalar Windgeschwindigkeit   2.83 Meteo
##  5 2012-12-30 Temperatur                   7.72 Meteo
##  6 2012-12-30 Vector Windgeschwindigkeit   2.79 Meteo
##  7 2012-12-30 Windrichtung               226.   Meteo
##  8 2012-12-31 Globalstrahlung             61.7  Meteo
##  9 2012-12-31 Luftdruck                  970.   Meteo
## 10 2012-12-31 Luftfeuchtigkeit            69.7  Meteo
## 11 2012-12-31 Niederschlagsdauer           0    Meteo
## 12 2012-12-31 Skalar Windgeschwindigkeit   1.62 Meteo
## 13 2012-12-31 Temperatur                   5.14 Meteo
## 14 2012-12-31 Vector Windgeschwindigkeit   1.59 Meteo
## 15 2012-12-31 Windrichtung               183.   Meteo

Now let’s group data per month, and keep the mean of the daily “Wert“ value:

byMonth <- byDay %>%  mutate(Monat = format(Datum, "%Y-%m")) %>% mutate(Monat = paste0(Monat, "-01")) %>% group_by(Monat, Parameter, type) %>% summarise(total = mean(total, na.rm=T))

byMonth$Monat <- factor(byMonth$Monat)
byMonth$Parameter <- factor(byMonth$Parameter)

head(byMonth, 20)

## # A tibble: 20 × 4
## # Groups:   Monat, Parameter [20]
##    Monat      Parameter                  type         total
##    <fct>      <fct>                      <chr>        <dbl>
##  1 2012-01-01 CO                         Pollution    0.393
##  2 2012-01-01 Globalstrahlung            Meteo       37.5  
##  3 2012-01-01 Luftdruck                  Meteo      972.   
##  4 2012-01-01 Luftfeuchtigkeit           Meteo       78.6  
##  5 2012-01-01 Niederschlagsdauer         Meteo       12.2  
##  6 2012-01-01 NO                         Pollution   22.1  
##  7 2012-01-01 NO2                        Pollution   32.6  
##  8 2012-01-01 NOx                        Pollution   34.7  
##  9 2012-01-01 O3                         Pollution   30.9  
## 10 2012-01-01 PM10                       Pollution   18.0  
## 11 2012-01-01 Skalar Windgeschwindigkeit Meteo        2.37 
## 12 2012-01-01 SO2                        Pollution    2.37 
## 13 2012-01-01 Temperatur                 Meteo        3.52 
## 14 2012-01-01 Traffic                    Traffic   3723.   
## 15 2012-01-01 Vector Windgeschwindigkeit Meteo        2.35 
## 16 2012-01-01 Windrichtung               Meteo      176.   
## 17 2012-02-01 CO                         Pollution    0.532
## 18 2012-02-01 Globalstrahlung            Meteo       86.5  
## 19 2012-02-01 Luftdruck                  Meteo      974.   
## 20 2012-02-01 Luftfeuchtigkeit           Meteo       73.1

Let’s try clustering

First, let’s get rid of NA’s in byDay and byMonth:

anyNA(byDay)

## [1] FALSE

anyNA(byMonth)

## [1] FALSE

Our data per month and day are messy. The column parameter contains data that should be contained in separate columns to conduct a clustering analysis:

byColDay <- byDay[,-c(4)] %>% spread(Parameter, total)
# Reorder columns, so the result will be more convenient (1. Air quality, then weather and traffic columns)
byColDay <- byColDay[,c(1, 2, 7, 8,9,10,11,13,3,4,5,6,12,14,16,17,15)]

# Actually I have NA's after this operation, so:
na.omit(byColDay)

## # A tibble: 358 × 17
## # Groups:   Datum [358]
##    Datum         CO    NO   NO2   NOx    O3  PM10   SO2 Global…¹ Luftd…² Luftf…³
##    <date>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>   <dbl>   <dbl>
##  1 2012-01-01 0.312  7.83  26.5 20.2   22.2 18.0  1.78      39.6    969.    86.9
##  2 2012-01-02 0.236  3.51  17.9 12.2   36.1  5.34 1.64      12.6    965.    83.9
##  3 2012-01-03 0.245  5.85  19.6 15.0   48.9  6.92 1.40      63.4    973.    67.2
##  4 2012-01-04 0.216  2.59  13.3  9.00  59.5  7.49 1.11      43.0    971.    69.0
##  5 2012-01-05 0.205  2.23  10.1  7.08  62.6  4.92 0.902     11.2    957.    75.6
##  6 2012-01-06 0.239  3.20  17.7 11.8   46.1  7.11 1.10      18.6    968.    82.4
##  7 2012-01-07 0.236  2.36  13.2  8.81  47.5  7.42 1.15      11.5    973.    76.8
##  8 2012-01-08 0.226  2.35  13.9  9.17  48.9  8.47 1.26      13.2    973.    82.8
##  9 2012-01-09 0.332 15.4   30.9 28.5   32.2 14.6  1.59      29.2    979.    78.1
## 10 2012-01-10 0.539 43.5   46.1 59.0   17.7 18.9  2.28      35.5    982.    82.4
## # … with 348 more rows, 6 more variables: Niederschlagsdauer <dbl>,
## #   `Skalar Windgeschwindigkeit` <dbl>, Temperatur <dbl>,
## #   `Vector Windgeschwindigkeit` <dbl>, Windrichtung <dbl>, Traffic <dbl>, and
## #   abbreviated variable names ¹Globalstrahlung, ²Luftdruck, ³Luftfeuchtigkeit

byColMonth <- byMonth[,-c(3)] %>% spread(Parameter, total)
byColMonth <- byColMonth[,c(1, 2, 7, 8,9,10,11,13,3,4,5,6,12,14,16,17,15)]

na.omit(byColMonth)

## # A tibble: 12 × 17
## # Groups:   Monat [12]
##    Monat         CO    NO   NO2   NOx    O3  PM10   SO2 Global…¹ Luftd…² Luftf…³
##    <fct>      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>   <dbl>   <dbl>
##  1 2012-01-01 0.393 22.1   32.6  34.7  30.9  18.0 2.37      37.5    972.    78.6
##  2 2012-02-01 0.532 24.9   49.8  46.0  26.6  39.3 5.75      86.5    974.    73.1
##  3 2012-03-01 0.464 26.3   47.0  45.6  35.5  28.8 3.57     163.     975.    59.5
##  4 2012-04-01 0.316 11.9   29.1  24.8  57.4  14.3 2.30     152.     956.    65.9
##  5 2012-05-01 0.299 10.3   26.9  22.4  68.4  15.2 1.38     236.     966.    61.8
##  6 2012-06-01 0.268  9.52  24.4  20.4  61.0  16.4 1.24     228.     965.    67.7
##  7 2012-07-01 0.265  9.69  23.8  20.2  57.2  15.3 1.31     213.     967.    67.2
##  8 2012-08-01 0.339 10.4   27.0  22.5  65.2  17.9 0.829    217.     968.    65.4
##  9 2012-09-01 0.394 18.6   33.9  32.6  39.2  17.9 2.09     136.     967.    76.1
## 10 2012-10-01 0.443 30.3   36.8  43.5  20.0  18.2 3.21      79.9    963.    82.1
## 11 2012-11-01 0.443 35.4   40.9  49.8  15.4  20.8 2.04      47.4    964.    82.7
## 12 2012-12-01 0.364 20.5   36.5  35.5  31.2  15.3 2.01      33.4    965.    80.4
## # … with 6 more variables: Niederschlagsdauer <dbl>,
## #   `Skalar Windgeschwindigkeit` <dbl>, Temperatur <dbl>,
## #   `Vector Windgeschwindigkeit` <dbl>, Windrichtung <dbl>, Traffic <dbl>, and
## #   abbreviated variable names ¹Globalstrahlung, ²Luftdruck, ³Luftfeuchtigkeit

names(byColDay)

##  [1] "Datum"                      "CO"                        
##  [3] "NO"                         "NO2"                       
##  [5] "NOx"                        "O3"                        
##  [7] "PM10"                       "SO2"                       
##  [9] "Globalstrahlung"            "Luftdruck"                 
## [11] "Luftfeuchtigkeit"           "Niederschlagsdauer"        
## [13] "Skalar Windgeschwindigkeit" "Temperatur"                
## [15] "Vector Windgeschwindigkeit" "Windrichtung"              
## [17] "Traffic"

names(byColMonth)

##  [1] "Monat"                      "CO"                        
##  [3] "NO"                         "NO2"                       
##  [5] "NOx"                        "O3"                        
##  [7] "PM10"                       "SO2"                       
##  [9] "Globalstrahlung"            "Luftdruck"                 
## [11] "Luftfeuchtigkeit"           "Niederschlagsdauer"        
## [13] "Skalar Windgeschwindigkeit" "Temperatur"                
## [15] "Vector Windgeschwindigkeit" "Windrichtung"              
## [17] "Traffic"

Note: Traffic figures were not recorded during the construction work on Stampfenbachstrasse in 2018 and 2019.

So if year is 2018 or 2019, we will have to clean NAs, because traffic was no measured during several months:

# Cleaning the NA requires to put the Parameter data in columns
if(year == "2019" | year == "2018") { 
    # We delete NA lines
    byColDay <- byColDay[!is.na(byColDay$Traffic),]
    byColMonth <- byColMonth[!is.na(byColMonth$Traffic),]
} else { 
        print("These years do not need NA deleting!")
    }

## [1] "These years do not need NA deleting!"

dim(byColDay)

## [1] 366  17

dim(byColMonth)

## [1] 12 17

Clustering per Month:

We must transform the data into a matrix in order to be able to calculate the svd derivated matrix. Then we will display a heatmap of the datamatrix and plot the variance graphs so we will see if some parameters explain significantly some variations.

# Transforms the data frame into a matrix
dataMatrix <- data.matrix(byColMonth[,-1])
hh <- hclust(dist(dataMatrix)) # Find the diagonal matrix
# str(hh)
dataMatrixOrdered <- dataMatrix[hh$order,]


svd1 <- svd(scale(dataMatrixOrdered))
#str(svd1)



par(mfrow=c(1,3))
image(t(dataMatrixOrdered)[,nrow(dataMatrixOrdered):1], main="Heatmap")

plot(colMeans(dataMatrixOrdered),xlab="Column index",ylab="Column Mean",pch=19, main="Mean per column")
plot((svd1$d^2/sum(svd1$d^2)), xlab = "Columns", ylab="Prop. of variance explained", pch=19, main="Variance explained")

The 3 first columns explain around 80% of the variations.

What are the 3 first columns?

names(byColMonth[,c(9,10,11,15,16)])

## [1] "Globalstrahlung"            "Luftdruck"                 
## [3] "Luftfeuchtigkeit"           "Vector Windgeschwindigkeit"
## [5] "Windrichtung"

The 6 more influential data are 2 pollution measurement and 4 weather. The same matrix with one significant pollutant only, amid all the weather information would help to identify wich connection he has with the weather phenomena:

The very useful getSvdMostInfluential⁴ function will clarify that:

if("Gmisc" %in% rownames(installed.packages()) == FALSE) {
    install.packages("Gmisc")
    }
library(Gmisc)

## Le chargement a nécessité le package : Rcpp

## Le chargement a nécessité le package : htmlTable

getSvdMostInfluential(dataMatrix, 
                      quantile=.8, 
                      similarity_threshold = .9,
                      plot_threshold = .05,
                      plot_selection = TRUE)

Let’s see if the daily measurements have the same results:

# Transforms the data frame into a matrix
# byColDay1 <- byColDay[,c(1,2,13,3,16,5,6,7,8,9,10,11,12,14,15,17,4)]

dataMatrix_day <- data.matrix(byColDay[,-c(1)])
hhd <- hclust(dist(dataMatrix_day)) #Maintenant je calcule la matrice diagonale
# str(hhd)
dataMatrix_dayOrdered <- na.omit(dataMatrix_day[hhd$order,])


svd1_day <- svd(scale(dataMatrix_dayOrdered))
#str(svd1_day)

par(mfrow=c(1,3))
image(t(dataMatrix_dayOrdered)[,nrow(dataMatrix_dayOrdered):1], main="Heatmap")
plot(colMeans(dataMatrix_dayOrdered),xlab="Column index",ylab="Column Mean",pch=19, main="Mean per column")
plot((svd1_day$d^2/sum(svd1_day$d^2)), xlab = "Columns per ratio", ylab="Prop. of variance explained", pch=19, main="Variance explained")

There are no major difference between the daily and monthly values, but we see that the variance explained gives more importance to the 6 first values than the monthly values.

This does not help to identify correlations between weather or traffic information and the air quality, because the maximum contributors are pollutants that are together, of course, more present when the air quality is low. These only shows that when the atmosphere is more polluted, there are more pollutants in it…

Let’s try to find correlations:

library(Hmisc)
library(lattice)
library(survival)
library(Formula)

# cor(byColDay[,-c(1,2)], method=c("pearson", "kendall", "spearman"))

Maybe the correlation ratios can help to find relationships between weather or traffic and some pollutants:

Correlation ratios:

The rcorr function will allow us to measure the correlation between each pair in the matrix. We exclude the first columns containing the dates.

rcorr(as.matrix(byColDay[,-c(1)]), type="pearson")$r[-c(1,2,3,4,5,6,7), -c(8,9,10,11,12,13,14,15,16,17)]

##                                    CO          NO         NO2         NOx
## Globalstrahlung            -0.2031445 -0.20857820 -0.16861371 -0.20319557
## Luftdruck                   0.3265544  0.19185046  0.27722293  0.23118433
## Luftfeuchtigkeit            0.1704773  0.22549334  0.10565246  0.19188523
## Niederschlagsdauer         -0.2671892 -0.19136108 -0.20622130 -0.20510591
## Skalar Windgeschwindigkeit -0.4146066 -0.40432166 -0.48954362 -0.45287751
## Temperatur                 -0.3958797 -0.31427284 -0.37381742 -0.34958647
## Vector Windgeschwindigkeit -0.4072990 -0.39865464 -0.48379332 -0.44693282
## Windrichtung               -0.4134102 -0.26012772 -0.31323000 -0.29074009
## Traffic                    -0.1081340  0.07972877  0.01324629  0.05912409
##                                      O3         PM10         SO2
## Globalstrahlung             0.616125016  0.004252051 -0.18239016
## Luftdruck                  -0.208863123  0.381974768  0.27339067
## Luftfeuchtigkeit           -0.601980232 -0.093581899 -0.04179275
## Niederschlagsdauer         -0.003143124 -0.418879840 -0.24677663
## Skalar Windgeschwindigkeit  0.294992967 -0.260309200 -0.05989382
## Temperatur                  0.581125732 -0.275044396 -0.58877054
## Vector Windgeschwindigkeit  0.286626022 -0.255867898 -0.05482406
## Windrichtung                0.216714974 -0.356378953 -0.31888024
## Traffic                     0.137477810 -0.144623818 -0.33958424

The columns shows the various measured pollutants, and the rows the weather and traffic data.

This matrix displays the r pearson correlation coefficients of the values in our matrix. It measures the strength (0 to 1) and the direction of the relationship between the variables (a positive value for variables moving in the same direction, a negative value if they act opposite).

We can make the result more readable thanks to STHDA⁵:

res <- rcorr(as.matrix(byColDay[,-c(1)]), type="pearson")
flattened <- flattenCorrMatrix(res$r, res$P)
flattened

##                            row                     column   cor    p
## 1                           CO                         NO  0.85 0.00
## 2                           CO                        NO2  0.90 0.00
## 3                           NO                        NO2  0.82 0.00
## 4                           CO                        NOx  0.90 0.00
## 5                           NO                        NOx  0.98 0.00
## 6                          NO2                        NOx  0.92 0.00
## 7                           CO                         O3 -0.70 0.00
## 8                           NO                         O3 -0.71 0.00
## 9                          NO2                         O3 -0.68 0.00
## 10                         NOx                         O3 -0.73 0.00
## 11                          CO                       PM10  0.77 0.00
## 12                          NO                       PM10  0.51 0.00
## 13                         NO2                       PM10  0.68 0.00
## 14                         NOx                       PM10  0.60 0.00
## 15                          O3                       PM10 -0.36 0.00
## 16                          CO                        SO2  0.67 0.00
## 17                          NO                        SO2  0.47 0.00
## 18                         NO2                        SO2  0.61 0.00
## 19                         NOx                        SO2  0.54 0.00
## 20                          O3                        SO2 -0.43 0.00
## 21                        PM10                        SO2  0.68 0.00
## 22                          CO            Globalstrahlung -0.20 0.00
## 23                          NO            Globalstrahlung -0.21 0.00
## 24                         NO2            Globalstrahlung -0.17 0.00
## 25                         NOx            Globalstrahlung -0.20 0.00
## 26                          O3            Globalstrahlung  0.62 0.00
## 27                        PM10            Globalstrahlung  0.00 0.94
## 28                         SO2            Globalstrahlung -0.18 0.00
## 29                          CO                  Luftdruck  0.33 0.00
## 30                          NO                  Luftdruck  0.19 0.00
## 31                         NO2                  Luftdruck  0.28 0.00
## 32                         NOx                  Luftdruck  0.23 0.00
## 33                          O3                  Luftdruck -0.21 0.00
## 34                        PM10                  Luftdruck  0.38 0.00
## 35                         SO2                  Luftdruck  0.27 0.00
## 36             Globalstrahlung                  Luftdruck  0.04 0.44
## 37                          CO           Luftfeuchtigkeit  0.17 0.00
## 38                          NO           Luftfeuchtigkeit  0.23 0.00
## 39                         NO2           Luftfeuchtigkeit  0.11 0.04
## 40                         NOx           Luftfeuchtigkeit  0.19 0.00
## 41                          O3           Luftfeuchtigkeit -0.60 0.00
## 42                        PM10           Luftfeuchtigkeit -0.09 0.08
## 43                         SO2           Luftfeuchtigkeit -0.04 0.43
## 44             Globalstrahlung           Luftfeuchtigkeit -0.78 0.00
## 45                   Luftdruck           Luftfeuchtigkeit -0.14 0.01
## 46                          CO         Niederschlagsdauer -0.27 0.00
## 47                          NO         Niederschlagsdauer -0.19 0.00
## 48                         NO2         Niederschlagsdauer -0.21 0.00
## 49                         NOx         Niederschlagsdauer -0.21 0.00
## 50                          O3         Niederschlagsdauer  0.00 0.95
## 51                        PM10         Niederschlagsdauer -0.42 0.00
## 52                         SO2         Niederschlagsdauer -0.25 0.00
## 53             Globalstrahlung         Niederschlagsdauer -0.48 0.00
## 54                   Luftdruck         Niederschlagsdauer -0.32 0.00
## 55            Luftfeuchtigkeit         Niederschlagsdauer  0.54 0.00
## 56                          CO Skalar Windgeschwindigkeit -0.41 0.00
## 57                          NO Skalar Windgeschwindigkeit -0.40 0.00
## 58                         NO2 Skalar Windgeschwindigkeit -0.49 0.00
## 59                         NOx Skalar Windgeschwindigkeit -0.45 0.00
## 60                          O3 Skalar Windgeschwindigkeit  0.29 0.00
## 61                        PM10 Skalar Windgeschwindigkeit -0.26 0.00
## 62                         SO2 Skalar Windgeschwindigkeit -0.06 0.25
## 63             Globalstrahlung Skalar Windgeschwindigkeit -0.19 0.00
## 64                   Luftdruck Skalar Windgeschwindigkeit -0.11 0.04
## 65            Luftfeuchtigkeit Skalar Windgeschwindigkeit -0.03 0.51
## 66          Niederschlagsdauer Skalar Windgeschwindigkeit  0.23 0.00
## 67                          CO                 Temperatur -0.40 0.00
## 68                          NO                 Temperatur -0.31 0.00
## 69                         NO2                 Temperatur -0.37 0.00
## 70                         NOx                 Temperatur -0.35 0.00
## 71                          O3                 Temperatur  0.58 0.00
## 72                        PM10                 Temperatur -0.28 0.00
## 73                         SO2                 Temperatur -0.59 0.00
## 74             Globalstrahlung                 Temperatur  0.68 0.00
## 75                   Luftdruck                 Temperatur -0.16 0.00
## 76            Luftfeuchtigkeit                 Temperatur -0.42 0.00
## 77          Niederschlagsdauer                 Temperatur -0.16 0.00
## 78  Skalar Windgeschwindigkeit                 Temperatur -0.23 0.00
## 79                          CO Vector Windgeschwindigkeit -0.41 0.00
## 80                          NO Vector Windgeschwindigkeit -0.40 0.00
## 81                         NO2 Vector Windgeschwindigkeit -0.48 0.00
## 82                         NOx Vector Windgeschwindigkeit -0.45 0.00
## 83                          O3 Vector Windgeschwindigkeit  0.29 0.00
## 84                        PM10 Vector Windgeschwindigkeit -0.26 0.00
## 85                         SO2 Vector Windgeschwindigkeit -0.05 0.30
## 86             Globalstrahlung Vector Windgeschwindigkeit -0.19 0.00
## 87                   Luftdruck Vector Windgeschwindigkeit -0.10 0.06
## 88            Luftfeuchtigkeit Vector Windgeschwindigkeit -0.03 0.54
## 89          Niederschlagsdauer Vector Windgeschwindigkeit  0.22 0.00
## 90  Skalar Windgeschwindigkeit Vector Windgeschwindigkeit  1.00 0.00
## 91                  Temperatur Vector Windgeschwindigkeit -0.24 0.00
## 92                          CO               Windrichtung -0.41 0.00
## 93                          NO               Windrichtung -0.26 0.00
## 94                         NO2               Windrichtung -0.31 0.00
## 95                         NOx               Windrichtung -0.29 0.00
## 96                          O3               Windrichtung  0.22 0.00
## 97                        PM10               Windrichtung -0.36 0.00
## 98                         SO2               Windrichtung -0.32 0.00
## 99             Globalstrahlung               Windrichtung -0.05 0.35
## 100                  Luftdruck               Windrichtung -0.26 0.00
## 101           Luftfeuchtigkeit               Windrichtung  0.04 0.45
## 102         Niederschlagsdauer               Windrichtung  0.27 0.00
## 103 Skalar Windgeschwindigkeit               Windrichtung  0.05 0.31
## 104                 Temperatur               Windrichtung  0.17 0.00
## 105 Vector Windgeschwindigkeit               Windrichtung  0.04 0.39
## 106                         CO                    Traffic -0.11 0.04
## 107                         NO                    Traffic  0.08 0.13
## 108                        NO2                    Traffic  0.01 0.80
## 109                        NOx                    Traffic  0.06 0.26
## 110                         O3                    Traffic  0.14 0.01
## 111                       PM10                    Traffic -0.14 0.01
## 112                        SO2                    Traffic -0.34 0.00
## 113            Globalstrahlung                    Traffic  0.34 0.00
## 114                  Luftdruck                    Traffic -0.14 0.01
## 115           Luftfeuchtigkeit                    Traffic -0.04 0.44
## 116         Niederschlagsdauer                    Traffic -0.07 0.21
## 117 Skalar Windgeschwindigkeit                    Traffic -0.25 0.00
## 118                 Temperatur                    Traffic  0.53 0.00
## 119 Vector Windgeschwindigkeit                    Traffic -0.25 0.00
## 120               Windrichtung                    Traffic  0.13 0.01

Or better readable in the correlation graph:

Let’s build the correlation matrix:

mcor <- cor(na.omit(byColDay[,-c(1)]))
head(mcor)

##              CO         NO        NO2        NOx         O3       PM10
## CO    1.0000000  0.8442662  0.8967970  0.9004475 -0.6989366  0.7701098
## NO    0.8442662  1.0000000  0.8177938  0.9780536 -0.7096441  0.5120767
## NO2   0.8967970  0.8177938  1.0000000  0.9197559 -0.6818326  0.6803786
## NOx   0.9004475  0.9780536  0.9197559  1.0000000 -0.7308127  0.5955489
## O3   -0.6989366 -0.7096441 -0.6818326 -0.7308127  1.0000000 -0.3617111
## PM10  0.7701098  0.5120767  0.6803786  0.5955489 -0.3617111  1.0000000
##             SO2 Globalstrahlung  Luftdruck Luftfeuchtigkeit Niederschlagsdauer
## CO    0.6701832    -0.206041760  0.3340554       0.17245977       -0.266728733
## NO    0.4683684    -0.214425363  0.1908097       0.23023934       -0.190587474
## NO2   0.6059638    -0.172675141  0.2814683       0.10929338       -0.205904754
## NOx   0.5388000    -0.208750394  0.2320314       0.19658832       -0.204523143
## O3   -0.4248422     0.623174693 -0.2150114      -0.60784536       -0.009108881
## PM10  0.6848203     0.003768149  0.3837374      -0.09335306       -0.419277005
##      Skalar Windgeschwindigkeit Temperatur Vector Windgeschwindigkeit
## CO                   -0.4155812 -0.4005473                 -0.4083642
## NO                   -0.4068303 -0.3195898                 -0.4011903
## NO2                  -0.4949809 -0.3787525                 -0.4893812
## NOx                  -0.4566520 -0.3550770                 -0.4507784
## O3                    0.2935876  0.5883707                  0.2853889
## PM10                 -0.2593049 -0.2761583                 -0.2548478
##      Windrichtung      Traffic
## CO     -0.4232931 -0.113633167
## NO     -0.2699444  0.075354250
## NO2    -0.3247486  0.007539406
## NOx    -0.3016673  0.054118776
## O3      0.2227125  0.144205304
## PM10   -0.3579222 -0.144623818

Now we use the corrplot package to display a readable matrix:

corrplot(mcor, type="lower", order="hclust", tl.col="black", tl.srt=45, main="Air quality, traffic and weather correlation", mar=c(0,0,1,0))

In red we have the negative correlations, in blue the positive.

The strongest coefficients occur for the

O3 (Ozone) levels ,that seems to be :
- positively correlated to Globalstrahlung (the global radiation) and Temperatur (the temperature) with respectively 0.62 and 0.58 coefficients
- negatively correlated to Luftfeuchtigkeit (the air humidity) with a coefficient -0.6
SO2 (Sulfur dioxide)
- negatively correlated to Temperatur (temperature) with a -0.59 value.

The pearson correlation coefficients for the other pollutants are all lower than 0.5

Conclusion, the traffic seems to have a minor influence on pollution. The traffic is lightly correlated with NO, NOx and NO2 levels, which are more correlated with the levels of O3. The more O3, the less NO, NOx, NO2 and CO. Is it due to some chemical reactions or only because one is created in the conditions that weakens the others ?

The weather condition that are more correlated to air quality are

the speed of the wind for PM10, NO2 and CO.
PM10 seems also influenced by the precipitation duration
O3 seems correlated with global radiation, the temperature and the air humidity

Given this, the parameters that I would focus on would be:

For the Air quality measurements: O3 (negatively correlated to NO, NOx, NO2 and CO, so we can ignore them), PM10 and SO2. PM2.5 is not measured on all our samples, so I will ignore it.
For the Weather conditions: Globalstrahlung, Skalar (strongly correlated with Vector) Windgeschwindigkeit, Niederschlagsdauer, Temperatur and Luftfeuchtigkeit.
I would keep an eye on traffic levels too, because it could have impact during the COVID period.

Evolution since 2007

The purpose here is to gather all the data collected since 2007 in a single data frame, keeping only the previously indexed information, in order to limit the consumption of resources.

We initialize the data frame that will collect all the data (DFAll)

DFAll <- data.frame(matrix(ncol=4,nrow=0, dimnames=list(NULL, c("Datum", "Parameter", "Wert", "Type"))))

DFAllPol will contain a larger scope of data for further investigations

DFAllPol <- data.frame(matrix(ncol=4,nrow=0, dimnames=list(NULL, c("Datum", "Parameter", "Wert", "Type"))))

Collect and reprocess the data

Download the files and import and transform all the necessary data in DFAll.

## [1] 2007
## [1] 2008
## [1] 2009
## [1] 2010
## [1] 2011
## [1] 2012
## [1] 2013
## [1] 2014
## [1] 2015
## [1] 2016
## [1] 2017
## [1] 2018
## [1] 2019
## [1] 2020
## [1] 2021

totLines <- dim(DFAll)[1]
totLines

## [1] 46390

So we got 46390 rows of data that we will now explore.

Explore the chronology

head(DFAll)

## # A tibble: 6 × 4
## # Groups:   Datum [1]
##   Datum      Parameter                  total type     
##   <date>     <chr>                      <dbl> <chr>    
## 1 2007-01-01 Globalstrahlung             6.49 Meteo    
## 2 2007-01-01 Luftfeuchtigkeit           77.7  Meteo    
## 3 2007-01-01 Niederschlagsdauer         24.8  Meteo    
## 4 2007-01-01 O3                         48.4  Pollution
## 5 2007-01-01 PM10                       14.8  Pollution
## 6 2007-01-01 Skalar Windgeschwindigkeit  3.42 Meteo

We need to explode the various Parameter into disctinct columns to build our diagrams.

DFAllc <- DFAll[,-c(4)] %>% spread(Parameter, total)

The columns of DFAllc contain measured values in disparate units and scales. We will use the R scale function to make them consistent and to compare them.

DFAllcs <- DFAllc
DFAllcs[,-1] <- scale(DFAllc[,-1], scale = TRUE)
head(DFAllcs)

## # A tibble: 6 × 10
## # Groups:   Datum [6]
##   Datum      Globalstrahl…¹ Luftf…² Niede…³      O3   PM10 Skala…⁴   SO2 Tempe…⁵
##   <date>              <dbl>   <dbl>   <dbl>   <dbl>  <dbl>   <dbl> <dbl>   <dbl>
## 1 2007-01-01         -1.32   0.645   1.96    0.168  -0.377   2.19  1.45   -0.337
## 2 2007-01-02         -1.18   0.891   2.08    0.700  NA       2.39  0.176  -1.02 
## 3 2007-01-03         -0.892  0.397  -0.386  -0.0379 -0.276   0.964 0.698  -0.901
## 4 2007-01-04         -1.21   0.148   0.309  -0.0469 -0.741   2.11  1.21   -0.704
## 5 2007-01-05         -1.13   0.0332 -0.0745  0.557  -0.767   1.80  0.336  -0.610
## 6 2007-01-06         -1.08  -0.181  -0.615  -0.178  -0.613   0.823 1.58   -0.493
## # … with 1 more variable: Traffic <dbl>, and abbreviated variable names
## #   ¹Globalstrahlung, ²Luftfeuchtigkeit, ³Niederschlagsdauer,
## #   ⁴`Skalar Windgeschwindigkeit`, ⁵Temperatur

Let’s see if the result in a diagram coincides with what the correlation matrix calculated above

We need first to load the TSstudio package which allows to draw and manipulate time series

if("TSstudio" %in% rownames(installed.packages()) == FALSE) {
    install.packages("TSstudio")
    }
library(TSstudio)

Then we can draw our time series

# First convert Datum into a date
DFAllcs$Datum <- as.Date(DFAllcs$Datum, "%Y-%m-%d")
# DFAllbmc$Monat <- as.Date(DFAllbmc$Monat, "%Y-%m-%d")

t <- ts_plot(DFAllcs,
             title = "Air quality, weather and traffic progression since 2007",
             Xtitle = "Time",
             Ytitle = "Scaled value",
             slider=TRUE)
t

You can manipulate the graph with the scale bar under the picture and select the values to display in the legend.

If you select only “O3”, Globalstrahlung” and “Luftfeuchtigkeit”, you will note a positive correlation between “O3” levels and the “Globalstrahlung” and a negative correlation with “Luftfeuchtigkeit”.

But this details level is quite confusing, and the result will be more obvious if we draw the same diagram on a monthly basis:

In this view, you can focus on the parameters and period that you wish.

Specific pollutants views

The following schemes are focused on the pollutants and weather conditions evocated formerly:

O3 levels

The O3 was calculated to be correlated with global radiation, temperature and the air humidity with 0.62, 0.58 and -0.6 coefficients. Here is the curve of these observations since 2007:

t <- ts_plot(DFAllbmcs[,c(1,2,3,5,9)],
        title = "O3 vs global radiation, temperature and air humidity",
        line.mode="lines+markers",
        Xtitle = "Time",
        Ytitle = "Scaled value",
        slider=TRUE)
t

Ozone levels are strongly and sustainably correlated with humidity, global radiation and temperature levels.

We noticed that O3 was strongly correlated with negatively correlated to NO, NOx, NO2 and CO. So we could infer that what affects the O3 levels is although affecting these pollutants levels. It would be interesting to see if these correlation levels are stable on the long term.

Correlation between O3 and the NO, NOx, NO2 and CO pollutants

To do that, we have to reintroduce these measurements in the data frame and scale it again.

We already prepared these data in DFAllPol

# To do that, I have to reintroduce these measurements in the data frame and scale it again:
DFAllPolbm <- DFAllPol %>%  mutate(Monat = format(Datum, "%Y-%m")) %>% mutate(Monat = paste0(Monat, "-01")) %>% group_by(Monat, Parameter, type) %>% summarise(total = mean(total, na.rm=T))

# Convert the Monat and Parameter columns into factors for incoming manipulations
DFAllPolbm$Monat <- factor(DFAllPolbm$Monat)
DFAllPolbm$Parameter <- factor(DFAllPolbm$Parameter)

# head(DFAllbm, 20)

DFAllPolbmc <- DFAllPolbm[,-c(3)] %>% spread(Parameter, total)
DFAllPolbmcs <- DFAllPolbmc
DFAllPolbmcs[,-1] <- scale(DFAllPolbmc[,-1], scale = TRUE)

#head(DFAllbmcs)
#dim(DFAllbmcs)

# First convert Monat into a date
DFAllPolbmcs$Monat <- as.Date(DFAllPolbmcs$Monat, "%Y-%m-%d")
# DFAllbmc$Monat <- as.Date(DFAllbmc$Monat, "%Y-%m-%d")

t <- ts_plot(DFAllPolbmcs[,c(1,2,6,7,8,9)],
        title = "O3 vs NO, NO2, NOx and CO levels",
        line.mode="lines+markers",
        Xtitle = "Time",
        Ytitle = "Scaled value",
        slider=TRUE)
t

The correlation of the different pollutants with O3 is confirmed on the long term except for CO which seems to weaken between 2018 and 2021, where the O3 curve continues as sinusoidal with a large amplitude while that of CO seems to settle.

SO2 levels

We noticed that SO2 was negatively correlated to temperature with a 0.59 coefficient, let’s see this on the long duration:

t <- ts_plot(DFAllbmcs[,c(1,8,9)],
        title = "SO2 vs temperature",
        line.mode="lines+markers",
        Xtitle = "Time",
        Ytitle = "Scaled value",
        slider=TRUE)
t

Looking at the curves of SO2 versus temperatures here above, you can notice that the correlation is flagrant from 2007 to a date around 2015. But after this point of the time, the variations of the temperatures follow the same pace, while SO2 remains lower and seems to stabilize.

I deduce from this that this correlation was in no way a causal relationship, or that a third factor, linked to the level of temperatures, such as the use of home heating for example, disappeared or was transformed in order to emit less sulfur dioxide.

PM10 levels

PM10 is supposed to be correlated with the speed of the wind (-0.26), it’s direction (-0.36) and the precipitation duration (-0.42), atmospheric pressure (0.38) and temperature (-0.28). Traffic is quite low with a -0.14 coefficient. So we see a weak correlation with several events…

Then, for PM10, there are no dominant factor. Precipitation duration and the atmospheric pressure seems to be more influential than the other parameters, but if we guess that it would have positive effect only combined with the others.

t <- ts_plot(DFAllbmcs[,c(1,4,6,7,9,10)],
        title = "PM10 vs temperature, precipitation duration, speed of the wind and traffic",
        line.mode="lines+markers",
        Xtitle = "Time",
        Ytitle = "Scaled value",
        slider=TRUE)
t

Maybe the sum of weak correlations gives a major correlation with a set of different intrans? We will have a look on the result when we add the absolute value of the environmental data together:

DFAllbmcs[is.na(DFAllbmcs)] <- 0
DFAllbmcs$total <- abs(DFAllbmcs$Temperatur) + abs(DFAllbmcs$Traffic) + abs(DFAllbmcs$Niederschlagsdauer) + abs(DFAllbmcs$`Skalar Windgeschwindigkeit`)

t <- ts_plot(DFAllbmcs[,c(1,4,6,7,9,10)],
        title = "PM10 vs temperature, precipitation duration, speed of the wind and traffic",
        line.mode="lines+markers",
        Xtitle = "Time",
        Ytitle = "Scaled value",
        slider=TRUE)
t

If you select separately each value with the PM10 levels, you will see that there is no constant correlation on the whole period. This explain the poor correlation calculated for each element.

cor(DFAllbmcs$PM10, DFAllbmcs$total)

## [1] -0.0255194

A -0.02 coefficient factor is very low value that explains the messy curves that we get here over.

But I still hope to find a combination of factors that will allow us to obtain a significant coefficient of correlation and thus a means of acting against this pollution.

Let’s see the coefficients we get by combining these factors two by two with respect to PM10 levels. We have four potential factors, which gives us six groups of two to evaluate:

cor(DFAllbmcs$PM10, abs(DFAllbmcs$Temperatur) + abs(DFAllbmcs$Traffic))

## [1] 0.1406732

cor(DFAllbmcs$PM10, abs(DFAllbmcs$Temperatur) + abs(DFAllbmcs$`Skalar Windgeschwindigkeit`))

## [1] 0.007228167

cor(DFAllbmcs$PM10, abs(DFAllbmcs$Temperatur) + abs(DFAllbmcs$Niederschlagsdauer))

## [1] 0.04768694

cor(DFAllbmcs$PM10, abs(DFAllbmcs$Niederschlagsdauer) + abs(DFAllbmcs$`Skalar Windgeschwindigkeit`))

## [1] -0.1382358

cor(DFAllbmcs$PM10, abs(DFAllbmcs$Niederschlagsdauer) + abs(DFAllbmcs$Traffic))

## [1] -0.04316475

cor(DFAllbmcs$PM10, abs(DFAllbmcs$Traffic) + abs(DFAllbmcs$`Skalar Windgeschwindigkeit`))

## [1] -0.07167835

Well, the values are still much too low to be significant.

The combinations of

temperature and traffic only have a 0.14 coefficient.
temperature associated with the wind speed only get 0.007
temperature and precipitation duration obtain only 0.04
etc…

Perhaps by combining our parameters in groups of 3 we will get better results?

cor(DFAllbmcs$PM10, abs(DFAllbmcs$Temperatur) + abs(DFAllbmcs$Traffic) + abs(DFAllbmcs$`Skalar Windgeschwindigkeit`))

## [1] 0.01762356

cor(DFAllbmcs$PM10, abs(DFAllbmcs$Temperatur) + abs(DFAllbmcs$Traffic) + abs(DFAllbmcs$Niederschlagsdauer))

## [1] 0.05157836

cor(DFAllbmcs$PM10, abs(DFAllbmcs$Temperatur) + abs(DFAllbmcs$Niederschlagsdauer) + abs(DFAllbmcs$`Skalar Windgeschwindigkeit`))

## [1] -0.04156098

cor(DFAllbmcs$PM10, abs(DFAllbmcs$Niederschlagsdauer) + abs(DFAllbmcs$Traffic) + abs(DFAllbmcs$`Skalar Windgeschwindigkeit`))

## [1] -0.1007072

The results are final. We cannot conclude that there is a correlation and even less causality between PM10 levels and our weather or traffic factors. So the truth is elsewhere.

Conclusion

It would have been surprising if this very easy-to-access data had provided us with a simple analysis revolutionary information on the causal links between certain weather events or the level of traffic and the levels of different pollutants measured in the air of the city of Zurich.

We were able, however, to observe that the levels of many pollutants are highly correlated, leaving the hope that finding the cause of the increase in one of them might help to combat others.

We noticed that NO, NOx, NO2 and CO are strongly positive correlated. We can hope that there is a common factor explaining their presence. But as O3 is negatively correlated to these pollutants and revealed strongly and sustainably correlated with humidity, global radiation and temperature levels, we can imagine that these factors have an impact too on all of them. Unfortunately, these pollutants have a strong negative correlation with ozone levels. It is therefore conceivable that if a lever could be found to lower the level of Ozone, it would irreparably cause the levels of NOx, NO, NO2 and carbon monoxide to increase.

But these correlations are not causal links. Weather events that fluctuate as pollution levels change are only factors that aggregate or mitigate the effects of pollution from elsewhere.

We hoped to find an easy culprit in car traffic, at least for the pollutants known to be highly emitted by exhaust gases. But the decorrelation between traffic and pollution levels, even if it does not deny its harmful nature for health and the environment, prove that one or more other causes not measured in this study are at work.

The SO2 levels are an interesting case. Strongly correlated with the temperature levels between 2007 and 2015, they then seem to unravel. As temperature variations continue over the same range, the SO2 level appears to gradually decline and have lower peaks. Perhaps policies to combat emissions of this gas have borne fruit, or this correlation before 2015 was just fortuitous.

But what this analysis seems to reveal is that variations in weather events and pollution levels are linked. Weather acts as a catalyst or inhibitor on the presence of pollutants, and that in a changing climate environment in which extreme episodes amplify and multiply, the same levels of pollutant emissions are likely to cause greater harm in the future. But this is only a hypothesis that will have to be verified with precise data on the effects of pollutants, in particular on health.

Appendix

library(dplyr)
library(tidyr)
library(ggplot2)
# install.packages("corrplot")
if("corrplot" %in% rownames(installed.packages()) == FALSE) {
    install.packages("corrplot")
    }
library(corrplot)

# Download each file
if(!exists("./data")){dir.create("./data")}

## Warning in dir.create("./data"): '.\data' existe déjà

for(i in years) {
    print(i)
    fileUrl <- paste0(url_path, filename, i, type, sep = "", recycle0 = TRUE)
    locfilename <- paste0("./data/", filename, i, type, sep = "", recycle0 = TRUE)
    
    m_fileUrl <- paste0(m_url_path, m_filename, i, type, sep = "", recycle0 = TRUE)
    
    m_locfilename <- paste0("./data/", m_filename, i, type, sep = "", recycle0 = TRUE)
    
    t_fileUrl <- paste0(t_url_path, t_filename, i, type, sep = "", recycle0 = TRUE)
    t_locfilename <- paste0("./data/", t_filename, i, type, sep = "", recycle0 = TRUE)
    
    # Collect and access the data from internet:
    if(!exists(locfilename)){
        download.file(fileUrl, destfile = locfilename, method = "curl")}
    
    if(!exists(m_locfilename)){
        download.file(m_fileUrl, destfile = m_locfilename, method = "curl")}
    
    if(!exists(t_locfilename)){
        download.file(t_fileUrl, destfile = t_locfilename, method = "curl")}
    
    # Collect the data
    DFPol <- read.csv(locfilename)
    dim(DFPol)
        
    DFMet <- read.csv(m_locfilename)
    dim(DFMet)
        
    DFTra <- read.csv(t_locfilename)
    dim(DFTra)
    
    # Then clean data
    
    # Rename the parameters
    DFMet[DFMet$Parameter == "p", ]$Parameter <- "Luftdruck"
    DFMet[DFMet$Parameter == "RainDur", ]$Parameter <- "Niederschlagsdauer"
    DFMet[DFMet$Parameter == "StrGlo", ]$Parameter <- "Globalstrahlung"
    DFMet[DFMet$Parameter == "T", ]$Parameter <- "Temperatur"
    DFMet[DFMet$Parameter == "Hr", ]$Parameter <- "Luftfeuchtigkeit"
    DFMet[DFMet$Parameter == "WD", ]$Parameter <- "Windrichtung"
    DFMet[DFMet$Parameter == "WVv", ]$Parameter <- "Vector Windgeschwindigkeit"
    DFMet[DFMet$Parameter == "WVs", ]$Parameter <- "Skalar Windgeschwindigkeit"

    # Exclude the unecessary columns
    DFMet <- DFMet[, c(1,2,3,5,6)]
    DFPol <- DFPol[, c(1,2,3,5,6)]
    DFTra1 <- DFTra[, c(1,8)]
    
    # I delete the hours to keep only the date in Datum
    DFPol$Datum <- as.Date(substr(DFPol$Datum, 1, 10), format = "%Y-%m-%d")
    DFMet$Datum <- as.Date(substr(DFMet$Datum, 1, 10), format = "%Y-%m-%d")
    DFTra$Datum <- as.Date(substr(DFTra$Datum, 1, 10), format = "%Y-%m-%d")
    
    # I delete the NA created
    DFPol <- DFPol[!is.na(DFPol$Wert), ]
    DFMet <- DFMet[!is.na(DFMet$Wert), ]
    DFTra <- DFTra[!is.na(DFTra$Anzahl), ]
    
    # Then group the data per day:
    PolbyDay <- DFPol %>% group_by(Datum, Standort, Parameter) %>% summarise(total = mean(Wert, na.rm=T))
    MetbyDay <- DFMet %>% group_by(Datum, Standort, Parameter) %>% summarise(total = mean(Wert, na.rm=T))
    head(PolbyDay)
    head(MetbyDay)
    TrabyDay <- DFTra %>% group_by(Datum, Parameter = "Traffic") %>% summarise(total = sum(Anzahl, na.rm=T))
    
    # Only keep the "Zch_Stampfenbachstrasse" place:
    PolbyDay <- PolbyDay[PolbyDay$Standort == "Zch_Stampfenbachstrasse", ]
    MetbyDay <- MetbyDay[MetbyDay$Standort == "Zch_Stampfenbachstrasse", ]
    
    # The data frame with more pollutants for further investigations
    AllPolbyDay <- PolbyDay
    
    # Only keep the parameters on which we decided to focus:
    # O3, PM10, SO2 for air quality
    # "Globalstrahlung", "Skalar Windgeschwindigkeit", "Niederschlagsdauer", "Temperatur", "Luftfeuchtigkeit" for Meteo
    PolbyDay <- PolbyDay[(PolbyDay$Parameter == "O3" | PolbyDay$Parameter == "PM10" | PolbyDay$Parameter == "SO2"), ]
    MetbyDay <- MetbyDay[(MetbyDay$Parameter == "Globalstrahlung" | MetbyDay$Parameter == "Skalar Windgeschwindigkeit" | MetbyDay$Parameter == "Niederschlagsdauer" | MetbyDay$Parameter == "Temperatur" | MetbyDay$Parameter == "Luftfeuchtigkeit"), ]
    
    
    
    # Get rid of the "Standort" column
    AllPolbyDay <- AllPolbyDay[, c(1,3,4)]
    PolbyDay <- PolbyDay[, c(1,3,4)]
    MetbyDay <- MetbyDay[, c(1,3,4)]
    
    # Add the origin label
    AllPolbyDay$type <- "Pollution"
    PolbyDay$type <- "Pollution"
    MetbyDay$type <- "Meteo"
    TrabyDay$type <- "Traffic"
    
    # add it in the DFAll
    DFAll <- rbind(DFAll, PolbyDay, MetbyDay, TrabyDay)
    
    # And reorder the data
    DFAll <- DFAll[order(DFAll$Datum, DFAll$Parameter),]
    
    # Same keeping all pollutants for further investigations
    DFAllPol <- rbind(DFAllPol, AllPolbyDay, MetbyDay, TrabyDay)
    DFAllPol <- DFAllPol[order(DFAllPol$Datum, DFAllPol$Parameter),]
}

## [1] 2007

## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum'. You can override using the
## `.groups` argument.

## [1] 2008

## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum'. You can override using the
## `.groups` argument.

## [1] 2009

## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum'. You can override using the
## `.groups` argument.

## [1] 2010

## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum'. You can override using the
## `.groups` argument.

## [1] 2011

## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum'. You can override using the
## `.groups` argument.

## [1] 2012

## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum'. You can override using the
## `.groups` argument.

## [1] 2013

## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum'. You can override using the
## `.groups` argument.

## [1] 2014

## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum'. You can override using the
## `.groups` argument.

## [1] 2015

## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum'. You can override using the
## `.groups` argument.

## [1] 2016

## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum'. You can override using the
## `.groups` argument.

## [1] 2017

## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum'. You can override using the
## `.groups` argument.

## [1] 2018

## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum'. You can override using the
## `.groups` argument.

## [1] 2019

## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum'. You can override using the
## `.groups` argument.

## [1] 2020

## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum'. You can override using the
## `.groups` argument.

## [1] 2021

## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum', 'Standort'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'Datum'. You can override using the
## `.groups` argument.

# Daily results are messy. Let's try monthly
DFAllbm <- DFAll %>%  mutate(Monat = format(Datum, "%Y-%m")) %>% mutate(Monat = paste0(Monat, "-01")) %>% group_by(Monat, Parameter, type) %>% summarise(total = mean(total, na.rm=T))

## `summarise()` has grouped output by 'Monat', 'Parameter'. You can override
## using the `.groups` argument.

# Convert the Monat and Parameter columns into factors for incoming manipulations
DFAllbm$Monat <- factor(DFAllbm$Monat)
DFAllbm$Parameter <- factor(DFAllbm$Parameter)

# head(DFAllbm, 20)

DFAllbmc <- DFAllbm[,-c(3)] %>% spread(Parameter, total)
DFAllbmcs <- DFAllbmc
DFAllbmcs[,-1] <- scale(DFAllbmc[,-1], scale = TRUE)

#head(DFAllbmcs)
#dim(DFAllbmcs)

# First convert Monat into a date
DFAllbmcs$Monat <- as.Date(DFAllbmcs$Monat, "%Y-%m-%d")
# DFAllbmc$Monat <- as.Date(DFAllbmc$Monat, "%Y-%m-%d")

t <- ts_plot(DFAllbmcs,
        title = "Air quality, weather and traffic monthly progression since 2007",
        line.mode="lines+markers",
        Xtitle = "Time",
        Ytitle = "Scaled value",
        slider=TRUE)
t

Impact of traffic and weather on pollution levels in Zurich

Hervé Schilling

2023-01-14

Context

Synopsis

Exploratory analysis

Initialization of the data

Data exploration

Pollution and weather data

Traffic data:

Cleaning data

Let’s try clustering

Clustering per Month:

Let’s try to find correlations:

Correlation ratios:

We can make the result more readable thanks to STHDA⁵:

Or better readable in the correlation graph:

Evolution since 2007

Collect and reprocess the data

Explore the chronology

Specific pollutants views

O3 levels

Correlation between O3 and the NO, NOx, NO2 and CO pollutants

SO2 levels

PM10 levels

Conclusion

Appendix

Impact of traffic and weather on pollution levels in Zurich

Hervé Schilling

2023-01-14

Context

Synopsis

Exploratory analysis

Initialization of the data

Data exploration

Pollution and weather data

Traffic data:

Cleaning data

Let’s try clustering

Clustering per Month:

Let’s try to find correlations:

Correlation ratios:

We can make the result more readable thanks to STHDA5:

Or better readable in the correlation graph:

Evolution since 2007

Collect and reprocess the data

Explore the chronology

Specific pollutants views

O3 levels

Correlation between O3 and the NO, NOx, NO2 and CO pollutants

SO2 levels

PM10 levels

Conclusion

Appendix

We can make the result more readable thanks to STHDA⁵: