Processing data report for JEFF station

Introduction

Data about pollution measured by this station has been provided by Prof Ouarzazi in 2013. It accounts for Co, NO2, Wind Speed, Temperature, PM10, SO2, Solar Radiation and Ozone hourly based.

setwd("~/git/ouarzazi")
suppressPackageStartupMessages(library(googleVis))
suppressPackageStartupMessages(library(xtable))
suppressPackageStartupMessages(library(Peaks))
suppressPackageStartupMessages(library(magic))
suppressPackageStartupMessages(library(segmented))
suppressPackageStartupMessages(library(fftw))
suppressPackageStartupMessages(library(FITSio))
suppressPackageStartupMessages(library(stringr))
suppressPackageStartupMessages(library(utils))
suppressPackageStartupMessages(library(e1071))
suppressPackageStartupMessages(library(randomForest))
suppressPackageStartupMessages(library(rpart))
suppressPackageStartupMessages(library(nnet))
suppressPackageStartupMessages(library(party))
suppressPackageStartupMessages(library(partykit))

if (file.exists("JEF.RData")) {
    load("JEF.RData")
} else {
    JEF <- read.csv(file = "JEF_data.csv", sep = ";", dec = ",", header = TRUE)
    DJEF <- JEF[!is.na(as.Date(as.character(JEF[, 1]))), 1:11]
    newc <- paste(as.character(DJEF[, 1]), paste(DJEF[, 2], ":00:00", sep = ""), 
        sep = " ")
    newd <- strptime(newc, "%d/%m/%y %H:%M:%S")
    antes <- newd - 3600
    NJEF <- DJEF[, -2]
    NJEF[, 1] <- as.data.frame(newd)
    save(JEF, DJEF, NJEF, file = "JEF.RData")
}
rm(JEF)
Date CO.JEF NO2.JEF O3.JEF PM10.JEF SO2.JEF HR.JEF TC.JEF Vent.JEF Rad.Sol
1 Min. :2009-06-01 01:00:00 Min. :0.010 Min. : 1.0 Min. : 1.0 Min. : 1 Min. : 1.0 Min. : 6.0 Min. : 5.0 Min. :0.100 Min. : 0.0
2 1st Qu.:2009-10-01 00:00:00 1st Qu.:0.030 1st Qu.: 12.0 1st Qu.: 28.0 1st Qu.: 32 1st Qu.: 8.0 1st Qu.: 37.0 1st Qu.:16.6 1st Qu.:0.600 1st Qu.: 0.0
3 Median :2010-02-13 20:00:00 Median :0.070 Median : 21.0 Median : 43.0 Median : 47 Median : 9.0 Median : 53.0 Median :21.0 Median :0.800 Median : 11.1
4 Mean :2010-02-09 15:54:17 Mean :0.159 Mean : 23.4 Mean : 46.1 Mean : 55 Mean :10.3 Mean : 53.6 Mean :21.7 Mean :0.902 Mean : 214.9
5 3rd Qu.:2010-06-15 21:00:00 3rd Qu.:0.140 3rd Qu.: 32.0 3rd Qu.: 62.0 3rd Qu.: 69 3rd Qu.:13.0 3rd Qu.: 71.0 3rd Qu.:26.2 3rd Qu.:1.100 3rd Qu.: 427.3
6 Max. :2010-11-17 19:00:00 Max. :4.640 Max. :108.0 Max. :157.0 Max. :457 Max. :48.0 Max. :100.0 Max. :43.9 Max. :5.000 Max. :1092.0

Numerical treatment will be performed by using the well known open source statistical environment R (http://www.r-project.org).

Processing

In order to compare with Prof Ouarzazi's results (corr = 0.84) for a local based model O3 ~ remaining variables at the same period, we will use several technologies.

Basic methodology will be:

Linear approach as reference

A linear model is considered as reference, for comparison of results in order to evaluate the degree of linearity


if (file.exists("JEF_lm.RData")) {
    load("JEF_lm.RData")
} else {
    objJEF <- tune(lm, O3.JEF ~ ., data = NJEF[, -1], tunecontrol = tune.control(sampling = "cross", 
        cross = 10))
    JEF.lm <- objJEF
    pO3 <- predict(JEF.lm$best.model, NJEF[, -c(1, 4)])
    cr.lm <- cor(as.numeric(NJEF[, 4]), as.numeric(pO3))
    save(JEF.lm, NJEF, cr.lm, file = "JEF_lm.RData")
}
pO3 <- predict(JEF.lm$best.model, NJEF[, -c(1, 4)])
cr.lm <- cor(as.numeric(NJEF[, 4]), as.numeric(pO3))
par(mgp = c(2.2, 0.45, 0), tcl = 0.4, mar = c(3.3, 3.6, 1.1, 1.1))
plot(as.numeric(NJEF[, 4]), as.numeric(pO3), main = expression(O[3] ~ Prediction. ~ 
    JEF ~ station), xlab = expression(O[3] ~ real), ylab = expression(O[3] ~ 
    LM ~ predicted))
lines(c(0, 200), c(0, 200), col = 2)

plot of chunk unnamed-chunk-3

dummyparameter error dispersion
1 Min. :0 Min. :252 Min. :24.3
2 1st Qu.:0 1st Qu.:252 1st Qu.:24.3
3 Median :0 Median :252 Median :24.3
4 Mean :0 Mean :252 Mean :24.3
5 3rd Qu.:0 3rd Qu.:252 3rd Qu.:24.3
6 Max. :0 Max. :252 Max. :24.3

The results found account for a correlation of 0.7644. It will considered as a reference.

SVM approach

A wrapper for SVM based regressors is applied looking for best parameters of learning.

plot of chunk unnamed-chunk-5

gamma cost error dispersion
1 Min. :0.125 Min. :2.00 Min. :127 Min. : 9.12
2 1st Qu.:0.125 1st Qu.:2.00 1st Qu.:130 1st Qu.: 9.80
3 Median :0.250 Median :4.00 Median :133 Median :10.13
4 Mean :0.292 Mean :4.67 Mean :137 Mean :10.02
5 3rd Qu.:0.500 3rd Qu.:8.00 3rd Qu.:143 3rd Qu.:10.34
6 Max. :0.500 Max. :8.00 Max. :151 Max. :10.77

The results found account for a correlation of 0.9509 which outperforms the initial proposal carried out by Prof. Ouarzazi.

RandomForest

Let's test the randomForest technology. plot of chunk unnamed-chunk-7

mtry ntree error dispersion
1 Min. :3.00 Min. : 300 Min. :124 Min. :8.67
2 1st Qu.:3.75 1st Qu.: 600 1st Qu.:125 1st Qu.:8.94
3 Median :4.50 Median : 900 Median :125 Median :9.04
4 Mean :4.50 Mean : 900 Mean :125 Mean :9.07
5 3rd Qu.:5.25 3rd Qu.:1200 3rd Qu.:126 3rd Qu.:9.20
6 Max. :6.00 Max. :1500 Max. :126 Max. :9.50

The results found account for a correlation of 0.9829.

FFNN: MLP

Let's test backpropagation trained multilayer perceptron type neural network do their work.

plot of chunk unnamed-chunk-9

linout size maxit decay abstol reltol trace rang Var9 skip error dispersion
1 Mode:logical Min. : 4 Min. :50000 Min. :0.0156 Min. :1e-06 Min. :1e-07 Mode :logical Min. :0 Min. :7 Mode:logical Min. :187 Min. :10.7
2 TRUE:17 1st Qu.: 8 1st Qu.:50000 1st Qu.:0.0156 1st Qu.:1e-06 1st Qu.:1e-07 FALSE:17 1st Qu.:0 1st Qu.:7 TRUE:17 1st Qu.:200 1st Qu.:14.8
3 NA’s:0 Median :12 Median :50000 Median :0.0156 Median :1e-06 Median :1e-07 NA’s :0 Median :0 Median :7 NA’s:0 Median :206 Median :17.1
4 Mean :12 Mean :50000 Mean :0.0156 Mean :1e-06 Mean :1e-07 Mean :0 Mean :7 Mean :205 Mean :19.3
5 3rd Qu.:16 3rd Qu.:50000 3rd Qu.:0.0156 3rd Qu.:1e-06 3rd Qu.:1e-07 3rd Qu.:0 3rd Qu.:7 3rd Qu.:210 3rd Qu.:24.4
6 Max. :20 Max. :50000 Max. :0.0156 Max. :1e-06 Max. :1e-07 Max. :0 Max. :7 Max. :220 Max. :31.0

The results found account for a correlation of 0.8329.

CART solution

Now we will use classification and regression trees to have a look at their capabilities for this particular problem.

plot of chunk unnamed-chunk-11

control.Length control.Class control.Mode error dispersion
1 1 -none- numeric Min. :289 Min. :14.3
2 1 -none- numeric 1st Qu.:289 1st Qu.:14.3
3 10 -none- numeric Median :289 Median :14.3
4 1 -none- numeric Mean :289 Mean :14.3
5 1 -none- numeric 3rd Qu.:289 3rd Qu.:14.3
6 1 -none- numeric Max. :289 Max. :14.3
7 1 -none- numeric
8 1 -none- numeric
9 1 -none- numeric

The results found account for a correlation of 0.8329.

Conclusions

After this short analysis we can conclude that:

Method Cor
1 LM 0.76
2 SVM 0.95
3 RF 0.98
4 MLP 0.83
5 CART 0.72

From the figures, it is clear that RF produces some kind of understimation of higher values, probably because the data set is density imbalanced. Regarding this particular factor it exhibits a pretty nice performance the SVM technology.

In a global view we can conclude that the best fit was scored for RF method with a corrlation factor of 0.9829