Processing data report for JEFF station

Introduction

Data about pollution measured by this station has been provided by Prof Ouarzazi in 2013. It accounts for Co, NO2, Wind Speed, Temperature, PM10, SO2, Solar Radiation and Ozone hourly based.

setwd("~/git/ouarzazi")
suppressPackageStartupMessages(library(googleVis))
suppressPackageStartupMessages(library(xtable))
suppressPackageStartupMessages(library(Peaks))
suppressPackageStartupMessages(library(magic))
suppressPackageStartupMessages(library(segmented))
suppressPackageStartupMessages(library(fftw))
suppressPackageStartupMessages(library(FITSio))
suppressPackageStartupMessages(library(stringr))
suppressPackageStartupMessages(library(utils))
suppressPackageStartupMessages(library(e1071))
suppressPackageStartupMessages(library(randomForest))
suppressPackageStartupMessages(library(rpart))
suppressPackageStartupMessages(library(nnet))
suppressPackageStartupMessages(library(party))
suppressPackageStartupMessages(library(partykit))

if (file.exists("JEF.RData")) {
    load("JEF.RData")
} else {
    JEF <- read.csv(file = "JEF_data.csv", sep = ";", dec = ",", header = TRUE)
    DJEF <- JEF[!is.na(as.Date(as.character(JEF[, 1]))), 1:11]
    newc <- paste(as.character(DJEF[, 1]), paste(DJEF[, 2], ":00:00", sep = ""), 
        sep = " ")
    newd <- strptime(newc, "%d/%m/%y %H:%M:%S")
    antes <- newd - 3600
    NJEF <- DJEF[, -2]
    NJEF[, 1] <- as.data.frame(newd)
    save(JEF, DJEF, NJEF, file = "JEF.RData")
}
rm(JEF)

	Date	CO.JEF	NO2.JEF	O3.JEF	PM10.JEF	SO2.JEF	HR.JEF	TC.JEF	Vent.JEF	Rad.Sol
1	Min. :2009-06-01 01:00:00	Min. :0.010	Min. : 1.0	Min. : 1.0	Min. : 1	Min. : 1.0	Min. : 6.0	Min. : 5.0	Min. :0.100	Min. : 0.0
2	1st Qu.:2009-10-01 00:00:00	1st Qu.:0.030	1st Qu.: 12.0	1st Qu.: 28.0	1st Qu.: 32	1st Qu.: 8.0	1st Qu.: 37.0	1st Qu.:16.6	1st Qu.:0.600	1st Qu.: 0.0
3	Median :2010-02-13 20:00:00	Median :0.070	Median : 21.0	Median : 43.0	Median : 47	Median : 9.0	Median : 53.0	Median :21.0	Median :0.800	Median : 11.1
4	Mean :2010-02-09 15:54:17	Mean :0.159	Mean : 23.4	Mean : 46.1	Mean : 55	Mean :10.3	Mean : 53.6	Mean :21.7	Mean :0.902	Mean : 214.9
5	3rd Qu.:2010-06-15 21:00:00	3rd Qu.:0.140	3rd Qu.: 32.0	3rd Qu.: 62.0	3rd Qu.: 69	3rd Qu.:13.0	3rd Qu.: 71.0	3rd Qu.:26.2	3rd Qu.:1.100	3rd Qu.: 427.3
6	Max. :2010-11-17 19:00:00	Max. :4.640	Max. :108.0	Max. :157.0	Max. :457	Max. :48.0	Max. :100.0	Max. :43.9	Max. :5.000	Max. :1092.0

Numerical treatment will be performed by using the well known open source statistical environment R (http://www.r-project.org).

Processing

In order to compare with Prof Ouarzazi's results (corr = 0.84) for a local based model O3 ~ remaining variables at the same period, we will use several technologies.

Basic methodology will be:

To apply cross correlation learning validation as it becomes more robust that the fixed approach 70%,15%,15%
To apply full validation to all dataset, after selecting the best model, as Prof Ouarzazi did.
The hourly based moted was selected as for learning what it is possible to do, even when \( O_3 \) should be accounted by its maximum per day and/or the dosage by 8h periods, depending on the specific regulation.
Uncertainty about future predictors was removed as we were no predicting Ozone with any lag.

Linear approach as reference

A linear model is considered as reference, for comparison of results in order to evaluate the degree of linearity


if (file.exists("JEF_lm.RData")) {
    load("JEF_lm.RData")
} else {
    objJEF <- tune(lm, O3.JEF ~ ., data = NJEF[, -1], tunecontrol = tune.control(sampling = "cross", 
        cross = 10))
    JEF.lm <- objJEF
    pO3 <- predict(JEF.lm$best.model, NJEF[, -c(1, 4)])
    cr.lm <- cor(as.numeric(NJEF[, 4]), as.numeric(pO3))
    save(JEF.lm, NJEF, cr.lm, file = "JEF_lm.RData")
}
pO3 <- predict(JEF.lm$best.model, NJEF[, -c(1, 4)])
cr.lm <- cor(as.numeric(NJEF[, 4]), as.numeric(pO3))
par(mgp = c(2.2, 0.45, 0), tcl = 0.4, mar = c(3.3, 3.6, 1.1, 1.1))
plot(as.numeric(NJEF[, 4]), as.numeric(pO3), main = expression(O[3] ~ Prediction. ~ 
    JEF ~ station), xlab = expression(O[3] ~ real), ylab = expression(O[3] ~ 
    LM ~ predicted))
lines(c(0, 200), c(0, 200), col = 2)

plot of chunk unnamed-chunk-3

	dummyparameter	error	dispersion
1	Min. :0	Min. :252	Min. :24.3
2	1st Qu.:0	1st Qu.:252	1st Qu.:24.3
3	Median :0	Median :252	Median :24.3
4	Mean :0	Mean :252	Mean :24.3
5	3rd Qu.:0	3rd Qu.:252	3rd Qu.:24.3
6	Max. :0	Max. :252	Max. :24.3

The results found account for a correlation of 0.7644. It will considered as a reference.

SVM approach

A wrapper for SVM based regressors is applied looking for best parameters of learning.

plot of chunk unnamed-chunk-5

	gamma	cost	error	dispersion
1	Min. :0.125	Min. :2.00	Min. :127	Min. : 9.12
2	1st Qu.:0.125	1st Qu.:2.00	1st Qu.:130	1st Qu.: 9.80
3	Median :0.250	Median :4.00	Median :133	Median :10.13
4	Mean :0.292	Mean :4.67	Mean :137	Mean :10.02
5	3rd Qu.:0.500	3rd Qu.:8.00	3rd Qu.:143	3rd Qu.:10.34
6	Max. :0.500	Max. :8.00	Max. :151	Max. :10.77

The results found account for a correlation of 0.9509 which outperforms the initial proposal carried out by Prof. Ouarzazi.

RandomForest

Let's test the randomForest technology. plot of chunk unnamed-chunk-7

	mtry	ntree	error	dispersion
1	Min. :3.00	Min. : 300	Min. :124	Min. :8.67
2	1st Qu.:3.75	1st Qu.: 600	1st Qu.:125	1st Qu.:8.94
3	Median :4.50	Median : 900	Median :125	Median :9.04
4	Mean :4.50	Mean : 900	Mean :125	Mean :9.07
5	3rd Qu.:5.25	3rd Qu.:1200	3rd Qu.:126	3rd Qu.:9.20
6	Max. :6.00	Max. :1500	Max. :126	Max. :9.50

The results found account for a correlation of 0.9829.

FFNN: MLP

Let's test backpropagation trained multilayer perceptron type neural network do their work.

plot of chunk unnamed-chunk-9

	linout	size	maxit	decay	abstol	reltol	trace	rang	Var9	skip	error	dispersion
1	Mode:logical	Min. : 4	Min. :50000	Min. :0.0156	Min. :1e-06	Min. :1e-07	Mode :logical	Min. :0	Min. :7	Mode:logical	Min. :187	Min. :10.7
2	TRUE:17	1st Qu.: 8	1st Qu.:50000	1st Qu.:0.0156	1st Qu.:1e-06	1st Qu.:1e-07	FALSE:17	1st Qu.:0	1st Qu.:7	TRUE:17	1st Qu.:200	1st Qu.:14.8
3	NA’s:0	Median :12	Median :50000	Median :0.0156	Median :1e-06	Median :1e-07	NA’s :0	Median :0	Median :7	NA’s:0	Median :206	Median :17.1
4		Mean :12	Mean :50000	Mean :0.0156	Mean :1e-06	Mean :1e-07		Mean :0	Mean :7		Mean :205	Mean :19.3
5		3rd Qu.:16	3rd Qu.:50000	3rd Qu.:0.0156	3rd Qu.:1e-06	3rd Qu.:1e-07		3rd Qu.:0	3rd Qu.:7		3rd Qu.:210	3rd Qu.:24.4
6		Max. :20	Max. :50000	Max. :0.0156	Max. :1e-06	Max. :1e-07		Max. :0	Max. :7		Max. :220	Max. :31.0

The results found account for a correlation of 0.8329.

CART solution

Now we will use classification and regression trees to have a look at their capabilities for this particular problem.

plot of chunk unnamed-chunk-11

	control.Length control.Class control.Mode	error	dispersion
1	1 -none- numeric	Min. :289	Min. :14.3
2	1 -none- numeric	1st Qu.:289	1st Qu.:14.3
3	10 -none- numeric	Median :289	Median :14.3
4	1 -none- numeric	Mean :289	Mean :14.3
5	1 -none- numeric	3rd Qu.:289	3rd Qu.:14.3
6	1 -none- numeric	Max. :289	Max. :14.3
7	1 -none- numeric
8	1 -none- numeric
9	1 -none- numeric

The results found account for a correlation of 0.8329.

Conclusions

After this short analysis we can conclude that:

	Method	Cor
1	LM	0.76
2	SVM	0.95
3	RF	0.98
4	MLP	0.83
5	CART	0.72

From the figures, it is clear that RF produces some kind of understimation of higher values, probably because the data set is density imbalanced. Regarding this particular factor it exhibits a pretty nice performance the SVM technology.

In a global view we can conclude that the best fit was scored for RF method with a corrlation factor of 0.9829