Data about pollution measured by this station has been provided by Prof Ouarzazi in 2013. It accounts for Co, NO2, Wind Speed, Temperature, PM10, SO2, Solar Radiation and Ozone hourly based.
setwd("~/git/ouarzazi")
suppressPackageStartupMessages(library(googleVis))
suppressPackageStartupMessages(library(xtable))
suppressPackageStartupMessages(library(Peaks))
suppressPackageStartupMessages(library(magic))
suppressPackageStartupMessages(library(segmented))
suppressPackageStartupMessages(library(fftw))
suppressPackageStartupMessages(library(FITSio))
suppressPackageStartupMessages(library(stringr))
suppressPackageStartupMessages(library(utils))
suppressPackageStartupMessages(library(e1071))
suppressPackageStartupMessages(library(randomForest))
suppressPackageStartupMessages(library(rpart))
suppressPackageStartupMessages(library(nnet))
suppressPackageStartupMessages(library(party))
suppressPackageStartupMessages(library(partykit))
if (file.exists("JEF.RData")) {
load("JEF.RData")
} else {
JEF <- read.csv(file = "JEF_data.csv", sep = ";", dec = ",", header = TRUE)
DJEF <- JEF[!is.na(as.Date(as.character(JEF[, 1]))), 1:11]
newc <- paste(as.character(DJEF[, 1]), paste(DJEF[, 2], ":00:00", sep = ""),
sep = " ")
newd <- strptime(newc, "%d/%m/%y %H:%M:%S")
antes <- newd - 3600
NJEF <- DJEF[, -2]
NJEF[, 1] <- as.data.frame(newd)
save(JEF, DJEF, NJEF, file = "JEF.RData")
}
rm(JEF)
| Date | CO.JEF | NO2.JEF | O3.JEF | PM10.JEF | SO2.JEF | HR.JEF | TC.JEF | Vent.JEF | Rad.Sol | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Min. :2009-06-01 01:00:00 | Min. :0.010 | Min. : 1.0 | Min. : 1.0 | Min. : 1 | Min. : 1.0 | Min. : 6.0 | Min. : 5.0 | Min. :0.100 | Min. : 0.0 |
| 2 | 1st Qu.:2009-10-01 00:00:00 | 1st Qu.:0.030 | 1st Qu.: 12.0 | 1st Qu.: 28.0 | 1st Qu.: 32 | 1st Qu.: 8.0 | 1st Qu.: 37.0 | 1st Qu.:16.6 | 1st Qu.:0.600 | 1st Qu.: 0.0 |
| 3 | Median :2010-02-13 20:00:00 | Median :0.070 | Median : 21.0 | Median : 43.0 | Median : 47 | Median : 9.0 | Median : 53.0 | Median :21.0 | Median :0.800 | Median : 11.1 |
| 4 | Mean :2010-02-09 15:54:17 | Mean :0.159 | Mean : 23.4 | Mean : 46.1 | Mean : 55 | Mean :10.3 | Mean : 53.6 | Mean :21.7 | Mean :0.902 | Mean : 214.9 |
| 5 | 3rd Qu.:2010-06-15 21:00:00 | 3rd Qu.:0.140 | 3rd Qu.: 32.0 | 3rd Qu.: 62.0 | 3rd Qu.: 69 | 3rd Qu.:13.0 | 3rd Qu.: 71.0 | 3rd Qu.:26.2 | 3rd Qu.:1.100 | 3rd Qu.: 427.3 |
| 6 | Max. :2010-11-17 19:00:00 | Max. :4.640 | Max. :108.0 | Max. :157.0 | Max. :457 | Max. :48.0 | Max. :100.0 | Max. :43.9 | Max. :5.000 | Max. :1092.0 |
Numerical treatment will be performed by using the well known open source statistical environment R (http://www.r-project.org).
In order to compare with Prof Ouarzazi's results (corr = 0.84) for a local based model O3 ~ remaining variables at the same period, we will use several technologies.
Basic methodology will be:
A linear model is considered as reference, for comparison of results in order to evaluate the degree of linearity
if (file.exists("JEF_lm.RData")) {
load("JEF_lm.RData")
} else {
objJEF <- tune(lm, O3.JEF ~ ., data = NJEF[, -1], tunecontrol = tune.control(sampling = "cross",
cross = 10))
JEF.lm <- objJEF
pO3 <- predict(JEF.lm$best.model, NJEF[, -c(1, 4)])
cr.lm <- cor(as.numeric(NJEF[, 4]), as.numeric(pO3))
save(JEF.lm, NJEF, cr.lm, file = "JEF_lm.RData")
}
pO3 <- predict(JEF.lm$best.model, NJEF[, -c(1, 4)])
cr.lm <- cor(as.numeric(NJEF[, 4]), as.numeric(pO3))
par(mgp = c(2.2, 0.45, 0), tcl = 0.4, mar = c(3.3, 3.6, 1.1, 1.1))
plot(as.numeric(NJEF[, 4]), as.numeric(pO3), main = expression(O[3] ~ Prediction. ~
JEF ~ station), xlab = expression(O[3] ~ real), ylab = expression(O[3] ~
LM ~ predicted))
lines(c(0, 200), c(0, 200), col = 2)
| dummyparameter | error | dispersion | |
|---|---|---|---|
| 1 | Min. :0 | Min. :252 | Min. :24.3 |
| 2 | 1st Qu.:0 | 1st Qu.:252 | 1st Qu.:24.3 |
| 3 | Median :0 | Median :252 | Median :24.3 |
| 4 | Mean :0 | Mean :252 | Mean :24.3 |
| 5 | 3rd Qu.:0 | 3rd Qu.:252 | 3rd Qu.:24.3 |
| 6 | Max. :0 | Max. :252 | Max. :24.3 |
The results found account for a correlation of 0.7644. It will considered as a reference.
A wrapper for SVM based regressors is applied looking for best parameters of learning.
| gamma | cost | error | dispersion | |
|---|---|---|---|---|
| 1 | Min. :0.125 | Min. :2.00 | Min. :127 | Min. : 9.12 |
| 2 | 1st Qu.:0.125 | 1st Qu.:2.00 | 1st Qu.:130 | 1st Qu.: 9.80 |
| 3 | Median :0.250 | Median :4.00 | Median :133 | Median :10.13 |
| 4 | Mean :0.292 | Mean :4.67 | Mean :137 | Mean :10.02 |
| 5 | 3rd Qu.:0.500 | 3rd Qu.:8.00 | 3rd Qu.:143 | 3rd Qu.:10.34 |
| 6 | Max. :0.500 | Max. :8.00 | Max. :151 | Max. :10.77 |
The results found account for a correlation of 0.9509 which outperforms the initial proposal carried out by Prof. Ouarzazi.
Let's test the randomForest technology.
| mtry | ntree | error | dispersion | |
|---|---|---|---|---|
| 1 | Min. :3.00 | Min. : 300 | Min. :124 | Min. :8.67 |
| 2 | 1st Qu.:3.75 | 1st Qu.: 600 | 1st Qu.:125 | 1st Qu.:8.94 |
| 3 | Median :4.50 | Median : 900 | Median :125 | Median :9.04 |
| 4 | Mean :4.50 | Mean : 900 | Mean :125 | Mean :9.07 |
| 5 | 3rd Qu.:5.25 | 3rd Qu.:1200 | 3rd Qu.:126 | 3rd Qu.:9.20 |
| 6 | Max. :6.00 | Max. :1500 | Max. :126 | Max. :9.50 |
The results found account for a correlation of 0.9829.
Let's test backpropagation trained multilayer perceptron type neural network do their work.
| linout | size | maxit | decay | abstol | reltol | trace | rang | Var9 | skip | error | dispersion | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Mode:logical | Min. : 4 | Min. :50000 | Min. :0.0156 | Min. :1e-06 | Min. :1e-07 | Mode :logical | Min. :0 | Min. :7 | Mode:logical | Min. :187 | Min. :10.7 |
| 2 | TRUE:17 | 1st Qu.: 8 | 1st Qu.:50000 | 1st Qu.:0.0156 | 1st Qu.:1e-06 | 1st Qu.:1e-07 | FALSE:17 | 1st Qu.:0 | 1st Qu.:7 | TRUE:17 | 1st Qu.:200 | 1st Qu.:14.8 |
| 3 | NA’s:0 | Median :12 | Median :50000 | Median :0.0156 | Median :1e-06 | Median :1e-07 | NA’s :0 | Median :0 | Median :7 | NA’s:0 | Median :206 | Median :17.1 |
| 4 | Mean :12 | Mean :50000 | Mean :0.0156 | Mean :1e-06 | Mean :1e-07 | Mean :0 | Mean :7 | Mean :205 | Mean :19.3 | |||
| 5 | 3rd Qu.:16 | 3rd Qu.:50000 | 3rd Qu.:0.0156 | 3rd Qu.:1e-06 | 3rd Qu.:1e-07 | 3rd Qu.:0 | 3rd Qu.:7 | 3rd Qu.:210 | 3rd Qu.:24.4 | |||
| 6 | Max. :20 | Max. :50000 | Max. :0.0156 | Max. :1e-06 | Max. :1e-07 | Max. :0 | Max. :7 | Max. :220 | Max. :31.0 |
The results found account for a correlation of 0.8329.
Now we will use classification and regression trees to have a look at their capabilities for this particular problem.
| control.Length control.Class control.Mode | error | dispersion | |
|---|---|---|---|
| 1 | 1 -none- numeric | Min. :289 | Min. :14.3 |
| 2 | 1 -none- numeric | 1st Qu.:289 | 1st Qu.:14.3 |
| 3 | 10 -none- numeric | Median :289 | Median :14.3 |
| 4 | 1 -none- numeric | Mean :289 | Mean :14.3 |
| 5 | 1 -none- numeric | 3rd Qu.:289 | 3rd Qu.:14.3 |
| 6 | 1 -none- numeric | Max. :289 | Max. :14.3 |
| 7 | 1 -none- numeric | ||
| 8 | 1 -none- numeric | ||
| 9 | 1 -none- numeric |
The results found account for a correlation of 0.8329.
After this short analysis we can conclude that:
| Method | Cor | |
|---|---|---|
| 1 | LM | 0.76 |
| 2 | SVM | 0.95 |
| 3 | RF | 0.98 |
| 4 | MLP | 0.83 |
| 5 | CART | 0.72 |
From the figures, it is clear that RF produces some kind of understimation of higher values, probably because the data set is density imbalanced. Regarding this particular factor it exhibits a pretty nice performance the SVM technology.
In a global view we can conclude that the best fit was scored for RF method with a corrlation factor of 0.9829