Representación gráfica de dos variables y ajustes de modelos

Nuestros datos

Queremos representar gráficamente la relación entre dos variabloes en un gráfico de dispersión. Nuestra matriz consta de un total de 6 variables medidas en 29 sitios diferentes.

Veamos como son nuestros datos:

#Fijamos el directorio de trabajo
setwd(dir = "C:/R/MARKDOWN/graph-models")

#leemos el fichero de datos indicando que la primera columna no es una variable row.names
graphmodels<-read.csv("graph-models.csv", sep=";",row.names = 1)
str(graphmodels)

## 'data.frame':    29 obs. of  7 variables:
##  $ altura       : int  1034 704 1093 952 263 1023 1219 821 207 552 ...
##  $ diversidad   : int  2 15 3 10 16 1 2 14 15 10 ...
##  $ abundancia   : int  100 459 110 412 599 101 102 515 601 699 ...
##  $ precipitacion: int  100 120 156 155 12 122 126 100 41 59 ...
##  $ uso          : int  1 4 1 2 7 1 1 3 7 6 ...
##  $ incendios    : int  1 2 5 6 8 7 4 5 1 2 ...
##  $ otros        : int  0 0 0 9 8 7 6 5 4 3 ...

head(graphmodels)

##   altura diversidad abundancia precipitacion uso incendios otros
## 1   1034          2        100           100   1         1     0
## 2    704         15        459           120   4         2     0
## 3   1093          3        110           156   1         5     0
## 4    952         10        412           155   2         6     9
## 5    263         16        599            12   7         8     8
## 6   1023          1        101           122   1         7     7

Representación gráfica

Ahora vamos a representar la primera variable (altura) con el resto. Cargamos todo lo que se necesita:

library("ggplot2")

## Warning: package 'ggplot2' was built under R version 3.5.3

library("ggpubr")

## Warning: package 'ggpubr' was built under R version 3.5.3

## Loading required package: magrittr

library("ModelMetrics")

## Warning: package 'ModelMetrics' was built under R version 3.5.3

## 
## Attaching package: 'ModelMetrics'

## The following object is masked from 'package:base':
## 
##     kappa

library("mgcv")

## Loading required package: nlme

## This is mgcv 1.8-26. For overview type 'help("mgcv-package")'.

G1<-ggplot(graphmodels, aes(altura, diversidad) ) +
  geom_point()
G1

Representación gráfica mejorada

Ahora mejoramos lo anterior incluyendo modelo y su incertidumbre:

G1better<-ggplot(graphmodels, aes(altura, diversidad) ) +
  geom_point() +
  stat_smooth()
G1better

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Hacemos lo mismo para el resto de variables:

library("ggplot2")
G2<-ggplot(graphmodels, aes(altura, abundancia) ) +
  geom_point() +
  stat_smooth()
G2

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

G3<-ggplot(graphmodels, aes(altura, precipitacion) ) +
  geom_point() +
  stat_smooth()
G3

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

G4<-ggplot(graphmodels, aes(altura, uso) ) +
  geom_point() +
  stat_smooth()
G4

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

G5<-ggplot(graphmodels, aes(altura, incendios) ) +
  geom_point() +
  stat_smooth()
G5

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

G6<-ggplot(graphmodels, aes(altura, otros) ) +
  geom_point() +
  stat_smooth()
G6

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Todas las representaciones juntas

Ponemos todos juntos:

ggarrange(G1better, G2, G3, G4, G5, G6 + rremove("x.text"), 
          labels = c("A", "B", "C", "D", "E", "F"),
          ncol = 3, nrow = 2, font.label=list(size=10,face="bold", color="red"))

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Representación varios modelos

Las líneas azules nos pueden dar una idea de la relación que existe entre las variables. Vamos a ver la relación entre altura y precipitacion sin ningún tipo de representación:

G3free<-ggplot(graphmodels, aes(altura, precipitacion) ) +
  geom_point()
G3free

A simple vista puede parecer una relación más o menos lineal (‘lm’) positiva:

G3line<-ggplot(graphmodels, aes(altura, precipitacion) ) +
  geom_point() +
  stat_smooth(method="lm")
G3line

Podría ser glm = generalized linear regression:

G3glm<-ggplot(graphmodels, aes(altura, precipitacion) ) +
  geom_point() +
  stat_smooth(method="glm")
G3glm

Podría ser gam = generalized additive model:

G3gam<-ggplot(graphmodels, aes(altura, precipitacion) ) +
  geom_point() +
  stat_smooth(method="gam")
G3gam

Podría ser loess = local regression:

G3loess<-ggplot(graphmodels, aes(altura, precipitacion) ) +
  geom_point() +
  stat_smooth(method="loess")
G3loess

Ahora los cuatro anteriores juntos:

ggarrange(G3line, G3glm, G3gam, G3loess + rremove("x.text"), 
          labels = c("Line", "GLM", "GAM", "loess"),
          ncol = 2, nrow = 2, font.label=list(size=10,face="bold", color="red"))

Ejemplo de ajuste a dos modelos: lineal y polinómico

Podemos intentar probar varios modelos y quedarnos con el que mejor se ajuste a nuestros datos. Para ello tenemos que realizar los diferentes modelos. Por ejemplo para el modelo lineal:

lineal<-ggplot(graphmodels, aes(altura, precipitacion) ) +
  geom_point() +
  stat_smooth(method = lm, formula = y ~ x)
lineal

lin<-lm(precipitacion ~ altura, data = graphmodels)
summary(lin)

## 
## Call:
## lm(formula = precipitacion ~ altura, data = graphmodels)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.677 -21.726  -1.487  21.091  43.032 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.62340   12.10193   0.547    0.589    
## altura       0.11670    0.01548   7.539 4.14e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.44 on 27 degrees of freedom
## Multiple R-squared:  0.6779, Adjusted R-squared:  0.666 
## F-statistic: 56.84 on 1 and 27 DF,  p-value: 4.139e-08

La precipitación cambia con la altura y el R es relativamente alto. Tiene buena pinta. Vamos a hacer las predicciones y ajustes del modelo:

predictionslin <- lin %>% predict(graphmodels)

predictionslin #los valores predichos por el modelo

##         1         2         3         4         5         6         7 
## 127.28751  88.77769 134.17260 117.71841  37.31456 126.00385 148.87635 
##         8         9        10        11        12        13        14 
## 102.43117  30.77956  71.03983  59.48689  38.94831 111.06671  60.07037 
##        15        16        17        18        19        20        21 
##  79.44198 151.67707  94.14573 115.96796 134.05591  30.19608  51.55153 
##        22        23        24        25        26        27        28 
## 100.09724  80.72564 144.90867  69.75617  73.84055  65.20501  99.98055 
##        29 
##  82.47608

Representamos juntos los predichos en negro y los observados en rojo:

plot(graphmodels$altura, predictionslin, xlim=range(graphmodels$altura, graphmodels$altura), ylim=range(predictionslin, graphmodels$precipitacion)) 
points(graphmodels$altura, graphmodels$precipitacion, col='red')

Ahora toca evaluar la diferencia entre lo observado y lo predicho, a menor diferencia mejor ajuste. Además el R2 nos informa también sobre el grado de ajuste:

rmse(predictionslin, graphmodels$precipitacion) #Error de mínimos cuadrados

## [1] 24.54729

summary(lin)

## 
## Call:
## lm(formula = precipitacion ~ altura, data = graphmodels)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.677 -21.726  -1.487  21.091  43.032 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.62340   12.10193   0.547    0.589    
## altura       0.11670    0.01548   7.539 4.14e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.44 on 27 degrees of freedom
## Multiple R-squared:  0.6779, Adjusted R-squared:  0.666 
## F-statistic: 56.84 on 1 and 27 DF,  p-value: 4.139e-08

Ahora probamos con un modelo polinomial:

poli<-lm(precipitacion ~ poly(altura, 2, raw = TRUE), data = graphmodels)
summary(poli)

## 
## Call:
## lm(formula = precipitacion ~ poly(altura, 2, raw = TRUE), data = graphmodels)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -44.712 -20.410  -2.391  22.887  41.078 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)  
## (Intercept)                  -1.134e+01  2.518e+01  -0.450   0.6561  
## poly(altura, 2, raw = TRUE)1  1.774e-01  7.612e-02   2.331   0.0278 *
## poly(altura, 2, raw = TRUE)2 -4.214e-05  5.169e-05  -0.815   0.4223  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.6 on 26 degrees of freedom
## Multiple R-squared:  0.686,  Adjusted R-squared:  0.6618 
## F-statistic:  28.4 on 2 and 26 DF,  p-value: 2.888e-07

predictionspol <- poli %>% predict(graphmodels)
predictionspol #los valores predichos por el modelo

##         1         2         3         4         5         6         7 
## 127.07765  92.69072 132.25852 119.38993  32.41003 126.07928 142.34052 
##         8         9        10        11        12        13        14 
## 105.93256  23.58256  73.76472  60.39084  34.57560 113.71217  61.08609 
##        15        16        17        18        19        20        21 
##  82.97238 144.10921  98.03453 117.92233 132.17315  22.78154  50.72623 
##        22        23        24        25        26        27        28 
## 103.75075  84.34063 139.75179  72.31952  76.88248  67.11347 103.64077 
##        29 
##  86.19000

plot(graphmodels$altura, predictionspol, xlim=range(graphmodels$altura, graphmodels$altura), ylim=range(predictionspol, graphmodels$precipitacion)) 
points(graphmodels$altura, graphmodels$precipitacion, col='red')

rmse(predictionspol, graphmodels$precipitacion) #Error de mínimos cuadrados

## [1] 24.23941

summary(poli)

## 
## Call:
## lm(formula = precipitacion ~ poly(altura, 2, raw = TRUE), data = graphmodels)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -44.712 -20.410  -2.391  22.887  41.078 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)  
## (Intercept)                  -1.134e+01  2.518e+01  -0.450   0.6561  
## poly(altura, 2, raw = TRUE)1  1.774e-01  7.612e-02   2.331   0.0278 *
## poly(altura, 2, raw = TRUE)2 -4.214e-05  5.169e-05  -0.815   0.4223  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.6 on 26 degrees of freedom
## Multiple R-squared:  0.686,  Adjusted R-squared:  0.6618 
## F-statistic:  28.4 on 2 and 26 DF,  p-value: 2.888e-07

Los dos (lineal y polinómico) son muy parecidos:

Lineal RMSE=24.55 pero R2 es 67.8
Polinómico RMSE=24.24 pero R2 es 68.6

De estos dos es mejor el polinómico (a efectos prácticos son iguales).

De esta forma se probarían varios modelos en función de la forma de la relación entre cada par de variables que nos interesan para nuestro estudio.