En regresión, la medida relevante de la calidad del modelo es la precisión predictiva. En otras palabras, si las predicciones del modelo estarán cerca de lo que realmente sucede.
Mucha gente comete un gran error al medir la precisión predictiva. Hacen predicciones con sus datos de entrenamiento y comparan esas predicciones con los valores objetivo en los datos de entrenamiento.
Hay muchas métricas para resumir la calidad del modelo, en general se usa el llamado Error absoluto medio (también llamado MAE) y la R2.
The caret
package (short for Classification And
REgression Training) is a set of
functions that attempt to streamline the process for creating predictive
models. The package contains tools for:
data splitting
pre-processing
feature selection
model tuning using resampling
variable importance estimation
https://topepo.github.io/caret/available-models.html MODELOS disponibles en caret
En este video explico todo
library(caret) # cientos de functiones
library(DT)
library(tidyr)
library(performance)
library(car)
library(effects)
library(equatiomatic)
Datos<-airquality #
data<-Datos %>% drop_na() # nombre a sus datos como data
attach(data)
datatable(data)
# por si es necesario normalizar
variable<-Ozone # escriba nombre de la variable "y"
data$variable_box<-variable^(b<-(a<-BoxCoxTrans(variable)$ lambda))
El objetivo final de un modelo predictivo es predecir datos que el modelo nunca antes había visto. Para eso vamos a separar el conjuto de datos original en dos, los llamaremos: 1) “training”), para entrenar el modelo, y 2) “testing) para validar el desempeño del modelo con nuevos datos.
set.seed(123456)
samp <- createDataPartition(data$Ozone, p = 0.8, list = FALSE) # variable respuesta
training <- data[samp,]; datatable(training) # Set para entrenamiento
testing <- data[-samp,]; testing # Set para validacion
## Ozone Solar.R Wind Temp Month Day variable_box
## 13 34 307 12.0 66 5 17 2.024397
## 17 1 8 9.7 59 5 21 1.000000
## 21 23 13 12.0 67 5 28 1.872171
## 22 45 252 14.9 81 5 29 2.141127
## 25 29 127 9.7 82 6 7 1.961009
## 31 20 37 9.2 65 6 18 1.820564
## 35 49 248 9.2 85 7 2 2.177906
## 44 27 175 14.9 81 7 13 1.933182
## 51 16 7 6.9 74 7 21 1.741101
## 53 108 223 8.0 85 7 25 2.550849
## 54 20 81 8.6 82 7 26 1.820564
## 61 9 24 13.8 81 8 2 1.551846
## 65 110 207 8.0 90 8 9 2.560227
## 70 59 51 6.3 79 8 17 2.260322
## 73 44 190 10.3 78 8 20 2.131526
## 78 73 215 8.0 86 8 26 2.358656
## 81 84 237 6.3 96 8 30 2.425805
## 83 96 167 6.9 91 9 1 2.491462
## 99 18 224 13.8 67 9 17 1.782602
## 100 13 27 10.3 76 9 18 1.670278
## 110 18 131 8.0 76 9 29 1.782602
str(training) # cuantos datos para el set?
## 'data.frame': 90 obs. of 7 variables:
## $ Ozone : int 41 36 12 18 23 19 8 16 11 14 ...
## $ Solar.R : int 190 118 149 313 299 99 19 256 290 274 ...
## $ Wind : num 7.4 8 12.6 11.5 8.6 13.8 20.1 9.7 9.2 10.9 ...
## $ Temp : int 67 72 74 62 65 59 61 69 66 68 ...
## $ Month : int 5 5 5 5 5 5 5 5 5 5 ...
## $ Day : int 1 2 3 4 7 8 9 12 13 14 ...
## $ variable_box: num 2.1 2.05 1.64 1.78 1.87 ...
str(testing) # cuantos datos para el set?
## 'data.frame': 21 obs. of 7 variables:
## $ Ozone : int 34 1 23 45 29 20 49 27 16 108 ...
## $ Solar.R : int 307 8 13 252 127 37 248 175 7 223 ...
## $ Wind : num 12 9.7 12 14.9 9.7 9.2 9.2 14.9 6.9 8 ...
## $ Temp : int 66 59 67 81 82 65 85 81 74 85 ...
## $ Month : int 5 5 5 5 6 6 7 7 7 7 ...
## $ Day : int 17 21 28 29 7 18 2 13 21 25 ...
## $ variable_box: num 2.02 1 1.87 2.14 1.96 ...
modelo<-lm(variable_box ~ Temp + Wind, data=training)
summary(modelo) # ajuste normal
##
## Call:
## lm(formula = variable_box ~ Temp + Wind, data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.45481 -0.10772 0.00203 0.10324 0.51774
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.755515 0.236531 3.194 0.00195 **
## Temp 0.019908 0.002519 7.904 7.74e-12 ***
## Wind -0.029201 0.006467 -4.516 1.97e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1944 on 87 degrees of freedom
## Multiple R-squared: 0.6547, Adjusted R-squared: 0.6467
## F-statistic: 82.46 on 2 and 87 DF, p-value: < 2.2e-16
outlierTest(modelo, cutoff=Inf, n.max=5) # bonferroni detectar outliers
## rstudent unadjusted p-value Bonferroni p
## 77 2.842095 0.0055962 0.50366
## 30 2.688110 0.0086272 0.77645
## 45 -2.451994 0.0162290 NA
## 62 -2.287895 0.0245970 NA
## 23 2.249548 0.0270310 NA
check_heteroskedasticity (modelo) # verificar homogeneidad
## OK: Error variance appears to be homoscedastic (p = 0.260).
check_autocorrelation (modelo) # verificar autocorrelacion
## OK: Residuals appear to be independent and not autocorrelated (p = 0.680).
check_collinearity (modelo) # verificar multicolinealidad
## # Check for Multicollinearity
##
## Low Correlation
##
## Term VIF VIF 95% CI Increased SE Tolerance Tolerance 95% CI
## Temp 1.37 [1.15, 1.93] 1.17 0.73 [0.52, 0.87]
## Wind 1.37 [1.15, 1.93] 1.17 0.73 [0.52, 0.87]
check_normality (modelo) # verificar normalidad
## OK: residuals appear as normally distributed (p = 0.504).
plot(predictorEffects(modelo))
extract_eq(modelo)
\[ \operatorname{variable\_box} = \alpha + \beta_{1}(\operatorname{Temp}) + \beta_{2}(\operatorname{Wind}) + \epsilon \]
modelo_ml <- train(Ozone~ Temp + Wind, data=training, method = "lm")
summary(modelo_ml)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.227 -13.259 -3.723 9.561 97.763
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -57.0629 26.5148 -2.152 0.0342 *
## Temp 1.7156 0.2823 6.077 3.16e-08 ***
## Wind -3.4314 0.7249 -4.733 8.49e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.8 on 87 degrees of freedom
## Multiple R-squared: 0.5851, Adjusted R-squared: 0.5755
## F-statistic: 61.34 on 2 and 87 DF, p-value: < 2.2e-16
modelo_ml # Ver ldetalles
## Linear Regression
##
## 90 samples
## 2 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 90, 90, 90, 90, 90, 90, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 22.23063 0.5835891 16.76466
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
modelo_ml$results # Ver metricas de ajuste
## intercept RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 TRUE 22.23063 0.5835891 16.76466 3.3381 0.08442931 2.125653
modelo_ml$resample # Ver metricas de la validacion
## RMSE Rsquared MAE Resample
## 1 27.01189 0.4044461 20.84990 Resample01
## 2 24.43105 0.5340034 17.78975 Resample02
## 3 23.07315 0.5846296 16.46979 Resample03
## 4 19.14682 0.5596073 16.07436 Resample04
## 5 27.82705 0.5959700 19.02352 Resample05
## 6 21.18121 0.6286857 16.49026 Resample06
## 7 14.90604 0.7433432 11.65520 Resample07
## 8 26.25651 0.4806278 17.22014 Resample08
## 9 18.69297 0.6438457 14.09572 Resample09
## 10 24.00819 0.6222857 15.24180 Resample10
## 11 18.21353 0.6506005 13.53251 Resample11
## 12 19.28408 0.6482115 14.84779 Resample12
## 13 21.93684 0.6288234 15.61944 Resample13
## 14 18.24770 0.7510430 15.69221 Resample14
## 15 24.77278 0.4467025 19.80525 Resample15
## 16 22.83910 0.5529851 15.83201 Resample16
## 17 20.92122 0.5959891 16.69518 Resample17
## 18 20.56027 0.5736169 16.54761 Resample18
## 19 21.70820 0.5009911 18.08607 Resample19
## 20 25.07390 0.5965782 16.98100 Resample20
## 21 21.15156 0.5853800 17.29343 Resample21
## 22 21.82742 0.6804282 17.62581 Resample22
## 23 26.10257 0.5103134 19.00201 Resample23
## 24 27.32717 0.4743590 20.53402 Resample24
## 25 19.26464 0.5962612 16.11170 Resample25
modelo_ml$finalModel # Ver los coeficientes
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Coefficients:
## (Intercept) Temp Wind
## -57.063 1.716 -3.431
dotPlot(varImp(modelo_ml)) # importancia de las variables
# ESCRIBA AQUI EL OBJETO, IMPORTANTE ::::::: NO ES UN OBJETO lm
que_escribir<-modelo$call; que_escribir
## lm(formula = variable_box ~ Temp + Wind, data = training)
objeto<- variable_box ~ Temp + Wind # use esta forma
# Define training control
set.seed(123)
metodo_boot <- trainControl(method = "boot", number = 100)
#fit a regression model
model_boot <- train(objeto, # el modelo
data = testing, # los datos
method = "lm", # el metodo
trControl = metodo_boot) # entrenamiento
#view summary of Bootstrap
print(model_boot)
## Linear Regression
##
## 21 samples
## 2 predictor
##
## No pre-processing
## Resampling: Bootstrapped (100 reps)
## Summary of sample sizes: 21, 21, 21, 21, 21, 21, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.3035441 0.4987824 0.257456
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
# Define training control
set.seed(123)
metodo_cv <- trainControl(method = "cv", number = 10)
#fit a regression model
model_cv <- train(objeto, # el modelo
data = testing, # los datos
method = "lm", # el metodo
trControl = metodo_cv) # entrenamiento
#view summary of Cross-validation
print(model_cv)
## Linear Regression
##
## 21 samples
## 2 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 18, 19, 19, 19, 19, 19, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.2504776 0.9990052 0.2334293
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
# Define training control
set.seed(123)
metodo_LOOCV <- trainControl(method = "LOOCV")
#fit a regression model
model_LOOCV <- train(objeto, # el modelo
data = testing, # los datos
method = "lm", # el metodo
trControl = metodo_LOOCV) # entrenamiento
#view summary of LOOCV
print(model_LOOCV)
## Linear Regression
##
## 21 samples
## 2 predictor
##
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation
## Summary of sample sizes: 20, 20, 20, 20, 20, 20, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.2899657 0.3987553 0.2478356
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
# Define training control
set.seed(123)
metodo_repeatedcv <- trainControl(method = "repeatedcv", repeats = 10)
#fit a regression model
model_repeatedcv <- train(objeto, # el modelo
data = testing, # los datos
method = "lm", # el metodo
trControl = metodo_repeatedcv) # entrenamiento
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
#view summary of repeatedcv
print(model_repeatedcv)
## Linear Regression
##
## 21 samples
## 2 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 18, 19, 19, 19, 19, 19, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.2632832 0.9575559 0.2429467
##
## Tuning parameter 'intercept' was held constant at a value of TRUE