Paquete creado por Max Kuhn
Max Kuhn
El paquete caret (Classification And REgression Training) es un conjunto de funciones que intentan agilizar el proceso de creacion de modelos prdictivos. El paquete contiene herramientas para:
En el siguiente link se puede ver muchos ejemplos del alcance de esta libreria, https://topepo.github.io/caret/index.html
install.packages("caret")
install.packages("mlbench")
install.packages("e1071")
install.packages("caTools")
#install.packages("rattle")
library(ggplot2)
library(dplyr)
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
library(caret)
Loading required package: lattice
library(mlbench)
library(caTools)
#library(rattle)
La función train()
es la que encapsula todos los modelos en la libreria caret.
Tiene los siguientes parametros
data
= el dataset a utilizarmethod
= Establece el algoritmo que se desea usar. El siguiente link tiene una lista de todos lo algoritmo que encapsula caret, http://topepo.github.io/caret/train-models-by-tag.htmltrControl
= Una lista que contiene los parametros de control.preProcess
= Un vector que contiene los parametros para el pre proceso de la data.metric
= Un string que define que metrica se usara para determinar el modelo optimo.Maximize
= Valor logico que se usara para determinar si se desea maximizar o minimizar la metrica.tuneGrid
= Un data frame que contien los parametros y los valores a ser usados para ajustar los parametros?train
set.seed(37)
setControl <- trainControl(
method = "cv", ## Metodo de resampling
number = 10, ## Numero de particiones
verboseIter = TRUE ## Para imprimir el log de entrenamiento
)
fit <- train(
price ~ . , ##Formula
diamonds, ## Data
method = "lm", ## Metodo
trControl = setControl ##Control
)
+ Fold01: intercept=TRUE
- Fold01: intercept=TRUE
+ Fold02: intercept=TRUE
- Fold02: intercept=TRUE
+ Fold03: intercept=TRUE
- Fold03: intercept=TRUE
+ Fold04: intercept=TRUE
- Fold04: intercept=TRUE
+ Fold05: intercept=TRUE
- Fold05: intercept=TRUE
+ Fold06: intercept=TRUE
- Fold06: intercept=TRUE
+ Fold07: intercept=TRUE
- Fold07: intercept=TRUE
+ Fold08: intercept=TRUE
- Fold08: intercept=TRUE
+ Fold09: intercept=TRUE
- Fold09: intercept=TRUE
+ Fold10: intercept=TRUE
- Fold10: intercept=TRUE
Aggregating results
Fitting final model on full training set
fit
Linear Regression
53940 samples
9 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 48547, 48547, 48545, 48546, 48545, 48546, ...
Resampling results:
RMSE Rsquared
1155.916 0.9159475
Tuning parameter 'intercept' was held constant at a value
of TRUE
summary(fit)
Call:
lm(formula = .outcome ~ ., data = dat)
Residuals:
Min 1Q Median 3Q Max
-21376.0 -592.4 -183.5 376.4 10694.2
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5753.762 396.630 14.507 < 2e-16 ***
carat 11256.978 48.628 231.494 < 2e-16 ***
cut.L 584.457 22.478 26.001 < 2e-16 ***
cut.Q -301.908 17.994 -16.778 < 2e-16 ***
cut.C 148.035 15.483 9.561 < 2e-16 ***
`cut^4` -20.794 12.377 -1.680 0.09294 .
color.L -1952.160 17.342 -112.570 < 2e-16 ***
color.Q -672.054 15.777 -42.597 < 2e-16 ***
color.C -165.283 14.725 -11.225 < 2e-16 ***
`color^4` 38.195 13.527 2.824 0.00475 **
`color^5` -95.793 12.776 -7.498 6.59e-14 ***
`color^6` -48.466 11.614 -4.173 3.01e-05 ***
clarity.L 4097.431 30.259 135.414 < 2e-16 ***
clarity.Q -1925.004 28.227 -68.197 < 2e-16 ***
clarity.C 982.205 24.152 40.668 < 2e-16 ***
`clarity^4` -364.918 19.285 -18.922 < 2e-16 ***
`clarity^5` 233.563 15.752 14.828 < 2e-16 ***
`clarity^6` 6.883 13.715 0.502 0.61575
`clarity^7` 90.640 12.103 7.489 7.06e-14 ***
depth -63.806 4.535 -14.071 < 2e-16 ***
table -26.474 2.912 -9.092 < 2e-16 ***
x -1008.261 32.898 -30.648 < 2e-16 ***
y 9.609 19.333 0.497 0.61918
z -50.119 33.486 -1.497 0.13448
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1130 on 53916 degrees of freedom
Multiple R-squared: 0.9198, Adjusted R-squared: 0.9198
F-statistic: 2.688e+04 on 23 and 53916 DF, p-value: < 2.2e-16
setControl <- trainControl(
method = "cv",
number = 10,
repeats = 5,
verboseIter = TRUE
)
fit <- train(
price ~ . , diamonds,
method = "lm",
trControl = setControl
)
+ Fold01: intercept=TRUE
- Fold01: intercept=TRUE
+ Fold02: intercept=TRUE
- Fold02: intercept=TRUE
+ Fold03: intercept=TRUE
- Fold03: intercept=TRUE
+ Fold04: intercept=TRUE
- Fold04: intercept=TRUE
+ Fold05: intercept=TRUE
- Fold05: intercept=TRUE
+ Fold06: intercept=TRUE
- Fold06: intercept=TRUE
+ Fold07: intercept=TRUE
- Fold07: intercept=TRUE
+ Fold08: intercept=TRUE
- Fold08: intercept=TRUE
+ Fold09: intercept=TRUE
- Fold09: intercept=TRUE
+ Fold10: intercept=TRUE
- Fold10: intercept=TRUE
Aggregating results
Fitting final model on full training set
fit
Linear Regression
53940 samples
9 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 48545, 48547, 48546, 48547, 48545, 48546, ...
Resampling results:
RMSE Rsquared
1131.204 0.9196058
Tuning parameter 'intercept' was held constant at a value
of TRUE
data(Sonar)
train_index <- createDataPartition(Sonar$Class,
p=0.7 ,
list = FALSE,
times =1)
train <- Sonar[train_index,]
test <- Sonar[-train_index,]
table(train$Class) %>% prop.table()
M R
0.5342466 0.4657534
table(test$Class) %>% prop.table()
M R
0.5322581 0.4677419
nrow(train)
[1] 146
nrow(test)
[1] 62
glm_model <- glm(Class ~ ., family = "binomial", train)
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred
p<-predict(glm_model, test, type = "response")
p_class <- ifelse(p>0.5, "M", "R")
confusionMatrix(p_class,test$Class)
Confusion Matrix and Statistics
Reference
Prediction M R
M 11 17
R 22 12
Accuracy : 0.371
95% CI : (0.2516, 0.5031)
No Information Rate : 0.5323
P-Value [Acc > NIR] : 0.9963
Kappa : -0.2503
Mcnemar's Test P-Value : 0.5218
Sensitivity : 0.3333
Specificity : 0.4138
Pos Pred Value : 0.3929
Neg Pred Value : 0.3529
Prevalence : 0.5323
Detection Rate : 0.1774
Detection Prevalence : 0.4516
Balanced Accuracy : 0.3736
'Positive' Class : M
colAUC(p,test$Class, plotROC = TRUE)
[,1]
M vs. R 0.6567398
glm_model <- train(Class ~ . ,
data=Sonar,
method = "glm",
trControl = trainControl(
method = "cv",
number = 10,
summaryFunction = twoClassSummary,
classProbs = TRUE,
verboseIter = TRUE
)
)
The metric "Accuracy" was not in the result set. ROC will be used instead.
+ Fold01: parameter=none
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred
- Fold01: parameter=none
+ Fold02: parameter=none
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred
- Fold02: parameter=none
+ Fold03: parameter=none
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred
- Fold03: parameter=none
+ Fold04: parameter=none
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred
- Fold04: parameter=none
+ Fold05: parameter=none
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred
- Fold05: parameter=none
+ Fold06: parameter=none
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred
- Fold06: parameter=none
+ Fold07: parameter=none
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred
- Fold07: parameter=none
+ Fold08: parameter=none
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred
- Fold08: parameter=none
+ Fold09: parameter=none
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred
- Fold09: parameter=none
+ Fold10: parameter=none
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred
- Fold10: parameter=none
Aggregating results
Fitting final model on full training set
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred
glm_model
Generalized Linear Model
208 samples
60 predictor
2 classes: 'M', 'R'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 188, 187, 187, 187, 187, 188, ...
Resampling results:
ROC Sens Spec
0.7481566 0.7856061 0.69
# install.packages("doMC")
library(doMC)
Loading required package: foreach
foreach: simple, scalable parallel programming from Revolution Analytics
Use Revolution R for scalability, fault tolerance and more.
http://www.revolutionanalytics.com
Loading required package: iterators
Loading required package: parallel
nucleos <- 4 # este valor depende de cada maquina
registerDoMC(nucleos)
http://machinelearningmastery.com/pre-process-your-dataset-in-r/ ### Imputaciones
Datos faltantes aleatorios.
data(mtcars)
data(mtcars)
set.seed(42)
mtcars[sample(1:nrow(mtcars), 10), "hp"] <- NA
mtcars_x <- mtcars[,-1]
mtcars_y <- mtcars[,1]
mtcars_x$hp
[1] 110 110 93 NA 175 105 245 62 NA 123 123 180 180 180
[15] NA NA NA NA 52 NA 97 150 150 245 NA 66 91 113
[29] 264 NA 335 NA
mtcars_impute <- preProcess(mtcars_x,method = "medianImpute")
mtcars_impute <- predict(mtcars_impute,mtcars)
data.frame(mtcars_impute$hp,mtcars_x$hp)
Datos faltantes no aleatorios.
mtcars_impute <- preProcess(mtcars_x,method = "knnImpute")
mtcars_impute <- predict(mtcars_impute,mtcars)
data.frame(mtcars_impute$hp,mtcars$hp)
## install.packages("RANN")
data(mtcars)
set.seed(42)
mtcars[sample(1:nrow(mtcars), 10), "hp"] <- NA
Y <- mtcars$mpg
X <- mtcars[, 2:4]
setControl <- trainControl(
method = "cv",
number = 5,
repeats = 20,
verboseIter = FALSE
)
model <- train(x = X, y = Y,method="glm",preProcess = "medianImpute",trControl = setControl)
model
Generalized Linear Model
32 samples
3 predictor
Pre-processing: median imputation (3)
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 25, 26, 26, 26, 25
Resampling results:
RMSE Rsquared
3.023511 0.7887627
model2 <- train(x = X, y = Y,
method = "glm",
preProcess = "knnImpute",trControl = setControl)
model2
Generalized Linear Model
32 samples
3 predictor
Pre-processing: nearest neighbor imputation (3),
centered (3), scaled (3)
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 26, 24, 26, 26, 26
Resampling results:
RMSE Rsquared
3.258643 0.8061497
Resta la media de los datos.
mtcars_impute <- preProcess(mtcars_x,method = c("medianImpute","center"))
mtcars_impute <- predict(mtcars_impute,mtcars_x)
data.frame(mtcars_impute$hp,mtcars$hp)
Divide los datos dentro de la desviación estandar
mtcars_impute <- preProcess(mtcars_x,method = c("medianImpute","scale"))
mtcars_impute <- predict(mtcars_impute,mtcars_x)
data.frame(mtcars_impute$hp,mtcars$hp)
summary(mtcars_impute)
cyl disp hp
Min. :2.240 Min. :0.5737 Min. :0.7169
1st Qu.:2.240 1st Qu.:0.9749 1st Qu.:1.5166
Median :3.360 Median :1.5838 Median :1.6958
Mean :3.465 Mean :1.8616 Mean :1.9298
3rd Qu.:4.479 3rd Qu.:2.6303 3rd Qu.:2.1543
Max. :4.479 Max. :3.8083 Max. :4.6188
drat wt qsec
Min. :5.162 Min. :1.546 Min. : 8.114
1st Qu.:5.760 1st Qu.:2.638 1st Qu.: 9.453
Median :6.911 Median :3.398 Median : 9.911
Mean :6.727 Mean :3.288 Mean : 9.988
3rd Qu.:7.332 3rd Qu.:3.689 3rd Qu.:10.577
Max. :9.220 Max. :5.543 Max. :12.815
vs am gear
Min. :0.000 Min. :0.0000 Min. :4.066
1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:4.066
Median :0.000 Median :0.0000 Median :5.421
Mean :0.868 Mean :0.8141 Mean :4.998
3rd Qu.:1.984 3rd Qu.:2.0040 3rd Qu.:5.421
Max. :1.984 Max. :2.0040 Max. :6.777
carb
Min. :0.6191
1st Qu.:1.2382
Median :1.2382
Mean :1.7413
3rd Qu.:2.4765
Max. :4.9529
mtcars_impute <- preProcess(mtcars_x,method = c("medianImpute","center","scale"))
mtcars_impute <- predict(mtcars_impute,mtcars_x)
data.frame(mtcars_impute$hp,mtcars$hp)
summary(mtcars_impute)
cyl disp hp
Min. :-1.225 Min. :-1.2879 Min. :-1.3192
1st Qu.:-1.225 1st Qu.:-0.8867 1st Qu.:-0.5195
Median :-0.105 Median :-0.2777 Median :-0.3403
Mean : 0.000 Mean : 0.0000 Mean :-0.1063
3rd Qu.: 1.015 3rd Qu.: 0.7688 3rd Qu.: 0.1181
Max. : 1.015 Max. : 1.9468 Max. : 2.5826
drat wt qsec
Min. :-1.5646 Min. :-1.7418 Min. :-1.87401
1st Qu.:-0.9661 1st Qu.:-0.6500 1st Qu.:-0.53513
Median : 0.1841 Median : 0.1101 Median :-0.07765
Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
3rd Qu.: 0.6049 3rd Qu.: 0.4014 3rd Qu.: 0.58830
Max. : 2.4939 Max. : 2.2553 Max. : 2.82675
vs am gear
Min. :-0.868 Min. :-0.8141 Min. :-0.9318
1st Qu.:-0.868 1st Qu.:-0.8141 1st Qu.:-0.9318
Median :-0.868 Median :-0.8141 Median : 0.4236
Mean : 0.000 Mean : 0.0000 Mean : 0.0000
3rd Qu.: 1.116 3rd Qu.: 1.1899 3rd Qu.: 0.4236
Max. : 1.116 Max. : 1.1899 Max. : 1.7789
carb
Min. :-1.1222
1st Qu.:-0.5030
Median :-0.5030
Mean : 0.0000
3rd Qu.: 0.7352
Max. : 3.2117
model3 <- train(x = X, y = Y,
method = "glm",
preProcess = c("medianImpute","center","scale"),
trControl = setControl)
model3
Generalized Linear Model
32 samples
3 predictor
Pre-processing: median imputation (3), centered (3),
scaled (3)
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 24, 27, 25, 26, 26
Resampling results:
RMSE Rsquared
3.062007 0.7844348
model4 <- train(x = X, y = Y,
method = "glm",
preProcess = c("knnImpute","center","scale"),trControl = setControl)
model4
Generalized Linear Model
32 samples
3 predictor
Pre-processing: nearest neighbor imputation (3),
centered (3), scaled (3)
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 27, 25, 25, 25, 26
Resampling results:
RMSE Rsquared
3.056672 0.7762586
El metodo de BoxCox reajusta la data para que sea mas lineal.
mtcars_impute <- preProcess(mtcars_x,method = c("medianImpute","BoxCox"))
mtcars_impute <- predict(mtcars_impute,mtcars_x)
par(mfrow=c(1,2))
plot(mtcars_impute$disp,mtcars_y)
plot(mtcars$disp,mtcars$mpg)
boxcox_fit1<-lm(mtcars_y~mtcars$disp)
boxcox_fit2<-lm(mtcars_y~mtcars_impute$disp)
data.frame(fit1=summary(boxcox_fit1)$r.squared,
fit2=summary(boxcox_fit2)$r.squared)
Si la data tiene valores negativos no se puede usar la transformación de BoxCox se deveria de usar la transformación YeoJohnson
model5 <- train(x = X, y = Y,
method = "glm",
preProcess = c("medianImpute","center","scale","BoxCox"),
trControl = setControl)
model5
Generalized Linear Model
32 samples
3 predictor
Pre-processing: median imputation (3), centered (3),
scaled (3), Box-Cox transformation (3)
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 24, 26, 26, 26, 26
Resampling results:
RMSE Rsquared
2.832923 0.838246
model6 <- train(x = X, y = Y,
method = "glm",
preProcess = c("knnImpute","center","scale","BoxCox"),
trControl = setControl)
model6
Generalized Linear Model
32 samples
3 predictor
Pre-processing: nearest neighbor imputation (3),
centered (3), scaled (3), Box-Cox transformation (3)
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 26, 25, 27, 25, 25
Resampling results:
RMSE Rsquared
2.708689 0.8331811
model_list = list(mod1=model,
mod2=model2,
mod3=model3,
mod4=model4,
mod5=model5,
mod6=model6)
resamples <- resamples(model_list)
dotplot(resamples,metric = "RMSE")