El paquete caret es un framework que unifica diferentes metodos para crear modelos predictions. El manual se encuentra disponible en github page.
set.seed(1)
library(caret)
library(readr)
library(tidyr)
raw_data <- read.csv("train.csv", stringsAsFactors=FALSE)
raw_data
Nuestro archivo train.csv es el unico archivo que contiene observaciones, por lo que lo vamos a utilizar para entrenar y evaluar nuestro modelo de machine learning. Primero separamos el archivo train.csv
en dos utilizando la funcion createDataPartition. A la función se le indica como parametro la clase (Survive en este caso) y el por centaje en el que vamos a separar el dataset. En ete caso vamos a separarlo en un 80/20. Es decir 80% del dataset sera para entrenar y un 20% para testear.
Para ajustar los parametros de los modelo se utilizan diferentes tecnicas que utilizaran train_data
. Normalmente se utilizan tecnicas de resampling que lo que hacen es partir de diferente manera los datos para luego entrenar y validar los resultados. Todo este proceso caret lo hace automaticamente a traves de trainControl
. En este caso le indicamos que vamos a aplicar un metodo de validacion Cruzada (CrossValidation) de 2x4. Lo que significa que a data_train
lo vamos a separar en 4 partes y vamos a usar 3 partes para entrenar un modelo y la restante para validarlo.
train_control <- trainControl(method="cv", number=5)
El campo Survived tiene los valores [0,1], los cuales pueden incorrectamente considerarse como numeros cuando en realidad se trata de categorias. Para esto resulta necesario convertirlos a factores.
La funcion train entrena un modelo. El primer parametro Survived~.
es lo que se conoce como sintaxis de formula en R. Y basicamente se indica a la funcion que vamos a utilizar Survived como la clase a predecir y el .
indica que utilicemos TODOS los atributos(columnas) durante el entrenamiento.
glm_model1 <- train(Survived ~ ., data = train_data, method = "glm", family = "binomial",trainControl=train_control)
Error in na.fail.default(list(Survived = c(0L, 1L, 1L, 1L, 0L, 0L, 0L, :
missing values in object
El modelo deberia dar un error, ya que existen ciertos registros que no tiene valores. (NA). Esto podemos verifcarlo:
sapply(raw_data, anyNA)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin
FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
Embarked
FALSE
Se verifica que los campos Age
, Cabin
y Embarked
contienen datos incompletos. Podemos intentar solucionar esto, pero lo veremos mas adelante. Por ahora simplemente eliminemoslos.
train_data_removed<-train_data %>% select(-Age,-Cabin,-Embarked)
glm_model1 <- train(Survived ~ ., data = train_data_removed, method = "glm", family = "binomial",trainControl=train_control)
You are trying to do regression and your outcome only has two possible values Are you trying to do classification? If so, use a 2 level factor as your outcome column.model fit failed for Resample01: parameter=none Error in glm.control(trainControl = list(method = "cv", number = 5, repeats = NA, :
unused argument (trainControl = list("cv", 5, NA, "grid", 0.75, NULL, 1, TRUE, 0, FALSE, TRUE, "final", FALSE, FALSE, function (data, lev = NULL, model = NULL)
{
if (is.character(data$obs)) data$obs <- factor(data$obs, levels = lev)
postResample(data[, "pred"], data[, "obs"])
}, "best", list(0.95, 3, 5, 19, 10, 0.9), NULL, NULL, NULL, NULL, 0, c(FALSE, FALSE), NA, list(5, 0.05, "gls", TRUE), FALSE, TRUE))
model fit failed for Resample02: parameter=none Error in glm.control(trainControl = list(method = "cv", number = 5, repeats = NA, :
unused argument (trainControl = list("cv", 5, NA, "grid", 0.75, NULL, 1, TRUE, 0, FALSE, TRUE, "final", FALSE, FALSE, function (data, lev = NULL, model = NULL)
{
if (is.character(data$obs)) data$obs <- factor(data$obs, levels = lev)
postResample(data[, "pred"], data[, "obs"])
}, "best", list(0.95, 3, 5, 19, 10, 0.9), NULL, NULL, NULL, NULL, 0, c(FALSE, FALSE), NA, list(5, 0.05, "gls", TRUE), FALSE, TRUE))
model fit failed for Resample03: parameter=none Error in glm.control(trainControl = list(method = "cv", number = 5, repeats = NA, :
unused argument (trainControl = list("cv", 5, NA, "grid", 0.75, NULL, 1, TRUE, 0, FALSE, TRUE, "final", FALSE, FALSE, function (data, lev = NULL, model = NULL)
{
if (is.character(data$obs)) data$obs <- factor(data$obs, levels = lev)
postResample(data[, "pred"], data[, "obs"])
}, "best", list(0.95, 3, 5, 19, 10, 0.9), NULL, NULL, NULL, NULL, 0, c(FALSE, FALSE), NA, list(5, 0.05, "gls", TRUE), FALSE, TRUE))
model fit failed for Resample04: parameter=none Error in glm.control(trainControl = list(method = "cv", number = 5, repeats = NA, :
unused argument (trainControl = list("cv", 5, NA, "grid", 0.75, NULL, 1, TRUE, 0, FALSE, TRUE, "final", FALSE, FALSE, function (data, lev = NULL, model = NULL)
{
if (is.character(data$obs)) data$obs <- factor(data$obs, levels = lev)
postResample(data[, "pred"], data[, "obs"])
}, "best", list(0.95, 3, 5, 19, 10, 0.9), NULL, NULL, NULL, NULL, 0, c(FALSE, FALSE), NA, list(5, 0.05, "gls", TRUE), FALSE, TRUE))
model fit failed for Resample05: parameter=none Error in glm.control(trainControl = list(method = "cv", number = 5, repeats = NA, :
unused argument (trainControl = list("cv", 5, NA, "grid", 0.75, NULL, 1, TRUE, 0, FALSE, TRUE, "final", FALSE, FALSE, function (data, lev = NULL, model = NULL)
{
if (is.character(data$obs)) data$obs <- factor(data$obs, levels = lev)
postResample(data[, "pred"], data[, "obs"])
}, "best", list(0.95, 3, 5, 19, 10, 0.9), NULL, NULL, NULL, NULL, 0, c(FALSE, FALSE), NA, list(5, 0.05, "gls", TRUE), FALSE, TRUE))
model fit failed for Resample06: parameter=none Error in glm.control(trainControl = list(method = "cv", number = 5, repeats = NA, :
unused argument (trainControl = list("cv", 5, NA, "grid", 0.75, NULL, 1, TRUE, 0, FALSE, TRUE, "final", FALSE, FALSE, function (data, lev = NULL, model = NULL)
{
if (is.character(data$obs)) data$obs <- factor(data$obs, levels = lev)
postResample(data[, "pred"], data[, "obs"])
}, "best", list(0.95, 3, 5, 19, 10, 0.9), NULL, NULL, NULL, NULL, 0, c(FALSE, FALSE), NA, list(5, 0.05, "gls", TRUE), FALSE, TRUE))
model fit failed for Resample07: parameter=none Error in glm.control(trainControl = list(method = "cv", number = 5, repeats = NA, :
unused argument (trainControl = list("cv", 5, NA, "grid", 0.75, NULL, 1, TRUE, 0, FALSE, TRUE, "final", FALSE, FALSE, function (data, lev = NULL, model = NULL)
{
if (is.character(data$obs)) data$obs <- factor(data$obs, levels = lev)
postResample(data[, "pred"], data[, "obs"])
}, "best", list(0.95, 3, 5, 19, 10, 0.9), NULL, NULL, NULL, NULL, 0, c(FALSE, FALSE), NA, list(5, 0.05, "gls", TRUE), FALSE, TRUE))
model fit failed for Resample08: parameter=none Error in glm.control(trainControl = list(method = "cv", number = 5, repeats = NA, :
unused argument (trainControl = list("cv", 5, NA, "grid", 0.75, NULL, 1, TRUE, 0, FALSE, TRUE, "final", FALSE, FALSE, function (data, lev = NULL, model = NULL)
{
if (is.character(data$obs)) data$obs <- factor(data$obs, levels = lev)
postResample(data[, "pred"], data[, "obs"])
}, "best", list(0.95, 3, 5, 19, 10, 0.9), NULL, NULL, NULL, NULL, 0, c(FALSE, FALSE), NA, list(5, 0.05, "gls", TRUE), FALSE, TRUE))
model fit failed for Resample09: parameter=none Error in glm.control(trainControl = list(method = "cv", number = 5, repeats = NA, :
unused argument (trainControl = list("cv", 5, NA, "grid", 0.75, NULL, 1, TRUE, 0, FALSE, TRUE, "final", FALSE, FALSE, function (data, lev = NULL, model = NULL)
{
if (is.character(data$obs)) data$obs <- factor(data$obs, levels = lev)
postResample(data[, "pred"], data[, "obs"])
}, "best", list(0.95, 3, 5, 19, 10, 0.9), NULL, NULL, NULL, NULL, 0, c(FALSE, FALSE), NA, list(5, 0.05, "gls", TRUE), FALSE, TRUE))
model fit failed for Resample10: parameter=none Error in glm.control(trainControl = list(method = "cv", number = 5, repeats = NA, :
unused argument (trainControl = list("cv", 5, NA, "grid", 0.75, NULL, 1, TRUE, 0, FALSE, TRUE, "final", FALSE, FALSE, function (data, lev = NULL, model = NULL)
{
if (is.character(data$obs)) data$obs <- factor(data$obs, levels = lev)
postResample(data[, "pred"], data[, "obs"])
}, "best", list(0.95, 3, 5, 19, 10, 0.9), NULL, NULL, NULL, NULL, 0, c(FALSE, FALSE), NA, list(5, 0.05, "gls", TRUE), FALSE, TRUE))
model fit failed for Resample11: parameter=none Error in glm.control(trainControl = list(method = "cv", number = 5, repeats = NA, :
unused argument (trainControl = list("cv", 5, NA, "grid", 0.75, NULL, 1, TRUE, 0, FALSE, TRUE, "final", FALSE, FALSE, function (data, lev = NULL, model = NULL)
{
if (is.character(data$obs)) data$obs <- factor(data$obs, levels = lev)
postResample(data[, "pred"], data[, "obs"])
}, "best", list(0.95, 3, 5, 19, 10, 0.9), NULL, NULL, NULL, NULL, 0, c(FALSE, FALSE), NA, list(5, 0.05, "gls", TRUE), FALSE, TRUE))
model fit failed for Resample12: parameter=none Error in glm.control(trainControl = list(method = "cv", number = 5, repeats = NA, :
unused argument (trainControl = list("cv", 5, NA, "grid", 0.75, NULL, 1, TRUE, 0, FALSE, TRUE, "final", FALSE, FALSE, function (data, lev = NULL, model = NULL)
{
if (is.character(data$obs)) data$obs <- factor(data$obs, levels = lev)
postResample(data[, "pred"], data[, "obs"])
}, "best", list(0.95, 3, 5, 19, 10, 0.9), NULL, NULL, NULL, NULL, 0, c(FALSE, FALSE), NA, list(5, 0.05, "gls", TRUE), FALSE, TRUE))
model fit failed for Resample13: parameter=none Error in glm.control(trainControl = list(method = "cv", number = 5, repeats = NA, :
unused argument (trainControl = list("cv", 5, NA, "grid", 0.75, NULL, 1, TRUE, 0, FALSE, TRUE, "final", FALSE, FALSE, function (data, lev = NULL, model = NULL)
{
if (is.character(data$obs)) data$obs <- factor(data$obs, levels = lev)
postResample(data[, "pred"], data[, "obs"])
}, "best", list(0.95, 3, 5, 19, 10, 0.9), NULL, NULL, NULL, NULL, 0, c(FALSE, FALSE), NA, list(5, 0.05, "gls", TRUE), FALSE, TRUE))
model fit failed for Resample14: parameter=none Error in glm.control(trainControl = list(method = "cv", number = 5, repeats = NA, :
unused argument (trainControl = list("cv", 5, NA, "grid", 0.75, NULL, 1, TRUE, 0, FALSE, TRUE, "final", FALSE, FALSE, function (data, lev = NULL, model = NULL)
{
if (is.character(data$obs)) data$obs <- factor(data$obs, levels = lev)
postResample(data[, "pred"], data[, "obs"])
}, "best", list(0.95, 3, 5, 19, 10, 0.9), NULL, NULL, NULL, NULL, 0, c(FALSE, FALSE), NA, list(5, 0.05, "gls", TRUE), FALSE, TRUE))
model fit failed for Resample15: parameter=none Error in glm.control(trainControl = list(method = "cv", number = 5, repeats = NA, :
unused argument (trainControl = list("cv", 5, NA, "grid", 0.75, NULL, 1, TRUE, 0, FALSE, TRUE, "final", FALSE, FALSE, function (data, lev = NULL, model = NULL)
{
if (is.character(data$obs)) data$obs <- factor(data$obs, levels = lev)
postResample(data[, "pred"], data[, "obs"])
}, "best", list(0.95, 3, 5, 19, 10, 0.9), NULL, NULL, NULL, NULL, 0, c(FALSE, FALSE), NA, list(5, 0.05, "gls", TRUE), FALSE, TRUE))
model fit failed for Resample16: parameter=none Error in glm.control(trainControl = list(method = "cv", number = 5, repeats = NA, :
unused argument (trainControl = list("cv", 5, NA, "grid", 0.75, NULL, 1, TRUE, 0, FALSE, TRUE, "final", FALSE, FALSE, function (data, lev = NULL, model = NULL)
{
if (is.character(data$obs)) data$obs <- factor(data$obs, levels = lev)
postResample(data[, "pred"], data[, "obs"])
}, "best", list(0.95, 3, 5, 19, 10, 0.9), NULL, NULL, NULL, NULL, 0, c(FALSE, FALSE), NA, list(5, 0.05, "gls", TRUE), FALSE, TRUE))
model fit failed for Resample17: parameter=none Error in glm.control(trainControl = list(method = "cv", number = 5, repeats = NA, :
unused argument (trainControl = list("cv", 5, NA, "grid", 0.75, NULL, 1, TRUE, 0, FALSE, TRUE, "final", FALSE, FALSE, function (data, lev = NULL, model = NULL)
{
if (is.character(data$obs)) data$obs <- factor(data$obs, levels = lev)
postResample(data[, "pred"], data[, "obs"])
}, "best", list(0.95, 3, 5, 19, 10, 0.9), NULL, NULL, NULL, NULL, 0, c(FALSE, FALSE), NA, list(5, 0.05, "gls", TRUE), FALSE, TRUE))
model fit failed for Resample18: parameter=none Error in glm.control(trainControl = list(method = "cv", number = 5, repeats = NA, :
unused argument (trainControl = list("cv", 5, NA, "grid", 0.75, NULL, 1, TRUE, 0, FALSE, TRUE, "final", FALSE, FALSE, function (data, lev = NULL, model = NULL)
{
if (is.character(data$obs)) data$obs <- factor(data$obs, levels = lev)
postResample(data[, "pred"], data[, "obs"])
}, "best", list(0.95, 3, 5, 19, 10, 0.9), NULL, NULL, NULL, NULL, 0, c(FALSE, FALSE), NA, list(5, 0.05, "gls", TRUE), FALSE, TRUE))
model fit failed for Resample19: parameter=none Error in glm.control(trainControl = list(method = "cv", number = 5, repeats = NA, :
unused argument (trainControl = list("cv", 5, NA, "grid", 0.75, NULL, 1, TRUE, 0, FALSE, TRUE, "final", FALSE, FALSE, function (data, lev = NULL, model = NULL)
{
if (is.character(data$obs)) data$obs <- factor(data$obs, levels = lev)
postResample(data[, "pred"], data[, "obs"])
}, "best", list(0.95, 3, 5, 19, 10, 0.9), NULL, NULL, NULL, NULL, 0, c(FALSE, FALSE), NA, list(5, 0.05, "gls", TRUE), FALSE, TRUE))
model fit failed for Resample20: parameter=none Error in glm.control(trainControl = list(method = "cv", number = 5, repeats = NA, :
unused argument (trainControl = list("cv", 5, NA, "grid", 0.75, NULL, 1, TRUE, 0, FALSE, TRUE, "final", FALSE, FALSE, function (data, lev = NULL, model = NULL)
{
if (is.character(data$obs)) data$obs <- factor(data$obs, levels = lev)
postResample(data[, "pred"], data[, "obs"])
}, "best", list(0.95, 3, 5, 19, 10, 0.9), NULL, NULL, NULL, NULL, 0, c(FALSE, FALSE), NA, list(5, 0.05, "gls", TRUE), FALSE, TRUE))
model fit failed for Resample21: parameter=none Error in glm.control(trainControl = list(method = "cv", number = 5, repeats = NA, :
unused argument (trainControl = list("cv", 5, NA, "grid", 0.75, NULL, 1, TRUE, 0, FALSE, TRUE, "final", FALSE, FALSE, function (data, lev = NULL, model = NULL)
{
if (is.character(data$obs)) data$obs <- factor(data$obs, levels = lev)
postResample(data[, "pred"], data[, "obs"])
}, "best", list(0.95, 3, 5, 19, 10, 0.9), NULL, NULL, NULL, NULL, 0, c(FALSE, FALSE), NA, list(5, 0.05, "gls", TRUE), FALSE, TRUE))
model fit failed for Resample22: parameter=none Error in glm.control(trainControl = list(method = "cv", number = 5, repeats = NA, :
unused argument (trainControl = list("cv", 5, NA, "grid", 0.75, NULL, 1, TRUE, 0, FALSE, TRUE, "final", FALSE, FALSE, function (data, lev = NULL, model = NULL)
{
if (is.character(data$obs)) data$obs <- factor(data$obs, levels = lev)
postResample(data[, "pred"], data[, "obs"])
}, "best", list(0.95, 3, 5, 19, 10, 0.9), NULL, NULL, NULL, NULL, 0, c(FALSE, FALSE), NA, list(5, 0.05, "gls", TRUE), FALSE, TRUE))
model fit failed for Resample23: parameter=none Error in glm.control(trainControl = list(method = "cv", number = 5, repeats = NA, :
unused argument (trainControl = list("cv", 5, NA, "grid", 0.75, NULL, 1, TRUE, 0, FALSE, TRUE, "final", FALSE, FALSE, function (data, lev = NULL, model = NULL)
{
if (is.character(data$obs)) data$obs <- factor(data$obs, levels = lev)
postResample(data[, "pred"], data[, "obs"])
}, "best", list(0.95, 3, 5, 19, 10, 0.9), NULL, NULL, NULL, NULL, 0, c(FALSE, FALSE), NA, list(5, 0.05, "gls", TRUE), FALSE, TRUE))
model fit failed for Resample24: parameter=none Error in glm.control(trainControl = list(method = "cv", number = 5, repeats = NA, :
unused argument (trainControl = list("cv", 5, NA, "grid", 0.75, NULL, 1, TRUE, 0, FALSE, TRUE, "final", FALSE, FALSE, function (data, lev = NULL, model = NULL)
{
if (is.character(data$obs)) data$obs <- factor(data$obs, levels = lev)
postResample(data[, "pred"], data[, "obs"])
}, "best", list(0.95, 3, 5, 19, 10, 0.9), NULL, NULL, NULL, NULL, 0, c(FALSE, FALSE), NA, list(5, 0.05, "gls", TRUE), FALSE, TRUE))
model fit failed for Resample25: parameter=none Error in glm.control(trainControl = list(method = "cv", number = 5, repeats = NA, :
unused argument (trainControl = list("cv", 5, NA, "grid", 0.75, NULL, 1, TRUE, 0, FALSE, TRUE, "final", FALSE, FALSE, function (data, lev = NULL, model = NULL)
{
if (is.character(data$obs)) data$obs <- factor(data$obs, levels = lev)
postResample(data[, "pred"], data[, "obs"])
}, "best", list(0.95, 3, 5, 19, 10, 0.9), NULL, NULL, NULL, NULL, 0, c(FALSE, FALSE), NA, list(5, 0.05, "gls", TRUE), FALSE, TRUE))
There were missing values in resampled performance measures.
Something is wrong; all the RMSE metric values are missing:
RMSE Rsquared MAE
Min. : NA Min. : NA Min. : NA
1st Qu.: NA 1st Qu.: NA 1st Qu.: NA
Median : NA Median : NA Median : NA
Mean :NaN Mean :NaN Mean :NaN
3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA
Max. : NA Max. : NA Max. : NA
NA's :1 NA's :1 NA's :1
Error: Stopping
La función train()
vuelve a dar un error ya que existen atributos(columnas) no numericas como entrada y claramente resulta imposible realizar una regresion. Ejemplo de esto son el campo Name
y Ticket
. Esto puede solucionarse tambien, pero por ahora vamos a simplemente eliminarlo.
train_data_removed<-train_data %>% select(-Name,-Ticket,-Age,-Cabin,-Embarked,-Sex)
glm_model1 <- train(formula, data = train_data_removed, method = "glm", family=binomial, trControl=train_control)
PredTrain = predict(glm_model1, newdata=train, type="raw")
table(train$Survived, PredTrain > 0.5)
La funcion confusionMatrix()
del paquete caret
permite calcular la matriz de confusion junto a otras métricas
confusionMatrix(as.factor(train$Survived), as.factor(ifelse(PredTrain > 0.5,1,0)))