El paquete caret es un framework que unifica diferentes metodos para crear modelos predictions. El manual se encuentra disponible en github page.
set.seed(1)
library(caret)
library(readr)
library(tidyr)
raw_data <- read.csv("train.csv", stringsAsFactors=FALSE)
raw_data
Nuestro archivo train.csv es el unico archivo que contiene observaciones, por lo que lo vamos a utilizar para entrenar y evaluar nuestro modelo de machine learning. Primero separamos el archivo train.csv en dos utilizando la funcion createDataPartition. A la función se le indica como parametro la clase (Survive en este caso) y el por centaje en el que vamos a separar el dataset. En ete caso vamos a separarlo en un 80/20. Es decir 80% del dataset sera para entrenar y un 20% para testear.
Para ajustar los parametros de los modelo se utilizan diferentes tecnicas que utilizaran train_data. Normalmente se utilizan tecnicas de resampling que lo que hacen es partir de diferente manera los datos para luego entrenar y validar los resultados. Todo este proceso caret lo hace automaticamente a traves de trainControl. En este caso le indicamos que vamos a aplicar un metodo de validacion Cruzada (CrossValidation) de 2x4. Lo que significa que a data_train lo vamos a separar en 4 partes y vamos a usar 3 partes para entrenar un modelo y la restante para validarlo.
train_control <- trainControl(method="cv", number=5)
El campo Survived tiene los valores [0,1], los cuales pueden incorrectamente considerarse como numeros cuando en realidad se trata de categorias. Para esto resulta necesario convertirlos a factores.
La funcion train entrena un modelo. El primer parametro Survived~. es lo que se conoce como sintaxis de formula en R. Y basicamente se indica a la funcion que vamos a utilizar Survived como la clase a predecir y el . indica que utilicemos TODOS los atributos(columnas) durante el entrenamiento.
glm_model1 <- train(Survived ~ ., data = train_data, method = "glm", family = "binomial",trainControl=train_control)
Error in na.fail.default(list(Survived = c(0L, 1L, 1L, 1L, 0L, 0L, 0L, :
missing values in object
El modelo deberia dar un error, ya que existen ciertos registros que no tiene valores. (NA). Esto podemos verifcarlo:
sapply(raw_data, anyNA)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin
FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
Embarked
FALSE
Se verifica que los campos Age, Cabin y Embarked contienen datos incompletos. Podemos intentar solucionar esto, pero lo veremos mas adelante. Por ahora simplemente eliminemoslos.
train_data_removed<-train_data %>% select(-Age,-Cabin,-Embarked)
glm_model1 <- train(Survived ~ ., data = train_data_removed, method = "glm", family = "binomial",trainControl=train_control)
You are trying to do regression and your outcome only has two possible values Are you trying to do classification? If so, use a 2 level factor as your outcome column.model fit failed for Resample01: parameter=none Error in glm.control(trainControl = list(method = "cv", number = 5, repeats = NA, :
unused argument (trainControl = list("cv", 5, NA, "grid", 0.75, NULL, 1, TRUE, 0, FALSE, TRUE, "final", FALSE, FALSE, function (data, lev = NULL, model = NULL)
{
if (is.character(data$obs)) data$obs <- factor(data$obs, levels = lev)
postResample(data[, "pred"], data[, "obs"])
}, "best", list(0.95, 3, 5, 19, 10, 0.9), NULL, NULL, NULL, NULL, 0, c(FALSE, FALSE), NA, list(5, 0.05, "gls", TRUE), FALSE, TRUE))
model fit failed for Resample02: parameter=none Error in glm.control(trainControl = list(method = "cv", number = 5, repeats = NA, :
unused argument (trainControl = list("cv", 5, NA, "grid", 0.75, NULL, 1, TRUE, 0, FALSE, TRUE, "final", FALSE, FALSE, function (data, lev = NULL, model = NULL)
{
if (is.character(data$obs)) data$obs <- factor(data$obs, levels = lev)
postResample(data[, "pred"], data[, "obs"])
}, "best", list(0.95, 3, 5, 19, 10, 0.9), NULL, NULL, NULL, NULL, 0, c(FALSE, FALSE), NA, list(5, 0.05, "gls", TRUE), FALSE, TRUE))
model fit failed for Resample03: parameter=none Error in glm.control(trainControl = list(method = "cv", number = 5, repeats = NA, :
unused argument (trainControl = list("cv", 5, NA, "grid", 0.75, NULL, 1, TRUE, 0, FALSE, TRUE, "final", FALSE, FALSE, function (data, lev = NULL, model = NULL)
{
if (is.character(data$obs)) data$obs <- factor(data$obs, levels = lev)
postResample(data[, "pred"], data[, "obs"])
}, "best", list(0.95, 3, 5, 19, 10, 0.9), NULL, NULL, NULL, NULL, 0, c(FALSE, FALSE), NA, list(5, 0.05, "gls", TRUE), FALSE, TRUE))
model fit failed for Resample04: parameter=none Error in glm.control(trainControl = list(method = "cv", number = 5, repeats = NA, :
unused argument (trainControl = list("cv", 5, NA, "grid", 0.75, NULL, 1, TRUE, 0, FALSE, TRUE, "final", FALSE, FALSE, function (data, lev = NULL, model = NULL)
{
if (is.character(data$obs)) data$obs <- factor(data$obs, levels = lev)
postResample(data[, "pred"], data[, "obs"])
}, "best", list(0.95, 3, 5, 19, 10, 0.9), NULL, NULL, NULL, NULL, 0, c(FALSE, FALSE), NA, list(5, 0.05, "gls", TRUE), FALSE, TRUE))
model fit failed for Resample05: parameter=none Error in glm.control(trainControl = list(method = "cv", number = 5, repeats = NA, :
unused argument (trainControl = list("cv", 5, NA, "grid", 0.75, NULL, 1, TRUE, 0, FALSE, TRUE, "final", FALSE, FALSE, function (data, lev = NULL, model = NULL)
{
if (is.character(data$obs)) data$obs <- factor(data$obs, levels = lev)
postResample(data[, "pred"], data[, "obs"])
}, "best", list(0.95, 3, 5, 19, 10, 0.9), NULL, NULL, NULL, NULL, 0, c(FALSE, FALSE), NA, list(5, 0.05, "gls", TRUE), FALSE, TRUE))
model fit failed for Resample06: parameter=none Error in glm.control(trainControl = list(method = "cv", number = 5, repeats = NA, :
unused argument (trainControl = list("cv", 5, NA, "grid", 0.75, NULL, 1, TRUE, 0, FALSE, TRUE, "final", FALSE, FALSE, function (data, lev = NULL, model = NULL)
{
if (is.character(data$obs)) data$obs <- factor(data$obs, levels = lev)
postResample(data[, "pred"], data[, "obs"])
}, "best", list(0.95, 3, 5, 19, 10, 0.9), NULL, NULL, NULL, NULL, 0, c(FALSE, FALSE), NA, list(5, 0.05, "gls", TRUE), FALSE, TRUE))
model fit failed for Resample07: parameter=none Error in glm.control(trainControl = list(method = "cv", number = 5, repeats = NA, :
unused argument (trainControl = list("cv", 5, NA, "grid", 0.75, NULL, 1, TRUE, 0, FALSE, TRUE, "final", FALSE, FALSE, function (data, lev = NULL, model = NULL)
{
if (is.character(data$obs)) data$obs <- factor(data$obs, levels = lev)
postResample(data[, "pred"], data[, "obs"])
}, "best", list(0.95, 3, 5, 19, 10, 0.9), NULL, NULL, NULL, NULL, 0, c(FALSE, FALSE), NA, list(5, 0.05, "gls", TRUE), FALSE, TRUE))
model fit failed for Resample08: parameter=none Error in glm.control(trainControl = list(method = "cv", number = 5, repeats = NA, :
unused argument (trainControl = list("cv", 5, NA, "grid", 0.75, NULL, 1, TRUE, 0, FALSE, TRUE, "final", FALSE, FALSE, function (data, lev = NULL, model = NULL)
{
if (is.character(data$obs)) data$obs <- factor(data$obs, levels = lev)
postResample(data[, "pred"], data[, "obs"])
}, "best", list(0.95, 3, 5, 19, 10, 0.9), NULL, NULL, NULL, NULL, 0, c(FALSE, FALSE), NA, list(5, 0.05, "gls", TRUE), FALSE, TRUE))
model fit failed for Resample09: parameter=none Error in glm.control(trainControl = list(method = "cv", number = 5, repeats = NA, :
unused argument (trainControl = list("cv", 5, NA, "grid", 0.75, NULL, 1, TRUE, 0, FALSE, TRUE, "final", FALSE, FALSE, function (data, lev = NULL, model = NULL)
{
if (is.character(data$obs)) data$obs <- factor(data$obs, levels = lev)
postResample(data[, "pred"], data[, "obs"])
}, "best", list(0.95, 3, 5, 19, 10, 0.9), NULL, NULL, NULL, NULL, 0, c(FALSE, FALSE), NA, list(5, 0.05, "gls", TRUE), FALSE, TRUE))
model fit failed for Resample10: parameter=none Error in glm.control(trainControl = list(method = "cv", number = 5, repeats = NA, :
unused argument (trainControl = list("cv", 5, NA, "grid", 0.75, NULL, 1, TRUE, 0, FALSE, TRUE, "final", FALSE, FALSE, function (data, lev = NULL, model = NULL)
{
if (is.character(data$obs)) data$obs <- factor(data$obs, levels = lev)
postResample(data[, "pred"], data[, "obs"])
}, "best", list(0.95, 3, 5, 19, 10, 0.9), NULL, NULL, NULL, NULL, 0, c(FALSE, FALSE), NA, list(5, 0.05, "gls", TRUE), FALSE, TRUE))
model fit failed for Resample11: parameter=none Error in glm.control(trainControl = list(method = "cv", number = 5, repeats = NA, :
unused argument (trainControl = list("cv", 5, NA, "grid", 0.75, NULL, 1, TRUE, 0, FALSE, TRUE, "final", FALSE, FALSE, function (data, lev = NULL, model = NULL)
{
if (is.character(data$obs)) data$obs <- factor(data$obs, levels = lev)
postResample(data[, "pred"], data[, "obs"])
}, "best", list(0.95, 3, 5, 19, 10, 0.9), NULL, NULL, NULL, NULL, 0, c(FALSE, FALSE), NA, list(5, 0.05, "gls", TRUE), FALSE, TRUE))
model fit failed for Resample12: parameter=none Error in glm.control(trainControl = list(method = "cv", number = 5, repeats = NA, :
unused argument (trainControl = list("cv", 5, NA, "grid", 0.75, NULL, 1, TRUE, 0, FALSE, TRUE, "final", FALSE, FALSE, function (data, lev = NULL, model = NULL)
{
if (is.character(data$obs)) data$obs <- factor(data$obs, levels = lev)
postResample(data[, "pred"], data[, "obs"])
}, "best", list(0.95, 3, 5, 19, 10, 0.9), NULL, NULL, NULL, NULL, 0, c(FALSE, FALSE), NA, list(5, 0.05, "gls", TRUE), FALSE, TRUE))
model fit failed for Resample13: parameter=none Error in glm.control(trainControl = list(method = "cv", number = 5, repeats = NA, :
unused argument (trainControl = list("cv", 5, NA, "grid", 0.75, NULL, 1, TRUE, 0, FALSE, TRUE, "final", FALSE, FALSE, function (data, lev = NULL, model = NULL)
{
if (is.character(data$obs)) data$obs <- factor(data$obs, levels = lev)
postResample(data[, "pred"], data[, "obs"])
}, "best", list(0.95, 3, 5, 19, 10, 0.9), NULL, NULL, NULL, NULL, 0, c(FALSE, FALSE), NA, list(5, 0.05, "gls", TRUE), FALSE, TRUE))
model fit failed for Resample14: parameter=none Error in glm.control(trainControl = list(method = "cv", number = 5, repeats = NA, :
unused argument (trainControl = list("cv", 5, NA, "grid", 0.75, NULL, 1, TRUE, 0, FALSE, TRUE, "final", FALSE, FALSE, function (data, lev = NULL, model = NULL)
{
if (is.character(data$obs)) data$obs <- factor(data$obs, levels = lev)
postResample(data[, "pred"], data[, "obs"])
}, "best", list(0.95, 3, 5, 19, 10, 0.9), NULL, NULL, NULL, NULL, 0, c(FALSE, FALSE), NA, list(5, 0.05, "gls", TRUE), FALSE, TRUE))
model fit failed for Resample15: parameter=none Error in glm.control(trainControl = list(method = "cv", number = 5, repeats = NA, :
unused argument (trainControl = list("cv", 5, NA, "grid", 0.75, NULL, 1, TRUE, 0, FALSE, TRUE, "final", FALSE, FALSE, function (data, lev = NULL, model = NULL)
{
if (is.character(data$obs)) data$obs <- factor(data$obs, levels = lev)
postResample(data[, "pred"], data[, "obs"])
}, "best", list(0.95, 3, 5, 19, 10, 0.9), NULL, NULL, NULL, NULL, 0, c(FALSE, FALSE), NA, list(5, 0.05, "gls", TRUE), FALSE, TRUE))
model fit failed for Resample16: parameter=none Error in glm.control(trainControl = list(method = "cv", number = 5, repeats = NA, :
unused argument (trainControl = list("cv", 5, NA, "grid", 0.75, NULL, 1, TRUE, 0, FALSE, TRUE, "final", FALSE, FALSE, function (data, lev = NULL, model = NULL)
{
if (is.character(data$obs)) data$obs <- factor(data$obs, levels = lev)
postResample(data[, "pred"], data[, "obs"])
}, "best", list(0.95, 3, 5, 19, 10, 0.9), NULL, NULL, NULL, NULL, 0, c(FALSE, FALSE), NA, list(5, 0.05, "gls", TRUE), FALSE, TRUE))
model fit failed for Resample17: parameter=none Error in glm.control(trainControl = list(method = "cv", number = 5, repeats = NA, :
unused argument (trainControl = list("cv", 5, NA, "grid", 0.75, NULL, 1, TRUE, 0, FALSE, TRUE, "final", FALSE, FALSE, function (data, lev = NULL, model = NULL)
{
if (is.character(data$obs)) data$obs <- factor(data$obs, levels = lev)
postResample(data[, "pred"], data[, "obs"])
}, "best", list(0.95, 3, 5, 19, 10, 0.9), NULL, NULL, NULL, NULL, 0, c(FALSE, FALSE), NA, list(5, 0.05, "gls", TRUE), FALSE, TRUE))
model fit failed for Resample18: parameter=none Error in glm.control(trainControl = list(method = "cv", number = 5, repeats = NA, :
unused argument (trainControl = list("cv", 5, NA, "grid", 0.75, NULL, 1, TRUE, 0, FALSE, TRUE, "final", FALSE, FALSE, function (data, lev = NULL, model = NULL)
{
if (is.character(data$obs)) data$obs <- factor(data$obs, levels = lev)
postResample(data[, "pred"], data[, "obs"])
}, "best", list(0.95, 3, 5, 19, 10, 0.9), NULL, NULL, NULL, NULL, 0, c(FALSE, FALSE), NA, list(5, 0.05, "gls", TRUE), FALSE, TRUE))
model fit failed for Resample19: parameter=none Error in glm.control(trainControl = list(method = "cv", number = 5, repeats = NA, :
unused argument (trainControl = list("cv", 5, NA, "grid", 0.75, NULL, 1, TRUE, 0, FALSE, TRUE, "final", FALSE, FALSE, function (data, lev = NULL, model = NULL)
{
if (is.character(data$obs)) data$obs <- factor(data$obs, levels = lev)
postResample(data[, "pred"], data[, "obs"])
}, "best", list(0.95, 3, 5, 19, 10, 0.9), NULL, NULL, NULL, NULL, 0, c(FALSE, FALSE), NA, list(5, 0.05, "gls", TRUE), FALSE, TRUE))
model fit failed for Resample20: parameter=none Error in glm.control(trainControl = list(method = "cv", number = 5, repeats = NA, :
unused argument (trainControl = list("cv", 5, NA, "grid", 0.75, NULL, 1, TRUE, 0, FALSE, TRUE, "final", FALSE, FALSE, function (data, lev = NULL, model = NULL)
{
if (is.character(data$obs)) data$obs <- factor(data$obs, levels = lev)
postResample(data[, "pred"], data[, "obs"])
}, "best", list(0.95, 3, 5, 19, 10, 0.9), NULL, NULL, NULL, NULL, 0, c(FALSE, FALSE), NA, list(5, 0.05, "gls", TRUE), FALSE, TRUE))
model fit failed for Resample21: parameter=none Error in glm.control(trainControl = list(method = "cv", number = 5, repeats = NA, :
unused argument (trainControl = list("cv", 5, NA, "grid", 0.75, NULL, 1, TRUE, 0, FALSE, TRUE, "final", FALSE, FALSE, function (data, lev = NULL, model = NULL)
{
if (is.character(data$obs)) data$obs <- factor(data$obs, levels = lev)
postResample(data[, "pred"], data[, "obs"])
}, "best", list(0.95, 3, 5, 19, 10, 0.9), NULL, NULL, NULL, NULL, 0, c(FALSE, FALSE), NA, list(5, 0.05, "gls", TRUE), FALSE, TRUE))
model fit failed for Resample22: parameter=none Error in glm.control(trainControl = list(method = "cv", number = 5, repeats = NA, :
unused argument (trainControl = list("cv", 5, NA, "grid", 0.75, NULL, 1, TRUE, 0, FALSE, TRUE, "final", FALSE, FALSE, function (data, lev = NULL, model = NULL)
{
if (is.character(data$obs)) data$obs <- factor(data$obs, levels = lev)
postResample(data[, "pred"], data[, "obs"])
}, "best", list(0.95, 3, 5, 19, 10, 0.9), NULL, NULL, NULL, NULL, 0, c(FALSE, FALSE), NA, list(5, 0.05, "gls", TRUE), FALSE, TRUE))
model fit failed for Resample23: parameter=none Error in glm.control(trainControl = list(method = "cv", number = 5, repeats = NA, :
unused argument (trainControl = list("cv", 5, NA, "grid", 0.75, NULL, 1, TRUE, 0, FALSE, TRUE, "final", FALSE, FALSE, function (data, lev = NULL, model = NULL)
{
if (is.character(data$obs)) data$obs <- factor(data$obs, levels = lev)
postResample(data[, "pred"], data[, "obs"])
}, "best", list(0.95, 3, 5, 19, 10, 0.9), NULL, NULL, NULL, NULL, 0, c(FALSE, FALSE), NA, list(5, 0.05, "gls", TRUE), FALSE, TRUE))
model fit failed for Resample24: parameter=none Error in glm.control(trainControl = list(method = "cv", number = 5, repeats = NA, :
unused argument (trainControl = list("cv", 5, NA, "grid", 0.75, NULL, 1, TRUE, 0, FALSE, TRUE, "final", FALSE, FALSE, function (data, lev = NULL, model = NULL)
{
if (is.character(data$obs)) data$obs <- factor(data$obs, levels = lev)
postResample(data[, "pred"], data[, "obs"])
}, "best", list(0.95, 3, 5, 19, 10, 0.9), NULL, NULL, NULL, NULL, 0, c(FALSE, FALSE), NA, list(5, 0.05, "gls", TRUE), FALSE, TRUE))
model fit failed for Resample25: parameter=none Error in glm.control(trainControl = list(method = "cv", number = 5, repeats = NA, :
unused argument (trainControl = list("cv", 5, NA, "grid", 0.75, NULL, 1, TRUE, 0, FALSE, TRUE, "final", FALSE, FALSE, function (data, lev = NULL, model = NULL)
{
if (is.character(data$obs)) data$obs <- factor(data$obs, levels = lev)
postResample(data[, "pred"], data[, "obs"])
}, "best", list(0.95, 3, 5, 19, 10, 0.9), NULL, NULL, NULL, NULL, 0, c(FALSE, FALSE), NA, list(5, 0.05, "gls", TRUE), FALSE, TRUE))
There were missing values in resampled performance measures.
Something is wrong; all the RMSE metric values are missing:
RMSE Rsquared MAE
Min. : NA Min. : NA Min. : NA
1st Qu.: NA 1st Qu.: NA 1st Qu.: NA
Median : NA Median : NA Median : NA
Mean :NaN Mean :NaN Mean :NaN
3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA
Max. : NA Max. : NA Max. : NA
NA's :1 NA's :1 NA's :1
Error: Stopping
La función train() vuelve a dar un error ya que existen atributos(columnas) no numericas como entrada y claramente resulta imposible realizar una regresion. Ejemplo de esto son el campo Name y Ticket. Esto puede solucionarse tambien, pero por ahora vamos a simplemente eliminarlo.
train_data_removed<-train_data %>% select(-Name,-Ticket,-Age,-Cabin,-Embarked,-Sex)
glm_model1 <- train(formula, data = train_data_removed, method = "glm", family=binomial, trControl=train_control)
PredTrain = predict(glm_model1, newdata=train, type="raw")
table(train$Survived, PredTrain > 0.5)
La funcion confusionMatrix() del paquete caret permite calcular la matriz de confusion junto a otras métricas
confusionMatrix(as.factor(train$Survived), as.factor(ifelse(PredTrain > 0.5,1,0)))