library("tidyverse")
[37m-- [1mAttaching packages[22m --------------------------------------- tidyverse 1.2.1 --[39m
[37m[32mv[37m [34mggplot2[37m 3.2.1 [32mv[37m [34mpurrr [37m 0.3.3
[32mv[37m [34mtibble [37m 2.1.3 [32mv[37m [34mdplyr [37m 0.8.3
[32mv[37m [34mtidyr [37m 1.0.0 [32mv[37m [34mstringr[37m 1.4.0
[32mv[37m [34mreadr [37m 1.3.1 [32mv[37m [34mforcats[37m 0.4.0[39m
[37m-- [1mConflicts[22m ------------------------------------------ tidyverse_conflicts() --
[31mx[37m [34mdplyr[37m::[32mfilter()[37m masks [34mstats[37m::filter()
[31mx[37m [34mdplyr[37m::[32mlag()[37m masks [34mstats[37m::lag()[39m
library("ggplot2")
library("GGally")
Attaching package: 㤼㸱GGally㤼㸲
The following object is masked from 㤼㸱package:dplyr㤼㸲:
nasa
library("dplyr")
library("tidyr")
library("broom")
library("modelr")
Attaching package: 㤼㸱modelr㤼㸲
The following object is masked from 㤼㸱package:broom㤼㸲:
bootstrap
library("OneR")
library("rlang")
Attaching package: 㤼㸱rlang㤼㸲
The following objects are masked from 㤼㸱package:purrr㤼㸲:
%@%, as_function, flatten, flatten_chr,
flatten_dbl, flatten_int, flatten_lgl,
flatten_raw, invoke, list_along, modify,
prepend, splice
library("purrr")
library("pROC")
Type 'citation("pROC")' for a citation.
Attaching package: 㤼㸱pROC㤼㸲
The following objects are masked from 㤼㸱package:stats㤼㸲:
cov, smooth, var
library("caret")
Loading required package: lattice
Attaching package: 㤼㸱caret㤼㸲
The following object is masked from 㤼㸱package:purrr㤼㸲:
lift
#install.packages('e1071', dependencies=TRUE)
titanic_train = read.csv("titanic_complete_train.csv")
Diccionario de datos
Vamos a ver un resumen de los datos y las distribuciones de algunas variables
summary(titanic_train)
PassengerId Survived Pclass
Min. : 1.0 Min. :0.0000 Min. :1.000
1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000
Median :446.0 Median :0.0000 Median :3.000
Mean :446.0 Mean :0.3838 Mean :2.309
3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
Max. :891.0 Max. :1.0000 Max. :3.000
Name Sex
Abbing, Mr. Anthony : 1 female:314
Abbott, Mr. Rossmore Edward : 1 male :577
Abbott, Mrs. Stanton (Rosa Hunt) : 1
Abelson, Mr. Samuel : 1
Abelson, Mrs. Samuel (Hannah Wizosky): 1
Adahl, Mr. Mauritz Nils Martin : 1
(Other) :885
Age SibSp Parch
Min. : 0.42 Min. :0.000 Min. :0.0000
1st Qu.:21.75 1st Qu.:0.000 1st Qu.:0.0000
Median :26.51 Median :0.000 Median :0.0000
Mean :29.32 Mean :0.523 Mean :0.3816
3rd Qu.:36.00 3rd Qu.:1.000 3rd Qu.:0.0000
Max. :80.00 Max. :8.000 Max. :6.0000
Ticket Fare Cabin
1601 : 7 Min. : 0.00 B96 B98 : 4
347082 : 7 1st Qu.: 7.91 C23 C25 C27: 4
CA. 2343: 7 Median : 14.45 G6 : 4
3101295 : 6 Mean : 32.20 C22 C26 : 3
347088 : 6 3rd Qu.: 31.00 D : 3
CA 2144 : 6 Max. :512.33 (Other) :186
(Other) :852 NA's :687
Embarked
C:168
Q: 77
S:646
Vemos que el dataset tiene 891 registros. Hay mas hombres, mayor personal en la clase 3 y murieron mas personas de las que sobrevivieron. El 50% de las personas se encuentran entre los 21 y 36 años.
titanic_train_selec = titanic_train %>% select(c(PassengerId, Survived, Pclass, Sex, Age, SibSp,Parch, Fare, Embarked))
titanic_train_selec$Survived = as_factor(titanic_train_selec$Survived)
titanic_train_selec$Pclass = as_factor(titanic_train_selec$Pclass)
titanic_train_selec$Embarked = as_factor(titanic_train_selec$Embarked)
ggpairs(titanic_train_selec %>% select(c( Survived, Pclass, Sex, Age , Fare)), mapping = aes(colour= Survived))
La proporción de muertos para los hombres es mucho mayor que la de las mujeres, lo mismo para la 3° clase.
plot(titanic_train_selec$Survived)
spec = c(train = .7, validate = .3)
g = sample(cut(
seq(nrow(titanic_train_selec)),
nrow(titanic_train_selec)*cumsum(c(0,spec)),
labels = names(spec)
))
res = split(titanic_train_selec, g)
plot(res[['train']]$Survived)
plot(res[['validate']]$Survived)
prop = data.frame(
"complete.Survived0" = nrow(titanic_train_selec %>% filter(Survived == 0)) / nrow(titanic_train_selec),
"complete.Survived1" = nrow(titanic_train_selec %>% filter(Survived == 1)) / nrow(titanic_train_selec),
"train_prop.Survived0" = nrow(res[['train']] %>% filter(Survived == 0)) / nrow(res[['train']]),
"train_prop.Survived1" = nrow(res[['train']] %>% filter(Survived == 1)) / nrow(res[['train']]),
"valid_prop.Survived0" = nrow(res[['train']] %>% filter(Survived == 0)) / nrow(res[['train']]),
"valid_prop.Survived1" = nrow(res[['train']] %>% filter(Survived == 1)) / nrow(res[['train']])
)
prop
Luego de dividir el set de datos, vemos que las proporciones siguen siendo similares.
glm.fit = glm(Survived ~ Pclass + Sex + Age, data = res[['train']], family = binomial)
summary(glm.fit)
Call:
glm(formula = Survived ~ Pclass + Sex + Age, family = binomial,
data = res[["train"]])
Deviance Residuals:
Min 1Q Median 3Q Max
-2.8248 -0.6364 -0.4445 0.6381 2.4815
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.061364 0.474382 8.561 < 2e-16 ***
Pclass2 -1.259315 0.314522 -4.004 6.23e-05 ***
Pclass3 -2.392876 0.305697 -7.828 4.97e-15 ***
Sexmale -2.670144 0.225457 -11.843 < 2e-16 ***
Age -0.045113 0.009313 -4.844 1.27e-06 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 833.29 on 622 degrees of freedom
Residual deviance: 558.85 on 618 degrees of freedom
AIC: 568.85
Number of Fisher Scoring iterations: 5
Todos los coeficientes son significativos. Ser hombre parece disminuir las probabilidades de sobrevivir respecto al caso basal (mujer). Ser de las clases 2 y 3, también disminuye las probabilidades de sobrevir respecto al caso basal (clase 1). A mayor edad, mayor es la posibilidad de no sobrevivir.
rose = predict(glm.fit,newdata = data.frame(Pclass = '1', Sex = "female", Age = 17),type = "response")
jack = predict(glm.fit,newdata = data.frame(Pclass = '3', Sex = "male", Age = 20),type = "response")
#glm.pred = ifelse(glm.probs > 0.5, "Up", "Down")
rose
1
0.9642374
jack
1
0.1296649
Vemos que claramente, Jack no tenia chances de sobrevivir, a penas un 10%. En cambio, Rose tenia un 95%.
f = formulas(.response = ~Survived,
Familia= ~SibSp+Parch+Sex+Age,
Embarque=~Embarked+Parch+Sex+Age,
Todo=~Pclass+Sex+Age+SibSp+Parch+Fare+Embarked
)
modelos = data_frame(f) %>%
mutate(modelos = names(f),
expresion = paste(f),
mod = map(f, ~glm(.,family = 'binomial', data = res[['train']])))
`data_frame()` is deprecated, use `tibble()`.
[90mThis warning is displayed once per session.[39m
modelos = modelos %>%
mutate(glance = map(mod,glance)) %>%
mutate(pred= map(mod,augment, type.predict = "response"))
modelos %>%
unnest(glance, .drop = TRUE) %>%
mutate(perc_explained_dev = 1-deviance/null.deviance) %>%
select(-c(modelos, df.null, AIC, BIC)) %>%
arrange(deviance)
The `.drop` argument of `unnest()` is deprecated as of tidyr 1.0.0.
All list-columns are now preserved.
[90mThis warning is displayed once per session.[39m
[90mCall `lifecycle::last_warnings()` to see where this warning was generated.[39m
NA
El mejor modelo segun deviance es el que usa todas las variables con un porcentaje explicado de 0.35. El peor modelo es el que tiene como formula: Survived ~ SibSp + Parch + Sex + Age. Con un porcentaje explicado de 0.24
Voy a elegir el modelo completo ya que tiene la mayor explicacion
models <- models %>%
mutate(pred= map(mod,augment, newdata=res[["validate"]], type.predict = "response"))
prediction_full <- models %>%
filter(models=="Todo") %>%
unnest(pred, .drop=TRUE)
roc_full <- roc(response=prediction_full$Survived, predictor=prediction_full$.fitted)
Setting levels: control = 0, case = 1
Setting direction: controls < cases
ggroc(list(full=roc_full), size=1) +
geom_abline(slope = 1, intercept = 1, linetype='dashed') +
theme_minimal() + labs(title=paste('Curva ROC AUC:',round(roc_full$auc,2)), color='Modelo Completo')
Podemos ver la relación entre la specificity (FPR) y sensitivity (TPR). La curva se encuentra por encima de la recta del azar. Se obtuvo un Area bajo la curva de 0.86.
ggplot(prediction_full, aes(x=Survived, y=.fitted, group=Survived,fill=factor(Survived))) +
geom_violin() +
labs(title='Violin plot', subtitle='Modelo completo', y='Predicted probability')
Vemos una gran concentracion de observaciones de no sobrevivientes por de bajo de probabilidades predichas de 0.25 aunque tambien hay casos con probabilidad superior a 0.75. Mientras que los sobrevivientes, la cantidad de observaciones comienza a aumentar considerablemente por encima de 0.5, parecido a lo anterior, tambien hay casos con probabilidad menor a 0.125 que sobrevivieron. El punto de corte puede ser cualquera entre 0.25 y 0.5.
#models_validate = glm(f$Todo, family = 'binomial', data = res[['validate']])
#modelo_todo = modelos %>% select(mod)
#prediction_validategment(x = models_validate, type.predict = 'response')
#prediction_todo(x = modelo_todo, type.predict = 'response')
prediction_metrics <- function(cutoff, predictions=prediction_full){
table <- predictions %>%
mutate(predicted_class=if_else(.fitted>cutoff, 1, 0) %>% as.factor(),
Survived= factor(Survived))
confusionMatrix(table$predicted_class, table$Survived, positive = "1") %>%
tidy() %>%
select(term, estimate) %>%
filter(term %in% c('accuracy', 'sensitivity', 'specificity', 'precision','recall')) %>%
mutate(cutoff=cutoff)
}
cutoffs = seq(0.01,0.95,0.01)
logit_pred= map_dfr(cutoffs, prediction_metrics)%>% mutate(term=as.factor(term))
ggplot(logit_pred, aes(cutoff,estimate, group=term, color=term)) + geom_line(size=1) +
theme_minimal() +
labs(title= 'Accuracy, Sensitivity, Specificity, Recall y Precision', subtitle= 'Modelo completo', color="")
Me quedaría con un punto de corte cercano a 0.36 (un poco antes del centro [0.25:0.5]), donde se cruzan la mayoria de curvas. La desicion depende de cual metrica se quiera maximizar. Prefiero tener un equilibrio entre todas las metricas.
corte = 0.37
table= prediction_full %>% mutate(predicted_class=if_else(.fitted>corte, 1, 0) %>% as.factor(),Survived= factor(Survived))
confusionMatrix(table(table$predicted_class, table$Survived), positive = "1")
Confusion Matrix and Statistics
0 1
0 132 22
1 37 77
Accuracy : 0.7799
95% CI : (0.7254, 0.828)
No Information Rate : 0.6306
P-Value [Acc > NIR] : 1.06e-07
Kappa : 0.5418
Mcnemar's Test P-Value : 0.06836
Sensitivity : 0.7778
Specificity : 0.7811
Pos Pred Value : 0.6754
Neg Pred Value : 0.8571
Prevalence : 0.3694
Detection Rate : 0.2873
Detection Prevalence : 0.4254
Balanced Accuracy : 0.7794
'Positive' Class : 1
Así obtenemos un acc de 0.7799 El modelo tiene mejor presicion para los casos negativos (no sobrevive) que los positivos, dado los valores de Pos Pred Value y Neg Pred Value.
titanic_test = read.csv("titanic_complete_test.csv",sep = ",")
titanic_test = titanic_test %>%
select(PassengerId, Survived, Pclass, Sex, Age, SibSp,Parch, Fare, Embarked) %>%
mutate(Survived = as.factor(Survived)) %>%
mutate(Pclass = as.factor(Pclass)) %>%
mutate(Embarked = as.factor(Embarked))
summary(titanic_test)
PassengerId Survived Pclass Sex Age SibSp Parch Fare
Min. : 892.0 0:261 1:107 female:152 Min. : 0.17 Min. :0.0000 Min. :0.0000 Min. : 0.000
1st Qu.: 996.2 1:157 2: 93 male :266 1st Qu.:23.00 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 7.896
Median :1100.5 3:218 Median :25.00 Median :0.0000 Median :0.0000 Median : 14.454
Mean :1100.5 Mean :29.42 Mean :0.4474 Mean :0.3923 Mean : 35.577
3rd Qu.:1204.8 3rd Qu.:36.38 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.: 31.472
Max. :1309.0 Max. :76.00 Max. :8.0000 Max. :9.0000 Max. :512.329
Embarked
C:102
Q: 46
S:270
model <- glm(f$Todo, family = 'binomial', data = res[["train"]])
prediction_test <- augment(x = model, newdata = titanic_test, type.predict = 'response')
table <- prediction_test %>%
mutate(predicted_class=if_else(.fitted>corte, 1, 0) %>% as.factor(),
Survived= factor(Survived))
confusionMatrix(table(table$predicted_class, table$Survived), positive = "1")
Confusion Matrix and Statistics
0 1
0 192 39
1 69 118
Accuracy : 0.7416
95% CI : (0.6968, 0.7829)
No Information Rate : 0.6244
P-Value [Acc > NIR] : 2.496e-07
Kappa : 0.4694
Mcnemar's Test P-Value : 0.005262
Sensitivity : 0.7516
Specificity : 0.7356
Pos Pred Value : 0.6310
Neg Pred Value : 0.8312
Prevalence : 0.3756
Detection Rate : 0.2823
Detection Prevalence : 0.4474
Balanced Accuracy : 0.7436
'Positive' Class : 1
Se redujo el accuracy, algo que se espera si el modelo no esta “overfiteando”. Lo mismo paso con la capacidad de predecir las clases positivas. Se sigue prediciendo mejor la clase negativa.