La regresión logística es un modelo estadístico de clasificación binaria, que estima la probabilidad de que ocurra un evento (valor 1) frente a que no ocurra (valor 0), en función de variables independientes.
#install.packages("titanic")
library(titanic)
#install.packages("caret")
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
#install.packages("tidyverse")
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.0 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ lubridate 1.9.5 ✔ tibble 3.3.1
## ✔ purrr 1.2.1 ✔ tidyr 1.3.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ purrr::lift() masks caret::lift()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
df <- read.csv("C:/Users/joseo/Downloads/titanic (1).csv")
summary(df)
## pclass survived name sex
## Min. :1.000 Min. :0.000 Length:1310 Length:1310
## 1st Qu.:2.000 1st Qu.:0.000 Class :character Class :character
## Median :3.000 Median :0.000 Mode :character Mode :character
## Mean :2.295 Mean :0.382
## 3rd Qu.:3.000 3rd Qu.:1.000
## Max. :3.000 Max. :1.000
## NA's :1 NA's :1
## age sibsp parch ticket
## Min. : 0.1667 Min. :0.0000 Min. :0.000 Length:1310
## 1st Qu.:21.0000 1st Qu.:0.0000 1st Qu.:0.000 Class :character
## Median :28.0000 Median :0.0000 Median :0.000 Mode :character
## Mean :29.8811 Mean :0.4989 Mean :0.385
## 3rd Qu.:39.0000 3rd Qu.:1.0000 3rd Qu.:0.000
## Max. :80.0000 Max. :8.0000 Max. :9.000
## NA's :264 NA's :1 NA's :1
## fare cabin embarked boat
## Min. : 0.000 Length:1310 Length:1310 Length:1310
## 1st Qu.: 7.896 Class :character Class :character Class :character
## Median : 14.454 Mode :character Mode :character Mode :character
## Mean : 33.295
## 3rd Qu.: 31.275
## Max. :512.329
## NA's :2
## body home.dest
## Min. : 1.0 Length:1310
## 1st Qu.: 72.0 Class :character
## Median :155.0 Mode :character
## Mean :160.8
## 3rd Qu.:256.0
## Max. :328.0
## NA's :1189
str(df)
## 'data.frame': 1310 obs. of 14 variables:
## $ pclass : int 1 1 1 1 1 1 1 1 1 1 ...
## $ survived : int 1 1 0 0 0 1 1 0 1 0 ...
## $ name : chr "Allen, Miss. Elisabeth Walton" "Allison, Master. Hudson Trevor" "Allison, Miss. Helen Loraine" "Allison, Mr. Hudson Joshua Creighton" ...
## $ sex : chr "female" "male" "female" "male" ...
## $ age : num 29 0.917 2 30 25 ...
## $ sibsp : int 0 1 1 1 1 0 1 0 2 0 ...
## $ parch : int 0 2 2 2 2 0 0 0 0 0 ...
## $ ticket : chr "24160" "113781" "113781" "113781" ...
## $ fare : num 211 152 152 152 152 ...
## $ cabin : chr "B5" "C22 C26" "C22 C26" "C22 C26" ...
## $ embarked : chr "S" "S" "S" "S" ...
## $ boat : chr "2" "11" "" "" ...
## $ body : int NA NA NA 135 NA NA NA NA NA 22 ...
## $ home.dest: chr "St Louis, MO" "Montreal, PQ / Chesterville, ON" "Montreal, PQ / Chesterville, ON" "Montreal, PQ / Chesterville, ON" ...
df <- df[, c("survived", "pclass","sex","age")]
df <- na.omit(df)
df$survived <- as.factor(df$survived)
df$pclass <- as.factor(df$pclass)
df$sex <- as.factor(df$sex)
modelo <- glm(survived ~ ., data=df, family=binomial)
summary(modelo)
##
## Call:
## glm(formula = survived ~ ., family = binomial, data = df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.522074 0.326702 10.781 < 2e-16 ***
## pclass2 -1.280570 0.225538 -5.678 1.36e-08 ***
## pclass3 -2.289661 0.225802 -10.140 < 2e-16 ***
## sexmale -2.497845 0.166037 -15.044 < 2e-16 ***
## age -0.034393 0.006331 -5.433 5.56e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1414.62 on 1045 degrees of freedom
## Residual deviance: 982.45 on 1041 degrees of freedom
## AIC: 992.45
##
## Number of Fisher Scoring iterations: 4
prueba <- data.frame(pclass=as.factor(c(1,3)), sex=as.factor(c("female","male")),age=c(25,40))
probabilidad <- predict(modelo, newdata=prueba, type="response")
cbind(prueba, Probabilidad_Sobrevive=probabilidad)
## pclass sex age Probabilidad_Sobrevive
## 1 1 female 25 0.93476160
## 2 3 male 40 0.06653593
La probabilidad de sobrevivir depende principalmente de sexo, clase y edad.
Mayor probabilidad: mujer, 1ª clase y joven.
Ser hombre reduce fuertemente la probabilidad de sobrevivir (coeficiente más impactante).
2ª clase tiene menor probabilidad que 1ª clase.
3ª clase presenta la mayor reducción en supervivencia.
A mayor edad, menor probabilidad de sobrevivir.
Mujer de 25 años en 1ª clase presenta probabilidad alta.
Hombre de 40 años en 3ª clase presenta probabilidad muy baja.