AnĂ¡lisis preliminar del Titanic. (Modo PrĂ¡ctica).

Twitter: @yimmyeman

LibrerĂ­as de R:

library(tidyverse)
library(reticulate)
library(gmodels)

ImportaciĂ³n de los datos:

titanic_df <- read.csv("Data/titanicdf.csv", 
                       stringsAsFactors = F, 
                       row.names = "X")

Nombres de las variabes

names(titanic_df)
##  [1] "PassengerId" "Survived"    "Pclass"      "Name"        "Sex"        
##  [6] "Age"         "SibSp"       "Parch"       "Ticket"      "Fare"       
## [11] "Cabin"       "Embarked"

Significado de las variables:

  • PassengerId: NumeraciĂ³n de las Filas
  • Survived: Sobreviente
  • Pclass: Clase del Ticket(Primera Clase, Segunda Clase, Tercera Clase)
  • Name: Nombre del Pasajero
  • Sex: Sexo
  • Age: Edad
  • SibSp: # de Hermanos/CĂ³nyuges a bordo del Titanic
  • Parch: # de Padres/Hijos a bordo del Titanic
  • Ticket: NĂºmero de ticket
  • Fare: Tarifa de pasajero
  • Cabin: # de cabina
  • Embarked: Puerta de embarque (C = Cherburgo, Q = Queenstown, S = Southampton)

Analizando la estructura de los datos:

str(titanic_df)
## 'data.frame':    1309 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...
summary(titanic_df)
##   PassengerId      Survived          Pclass          Name          
##  Min.   :   1   Min.   :0.0000   Min.   :1.000   Length:1309       
##  1st Qu.: 328   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
##  Median : 655   Median :0.0000   Median :3.000   Mode  :character  
##  Mean   : 655   Mean   :0.3774   Mean   :2.295                     
##  3rd Qu.: 982   3rd Qu.:1.0000   3rd Qu.:3.000                     
##  Max.   :1309   Max.   :1.0000   Max.   :3.000                     
##                                                                    
##      Sex                 Age            SibSp            Parch      
##  Length:1309        Min.   : 0.17   Min.   :0.0000   Min.   :0.000  
##  Class :character   1st Qu.:21.00   1st Qu.:0.0000   1st Qu.:0.000  
##  Mode  :character   Median :28.00   Median :0.0000   Median :0.000  
##                     Mean   :29.88   Mean   :0.4989   Mean   :0.385  
##                     3rd Qu.:39.00   3rd Qu.:1.0000   3rd Qu.:0.000  
##                     Max.   :80.00   Max.   :8.0000   Max.   :9.000  
##                     NA's   :263                                     
##     Ticket               Fare            Cabin             Embarked        
##  Length:1309        Min.   :  0.000   Length:1309        Length:1309       
##  Class :character   1st Qu.:  7.896   Class :character   Class :character  
##  Mode  :character   Median : 14.454   Mode  :character   Mode  :character  
##                     Mean   : 33.295                                        
##                     3rd Qu.: 31.275                                        
##                     Max.   :512.329                                        
##                     NA's   :1

¿QuĂ© variables son categĂ³ricas?

Convirtiendo variables categĂ³ricas en factores.

titanic_df$Survived <- factor(titanic_df$Survived, 
                            levels = c(0,1), 
                            labels = c("Dead", "Survived"))

titanic_df$Sex <- factor(titanic_df$Sex,
                       levels = c("male","female"),
                       labels = c("Male","Female"))

titanic_df$Embarked <- factor(titanic_df$Embarked, levels = c("C", "Q", "S"))

titanic_df$Pclass <- factor(titanic_df$Pclass, 
                          levels = c(1,2,3),
                          labels = c("First Class", 
                                     "Second Class", 
                                     "Third Class"))

¿QuĂ© variables son numĂ©ricas?

Tratamiento de los datos faltantes:

Podemos ver que la variable Age tienen un total de: 263 NA’s.

Crearemos una funciĂ³n para sustituir los datos faltantes en la variable Age de forma aleatoria.

rand.impute <- function(x) {
  missing <- is.na(x) 
  n.missing <- sum(missing)
  x.obs <- x[!missing]
  imputed <- x
  imputed[missing] <- sample(x.obs, n.missing, replace = TRUE)
  return (imputed)
}
titanic_df$Age <- rand.impute(titanic_df$Age)

Analizando la relaciĂ³n de las variables en funciĂ³n de Survived

Survived en funciĂ³n del Sex:

Se puede observar que la preferencia por las mujeres fué significativa ya que sobrevivieron con una tasa del 82.62 %.

CrossTable(titanic_df$Sex, titanic_df$Survived, dnn = c("Sex", "Survived"))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1309 
## 
##  
##              | Survived 
##          Sex |      Dead |  Survived | Row Total | 
## -------------|-----------|-----------|-----------|
##         Male |       734 |       109 |       843 | 
##              |    83.333 |   137.483 |           | 
##              |     0.871 |     0.129 |     0.644 | 
##              |     0.901 |     0.221 |           | 
##              |     0.561 |     0.083 |           | 
## -------------|-----------|-----------|-----------|
##       Female |        81 |       385 |       466 | 
##              |   150.751 |   248.709 |           | 
##              |     0.174 |     0.826 |     0.356 | 
##              |     0.099 |     0.779 |           | 
##              |     0.062 |     0.294 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |       815 |       494 |      1309 | 
##              |     0.623 |     0.377 |           | 
## -------------|-----------|-----------|-----------|
## 
## 
titanic_df %>% 
  ggplot() +
  geom_bar(mapping = aes(Sex, fill = Survived))

Survived en funciĂ³n del Pclass:

La tasa de supervivencia fué equilibrada entre las 3 clases inclinandose ligeramente por la tercera clase, sin embargo la tercera clase fué la que tuvo mayor tasa de no sobrevientes.

CrossTable(titanic_df$Pclas, titanic_df$Survived, dnn = c("Pclass", "Survived"))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1309 
## 
##  
##              | Survived 
##       Pclass |      Dead |  Survived | Row Total | 
## -------------|-----------|-----------|-----------|
##  First Class |       137 |       186 |       323 | 
##              |    20.434 |    33.712 |           | 
##              |     0.424 |     0.576 |     0.247 | 
##              |     0.168 |     0.377 |           | 
##              |     0.105 |     0.142 |           | 
## -------------|-----------|-----------|-----------|
## Second Class |       160 |       117 |       277 | 
##              |     0.901 |     1.486 |           | 
##              |     0.578 |     0.422 |     0.212 | 
##              |     0.196 |     0.237 |           | 
##              |     0.122 |     0.089 |           | 
## -------------|-----------|-----------|-----------|
##  Third Class |       518 |       191 |       709 | 
##              |    13.281 |    21.911 |           | 
##              |     0.731 |     0.269 |     0.542 | 
##              |     0.636 |     0.387 |           | 
##              |     0.396 |     0.146 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |       815 |       494 |      1309 | 
##              |     0.623 |     0.377 |           | 
## -------------|-----------|-----------|-----------|
## 
## 
titanic_df %>% 
  ggplot()+
  geom_bar(mapping = aes(x = Pclass, fill = Survived))

Survived en funciĂ³n del Age:

Crearemos una nueva variable llamada Age.cat para categorizar por grupos de edad.

La clasificaciĂ³n serĂ¡ de la siguiente forma:

  • 0-14: Niño.
  • 15-24: Joven.
  • 25-64: Adulto.
  • 65 y mĂ¡s: Mayor.
titanic_df$Age.cat <- cut(titanic_df$Age, 
                          breaks = c(0, 14, 24, 64, Inf),
                          labels = c("Niño", "Joven", "Adulto", "Mayor"))

La mayor cantidad de personas pertence a la clase Adulto.

La supervivencia de la clase de los Niño estuvo equilibrada en torno al 50%.

La clase Mayor fué la que menos oportunidad de sobrevir tuvo con un 17.65% de tasa de superviencia.

round(prop.table(table(titanic_df$Age.cat, 
                 titanic_df$Survived,
                 dnn =  c("Age", "Survived(%)")), 1)*100,2)
##         Survived(%)
## Age       Dead Survived
##   Niño   51.82    48.18
##   Joven  63.32    36.68
##   Adulto 63.45    36.55
##   Mayor  68.75    31.25
titanic_df %>% 
  ggplot()+
  geom_bar(aes(Age.cat, fill = Survived), binwidth = 2)

Survived en funciĂ³n del Embarked:

La puerta de embarque S presento una menor tasa de supervivencia, aunque representa la puerta con mayor cantidad de pasajeros.

CrossTable(titanic_df$Embarked, 
           titanic_df$Survived, 
           dnn = c("Embarked", "Survived"))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1307 
## 
##  
##              | Survived 
##     Embarked |      Dead |  Survived | Row Total | 
## -------------|-----------|-----------|-----------|
##            C |       137 |       133 |       270 | 
##              |     5.842 |     9.678 |           | 
##              |     0.507 |     0.493 |     0.207 | 
##              |     0.168 |     0.270 |           | 
##              |     0.105 |     0.102 |           | 
## -------------|-----------|-----------|-----------|
##            Q |        69 |        54 |       123 | 
##              |     0.773 |     1.280 |           | 
##              |     0.561 |     0.439 |     0.094 | 
##              |     0.085 |     0.110 |           | 
##              |     0.053 |     0.041 |           | 
## -------------|-----------|-----------|-----------|
##            S |       609 |       305 |       914 | 
##              |     2.677 |     4.435 |           | 
##              |     0.666 |     0.334 |     0.699 | 
##              |     0.747 |     0.620 |           | 
##              |     0.466 |     0.233 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |       815 |       492 |      1307 | 
##              |     0.624 |     0.376 |           | 
## -------------|-----------|-----------|-----------|
## 
## 
titanic_df %>% 
  ggplot()+
  geom_bar(mapping = aes(x = Embarked, fill = Survived))

Survived en funciĂ³n del Sibsp y Parch:

La mayor cantidad de pasajeros no tenĂ­a familiares a bordo, la tasa de no supervivencia en ambos casos Sibsp y Parch super el 60%.

titanic_df %>% 
  ggplot()+
  geom_bar(mapping = aes(x = SibSp, fill = Survived))

titanic_df %>% 
  ggplot()+
  geom_bar(mapping = aes(x = Parch, fill = Survived))

Survived en funciĂ³n del Name:

Para categorizar la variable Name extraeremos los tĂ­tulos de cada persona para deteminar la tasa de supervivencia.

Vamos a requerir de las expresiones regulares y haremos uso del lenguaje de programaciĂ³n de Python.

df = r.titanic_df

Agregaremos una nueva columna llamada Title.

import re

Title = []
patron_title = re.compile(' ([A-Za-z]+)\.')

for name in df["Name"]:
  extract = patron_title.findall(name)
  Title.append(extract[0])
  
df["Title"] = Title

# TĂ­tulo presentes:
print(df["Title"].unique())
## ['Mr' 'Mrs' 'Miss' 'Master' 'Don' 'Rev' 'Dr' 'Mme' 'Ms' 'Major' 'Lady'
##  'Sir' 'Mlle' 'Col' 'Capt' 'Countess' 'Jonkheer' 'Dona']

Procedemos a agrupar los tĂ­tulos

df["Title"] = df["Title"].replace(["Mlle", "Ms", "Lady", "Countess"], "Miss")
df["Title"] = df["Title"].replace(["Sir", "Don"], "Mr")
df["Title"] = df["Title"].replace("Mme", "Mr")
df["Title"] = df["Title"].replace(['Rev', 'Dr', 'Col', 'Major','Capt', 'Jonkheer', 'Dona'], 'Rare')

Pasamos a R el dataframe:

titanic_df <- py$df

Convertimos a factor la nueva variable Title.

unique(titanic_df$Title)
## [1] "Mr"     "Mrs"    "Miss"   "Master" "Rare"
titanic_df$Title <- factor(titanic_df$Title)

Es claro ver que el tĂ­tulo Mr tiene la mayor tasa de supervivencia al tratase del sexo masculino.

CrossTable(titanic_df$Title, titanic_df$Survived, dnn = c("Title","Survived"))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1309 
## 
##  
##              | Survived 
##        Title |      Dead |  Survived | Row Total | 
## -------------|-----------|-----------|-----------|
##       Master |        38 |        23 |        61 | 
##              |     0.000 |     0.000 |           | 
##              |     0.623 |     0.377 |     0.047 | 
##              |     0.047 |     0.047 |           | 
##              |     0.029 |     0.018 |           | 
## -------------|-----------|-----------|-----------|
##         Miss |        55 |       211 |       266 | 
##              |    73.880 |   121.887 |           | 
##              |     0.207 |     0.793 |     0.203 | 
##              |     0.067 |     0.427 |           | 
##              |     0.042 |     0.161 |           | 
## -------------|-----------|-----------|-----------|
##           Mr |       677 |        83 |       760 | 
##              |    87.789 |   144.833 |           | 
##              |     0.891 |     0.109 |     0.581 | 
##              |     0.831 |     0.168 |           | 
##              |     0.517 |     0.063 |           | 
## -------------|-----------|-----------|-----------|
##          Mrs |        26 |       171 |       197 | 
##              |    76.166 |   125.659 |           | 
##              |     0.132 |     0.868 |     0.150 | 
##              |     0.032 |     0.346 |           | 
##              |     0.020 |     0.131 |           | 
## -------------|-----------|-----------|-----------|
##         Rare |        19 |         6 |        25 | 
##              |     0.758 |     1.250 |           | 
##              |     0.760 |     0.240 |     0.019 | 
##              |     0.023 |     0.012 |           | 
##              |     0.015 |     0.005 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |       815 |       494 |      1309 | 
##              |     0.623 |     0.377 |           | 
## -------------|-----------|-----------|-----------|
## 
## 
titanic_df %>% 
  ggplot()+
  geom_bar(mapping = aes(x = Title, fill = Survived))