AnĂ¡lisis preliminar del Titanic. (Modo PrĂ¡ctica).
Twitter: @yimmyeman
R:library(tidyverse)
library(reticulate)
library(gmodels)
titanic_df <- read.csv("Data/titanicdf.csv",
stringsAsFactors = F,
row.names = "X")
names(titanic_df)
## [1] "PassengerId" "Survived" "Pclass" "Name" "Sex"
## [6] "Age" "SibSp" "Parch" "Ticket" "Fare"
## [11] "Cabin" "Embarked"
str(titanic_df)
## 'data.frame': 1309 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
summary(titanic_df)
## PassengerId Survived Pclass Name
## Min. : 1 Min. :0.0000 Min. :1.000 Length:1309
## 1st Qu.: 328 1st Qu.:0.0000 1st Qu.:2.000 Class :character
## Median : 655 Median :0.0000 Median :3.000 Mode :character
## Mean : 655 Mean :0.3774 Mean :2.295
## 3rd Qu.: 982 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :1309 Max. :1.0000 Max. :3.000
##
## Sex Age SibSp Parch
## Length:1309 Min. : 0.17 Min. :0.0000 Min. :0.000
## Class :character 1st Qu.:21.00 1st Qu.:0.0000 1st Qu.:0.000
## Mode :character Median :28.00 Median :0.0000 Median :0.000
## Mean :29.88 Mean :0.4989 Mean :0.385
## 3rd Qu.:39.00 3rd Qu.:1.0000 3rd Qu.:0.000
## Max. :80.00 Max. :8.0000 Max. :9.000
## NA's :263
## Ticket Fare Cabin Embarked
## Length:1309 Min. : 0.000 Length:1309 Length:1309
## Class :character 1st Qu.: 7.896 Class :character Class :character
## Mode :character Median : 14.454 Mode :character Mode :character
## Mean : 33.295
## 3rd Qu.: 31.275
## Max. :512.329
## NA's :1
titanic_df$Survived <- factor(titanic_df$Survived,
levels = c(0,1),
labels = c("Dead", "Survived"))
titanic_df$Sex <- factor(titanic_df$Sex,
levels = c("male","female"),
labels = c("Male","Female"))
titanic_df$Embarked <- factor(titanic_df$Embarked, levels = c("C", "Q", "S"))
titanic_df$Pclass <- factor(titanic_df$Pclass,
levels = c(1,2,3),
labels = c("First Class",
"Second Class",
"Third Class"))
Podemos ver que la variable Age tienen un total de: 263
NA’s.
Crearemos una funciĂ³n para sustituir los datos faltantes en la
variable Age de forma aleatoria.
rand.impute <- function(x) {
missing <- is.na(x)
n.missing <- sum(missing)
x.obs <- x[!missing]
imputed <- x
imputed[missing] <- sample(x.obs, n.missing, replace = TRUE)
return (imputed)
}
titanic_df$Age <- rand.impute(titanic_df$Age)
SurvivedSurvived en funciĂ³n del Sex:Se puede observar que la preferencia por las mujeres fuĂ© significativa ya que sobrevivieron con una tasa del 82.62 %.
CrossTable(titanic_df$Sex, titanic_df$Survived, dnn = c("Sex", "Survived"))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 1309
##
##
## | Survived
## Sex | Dead | Survived | Row Total |
## -------------|-----------|-----------|-----------|
## Male | 734 | 109 | 843 |
## | 83.333 | 137.483 | |
## | 0.871 | 0.129 | 0.644 |
## | 0.901 | 0.221 | |
## | 0.561 | 0.083 | |
## -------------|-----------|-----------|-----------|
## Female | 81 | 385 | 466 |
## | 150.751 | 248.709 | |
## | 0.174 | 0.826 | 0.356 |
## | 0.099 | 0.779 | |
## | 0.062 | 0.294 | |
## -------------|-----------|-----------|-----------|
## Column Total | 815 | 494 | 1309 |
## | 0.623 | 0.377 | |
## -------------|-----------|-----------|-----------|
##
##
titanic_df %>%
ggplot() +
geom_bar(mapping = aes(Sex, fill = Survived))
Survived en funciĂ³n del Pclass:La tasa de supervivencia fuĂ© equilibrada entre las 3 clases inclinandose ligeramente por la tercera clase, sin embargo la tercera clase fuĂ© la que tuvo mayor tasa de no sobrevientes.
CrossTable(titanic_df$Pclas, titanic_df$Survived, dnn = c("Pclass", "Survived"))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 1309
##
##
## | Survived
## Pclass | Dead | Survived | Row Total |
## -------------|-----------|-----------|-----------|
## First Class | 137 | 186 | 323 |
## | 20.434 | 33.712 | |
## | 0.424 | 0.576 | 0.247 |
## | 0.168 | 0.377 | |
## | 0.105 | 0.142 | |
## -------------|-----------|-----------|-----------|
## Second Class | 160 | 117 | 277 |
## | 0.901 | 1.486 | |
## | 0.578 | 0.422 | 0.212 |
## | 0.196 | 0.237 | |
## | 0.122 | 0.089 | |
## -------------|-----------|-----------|-----------|
## Third Class | 518 | 191 | 709 |
## | 13.281 | 21.911 | |
## | 0.731 | 0.269 | 0.542 |
## | 0.636 | 0.387 | |
## | 0.396 | 0.146 | |
## -------------|-----------|-----------|-----------|
## Column Total | 815 | 494 | 1309 |
## | 0.623 | 0.377 | |
## -------------|-----------|-----------|-----------|
##
##
titanic_df %>%
ggplot()+
geom_bar(mapping = aes(x = Pclass, fill = Survived))
Survived en funciĂ³n del Age:Crearemos una nueva variable llamada Age.cat para
categorizar por grupos de edad.
La clasificaciĂ³n serĂ¡ de la siguiente forma:
titanic_df$Age.cat <- cut(titanic_df$Age,
breaks = c(0, 14, 24, 64, Inf),
labels = c("Niño", "Joven", "Adulto", "Mayor"))
La mayor cantidad de personas pertence a la clase Adulto.
La supervivencia de la clase de los Niño estuvo equilibrada en torno al 50%.
La clase Mayor fué la que menos oportunidad de sobrevir tuvo con un 17.65% de tasa de superviencia.
round(prop.table(table(titanic_df$Age.cat,
titanic_df$Survived,
dnn = c("Age", "Survived(%)")), 1)*100,2)
## Survived(%)
## Age Dead Survived
## Niño 51.82 48.18
## Joven 63.32 36.68
## Adulto 63.45 36.55
## Mayor 68.75 31.25
titanic_df %>%
ggplot()+
geom_bar(aes(Age.cat, fill = Survived), binwidth = 2)
Survived en funciĂ³n del Embarked:La puerta de embarque S presento una menor tasa de supervivencia, aunque representa la puerta con mayor cantidad de pasajeros.
CrossTable(titanic_df$Embarked,
titanic_df$Survived,
dnn = c("Embarked", "Survived"))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 1307
##
##
## | Survived
## Embarked | Dead | Survived | Row Total |
## -------------|-----------|-----------|-----------|
## C | 137 | 133 | 270 |
## | 5.842 | 9.678 | |
## | 0.507 | 0.493 | 0.207 |
## | 0.168 | 0.270 | |
## | 0.105 | 0.102 | |
## -------------|-----------|-----------|-----------|
## Q | 69 | 54 | 123 |
## | 0.773 | 1.280 | |
## | 0.561 | 0.439 | 0.094 |
## | 0.085 | 0.110 | |
## | 0.053 | 0.041 | |
## -------------|-----------|-----------|-----------|
## S | 609 | 305 | 914 |
## | 2.677 | 4.435 | |
## | 0.666 | 0.334 | 0.699 |
## | 0.747 | 0.620 | |
## | 0.466 | 0.233 | |
## -------------|-----------|-----------|-----------|
## Column Total | 815 | 492 | 1307 |
## | 0.624 | 0.376 | |
## -------------|-----------|-----------|-----------|
##
##
titanic_df %>%
ggplot()+
geom_bar(mapping = aes(x = Embarked, fill = Survived))
Survived en funciĂ³n del Sibsp y
Parch:La mayor cantidad de pasajeros no tenĂa familiares a bordo, la tasa
de no supervivencia en ambos casos Sibsp y
Parch super el 60%.
titanic_df %>%
ggplot()+
geom_bar(mapping = aes(x = SibSp, fill = Survived))
titanic_df %>%
ggplot()+
geom_bar(mapping = aes(x = Parch, fill = Survived))
Survived en funciĂ³n del Name:Para categorizar la variable Name extraeremos los
tĂtulos de cada persona para deteminar la tasa de supervivencia.
Vamos a requerir de las expresiones regulares y haremos uso del lenguaje de programaciĂ³n de Python.
df = r.titanic_df
Agregaremos una nueva columna llamada Title.
import re
Title = []
patron_title = re.compile(' ([A-Za-z]+)\.')
for name in df["Name"]:
extract = patron_title.findall(name)
Title.append(extract[0])
df["Title"] = Title
# TĂtulo presentes:
print(df["Title"].unique())
## ['Mr' 'Mrs' 'Miss' 'Master' 'Don' 'Rev' 'Dr' 'Mme' 'Ms' 'Major' 'Lady'
## 'Sir' 'Mlle' 'Col' 'Capt' 'Countess' 'Jonkheer' 'Dona']
Procedemos a agrupar los tĂtulos
df["Title"] = df["Title"].replace(["Mlle", "Ms", "Lady", "Countess"], "Miss")
df["Title"] = df["Title"].replace(["Sir", "Don"], "Mr")
df["Title"] = df["Title"].replace("Mme", "Mr")
df["Title"] = df["Title"].replace(['Rev', 'Dr', 'Col', 'Major','Capt', 'Jonkheer', 'Dona'], 'Rare')
Pasamos a R el dataframe:
titanic_df <- py$df
Convertimos a factor la nueva variable Title.
unique(titanic_df$Title)
## [1] "Mr" "Mrs" "Miss" "Master" "Rare"
titanic_df$Title <- factor(titanic_df$Title)
Es claro ver que el tĂtulo Mr tiene la mayor tasa de supervivencia al tratase del sexo masculino.
CrossTable(titanic_df$Title, titanic_df$Survived, dnn = c("Title","Survived"))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 1309
##
##
## | Survived
## Title | Dead | Survived | Row Total |
## -------------|-----------|-----------|-----------|
## Master | 38 | 23 | 61 |
## | 0.000 | 0.000 | |
## | 0.623 | 0.377 | 0.047 |
## | 0.047 | 0.047 | |
## | 0.029 | 0.018 | |
## -------------|-----------|-----------|-----------|
## Miss | 55 | 211 | 266 |
## | 73.880 | 121.887 | |
## | 0.207 | 0.793 | 0.203 |
## | 0.067 | 0.427 | |
## | 0.042 | 0.161 | |
## -------------|-----------|-----------|-----------|
## Mr | 677 | 83 | 760 |
## | 87.789 | 144.833 | |
## | 0.891 | 0.109 | 0.581 |
## | 0.831 | 0.168 | |
## | 0.517 | 0.063 | |
## -------------|-----------|-----------|-----------|
## Mrs | 26 | 171 | 197 |
## | 76.166 | 125.659 | |
## | 0.132 | 0.868 | 0.150 |
## | 0.032 | 0.346 | |
## | 0.020 | 0.131 | |
## -------------|-----------|-----------|-----------|
## Rare | 19 | 6 | 25 |
## | 0.758 | 1.250 | |
## | 0.760 | 0.240 | 0.019 |
## | 0.023 | 0.012 | |
## | 0.015 | 0.005 | |
## -------------|-----------|-----------|-----------|
## Column Total | 815 | 494 | 1309 |
## | 0.623 | 0.377 | |
## -------------|-----------|-----------|-----------|
##
##
titanic_df %>%
ggplot()+
geom_bar(mapping = aes(x = Title, fill = Survived))