El archivo titanic.csv brinda informaci´on de 891 pasajeros que se encontraban en el barco Titanic el d´ıa del tragedia. la base de datosd considera la siguientes; Supervivencia, clase de boleto, Edad en a˜nos de hermanos/c´onyuges a bordo del Titanic, de padres/hijos a bordo del Titanic, entre otras.
## importamos los datos
titanic1 <- read_csv("titanic (1).csv")
glimpse(titanic1)
## Rows: 891
## Columns: 12
## $ PassengerId <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Survived <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
## $ Pclass <dbl> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
## $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
## $ Sex <chr> "male", "female", "female", "female", "male", "male", "mal…
## $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
## $ SibSp <dbl> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
## $ Parch <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
## $ Cabin <chr> NA, "C85", NA, "C123", NA, NA, "E46", NA, NA, NA, "G6", "C…
## $ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…
Observamos que algunas variables no corresponden a tipo de datos, como lo son: Survived,Pclass, Sex, sibsp, Parch, Embarked, Cabin. Las cuales deberias se consideradas como factores.
titanic1$Survived <- as.factor(titanic1$Survived)
titanic1$Pclass <- as.factor(titanic1$Pclass)
titanic1$Sex <- as.factor(titanic1$Sex)
titanic1$SibSp <- as.factor(titanic1$SibSp)
titanic1$Parch <- as.factor(titanic1$Parch)
titanic1$Embarked <- as.factor(titanic1$Embarked)
titanic1$Cabin <- as.factor(titanic1$Cabin)
1.1 Valores ausentes.
valores_ausentes <- sapply(titanic1, function (x) sum(is.na(x))/length(x) )
valores_ausentes <- valores_ausentes[valores_ausentes>0]
q1 <- ggplot() + geom_bar(aes(x=names(valores_ausentes),
y=valores_ausentes),
stat = "identity",show.legend = F,
width = 0.7,
fill = "orange")
q1 + scale_y_continuous(limits =c(0,0.8),
labels = function(x) paste(x*100,"%"),
expand = c(0,0)) +
xlab("Variables") + ylab("Porcentaje")+
ggtitle("Valores ausentes") + theme_linedraw()
t1 <- titanic1 %>%
select(Survived,Age) %>%
filter(Survived == 1) %>%
summarise(media = mean(na.omit(Age)),
mediana = median(na.omit(Age)))
row.names(t1) <- c("Edad")
t1 %>% kable(digits = 2,caption = "Edad",align = c("c","c")) %>% kable_styling(position = "c")
| media | mediana |
|---|---|
| 28.34 | 28 |
t2 <- titanic1 %>%
group_by(Pclass) %>%
summarise(cantidad = n())
t2 %>%
kable(align = c("l","c"),
caption = "Cantidad de personas que viajaban en cada clase") %>%
kable_styling(position = "c")
| Pclass | cantidad |
|---|---|
| 1 | 216 |
| 2 | 184 |
| 3 | 491 |
t4 <- titanic1 %>%
group_by(Pclass,Survived) %>%
summarise(Cantidad = n())
t4 %>%
kable(align = c("c","c","c")) %>%
kable_styling(position = "c")
| Pclass | Survived | Cantidad |
|---|---|---|
| 1 | 0 | 80 |
| 1 | 1 | 136 |
| 2 | 0 | 97 |
| 2 | 1 | 87 |
| 3 | 0 | 372 |
| 3 | 1 | 119 |
titanic1 %>%
summarise(masEconomico = min(Fare) ,
masCostoso = max(Fare),
Promedio = mean(Fare) )
## # A tibble: 1 × 3
## masEconomico masCostoso Promedio
## <dbl> <dbl> <dbl>
## 1 0 512. 32.2
Segun la informacion el valor minimo del costo del tikete es 0, valor que genera confucion, probablemente puede que este valor represente una invitacion o simplemente que son valores ausentes.
titanic1 %>%
filter(Fare>0) %>%
summarise(masEconomico = min(Fare) ,
masCostoso = max(Fare),
Promedio = mean(Fare) )
## # A tibble: 1 × 3
## masEconomico masCostoso Promedio
## <dbl> <dbl> <dbl>
## 1 4.01 512. 32.8
titanic1 %>%
ggplot(aes(x=Sex,fill="Orange")) +
geom_bar(stat = "count",width = 0.5,
fill = "Orange") +
theme_linedraw()+
xlab("Sexo")+
scale_y_continuous(limits = c(0,600) , expand = c(0,0))
titanic1 %>% ggplot(aes(x=Pclass)) + geom_bar(stat = "count")
8. Realice un diagrama de caja (boxplot) con los datos de las edades.
¿Cu´al es su interpretaci´on a lo visualizado en el gr´afico?
titanic1 %>% ggplot(aes(y=Age,x="Edad")) + geom_boxplot() +geom_jitter()
plot(density(titanic1$Age,na.rm = T))
shapiro.test( na.omit(titanic1$Age) )
##
## Shapiro-Wilk normality test
##
## data: na.omit(titanic1$Age)
## W = 0.98146, p-value = 7.337e-08