## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## 
## Attaching package: 'psych'
## 
## 
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
train <- read_csv("train.csv")
## Rows: 891 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Name, Sex, Ticket, Cabin, Embarked
## dbl (7): PassengerId, Survived, Pclass, Age, SibSp, Parch, Fare
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Q1

head(train)
## # A tibble: 6 × 12
##   PassengerId Survived Pclass Name    Sex     Age SibSp Parch Ticket  Fare Cabin
##         <dbl>    <dbl>  <dbl> <chr>   <chr> <dbl> <dbl> <dbl> <chr>  <dbl> <chr>
## 1           1        0      3 Braund… male     22     1     0 A/5 2…  7.25 <NA> 
## 2           2        1      1 Cuming… fema…    38     1     0 PC 17… 71.3  C85  
## 3           3        1      3 Heikki… fema…    26     0     0 STON/…  7.92 <NA> 
## 4           4        1      1 Futrel… fema…    35     1     0 113803 53.1  C123 
## 5           5        0      3 Allen,… male     35     0     0 373450  8.05 <NA> 
## 6           6        0      3 Moran,… male     NA     0     0 330877  8.46 <NA> 
## # ℹ 1 more variable: Embarked <chr>
train %>% 
  distinct(PassengerId)
## # A tibble: 891 × 1
##    PassengerId
##          <dbl>
##  1           1
##  2           2
##  3           3
##  4           4
##  5           5
##  6           6
##  7           7
##  8           8
##  9           9
## 10          10
## # ℹ 881 more rows

PassengerID is qualitative with a level of nominal, while age is quantiative and numerical.

Q2

colSums(is.na(train))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0         687           2

Cabin is the variable with the most NAs.

train$Age[is.na(train$Age)] <- median(train$Age, na.rm = TRUE)
train$SibSp[is.na(train$SibSp)] <- median(train$SibSp, na.rm = TRUE)
train$Parch[is.na(train$Parch)] <- median(train$Parch, na.rm = TRUE)

Q3

describe(train[, c("Age", "SibSp", "Parch")])
##       vars   n  mean    sd median trimmed mad  min max range skew kurtosis   se
## Age      1 891 29.36 13.02     28   28.83 8.9 0.42  80 79.58 0.51     0.97 0.44
## SibSp    2 891  0.52  1.10      0    0.27 0.0 0.00   8  8.00 3.68    17.73 0.04
## Parch    3 891  0.38  0.81      0    0.18 0.0 0.00   6  6.00 2.74     9.69 0.03

There are a total of 891 cases that are tracked in this set of Titanic dataset. Because these are unqiue types of variables, so some of insightful information to note including:

The average of all cases was around 29 years old with the youngest being about 5 month old and the oldest being 90 years old. The passengers include all ranges of ages with a low kurtosis number of 0.97

There were passengers who did not have any siblings/spouses, but THE MOST there were some passengers had an entire siblings of 8 on board together. Most of passengers did not board with their spouse/siblings with the average of SibSp data being very low of 0.51. I also assume this because it had a very high number of kurtosis (17.73), as I think this high peak is in the lower range.

There were passengers who did not board with any parents or children, but there were one or some larger families of 6 together on board.

Q4

table(train$Survived, train$Sex)
##    
##     female male
##   0     81  468
##   1    233  109
prop.table(table(train$Survived, train$Sex))
##    
##         female       male
##   0 0.09090909 0.52525253
##   1 0.26150393 0.12233446

The overall summary from this graphs shows only 38.43% of the passengers (342/890 cases) survived from titanic with more women have survived than men out of all 890 passengers. In other words, 62.57% of passengers (549/890 cases) lost their lives during the titanic, with the majority of men faced this tragedy with 468 lives which is about around 52.53% of them.

Looking more in depth of the gender and survival differences, a majority of the women survived, while more than half of the men DID NOT survived. The survival rate for women only during the titanic is about 74.20% (233/(81+233)), but the survival rate for men only during the titantic is about 18.89% (109/(468+109)). About 85.25% of non-survived passengers were men (468/(81+468)).

Q5

boxplot(train$Age~train$Survived, 
        notch=TRUE, 
        horizontal=T,
        main = "Age Distribution by Survival Status",  
        xlab = "Age",                                  
        ylab = "Survival Status (0 = No, 1 = Yes)",
        pch = 16)

For both survived and non-survived, there passengers of all ranges from 0 to close to about 80. The median age for both of these is about pretty similar.

There are more outliers of passengers who did not survive than the outliers for the passengers who did survived. I am assuming there wer trying to save the younger passengers with the boxplot for the survived indivisuals covering a bigger range fro 0 year old all the way to around mid-50 year olds.