EDA of Titanic data set

I would recommend to visit http://rpubs.com/vikassinha167/530654 before reading below article.

After you have read above mentioned article, we will explore classic “Titanic” case.

#Uncomment below lines of code if the packages are not installed
# install.packages("tidyverse")
# install.packages("Hmisc")
# install.packages("funModeling")
# Load the libraries
library(Hmisc)
library(funModeling)
library(tidyverse)

Let us check the structure of dataset Titanic using two methods

str(Titanic)

##  'table' num [1:4, 1:2, 1:2, 1:2] 0 0 35 0 0 0 17 0 118 154 ...
##  - attr(*, "dimnames")=List of 4
##   ..$ Class   : chr [1:4] "1st" "2nd" "3rd" "Crew"
##   ..$ Sex     : chr [1:2] "Male" "Female"
##   ..$ Age     : chr [1:2] "Child" "Adult"
##   ..$ Survived: chr [1:2] "No" "Yes"

glimpse(Titanic)

##  'table' num [1:4, 1:2, 1:2, 1:2] 0 0 35 0 0 0 17 0 118 154 ...
##  - attr(*, "dimnames")=List of 4
##   ..$ Class   : chr [1:4] "1st" "2nd" "3rd" "Crew"
##   ..$ Sex     : chr [1:2] "Male" "Female"
##   ..$ Age     : chr [1:2] "Child" "Adult"
##   ..$ Survived: chr [1:2] "No" "Yes"

Titanic dataset is a Table , not a direct Data Frame or Matrix or vector

class(Titanic)

## [1] "table"

print(Titanic)

## , , Age = Child, Survived = No
## 
##       Sex
## Class  Male Female
##   1st     0      0
##   2nd     0      0
##   3rd    35     17
##   Crew    0      0
## 
## , , Age = Adult, Survived = No
## 
##       Sex
## Class  Male Female
##   1st   118      4
##   2nd   154     13
##   3rd   387     89
##   Crew  670      3
## 
## , , Age = Child, Survived = Yes
## 
##       Sex
## Class  Male Female
##   1st     5      1
##   2nd    11     13
##   3rd    13     14
##   Crew    0      0
## 
## , , Age = Adult, Survived = Yes
## 
##       Sex
## Class  Male Female
##   1st    57    140
##   2nd    14     80
##   3rd    75     76
##   Crew  192     20

Let us convert this “table” to more user friendly “data frame”

TitanicDataFrame <- as.data.frame(Titanic)

Now check the class and data within this data frame.

class(TitanicDataFrame)

## [1] "data.frame"

print(TitanicDataFrame)

##    Class    Sex   Age Survived Freq
## 1    1st   Male Child       No    0
## 2    2nd   Male Child       No    0
## 3    3rd   Male Child       No   35
## 4   Crew   Male Child       No    0
## 5    1st Female Child       No    0
## 6    2nd Female Child       No    0
## 7    3rd Female Child       No   17
## 8   Crew Female Child       No    0
## 9    1st   Male Adult       No  118
## 10   2nd   Male Adult       No  154
## 11   3rd   Male Adult       No  387
## 12  Crew   Male Adult       No  670
## 13   1st Female Adult       No    4
## 14   2nd Female Adult       No   13
## 15   3rd Female Adult       No   89
## 16  Crew Female Adult       No    3
## 17   1st   Male Child      Yes    5
## 18   2nd   Male Child      Yes   11
## 19   3rd   Male Child      Yes   13
## 20  Crew   Male Child      Yes    0
## 21   1st Female Child      Yes    1
## 22   2nd Female Child      Yes   13
## 23   3rd Female Child      Yes   14
## 24  Crew Female Child      Yes    0
## 25   1st   Male Adult      Yes   57
## 26   2nd   Male Adult      Yes   14
## 27   3rd   Male Adult      Yes   75
## 28  Crew   Male Adult      Yes  192
## 29   1st Female Adult      Yes  140
## 30   2nd Female Adult      Yes   80
## 31   3rd Female Adult      Yes   76
## 32  Crew Female Adult      Yes   20

As you notice above , the Titanic “table” got coerced to Titanic " data frame“. Let us check the structure of this data frame.

glimpse(TitanicDataFrame)

## Observations: 32
## Variables: 5
## $ Class    <fct> 1st, 2nd, 3rd, Crew, 1st, 2nd, 3rd, Crew, 1st, 2nd, 3...
## $ Sex      <fct> Male, Male, Male, Male, Female, Female, Female, Femal...
## $ Age      <fct> Child, Child, Child, Child, Child, Child, Child, Chil...
## $ Survived <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, N...
## $ Freq     <dbl> 0, 0, 35, 0, 0, 0, 17, 0, 118, 154, 387, 670, 4, 13, ...

str(TitanicDataFrame)

## 'data.frame':    32 obs. of  5 variables:
##  $ Class   : Factor w/ 4 levels "1st","2nd","3rd",..: 1 2 3 4 1 2 3 4 1 2 ...
##  $ Sex     : Factor w/ 2 levels "Male","Female": 1 1 1 1 2 2 2 2 1 1 ...
##  $ Age     : Factor w/ 2 levels "Child","Adult": 1 1 1 1 1 1 1 1 2 2 ...
##  $ Survived: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Freq    : num  0 0 35 0 0 0 17 0 118 154 ...

Let us check the status of data i.e how many are zero , na and infinite.

df_status(TitanicDataFrame)

##   variable q_zeros p_zeros q_na p_na q_inf p_inf    type unique
## 1    Class       0       0    0    0     0     0  factor      4
## 2      Sex       0       0    0    0     0     0  factor      2
## 3      Age       0       0    0    0     0     0  factor      2
## 4 Survived       0       0    0    0     0     0  factor      2
## 5     Freq       8      25    0    0     0     0 numeric     22

We are mostly concerned with NA and Infinite values. They are zero in this case Hence we consider our data to be good. Let us plot the frequency graph for our categorical variables.

freq(TitanicDataFrame)

##   Class frequency percentage cumulative_perc
## 1   1st         8         25              25
## 2   2nd         8         25              50
## 3   3rd         8         25              75
## 4  Crew         8         25             100

##      Sex frequency percentage cumulative_perc
## 1   Male        16         50              50
## 2 Female        16         50             100

##     Age frequency percentage cumulative_perc
## 1 Child        16         50              50
## 2 Adult        16         50             100

##   Survived frequency percentage cumulative_perc
## 1       No        16         50              50
## 2      Yes        16         50             100

## [1] "Variables processed: Class, Sex, Age, Survived"

Let us profile the numerical variable in the Titanic data set

plot_num(TitanicDataFrame)

profiling_num(TitanicDataFrame)

##   variable     mean  std_dev variation_coef p_01 p_05 p_25 p_50 p_75
## 1     Freq 68.78125 135.9959       1.977224    0    0 0.75 13.5   77
##     p_95   p_99 skewness kurtosis   iqr    range_98   range_80
## 1 279.75 582.27 3.224419 13.77958 76.25 [0, 582.27] [0, 152.6]

describe(TitanicDataFrame)

## TitanicDataFrame 
## 
##  5  Variables      32  Observations
## ---------------------------------------------------------------------------
## Class 
##        n  missing distinct 
##       32        0        4 
##                               
## Value       1st  2nd  3rd Crew
## Frequency     8    8    8    8
## Proportion 0.25 0.25 0.25 0.25
## ---------------------------------------------------------------------------
## Sex 
##        n  missing distinct 
##       32        0        2 
##                         
## Value        Male Female
## Frequency      16     16
## Proportion    0.5    0.5
## ---------------------------------------------------------------------------
## Age 
##        n  missing distinct 
##       32        0        2 
##                       
## Value      Child Adult
## Frequency     16    16
## Proportion   0.5   0.5
## ---------------------------------------------------------------------------
## Survived 
##        n  missing distinct 
##       32        0        2 
##                   
## Value       No Yes
## Frequency   16  16
## Proportion 0.5 0.5
## ---------------------------------------------------------------------------
## Freq 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       32        0       22    0.984    68.78    106.4     0.00     0.00 
##      .25      .50      .75      .90      .95 
##     0.75    13.50    77.00   152.60   279.75 
## 
## lowest :   0   1   3   4   5, highest: 140 154 192 387 670
## ---------------------------------------------------------------------------

Let us as well plot Freq variable and check its distribution.

ggplot(TitanicDataFrame,mapping = aes(x=TitanicDataFrame$Freq)) + geom_density()

ggplot(TitanicDataFrame,mapping = aes(x=TitanicDataFrame$Freq)) + geom_histogram(aes(y = ..density..), 
                   colour="black", fill="brown")

Also prove the same distribution using the descritpive analysis.

dispersion <- profiling_num(TitanicDataFrame$Freq)
kurtosis <-dispersion$kurtosis
kurtosis

## [1] 13.77958

skewness<-dispersion$skewness
skewness

## [1] 3.224419

Both the plots and values of Kurtosis and Skewness indicate Freq is not Normally distributed rather it is right skewed with a heavy right tail.

EDA of ‘Titanic’ data set

Vikas Sinha (G.O.D.S - https://www.gods-online.com)

September 21, 2019