Author: Hui-Ju Huang


Descriptive statistics of dataset Titanic

1 Description of data set used in the analysis

1.1 Source

mydata <- read.table("./train.csv", header = TRUE, sep = ",", dec = ".")

head(mydata) 
##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           6        0      3
##                                                  Name    Sex Age SibSp Parch
## 1                             Braund, Mr. Owen Harris   male  22     1     0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
## 3                              Heikkinen, Miss. Laina female  26     0     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
## 5                            Allen, Mr. William Henry   male  35     0     0
## 6                                    Moran, Mr. James   male  NA     0     0
##             Ticket    Fare Cabin Embarked
## 1        A/5 21171  7.2500              S
## 2         PC 17599 71.2833   C85        C
## 3 STON/O2. 3101282  7.9250              S
## 4           113803 53.1000  C123        S
## 5           373450  8.0500              S
## 6           330877  8.4583              Q

1.2 Explanation of the dataset

  • Unit of Observation: Each row in the dataset represents a passenger who was aboard the Titanic.

  • Sample Size: The sample size is 891, which is the total number of rows (passengers) in the dataset.

  • Definition of All Variables

    • PassengerId: A unique identifier assigned to each passenger.
    • Survived: Indicates whether the passenger survived the disaster (0 = No, 1 = Yes).
    • Pclass: Ticket class of the passenger (1st, 2nd, or 3rd class).
    • Name: Name of the passenger.
    • Sex: Gender of the passenger (male or female).
    • Age: Age of the passenger in years.
    • SibSp: Number of siblings/spouses aboard the Titanic.
    • Parch: Number of parents/children aboard the Titanic.
    • Ticket: Ticket number of the passenger.
    • Fare: Fare paid by the passenger.
    • Cabin: Cabin number of the passenger.
    • Embarked: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton).
  • Units of Measurement

    • PassengerId: Numeric (sequential integer values).
    • Survived: Categorical (0 or 1).
    • Pclass: Categorical (1st, 2nd, or 3rd class).
    • Name: Text.
    • Sex: Categorical (male or female).
    • Age: Numeric (years).
    • SibSp: Numeric (count of siblings/spouses).
    • Parch: Numeric (count of parents/children).
    • Ticket: Text or alphanumeric.
    • Fare: Numeric (currency).
    • Cabin: Text or alphanumeric.
    • Embarked: Categorical (Cherbourg, Queenstown, Southampton).

1.3 Data manipulation

  • Check the structure of the dataset
    • There are 891 observations and 12 variables.
str(mydata)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...
  • Summary of the dataset
summary(mydata)
##   PassengerId       Survived          Pclass          Name          
##  Min.   :  1.0   Min.   :0.0000   Min.   :1.000   Length:891        
##  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
##  Median :446.0   Median :0.0000   Median :3.000   Mode  :character  
##  Mean   :446.0   Mean   :0.3838   Mean   :2.309                     
##  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000                     
##  Max.   :891.0   Max.   :1.0000   Max.   :3.000                     
##                                                                     
##      Sex                 Age            SibSp           Parch       
##  Length:891         Min.   : 0.42   Min.   :0.000   Min.   :0.0000  
##  Class :character   1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000  
##  Mode  :character   Median :28.00   Median :0.000   Median :0.0000  
##                     Mean   :29.70   Mean   :0.523   Mean   :0.3816  
##                     3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000  
##                     Max.   :80.00   Max.   :8.000   Max.   :6.0000  
##                     NA's   :177                                     
##     Ticket               Fare           Cabin             Embarked        
##  Length:891         Min.   :  0.00   Length:891         Length:891        
##  Class :character   1st Qu.:  7.91   Class :character   Class :character  
##  Mode  :character   Median : 14.45   Mode  :character   Mode  :character  
##                     Mean   : 32.20                                        
##                     3rd Qu.: 31.00                                        
##                     Max.   :512.33                                        
## 
  • Check for missing value
    • The variable “Age” has 177 missing value (NA).
colSums(is.na(mydata))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0
  • Convert categorical variables to factors
mydata$Survived <- factor(mydata$Survived)
mydata$Pclass <- factor(mydata$Pclass)
mydata$Sex <- factor(mydata$Sex)
mydata$Embarked <- factor(mydata$Embarked)
  • Check the structure of the dataset after conversion
str(mydata)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass     : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
  • Summary of the dataset after conversion
summary(mydata)
##   PassengerId    Survived Pclass      Name               Sex     
##  Min.   :  1.0   0:549    1:216   Length:891         female:314  
##  1st Qu.:223.5   1:342    2:184   Class :character   male  :577  
##  Median :446.0            3:491   Mode  :character               
##  Mean   :446.0                                                   
##  3rd Qu.:668.5                                                   
##  Max.   :891.0                                                   
##                                                                  
##       Age            SibSp           Parch           Ticket         
##  Min.   : 0.42   Min.   :0.000   Min.   :0.0000   Length:891        
##  1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000   Class :character  
##  Median :28.00   Median :0.000   Median :0.0000   Mode  :character  
##  Mean   :29.70   Mean   :0.523   Mean   :0.3816                     
##  3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000                     
##  Max.   :80.00   Max.   :8.000   Max.   :6.0000                     
##  NA's   :177                                                        
##       Fare           Cabin           Embarked
##  Min.   :  0.00   Length:891          :  2   
##  1st Qu.:  7.91   Class :character   C:168   
##  Median : 14.45   Mode  :character   Q: 77   
##  Mean   : 32.20                      S:644   
##  3rd Qu.: 31.00                              
##  Max.   :512.33                              
## 

2 Descriptive statistics

  • Select variables
    • Select four variables “Age”, “Fare”, “SibSp”, “Parch”.
selected_vars <- c("Age", "Fare", "SibSp", "Parch")
selected_data <- mydata[, selected_vars]
  • Parameter estimation
    • The mean of Age is 29.70.
    • The median of Fare is 14.45.
    • The standard deviation of SibSp is 1.10.
    • The standard deviation of Parch is 0.81.
#install.packages("psych")
library(psych)
describe(selected_data)
##       vars   n  mean    sd median trimmed   mad  min    max  range skew
## Age      1 714 29.70 14.53  28.00   29.27 13.34 0.42  80.00  79.58 0.39
## Fare     2 891 32.20 49.69  14.45   21.38 10.24 0.00 512.33 512.33 4.77
## SibSp    3 891  0.52  1.10   0.00    0.27  0.00 0.00   8.00   8.00 3.68
## Parch    4 891  0.38  0.81   0.00    0.18  0.00 0.00   6.00   6.00 2.74
##       kurtosis   se
## Age       0.16 0.54
## Fare     33.12 1.66
## SibSp    17.73 0.04
## Parch     9.69 0.03

3 Graphical representations

  • Load necessary libraries for plotting
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
Histogram of Age
  • The histogram of Age shows the distribution of ages among passengers.
  • It visualize the age distribution, indicating it’s skewed to right.
ggplot(mydata, aes(x = Age)) +
  geom_histogram(binwidth = 5, fill = "lightblue",color = "gray") +
  labs(title = "Distribution of Age",
       x = "Age",
       y = "Frequency")
## Warning: Removed 177 rows containing non-finite outside the scale range
## (`stat_bin()`).

Histogram of Fare
  • The histogram of Fare displays the distribution of fares paid by passengers, providing insights into the fare structure aboard the Titanic.
  • It indicates it’s skewed to right.
ggplot(mydata, aes(x = Fare)) +
  geom_histogram(binwidth = 10, fill = "lightgreen", color = "gray") +
  labs(title = "Distribution of Fare",
       x = "Fare",
       y = "Frequency")

Scatterplot of Age vs. Fare
  • The scatterplot of Age vs. Fare shows the relationship between age and fare paid by passengers.
  • It identify patterns or correlations between the two variables.
ggplot(mydata, aes(x = Age, y = Fare)) +
  geom_point(color = "blue") +
  labs(title = "Scatterplot of Age vs. Fare",
       x = "Age",
       y = "Fare")
## Warning: Removed 177 rows containing missing values or values outside the scale range
## (`geom_point()`).