Examine the Titantic Data Set (train.csv)


1. Load the data.

Set the path and import the .csv file containing the data set (train.csv). Account for a header row, specify the separator/delimiter as comma, and treat strings as factor (qualitative) variables at the outset.

setwd("/Users/whinton/src/rstudio/tim8501")
train <- read.csv("train.csv", header = TRUE, sep= ",",stringsAsFactors = TRUE)

2. Assess the dataframe and structure of variables.

Check the number of objects as rows using nrow(), and the number of columns as lenth() or ncol().

nrow(train)
## [1] 891
cat("length():",length(train), ", ncol():",ncol(train))
## length(): 12 , ncol(): 12

Take a peek at first few rows with head().

head(train)
##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           6        0      3
##                                                  Name    Sex Age SibSp Parch
## 1                             Braund, Mr. Owen Harris   male  22     1     0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
## 3                              Heikkinen, Miss. Laina female  26     0     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
## 5                            Allen, Mr. William Henry   male  35     0     0
## 6                                    Moran, Mr. James   male  NA     0     0
##             Ticket    Fare Cabin Embarked
## 1        A/5 21171  7.2500              S
## 2         PC 17599 71.2833   C85        C
## 3 STON/O2. 3101282  7.9250              S
## 4           113803 53.1000  C123        S
## 5           373450  8.0500              S
## 6           330877  8.4583              Q

Take a peek at last few rows with tail().

tail(train)
##     PassengerId Survived Pclass                                     Name    Sex
## 886         886        0      3     Rice, Mrs. William (Margaret Norton) female
## 887         887        0      2                    Montvila, Rev. Juozas   male
## 888         888        1      1             Graham, Miss. Margaret Edith female
## 889         889        0      3 Johnston, Miss. Catherine Helen "Carrie" female
## 890         890        1      1                    Behr, Mr. Karl Howell   male
## 891         891        0      3                      Dooley, Mr. Patrick   male
##     Age SibSp Parch     Ticket   Fare Cabin Embarked
## 886  39     0     5     382652 29.125              Q
## 887  27     0     0     211536 13.000              S
## 888  19     0     0     112053 30.000   B42        S
## 889  NA     1     2 W./C. 6607 23.450              S
## 890  26     0     0     111369 30.000  C148        C
## 891  32     0     0     370376  7.750              Q

Examine a summary of the dataframe with summary().

summary(train)
##   PassengerId       Survived          Pclass     
##  Min.   :  1.0   Min.   :0.0000   Min.   :1.000  
##  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000  
##  Median :446.0   Median :0.0000   Median :3.000  
##  Mean   :446.0   Mean   :0.3838   Mean   :2.309  
##  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000  
##  Max.   :891.0   Max.   :1.0000   Max.   :3.000  
##                                                  
##                                     Name         Sex           Age       
##  Abbing, Mr. Anthony                  :  1   female:314   Min.   : 0.42  
##  Abbott, Mr. Rossmore Edward          :  1   male  :577   1st Qu.:20.12  
##  Abbott, Mrs. Stanton (Rosa Hunt)     :  1                Median :28.00  
##  Abelson, Mr. Samuel                  :  1                Mean   :29.70  
##  Abelson, Mrs. Samuel (Hannah Wizosky):  1                3rd Qu.:38.00  
##  Adahl, Mr. Mauritz Nils Martin       :  1                Max.   :80.00  
##  (Other)                              :885                NA's   :177    
##      SibSp           Parch             Ticket         Fare       
##  Min.   :0.000   Min.   :0.0000   1601    :  7   Min.   :  0.00  
##  1st Qu.:0.000   1st Qu.:0.0000   347082  :  7   1st Qu.:  7.91  
##  Median :0.000   Median :0.0000   CA. 2343:  7   Median : 14.45  
##  Mean   :0.523   Mean   :0.3816   3101295 :  6   Mean   : 32.20  
##  3rd Qu.:1.000   3rd Qu.:0.0000   347088  :  6   3rd Qu.: 31.00  
##  Max.   :8.000   Max.   :6.0000   CA 2144 :  6   Max.   :512.33  
##                                   (Other) :852                   
##          Cabin     Embarked
##             :687    :  2   
##  B96 B98    :  4   C:168   
##  C23 C25 C27:  4   Q: 77   
##  G6         :  4   S:644   
##  C22 C26    :  3           
##  D          :  3           
##  (Other)    :186

3. Assess the variable types with str().

str(train)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
##  $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...

4. Redefine variable types with cut() and as.factor().

The data set’s first two pertinent variables are represented as integers (e.g., Survived, Pclass). They should be converted to factor variables for later human-readable insights.

Convert Survived from numerics of 0,1 to factor labels of “NO”, “YES”.

str(train$Survived)
##  int [1:891] 0 1 1 1 0 0 0 0 1 1 ...
train$Survived <- cut(train$Survived, breaks=c(-1,0,1), labels=c("NO","YES"))
train$Survived <- as.factor(train$Survived)
str(train$Survived)
##  Factor w/ 2 levels "NO","YES": 1 2 2 2 1 1 1 1 2 2 ...

Convert Pclass from numerics of 1,2,3 to factor labels of “Upper”, “Middle”, “Lower”.

str(train$Pclass)
##  int [1:891] 3 1 3 1 3 3 1 3 3 2 ...
train$Pclass <- cut(train$Pclass, breaks=c(0,1,2,3), labels=c("Upper","Middle","Lower"))
train$Pclass <- as.factor(train$Pclass)
str(train$Pclass)
##  Factor w/ 3 levels "Upper","Middle",..: 3 1 3 1 3 3 1 3 3 2 ...

5. Redefine variable factor levels.

For this particular step in the algorithm, we’ll just record the number of NAs and blanks. Then drop the blank factor levels (e.g. Embarked). As a follow-up, such values can be handled with extractions or some imputation method.

Handling both blanks and NAs is not simple so first eliminate some of those. Eliminate the blanks and change them to NAs.

df <- train ## take copy of the original dataframe to illustrate this step in the algorithm  
 # Loop through columns
   for (i in 1:length(df)) {
     # Loop through rows
     for (j in 1:nrow(df)) {
       # Check for empty strings or NA values
       if (df[j, i] == "" | is.na(df[j, i])) {
         # Replace with actual NA value (not a string "NA")
         df[j, i] <- NA
       }
     }
   }

na_counts <- colSums(is.na(df))
print(na_counts)
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0         687           2
str(df$Embarked)
##  Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...

Drop the blank factor levels for categorical variables. Particularly notice variables Cabin and Embarked which previously had NAs that are now removed and Embarked previously had 4 levels, but only 3 actual readable values.

df$Survived <- as.factor(as.character(df$Survived))
df$Pclass <- as.factor(as.character(df$Pclass))
df$Name <- as.factor(as.character(df$Name))
df$Sex <- as.factor(as.character(df$Sex))
df$Ticket <- as.factor(as.character(df$Ticket))
df$Cabin <- as.factor(as.character(df$Cabin))
df$Embarked <- as.factor(as.character(df$Embarked))

na_counts_base <- colSums(is.na(df))
print(na_counts_base)
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0         687           2
str(df$Embarked)
##  Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...

6. Consider removing variables that require no calculation such as Name and Cabin .

For this data set, at this time, the passenger’s Cabin don’t seem to have statistical significance many missing values. So we can consider removing those columns. If a variable like Age becomes relevant and we want to deal with the missing/NA values, then such values can be handled with row or column removals or some sort of imputation method.

str(train$Cabin)
##  Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...

Final look at the variable structure and data summary.


summary(df)
##   PassengerId    Survived     Pclass   
##  Min.   :  1.0   NO :549   Lower :491  
##  1st Qu.:223.5   YES:342   Middle:184  
##  Median :446.0             Upper :216  
##  Mean   :446.0                         
##  3rd Qu.:668.5                         
##  Max.   :891.0                         
##                                        
##                                     Name         Sex           Age       
##  Abbing, Mr. Anthony                  :  1   female:314   Min.   : 0.42  
##  Abbott, Mr. Rossmore Edward          :  1   male  :577   1st Qu.:20.12  
##  Abbott, Mrs. Stanton (Rosa Hunt)     :  1                Median :28.00  
##  Abelson, Mr. Samuel                  :  1                Mean   :29.70  
##  Abelson, Mrs. Samuel (Hannah Wizosky):  1                3rd Qu.:38.00  
##  Adahl, Mr. Mauritz Nils Martin       :  1                Max.   :80.00  
##  (Other)                              :885                NA's   :177    
##      SibSp           Parch             Ticket         Fare       
##  Min.   :0.000   Min.   :0.0000   1601    :  7   Min.   :  0.00  
##  1st Qu.:0.000   1st Qu.:0.0000   347082  :  7   1st Qu.:  7.91  
##  Median :0.000   Median :0.0000   CA. 2343:  7   Median : 14.45  
##  Mean   :0.523   Mean   :0.3816   3101295 :  6   Mean   : 32.20  
##  3rd Qu.:1.000   3rd Qu.:0.0000   347088  :  6   3rd Qu.: 31.00  
##  Max.   :8.000   Max.   :6.0000   CA 2144 :  6   Max.   :512.33  
##                                   (Other) :852                   
##          Cabin     Embarked  
##  B96 B98    :  4   C   :168  
##  C23 C25 C27:  4   Q   : 77  
##  G6         :  4   S   :644  
##  C22 C26    :  3   NA's:  2  
##  D          :  3             
##  (Other)    :186             
##  NA's       :687
str(df)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : Factor w/ 2 levels "NO","YES": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass     : Factor w/ 3 levels "Lower","Middle",..: 1 3 1 3 1 1 3 1 1 2 ...
##  $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : Factor w/ 147 levels "A10","A14","A16",..: NA 82 NA 56 NA NA 130 NA NA NA ...
##  $ Embarked   : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...

Try a Plot

Examining the first 6 steps of the EDA algorithm here, but for learning I attempt plotting.


Plot of a Titantic data set categorical/factor variable sex (male,female) .
Shows that the number of male passengers is nearly double the number of females.


Plot of the Titantic data set categorical/factor variable Embarked (C,Q,S).
Shows that the embarkment port for nearly all passengers was Southhampton.