Assess, Filter, Clean, Present the Data Set (titanic.csv)


1. Load the data.

Set the path and import the .csv file containing the data set (titanic.csv). Account for a header row, specify the separator/delimiter as comma, and treat strings as factor (qualitative) variables at the outset.

setwd("/Users/whinton/src/rstudio/tim8501")
titanic <- read.csv("titanic.csv", header = TRUE, sep= ",",stringsAsFactors = TRUE)

2. ASSESS the dataframe, structure, counts and statistics of variables.

Check the number of objects as rows using nrow(), and the number of columns as lenth() or ncol().

df <- titanic ## make copy of original dataset to data frame df
cat("Number of rows:", nrow(df),"")
## Number of rows: 891
cat("Number of cols:",length(df), "")
## Number of cols: 12

Take a peek at first few rows with head().

head(df)
##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           6        0      3
##                                                  Name    Sex Age SibSp Parch
## 1                             Braund, Mr. Owen Harris   male  22     1     0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
## 3                              Heikkinen, Miss. Laina female  26     0     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
## 5                            Allen, Mr. William Henry   male  35     0     0
## 6                                    Moran, Mr. James   male  NA     0     0
##             Ticket    Fare Cabin Embarked
## 1        A/5 21171  7.2500              S
## 2         PC 17599 71.2833   C85        C
## 3 STON/O2. 3101282  7.9250              S
## 4           113803 53.1000  C123        S
## 5           373450  8.0500              S
## 6           330877  8.4583              Q

Take a peek at last few rows with tail().

tail(df)
##     PassengerId Survived Pclass                                     Name    Sex
## 886         886        0      3     Rice, Mrs. William (Margaret Norton) female
## 887         887        0      2                    Montvila, Rev. Juozas   male
## 888         888        1      1             Graham, Miss. Margaret Edith female
## 889         889        0      3 Johnston, Miss. Catherine Helen "Carrie" female
## 890         890        1      1                    Behr, Mr. Karl Howell   male
## 891         891        0      3                      Dooley, Mr. Patrick   male
##     Age SibSp Parch     Ticket   Fare Cabin Embarked
## 886  39     0     5     382652 29.125              Q
## 887  27     0     0     211536 13.000              S
## 888  19     0     0     112053 30.000   B42        S
## 889  NA     1     2 W./C. 6607 23.450              S
## 890  26     0     0     111369 30.000  C148        C
## 891  32     0     0     370376  7.750              Q

Assess the variable types with summary(), str() and psych::describe().

Examine a summary of the dataframe with summary() and str().

summary(df)
##   PassengerId       Survived          Pclass     
##  Min.   :  1.0   Min.   :0.0000   Min.   :1.000  
##  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000  
##  Median :446.0   Median :0.0000   Median :3.000  
##  Mean   :446.0   Mean   :0.3838   Mean   :2.309  
##  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000  
##  Max.   :891.0   Max.   :1.0000   Max.   :3.000  
##                                                  
##                                     Name         Sex           Age       
##  Abbing, Mr. Anthony                  :  1   female:314   Min.   : 0.42  
##  Abbott, Mr. Rossmore Edward          :  1   male  :577   1st Qu.:20.12  
##  Abbott, Mrs. Stanton (Rosa Hunt)     :  1                Median :28.00  
##  Abelson, Mr. Samuel                  :  1                Mean   :29.70  
##  Abelson, Mrs. Samuel (Hannah Wizosky):  1                3rd Qu.:38.00  
##  Adahl, Mr. Mauritz Nils Martin       :  1                Max.   :80.00  
##  (Other)                              :885                NA's   :177    
##      SibSp           Parch             Ticket         Fare       
##  Min.   :0.000   Min.   :0.0000   1601    :  7   Min.   :  0.00  
##  1st Qu.:0.000   1st Qu.:0.0000   347082  :  7   1st Qu.:  7.91  
##  Median :0.000   Median :0.0000   CA. 2343:  7   Median : 14.45  
##  Mean   :0.523   Mean   :0.3816   3101295 :  6   Mean   : 32.20  
##  3rd Qu.:1.000   3rd Qu.:0.0000   347088  :  6   3rd Qu.: 31.00  
##  Max.   :8.000   Max.   :6.0000   CA 2144 :  6   Max.   :512.33  
##                                   (Other) :852                   
##          Cabin     Embarked
##             :687    :  2   
##  B96 B98    :  4   C:168   
##  C23 C25 C27:  4   Q: 77   
##  G6         :  4   S:644   
##  C22 C26    :  3           
##  D          :  3           
##  (Other)    :186
str(df)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
##  $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...

Assess the variable types with psych().

psych::describe(df)
##             vars   n   mean     sd median trimmed    mad  min    max  range
## PassengerId    1 891 446.00 257.35 446.00  446.00 330.62 1.00 891.00 890.00
## Survived       2 891   0.38   0.49   0.00    0.35   0.00 0.00   1.00   1.00
## Pclass         3 891   2.31   0.84   3.00    2.39   0.00 1.00   3.00   2.00
## Name*          4 891 446.00 257.35 446.00  446.00 330.62 1.00 891.00 890.00
## Sex*           5 891   1.65   0.48   2.00    1.68   0.00 1.00   2.00   1.00
## Age            6 714  29.70  14.53  28.00   29.27  13.34 0.42  80.00  79.58
## SibSp          7 891   0.52   1.10   0.00    0.27   0.00 0.00   8.00   8.00
## Parch          8 891   0.38   0.81   0.00    0.18   0.00 0.00   6.00   6.00
## Ticket*        9 891 339.52 200.83 338.00  339.65 268.35 1.00 681.00 680.00
## Fare          10 891  32.20  49.69  14.45   21.38  10.24 0.00 512.33 512.33
## Cabin*        11 891  18.63  38.14   1.00    8.29   0.00 1.00 148.00 147.00
## Embarked*     12 891   3.53   0.80   4.00    3.66   0.00 1.00   4.00   3.00
##              skew kurtosis   se
## PassengerId  0.00    -1.20 8.62
## Survived     0.48    -1.77 0.02
## Pclass      -0.63    -1.28 0.03
## Name*        0.00    -1.20 8.62
## Sex*        -0.62    -1.62 0.02
## Age          0.39     0.16 0.54
## SibSp        3.68    17.73 0.04
## Parch        2.74     9.69 0.03
## Ticket*      0.00    -1.28 6.73
## Fare         4.77    33.12 1.66
## Cabin*       2.09     3.07 1.28
## Embarked*   -1.27    -0.16 0.03

Our assessment initially shows 891 row observations with 12 column variables. Of the variables, 7 are presently numeric types (int, decimal) and 5 are categorical factor variables (strings treated as factors upon loading).

Additionally, 2 of the numeric variables (Survived, Pclass) should be converted to categorical factor variables for improved interpretation and insight. Better human-readability can achieved by labeling as:
- Survived (YES=1, NO=1; instead of merely 1, 0)
- Pclass ( Upper, Middle, Lower; instead of 1,2,3)

Missing values seem to be indicated by blanks or NAs. The variables: Name, Cabin and Embarked are factor variables that have missing values. The variable: Age is a numeric that has 177 NAs. Missing values are more detectable in the form of NAs rather than blanks. So let’s transform blanks to NAs.


Assess and start the filter process by making missing values more visible.

Handling both blanks and NAs is not simple so first eliminate some of those. Eliminate the blanks and change them to NAs.

 # Loop through columns
   for (i in 1:length(df)) {
     # Loop through rows
     for (j in 1:nrow(df)) {
       # Check for empty strings or NA values
       if (df[j, i] == "" | is.na(df[j, i])) {
         # Replace with actual NA value (not a string "NA")
         df[j, i] <- NA
       }
     }
   }

na_counts <- colSums(is.na(df))
print(na_counts)
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0         687           2

Here we can visually see the missing values in form of NAs. All missing values reside in 3 three columns.
- Cabin
- Embarked
- Age

missmap(df)

Calculate NA percentages and assess statistical relevance before moving on to filtering.

  # Count rows with NAs
  rows_with_nas <- sum(rowSums(is.na(df)) > 0)
  Percent_row_NA <- percent(rows_with_nas/nrow(df))
  cat("Rows with NAs:",rows_with_nas,"Percent of rows w/ NA:",Percent_row_NA,"  ")
## Rows with NAs: 708 Percent of rows w/ NA: 79%
  # The proportion of the rows with NAs is large, but don't want to drop more that 5%.
  # Would impact the integrity of the dataset.
  
  # Count columns with NAs
  cols_with_nas <- sum(colSums(is.na(df)) > 0)
  cabin_nas <- sum(is.na(df$Cabin))
  percent_cabin_nas <- percent((cabin_nas / nrow(df)), accuracy = 0.01)
  embark_nas <- sum(is.na(df$Embarked))
  percent_embark_nas <- percent((embark_nas / nrow(df)), accuracy = 0.01)
  age_nas <- sum(is.na(df$Age))
  percent_age_nas <- percent((age_nas / nrow(df)), accuracy = 0.01)
                             
  cat("Columns w/ NAs:",cols_with_nas,"  ")
## Columns w/ NAs: 3
  cat("Col percent w/ NAs for Cabin:",percent_cabin_nas,"  ")  # likely remove column
## Col percent w/ NAs for Cabin: 77.10%
  cat("Col percent w/ NAs for Embarked:",percent_embark_nas,"  ") # impute with mode
## Col percent w/ NAs for Embarked: 0.22%
  cat("Col percent w/ NAs for Age:",percent_age_nas,"  ")  # impute with mean
## Col percent w/ NAs for Age: 19.87%

3. FILTER by handling missing values, re-classifying types and start the cleaning process by removing statistically irrelevant variables.

The data set’s first two pertinent variables are represented as integers (e.g., Survived, Pclass). They should be converted to factor variables for later human-readable insights.

Convert Survived from numerics of 0,1 to factor labels of “NO”, “YES”.

str(df$Survived)
##  int [1:891] 0 1 1 1 0 0 0 0 1 1 ...
df$Survived <- cut(df$Survived, breaks=c(-1,0,1), labels=c("NO","YES"))
df$Survived <- as.factor(df$Survived)
str(df$Survived)
##  Factor w/ 2 levels "NO","YES": 1 2 2 2 1 1 1 1 2 2 ...

Convert Pclass from numerics of 1,2,3 to factor labels of “Upper”, “Middle”, “Lower”.

str(df$Pclass)
##  int [1:891] 3 1 3 1 3 3 1 3 3 2 ...
df$Pclass <- cut(df$Pclass, breaks=c(0,1,2,3), labels=c("Upper","Middle","Lower"))
df$Pclass <- as.factor(df$Pclass)
str(df$Pclass)
##  Factor w/ 3 levels "Upper","Middle",..: 3 1 3 1 3 3 1 3 3 2 ...

4. CLEAN variables that require no calculation such as Name and Cabin .

For this data set, at this time, the passenger’s Name and Cabin don’t seem to have statistical significance have many missing values. So we can remove those columns.

df <- select(df, -Name)
df <- select(df, -Cabin)

Cleanup the NAs. Use mean and mode imputation variables Age and Embarked.

The Age variable is numeric and can have a fractional value and the missing value percentage is nearly 20. So I elected to use mean imputation and retain all rows. The Embarked variable is categorical factor with only 2 missing values where the mode is clearly Southhampton. So I elected to use mode imputation for those two missing values.

  for (col in names(df)) {
    if (is.numeric(df[[col]]) || is.integer(df[[col]])) {
      if (sum(!is.na(df[[col]])) > 10) {
        # If more than 10 non-NA values, use mean
        df[[col]][is.na(df[[col]])] <- mean(df[[col]], na.rm = TRUE)
      } else {
        # Otherwise, use linear interpolation for imputation
        df[[col]][is.na(df[[col]])] <- approx(seq_along(df[[col]]), df[[col]], n = length(df[[col]]))[["y"]][is.na(df[[col]])]
      }
    } else if (is.factor(df[[col]])) {
      mode_val <- names(sort(-table(df[[col]])))[1]
      df[[col]][is.na(df[[col]])] <- mode_val
    } else if (is.character(df[[col]])) {
      df[[col]][is.na(df[[col]])] <- "NA"
    }
  }

Performed imputation techniques for variables. Particularly notice variable Cabin is gone and Embarked which previously had NAs that are now mean imputed and Embarked previously had 4 levels, but only 3 actual readable values.

df$Survived <- as.factor(as.character(df$Survived))
df$Pclass <- as.factor(as.character(df$Pclass))
##df$Name <- as.factor(as.character(df$Name))
df$Sex <- as.factor(as.character(df$Sex))
df$Ticket <- as.factor(as.character(df$Ticket))
##df$Cabin <- as.factor(as.character(df$Cabin))
df$Embarked <- as.factor(as.character(df$Embarked))

na_counts_base <- colSums(is.na(df))
print(na_counts_base)
## PassengerId    Survived      Pclass         Sex         Age       SibSp 
##           0           0           0           0           0           0 
##       Parch      Ticket        Fare    Embarked 
##           0           0           0           0

5. PRESENT the clean data with a final look at the structure and summary.


The below illustration now shows 10 variables that all have values and some statistical relevance and the original count of rows/objects is still in tact.

summary(df)
##   PassengerId    Survived     Pclass        Sex           Age       
##  Min.   :  1.0   NO :549   Lower :491   female:314   Min.   : 0.42  
##  1st Qu.:223.5   YES:342   Middle:184   male  :577   1st Qu.:22.00  
##  Median :446.0             Upper :216                Median :29.70  
##  Mean   :446.0                                       Mean   :29.70  
##  3rd Qu.:668.5                                       3rd Qu.:35.00  
##  Max.   :891.0                                       Max.   :80.00  
##                                                                     
##      SibSp           Parch             Ticket         Fare        Embarked
##  Min.   :0.000   Min.   :0.0000   1601    :  7   Min.   :  0.00   C:168   
##  1st Qu.:0.000   1st Qu.:0.0000   347082  :  7   1st Qu.:  7.91   Q: 77   
##  Median :0.000   Median :0.0000   CA. 2343:  7   Median : 14.45   S:646   
##  Mean   :0.523   Mean   :0.3816   3101295 :  6   Mean   : 32.20           
##  3rd Qu.:1.000   3rd Qu.:0.0000   347088  :  6   3rd Qu.: 31.00           
##  Max.   :8.000   Max.   :6.0000   CA 2144 :  6   Max.   :512.33           
##                                   (Other) :852
str(df)
## 'data.frame':    891 obs. of  10 variables:
##  $ PassengerId: num  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : Factor w/ 2 levels "NO","YES": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass     : Factor w/ 3 levels "Lower","Middle",..: 1 3 1 3 1 1 3 1 1 2 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 ...
##  $ SibSp      : num  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : num  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Embarked   : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
colSums(is.na(df))
## PassengerId    Survived      Pclass         Sex         Age       SibSp 
##           0           0           0           0           0           0 
##       Parch      Ticket        Fare    Embarked 
##           0           0           0           0
missmap(df)


Try a single (univariate) variable plot

Examining the next steps of the EDA algorithm here, so attempting a plot of a single variable.


Plot of a single categorical/factor variable sex (male,female) .
Shows that the number of male passengers is nearly double the number of females.


Plot of a single categorical/factor variable Embarked (C,Q,S).
Shows that the embarkment port for nearly all passengers was Southhampton.