Set the path and import the .csv file containing the data set (titanic.csv). Account for a header row, specify the separator/delimiter as comma, and treat strings as factor (qualitative) variables at the outset.
setwd("/Users/whinton/src/rstudio/tim8501")
titanic <- read.csv("titanic.csv", header = TRUE, sep= ",",stringsAsFactors = TRUE)
Check the number of objects as rows using nrow(), and the number of columns as lenth() or ncol().
df <- titanic ## make copy of original dataset to data frame df
cat("Number of rows:", nrow(df),"")
## Number of rows: 891
cat("Number of cols:",length(df), "")
## Number of cols: 12
Take a peek at first few rows with head().
head(df)
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 6 0 3
## Name Sex Age SibSp Parch
## 1 Braund, Mr. Owen Harris male 22 1 0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
## 3 Heikkinen, Miss. Laina female 26 0 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
## 5 Allen, Mr. William Henry male 35 0 0
## 6 Moran, Mr. James male NA 0 0
## Ticket Fare Cabin Embarked
## 1 A/5 21171 7.2500 S
## 2 PC 17599 71.2833 C85 C
## 3 STON/O2. 3101282 7.9250 S
## 4 113803 53.1000 C123 S
## 5 373450 8.0500 S
## 6 330877 8.4583 Q
Take a peek at last few rows with tail().
tail(df)
## PassengerId Survived Pclass Name Sex
## 886 886 0 3 Rice, Mrs. William (Margaret Norton) female
## 887 887 0 2 Montvila, Rev. Juozas male
## 888 888 1 1 Graham, Miss. Margaret Edith female
## 889 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female
## 890 890 1 1 Behr, Mr. Karl Howell male
## 891 891 0 3 Dooley, Mr. Patrick male
## Age SibSp Parch Ticket Fare Cabin Embarked
## 886 39 0 5 382652 29.125 Q
## 887 27 0 0 211536 13.000 S
## 888 19 0 0 112053 30.000 B42 S
## 889 NA 1 2 W./C. 6607 23.450 S
## 890 26 0 0 111369 30.000 C148 C
## 891 32 0 0 370376 7.750 Q
Examine a summary of the dataframe with summary() and str().
summary(df)
## PassengerId Survived Pclass
## Min. : 1.0 Min. :0.0000 Min. :1.000
## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000
## Median :446.0 Median :0.0000 Median :3.000
## Mean :446.0 Mean :0.3838 Mean :2.309
## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :891.0 Max. :1.0000 Max. :3.000
##
## Name Sex Age
## Abbing, Mr. Anthony : 1 female:314 Min. : 0.42
## Abbott, Mr. Rossmore Edward : 1 male :577 1st Qu.:20.12
## Abbott, Mrs. Stanton (Rosa Hunt) : 1 Median :28.00
## Abelson, Mr. Samuel : 1 Mean :29.70
## Abelson, Mrs. Samuel (Hannah Wizosky): 1 3rd Qu.:38.00
## Adahl, Mr. Mauritz Nils Martin : 1 Max. :80.00
## (Other) :885 NA's :177
## SibSp Parch Ticket Fare
## Min. :0.000 Min. :0.0000 1601 : 7 Min. : 0.00
## 1st Qu.:0.000 1st Qu.:0.0000 347082 : 7 1st Qu.: 7.91
## Median :0.000 Median :0.0000 CA. 2343: 7 Median : 14.45
## Mean :0.523 Mean :0.3816 3101295 : 6 Mean : 32.20
## 3rd Qu.:1.000 3rd Qu.:0.0000 347088 : 6 3rd Qu.: 31.00
## Max. :8.000 Max. :6.0000 CA 2144 : 6 Max. :512.33
## (Other) :852
## Cabin Embarked
## :687 : 2
## B96 B98 : 4 C:168
## C23 C25 C27: 4 Q: 77
## G6 : 4 S:644
## C22 C26 : 3
## D : 3
## (Other) :186
str(df)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
## $ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
psych::describe(df)
## vars n mean sd median trimmed mad min max range
## PassengerId 1 891 446.00 257.35 446.00 446.00 330.62 1.00 891.00 890.00
## Survived 2 891 0.38 0.49 0.00 0.35 0.00 0.00 1.00 1.00
## Pclass 3 891 2.31 0.84 3.00 2.39 0.00 1.00 3.00 2.00
## Name* 4 891 446.00 257.35 446.00 446.00 330.62 1.00 891.00 890.00
## Sex* 5 891 1.65 0.48 2.00 1.68 0.00 1.00 2.00 1.00
## Age 6 714 29.70 14.53 28.00 29.27 13.34 0.42 80.00 79.58
## SibSp 7 891 0.52 1.10 0.00 0.27 0.00 0.00 8.00 8.00
## Parch 8 891 0.38 0.81 0.00 0.18 0.00 0.00 6.00 6.00
## Ticket* 9 891 339.52 200.83 338.00 339.65 268.35 1.00 681.00 680.00
## Fare 10 891 32.20 49.69 14.45 21.38 10.24 0.00 512.33 512.33
## Cabin* 11 891 18.63 38.14 1.00 8.29 0.00 1.00 148.00 147.00
## Embarked* 12 891 3.53 0.80 4.00 3.66 0.00 1.00 4.00 3.00
## skew kurtosis se
## PassengerId 0.00 -1.20 8.62
## Survived 0.48 -1.77 0.02
## Pclass -0.63 -1.28 0.03
## Name* 0.00 -1.20 8.62
## Sex* -0.62 -1.62 0.02
## Age 0.39 0.16 0.54
## SibSp 3.68 17.73 0.04
## Parch 2.74 9.69 0.03
## Ticket* 0.00 -1.28 6.73
## Fare 4.77 33.12 1.66
## Cabin* 2.09 3.07 1.28
## Embarked* -1.27 -0.16 0.03
Our assessment initially shows 891 row observations with 12 column variables. Of the variables, 7 are presently numeric types (int, decimal) and 5 are categorical factor variables (strings treated as factors upon loading).
Additionally, 2 of the numeric variables (Survived, Pclass) should be
converted to categorical factor variables for improved interpretation
and insight. Better human-readability can achieved by labeling as:
- Survived (YES=1, NO=1; instead of merely 1, 0)
- Pclass ( Upper, Middle, Lower; instead of 1,2,3)
Missing values seem to be indicated by blanks or NAs. The variables: Name, Cabin and Embarked are factor variables that have missing values. The variable: Age is a numeric that has 177 NAs. Missing values are more detectable in the form of NAs rather than blanks. So let’s transform blanks to NAs.
Handling both blanks and NAs is not simple so first eliminate some of those. Eliminate the blanks and change them to NAs.
# Loop through columns
for (i in 1:length(df)) {
# Loop through rows
for (j in 1:nrow(df)) {
# Check for empty strings or NA values
if (df[j, i] == "" | is.na(df[j, i])) {
# Replace with actual NA value (not a string "NA")
df[j, i] <- NA
}
}
}
na_counts <- colSums(is.na(df))
print(na_counts)
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 687 2
Here we can visually see the missing values in form of NAs. All
missing values reside in 3 three columns.
- Cabin
- Embarked
- Age
missmap(df)
# Count rows with NAs
rows_with_nas <- sum(rowSums(is.na(df)) > 0)
Percent_row_NA <- percent(rows_with_nas/nrow(df))
cat("Rows with NAs:",rows_with_nas,"Percent of rows w/ NA:",Percent_row_NA," ")
## Rows with NAs: 708 Percent of rows w/ NA: 79%
# The proportion of the rows with NAs is large, but don't want to drop more that 5%.
# Would impact the integrity of the dataset.
# Count columns with NAs
cols_with_nas <- sum(colSums(is.na(df)) > 0)
cabin_nas <- sum(is.na(df$Cabin))
percent_cabin_nas <- percent((cabin_nas / nrow(df)), accuracy = 0.01)
embark_nas <- sum(is.na(df$Embarked))
percent_embark_nas <- percent((embark_nas / nrow(df)), accuracy = 0.01)
age_nas <- sum(is.na(df$Age))
percent_age_nas <- percent((age_nas / nrow(df)), accuracy = 0.01)
cat("Columns w/ NAs:",cols_with_nas," ")
## Columns w/ NAs: 3
cat("Col percent w/ NAs for Cabin:",percent_cabin_nas," ") # likely remove column
## Col percent w/ NAs for Cabin: 77.10%
cat("Col percent w/ NAs for Embarked:",percent_embark_nas," ") # impute with mode
## Col percent w/ NAs for Embarked: 0.22%
cat("Col percent w/ NAs for Age:",percent_age_nas," ") # impute with mean
## Col percent w/ NAs for Age: 19.87%
The data set’s first two pertinent variables are represented as integers (e.g., Survived, Pclass). They should be converted to factor variables for later human-readable insights.
Convert Survived from numerics of 0,1 to factor labels of “NO”, “YES”.
str(df$Survived)
## int [1:891] 0 1 1 1 0 0 0 0 1 1 ...
df$Survived <- cut(df$Survived, breaks=c(-1,0,1), labels=c("NO","YES"))
df$Survived <- as.factor(df$Survived)
str(df$Survived)
## Factor w/ 2 levels "NO","YES": 1 2 2 2 1 1 1 1 2 2 ...
Convert Pclass from numerics of 1,2,3 to factor labels of “Upper”, “Middle”, “Lower”.
str(df$Pclass)
## int [1:891] 3 1 3 1 3 3 1 3 3 2 ...
df$Pclass <- cut(df$Pclass, breaks=c(0,1,2,3), labels=c("Upper","Middle","Lower"))
df$Pclass <- as.factor(df$Pclass)
str(df$Pclass)
## Factor w/ 3 levels "Upper","Middle",..: 3 1 3 1 3 3 1 3 3 2 ...
For this data set, at this time, the passenger’s Name and Cabin don’t seem to have statistical significance have many missing values. So we can remove those columns.
df <- select(df, -Name)
df <- select(df, -Cabin)
The Age variable is numeric and can have a fractional value and the missing value percentage is nearly 20. So I elected to use mean imputation and retain all rows. The Embarked variable is categorical factor with only 2 missing values where the mode is clearly Southhampton. So I elected to use mode imputation for those two missing values.
for (col in names(df)) {
if (is.numeric(df[[col]]) || is.integer(df[[col]])) {
if (sum(!is.na(df[[col]])) > 10) {
# If more than 10 non-NA values, use mean
df[[col]][is.na(df[[col]])] <- mean(df[[col]], na.rm = TRUE)
} else {
# Otherwise, use linear interpolation for imputation
df[[col]][is.na(df[[col]])] <- approx(seq_along(df[[col]]), df[[col]], n = length(df[[col]]))[["y"]][is.na(df[[col]])]
}
} else if (is.factor(df[[col]])) {
mode_val <- names(sort(-table(df[[col]])))[1]
df[[col]][is.na(df[[col]])] <- mode_val
} else if (is.character(df[[col]])) {
df[[col]][is.na(df[[col]])] <- "NA"
}
}
Performed imputation techniques for variables. Particularly notice variable Cabin is gone and Embarked which previously had NAs that are now mean imputed and Embarked previously had 4 levels, but only 3 actual readable values.
df$Survived <- as.factor(as.character(df$Survived))
df$Pclass <- as.factor(as.character(df$Pclass))
##df$Name <- as.factor(as.character(df$Name))
df$Sex <- as.factor(as.character(df$Sex))
df$Ticket <- as.factor(as.character(df$Ticket))
##df$Cabin <- as.factor(as.character(df$Cabin))
df$Embarked <- as.factor(as.character(df$Embarked))
na_counts_base <- colSums(is.na(df))
print(na_counts_base)
## PassengerId Survived Pclass Sex Age SibSp
## 0 0 0 0 0 0
## Parch Ticket Fare Embarked
## 0 0 0 0
The below illustration now shows 10 variables that all have values and some statistical relevance and the original count of rows/objects is still in tact.
summary(df)
## PassengerId Survived Pclass Sex Age
## Min. : 1.0 NO :549 Lower :491 female:314 Min. : 0.42
## 1st Qu.:223.5 YES:342 Middle:184 male :577 1st Qu.:22.00
## Median :446.0 Upper :216 Median :29.70
## Mean :446.0 Mean :29.70
## 3rd Qu.:668.5 3rd Qu.:35.00
## Max. :891.0 Max. :80.00
##
## SibSp Parch Ticket Fare Embarked
## Min. :0.000 Min. :0.0000 1601 : 7 Min. : 0.00 C:168
## 1st Qu.:0.000 1st Qu.:0.0000 347082 : 7 1st Qu.: 7.91 Q: 77
## Median :0.000 Median :0.0000 CA. 2343: 7 Median : 14.45 S:646
## Mean :0.523 Mean :0.3816 3101295 : 6 Mean : 32.20
## 3rd Qu.:1.000 3rd Qu.:0.0000 347088 : 6 3rd Qu.: 31.00
## Max. :8.000 Max. :6.0000 CA 2144 : 6 Max. :512.33
## (Other) :852
str(df)
## 'data.frame': 891 obs. of 10 variables:
## $ PassengerId: num 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : Factor w/ 2 levels "NO","YES": 1 2 2 2 1 1 1 1 2 2 ...
## $ Pclass : Factor w/ 3 levels "Lower","Middle",..: 1 3 1 3 1 1 3 1 1 2 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 ...
## $ SibSp : num 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : num 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Embarked : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
colSums(is.na(df))
## PassengerId Survived Pclass Sex Age SibSp
## 0 0 0 0 0 0
## Parch Ticket Fare Embarked
## 0 0 0 0
missmap(df)
Examining the next steps of the EDA algorithm here, so attempting a plot of a single variable.
Plot of a single categorical/factor variable sex
(male,female) .
Shows that the number of male passengers is nearly double the number of
females.
Plot of a single categorical/factor variable Embarked
(C,Q,S).
Shows that the embarkment port for nearly all passengers was
Southhampton.