Set the path and import the .csv file containing the data set (train.csv). Account for a header row, specify the separator/delimiter as comma, and treat strings as factor (qualitative) variables at the outset.
setwd("/Users/whinton/src/rstudio/tim8501")
train <- read.csv("train.csv", header = TRUE, sep= ",",stringsAsFactors = TRUE)
Check the number of objects as rows using nrow(), and the number of columns as lenth() or ncol().
nrow(train)
## [1] 891
cat("length():",length(train), ", ncol():",ncol(train))
## length(): 12 , ncol(): 12
Take a peek at first few rows with head().
head(train)
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 6 0 3
## Name Sex Age SibSp Parch
## 1 Braund, Mr. Owen Harris male 22 1 0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
## 3 Heikkinen, Miss. Laina female 26 0 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
## 5 Allen, Mr. William Henry male 35 0 0
## 6 Moran, Mr. James male NA 0 0
## Ticket Fare Cabin Embarked
## 1 A/5 21171 7.2500 S
## 2 PC 17599 71.2833 C85 C
## 3 STON/O2. 3101282 7.9250 S
## 4 113803 53.1000 C123 S
## 5 373450 8.0500 S
## 6 330877 8.4583 Q
Take a peek at last few rows with tail().
tail(train)
## PassengerId Survived Pclass Name Sex
## 886 886 0 3 Rice, Mrs. William (Margaret Norton) female
## 887 887 0 2 Montvila, Rev. Juozas male
## 888 888 1 1 Graham, Miss. Margaret Edith female
## 889 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female
## 890 890 1 1 Behr, Mr. Karl Howell male
## 891 891 0 3 Dooley, Mr. Patrick male
## Age SibSp Parch Ticket Fare Cabin Embarked
## 886 39 0 5 382652 29.125 Q
## 887 27 0 0 211536 13.000 S
## 888 19 0 0 112053 30.000 B42 S
## 889 NA 1 2 W./C. 6607 23.450 S
## 890 26 0 0 111369 30.000 C148 C
## 891 32 0 0 370376 7.750 Q
Examine a summary of the dataframe with summary().
summary(train)
## PassengerId Survived Pclass
## Min. : 1.0 Min. :0.0000 Min. :1.000
## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000
## Median :446.0 Median :0.0000 Median :3.000
## Mean :446.0 Mean :0.3838 Mean :2.309
## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :891.0 Max. :1.0000 Max. :3.000
##
## Name Sex Age
## Abbing, Mr. Anthony : 1 female:314 Min. : 0.42
## Abbott, Mr. Rossmore Edward : 1 male :577 1st Qu.:20.12
## Abbott, Mrs. Stanton (Rosa Hunt) : 1 Median :28.00
## Abelson, Mr. Samuel : 1 Mean :29.70
## Abelson, Mrs. Samuel (Hannah Wizosky): 1 3rd Qu.:38.00
## Adahl, Mr. Mauritz Nils Martin : 1 Max. :80.00
## (Other) :885 NA's :177
## SibSp Parch Ticket Fare
## Min. :0.000 Min. :0.0000 1601 : 7 Min. : 0.00
## 1st Qu.:0.000 1st Qu.:0.0000 347082 : 7 1st Qu.: 7.91
## Median :0.000 Median :0.0000 CA. 2343: 7 Median : 14.45
## Mean :0.523 Mean :0.3816 3101295 : 6 Mean : 32.20
## 3rd Qu.:1.000 3rd Qu.:0.0000 347088 : 6 3rd Qu.: 31.00
## Max. :8.000 Max. :6.0000 CA 2144 : 6 Max. :512.33
## (Other) :852
## Cabin Embarked
## :687 : 2
## B96 B98 : 4 C:168
## C23 C25 C27: 4 Q: 77
## G6 : 4 S:644
## C22 C26 : 3
## D : 3
## (Other) :186
str(train)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
## $ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
The data set’s first two pertinent variables are represented as integers (e.g., Survived, Pclass). They should be converted to factor variables for later human-readable insights.
Convert Survived from numerics of 0,1 to factor labels of “NO”, “YES”.
str(train$Survived)
## int [1:891] 0 1 1 1 0 0 0 0 1 1 ...
train$Survived <- cut(train$Survived, breaks=c(-1,0,1), labels=c("NO","YES"))
train$Survived <- as.factor(train$Survived)
str(train$Survived)
## Factor w/ 2 levels "NO","YES": 1 2 2 2 1 1 1 1 2 2 ...
Convert Pclass from numerics of 1,2,3 to factor labels of “Upper”, “Middle”, “Lower”.
str(train$Pclass)
## int [1:891] 3 1 3 1 3 3 1 3 3 2 ...
train$Pclass <- cut(train$Pclass, breaks=c(0,1,2,3), labels=c("Upper","Middle","Lower"))
train$Pclass <- as.factor(train$Pclass)
str(train$Pclass)
## Factor w/ 3 levels "Upper","Middle",..: 3 1 3 1 3 3 1 3 3 2 ...
For this particular step in the algorithm, we’ll just record the number of NAs and blanks. Then drop the blank factor levels (e.g. Embarked). As a follow-up, such values can be handled with extractions or some imputation method.
Handling both blanks and NAs is not simple so first eliminate some of those. Eliminate the blanks and change them to NAs.
df <- train ## take copy of the original dataframe to illustrate this step in the algorithm
# Loop through columns
for (i in 1:length(df)) {
# Loop through rows
for (j in 1:nrow(df)) {
# Check for empty strings or NA values
if (df[j, i] == "" | is.na(df[j, i])) {
# Replace with actual NA value (not a string "NA")
df[j, i] <- NA
}
}
}
na_counts <- colSums(is.na(df))
print(na_counts)
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 687 2
str(df$Embarked)
## Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
Drop the blank factor levels for categorical variables. Particularly notice variables Cabin and Embarked which previously had NAs that are now removed and Embarked previously had 4 levels, but only 3 actual readable values.
df$Survived <- as.factor(as.character(df$Survived))
df$Pclass <- as.factor(as.character(df$Pclass))
df$Name <- as.factor(as.character(df$Name))
df$Sex <- as.factor(as.character(df$Sex))
df$Ticket <- as.factor(as.character(df$Ticket))
df$Cabin <- as.factor(as.character(df$Cabin))
df$Embarked <- as.factor(as.character(df$Embarked))
na_counts_base <- colSums(is.na(df))
print(na_counts_base)
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 687 2
str(df$Embarked)
## Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
For this data set, at this time, the passenger’s Cabin don’t seem to have statistical significance many missing values. So we can consider removing those columns. If a variable like Age becomes relevant and we want to deal with the missing/NA values, then such values can be handled with row or column removals or some sort of imputation method.
str(train$Cabin)
## Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
summary(df)
## PassengerId Survived Pclass
## Min. : 1.0 NO :549 Lower :491
## 1st Qu.:223.5 YES:342 Middle:184
## Median :446.0 Upper :216
## Mean :446.0
## 3rd Qu.:668.5
## Max. :891.0
##
## Name Sex Age
## Abbing, Mr. Anthony : 1 female:314 Min. : 0.42
## Abbott, Mr. Rossmore Edward : 1 male :577 1st Qu.:20.12
## Abbott, Mrs. Stanton (Rosa Hunt) : 1 Median :28.00
## Abelson, Mr. Samuel : 1 Mean :29.70
## Abelson, Mrs. Samuel (Hannah Wizosky): 1 3rd Qu.:38.00
## Adahl, Mr. Mauritz Nils Martin : 1 Max. :80.00
## (Other) :885 NA's :177
## SibSp Parch Ticket Fare
## Min. :0.000 Min. :0.0000 1601 : 7 Min. : 0.00
## 1st Qu.:0.000 1st Qu.:0.0000 347082 : 7 1st Qu.: 7.91
## Median :0.000 Median :0.0000 CA. 2343: 7 Median : 14.45
## Mean :0.523 Mean :0.3816 3101295 : 6 Mean : 32.20
## 3rd Qu.:1.000 3rd Qu.:0.0000 347088 : 6 3rd Qu.: 31.00
## Max. :8.000 Max. :6.0000 CA 2144 : 6 Max. :512.33
## (Other) :852
## Cabin Embarked
## B96 B98 : 4 C :168
## C23 C25 C27: 4 Q : 77
## G6 : 4 S :644
## C22 C26 : 3 NA's: 2
## D : 3
## (Other) :186
## NA's :687
str(df)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : Factor w/ 2 levels "NO","YES": 1 2 2 2 1 1 1 1 2 2 ...
## $ Pclass : Factor w/ 3 levels "Lower","Middle",..: 1 3 1 3 1 1 3 1 1 2 ...
## $ Name : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : Factor w/ 147 levels "A10","A14","A16",..: NA 82 NA 56 NA NA 130 NA NA NA ...
## $ Embarked : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
Examining the first 6 steps of the EDA algorithm here, but for learning I attempt plotting.
Plot of a Titantic data set categorical/factor variable sex
(male,female) .
Shows that the number of male passengers is nearly double the number of
females.
Plot of the Titantic data set categorical/factor variable
Embarked (C,Q,S).
Shows that the embarkment port for nearly all passengers was
Southhampton.