Data Mining & Feature Engineering Using Tree Methods

Intro

I enjoy machine learning very much and a place that I spend a lot of time these days is Kaggle. If I could sum why very quickly, it would be for one reason: it’s a great way to look at some of the best data mining, machine learning, and beautiful R code in the world. Frequently, winners of major competitions ($10,000, $50,000 and more) post solutions to the contest that they’ve won on the site for the community to learn from. To me this represents a gold mine of creative genius in the world of machine learning and so I spend hours each day simply reading code from other brilliant people. In that vein, I will say that I was inspired to think in very new ways about applications of machine learning using Decision Trees & Random Forests because of these resources.

Data

The data for this project comes from Kaggle’s Titanic dataset. The goal is to use the data given to train a model such that it is able to predict whether certain passengers on the Titanic will survive or not. The features present are listed as:

Survival: A binary factor indicating whether someone will live of die
Pclass: The class of the passenger on ship
Age: A passengers age at time of departure
SibSp: The number of Siblings/Spouses that a passenger has on the ship
Parch: The number of parents or children that are present on the vessel
Ticket: The ticket number
Fare: The passenger fare
Cabin: The cabin type
Embarked: From where the passenger embarked. C = Cerbourg; Q = Queenstown; S = Southampton

Processing

# Load the necessary packages
library(dplyr)
library(randomForest)
library(rpart)
library(ggplot2)
library(rattle)

My first steps for examinaing a large dataset are as follows: 1) First check for missing data 2) After finding missing values, I’ll combine the test and training sets into one large dataset & *3) After that I’ll take steps to impute the missing values using various methods.

*If the dataset were large enough I might also remove observations which don’t have complete variable sets. In my work you will likely see both methods.

# Check for missing data
check.missing <- function(x) return(length(which(is.na(x))))
data.frame(sapply(train,check.missing))
data.frame(sapply(test,check.missing))

# Check blanks
check.blank <- function(x) return(length(which(x == "")))
data.frame(sapply(train, check.blank))
data.frame(sapply(test, check.blank)) 


# Combine the test and train
combined <- rbind(train, test)

##             sapply.train..check.missing.
## PassengerId                            0
## Survived                               0
## Pclass                                 0
## Name                                   0
## Sex                                    0
## Age                                  177
## SibSp                                  0
## Parch                                  0
## Ticket                                 0
## Fare                                   0
## Cabin                                  0
## Embarked                               0

##             sapply.test..check.missing.
## PassengerId                           0
## Pclass                                0
## Name                                  0
## Sex                                   0
## Age                                  86
## SibSp                                 0
## Parch                                 0
## Ticket                                0
## Fare                                  1
## Cabin                                 0
## Embarked                              0
## Survived                              0

##             sapply.train..check.blank.
## PassengerId                          0
## Survived                             0
## Pclass                               0
## Name                                 0
## Sex                                  0
## Age                                  0
## SibSp                                0
## Parch                                0
## Ticket                               0
## Fare                                 0
## Cabin                              687
## Embarked                             2

##             sapply.test..check.blank.
## PassengerId                         0
## Pclass                              0
## Name                                0
## Sex                                 0
## Age                                 0
## SibSp                               0
## Parch                               0
## Ticket                              0
## Fare                                0
## Cabin                             327
## Embarked                            0
## Survived                            0

As you can see from the output from the function, there are quite a few missing values in the dataset. For variables with smaller quantities of NA values I’ll use simple methods to compute their values; For the missing Embarked locations and Fare we’ll use the mode and the mean respectively.

# Set the values of blanks in embarked category.  We'll use Southhampton because most people
# Embarked from there
combined$Embarked[which(combined$Embarked == "")] <- "S"

# Passenger on row 1044 has NA for Fare.  We'll set this to mean Fare paid.  
combined$Fare[which(is.na(combined$Fare))] <- mean(combined$Fare, na.rm = TRUE)

Imputing the column values with larger amounts of missing data presented a more complex problem. First, I wanted to understand the degree to which each column might be related to the column I want to compute. This would give me a sense of how I might structure and test other methods. To do that I took the numeric and integer elements of the dataset and created a correlation matrix, which you can see below:

# Find all the factor elements in train
factor_index <- sapply(combined, function(x) is.factor(x))

# Subset train by not factors
cor_sub <- combined[,!factor_index]

# Return only the complete cases of cor_sub
cor_sub2 <- cor_sub[complete.cases(cor_sub),]


# Change Age column to a numeric in cor_sub2
cor_sub2$Age <- as.numeric(cor_sub2$Age)

# Find a correlation matrix of cor_sub2
cor(cor_sub2)

##             PassengerId     Survived      Pclass         Age        SibSp
## PassengerId  1.00000000 -0.324201904 -0.06409668  0.02881445 -0.050699916
## Survived    -0.32420190  1.000000000 -0.24671423 -0.06609628 -0.006633255
## Pclass      -0.06409668 -0.246714232  1.00000000 -0.40810623  0.047221482
## Age          0.02881445 -0.066096276 -0.40810623  1.00000000 -0.243698998
## SibSp       -0.05069992 -0.006633255  0.04722148 -0.24369900  1.000000000
## Parch       -0.02109583  0.078907801  0.01722391 -0.15091709  0.374456041
## Fare         0.04785385  0.168637377 -0.56506908  0.17822909  0.141195081
##                   Parch        Fare
## PassengerId -0.02109583  0.04785385
## Survived     0.07890780  0.16863738
## Pclass       0.01722391 -0.56506908
## Age         -0.15091709  0.17822909
## SibSp        0.37445604  0.14119508
## Parch        1.00000000  0.21672597
## Fare         0.21672597  1.00000000

Next I wanted to see how the Age column itself was distributed and so I ploted a histogram and summary statistics (Not Shown). This led me to my first few conclusions: 1) Age has a defined leftward skew, where there are many more young passengers than older ones 2) The median ages of passengers by gender are different by about 3 years, meaning we might gain some advantage by separating genders if we want to use a simple metric (like mean or median) to impute Age.

Completing this technique lead me the following code:

# Calculate the mean age by gender to predict missing ages.  
# Replaced: See decision tree method below.  
#train %>% select(Age, Sex) %>% group_by(Sex) %>% summarise(mean(Age, na.rm = TRUE))


# maleAge <- combined$Sex == "male" & is.na(combined$Age) == TRUE
#   combined[maleAge, "Age"] <- 31.00
# femaleAge <- combined$Sex == "female" & is.na(combined$Age) == TRUE 
#   combined[femaleAge, "Age"] <- 28.00

As you can see from the second line comment, I abandoned this method in favor of using a decision tree to impute these values on advice from a sage data mining pro. His point of view which I agree with: choosing the central tendacy is going to be great until its not. Assuming the unknown popuation mirrors the known, for those younger passenegers it will likely work well. However, because the long tail is so large for the older passengers (max Age is 80) you’ll severly underestimate the older passenger by a large portion. It would be wise to attempt to programmitically compute these using another method. I implemented that below.

Cabin also has a significant number of missing data, but I’ll actually take care of this in the feature engineering section that follows.

# Use a decision tree in order to compute the missing values in Age
ageTree <- rpart(Age ~ PassengerId + Survived + Pclass + Sex + SibSp + Parch + Fare + Embarked, 
                 data = combined[!is.na(combined$Age),], method = "anova") 
agePrediction <- predict(ageTree, combined[is.na(combined$Age),])
combined$Age[which(is.na(combined$Age))] <- agePrediction

Data Mining & Features Engineering Using Tree Methods

After finally creating a complete dataset now begins the part I find particularly interesting: mining data for insights. I’m going to take a moment to talk about my method of doing this if only to show how I think about it: if this doesn’t interest you, feel free to skip on to the next code section.

In the end, tree methods are looking for discrete homogenous subsets to split on. If it makes it easier to understand my thought process, in my mind I imagine something like this:

You are given a large bucket of water filled with popsicle sticks. On each stick are painted banded colors like lines on a barbershop pole.
Reach in and pick out all of the sticks with any red bands on them.
Now notice that you also pulled out sticks containing other colors: orange, white, green…
Now notice that you pulled out most of the white, by chance, by picking out all of the red.
Next you look at the other colors you pulled out and notice most of the green came too by accident.
What if you took out all of the sticks that contain red and green, would you get more white?

Feature engineering to me is finding out that spliting green is related to randomly splitting red, which is related to white, which I want to find. Where am I going with this? This is the logic that leads to mining the Name of passengers for their Title.

After running a decision tree model naively on all of the data to predict age, we notice pclass and sex are very important for finding out survival. If we can find a way of getting more information about class and sex from the data we might find more information about survival. Titles (eg. Mr., Duchess, Count, Dr.) imply both social status and sex but are very nuanced. If we can extract these values and split them we might find more information about class and sex, which can help us predict survival.

In the same vein, I feature engineered a column for whether a passenger was a mother, computed a new Cabin column as relates to class, and family size. You can see this below:

# Convert names column to characters
combined$Name <- as.character(combined$Name)


# Take the titles out from the names
combined$Title<-sapply(combined$Name,function(x) strsplit(x,'[.,]')[[1]][2])
combined$Title<-gsub(' ','',combined$Title)
combined$Title[combined$Title %in% c('Capt', 'Don', 'Major', 'Sir')] <- 'Sir'
combined$Title[combined$Title %in% c('Dona', 'Lady', 'the Countess', 'Jonkheer')] <- 'Lady'
combined$Name <- as.factor(combined$Name)
combined$Title <- as.factor(combined$Title)

# Cabin information may be useful.  Extract first letter from Cabin, convert to integer type.  
combined$CabinTest <- 0 
pop_first <- function(x) substring(x, 1, 1)
combined$CabinTest <- sapply(combined$Cabin, pop_first)
combined$CabinTest[which(combined$CabinTest == "")] <- NA
combined$CabinTest <- as.integer(as.factor(combined$CabinTest))
#cor.test(combined$Pclass, combined$CabinTest)
combined$CabinTest[which(is.na(combined$CabinTest))] <- 0
combined$CabinTest <- as.factor(combined$CabinTest)


# Family size column
combined$FamilySize <- 0 
combined$FamilySize <- combined$SibSp + combined$Parch


# Mother Column
combined$Mother <- 0 # Initialize an empty new column
motherTest <- combined$Sex == "female" & combined$SibSp != 0 & combined$Age >= 18
    combined[motherTest, "Mother"] <- 1 
combined$Mother <- as.factor(combined$Mother)

We combined the model previously and now we’ll have to take a moment to pull it apart before we run our tests.

# Split the combined into test_new and train new
train_new <- combined[1:891,]
test_new <- combined[892:1309,]
test_new$Survived <- NULL

From here we compute the model, predict the values using our tree, and write a csv submission. It might seem like I’m now “done”, but the reality is far from it. I’ll do post hoc diagnosis of my model, tweek the sensitivity of the model in certain areas, fit various different types of machine learning methods to see if they pick up more sensitivity and more. I’ll save that for the next time.

# Create a random forest model using imputed Age, Mother, Cabin, Family Size Title column
myTree1 <- rpart(Survived ~ Pclass + Sex + Age + SibSp + Parch + 
                           Fare + Embarked + Mother + Title + CabinTest + FamilySize, data = train_new, method = "class") 

# View the Decision Tree
fancyRpartPlot(myTree1)

# Predict the outcome
my_prediction <- predict(myTree1, test_new, type = "class")


# Create a dataframe with my solution and passengerid
my_solution <- data.frame(PassengerId = test_new$PassengerId, Survived = my_prediction)


# Write a csv with the solution 
write.csv(my_solution, file = "3DecisionTree.csv", row.names = FALSE)

Data Mining & Feature Engineering Using Tree Methods

James Hilton Jr.

Intro

Data

Processing

Data Mining & Features Engineering Using Tree Methods