You can see the script of the project here.

The script is based on the work of Jun Li and Barış Can Esmer.

1 Introduction

Decision trees are among the more most popular supervised machine learning algorithms. Its popularity is mainly based on its interpretability and simplicity explained by the divide-and-conquerer algorithm (Wu et al. (2008)). It is worth mentioning that decision trees can be used for classification or regression problems; however, the chosen algorithm in this report (C4.5) only works with classification problems.

The C4.5 algorithm was developed by Ross Quinlan as an extension of the ID3 algorithm; therefore, the construction of the tree is similar but with some improvements. In the case of this report, it is a simplified version of the C4.5 algorithm, without missing value handling, cost differentiation, nor pruning.

Consider a dataset \(S=[X,Y]'\), where \(X\) are the data attributes and \(Y\) the target, as the following:

\(X=[X_1,…,X_k]'\), where \(k\) is the number of attributes and \(X_j\) is the jth attribute. \(Y = [y_1,..,y_n]'\), where n is the number of observations and \(y_i\) is the class for the ith observation. \(y_i\) \(∈{1,…,c}\), where \(c\) is the number of classes

There is a function, \(f_T\) that captures the relationship between \(X\) and \(Y\) \((f_T: X ⟼Y).\)

The C4.5 algorithm does an iterative process through the hypothesis space to find the hypothesis that better represents \(f_T\) using the information gain as a guideline. Firstly, it starts with a single node, where \(S\) is split into child nodes based on the attribute (\(X_j\)) that maximise the information gain. The process is then repeated recursively using the partitions of \(S\) per child node until satisfying a stop condition. These terminal nodes (also called leaves) define the decision of the branch based on the majority class \(c\) in the partition of \(S\).

For simplicity, \(S\) will be considerer as the subset of the original dataset \(S\) inside a node. Similarly, \(X\) and \(Y\) will be the subset of the data attributes and target inside the node, respectively.

This report presents the implementation of a decision tree construction algorithm for a multiclass classification problem, with both continuous and string type attributes using a simplified version of the C4.5 algorithm. Additionally, some hyperparameters were added to avoid overfitting.

The data used to train and evaluate the algorithm correspond to the wine and iris datasets from the scikit-learn library. It is worth mentioning that the iris dataset was modified to create a mix of continuous and string type data. Both datasets have a multiclass target variable.

2 Algorithm

The algorithm consists of four main modules: Entropy Computation, Information Gain Computation, Recursive Partitioning and Attribute Splitting. Additionally, there are six more modules that support the implementation; Initialisation, Fit, Prediction, Data Validation, Getting Majority Class, Printing, Getting Parameters and Setting Parameters.

2.1 Initialisation

This module initialises the variables inside the class object, where some could be defined by the user. The variables that the user can define are the following:

min_samples_split (int): Indicates the minimum length of \(S\) required to split the data in a node. The default value is 2.
max_depth (int): Indicates the maximum depth of the tree, which also can be seen as the maximum number of consecutive splits. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. The default value is None.
seed (int): It controls the randomness of the model. When a decision needs to be made using the majority class, but the classes have an equal number of observations, the decision is random. This seed can control the output of that decision for reproducibility. The default value is 2.
verbose (boolean): It controls the printing outputs of the functions inside the class object. True means print output and False means no printing. The default value is False.

Additionally, the function initialises other important variables needed to store values in the future:

children (dictionary): It is defined as an empty dictionary. After a split, it will store the children nodes objects.
split_feat_name (string): It is defined as None. It stores the name of the feature used for the split.
threshold (double): It is defined as None. When a numeric feature is selected to do the split, it stores the threshold that maximises the information gain.
decision (string or int): It is defined as None. When a terminal node or leaf is reached, it stores the most frequent class c in Y.

class TreeNode:
#Inicialization of the class, with some default parameters
    def __init__(self, min_samples_split=2, max_depth=None, seed=2, 
                 verbose=False):
        # Sub nodes -- recursive, those elements of the same type (TreeNode)
        self.children = {} 
        self.decision = None # Undecided
        self.split_feat_name = None # Splitting feature
        self.threshold = None #Where to split the feature
        #Minimun number of samples to do a split
        self.min_samples_split=min_samples_split
        #Maximun number of nodes/end-nodes: 0 => root_node
        self.max_depth=max_depth
        self.seed=seed #Seed for random numbers
        self.verbose=verbose #True to print the splits

2.2 Starting the Fitting Process

The fitting process needs two arguments given by the user, the sample attributes (data) and the sample target (which are \(X\) and \(Y\), respectively). Firstly, the function calls validate_data to check that data and target have the correct data type. In this case, \(X\) need to be a DataFrame object and \(Y\) a Series object, both with only numeric and string columns (categorical columns are not accepted). If everything is fine, it sets current_depth to 0 (because the tree does not have any depth yet) and calls the recursiveGenerateTree function.

#This function validate the datatype and equal len of data and target
    def validate_data(self, data, target):
        #Validating data type for sample (X)
        if not isinstance(data,(list,pd.core.series.Series,np.ndarray, 
                                pd.DataFrame)):
            return False
        #Validating data type for target (y)
        if not isinstance(target,(list,pd.core.series.Series,np.ndarray)):
            return False
        #Validating same len of X and y
        if len(data) != len(target):
            return False
        return True

2.3 Recursive Partitioning

The recursiveGenerateTree function is called for the first time by the fit function. Its objective is to make a decision in the node when it is possible (see Section 2.3.1) and to split \(S\) to call recursiveGenerateTree for each partition when there is no decision (See Section 2.3.2).

2.3.1 Stop conditions

When \(S\) in a node fulfils some stop condition, that node it is transformed into a leaf. In this custom algorithm, four stop conditions are implemented:

Unique class in the target: When \(Y\) inside a node has only one class, the node is transformed into a terminal node with decision equal to the value of the unique class in \(Y\).

#This is a recursive function, which selects the decision if possible. 
#If not, create children nodes
    def recursiveGenerateTree(self, sample_data, sample_target, current_depth):
        #If there is only one possible outcome, select that as the decision
        if len(sample_target.unique())==1:
            self.decision = sample_target.unique()[0]

Number of observation less than min_samples_split: When the length of the \(S\) is less that the min_samples_split, the decision is made using the majority class inside \(Y\). The function getMajClass is called to define which is the majority class and make the decision (See Section 2.7 for more information about the function).

#If the sample has less than min_samples_split select the majority class
        elif len(sample_target)<self.min_samples_split:
            self.decision = self.getMajClass(sample_target)

Reaching the max depth: When the depth of the tree in the node reaches the max_depth, the decision is made using the majority class in \(Y\) (See Section 2.7 for more about getMajClass).

#If the deep of the current branch is equal to max_depth \
#select the majority class
        elif current_depth == self.max_depth:
            self.decision = self.getMajClass(sample_target)

Node with no observations: During the split of the sample (Section 2.3.2), the algorithm checks if the partitions of \(S_p (S_1,...,S_n)\) have observations. If no observations are inside a partition, the node is created with the same sample as the parent node (\(S_p\)), but simulating that max_depth is reached. Therefore, the decision in the children node with no observations is made using the majority class of the parent node \(S_p\). Details of the code are shown in Section 2.3.2.

2.3.2 Create children nodes

If no stop condition is satisfied, the algorithm calls splitAttribute to decide which attribute \(X_j\) and threshold \(h\) (only in the case of continuous attribute) use to split \(S\). The function splitAttribute returns the splitter variable, which is the same as \(X_j\) for string variables, but for continuous ones is a transformation of \(X_j\) depending on if its observations \(x_{ij}\) are greater or lower than the threshold. See Section 2.4 for more details about splitAttribute.

After selecting the attribute, several objects are stored in the class TreeNode. Firstly an empty dictionary to store the children nodes is created. Then, the name of the best attribute \(X_j\) and the threshold \(h\) are also stored. Finally, before creating the children nodes, current_depth is increased by one because a new child node will be created.

#Call the function to select best_attribute to split
            best_attribute,best_threshold,splitter = self.splitAttribute(sample_data,
                                                                         sample_target)
            self.children = {} #Initializing a dictionary with the children nodes
            self.split_feat_name = best_attribute  #Name of the feature
            self.threshold = best_threshold #Threshold for continuous variable
            current_depth += 1 #Increase the deep by 1

Finally, for each class in \(X_j\) a child node with the partition of \(S\) is created. There are two possible outcomes for each children node, depending on how many observations the split have:

Length of \(S\) is greater than 0: When the partition of \(S\) has samples inside, a new initialisation of Treenode is started with the same hyperparameters as the current node. This Treenode is stored in the children dictionary created in the previous step. Finally, the function recursiveGenerateTree is called from Treenode, giving the current_depth and the partitions of \(X\) and \(Y\) as arguments. Therefore, a new child node will be created using the same process described in Section @ref(fittingprocess.

#Create a new node for each class of the best feature
            for v in splitter.unique():
                index = splitter == v #Select the indexes of each class
                #If there is data in the node, create a new tree node with that partition
                if len(sample_data[index])>0:
                    self.children[v] = TreeNode(min_samples_split = self.min_samples_split,
                                                max_depth=self.max_depth,
                                                seed=self.seed,
                                                verbose=self.verbose)
                    self.children[v].recursiveGenerateTree(sample_data[index],
                                                           sample_target[index],
                                                           current_depth)

Length of \(S\) is 0: When the partition of \(S\) has no observations inside, a new initialisation of Treenode is started with the same hyperparameters as the current node, except for max_depth, which is set to 1. The recursiveGenerateTree is also called from that class, but this time the current_depth is set to one, and instead of using the partitions of \(X\) and \(Y\) as arguments, the entire \(X\) and \(Y\) are used. Consequently, the child node will have max_depth=current_depth and a stop condition will be triggered (see Section 2.3.1: Node with no observations), making the decision based on the majority class in \(Y\) (which is the target in the parent node).

#If there is no data in the node, use the previous node data (this one) \
#and make a decision based on the majority class
                else:
                    #Force to make a decision based on majority class \
                    #simulating that it reached max_depth
                    self.children[v] = TreeNode(min_samples_split = self.min_samples_split,
                                                max_depth=1,
                                                seed=self.seed,
                                                verbose=self.verbose)
                    self.children[v].recursiveGenerateTree(sample_data,                                          
                                                           sample_target,
                                                           current_depth=1)

2.4 Selecting Best Attribute and Threshold

This function identifies which attribute \(X_j\) of \(S\) is the one that maximises the information gain when \(S\) is partitioned (See more details about the concept of information gain in Section 2.5). It only needs two arguments, \(X\) (sample_data) and \(Y\) (sample_target).

The first part of the function defines variables to work with:

info_gain_max: The variable is initialised using the lowest number in Python. It will store the information gain using the best attribute \(X_j\) as the splitter.
splitter: It is initialised with an empty string Series (which is a one-dimensional labelled array using in the Pandas library of Python). It will store the values of the best attribute \(X_j\).
best_attribute: It will store the name of the best attribute \(X_j\).
best_threshold: When \(X_j\) is continuous, it will store the threshold \(h\) that maximise the information gain.

#This function define which is the best attribute to split (string \
#or continious)
    def splitAttribute(self, sample_data,sample_target):
        info_gain_max = -1*float("inf") #Info gain set to a minimun
        #Creating a blank serie to store variable in which the split is based
        splitter = pd.Series(dtype='str') 
        best_attribute = None #Setting attribute to split to None
        best_threshold = None #Setting the threshold to None
        #Iterate for every attribute in the sample_data
        for attribute in sample_data.keys():

After defining the variables, it starts to iterate through each attribute \(X_j\) of \(X\) to find the one with more information gain (line 97). Depending on the type of the attribute, the process could be divided into two parts; string variable and continuous variable.

2.4.1 String Variable

When the attribute \(X_j\) is type string, the information gain is calculated using the function compute_info_gain, which needs \(X_j\) and \(Y\) as arguments. Once the information gain is computed, the algorithm checks if it is greater than the stored info_gain_max. If the information gain is greater than the previous maximum, \(X_j\) is selected as the temporal best attribute, and its information gain is stored as the current maximum.

#If the attribute is a string
            if is_string_dtype(sample_data[attribute]):
                #Compute information gain using that attribute to split the target
                aig = self.compute_info_gain(sample_data[attribute], sample_target)
                #If the information gain is more than the previous one, store
                if aig > info_gain_max:
                    splitter = sample_data[attribute] #Store the variable
                    info_gain_max = aig #Store the information gain
                    best_attribute = attribute #Store the name of the attribute
                    #In this case there is no threshold (string)
                    best_threshold = None

2.4.2 Continious Variable

When the attribute \(X_j\) is continuous additional steps are needed before computing the information gain. Firstly, new variables are created to store the sorted versions of \(X_j\) and \(Y\), ordered by the values of \(X_j\) in ascending order. Then the algorithm iterates through the sorted \(X_j\) and creates a threshold \(h\) based on the mean of two consecutive values.

Once the threshold \(h\) is chosen, the algorithm classifies \(X_j\) into two categories; greater or less than the threshold, which is stored in the variable classification. Then, the information gain is computed using compute_info_gain, with classification (which is a categorical version of \(X_j\) given a threshold) and \(Y\) as arguments. Once the information gain is computed, the algorithm checks if it is greater than the stored info_gain_max. If the information gain is greater than the previous maximum, \(X_j\) is selected as the temporal best attribute with threshold \(h\). Its information gain is also stored as the current maximum. The iteration continuous until using every possible mean between two consecutive values of the sorted \(X_j\) as a threshold.

It is worth mentioning that classification is stored as the best \(X_j\) to do the split.

#If the attribute is a continuous
            else:
                #Sort the continuous variable in an asc order. Change the target order \
                # based on that
                sorted_index = sample_data[attribute].sort_values(ascending=True).index
                sorted_sample_data = sample_data[attribute][sorted_index]
                sorted_sample_target = sample_target[sorted_index]
                #Iterate between each sample, except the last one
                for j in range(0, len(sorted_sample_data) - 1):
                    #Create a blank serie to store the classification (less or greater)
                    classification = pd.Series(dtype='str')
                    #If two consecutives samples are not the same, use its mean as \
                    #a threshold
                    if sorted_sample_data.iloc[j] != sorted_sample_data.iloc[j+1]:
                        threshold = (sorted_sample_data.iloc[j] + 
                                     sorted_sample_data.iloc[j+1]) / 2
                        #Assign greater or less acording to the threshold
                        classification = sample_data[attribute] > threshold
                        classification[classification] = 'greater'
                        classification[classification == False] = 'less'
                        #Calculate the information gain using previous variable \
                        # (now categorical)
                        aig = self.compute_info_gain(classification, sample_target)
                        #If the information gain is more than the previous one, store
                        if aig >= info_gain_max:
                            splitter = classification #Store the variable
                            info_gain_max = aig #Store the information gain
                            best_attribute = attribute #Store the name of the attribute
                            best_threshold = threshold #Store the threshold

2.4.3 Returning Results

After the iteration through every \(X_j\) and possible threshold \(h\), the function returns the name of the best attribute \(X_j\), the threshold \(h\) (in the case of a continuous attribute) and the values of the best \(X_j\) (which in the case of a continuous attribute is a categorical version of \(X_j\)).

#If verbose is true print the result of the split
        if self.verbose:
            if is_string_dtype(sample_data[best_attribute]):
                print(f"Split by {best_attribute}, IG: {info_gain_max:.2f}")
            else:
                print(f"Split by {best_attribute}, at {threshold}, IG: {info_gain_max:.2f}")
        return (best_attribute,best_threshold,splitter)

2.5 Entropy

Entropy measure the amount of information in terms of uncertainty in a categorical variable (\(Y_v\)) with \(c\) unique classes. It can also be understood as the number of bits needed to encode the classes of the target attribute. It is computed as the following:

\[ H(Y_v):= \sum_{i=1}^{c} -p_i \log_2p_i \]

Where \(p_i\) is the proportion of the number of elements in class \(i\in1\ldots c\ \) to the number of elements in \(Y_v\).

The algorithm uses the function compute_entropy to compute the entropy, which only needs the partition of the target attribute (\(Y_v\), which is defined in Section 2.6) as an argument. Firstly, the function checks the number of observations in \(Y_v\). When this number is less than two, the entropy is equal to zero because there is no uncertainty nor information in the system.

On the other hand, when there are more than one observations in \(Y_v\), the entropy is computed using the above equation, where freq is the same as \(p_i\). It is worth mentioning that 1e-6 is added to the equation to avoid an error when freq is equal to zero in the logarithm.

#This function calculates the entropy based on the distribution of \
#the target split
    def compute_entropy(self, sample_target_split):
        #If there is only only one class, the entropy is 0
        if len(sample_target_split) < 2:
            return 0
        #If not calculate the entropy
        else:
            freq = np.array(sample_target_split.value_counts(normalize=True))
            return -(freq * np.log2(freq + 1e-6)).sum()

2.6 Information Gain

The information gain measures the expected reduction in entropy in a variable \(Y\) given the knowledge of the variable \(X_j\). In other words, it measures how much uncertainty was reduced by partitioning \(Y\) according to the values of \(X_j\). It is calculated subtracting the entropy of \(Y|X_j\) to the entropy of \(Y\) as the following:

\[ IG\left(Y;X_j\right)≔ HY- \sum_{v ∈ Values(Xj)} \frac{|X_{j,v}|}{|X_j|} H(Y_v) \]

Where, * \(X_{j,v}\) is the partition of \(X_j\) where its values are equal to \(v\). * \(Y_v\) is the partition of \(Y\) with the same indexes as \(X_(j,v)\). * \(|X_(j,v) |/|X_j |\) is the proportion of \(X_j\) that have \(v\) as its values.

The algorithm uses the function compute_info_gain to compute the information gain, which needs the values of \(Y\) (sample_target) and \(X_j\) (sample_attribute) as arguments. Firstly, the function creates a variable called \(values\), which stores the unique values of \(X_j\) (\(v\)) and its proportion in the sample \((|X_(j,v) |/|X_j | )\). Secondly, the entropy of \(Y|X_j\) is computed adding the entropy of each \(Y_v\) multiplicated by its sample proportion in an iterative process. Finally, this value is subtracted to the entropy of \(Y\) and returned as the value of the information gain of \(Y\) given an attribute \(X_j\).

#This function computes the information gain using a specific \
#attribute to split the target
    def compute_info_gain(self, sample_attribute, sample_target):
        #Compute the proportion of each class in the attribute
        values = sample_attribute.value_counts(normalize=True)
        split_ent = 0 #Set the entropy to 0
        #Iterate for each class of the sample attribute
        for v, fr in values.iteritems():
            #Calculate the entropy for sample target corresponding to the class
            index = sample_attribute==v
            sub_ent = self.compute_entropy(sample_target[index])
            #Weighted sum of the entropies
            split_ent += fr * sub_ent
        #Compute the entropy without any split
        ent = self.compute_entropy(sample_target)
        #Return the information gain of the split
        return ent - split_ent

2.7 Getting Majority Class

The function getMajClass choose the majority class given the target variable \(Y\); thus, a decision can be made even if the node is not pure. It is only triggered when a stop condition is met (see Section 2.3.1).

The function starts computing the frequency of each class in \(Y\) and then selects the one with more observations in the sample. If two or more classes are the ones with more observations, the function randomly selects one of them. Finally, the majority class is returned as the decision for the node.

#This function selects the majority class of the target to make a decision
    def getMajClass(self, sample_target):
        #Compute the number of records per class and order it (desc)
        freq = sample_target.value_counts().sort_values(ascending=False)
        #Select the name of the class (classes) that has the max number of records
        MajClass = freq.keys()[freq==freq.max()]
        #If there are two classes with equal number of records, select one randomly
        if len(MajClass) > 1:
            decision = MajClass[random.Random(self.seed).randint(0,len(MajClass)-1)]
        #If there is only onle select that
        else:
            decision = MajClass[0]
        return decision

3 Model Evaluation

The model was evaluated on the Wine dataset and a modified version of the Iris dataset, both from the scikit-learn library on Python (explained in Section 3.2).

The following steps were done to evaluate model performance:

Split the datasets into test and train set (Section 3.2)
Select best hyperparameters using cross-validation (Section 3.3)
Train the model with the best hyperparameters and evaluate predictions in the test set (Section 3.4)
Compare the results with other implementations (Section 3.5)

3.1 Prediction

Before evaluating the model, it is necessary to understand how the prediction is computed. According to Section 2.1, several variables are stored in each node, including the decision, threshold or the children nodes where there is no decision. All these objects are used to predict the label of a given \(X\) using the function predict inside the class TreeNode.

The function predict starts in the root node, where it verifies if a decision exists. If there is a decision, the function returns its value, but if there is not, the process continues. Depending on the type of the \(X_j\) used to split the sample in this node, the function takes one of the two ways; numeric type split or string type split.

3.1.1 String type split

If the \(X_j\) used to split the sample is a string type, the function selects the children node that matched the value of \(X_j\) and stores it in a variable called child. Finally, the function predict is called again recursively, but this time using the variables stored in the children node.

3.1.2 Numeric type split

If the \(X_j\) used to split the sample is numeric, then its value is compared with the threshold stored in the class, with two possible outcomes; greater or less. Knowing if the value is greater or less than the threshold, the function selects the corresponding children node to keep looking for the decision.

#This function returns the class or prediction given an X
    def predict(self, sample):
        #If there is a decision in the node, return it
        if self.decision is not None:
            #Print when verbose is true
            if self.verbose:
                print("Decision:", self.decision)
            return self.decision #Return decision
        #If not, it means that it is an internal node
        else:
            #Select the value of the split attribute in the given data
            attr_val = sample[self.split_feat_name]
            #Print if verbose is true
            if self.verbose:
                print('attr_val')
            #If the value for the feature is not numeric just go to the\
            # corresponding child node and print
            if not isinstance(attr_val, Number):
                child = self.children[attr_val]
                if self.verbose:
                    print("Testing ", self.split_feat_name, "->", attr_val)
            #If the value is numeric see if it is greater or less than the \
            #threshold
            else:
                if attr_val > self.threshold:
                    child = self.children['greater']
                    if self.verbose:
                        print("Testing ", self.split_feat_name, "->",
                              'greater than ', self.threshold)
                else:
                    child = self.children['less']
                    if self.verbose:
                        print("Testing ", self.split_feat_name, "->",
                              'less than or equal', self.threshold)
            #Do it again with the child until finding the terminal node
            return child.predict(sample)

3.2 Data Preparation

Wine dataset: The data is the result of a chemical analysis of three different types of wine, where 13 continuous variables were recorded for 178 observations. The objective is to predict the wine grape cultivar (https://archive.ics.uci.edu/ml/datasets/wine).

Iris dataset: The data contains 150 observations of four characteristics of three different species of the iris plant (https://archive.ics.uci.edu/ml/datasets/iris). Due to all of the four attributes are numeric, two of them were converted to string type using three quantiles (see the code snippet below). This transformation was done to test the algorithm not only with continuous attributes.

#Transforming two continuous variables into strings using quantiles
X['sepal length (cm)']=pd.qcut(X['sepal length (cm)'], 
                               q=3, 
                               labels=["small","medium","large"]).astype(str)
X['petal length (cm)']=pd.qcut(X['petal length (cm)'], 
                               q=3, 
                               labels=["small","medium","large"]).astype(str)

Both datasets are split into two partitions, one for training and the other for evaluation purposes. The evaluation set is about 30% of the original dataset.

It is worth mentioning that due to both are toy datasets; they do not have quality issues like missing values or outliers.

3.3 Cross-Validation

Cross-validation was used to find the best hyperparameters for the decision tree construction.

The cross-validation function takes \(X\), \(Y\), the model, a number folds and a set of hyperparameters to compute the accuracy for each hyperparameter set, returning the one with the best metric. This is done by randomly partitioning the data into n folds for every group of hyperparameters. For every set of hyperparameters, the model is trained n-1 times, which then is evaluated using one of the folds as the test set. The model than on average has better accuracy is then selected as the one with the best group of hyperparameters.

#This function performs an hyperparameter tunning using cross-validation
def crossvalidation(X, y, model, n_folds, params, seed=4):
    #Creating combinations of all hyperparameters (like and expand grid)
    keys, values = zip(*params.items())
    params = [dict(zip(keys, v)) for v in itertools.product(*values)]
    #Initialice a list to store the accuracy for each set of parameters
    accuracy = []
    #Iterete each set of parameters
    for j in range(0, len(params)):
        #Creating the folds indexes
        indexes=np.array(X.index)
        #Shuffle randomly the indexes
        random.Random(seed).shuffle(indexes)
        #Save the indexes intro the folds
        folds = np.array_split(indexes,n_folds)
        acc = [] #Initialice a local accuracy for each fold
        #Compute the accuracy of each fold using the first one as a test set
        for k in range(1,n_folds):
            #Change the parameters of the model
            model.set_params(params[j])
            #Train the model with the fold
            model.fit(X.loc[folds[k]], y.loc[folds[k]])
            #Predict using the first fold as test set
            prediction = X.loc[folds[1]].apply(lambda row : model.predict(row), axis = 1)
            #Store the accuracy of the fold
            acc.append(sum((prediction  ==  y.loc[folds[1]]))/len(y.loc[folds[1]]))
        #Compute the accuracy of all the folds for the set of parameters
        accuracy.append(sum(acc)/len(acc))
        #Print the results for the set of parameters
        print(f"Parameters: {params[j]}, Accuracy: {accuracy[j]}")
        #Change the seed to do other folds
        seed=seed+1
    #Select and print the set of best hyperparameters
    max_value = max(accuracy)
    max_index = accuracy.index(max_value)
    print('Best hyperparameters:')
    print(params[max_index])

This process was done with both dataset with the following result:

Wine dataset:

Best hyperparameters: {‘min_samples_split’: 4, ‘max_depth’: None}

#Defining hyperparameters
param_grid = {'min_samples_split': [2,3,4,5], 'max_depth': [2,3,4,None], 
              'seed': [2]}
#Initialization of the model
tree_wine = TreeNode()
#Cross validation to find the best hyperparameters
crossvalidation(X_wine_train, y_wine_train, tree_wine, 5, param_grid, seed=12)

Iris dataset:

Best hyperparameters: {‘min_samples_split’: 4, ‘max_depth’: 3}

#Defining hyperparameters
param_grid = {'min_samples_split': [2,3,4,5], 'max_depth': [2,3,4,None], 
              'seed': [2]}
#Initialization of the model
tree_iris = TreeNode()
#Cross validation to find the best hyperparameters
crossvalidation(X_iris_train, y_iris_train, tree_iris, 5, param_grid)

3.4 Results Evaluation

Using the best hyperparameters exposed in Section 3.3, one model was trained for each dataset. Those models were used to predict the target variable in the test set, which was compared with the real values of the target.

This section gives a detailed discussion of the results of each model.

3.4.1 Wine results

The trained model was used to predict the labels of the test set, resulting in 83% observations with correct labelling. However, there were differences in how the model predicted the labels for each class, which is exposed in Table 3.1.

According to this Table, while the precision is similar between the three classes, the recall varies widely, from 1 (class 0) to 0.57 (class 2). Therefore, the major problem of the model occurs when predicting the label of class 2, where several false negative were encounter. This could also be seen in the confusion matrix (Table 3.2).

library(kableExtra)

temp<-data.frame(Label=c(0,1,2),
                 Precision=c(0.83,0.82,0.89),
                 Recall=c(1.00,0.86,0.57))

temp %>% kbl(caption = "Evaluation metrics for the wine model") %>% 
    kable_paper(full_width = F) %>%
    column_spec(1, width = "8em") %>%
    column_spec(2, width = "8em") %>%
    column_spec(3, width = "8em")

Table 3.1: Evaluation metrics for the wine model
Label	Precision	Recall
0	0.83	1.00
1	0.82	0.86
2	0.89	0.57

temp<-data.frame("Y"=c("True labels","True labels","True labels"),
                 "X"=c(0,1,2),
                 "A"=c(19,2,2),
                 "B"=c(0,18,4),
                 "C"=c(0,1,8))

temp %>% kbl(caption = "Confusion matrix for the wine model",
             col.names = c("","","1","2","3")) %>% 
  kable_paper(full_width = F) %>%
  add_header_above(c("","","Predicted labels"=3)) %>%
  column_spec(1, bold = TRUE) %>%
  column_spec(2, bold = TRUE, width = "3em") %>%
  column_spec(3, width = "6em", border_left = T, border_right = T) %>%
  column_spec(4, width = "6em", border_left = T, border_right = T) %>%
  column_spec(5, width = "6em", border_left = T, border_right = T) %>%
  collapse_rows(columns = 1, latex_hline = "major", valign = "middle")

Table 3.2: Confusion matrix for the wine model
		Predicted labels
		1	2	3
True labels	0	19	0	0
	1	2	18	1
	2	2	4	8

A possible explanation for the low recall in class 2, could be that this class have similar characteristics as the other two classes in the training set. Therefore, the algorithm decided that some of those characteristics belong to class 0 and 1, resulting in some false negatives for class 2.

3.4.2 Iris results

When the trained model is used to predict the labels in the test set, 100% of the observations were correctly predicted. Additionally, Table 3.3 and Table 3.4 shows no errors in the prediction of \(Y\) in the test set. This could be caused due to the simplicity of the data, that maybe have very different characteristics for each class.

temp<-data.frame(Label=c(0,1,2),
                 Precision=c(1.00,1.00,1.00),
                 Recall=c(1.00,1.00,1.00))

temp %>% kbl(caption = "Evaluation metrics for the iris model") %>% 
    kable_paper(full_width = F) %>%
    column_spec(1, width = "8em") %>%
    column_spec(2, width = "8em") %>%
    column_spec(3, width = "8em")

Table 3.3: Evaluation metrics for the iris model
Label	Precision	Recall
0	1	1
1	1	1
2	1	1

temp<-data.frame("Y"=c("True labels","True labels","True labels"),
                 "X"=c(0,1,2),
                 "A"=c(19,0,0),
                 "B"=c(0,13,0),
                 "C"=c(0,0,13))

temp %>% kbl(caption = "Confusion matrix for the iris model",
             col.names = c("","","1","2","3")) %>% 
  kable_paper(full_width = F) %>%
  add_header_above(c("","","Predicted labels"=3)) %>%
  column_spec(1, bold = TRUE) %>%
  column_spec(2, bold = TRUE, width = "3em") %>%
  column_spec(3, width = "6em", border_left = T, border_right = T) %>%
  column_spec(4, width = "6em", border_left = T, border_right = T) %>%
  column_spec(5, width = "6em", border_left = T, border_right = T) %>%
  collapse_rows(columns = 1, latex_hline = "major", valign = "middle")

Table 3.4: Confusion matrix for the iris model
		Predicted labels
		1	2	3
True labels	0	19	0	0
	1	0	13	0
	2	0	0	13

3.5 Comparing with other implementations

Due to the Iris model did not have errors, only the Wine model was compared with a distributed implementation, which in this case is the decision tree of scikit-learn.

The same process described in Section 3 was applied using scikit-learn, which implies some differences. Firstly, the function GridSearchCV from scikit-learn was applied to find the best hyperparameters (which were slightly different to ones found in Section 3.3). Then, the model was trained using the best hyperparameters and the training split data, but this time using DecisionTreeClassifier from scikit-learn. It is worth mentioning that entropy was used as the criterion for comparison purposes.

The accuracy of this model was the same our model, even having the same evaluation metrics (Table 3.5) and confusion matrix (Table 3.6). This only proves the correctness of the algorithm implementation; however, this does not mean that both implementations are equals. Scikit-learn implementation is much more flexible, and it is based on a more efficient algorithm than ID3 or C4.5.

library(kableExtra)

temp<-data.frame(Label=c(0,1,2),
                 Precision=c(0.83,0.82,0.89),
                 Recall=c(1.00,0.86,0.57))

temp %>% kbl(caption = "Evaluation metrics for the wine model using scikit-learn") %>% 
    kable_paper(full_width = F) %>%
    column_spec(1, width = "8em") %>%
    column_spec(2, width = "8em") %>%
    column_spec(3, width = "8em")

Table 3.5: Evaluation metrics for the wine model using scikit-learn
Label	Precision	Recall
0	0.83	1.00
1	0.82	0.86
2	0.89	0.57

temp<-data.frame("Y"=c("True labels","True labels","True labels"),
                 "X"=c(0,1,2),
                 "A"=c(19,2,2),
                 "B"=c(0,18,4),
                 "C"=c(0,1,8))

temp %>% kbl(caption = "Confusion matrix for the wine model using scikit-learn",
             col.names = c("","","1","2","3")) %>% 
  kable_paper(full_width = F) %>%
  add_header_above(c("","","Predicted labels"=3)) %>%
  column_spec(1, bold = TRUE) %>%
  column_spec(2, bold = TRUE, width = "3em") %>%
  column_spec(3, width = "6em", border_left = T, border_right = T) %>%
  column_spec(4, width = "6em", border_left = T, border_right = T) %>%
  column_spec(5, width = "6em", border_left = T, border_right = T) %>%
  collapse_rows(columns = 1, latex_hline = "major", valign = "middle")

Table 3.6: Confusion matrix for the wine model using scikit-learn
		Predicted labels
		1	2	3
True labels	0	19	0	0
	1	2	18	1
	2	2	4	8

4 Conclusion

This report presented the implementation of a simplified version of the C4.5 algorithm, which construction is based on entropy, the core concept of information theory. Additionally, this implementation added some hyperparameters which can be used to avoid overfitting, like the maximum depth of the tree or the minimum sample size. However, this algorithm still cannot be considered equal to the C4.5 algorithm because of the lack of some functionalities (such as pruning and missing values handling). Moreover, this algorithm is still not robust enough to be used widely due to the restrictions in the input format and lack of error handling.

Despite the constraints of the algorithm, it was demonstrated that it can handle numeric and string type data with successful results using two toy datasets, reaching an accuracy of at least 83%. Even more, the algorithm has equal results as the decision tree from scikit-learn when entropy is used as the criterion, and the hyperparameters are limited. Still, there a lot of potential for improvement, starting with the speed of the algorithm, which is much slower than the scikit-learn implementation.

References

Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., … Steinberg, D. (2008). Top 10 algorithms in data mining. Knowledge and Information Systems, 14(1), 1–37. https://doi.org/10.1007/s10115-007-0114-2

Implementation of Simplified C4.5 Decision Tree Algorithm

Felipe Monroy

03-11-2020