You can see the script of the project here.
The script is based on the work of Jun Li and Barış Can Esmer.
Decision trees are among the more most popular supervised machine learning algorithms. Its popularity is mainly based on its interpretability and simplicity explained by the divide-and-conquerer algorithm (Wu et al. (2008)). It is worth mentioning that decision trees can be used for classification or regression problems; however, the chosen algorithm in this report (C4.5) only works with classification problems.
The C4.5 algorithm was developed by Ross Quinlan as an extension of the ID3 algorithm; therefore, the construction of the tree is similar but with some improvements. In the case of this report, it is a simplified version of the C4.5 algorithm, without missing value handling, cost differentiation, nor pruning.
Consider a dataset \(S=[X,Y]'\), where \(X\) are the data attributes and \(Y\) the target, as the following:
\(X=[X_1,…,X_k]'\), where \(k\) is the number of attributes and \(X_j\) is the jth attribute. \(Y = [y_1,..,y_n]'\), where n is the number of observations and \(y_i\) is the class for the ith observation. \(y_i\) \(∈{1,…,c}\), where \(c\) is the number of classes
There is a function, \(f_T\) that captures the relationship between \(X\) and \(Y\) \((f_T: X ⟼Y).\)
The C4.5 algorithm does an iterative process through the hypothesis space to find the hypothesis that better represents \(f_T\) using the information gain as a guideline. Firstly, it starts with a single node, where \(S\) is split into child nodes based on the attribute (\(X_j\)) that maximise the information gain. The process is then repeated recursively using the partitions of \(S\) per child node until satisfying a stop condition. These terminal nodes (also called leaves) define the decision of the branch based on the majority class \(c\) in the partition of \(S\).
For simplicity, \(S\) will be considerer as the subset of the original dataset \(S\) inside a node. Similarly, \(X\) and \(Y\) will be the subset of the data attributes and target inside the node, respectively.
This report presents the implementation of a decision tree construction algorithm for a multiclass classification problem, with both continuous and string type attributes using a simplified version of the C4.5 algorithm. Additionally, some hyperparameters were added to avoid overfitting.
The data used to train and evaluate the algorithm correspond to the wine and iris datasets from the scikit-learn library. It is worth mentioning that the iris dataset was modified to create a mix of continuous and string type data. Both datasets have a multiclass target variable.
The algorithm consists of four main modules: Entropy Computation, Information Gain Computation, Recursive Partitioning and Attribute Splitting. Additionally, there are six more modules that support the implementation; Initialisation, Fit, Prediction, Data Validation, Getting Majority Class, Printing, Getting Parameters and Setting Parameters.
This module initialises the variables inside the class object, where some could be defined by the user. The variables that the user can define are the following:
Additionally, the function initialises other important variables needed to store values in the future:
class TreeNode:
#Inicialization of the class, with some default parameters
def __init__(self, min_samples_split=2, max_depth=None, seed=2,
verbose=False):
# Sub nodes -- recursive, those elements of the same type (TreeNode)
self.children = {}
self.decision = None # Undecided
self.split_feat_name = None # Splitting feature
self.threshold = None #Where to split the feature
#Minimun number of samples to do a split
self.min_samples_split=min_samples_split
#Maximun number of nodes/end-nodes: 0 => root_node
self.max_depth=max_depth
self.seed=seed #Seed for random numbers
self.verbose=verbose #True to print the splitsThe fitting process needs two arguments given by the user, the sample attributes (data) and the sample target (which are \(X\) and \(Y\), respectively). Firstly, the function calls validate_data to check that data and target have the correct data type. In this case, \(X\) need to be a DataFrame object and \(Y\) a Series object, both with only numeric and string columns (categorical columns are not accepted). If everything is fine, it sets current_depth to 0 (because the tree does not have any depth yet) and calls the recursiveGenerateTree function.
#This function validate the datatype and equal len of data and target
def validate_data(self, data, target):
#Validating data type for sample (X)
if not isinstance(data,(list,pd.core.series.Series,np.ndarray,
pd.DataFrame)):
return False
#Validating data type for target (y)
if not isinstance(target,(list,pd.core.series.Series,np.ndarray)):
return False
#Validating same len of X and y
if len(data) != len(target):
return False
return TrueThe recursiveGenerateTree function is called for the first time by the fit function. Its objective is to make a decision in the node when it is possible (see Section 2.3.1) and to split \(S\) to call recursiveGenerateTree for each partition when there is no decision (See Section 2.3.2).
When \(S\) in a node fulfils some stop condition, that node it is transformed into a leaf. In this custom algorithm, four stop conditions are implemented:
Unique class in the target: When \(Y\) inside a node has only one class, the node is transformed into a terminal node with decision equal to the value of the unique class in \(Y\).
#This is a recursive function, which selects the decision if possible.
#If not, create children nodes
def recursiveGenerateTree(self, sample_data, sample_target, current_depth):
#If there is only one possible outcome, select that as the decision
if len(sample_target.unique())==1:
self.decision = sample_target.unique()[0]Number of observation less than min_samples_split: When the length of the \(S\) is less that the min_samples_split, the decision is made using the majority class inside \(Y\). The function getMajClass is called to define which is the majority class and make the decision (See Section 2.7 for more information about the function).
#If the sample has less than min_samples_split select the majority class
elif len(sample_target)<self.min_samples_split:
self.decision = self.getMajClass(sample_target)Reaching the max depth: When the depth of the tree in the node reaches the max_depth, the decision is made using the majority class in \(Y\) (See Section 2.7 for more about getMajClass).
#If the deep of the current branch is equal to max_depth \
#select the majority class
elif current_depth == self.max_depth:
self.decision = self.getMajClass(sample_target)Node with no observations: During the split of the sample (Section 2.3.2), the algorithm checks if the partitions of \(S_p (S_1,...,S_n)\) have observations. If no observations are inside a partition, the node is created with the same sample as the parent node (\(S_p\)), but simulating that max_depth is reached. Therefore, the decision in the children node with no observations is made using the majority class of the parent node \(S_p\). Details of the code are shown in Section 2.3.2.
If no stop condition is satisfied, the algorithm calls splitAttribute to decide which attribute \(X_j\) and threshold \(h\) (only in the case of continuous attribute) use to split \(S\). The function splitAttribute returns the splitter variable, which is the same as \(X_j\) for string variables, but for continuous ones is a transformation of \(X_j\) depending on if its observations \(x_{ij}\) are greater or lower than the threshold. See Section 2.4 for more details about splitAttribute.
After selecting the attribute, several objects are stored in the class TreeNode. Firstly an empty dictionary to store the children nodes is created. Then, the name of the best attribute \(X_j\) and the threshold \(h\) are also stored. Finally, before creating the children nodes, current_depth is increased by one because a new child node will be created.
#Call the function to select best_attribute to split
best_attribute,best_threshold,splitter = self.splitAttribute(sample_data,
sample_target)
self.children = {} #Initializing a dictionary with the children nodes
self.split_feat_name = best_attribute #Name of the feature
self.threshold = best_threshold #Threshold for continuous variable
current_depth += 1 #Increase the deep by 1Finally, for each class in \(X_j\) a child node with the partition of \(S\) is created. There are two possible outcomes for each children node, depending on how many observations the split have:
Length of \(S\) is greater than 0: When the partition of \(S\) has samples inside, a new initialisation of Treenode is started with the same hyperparameters as the current node. This Treenode is stored in the children dictionary created in the previous step. Finally, the function recursiveGenerateTree is called from Treenode, giving the current_depth and the partitions of \(X\) and \(Y\) as arguments. Therefore, a new child node will be created using the same process described in Section @ref(fittingprocess.
#Create a new node for each class of the best feature
for v in splitter.unique():
index = splitter == v #Select the indexes of each class
#If there is data in the node, create a new tree node with that partition
if len(sample_data[index])>0:
self.children[v] = TreeNode(min_samples_split = self.min_samples_split,
max_depth=self.max_depth,
seed=self.seed,
verbose=self.verbose)
self.children[v].recursiveGenerateTree(sample_data[index],
sample_target[index],
current_depth)Length of \(S\) is 0: When the partition of \(S\) has no observations inside, a new initialisation of Treenode is started with the same hyperparameters as the current node, except for max_depth, which is set to 1. The recursiveGenerateTree is also called from that class, but this time the current_depth is set to one, and instead of using the partitions of \(X\) and \(Y\) as arguments, the entire \(X\) and \(Y\) are used. Consequently, the child node will have max_depth=current_depth and a stop condition will be triggered (see Section 2.3.1: Node with no observations), making the decision based on the majority class in \(Y\) (which is the target in the parent node).
#If there is no data in the node, use the previous node data (this one) \
#and make a decision based on the majority class
else:
#Force to make a decision based on majority class \
#simulating that it reached max_depth
self.children[v] = TreeNode(min_samples_split = self.min_samples_split,
max_depth=1,
seed=self.seed,
verbose=self.verbose)
self.children[v].recursiveGenerateTree(sample_data,
sample_target,
current_depth=1)This function identifies which attribute \(X_j\) of \(S\) is the one that maximises the information gain when \(S\) is partitioned (See more details about the concept of information gain in Section 2.5). It only needs two arguments, \(X\) (sample_data) and \(Y\) (sample_target).
The first part of the function defines variables to work with:
#This function define which is the best attribute to split (string \
#or continious)
def splitAttribute(self, sample_data,sample_target):
info_gain_max = -1*float("inf") #Info gain set to a minimun
#Creating a blank serie to store variable in which the split is based
splitter = pd.Series(dtype='str')
best_attribute = None #Setting attribute to split to None
best_threshold = None #Setting the threshold to None
#Iterate for every attribute in the sample_data
for attribute in sample_data.keys():After defining the variables, it starts to iterate through each attribute \(X_j\) of \(X\) to find the one with more information gain (line 97). Depending on the type of the attribute, the process could be divided into two parts; string variable and continuous variable.
When the attribute \(X_j\) is type string, the information gain is calculated using the function compute_info_gain, which needs \(X_j\) and \(Y\) as arguments. Once the information gain is computed, the algorithm checks if it is greater than the stored info_gain_max. If the information gain is greater than the previous maximum, \(X_j\) is selected as the temporal best attribute, and its information gain is stored as the current maximum.
#If the attribute is a string
if is_string_dtype(sample_data[attribute]):
#Compute information gain using that attribute to split the target
aig = self.compute_info_gain(sample_data[attribute], sample_target)
#If the information gain is more than the previous one, store
if aig > info_gain_max:
splitter = sample_data[attribute] #Store the variable
info_gain_max = aig #Store the information gain
best_attribute = attribute #Store the name of the attribute
#In this case there is no threshold (string)
best_threshold = None When the attribute \(X_j\) is continuous additional steps are needed before computing the information gain. Firstly, new variables are created to store the sorted versions of \(X_j\) and \(Y\), ordered by the values of \(X_j\) in ascending order. Then the algorithm iterates through the sorted \(X_j\) and creates a threshold \(h\) based on the mean of two consecutive values.
Once the threshold \(h\) is chosen, the algorithm classifies \(X_j\) into two categories; greater or less than the threshold, which is stored in the variable classification. Then, the information gain is computed using compute_info_gain, with classification (which is a categorical version of \(X_j\) given a threshold) and \(Y\) as arguments. Once the information gain is computed, the algorithm checks if it is greater than the stored info_gain_max. If the information gain is greater than the previous maximum, \(X_j\) is selected as the temporal best attribute with threshold \(h\). Its information gain is also stored as the current maximum. The iteration continuous until using every possible mean between two consecutive values of the sorted \(X_j\) as a threshold.
It is worth mentioning that classification is stored as the best \(X_j\) to do the split.
#If the attribute is a continuous
else:
#Sort the continuous variable in an asc order. Change the target order \
# based on that
sorted_index = sample_data[attribute].sort_values(ascending=True).index
sorted_sample_data = sample_data[attribute][sorted_index]
sorted_sample_target = sample_target[sorted_index]
#Iterate between each sample, except the last one
for j in range(0, len(sorted_sample_data) - 1):
#Create a blank serie to store the classification (less or greater)
classification = pd.Series(dtype='str')
#If two consecutives samples are not the same, use its mean as \
#a threshold
if sorted_sample_data.iloc[j] != sorted_sample_data.iloc[j+1]:
threshold = (sorted_sample_data.iloc[j] +
sorted_sample_data.iloc[j+1]) / 2
#Assign greater or less acording to the threshold
classification = sample_data[attribute] > threshold
classification[classification] = 'greater'
classification[classification == False] = 'less'
#Calculate the information gain using previous variable \
# (now categorical)
aig = self.compute_info_gain(classification, sample_target)
#If the information gain is more than the previous one, store
if aig >= info_gain_max:
splitter = classification #Store the variable
info_gain_max = aig #Store the information gain
best_attribute = attribute #Store the name of the attribute
best_threshold = threshold #Store the threshold After the iteration through every \(X_j\) and possible threshold \(h\), the function returns the name of the best attribute \(X_j\), the threshold \(h\) (in the case of a continuous attribute) and the values of the best \(X_j\) (which in the case of a continuous attribute is a categorical version of \(X_j\)).
#If verbose is true print the result of the split
if self.verbose:
if is_string_dtype(sample_data[best_attribute]):
print(f"Split by {best_attribute}, IG: {info_gain_max:.2f}")
else:
print(f"Split by {best_attribute}, at {threshold}, IG: {info_gain_max:.2f}")
return (best_attribute,best_threshold,splitter)Entropy measure the amount of information in terms of uncertainty in a categorical variable (\(Y_v\)) with \(c\) unique classes. It can also be understood as the number of bits needed to encode the classes of the target attribute. It is computed as the following:
\[ H(Y_v):= \sum_{i=1}^{c} -p_i \log_2p_i \]
Where \(p_i\) is the proportion of the number of elements in class \(i\in1\ldots c\ \) to the number of elements in \(Y_v\).
The algorithm uses the function compute_entropy to compute the entropy, which only needs the partition of the target attribute (\(Y_v\), which is defined in Section 2.6) as an argument. Firstly, the function checks the number of observations in \(Y_v\). When this number is less than two, the entropy is equal to zero because there is no uncertainty nor information in the system.
On the other hand, when there are more than one observations in \(Y_v\), the entropy is computed using the above equation, where freq is the same as \(p_i\). It is worth mentioning that 1e-6 is added to the equation to avoid an error when freq is equal to zero in the logarithm.
#This function calculates the entropy based on the distribution of \
#the target split
def compute_entropy(self, sample_target_split):
#If there is only only one class, the entropy is 0
if len(sample_target_split) < 2:
return 0
#If not calculate the entropy
else:
freq = np.array(sample_target_split.value_counts(normalize=True))
return -(freq * np.log2(freq + 1e-6)).sum()The information gain measures the expected reduction in entropy in a variable \(Y\) given the knowledge of the variable \(X_j\). In other words, it measures how much uncertainty was reduced by partitioning \(Y\) according to the values of \(X_j\). It is calculated subtracting the entropy of \(Y|X_j\) to the entropy of \(Y\) as the following:
\[ IG\left(Y;X_j\right)≔ HY- \sum_{v ∈ Values(Xj)} \frac{|X_{j,v}|}{|X_j|} H(Y_v) \]
Where, * \(X_{j,v}\) is the partition of \(X_j\) where its values are equal to \(v\). * \(Y_v\) is the partition of \(Y\) with the same indexes as \(X_(j,v)\). * \(|X_(j,v) |/|X_j |\) is the proportion of \(X_j\) that have \(v\) as its values.
The algorithm uses the function compute_info_gain to compute the information gain, which needs the values of \(Y\) (sample_target) and \(X_j\) (sample_attribute) as arguments. Firstly, the function creates a variable called \(values\), which stores the unique values of \(X_j\) (\(v\)) and its proportion in the sample \((|X_(j,v) |/|X_j | )\). Secondly, the entropy of \(Y|X_j\) is computed adding the entropy of each \(Y_v\) multiplicated by its sample proportion in an iterative process. Finally, this value is subtracted to the entropy of \(Y\) and returned as the value of the information gain of \(Y\) given an attribute \(X_j\).
#This function computes the information gain using a specific \
#attribute to split the target
def compute_info_gain(self, sample_attribute, sample_target):
#Compute the proportion of each class in the attribute
values = sample_attribute.value_counts(normalize=True)
split_ent = 0 #Set the entropy to 0
#Iterate for each class of the sample attribute
for v, fr in values.iteritems():
#Calculate the entropy for sample target corresponding to the class
index = sample_attribute==v
sub_ent = self.compute_entropy(sample_target[index])
#Weighted sum of the entropies
split_ent += fr * sub_ent
#Compute the entropy without any split
ent = self.compute_entropy(sample_target)
#Return the information gain of the split
return ent - split_entThe function getMajClass choose the majority class given the target variable \(Y\); thus, a decision can be made even if the node is not pure. It is only triggered when a stop condition is met (see Section 2.3.1).
The function starts computing the frequency of each class in \(Y\) and then selects the one with more observations in the sample. If two or more classes are the ones with more observations, the function randomly selects one of them. Finally, the majority class is returned as the decision for the node.
#This function selects the majority class of the target to make a decision
def getMajClass(self, sample_target):
#Compute the number of records per class and order it (desc)
freq = sample_target.value_counts().sort_values(ascending=False)
#Select the name of the class (classes) that has the max number of records
MajClass = freq.keys()[freq==freq.max()]
#If there are two classes with equal number of records, select one randomly
if len(MajClass) > 1:
decision = MajClass[random.Random(self.seed).randint(0,len(MajClass)-1)]
#If there is only onle select that
else:
decision = MajClass[0]
return decisionThe model was evaluated on the Wine dataset and a modified version of the Iris dataset, both from the scikit-learn library on Python (explained in Section 3.2).
The following steps were done to evaluate model performance:
Before evaluating the model, it is necessary to understand how the prediction is computed. According to Section 2.1, several variables are stored in each node, including the decision, threshold or the children nodes where there is no decision. All these objects are used to predict the label of a given \(X\) using the function predict inside the class TreeNode.
The function predict starts in the root node, where it verifies if a decision exists. If there is a decision, the function returns its value, but if there is not, the process continues. Depending on the type of the \(X_j\) used to split the sample in this node, the function takes one of the two ways; numeric type split or string type split.
If the \(X_j\) used to split the sample is a string type, the function selects the children node that matched the value of \(X_j\) and stores it in a variable called child. Finally, the function predict is called again recursively, but this time using the variables stored in the children node.
If the \(X_j\) used to split the sample is numeric, then its value is compared with the threshold stored in the class, with two possible outcomes; greater or less. Knowing if the value is greater or less than the threshold, the function selects the corresponding children node to keep looking for the decision.
#This function returns the class or prediction given an X
def predict(self, sample):
#If there is a decision in the node, return it
if self.decision is not None:
#Print when verbose is true
if self.verbose:
print("Decision:", self.decision)
return self.decision #Return decision
#If not, it means that it is an internal node
else:
#Select the value of the split attribute in the given data
attr_val = sample[self.split_feat_name]
#Print if verbose is true
if self.verbose:
print('attr_val')
#If the value for the feature is not numeric just go to the\
# corresponding child node and print
if not isinstance(attr_val, Number):
child = self.children[attr_val]
if self.verbose:
print("Testing ", self.split_feat_name, "->", attr_val)
#If the value is numeric see if it is greater or less than the \
#threshold
else:
if attr_val > self.threshold:
child = self.children['greater']
if self.verbose:
print("Testing ", self.split_feat_name, "->",
'greater than ', self.threshold)
else:
child = self.children['less']
if self.verbose:
print("Testing ", self.split_feat_name, "->",
'less than or equal', self.threshold)
#Do it again with the child until finding the terminal node
return child.predict(sample)Wine dataset: The data is the result of a chemical analysis of three different types of wine, where 13 continuous variables were recorded for 178 observations. The objective is to predict the wine grape cultivar (https://archive.ics.uci.edu/ml/datasets/wine).
Iris dataset: The data contains 150 observations of four characteristics of three different species of the iris plant (https://archive.ics.uci.edu/ml/datasets/iris). Due to all of the four attributes are numeric, two of them were converted to string type using three quantiles (see the code snippet below). This transformation was done to test the algorithm not only with continuous attributes.
#Transforming two continuous variables into strings using quantiles
X['sepal length (cm)']=pd.qcut(X['sepal length (cm)'],
q=3,
labels=["small","medium","large"]).astype(str)
X['petal length (cm)']=pd.qcut(X['petal length (cm)'],
q=3,
labels=["small","medium","large"]).astype(str)Both datasets are split into two partitions, one for training and the other for evaluation purposes. The evaluation set is about 30% of the original dataset.
It is worth mentioning that due to both are toy datasets; they do not have quality issues like missing values or outliers.
Cross-validation was used to find the best hyperparameters for the decision tree construction.
The cross-validation function takes \(X\), \(Y\), the model, a number folds and a set of hyperparameters to compute the accuracy for each hyperparameter set, returning the one with the best metric. This is done by randomly partitioning the data into n folds for every group of hyperparameters. For every set of hyperparameters, the model is trained n-1 times, which then is evaluated using one of the folds as the test set. The model than on average has better accuracy is then selected as the one with the best group of hyperparameters.
#This function performs an hyperparameter tunning using cross-validation
def crossvalidation(X, y, model, n_folds, params, seed=4):
#Creating combinations of all hyperparameters (like and expand grid)
keys, values = zip(*params.items())
params = [dict(zip(keys, v)) for v in itertools.product(*values)]
#Initialice a list to store the accuracy for each set of parameters
accuracy = []
#Iterete each set of parameters
for j in range(0, len(params)):
#Creating the folds indexes
indexes=np.array(X.index)
#Shuffle randomly the indexes
random.Random(seed).shuffle(indexes)
#Save the indexes intro the folds
folds = np.array_split(indexes,n_folds)
acc = [] #Initialice a local accuracy for each fold
#Compute the accuracy of each fold using the first one as a test set
for k in range(1,n_folds):
#Change the parameters of the model
model.set_params(params[j])
#Train the model with the fold
model.fit(X.loc[folds[k]], y.loc[folds[k]])
#Predict using the first fold as test set
prediction = X.loc[folds[1]].apply(lambda row : model.predict(row), axis = 1)
#Store the accuracy of the fold
acc.append(sum((prediction == y.loc[folds[1]]))/len(y.loc[folds[1]]))
#Compute the accuracy of all the folds for the set of parameters
accuracy.append(sum(acc)/len(acc))
#Print the results for the set of parameters
print(f"Parameters: {params[j]}, Accuracy: {accuracy[j]}")
#Change the seed to do other folds
seed=seed+1
#Select and print the set of best hyperparameters
max_value = max(accuracy)
max_index = accuracy.index(max_value)
print('Best hyperparameters:')
print(params[max_index])This process was done with both dataset with the following result:
Wine dataset:
Best hyperparameters: {‘min_samples_split’: 4, ‘max_depth’: None}
#Defining hyperparameters
param_grid = {'min_samples_split': [2,3,4,5], 'max_depth': [2,3,4,None],
'seed': [2]}
#Initialization of the model
tree_wine = TreeNode()
#Cross validation to find the best hyperparameters
crossvalidation(X_wine_train, y_wine_train, tree_wine, 5, param_grid, seed=12)Iris dataset:
Best hyperparameters: {‘min_samples_split’: 4, ‘max_depth’: 3}
Using the best hyperparameters exposed in Section 3.3, one model was trained for each dataset. Those models were used to predict the target variable in the test set, which was compared with the real values of the target.
This section gives a detailed discussion of the results of each model.
The trained model was used to predict the labels of the test set, resulting in 83% observations with correct labelling. However, there were differences in how the model predicted the labels for each class, which is exposed in Table 3.1.
According to this Table, while the precision is similar between the three classes, the recall varies widely, from 1 (class 0) to 0.57 (class 2). Therefore, the major problem of the model occurs when predicting the label of class 2, where several false negative were encounter. This could also be seen in the confusion matrix (Table 3.2).
library(kableExtra)
temp<-data.frame(Label=c(0,1,2),
Precision=c(0.83,0.82,0.89),
Recall=c(1.00,0.86,0.57))
temp %>% kbl(caption = "Evaluation metrics for the wine model") %>%
kable_paper(full_width = F) %>%
column_spec(1, width = "8em") %>%
column_spec(2, width = "8em") %>%
column_spec(3, width = "8em")| Label | Precision | Recall |
|---|---|---|
| 0 | 0.83 | 1.00 |
| 1 | 0.82 | 0.86 |
| 2 | 0.89 | 0.57 |
temp<-data.frame("Y"=c("True labels","True labels","True labels"),
"X"=c(0,1,2),
"A"=c(19,2,2),
"B"=c(0,18,4),
"C"=c(0,1,8))
temp %>% kbl(caption = "Confusion matrix for the wine model",
col.names = c("","","1","2","3")) %>%
kable_paper(full_width = F) %>%
add_header_above(c("","","Predicted labels"=3)) %>%
column_spec(1, bold = TRUE) %>%
column_spec(2, bold = TRUE, width = "3em") %>%
column_spec(3, width = "6em", border_left = T, border_right = T) %>%
column_spec(4, width = "6em", border_left = T, border_right = T) %>%
column_spec(5, width = "6em", border_left = T, border_right = T) %>%
collapse_rows(columns = 1, latex_hline = "major", valign = "middle")|
Predicted labels
|
||||
|---|---|---|---|---|
| 1 | 2 | 3 | ||
| True labels | 0 | 19 | 0 | 0 |
| 1 | 2 | 18 | 1 | |
| 2 | 2 | 4 | 8 | |
A possible explanation for the low recall in class 2, could be that this class have similar characteristics as the other two classes in the training set. Therefore, the algorithm decided that some of those characteristics belong to class 0 and 1, resulting in some false negatives for class 2.
When the trained model is used to predict the labels in the test set, 100% of the observations were correctly predicted. Additionally, Table 3.3 and Table 3.4 shows no errors in the prediction of \(Y\) in the test set. This could be caused due to the simplicity of the data, that maybe have very different characteristics for each class.
temp<-data.frame(Label=c(0,1,2),
Precision=c(1.00,1.00,1.00),
Recall=c(1.00,1.00,1.00))
temp %>% kbl(caption = "Evaluation metrics for the iris model") %>%
kable_paper(full_width = F) %>%
column_spec(1, width = "8em") %>%
column_spec(2, width = "8em") %>%
column_spec(3, width = "8em")| Label | Precision | Recall |
|---|---|---|
| 0 | 1 | 1 |
| 1 | 1 | 1 |
| 2 | 1 | 1 |
temp<-data.frame("Y"=c("True labels","True labels","True labels"),
"X"=c(0,1,2),
"A"=c(19,0,0),
"B"=c(0,13,0),
"C"=c(0,0,13))
temp %>% kbl(caption = "Confusion matrix for the iris model",
col.names = c("","","1","2","3")) %>%
kable_paper(full_width = F) %>%
add_header_above(c("","","Predicted labels"=3)) %>%
column_spec(1, bold = TRUE) %>%
column_spec(2, bold = TRUE, width = "3em") %>%
column_spec(3, width = "6em", border_left = T, border_right = T) %>%
column_spec(4, width = "6em", border_left = T, border_right = T) %>%
column_spec(5, width = "6em", border_left = T, border_right = T) %>%
collapse_rows(columns = 1, latex_hline = "major", valign = "middle")|
Predicted labels
|
||||
|---|---|---|---|---|
| 1 | 2 | 3 | ||
| True labels | 0 | 19 | 0 | 0 |
| 1 | 0 | 13 | 0 | |
| 2 | 0 | 0 | 13 | |
Due to the Iris model did not have errors, only the Wine model was compared with a distributed implementation, which in this case is the decision tree of scikit-learn.
The same process described in Section 3 was applied using scikit-learn, which implies some differences. Firstly, the function GridSearchCV from scikit-learn was applied to find the best hyperparameters (which were slightly different to ones found in Section 3.3). Then, the model was trained using the best hyperparameters and the training split data, but this time using DecisionTreeClassifier from scikit-learn. It is worth mentioning that entropy was used as the criterion for comparison purposes.
The accuracy of this model was the same our model, even having the same evaluation metrics (Table 3.5) and confusion matrix (Table 3.6). This only proves the correctness of the algorithm implementation; however, this does not mean that both implementations are equals. Scikit-learn implementation is much more flexible, and it is based on a more efficient algorithm than ID3 or C4.5.
library(kableExtra)
temp<-data.frame(Label=c(0,1,2),
Precision=c(0.83,0.82,0.89),
Recall=c(1.00,0.86,0.57))
temp %>% kbl(caption = "Evaluation metrics for the wine model using scikit-learn") %>%
kable_paper(full_width = F) %>%
column_spec(1, width = "8em") %>%
column_spec(2, width = "8em") %>%
column_spec(3, width = "8em")| Label | Precision | Recall |
|---|---|---|
| 0 | 0.83 | 1.00 |
| 1 | 0.82 | 0.86 |
| 2 | 0.89 | 0.57 |
temp<-data.frame("Y"=c("True labels","True labels","True labels"),
"X"=c(0,1,2),
"A"=c(19,2,2),
"B"=c(0,18,4),
"C"=c(0,1,8))
temp %>% kbl(caption = "Confusion matrix for the wine model using scikit-learn",
col.names = c("","","1","2","3")) %>%
kable_paper(full_width = F) %>%
add_header_above(c("","","Predicted labels"=3)) %>%
column_spec(1, bold = TRUE) %>%
column_spec(2, bold = TRUE, width = "3em") %>%
column_spec(3, width = "6em", border_left = T, border_right = T) %>%
column_spec(4, width = "6em", border_left = T, border_right = T) %>%
column_spec(5, width = "6em", border_left = T, border_right = T) %>%
collapse_rows(columns = 1, latex_hline = "major", valign = "middle")|
Predicted labels
|
||||
|---|---|---|---|---|
| 1 | 2 | 3 | ||
| True labels | 0 | 19 | 0 | 0 |
| 1 | 2 | 18 | 1 | |
| 2 | 2 | 4 | 8 | |
This report presented the implementation of a simplified version of the C4.5 algorithm, which construction is based on entropy, the core concept of information theory. Additionally, this implementation added some hyperparameters which can be used to avoid overfitting, like the maximum depth of the tree or the minimum sample size. However, this algorithm still cannot be considered equal to the C4.5 algorithm because of the lack of some functionalities (such as pruning and missing values handling). Moreover, this algorithm is still not robust enough to be used widely due to the restrictions in the input format and lack of error handling.
Despite the constraints of the algorithm, it was demonstrated that it can handle numeric and string type data with successful results using two toy datasets, reaching an accuracy of at least 83%. Even more, the algorithm has equal results as the decision tree from scikit-learn when entropy is used as the criterion, and the hyperparameters are limited. Still, there a lot of potential for improvement, starting with the speed of the algorithm, which is much slower than the scikit-learn implementation.
Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., … Steinberg, D. (2008). Top 10 algorithms in data mining. Knowledge and Information Systems, 14(1), 1–37. https://doi.org/10.1007/s10115-007-0114-2