CSCI/DASC 6020: Written Assignment 04

Author

Mir Quddus

Published

October 24, 2023

Assignment Goal

The goal of this assignment is to demonstrate your understanding of how the concepts from information theory can be used to build machine learning models for predictive tasks.

1 Calculating Entropy

The figure below shows a set of eight Scrabble pieces.

  1. What is the entropy (in bits) of the letters in this set?

    It is a multiset with duplicate elements. First set: ‘OOO’, second set = ‘XYMRN’

Entropy of the word OXYMORON: 2.405639 bits 
  1. What would be the reduction in entropy (i.e., the information gain) in bits if we split these letters into two sets, one containing the vowels and the other containing the consonants?

    Information Gain = Original Entropy of the set - Remaining Entropy

information gain due to splitting letters into two sets : 0.954434 bits 
  1. What is the maximum possible entropy (in bits) for a set of eight Scrabble pieces?

    We get maximum entropy, when the elements are distinct - every element in the set is different. In OXYMORON, there are 6 distinct letters {OXYMRN}

maximum possible entropy of OXYMORON:  2.584963 bits 
  1. In general, which is preferable when you are playing Scrabble: a set of letters with high entropy, or a set of letters with low entropy?

    I would like to have high entropy (which means the letters are distinct), this increases my chance to use these letters useful increases. With low entropy, due to possibility of duplication, the number of useful letter goes down. Thus, I prefer to have high entropy to increase my chances of winning.

2 Decision Tree Construction

A convicted criminal who reoffends after release is known as a recidivist. The Table below lists a dataset that describes prisoners released on parole, and whether they reoffended within two years of release.

ID Good Behavior Age Below 30 Drug Dependent Recidivist
1 false true false true
2 false false false false
3 false true false true
4 true false false false
5 true false true true
6 true false false false
Prisoners on parole.

This dataset lists six instances where prisoners were granted parole. Each of these instances are described in terms of three binary descriptive features (Good Behavior, Age Below 30, Drug Dependent) and a binary target feature, Recidivist. The Good Behavior feature has a value of true if the prisoner had not committed any infringements during incarceration, the Age Below 30 has a value of true if the prisoner was under 30 years of age when granted parole, and the Drug Dependent feature is true if the prisoner had a drug addiction at the time of parole. The target feature, Recidivist, has a true value if the prisoner was arrested within two years of being released; otherwise it has a value of false.

  1. Using this dataset, construct the decision tree that would be generated by the ID3 algorithm, using entropy-based information gain.

[1] 1.487343
[1] 1.755617
[1] 1.405639
we will use the " Age Below 30 " as the root node 
# Calculate the entropy of the entire dataset 

# Calculate the entropy of "good behavior" attribute (c3)
p_good_behavior <- sum(df1$c3 == "true") / nrow(df1)
p_bad_behavior <- sum(df1$c3 == "false") / nrow(df1)

entropy_good_behavior <- -p_good_behavior * log2(p_good_behavior) - p_bad_behavior * log2(p_bad_behavior)

# Calculate the Information Gain
information_gain_good_behavior <- entropy_original - entropy_good_behavior

# Calculate the entropy of "age below 30" attribute (c4)
p_below_30 <- sum(df1$c4 == "true") / nrow(df1)
p_above_30 <- sum(df1$c4 == "false") / nrow(df1)

entropy_below_30 <- -p_below_30 * log2(p_below_30) - p_above_30 * log2(p_above_30)

# Calculate the Information Gain
information_gain_below_30 <- entropy_original - entropy_below_30

# Calculate the entropy of the "drug-dependent" attribute (c5)
p_drug_dependent <- sum(df1$c5 == "true") / nrow(df1)
p_not_drug_dependent <- sum(df1$c5 == "false") / nrow(df1)

entropy_drug_dependent <- -p_drug_dependent * log2(p_drug_dependent) - p_not_drug_dependent * log2(p_not_drug_dependent)

# Calculate the Information Gain
information_gain_drug_dependent <- entropy_original - entropy_drug_dependent

# Create a named vector of Information Gain values for each attribute
information_gains <- c(
  "Good Behavior" = information_gain_good_behavior,
  "Age Below 30" = information_gain_below_30,
  "Drug Dependent" = information_gain_drug_dependent
)

# Print the Information Gain values with attribute names in two columns
cat("Attribute Name\tInformation Gain\n")
Attribute Name  Information Gain
for (attr_name in names(information_gains)) {
  cat(attr_name, "\t", information_gains[attr_name], "\n")
}
Good Behavior    1.487343 
Age Below 30     1.755617 
Drug Dependent   1.405639 
# Find the attribute with the highest Information Gain
highest_ig_attribute <- names(information_gains[which.max(information_gains)])

# Report the attribute with the highest IG
cat('we will use the "', highest_ig_attribute, '" as the root node', '\n')
we will use the " Age Below 30 " as the root node 
  1. What prediction will the decision tree generated in part (a) of this question return for the following query?

    Good Behavior = false, Age Below 30 = false, Drug Dependent = true

    Following the decision tree developed above, this person will become recidivist

  1. What prediction will the decision tree generated in part (a) of this question return for the following query?

    Good Behavior = true, Age Below 30 = true, Drug Dependent = false

     Following the decision tree developed above, this person will become recidivist

3 Information Gain

The Table below lists a sample of data from a census.

  c1 c2          c3            c4           c5        c6
1  1 39   bachelors never married    transport   25K-50K
2  2 50   bachelors       married professional   25K-50K
3  3 18 high school never married  agriculture below 25K
4  4 28   bachelors       married professional   25K-50K
5  5 37 high school       married  agriculture   25K-50K
6  6 24 high school never married armed forces below 25K
7  7 52 high school      divorced    transport   25K-50K
8  8 40   doctorate       married professional  over 50K
ID Age Education Marital Status Occupation Annual Income
1 39 bachelors never married transport 25K-50K
2 50 bachelors married professional 25K-50K
3 18 high school never married agriculture below 25K
4 28 bachelors married professional 25K-50K
5 37 high school married agriculture 25K-50K
6 24 high school never married armed forces below 25K
7 52 high school divorced transport 25K-50K
8 40 doctorate married professional over 50K
Census data.

There are four descriptive features and one target feature in this dataset:

  • Age, a continuous feature listing the age of the individual.
  • Education, a categorical feature listing the highest education award achieved by the individual (high school, bachelors, doctorate).
  • Marital Status (never married, married, divorced).
  • Occupation (transport = works in the transportation industry; professional = doctors, lawyers, etc.; agriculture = works in the agricultural industry; armed forces = is a member of the armed forces).
  • Annual Income, the target feature with 3 levels (<25K, 25K-50K, >50K).
  1. Calculate the entropy for this dataset.

    Here, Age is the predictor variable.

entropy of the total dataset:  1.298795 
  1. Calculate the Gini index for this dataset.

    gini index can be calculated, gini_index = 1 - sum(probability)^2

gini index of the total dataset:  0.53125 
  1. When building a decision tree, the easiest way to handle a continuous feature is to define a threshold around which splits will be made. What would be the optimal threshold to split the continuous Age feature (use information gain based on entropy as the feature selection measure)?

    From the table we can see that the target feature “annual income” is divided into three categories: <25k, 25-50k, >50k After sorting the table by age, we can build the tree by isolating the age groups by target feature (annual income):

Sorted Census data.
ID Age Education Marital Status Occupation Annual Income
3 3 18 high school never married agriculture below 25K
6 6 24 high school never married armed forces below 25K
4 4 28 bachelors married professional 25K-50K
5 5 37 high school married agriculture 25K-50K
1 1 39 bachelors never married transport 25K-50K
8 8 40 doctorate married professional over 50K
2 2 50 bachelors married professional 25K-50K
7 7 52 high school divorced transport 25K-50K
Now, we can calculate the transitions of the target feature in respect to age: by using
the label changes of the target feature, calculate the average of the age transition:

25k <- (24+28)/2 = 26yrs 25-50k <- (39+40)/2 = 39.5yrs ’>50k <- (40+50)/2 = 45yrs

From the sorted table, we find the following instances for each categories indicated above:

>26yrs:
        D1 = d3, d6
        D2 = d4, d5, d1, d8, d2, d7
>39.5yrs:
        D3 = d3, d6, d4, d5, d1
        D4 = d8, d2, d7
>45yrs:
        D5 = d3, d6, d4, d5, d1, d8
        D6 = d2, d7
part_entropy_D1= 0 bits 
part_entropy_D2= 0.6500224 bits 
part_entropy_D3= 0.9709506 bits 
part_entropy_D4= 0.9182958 bits 
part_entropy_D5= 1.459148 bits 
part_entropy_D6= 0 bits 
Formula to calcualate Remaining Entropy, Sum(probability of subsets * entropy of subsets)
remaining_entropy_D1&2= 0.4875168 bits 
remaining_entropy_D3&4= 0.9512051 bits 
remaining_entropy_D5&6= 1.094361 bits 
  1. Calculate information gain (based on entropy) for the Education, Marital Status, and Occupation features.

    We will sort the data table using the age column.

Sorted Census data.
ID Age Education Marital Status Occupation Annual Income
3 3 18 high school never married agriculture below 25K
6 6 24 high school never married armed forces below 25K
4 4 28 bachelors married professional 25K-50K
5 5 37 high school married agriculture 25K-50K
1 1 39 bachelors never married transport 25K-50K
8 8 40 doctorate married professional over 50K
2 2 50 bachelors married professional 25K-50K
7 7 52 high school divorced transport 25K-50K
Information gain can be calculated using this formula, IG = Original entropy - remaining entropy
information_gain_D1&2= 0.8112781 bits 
information_gain_D3&4= 0.3475899 bits 
information_gain_D5&6= 0.204434 bits 
based on the above informaiton, highest information gain is from age group 26, which will be the root node of the decision tree. 
  1. Calculate the information gain ratio (based on entropy) for Education, Marital Status, and Occupation features.

    We need to calculate the information gain ratio for education. In order to do this, we need to follow the same steps as above:

We need to sort the table by education column:
Sorted Census data.
ID Age Education Marital Status Occupation Annual Income
1 1 39 bachelors never married transport 25K-50K
2 2 50 bachelors married professional 25K-50K
4 4 28 bachelors married professional 25K-50K
8 8 40 doctorate married professional over 50K
3 3 18 high school never married agriculture below 25K
5 5 37 high school married agriculture 25K-50K
6 6 24 high school never married armed forces below 25K
7 7 52 high school divorced transport 25K-50K

We find the folloiwng information from the table:

Education:
HS:     d3, d5, d6, d7
BS:     d1, d2, d4
PhD.:   d8
part_entropy_HS= 1 bits 
part_entropy_BS= 0 bits 
part_entropy_PhD= 0 bits 
  we need to calculate remaining entropy for Education:
remaining entropy for education = 0.5 bits 
Information Gain for Education:
information gain for education = 0.7987949 bits 
We need to calculate the information gain ratio for Marital Status. In order to do this, we need to follow the same steps as above: 

We need to calculate Information Gain for Marriage:

Sorted Census data.
ID Age Education Marital Status Occupation Annual Income
7 7 52 high school divorced transport 25K-50K
2 2 50 bachelors married professional 25K-50K
4 4 28 bachelors married professional 25K-50K
5 5 37 high school married agriculture 25K-50K
8 8 40 doctorate married professional over 50K
1 1 39 bachelors never married transport 25K-50K
3 3 18 high school never married agriculture below 25K
6 6 24 high school never married armed forces below 25K

from the above table we can find the following categories:
Divorced (DV):          d7
Married (MR):           d2, d4, d5, d8 
Never Married (NM):     d1, d3, d6
part_entropy_DV= 0 bits 
part_entropy_MR= 0.8112781 bits 
part_entropy_NM= 0.9182958 bits 
calculate remaining entropy of Marital Status:
remaining_entropy_mar= 0.75 bits 
Information Gain for Marital Status:
information gain for martial status = 0.5487949 bits 
We need to calculate the information gain ratio for Occupation. In order to do this, we need to follow the same steps as above: 

We need to calculate Information Gain for Occupation:

Sorted Census data.
ID Age Education Marital Status Occupation Annual Income
3 3 18 high school never married agriculture below 25K
5 5 37 high school married agriculture 25K-50K
6 6 24 high school never married armed forces below 25K
2 2 50 bachelors married professional 25K-50K
4 4 28 bachelors married professional 25K-50K
8 8 40 doctorate married professional over 50K
1 1 39 bachelors never married transport 25K-50K
7 7 52 high school divorced transport 25K-50K

from the above table we can find the following categories:
Agriculture (AG):         d3, d5
Armed Forces (AF):        d6 
Professional (PF):        d2, d4, d8
Transport (TP):           d1, d7
part_entropy_AG= 1 bits 
part_entropy_AF= 0 bits 
part_entropy_PF= 0.9182958 bits 
part_entropy_TP= 0 bits 
calculate remaining entropy of Occupation:
remaining_entropy_mar= 0.75 bits 
Information Gain for Occupation:
information gain for occupation = 0.704434 bits 
Information gain ratio = Information Gain / Entropy. Now we can calculate IG ratio for Education, Marital Status and Occupation:
information gain ratio for education = 0.5011486 bits 
information gain ratio for marital status = 0.3904238 bits 
information gain ratio for occupation = 0.3696576 bits 
  1. Calculate information gain using the Gini index for the Education, Marital Status, and Occupation features.

    Gini index of the entire dataset from part (b) gini index of the total dataset: 0.53125 Education: HS: d3, d5, d6, d7 BS: d1, d2, d4 PhD.: d8

partition gini index for HS = 0.5 bits 
partition gini index for BS = 0 bits 
partition gini index for PhD = 0 bits 
Marital Status:
Divorced (DV):          d7
Married (MR):           d2, d4, d5, d8 
Never Married (NM):     d1, d3, d6
partition gini index for DV = 0 bits 
partition gini index for MR = 0.375 bits 
partition gini index for NM = 0.4444444 bits 
Occupation:
Agriculture (AG):         d3, d5
Armed Forces (AF):        d6 
Professional (PF):        d2, d4, d8
Transport (TP):           d1, d7
partition gini index for AG = 0.5 bits 
partition gini index for AF = 0 bits 
partition gini index for PF = 0.4444444 bits 
partition gini index for PF = 0 bits 
Calculate remaining gini entropy and IG for the Education: 
# probability of each subset
P_HS <- (4/8)
P_BS <- (3/8)
P_PhD <- (1/8)

remaining_gini_entropy_edu <- sum((P_HS*entropy_AG),(P_BS*entropy_AF),(P_PhD*entropy_PF))
IG_gini_entropy_edu <- 0.53125 -  remaining_gini_entropy_edu


# Print the remaining entropy values with their set name
cat('remaining gini entropy for education =', remaining_gini_entropy_edu, 'bits', '\n')
remaining gini entropy for education = 0.614787 bits 
cat('information gain from education =', IG_gini_entropy_edu, 'bits', '\n')
information gain from education = -0.08353698 bits 
Information gain by Education, Marital Status and Occupation: 

4 Decision Tree Error Pruning

Shown in the figure below shows a decision tree for predicting heart disease. The descriptive features in this domain describe whether the patient suffers from chest pain (Chest Pain) and blood pressure (Blood Pressure). The binary target feature is Heart Disease. The table below the diagram lists a pruning set from this domain.

ID Chest Pain Blood Pressure Heart Disease
1 false high false
2 true low true
3 false low false
4 true high true
5 false high false
Pruning set.

Using the pruning set, apply reduced error pruning to the decision tree. Assume that the algorithm is applied in a bottom-up, left-to-right fashion. For each iteration of the algorithm, indicate the subtrees considered as pruning candidates, explain why the algorithm chooses to prune or leave these subtrees in the tree, and illustrate the tree that results from each iteration.

Option 1: Using the prediction tree (given above), we verify the prediction against the ground truth (from the table): Chest Pain (False) + Blood Pressure (High) = 2 miss classifications Chest Pain (False) + Blood Pressure (Low) = 0 miss classification Total miss classifications = 2 So the tree can be pruned to Blood pressure level, given below:

Option 2:
If we choose to prune further to the root node:
We will have misclassification error: 3

Thus, we will opt to prune the tree as per Option 1. 

5 Random Forest

The Table below lists a dataset containing the details of five participants in a heart disease study. The target feature Risk describes their risk of heart disease. Each patient is described in terms of four binary descriptive features:

  • Exercise - how regularly do they exercise
  • Smoker - do they smoke
  • Obese - are they overweight
  • Family - did any of their parents or siblings suffer from heart disease
ID Exercise Smoker Obese Family Risk
1 daily false false yes low
2 weekly true false yes high
3 daily false false no low
4 rarely true true yes high
5 rarely true true no high
Heart disease study dataset.

As part of the study researchers have decided to create a predictive model to screen participants based on their risk of heart disease. You have been asked to implement this screening model using a random forest. Table below list three bootstrap samples that have been generated from the above dataset.

ID Exercise Family Risk
1 daily yes low
2 weekly yes high
2 weekly yes high
5 rarely yes high
5 rarely no high
Heart disease study dataset bootstrap sample A.
ID Smoker Obese Risk
1 false false low
2 true false high
2 true false high
4 true true high
5 true true high
Heart disease study dataset bootstrap sample B.
ID Obese Family Risk
1 false yes low
1 false yes low
2 false yes high
4 true yes high
5 true no high
Heart disease study dataset bootstrap sample C.
  1. Using these bootstrap samples create the decision trees that will be in the random forest model (use entropy based information gain as the feature selection criterion).

    In order to build the decision tree, we need to calculate the entropy of the entire dataset for the target feature Risk (2 levels): Low - 2 instances High - 3 instances

total entropy of risk 0.9709506 bits 
Heart disease study dataset.
ID Exercise Smoker Obese Family Risk
1 1 daily false false yes low
3 3 daily false false no low
4 4 rarely true true yes high
5 5 rarely true true no high
2 2 weekly true false yes high
Now, we will calculate the partition entropies for each attributes, beginning with Root node (Exercise):

Exercise:
Sort the table by Exercise column:
Heart disease study dataset bootstrap sample C.
ID Exercise Smoker Obese Family Risk
1 1 daily false false yes low
3 3 daily false false no low
4 4 rarely true true yes high
5 5 rarely true true no high
2 2 weekly true false yes high
The table sorted by exercise is given below: 

Excersise (from the table): 
Daily (D) = d1, d3
Weekly (W) = d2
Rarely (R) = d4, d5

Now, calculate partition entropy of Exercise:
entropy of the exercise daily = 0 bits 
entropy of the exercise weekly = 0 bits 
entropy of the exercise rarely = 0 bits 
Smoker:
Sort the table by Smoker column:
Heart disease study dataset bootstrap sample C.
ID Exercise Smoker Obese Family Risk
1 1 daily false false yes low
3 3 daily false false no low
2 2 weekly true false yes high
4 4 rarely true true yes high
5 5 rarely true true no high

Smoker (from the table): 
smoker_true (ST) = d1, d3
smoker_false (SF) = d2, d4, d5

Now, calculate partition entropy of Smoker:
entropy of the smoker true = 0 bits 
entropy of the smoker false = 0 bits 
Obese:
Sort the table by Obese column:
Heart disease study dataset bootstrap sample C.
ID Exercise Smoker Obese Family Risk
1 daily false false yes low
2 weekly true false yes high
3 daily false false no low
4 rarely true true yes high
5 rarely true true no high

Obese (from the table): 
obese_true (OT) = d4, d5
obese_false (OF) = d1, d2, d3

Now, calculate partition entropy of Obese:
entropy of the obese true = 0 bits 
entropy of the obese false = 0.9182958 bits 
Family:
Sort the table by Family column:
Heart disease study dataset bootstrap sample C.
ID Exercise Smoker Obese Family Risk
3 3 daily false false no low
5 5 rarely true true no high
1 1 daily false false yes low
2 2 weekly true false yes high
4 4 rarely true true yes high

Family (from the table): 
family_yes (FY) = d1, d2, d4
family_no (FN) = d3, d5

Now, calculate partition entropy of Family:
entropy of the family yes = 1 bits 
entropy of the family no = 0.9182958 bits 
Now, calculate the remaining entropy using this formula, Sum(probability of subsets * entropy of subsets)
Information Gain (IG) = Total entropy - remaining entropy
remaining_entropy_exercise= 0 bits 
information_gain_exercise= 0.9709506 bits 
Now, caclulate remaining entropy and information gain for Smoker:
remaining_entropy_smoker = 0 bits 
information_gain_smoker = 0.9709506 bits 
Now, caclulate remaining entropy and information gain for Obese:
remaining_entropy_obese = 0.5509775 bits 
information_gain_obese = 0.4199731 bits 
Now, caclulate remaining entropy and information gain for Family:
remaining_entropy_family = 0.9509775 bits 
information_gain_family = 0.01997309 bits 

Now, we compare the IG from these attributes:

information gain exercise = 0.9709506 bits 
information_gain_smoker = 0.9709506 bits 
information_gain_obese = 0.4199731 bits 
information gain family = 0.01997309 bits 
Based on the information gain number ranking, Exercise will form the root node of the tree for Bootstrap sample A (Exercise):

Bootstrap Sample B is Smoker:

Bootstrap Sample C is Obese:

  1. Assuming the random forest model you have created uses majority voting, what prediction will it return for the following query:

    Exercise=rarely, Smoker=false, Obese=true, Family=yes

    Ist Tree > Exercise Rarely = Heart Disease (High) 2nd Tree > Smoker False = Heart Disease (Low) 3rd Tree > Obese True = Heart Disease (High)

    From the analysis, it can be found that majority voting suggests, Heart Disease chance is High.