Entropy of the word OXYMORON: 2.405639 bits
CSCI/DASC 6020: Written Assignment 04
Assignment Goal
The goal of this assignment is to demonstrate your understanding of how the concepts from information theory can be used to build machine learning models for predictive tasks.
1 Calculating Entropy
The figure below shows a set of eight Scrabble pieces.

What is the entropy (in bits) of the letters in this set?
It is a multiset with duplicate elements. First set: ‘OOO’, second set = ‘XYMRN’
What would be the reduction in entropy (i.e., the information gain) in bits if we split these letters into two sets, one containing the vowels and the other containing the consonants?
Information Gain = Original Entropy of the set - Remaining Entropy
information gain due to splitting letters into two sets : 0.954434 bits
What is the maximum possible entropy (in bits) for a set of eight Scrabble pieces?
We get maximum entropy, when the elements are distinct - every element in the set is different. In OXYMORON, there are 6 distinct letters {OXYMRN}
maximum possible entropy of OXYMORON: 2.584963 bits
In general, which is preferable when you are playing Scrabble: a set of letters with high entropy, or a set of letters with low entropy?
I would like to have high entropy (which means the letters are distinct), this increases my chance to use these letters useful increases. With low entropy, due to possibility of duplication, the number of useful letter goes down. Thus, I prefer to have high entropy to increase my chances of winning.
2 Decision Tree Construction
A convicted criminal who reoffends after release is known as a recidivist. The Table below lists a dataset that describes prisoners released on parole, and whether they reoffended within two years of release.
| ID | Good Behavior | Age Below 30 | Drug Dependent | Recidivist |
|---|---|---|---|---|
| 1 | false | true | false | true |
| 2 | false | false | false | false |
| 3 | false | true | false | true |
| 4 | true | false | false | false |
| 5 | true | false | true | true |
| 6 | true | false | false | false |
This dataset lists six instances where prisoners were granted parole. Each of these instances are described in terms of three binary descriptive features (Good Behavior, Age Below 30, Drug Dependent) and a binary target feature, Recidivist. The Good Behavior feature has a value of true if the prisoner had not committed any infringements during incarceration, the Age Below 30 has a value of true if the prisoner was under 30 years of age when granted parole, and the Drug Dependent feature is true if the prisoner had a drug addiction at the time of parole. The target feature, Recidivist, has a true value if the prisoner was arrested within two years of being released; otherwise it has a value of false.
- Using this dataset, construct the decision tree that would be generated by the ID3 algorithm, using entropy-based information gain.
[1] 1.487343
[1] 1.755617
[1] 1.405639
we will use the " Age Below 30 " as the root node
# Calculate the entropy of the entire dataset
# Calculate the entropy of "good behavior" attribute (c3)
p_good_behavior <- sum(df1$c3 == "true") / nrow(df1)
p_bad_behavior <- sum(df1$c3 == "false") / nrow(df1)
entropy_good_behavior <- -p_good_behavior * log2(p_good_behavior) - p_bad_behavior * log2(p_bad_behavior)
# Calculate the Information Gain
information_gain_good_behavior <- entropy_original - entropy_good_behavior
# Calculate the entropy of "age below 30" attribute (c4)
p_below_30 <- sum(df1$c4 == "true") / nrow(df1)
p_above_30 <- sum(df1$c4 == "false") / nrow(df1)
entropy_below_30 <- -p_below_30 * log2(p_below_30) - p_above_30 * log2(p_above_30)
# Calculate the Information Gain
information_gain_below_30 <- entropy_original - entropy_below_30
# Calculate the entropy of the "drug-dependent" attribute (c5)
p_drug_dependent <- sum(df1$c5 == "true") / nrow(df1)
p_not_drug_dependent <- sum(df1$c5 == "false") / nrow(df1)
entropy_drug_dependent <- -p_drug_dependent * log2(p_drug_dependent) - p_not_drug_dependent * log2(p_not_drug_dependent)
# Calculate the Information Gain
information_gain_drug_dependent <- entropy_original - entropy_drug_dependent
# Create a named vector of Information Gain values for each attribute
information_gains <- c(
"Good Behavior" = information_gain_good_behavior,
"Age Below 30" = information_gain_below_30,
"Drug Dependent" = information_gain_drug_dependent
)
# Print the Information Gain values with attribute names in two columns
cat("Attribute Name\tInformation Gain\n")Attribute Name Information Gain
for (attr_name in names(information_gains)) {
cat(attr_name, "\t", information_gains[attr_name], "\n")
}Good Behavior 1.487343
Age Below 30 1.755617
Drug Dependent 1.405639
# Find the attribute with the highest Information Gain
highest_ig_attribute <- names(information_gains[which.max(information_gains)])
# Report the attribute with the highest IG
cat('we will use the "', highest_ig_attribute, '" as the root node', '\n')we will use the " Age Below 30 " as the root node
What prediction will the decision tree generated in part (a) of this question return for the following query?
Good Behavior = false, Age Below 30 = false, Drug Dependent = true
Following the decision tree developed above, this person will become recidivist
What prediction will the decision tree generated in part (a) of this question return for the following query?
Good Behavior = true, Age Below 30 = true, Drug Dependent = false
Following the decision tree developed above, this person will become recidivist
3 Information Gain
The Table below lists a sample of data from a census.
c1 c2 c3 c4 c5 c6
1 1 39 bachelors never married transport 25K-50K
2 2 50 bachelors married professional 25K-50K
3 3 18 high school never married agriculture below 25K
4 4 28 bachelors married professional 25K-50K
5 5 37 high school married agriculture 25K-50K
6 6 24 high school never married armed forces below 25K
7 7 52 high school divorced transport 25K-50K
8 8 40 doctorate married professional over 50K
| ID | Age | Education | Marital Status | Occupation | Annual Income |
|---|---|---|---|---|---|
| 1 | 39 | bachelors | never married | transport | 25K-50K |
| 2 | 50 | bachelors | married | professional | 25K-50K |
| 3 | 18 | high school | never married | agriculture | below 25K |
| 4 | 28 | bachelors | married | professional | 25K-50K |
| 5 | 37 | high school | married | agriculture | 25K-50K |
| 6 | 24 | high school | never married | armed forces | below 25K |
| 7 | 52 | high school | divorced | transport | 25K-50K |
| 8 | 40 | doctorate | married | professional | over 50K |
There are four descriptive features and one target feature in this dataset:
- Age, a continuous feature listing the age of the individual.
- Education, a categorical feature listing the highest education award achieved by the individual (high school, bachelors, doctorate).
- Marital Status (never married, married, divorced).
- Occupation (transport = works in the transportation industry; professional = doctors, lawyers, etc.; agriculture = works in the agricultural industry; armed forces = is a member of the armed forces).
- Annual Income, the target feature with 3 levels (<25K, 25K-50K, >50K).
Calculate the entropy for this dataset.
Here, Age is the predictor variable.
entropy of the total dataset: 1.298795
Calculate the Gini index for this dataset.
gini index can be calculated, gini_index = 1 - sum(probability)^2
gini index of the total dataset: 0.53125
When building a decision tree, the easiest way to handle a continuous feature is to define a threshold around which splits will be made. What would be the optimal threshold to split the continuous Age feature (use information gain based on entropy as the feature selection measure)?
From the table we can see that the target feature “annual income” is divided into three categories: <25k, 25-50k, >50k
After sorting the table by age, we can build the tree by isolating the age groups by target feature (annual income):
| ID | Age | Education | Marital Status | Occupation | Annual Income | |
|---|---|---|---|---|---|---|
| 3 | 3 | 18 | high school | never married | agriculture | below 25K |
| 6 | 6 | 24 | high school | never married | armed forces | below 25K |
| 4 | 4 | 28 | bachelors | married | professional | 25K-50K |
| 5 | 5 | 37 | high school | married | agriculture | 25K-50K |
| 1 | 1 | 39 | bachelors | never married | transport | 25K-50K |
| 8 | 8 | 40 | doctorate | married | professional | over 50K |
| 2 | 2 | 50 | bachelors | married | professional | 25K-50K |
| 7 | 7 | 52 | high school | divorced | transport | 25K-50K |
Now, we can calculate the transitions of the target feature in respect to age: by using
the label changes of the target feature, calculate the average of the age transition:
25k <- (24+28)/2 = 26yrs 25-50k <- (39+40)/2 = 39.5yrs ’>50k <- (40+50)/2 = 45yrs
From the sorted table, we find the following instances for each categories indicated above:
>26yrs:
D1 = d3, d6
D2 = d4, d5, d1, d8, d2, d7
>39.5yrs:
D3 = d3, d6, d4, d5, d1
D4 = d8, d2, d7
>45yrs:
D5 = d3, d6, d4, d5, d1, d8
D6 = d2, d7
part_entropy_D1= 0 bits
part_entropy_D2= 0.6500224 bits
part_entropy_D3= 0.9709506 bits
part_entropy_D4= 0.9182958 bits
part_entropy_D5= 1.459148 bits
part_entropy_D6= 0 bits
Formula to calcualate Remaining Entropy, Sum(probability of subsets * entropy of subsets)
remaining_entropy_D1&2= 0.4875168 bits
remaining_entropy_D3&4= 0.9512051 bits
remaining_entropy_D5&6= 1.094361 bits
Calculate information gain (based on entropy) for the Education, Marital Status, and Occupation features.
We will sort the data table using the age column.
| ID | Age | Education | Marital Status | Occupation | Annual Income | |
|---|---|---|---|---|---|---|
| 3 | 3 | 18 | high school | never married | agriculture | below 25K |
| 6 | 6 | 24 | high school | never married | armed forces | below 25K |
| 4 | 4 | 28 | bachelors | married | professional | 25K-50K |
| 5 | 5 | 37 | high school | married | agriculture | 25K-50K |
| 1 | 1 | 39 | bachelors | never married | transport | 25K-50K |
| 8 | 8 | 40 | doctorate | married | professional | over 50K |
| 2 | 2 | 50 | bachelors | married | professional | 25K-50K |
| 7 | 7 | 52 | high school | divorced | transport | 25K-50K |
Information gain can be calculated using this formula, IG = Original entropy - remaining entropy
information_gain_D1&2= 0.8112781 bits
information_gain_D3&4= 0.3475899 bits
information_gain_D5&6= 0.204434 bits
based on the above informaiton, highest information gain is from age group 26, which will be the root node of the decision tree.
Calculate the information gain ratio (based on entropy) for Education, Marital Status, and Occupation features.
We need to calculate the information gain ratio for education. In order to do this, we need to follow the same steps as above:
We need to sort the table by education column:
| ID | Age | Education | Marital Status | Occupation | Annual Income | |
|---|---|---|---|---|---|---|
| 1 | 1 | 39 | bachelors | never married | transport | 25K-50K |
| 2 | 2 | 50 | bachelors | married | professional | 25K-50K |
| 4 | 4 | 28 | bachelors | married | professional | 25K-50K |
| 8 | 8 | 40 | doctorate | married | professional | over 50K |
| 3 | 3 | 18 | high school | never married | agriculture | below 25K |
| 5 | 5 | 37 | high school | married | agriculture | 25K-50K |
| 6 | 6 | 24 | high school | never married | armed forces | below 25K |
| 7 | 7 | 52 | high school | divorced | transport | 25K-50K |
We find the folloiwng information from the table:
Education:
HS: d3, d5, d6, d7
BS: d1, d2, d4
PhD.: d8
part_entropy_HS= 1 bits
part_entropy_BS= 0 bits
part_entropy_PhD= 0 bits
we need to calculate remaining entropy for Education:
remaining entropy for education = 0.5 bits
Information Gain for Education:
information gain for education = 0.7987949 bits
We need to calculate the information gain ratio for Marital Status. In order to do this, we need to follow the same steps as above:
We need to calculate Information Gain for Marriage:
| ID | Age | Education | Marital Status | Occupation | Annual Income | |
|---|---|---|---|---|---|---|
| 7 | 7 | 52 | high school | divorced | transport | 25K-50K |
| 2 | 2 | 50 | bachelors | married | professional | 25K-50K |
| 4 | 4 | 28 | bachelors | married | professional | 25K-50K |
| 5 | 5 | 37 | high school | married | agriculture | 25K-50K |
| 8 | 8 | 40 | doctorate | married | professional | over 50K |
| 1 | 1 | 39 | bachelors | never married | transport | 25K-50K |
| 3 | 3 | 18 | high school | never married | agriculture | below 25K |
| 6 | 6 | 24 | high school | never married | armed forces | below 25K |
from the above table we can find the following categories:
Divorced (DV): d7
Married (MR): d2, d4, d5, d8
Never Married (NM): d1, d3, d6
part_entropy_DV= 0 bits
part_entropy_MR= 0.8112781 bits
part_entropy_NM= 0.9182958 bits
calculate remaining entropy of Marital Status:
remaining_entropy_mar= 0.75 bits
Information Gain for Marital Status:
information gain for martial status = 0.5487949 bits
We need to calculate the information gain ratio for Occupation. In order to do this, we need to follow the same steps as above:
We need to calculate Information Gain for Occupation:
| ID | Age | Education | Marital Status | Occupation | Annual Income | |
|---|---|---|---|---|---|---|
| 3 | 3 | 18 | high school | never married | agriculture | below 25K |
| 5 | 5 | 37 | high school | married | agriculture | 25K-50K |
| 6 | 6 | 24 | high school | never married | armed forces | below 25K |
| 2 | 2 | 50 | bachelors | married | professional | 25K-50K |
| 4 | 4 | 28 | bachelors | married | professional | 25K-50K |
| 8 | 8 | 40 | doctorate | married | professional | over 50K |
| 1 | 1 | 39 | bachelors | never married | transport | 25K-50K |
| 7 | 7 | 52 | high school | divorced | transport | 25K-50K |
from the above table we can find the following categories:
Agriculture (AG): d3, d5
Armed Forces (AF): d6
Professional (PF): d2, d4, d8
Transport (TP): d1, d7
part_entropy_AG= 1 bits
part_entropy_AF= 0 bits
part_entropy_PF= 0.9182958 bits
part_entropy_TP= 0 bits
calculate remaining entropy of Occupation:
remaining_entropy_mar= 0.75 bits
Information Gain for Occupation:
information gain for occupation = 0.704434 bits
Information gain ratio = Information Gain / Entropy. Now we can calculate IG ratio for Education, Marital Status and Occupation:
information gain ratio for education = 0.5011486 bits
information gain ratio for marital status = 0.3904238 bits
information gain ratio for occupation = 0.3696576 bits
Calculate information gain using the Gini index for the Education, Marital Status, and Occupation features.
Gini index of the entire dataset from part (b) gini index of the total dataset: 0.53125 Education: HS: d3, d5, d6, d7 BS: d1, d2, d4 PhD.: d8
partition gini index for HS = 0.5 bits
partition gini index for BS = 0 bits
partition gini index for PhD = 0 bits
Marital Status:
Divorced (DV): d7
Married (MR): d2, d4, d5, d8
Never Married (NM): d1, d3, d6
partition gini index for DV = 0 bits
partition gini index for MR = 0.375 bits
partition gini index for NM = 0.4444444 bits
Occupation:
Agriculture (AG): d3, d5
Armed Forces (AF): d6
Professional (PF): d2, d4, d8
Transport (TP): d1, d7
partition gini index for AG = 0.5 bits
partition gini index for AF = 0 bits
partition gini index for PF = 0.4444444 bits
partition gini index for PF = 0 bits
Calculate remaining gini entropy and IG for the Education:
# probability of each subset
P_HS <- (4/8)
P_BS <- (3/8)
P_PhD <- (1/8)
remaining_gini_entropy_edu <- sum((P_HS*entropy_AG),(P_BS*entropy_AF),(P_PhD*entropy_PF))
IG_gini_entropy_edu <- 0.53125 - remaining_gini_entropy_edu
# Print the remaining entropy values with their set name
cat('remaining gini entropy for education =', remaining_gini_entropy_edu, 'bits', '\n')remaining gini entropy for education = 0.614787 bits
cat('information gain from education =', IG_gini_entropy_edu, 'bits', '\n')information gain from education = -0.08353698 bits
Information gain by Education, Marital Status and Occupation:
4 Decision Tree Error Pruning
Shown in the figure below shows a decision tree for predicting heart disease. The descriptive features in this domain describe whether the patient suffers from chest pain (Chest Pain) and blood pressure (Blood Pressure). The binary target feature is Heart Disease. The table below the diagram lists a pruning set from this domain.
| ID | Chest Pain | Blood Pressure | Heart Disease |
|---|---|---|---|
| 1 | false | high | false |
| 2 | true | low | true |
| 3 | false | low | false |
| 4 | true | high | true |
| 5 | false | high | false |
Using the pruning set, apply reduced error pruning to the decision tree. Assume that the algorithm is applied in a bottom-up, left-to-right fashion. For each iteration of the algorithm, indicate the subtrees considered as pruning candidates, explain why the algorithm chooses to prune or leave these subtrees in the tree, and illustrate the tree that results from each iteration.
Option 1: Using the prediction tree (given above), we verify the prediction against the ground truth (from the table): Chest Pain (False) + Blood Pressure (High) = 2 miss classifications Chest Pain (False) + Blood Pressure (Low) = 0 miss classification Total miss classifications = 2 So the tree can be pruned to Blood pressure level, given below:
Option 2:
If we choose to prune further to the root node:
We will have misclassification error: 3
Thus, we will opt to prune the tree as per Option 1.
5 Random Forest
The Table below lists a dataset containing the details of five participants in a heart disease study. The target feature Risk describes their risk of heart disease. Each patient is described in terms of four binary descriptive features:
- Exercise - how regularly do they exercise
- Smoker - do they smoke
- Obese - are they overweight
- Family - did any of their parents or siblings suffer from heart disease
| ID | Exercise | Smoker | Obese | Family | Risk |
|---|---|---|---|---|---|
| 1 | daily | false | false | yes | low |
| 2 | weekly | true | false | yes | high |
| 3 | daily | false | false | no | low |
| 4 | rarely | true | true | yes | high |
| 5 | rarely | true | true | no | high |
As part of the study researchers have decided to create a predictive model to screen participants based on their risk of heart disease. You have been asked to implement this screening model using a random forest. Table below list three bootstrap samples that have been generated from the above dataset.
| ID | Exercise | Family | Risk |
|---|---|---|---|
| 1 | daily | yes | low |
| 2 | weekly | yes | high |
| 2 | weekly | yes | high |
| 5 | rarely | yes | high |
| 5 | rarely | no | high |
| ID | Smoker | Obese | Risk |
|---|---|---|---|
| 1 | false | false | low |
| 2 | true | false | high |
| 2 | true | false | high |
| 4 | true | true | high |
| 5 | true | true | high |
| ID | Obese | Family | Risk |
|---|---|---|---|
| 1 | false | yes | low |
| 1 | false | yes | low |
| 2 | false | yes | high |
| 4 | true | yes | high |
| 5 | true | no | high |
Using these bootstrap samples create the decision trees that will be in the random forest model (use entropy based information gain as the feature selection criterion).
In order to build the decision tree, we need to calculate the entropy of the entire dataset for the target feature Risk (2 levels): Low - 2 instances High - 3 instances
total entropy of risk 0.9709506 bits
| ID | Exercise | Smoker | Obese | Family | Risk | |
|---|---|---|---|---|---|---|
| 1 | 1 | daily | false | false | yes | low |
| 3 | 3 | daily | false | false | no | low |
| 4 | 4 | rarely | true | true | yes | high |
| 5 | 5 | rarely | true | true | no | high |
| 2 | 2 | weekly | true | false | yes | high |
Now, we will calculate the partition entropies for each attributes, beginning with Root node (Exercise):
Exercise:
Sort the table by Exercise column:
| ID | Exercise | Smoker | Obese | Family | Risk | |
|---|---|---|---|---|---|---|
| 1 | 1 | daily | false | false | yes | low |
| 3 | 3 | daily | false | false | no | low |
| 4 | 4 | rarely | true | true | yes | high |
| 5 | 5 | rarely | true | true | no | high |
| 2 | 2 | weekly | true | false | yes | high |
The table sorted by exercise is given below:
Excersise (from the table):
Daily (D) = d1, d3
Weekly (W) = d2
Rarely (R) = d4, d5
Now, calculate partition entropy of Exercise:
entropy of the exercise daily = 0 bits
entropy of the exercise weekly = 0 bits
entropy of the exercise rarely = 0 bits
Smoker:
Sort the table by Smoker column:
| ID | Exercise | Smoker | Obese | Family | Risk | |
|---|---|---|---|---|---|---|
| 1 | 1 | daily | false | false | yes | low |
| 3 | 3 | daily | false | false | no | low |
| 2 | 2 | weekly | true | false | yes | high |
| 4 | 4 | rarely | true | true | yes | high |
| 5 | 5 | rarely | true | true | no | high |
Smoker (from the table):
smoker_true (ST) = d1, d3
smoker_false (SF) = d2, d4, d5
Now, calculate partition entropy of Smoker:
entropy of the smoker true = 0 bits
entropy of the smoker false = 0 bits
Obese:
Sort the table by Obese column:
| ID | Exercise | Smoker | Obese | Family | Risk |
|---|---|---|---|---|---|
| 1 | daily | false | false | yes | low |
| 2 | weekly | true | false | yes | high |
| 3 | daily | false | false | no | low |
| 4 | rarely | true | true | yes | high |
| 5 | rarely | true | true | no | high |
Obese (from the table):
obese_true (OT) = d4, d5
obese_false (OF) = d1, d2, d3
Now, calculate partition entropy of Obese:
entropy of the obese true = 0 bits
entropy of the obese false = 0.9182958 bits
Family:
Sort the table by Family column:
| ID | Exercise | Smoker | Obese | Family | Risk | |
|---|---|---|---|---|---|---|
| 3 | 3 | daily | false | false | no | low |
| 5 | 5 | rarely | true | true | no | high |
| 1 | 1 | daily | false | false | yes | low |
| 2 | 2 | weekly | true | false | yes | high |
| 4 | 4 | rarely | true | true | yes | high |
Family (from the table):
family_yes (FY) = d1, d2, d4
family_no (FN) = d3, d5
Now, calculate partition entropy of Family:
entropy of the family yes = 1 bits
entropy of the family no = 0.9182958 bits
Now, calculate the remaining entropy using this formula, Sum(probability of subsets * entropy of subsets)
Information Gain (IG) = Total entropy - remaining entropy
remaining_entropy_exercise= 0 bits
information_gain_exercise= 0.9709506 bits
Now, caclulate remaining entropy and information gain for Smoker:
remaining_entropy_smoker = 0 bits
information_gain_smoker = 0.9709506 bits
Now, caclulate remaining entropy and information gain for Obese:
remaining_entropy_obese = 0.5509775 bits
information_gain_obese = 0.4199731 bits
Now, caclulate remaining entropy and information gain for Family:
remaining_entropy_family = 0.9509775 bits
information_gain_family = 0.01997309 bits
Now, we compare the IG from these attributes:
information gain exercise = 0.9709506 bits
information_gain_smoker = 0.9709506 bits
information_gain_obese = 0.4199731 bits
information gain family = 0.01997309 bits
Based on the information gain number ranking, Exercise will form the root node of the tree for Bootstrap sample A (Exercise):
Bootstrap Sample B is Smoker:
Bootstrap Sample C is Obese:
Assuming the random forest model you have created uses majority voting, what prediction will it return for the following query:
Exercise=rarely, Smoker=false, Obese=true, Family=yes
Ist Tree > Exercise Rarely = Heart Disease (High) 2nd Tree > Smoker False = Heart Disease (Low) 3rd Tree > Obese True = Heart Disease (High)
From the analysis, it can be found that majority voting suggests, Heart Disease chance is High.