CSCI/DASC 6020: Written Assignment 04

Author

Mir Quddus

Published

October 24, 2023

Assignment Goal

The goal of this assignment is to demonstrate your understanding of how the concepts from information theory can be used to build machine learning models for predictive tasks.

1 Calculating Entropy

The figure below shows a set of eight Scrabble pieces.

What is the entropy (in bits) of the letters in this set?

It is a multiset with duplicate elements. First set: ‘OOO’, second set = ‘XYMRN’

Entropy of the word OXYMORON: 2.405639 bits

What would be the reduction in entropy (i.e., the information gain) in bits if we split these letters into two sets, one containing the vowels and the other containing the consonants?

Information Gain = Original Entropy of the set - Remaining Entropy

information gain due to splitting letters into two sets : 0.954434 bits

What is the maximum possible entropy (in bits) for a set of eight Scrabble pieces?

We get maximum entropy, when the elements are distinct - every element in the set is different. In OXYMORON, there are 6 distinct letters {OXYMRN}

maximum possible entropy of OXYMORON:  2.584963 bits

In general, which is preferable when you are playing Scrabble: a set of letters with high entropy, or a set of letters with low entropy?

I would like to have high entropy (which means the letters are distinct), this increases my chance to use these letters useful increases. With low entropy, due to possibility of duplication, the number of useful letter goes down. Thus, I prefer to have high entropy to increase my chances of winning.

2 Decision Tree Construction

A convicted criminal who reoffends after release is known as a recidivist. The Table below lists a dataset that describes prisoners released on parole, and whether they reoffended within two years of release.

ID	Good Behavior	Age Below 30	Drug Dependent	Recidivist
1	false	true	false	true
2	false	false	false	false
3	false	true	false	true
4	true	false	false	false
5	true	false	true	true
6	true	false	false	false

Prisoners on parole.

This dataset lists six instances where prisoners were granted parole. Each of these instances are described in terms of three binary descriptive features (Good Behavior, Age Below 30, Drug Dependent) and a binary target feature, Recidivist. The Good Behavior feature has a value of true if the prisoner had not committed any infringements during incarceration, the Age Below 30 has a value of true if the prisoner was under 30 years of age when granted parole, and the Drug Dependent feature is true if the prisoner had a drug addiction at the time of parole. The target feature, Recidivist, has a true value if the prisoner was arrested within two years of being released; otherwise it has a value of false.

Using this dataset, construct the decision tree that would be generated by the ID3 algorithm, using entropy-based information gain.

[1] 1.487343

[1] 1.755617

[1] 1.405639

we will use the " Age Below 30 " as the root node

# Calculate the entropy of the entire dataset 

# Calculate the entropy of "good behavior" attribute (c3)
p_good_behavior <- sum(df1$c3 == "true") / nrow(df1)
p_bad_behavior <- sum(df1$c3 == "false") / nrow(df1)

entropy_good_behavior <- -p_good_behavior * log2(p_good_behavior) - p_bad_behavior * log2(p_bad_behavior)

# Calculate the Information Gain
information_gain_good_behavior <- entropy_original - entropy_good_behavior

# Calculate the entropy of "age below 30" attribute (c4)
p_below_30 <- sum(df1$c4 == "true") / nrow(df1)
p_above_30 <- sum(df1$c4 == "false") / nrow(df1)

entropy_below_30 <- -p_below_30 * log2(p_below_30) - p_above_30 * log2(p_above_30)

# Calculate the Information Gain
information_gain_below_30 <- entropy_original - entropy_below_30

# Calculate the entropy of the "drug-dependent" attribute (c5)
p_drug_dependent <- sum(df1$c5 == "true") / nrow(df1)
p_not_drug_dependent <- sum(df1$c5 == "false") / nrow(df1)

entropy_drug_dependent <- -p_drug_dependent * log2(p_drug_dependent) - p_not_drug_dependent * log2(p_not_drug_dependent)

# Calculate the Information Gain
information_gain_drug_dependent <- entropy_original - entropy_drug_dependent

# Create a named vector of Information Gain values for each attribute
information_gains <- c(
  "Good Behavior" = information_gain_good_behavior,
  "Age Below 30" = information_gain_below_30,
  "Drug Dependent" = information_gain_drug_dependent
)

# Print the Information Gain values with attribute names in two columns
cat("Attribute Name\tInformation Gain\n")

Attribute Name  Information Gain

for (attr_name in names(information_gains)) {
  cat(attr_name, "\t", information_gains[attr_name], "\n")
}

Good Behavior    1.487343 
Age Below 30     1.755617 
Drug Dependent   1.405639

# Find the attribute with the highest Information Gain
highest_ig_attribute <- names(information_gains[which.max(information_gains)])

# Report the attribute with the highest IG
cat('we will use the "', highest_ig_attribute, '" as the root node', '\n')

we will use the " Age Below 30 " as the root node

What prediction will the decision tree generated in part (a) of this question return for the following query?

Good Behavior = false, Age Below 30 = false, Drug Dependent = true

Following the decision tree developed above, this person will become recidivist

What prediction will the decision tree generated in part (a) of this question return for the following query?

Good Behavior = true, Age Below 30 = true, Drug Dependent = false
```
 Following the decision tree developed above, this person will become recidivist
```

3 Information Gain

The Table below lists a sample of data from a census.

  c1 c2          c3            c4           c5        c6
1  1 39   bachelors never married    transport   25K-50K
2  2 50   bachelors       married professional   25K-50K
3  3 18 high school never married  agriculture below 25K
4  4 28   bachelors       married professional   25K-50K
5  5 37 high school       married  agriculture   25K-50K
6  6 24 high school never married armed forces below 25K
7  7 52 high school      divorced    transport   25K-50K
8  8 40   doctorate       married professional  over 50K

ID	Age	Education	Marital Status	Occupation	Annual Income
1	39	bachelors	never married	transport	25K-50K
2	50	bachelors	married	professional	25K-50K
3	18	high school	never married	agriculture	below 25K
4	28	bachelors	married	professional	25K-50K
5	37	high school	married	agriculture	25K-50K
6	24	high school	never married	armed forces	below 25K
7	52	high school	divorced	transport	25K-50K
8	40	doctorate	married	professional	over 50K

Census data.

There are four descriptive features and one target feature in this dataset:

Age, a continuous feature listing the age of the individual.
Education, a categorical feature listing the highest education award achieved by the individual (high school, bachelors, doctorate).
Marital Status (never married, married, divorced).
Occupation (transport = works in the transportation industry; professional = doctors, lawyers, etc.; agriculture = works in the agricultural industry; armed forces = is a member of the armed forces).
Annual Income, the target feature with 3 levels (<25K, 25K-50K, >50K).

Calculate the entropy for this dataset.

Here, Age is the predictor variable.

entropy of the total dataset:  1.298795

Calculate the Gini index for this dataset.

gini index can be calculated, gini_index = 1 - sum(probability)^2

gini index of the total dataset:  0.53125

When building a decision tree, the easiest way to handle a continuous feature is to define a threshold around which splits will be made. What would be the optimal threshold to split the continuous Age feature (use information gain based on entropy as the feature selection measure)?

From the table we can see that the target feature “annual income” is divided into three categories: <25k, 25-50k, >50k After sorting the table by age, we can build the tree by isolating the age groups by target feature (annual income):

Sorted Census data.
	ID	Age	Education	Marital Status	Occupation	Annual Income
3	3	18	high school	never married	agriculture	below 25K
6	6	24	high school	never married	armed forces	below 25K
4	4	28	bachelors	married	professional	25K-50K
5	5	37	high school	married	agriculture	25K-50K
1	1	39	bachelors	never married	transport	25K-50K
8	8	40	doctorate	married	professional	over 50K
2	2	50	bachelors	married	professional	25K-50K
7	7	52	high school	divorced	transport	25K-50K

Now, we can calculate the transitions of the target feature in respect to age: by using
the label changes of the target feature, calculate the average of the age transition:

25k <- (24+28)/2 = 26yrs 25-50k <- (39+40)/2 = 39.5yrs ’>50k <- (40+50)/2 = 45yrs

From the sorted table, we find the following instances for each categories indicated above:

>26yrs:
        D1 = d3, d6
        D2 = d4, d5, d1, d8, d2, d7
>39.5yrs:
        D3 = d3, d6, d4, d5, d1
        D4 = d8, d2, d7
>45yrs:
        D5 = d3, d6, d4, d5, d1, d8
        D6 = d2, d7

part_entropy_D1= 0 bits

part_entropy_D2= 0.6500224 bits

part_entropy_D3= 0.9709506 bits

part_entropy_D4= 0.9182958 bits

part_entropy_D5= 1.459148 bits

part_entropy_D6= 0 bits

Formula to calcualate Remaining Entropy, Sum(probability of subsets * entropy of subsets)

remaining_entropy_D1&2= 0.4875168 bits

remaining_entropy_D3&4= 0.9512051 bits

remaining_entropy_D5&6= 1.094361 bits

Calculate information gain (based on entropy) for the Education, Marital Status, and Occupation features.

We will sort the data table using the age column.

Sorted Census data.
	ID	Age	Education	Marital Status	Occupation	Annual Income
3	3	18	high school	never married	agriculture	below 25K
6	6	24	high school	never married	armed forces	below 25K
4	4	28	bachelors	married	professional	25K-50K
5	5	37	high school	married	agriculture	25K-50K
1	1	39	bachelors	never married	transport	25K-50K
8	8	40	doctorate	married	professional	over 50K
2	2	50	bachelors	married	professional	25K-50K
7	7	52	high school	divorced	transport	25K-50K

Information gain can be calculated using this formula, IG = Original entropy - remaining entropy

information_gain_D1&2= 0.8112781 bits

information_gain_D3&4= 0.3475899 bits

information_gain_D5&6= 0.204434 bits

based on the above informaiton, highest information gain is from age group 26, which will be the root node of the decision tree.

Calculate the information gain ratio (based on entropy) for Education, Marital Status, and Occupation features.

We need to calculate the information gain ratio for education. In order to do this, we need to follow the same steps as above:

We need to sort the table by education column:

Sorted Census data.
	ID	Age	Education	Marital Status	Occupation	Annual Income
1	1	39	bachelors	never married	transport	25K-50K
2	2	50	bachelors	married	professional	25K-50K
4	4	28	bachelors	married	professional	25K-50K
8	8	40	doctorate	married	professional	over 50K
3	3	18	high school	never married	agriculture	below 25K
5	5	37	high school	married	agriculture	25K-50K
6	6	24	high school	never married	armed forces	below 25K
7	7	52	high school	divorced	transport	25K-50K

We find the folloiwng information from the table:

Education:
HS:     d3, d5, d6, d7
BS:     d1, d2, d4
PhD.:   d8

part_entropy_HS= 1 bits

part_entropy_BS= 0 bits

part_entropy_PhD= 0 bits

  we need to calculate remaining entropy for Education:

remaining entropy for education = 0.5 bits

Information Gain for Education:

information gain for education = 0.7987949 bits

We need to calculate the information gain ratio for Marital Status. In order to do this, we need to follow the same steps as above:

We need to calculate Information Gain for Marriage:

Sorted Census data.
	ID	Age	Education	Marital Status	Occupation	Annual Income
7	7	52	high school	divorced	transport	25K-50K
2	2	50	bachelors	married	professional	25K-50K
4	4	28	bachelors	married	professional	25K-50K
5	5	37	high school	married	agriculture	25K-50K
8	8	40	doctorate	married	professional	over 50K
1	1	39	bachelors	never married	transport	25K-50K
3	3	18	high school	never married	agriculture	below 25K
6	6	24	high school	never married	armed forces	below 25K

from the above table we can find the following categories:
Divorced (DV):          d7
Married (MR):           d2, d4, d5, d8 
Never Married (NM):     d1, d3, d6

part_entropy_DV= 0 bits

part_entropy_MR= 0.8112781 bits

part_entropy_NM= 0.9182958 bits

calculate remaining entropy of Marital Status:

remaining_entropy_mar= 0.75 bits

Information Gain for Marital Status:

information gain for martial status = 0.5487949 bits

We need to calculate the information gain ratio for Occupation. In order to do this, we need to follow the same steps as above:

We need to calculate Information Gain for Occupation:

Sorted Census data.
	ID	Age	Education	Marital Status	Occupation	Annual Income
3	3	18	high school	never married	agriculture	below 25K
5	5	37	high school	married	agriculture	25K-50K
6	6	24	high school	never married	armed forces	below 25K
2	2	50	bachelors	married	professional	25K-50K
4	4	28	bachelors	married	professional	25K-50K
8	8	40	doctorate	married	professional	over 50K
1	1	39	bachelors	never married	transport	25K-50K
7	7	52	high school	divorced	transport	25K-50K

from the above table we can find the following categories:
Agriculture (AG):         d3, d5
Armed Forces (AF):        d6 
Professional (PF):        d2, d4, d8
Transport (TP):           d1, d7

part_entropy_AG= 1 bits

part_entropy_AF= 0 bits

part_entropy_PF= 0.9182958 bits

part_entropy_TP= 0 bits

calculate remaining entropy of Occupation:

remaining_entropy_mar= 0.75 bits

Information Gain for Occupation:

information gain for occupation = 0.704434 bits

Information gain ratio = Information Gain / Entropy. Now we can calculate IG ratio for Education, Marital Status and Occupation:

information gain ratio for education = 0.5011486 bits

information gain ratio for marital status = 0.3904238 bits

information gain ratio for occupation = 0.3696576 bits

Calculate information gain using the Gini index for the Education, Marital Status, and Occupation features.

Gini index of the entire dataset from part (b) gini index of the total dataset: 0.53125 Education: HS: d3, d5, d6, d7 BS: d1, d2, d4 PhD.: d8

partition gini index for HS = 0.5 bits

partition gini index for BS = 0 bits

partition gini index for PhD = 0 bits

Marital Status:
Divorced (DV):          d7
Married (MR):           d2, d4, d5, d8 
Never Married (NM):     d1, d3, d6

partition gini index for DV = 0 bits

partition gini index for MR = 0.375 bits

partition gini index for NM = 0.4444444 bits

Occupation:
Agriculture (AG):         d3, d5
Armed Forces (AF):        d6 
Professional (PF):        d2, d4, d8
Transport (TP):           d1, d7

partition gini index for AG = 0.5 bits

partition gini index for AF = 0 bits

partition gini index for PF = 0.4444444 bits

partition gini index for PF = 0 bits

Calculate remaining gini entropy and IG for the Education:

# probability of each subset
P_HS <- (4/8)
P_BS <- (3/8)
P_PhD <- (1/8)

remaining_gini_entropy_edu <- sum((P_HS*entropy_AG),(P_BS*entropy_AF),(P_PhD*entropy_PF))
IG_gini_entropy_edu <- 0.53125 -  remaining_gini_entropy_edu


# Print the remaining entropy values with their set name
cat('remaining gini entropy for education =', remaining_gini_entropy_edu, 'bits', '\n')

remaining gini entropy for education = 0.614787 bits

cat('information gain from education =', IG_gini_entropy_edu, 'bits', '\n')

information gain from education = -0.08353698 bits

Information gain by Education, Marital Status and Occupation:

4 Decision Tree Error Pruning

Shown in the figure below shows a decision tree for predicting heart disease. The descriptive features in this domain describe whether the patient suffers from chest pain (Chest Pain) and blood pressure (Blood Pressure). The binary target feature is Heart Disease. The table below the diagram lists a pruning set from this domain.

ID	Chest Pain	Blood Pressure	Heart Disease
1	false	high	false
2	true	low	true
3	false	low	false
4	true	high	true
5	false	high	false

Pruning set.

Using the pruning set, apply reduced error pruning to the decision tree. Assume that the algorithm is applied in a bottom-up, left-to-right fashion. For each iteration of the algorithm, indicate the subtrees considered as pruning candidates, explain why the algorithm chooses to prune or leave these subtrees in the tree, and illustrate the tree that results from each iteration.

Option 1: Using the prediction tree (given above), we verify the prediction against the ground truth (from the table): Chest Pain (False) + Blood Pressure (High) = 2 miss classifications Chest Pain (False) + Blood Pressure (Low) = 0 miss classification Total miss classifications = 2 So the tree can be pruned to Blood pressure level, given below:

Option 2:
If we choose to prune further to the root node:
We will have misclassification error: 3

Thus, we will opt to prune the tree as per Option 1.

5 Random Forest

The Table below lists a dataset containing the details of five participants in a heart disease study. The target feature Risk describes their risk of heart disease. Each patient is described in terms of four binary descriptive features:

Exercise - how regularly do they exercise
Smoker - do they smoke
Obese - are they overweight
Family - did any of their parents or siblings suffer from heart disease

ID	Exercise	Smoker	Obese	Family	Risk
1	daily	false	false	yes	low
2	weekly	true	false	yes	high
3	daily	false	false	no	low
4	rarely	true	true	yes	high
5	rarely	true	true	no	high

Heart disease study dataset.

As part of the study researchers have decided to create a predictive model to screen participants based on their risk of heart disease. You have been asked to implement this screening model using a random forest. Table below list three bootstrap samples that have been generated from the above dataset.

ID	Exercise	Family	Risk
1	daily	yes	low
2	weekly	yes	high
2	weekly	yes	high
5	rarely	yes	high
5	rarely	no	high

Heart disease study dataset bootstrap sample A.

ID	Smoker	Obese	Risk
1	false	false	low
2	true	false	high
2	true	false	high
4	true	true	high
5	true	true	high

Heart disease study dataset bootstrap sample B.

ID	Obese	Family	Risk
1	false	yes	low
1	false	yes	low
2	false	yes	high
4	true	yes	high
5	true	no	high

Heart disease study dataset bootstrap sample C.

Using these bootstrap samples create the decision trees that will be in the random forest model (use entropy based information gain as the feature selection criterion).

In order to build the decision tree, we need to calculate the entropy of the entire dataset for the target feature Risk (2 levels): Low - 2 instances High - 3 instances

total entropy of risk 0.9709506 bits

Heart disease study dataset.
	ID	Exercise	Smoker	Obese	Family	Risk
1	1	daily	false	false	yes	low
3	3	daily	false	false	no	low
4	4	rarely	true	true	yes	high
5	5	rarely	true	true	no	high
2	2	weekly	true	false	yes	high

Now, we will calculate the partition entropies for each attributes, beginning with Root node (Exercise):

Exercise:
Sort the table by Exercise column:

Heart disease study dataset bootstrap sample C.
	ID	Exercise	Smoker	Obese	Family	Risk
1	1	daily	false	false	yes	low
3	3	daily	false	false	no	low
4	4	rarely	true	true	yes	high
5	5	rarely	true	true	no	high
2	2	weekly	true	false	yes	high

The table sorted by exercise is given below:

Excersise (from the table): 
Daily (D) = d1, d3
Weekly (W) = d2
Rarely (R) = d4, d5

Now, calculate partition entropy of Exercise:

entropy of the exercise daily = 0 bits

entropy of the exercise weekly = 0 bits

entropy of the exercise rarely = 0 bits

Smoker:
Sort the table by Smoker column:

Heart disease study dataset bootstrap sample C.
	ID	Exercise	Smoker	Obese	Family	Risk
1	1	daily	false	false	yes	low
3	3	daily	false	false	no	low
2	2	weekly	true	false	yes	high
4	4	rarely	true	true	yes	high
5	5	rarely	true	true	no	high

Smoker (from the table): 
smoker_true (ST) = d1, d3
smoker_false (SF) = d2, d4, d5

Now, calculate partition entropy of Smoker:

entropy of the smoker true = 0 bits

entropy of the smoker false = 0 bits

Obese:
Sort the table by Obese column:

Heart disease study dataset bootstrap sample C.
ID	Exercise	Smoker	Obese	Family	Risk
1	daily	false	false	yes	low
2	weekly	true	false	yes	high
3	daily	false	false	no	low
4	rarely	true	true	yes	high
5	rarely	true	true	no	high

Obese (from the table): 
obese_true (OT) = d4, d5
obese_false (OF) = d1, d2, d3

Now, calculate partition entropy of Obese:

entropy of the obese true = 0 bits

entropy of the obese false = 0.9182958 bits

Family:
Sort the table by Family column:

Heart disease study dataset bootstrap sample C.
	ID	Exercise	Smoker	Obese	Family	Risk
3	3	daily	false	false	no	low
5	5	rarely	true	true	no	high
1	1	daily	false	false	yes	low
2	2	weekly	true	false	yes	high
4	4	rarely	true	true	yes	high

Family (from the table): 
family_yes (FY) = d1, d2, d4
family_no (FN) = d3, d5

Now, calculate partition entropy of Family:

entropy of the family yes = 1 bits

entropy of the family no = 0.9182958 bits

Now, calculate the remaining entropy using this formula, Sum(probability of subsets * entropy of subsets)
Information Gain (IG) = Total entropy - remaining entropy

remaining_entropy_exercise= 0 bits

information_gain_exercise= 0.9709506 bits

Now, caclulate remaining entropy and information gain for Smoker:

remaining_entropy_smoker = 0 bits

information_gain_smoker = 0.9709506 bits

Now, caclulate remaining entropy and information gain for Obese:

remaining_entropy_obese = 0.5509775 bits

information_gain_obese = 0.4199731 bits

Now, caclulate remaining entropy and information gain for Family:

remaining_entropy_family = 0.9509775 bits

information_gain_family = 0.01997309 bits

Now, we compare the IG from these attributes:

information gain exercise = 0.9709506 bits

information_gain_smoker = 0.9709506 bits

information_gain_obese = 0.4199731 bits

information gain family = 0.01997309 bits

Based on the information gain number ranking, Exercise will form the root node of the tree for Bootstrap sample A (Exercise):

Bootstrap Sample B is Smoker:

Bootstrap Sample C is Obese:

Assuming the random forest model you have created uses majority voting, what prediction will it return for the following query:

Exercise=rarely, Smoker=false, Obese=true, Family=yes

Ist Tree > Exercise Rarely = Heart Disease (High) 2nd Tree > Smoker False = Heart Disease (Low) 3rd Tree > Obese True = Heart Disease (High)

From the analysis, it can be found that majority voting suggests, Heart Disease chance is High.