R Notebook

This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.

Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter.

Dylan Guzman

Question 1

Calculating initial entropy low: 3/11. Medium: 6/11. High: 2/11

initial_entropy = -1*(3/11*log(3/11, base = 2) + 6/11*log(6/11, base = 2) + 2/11*log(2/11, base = 2))
print(initial_entropy)

## [1] 1.435371

Calculating gain ratio on department attribute Sales: 3/11, with 1 medium and 2 low Systems: 4/11, with 2 high and 2 medium Marketing: 2/11 with 2 medium Secretary 2/11 with 1 low and 1 medium

sales_entropy = -3/11*(1/3*log(1/3, base = 2) + 2/3*log(2/3, base = 2))
systems_entropy = -4/11*(1/2*log(1/2, base = 2) + 1/2*log(1/2, base = 2))
marketing_entropy = -2/11*(2/2*log(2/2, base = 2) + 2/2*log(2/2, base = 2))
secretary_entropy = -2/11*(1/2*log(1/2, base = 2) + 1/2*log(1/2, base = 2))
department_entropy = sales_entropy+systems_entropy+marketing_entropy+secretary_entropy

department_IG = initial_entropy - department_entropy
department_split = -(3/14*log(3/14, base = 2) + 4/14*log(4/14, base = 2) + 2/14*log(2/14, base = 2) + 2/14*log(2/14, base = 2))
department_gain_ratio = department_IG/department_split
print(department_gain_ratio)

## [1] 0.3563086

Calculating gain ratio on status attribute junior: 6/11 with 3 low and 3 medium senior: 5/11 with 3 medium and 2 high

junior_entropy = -6/11*(1/2*log(1/2,base = 2) + 1/2*log(1/2, base = 2))
senior_entropy = -5/11*(3/5*log(3/5,base = 2) + 2/5*log(2/5, base = 2))
status_entropy = junior_entropy + senior_entropy

status_IG = initial_entropy - status_entropy
status_split = -(5/11*log(5/11,base=2) + 6/11*log(6/11, base=2))
status_gain_ratio = status_IG/status_split
print(status_gain_ratio)

## [1] 0.4512697

Calculating gain ratio on age attribute twenties: 4/11 with 2 low and 2 medium thirties: 5/11 with 3 medium, 1 low, 1 high forties: 2/11 with 1 high and 1 medium

twenties_entropy = -4/11*(1/2*log(1/2, base=2) + 1/2*log(1/2, base = 2))
thirties_entropy = -5/11*(3/5*log(3/5,base=2) + 1/5*log(1/5,base=2) + 1/5*log(1/5,base=2))
forties_entropy = -2/11*(1/2*log(1/2, base=2) + 1/2*log(1/2, base = 2))
age_entropy = twenties_entropy+thirties_entropy+forties_entropy

age_IG = initial_entropy - age_entropy
age_split = -(4/11*log(4/11,base=2) + 5/11*log(5/11,base=2) + 2/11*log(2/11,base=2))
age_gain_ratio = age_IG/age_split
print(age_gain_ratio)

## [1] 0.1784428

Choosing gain ratio

print(paste("Department Gain Ratio:", department_gain_ratio))

## [1] "Department Gain Ratio: 0.35630858266491"

print(paste("Status Gain Ratio:", status_gain_ratio))

## [1] "Status Gain Ratio: 0.45126965040218"

print(paste("Age Gain Ratio:", age_gain_ratio))

## [1] "Age Gain Ratio: 0.178442786054486"

Status has the highest gain ratio. Therefore it should be selected for the first split.

Question 2

Part a

play = 9/14
no_play = 5/14

sunny_play = 2/9
sunny_not = 3/5
overcast_play = 4/9
overcast_not = 0/5
rainy_play = 3/9
rainy_not = 2/5

hot_play = 2/9
hot_not = 2/5
mild_play = 4/9
mild_not = 2/5
cool_play = 3/9
cool_not = 1/5

high_play = 3/9
high_not = 4/5
normal_play = 6/9
normal_not = 1/5

windy_play = 3/9
windy_not = 3/5
not_windy_play = 6/9
not_windy_not = 2/5

part b

conditions_given_play = sunny_play*cool_play*high_play*windy_play
conditions_given_not_play = sunny_not*cool_not*high_not*windy_not

prob_play = conditions_given_play*play/(conditions_given_play*play + conditions_given_not_play*no_play)
prob_not_play = conditions_given_not_play*no_play/(conditions_given_play*play + conditions_given_not_play*no_play)
prob_play

## [1] 0.2045827

prob_not_play

## [1] 0.7954173

prob_play represents the probability of playing given the conditions in 2b. The formula is the product of the probability of the conditions in 2b given tennis is played and the probability tennis is played divided by the probability of the conditions.

part c I’m going to see if p(a and b) = p(a)*p(b) where a is the values of humidity and b is the values of windy

windy_high = 3/14
windy_normal = 3/14
not_windy_high = 4/14
not_windy_normal = 4/14

windy = 6/14
not_windy = 8/14
high = 7/14
normal = 7/14

windy*high == windy_high

## [1] TRUE

windy*normal == windy_normal

## [1] TRUE

not_windy*high == not_windy_high

## [1] TRUE

not_windy*normal == not_windy_normal

## [1] TRUE

Since p(a and b) = p(a)*p(b) where a is the values of humidity and b is the values of windy is true for all a and b, the attributes windy and humidity are independent.

part d I’m going to see if p(a and b given c) = p(a given c) * p(b given c), where a is all the values of humidity, b is all the values of windy, and c is all the values of play

windy_high_play = 1/9
windy_normal_play = 2/9
not_windy_high_play = 2/9
not_windy_normal_play = 4/9

windy_high_no_play = 2/5
windy_normal_no_play = 1/5
not_windy_high_no_play = 2/5
not_windy_normal_no_play = 0/5

windy_play*high_play == windy_high_play

## [1] TRUE

windy_play*normal_play == windy_normal_play

## [1] TRUE

windy_not*high_not == windy_high_no_play

## [1] FALSE

windy_not*normal_not == windy_normal_no_play

## [1] FALSE

Since the equality holds true when c is playing, humidity and windy attributes have conditional independence when play is yes. However, humidity and windy do not have conditional independence when play is no since the equality doesn’t hold true.

Question 3

First, I would find the new total number of instances by summing the count column, which is 165. Now, I would update the probabilities for all outcomes. For instance, the probability of a low salary is now 86/165, medium is 71/165, and high is 8/165. Hence, the updated initial entropy formula is -1(86/165log(86/165, base = 2) + 71/165log(71/165, base = 2) + 8/165log(8/165, base = 2)) which is 1.225166 I then would do the same for the rest of the entropies to find an updated gain ratio for all attributes.

Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I.

When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Ctrl+Shift+K to preview the HTML file).

The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.