This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.
Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter.
Dylan Guzman
Question 1
Calculating initial entropy low: 3/11. Medium: 6/11. High: 2/11
initial_entropy = -1*(3/11*log(3/11, base = 2) + 6/11*log(6/11, base = 2) + 2/11*log(2/11, base = 2))
print(initial_entropy)
## [1] 1.435371
Calculating gain ratio on department attribute Sales: 3/11, with 1 medium and 2 low Systems: 4/11, with 2 high and 2 medium Marketing: 2/11 with 2 medium Secretary 2/11 with 1 low and 1 medium
sales_entropy = -3/11*(1/3*log(1/3, base = 2) + 2/3*log(2/3, base = 2))
systems_entropy = -4/11*(1/2*log(1/2, base = 2) + 1/2*log(1/2, base = 2))
marketing_entropy = -2/11*(2/2*log(2/2, base = 2) + 2/2*log(2/2, base = 2))
secretary_entropy = -2/11*(1/2*log(1/2, base = 2) + 1/2*log(1/2, base = 2))
department_entropy = sales_entropy+systems_entropy+marketing_entropy+secretary_entropy
department_IG = initial_entropy - department_entropy
department_split = -(3/14*log(3/14, base = 2) + 4/14*log(4/14, base = 2) + 2/14*log(2/14, base = 2) + 2/14*log(2/14, base = 2))
department_gain_ratio = department_IG/department_split
print(department_gain_ratio)
## [1] 0.3563086
Calculating gain ratio on status attribute junior: 6/11 with 3 low and 3 medium senior: 5/11 with 3 medium and 2 high
junior_entropy = -6/11*(1/2*log(1/2,base = 2) + 1/2*log(1/2, base = 2))
senior_entropy = -5/11*(3/5*log(3/5,base = 2) + 2/5*log(2/5, base = 2))
status_entropy = junior_entropy + senior_entropy
status_IG = initial_entropy - status_entropy
status_split = -(5/11*log(5/11,base=2) + 6/11*log(6/11, base=2))
status_gain_ratio = status_IG/status_split
print(status_gain_ratio)
## [1] 0.4512697
Calculating gain ratio on age attribute twenties: 4/11 with 2 low and 2 medium thirties: 5/11 with 3 medium, 1 low, 1 high forties: 2/11 with 1 high and 1 medium
twenties_entropy = -4/11*(1/2*log(1/2, base=2) + 1/2*log(1/2, base = 2))
thirties_entropy = -5/11*(3/5*log(3/5,base=2) + 1/5*log(1/5,base=2) + 1/5*log(1/5,base=2))
forties_entropy = -2/11*(1/2*log(1/2, base=2) + 1/2*log(1/2, base = 2))
age_entropy = twenties_entropy+thirties_entropy+forties_entropy
age_IG = initial_entropy - age_entropy
age_split = -(4/11*log(4/11,base=2) + 5/11*log(5/11,base=2) + 2/11*log(2/11,base=2))
age_gain_ratio = age_IG/age_split
print(age_gain_ratio)
## [1] 0.1784428
Choosing gain ratio
print(paste("Department Gain Ratio:", department_gain_ratio))
## [1] "Department Gain Ratio: 0.35630858266491"
print(paste("Status Gain Ratio:", status_gain_ratio))
## [1] "Status Gain Ratio: 0.45126965040218"
print(paste("Age Gain Ratio:", age_gain_ratio))
## [1] "Age Gain Ratio: 0.178442786054486"
Status has the highest gain ratio. Therefore it should be selected for the first split.
Question 2
Part a
play = 9/14
no_play = 5/14
sunny_play = 2/9
sunny_not = 3/5
overcast_play = 4/9
overcast_not = 0/5
rainy_play = 3/9
rainy_not = 2/5
hot_play = 2/9
hot_not = 2/5
mild_play = 4/9
mild_not = 2/5
cool_play = 3/9
cool_not = 1/5
high_play = 3/9
high_not = 4/5
normal_play = 6/9
normal_not = 1/5
windy_play = 3/9
windy_not = 3/5
not_windy_play = 6/9
not_windy_not = 2/5
part b
conditions_given_play = sunny_play*cool_play*high_play*windy_play
conditions_given_not_play = sunny_not*cool_not*high_not*windy_not
prob_play = conditions_given_play*play/(conditions_given_play*play + conditions_given_not_play*no_play)
prob_not_play = conditions_given_not_play*no_play/(conditions_given_play*play + conditions_given_not_play*no_play)
prob_play
## [1] 0.2045827
prob_not_play
## [1] 0.7954173
prob_play represents the probability of playing given the conditions in 2b. The formula is the product of the probability of the conditions in 2b given tennis is played and the probability tennis is played divided by the probability of the conditions.
part c I’m going to see if p(a and b) = p(a)*p(b) where a is the values of humidity and b is the values of windy
windy_high = 3/14
windy_normal = 3/14
not_windy_high = 4/14
not_windy_normal = 4/14
windy = 6/14
not_windy = 8/14
high = 7/14
normal = 7/14
windy*high == windy_high
## [1] TRUE
windy*normal == windy_normal
## [1] TRUE
not_windy*high == not_windy_high
## [1] TRUE
not_windy*normal == not_windy_normal
## [1] TRUE
Since p(a and b) = p(a)*p(b) where a is the values of humidity and b is the values of windy is true for all a and b, the attributes windy and humidity are independent.
part d I’m going to see if p(a and b given c) = p(a given c) * p(b given c), where a is all the values of humidity, b is all the values of windy, and c is all the values of play
windy_high_play = 1/9
windy_normal_play = 2/9
not_windy_high_play = 2/9
not_windy_normal_play = 4/9
windy_high_no_play = 2/5
windy_normal_no_play = 1/5
not_windy_high_no_play = 2/5
not_windy_normal_no_play = 0/5
windy_play*high_play == windy_high_play
## [1] TRUE
windy_play*normal_play == windy_normal_play
## [1] TRUE
windy_not*high_not == windy_high_no_play
## [1] FALSE
windy_not*normal_not == windy_normal_no_play
## [1] FALSE
Since the equality holds true when c is playing, humidity and windy attributes have conditional independence when play is yes. However, humidity and windy do not have conditional independence when play is no since the equality doesn’t hold true.
Question 3
First, I would find the new total number of instances by summing the count column, which is 165. Now, I would update the probabilities for all outcomes. For instance, the probability of a low salary is now 86/165, medium is 71/165, and high is 8/165. Hence, the updated initial entropy formula is -1(86/165log(86/165, base = 2) + 71/165log(71/165, base = 2) + 8/165log(8/165, base = 2)) which is 1.225166 I then would do the same for the rest of the entropies to find an updated gain ratio for all attributes.
Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I.
When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Ctrl+Shift+K to preview the HTML file).
The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.