Problem 1.1 - Predicting B or not B
Before building models, let’s consider a baseline method that always predicts the most frequent outcome, which is “not B”. What is the accuracy of this baseline method on the test set?
library(caTools)
letters = read.csv("data/letters_ABPR.csv")
letters$isB = as.factor(letters$letter == "B")
set.seed(1000)
spl = sample.split(letters$isB, SplitRatio = 0.5)
train = subset(letters, spl == TRUE)
test = subset(letters, spl == FALSE)
table(train$isB)
FALSE TRUE
1175 383
table(test$isB)
FALSE TRUE
1175 383
1175/(1175+383)
[1] 0.754172
Problem 1.2 - Predicting B or not B
What is the accuracy of the CART model on the test set? (Use type=“class” when making predictions on the test set.)
CARTb = rpart(isB ~ . - letter, data=train, method="class")
predictions = predict(CARTb, newdata=test, type="class")
table(test$isB, predictions)
predictions
FALSE TRUE
FALSE 1118 57
TRUE 43 340
(1118+340)/nrow(test)
[1] 0.9358151
Problem 1.3 - Predicting B or Not B
What is the accuracy of the model on the test set?
(1163+375)/nrow(test)
[1] 0.987163
Problem 2.1 - Predicting the letters A, B, P, R
What is the baseline accuracy on the testing set?
set.seed(2000)
Warning message:
In strsplit(code, "\n", fixed = TRUE) :
input string 1 is invalid in this locale
spl = sample.split(letters$letter, SplitRatio = 0.5)
train2 = subset(letters, spl == TRUE)
test2 = subset(letters, spl == FALSE)
table(train2$letter)
A B P R
394 383 402 379
table(test2$letter)
A B P R
395 383 401 379
401/nrow(test)
[1] 0.2573813
###以預測有最多的observations的P為主,這樣Accuracy才會是最高,蘇然是50-50分,但A、P是奇數,所以數字會不同。"
Problem 2.2 - Predicting the letters A, B, P, R
CARTletter = rpart(letter ~ . - isB, data=train2, method="class")
Warning message:
In strsplit(code, "\n", fixed = TRUE) :
input string 1 is invalid in this locale
predictLetter = predict(CARTletter, newdata=test2, type="class")
table(test2$letter, predictLetter)
predictLetter
A B P R
A 348 4 0 43
B 8 318 12 45
P 2 21 363 15
R 10 24 5 340
(348+318+363+340)/nrow(test)
[1] 0.8786906
###可以用nrow(test)去除的原因是,他代表data frame所有觀察值的數量,另外,table裡面的數字,348、318,代表的是數量。"
Problem 2.3 - Predicting the letters A, B, P, R
What is the test set accuracy of your random forest model?
set.seed(1000)
RFletter = randomForest(letter ~ . - isB, data=train2)
predictLetter = predict(RFletter, newdata=test2)
table(test2$letter, predictLetter)
predictLetter
A B P R
A 391 0 3 1
B 0 380 1 2
P 0 6 394 1
R 3 14 0 362
(391+380+394+362)/nrow(test2)
[1] 0.9801027
LS0tDQp0aXRsZTogIlIgTm90ZWJvb2siDQphdXRob3I6ICLpu4Pmn4/ono0gTTA2NDExMTA0NiINCm91dHB1dDogaHRtbF9ub3RlYm9vaw0KLS0tDQojIyMjUHJvYmxlbSAxLjEgLSBQcmVkaWN0aW5nIEIgb3Igbm90IEINCg0KQmVmb3JlIGJ1aWxkaW5nIG1vZGVscywgbGV0J3MgY29uc2lkZXIgYSBiYXNlbGluZSBtZXRob2QgdGhhdCBhbHdheXMgcHJlZGljdHMgdGhlIG1vc3QgZnJlcXVlbnQgb3V0Y29tZSwgd2hpY2ggaXMgIm5vdCBCIi4gV2hhdCBpcyB0aGUgYWNjdXJhY3kgb2YgdGhpcyBiYXNlbGluZSBtZXRob2Qgb24gdGhlIHRlc3Qgc2V0Pw0KDQpgYGB7cn0NCmxpYnJhcnkoY2FUb29scykNCmxldHRlcnMgPSByZWFkLmNzdigiZGF0YS9sZXR0ZXJzX0FCUFIuY3N2IikNCmxldHRlcnMkaXNCID0gYXMuZmFjdG9yKGxldHRlcnMkbGV0dGVyID09ICJCIikNCnNldC5zZWVkKDEwMDApDQpzcGwgPSBzYW1wbGUuc3BsaXQobGV0dGVycyRpc0IsIFNwbGl0UmF0aW8gPSAwLjUpDQp0cmFpbiA9IHN1YnNldChsZXR0ZXJzLCBzcGwgPT0gVFJVRSkNCnRlc3QgPSBzdWJzZXQobGV0dGVycywgc3BsID09IEZBTFNFKQ0KdGFibGUodHJhaW4kaXNCKSANCnRhYmxlKHRlc3QkaXNCKQ0KMTE3NS8oMTE3NSszODMpDQojIyNGQUxTReavlOi8g+Wkmu+8jOS7pemgkOa4rG5vdC5C54K65Li777yM5Lul56K65L+d5q2j56K6546H5pyA6auY77yM5pW45a2X5pyD5Ymb5aW95LiA5qij5piv5Zug54K655W25Yid5Zyo5YiH55qE5pmC5YCZ77yMNTAl77yMNTAl77yM5Y+I5YG25pW477yM5omN5pyD5Ymb5aW95LiA5qij77yMMi4x55qEQeOAgVDlpYfmlbjlsLHmnIPkuI3lkIwNCmBgYA0KDQojIyMjUHJvYmxlbSAxLjIgLSBQcmVkaWN0aW5nIEIgb3Igbm90IEINCg0KV2hhdCBpcyB0aGUgYWNjdXJhY3kgb2YgdGhlIENBUlQgbW9kZWwgb24gdGhlIHRlc3Qgc2V0PyAoVXNlIHR5cGU9ImNsYXNzIiB3aGVuIG1ha2luZyBwcmVkaWN0aW9ucyBvbiB0aGUgdGVzdCBzZXQuKQ0KDQpgYGB7cn0NCkNBUlRiID0gcnBhcnQoaXNCIH4gLiAtIGxldHRlciwgZGF0YT10cmFpbiwgbWV0aG9kPSJjbGFzcyIpDQpwcmVkaWN0aW9ucyA9IHByZWRpY3QoQ0FSVGIsIG5ld2RhdGE9dGVzdCwgdHlwZT0iY2xhc3MiKQ0KdGFibGUodGVzdCRpc0IsIHByZWRpY3Rpb25zKQ0KKDExMTgrMzQwKS9ucm93KHRlc3QpDQpgYGANCg0KIyMjI1Byb2JsZW0gMS4zIC0gUHJlZGljdGluZyBCIG9yIE5vdCBCDQoNCldoYXQgaXMgdGhlIGFjY3VyYWN5IG9mIHRoZSBtb2RlbCBvbiB0aGUgdGVzdCBzZXQ/DQoNCmBgYHtyfQ0KaW5zdGFsbC5wYWNrYWdlcygicmFuZG9tRm9yZXN0IikNCmxpYnJhcnkocmFuZG9tRm9yZXN0KQ0Kc2V0LnNlZWQoMTAwMCkNClJGYiA9IHJhbmRvbUZvcmVzdChpc0IgfiB4Ym94ICsgeWJveCArIHdpZHRoICsgaGVpZ2h0ICsgb25waXggKyB4YmFyICsgeWJhciArIHgyYmFyICsgeTJiYXIgKyB4eWJhciArIHgyeWJhciArIHh5MmJhciArIHhlZGdlICsgeGVkZ2V5Y29yICsgeWVkZ2UgKyB5ZWRnZXhjb3IsIGRhdGE9dHJhaW4pDQpSRmIgPSByYW5kb21Gb3Jlc3QoaXNCIH4gLiAtIGxldHRlciwgZGF0YT10cmFpbikNCnByZWRpY3Rpb25zID0gcHJlZGljdChSRmIsIG5ld2RhdGE9dGVzdCkNCnRhYmxlKHRlc3QkaXNCLCBwcmVkaWN0aW9ucykNCigxMTYzKzM3NSkvbnJvdyh0ZXN0KQ0KYGBgDQoNCiMjIyNQcm9ibGVtIDIuMSAtIFByZWRpY3RpbmcgdGhlIGxldHRlcnMgQSwgQiwgUCwgUg0KDQpXaGF0IGlzIHRoZSBiYXNlbGluZSBhY2N1cmFjeSBvbiB0aGUgdGVzdGluZyBzZXQ/DQoNCmBgYHtyfQ0Kc2V0LnNlZWQoMjAwMCkNCnNwbCA9IHNhbXBsZS5zcGxpdChsZXR0ZXJzJGxldHRlciwgU3BsaXRSYXRpbyA9IDAuNSkNCnRyYWluMiA9IHN1YnNldChsZXR0ZXJzLCBzcGwgPT0gVFJVRSkNCnRlc3QyID0gc3Vic2V0KGxldHRlcnMsIHNwbCA9PSBGQUxTRSkNCnRhYmxlKHRyYWluMiRsZXR0ZXIpDQp0YWJsZSh0ZXN0MiRsZXR0ZXIpIA0KNDAxL25yb3codGVzdCkNCiMjI+S7pemgkOa4rOacieacgOWkmueahG9ic2VydmF0aW9uc+eahFDngrrkuLvvvIzpgJnmqKNBY2N1cmFjeeaJjeacg+aYr+acgOmrmO+8jOiYh+eEtuaYrzUwLTUw5YiG77yM5L2GQeOAgVDmmK/lpYfmlbjvvIzmiYDku6XmlbjlrZfmnIPkuI3lkIzjgIINCmBgYA0KDQojIyMjUHJvYmxlbSAyLjIgLSBQcmVkaWN0aW5nIHRoZSBsZXR0ZXJzIEEsIEIsIFAsIFINCg0KYGBge3J9DQpDQVJUbGV0dGVyID0gcnBhcnQobGV0dGVyIH4gLiAtIGlzQiwgZGF0YT10cmFpbjIsIG1ldGhvZD0iY2xhc3MiKQ0KcHJlZGljdExldHRlciA9IHByZWRpY3QoQ0FSVGxldHRlciwgbmV3ZGF0YT10ZXN0MiwgdHlwZT0iY2xhc3MiKQ0KdGFibGUodGVzdDIkbGV0dGVyLCBwcmVkaWN0TGV0dGVyKQ0KKDM0OCszMTgrMzYzKzM0MCkvbnJvdyh0ZXN0KQ0KIyMj5Y+v5Lul55SobnJvdyh0ZXN0KeWOu+mZpOeahOWOn+WboOaYr++8jOS7luS7o+ihqGRhdGEgZnJhbWXmiYDmnInop4Dlr5/lgLznmoTmlbjph4/vvIzlj6blpJbvvIx0YWJsZeijoemdoueahOaVuOWtl++8jDM0OOOAgTMxOO+8jOS7o+ihqOeahOaYr+aVuOmHj+OAgg0KYGBgDQoNCiMjIyNQcm9ibGVtIDIuMyAtIFByZWRpY3RpbmcgdGhlIGxldHRlcnMgQSwgQiwgUCwgUg0KDQpXaGF0IGlzIHRoZSB0ZXN0IHNldCBhY2N1cmFjeSBvZiB5b3VyIHJhbmRvbSBmb3Jlc3QgbW9kZWw/DQoNCmBgYHtyfQ0Kc2V0LnNlZWQoMTAwMCkNClJGbGV0dGVyID0gcmFuZG9tRm9yZXN0KGxldHRlciB+IC4gLSBpc0IsIGRhdGE9dHJhaW4yKQ0KcHJlZGljdExldHRlciA9IHByZWRpY3QoUkZsZXR0ZXIsIG5ld2RhdGE9dGVzdDIpDQp0YWJsZSh0ZXN0MiRsZXR0ZXIsIHByZWRpY3RMZXR0ZXIpDQooMzkxKzM4MCszOTQrMzYyKS9ucm93KHRlc3QyKQ0KYGBgDQoNCg==