Menyusun Model Prediksi itu menantang dan menyenangkan, Permasalahan yang sering dihadapi adalah bagaimana prediksi itu bisa dihitung dan diperkirakan. Penjelasan melalui model “Black Box” bukanlah soal yang mudah, oleh karenanya kita butuh penjelasan bagaimana proses itu dihitung dengan gamblang.
Kita coba dengan data hasil ujian PISA 2012 yang datanya bisa di download dan diinstall librarynya pada [Data Pisa] (https://github.com/pbiecek/PISA2012lite)
Untuk kasus ini dipilih : Indonesia, Malaysia, Thailand, Singapore, Vietnam, dan Cina
## 'data.frame': 18002 obs. of 14 variables:
## $ CNT : Factor w/ 5 levels "Indonesia","Malaysia",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ OECD : Factor w/ 2 levels "Non-OECD","OECD": 1 1 1 1 1 1 1 1 1 1 ...
## $ ST04Q01 : Factor w/ 2 levels "Female","Male": 1 2 1 1 1 1 1 2 1 1 ...
## $ ST05Q01 : Factor w/ 3 levels "No ","Yes, for one year or less ",..: 3 3 3 3 3 2 2 1 2 1 ...
## $ ST08Q01 : Factor w/ 4 levels "None ","One or two times ",..: 1 1 1 1 1 2 2 1 2 3 ...
## $ ST09Q01 : Factor w/ 4 levels "None ","One or two times ",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ST115Q01: num 1 1 1 1 1 2 1 1 2 2 ...
## $ ST20Q01 : Factor w/ 2 levels "Country of test",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ST27Q01 : Factor w/ 4 levels "None","One","Two",..: 2 4 4 4 4 4 2 1 2 3 ...
## $ ST27Q02 : Factor w/ 4 levels "None","One","Two",..: 1 4 4 3 4 2 2 1 2 2 ...
## $ ST46Q04 : Factor w/ 4 levels "Strongly agree",..: 2 2 2 2 1 2 2 1 2 1 ...
## $ WEALTH : num -1.87 0.05 -0.96 -1.41 -0.23 -1.05 -4.34 -4.34 -3.67 -3.16 ...
## $ ISCEDO : Factor w/ 4 levels "General","Pre-Vocational",..: 1 1 1 1 1 3 3 3 3 3 ...
## $ PV1MATH : num 408 385 529 388 482 ...
## - attr(*, "na.action")=Class 'omit' Named int [1:180945] 4 6 7 9 10 11 17 18 19 20 ...
## .. ..- attr(*, "names")= chr [1:180945] "4" "6" "7" "9" ...
## CNT OECD ST04Q01
## Indonesia:3535 Non-OECD:18002 Female:9596
## Malaysia :3262 OECD : 0 Male :8406
## Singapore:3648
## Thailand :4333
## Vietnam :3224
##
## ST05Q01 ST08Q01
## No : 2823 None :13271
## Yes, for one year or less : 3462 One or two times : 3625
## Yes, for more than one year:11717 Three or four times : 688
## Five or more times : 418
##
##
## ST09Q01 ST115Q01 ST20Q01
## None :15158 Min. :1.000 Country of test:17420
## One or two times : 2318 1st Qu.:1.000 Other country : 582
## Three or four times : 351 Median :1.000
## Five or more times : 175 Mean :1.231
## 3rd Qu.:1.000
## Max. :4.000
## ST27Q01 ST27Q02 ST46Q04
## None : 354 None : 517 Strongly agree :3158
## One : 2261 One :8681 Agree :9440
## Two : 3113 Two :5739 Disagree :4926
## Three or more:12274 Three or more:3065 Strongly disagree: 478
##
##
## WEALTH ISCEDO PV1MATH
## Min. :-6.650 General :16223 Min. :107.7
## 1st Qu.:-2.300 Pre-Vocational: 0 1st Qu.:384.3
## Median :-1.320 Vocational : 1757 Median :453.5
## Mean :-1.369 Modular : 22 Mean :464.9
## 3rd Qu.:-0.410 3rd Qu.:538.1
## Max. : 3.030 Max. :885.9
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## CNT OECD ST04Q01 ST05Q01 ST08Q01 ST09Q01 ST20Q01 ST27Q01 ST27Q02
## 5 1 2 3 4 4 2 4 4
## ST46Q04 ISCEDO
## 4 3
Tampilan Data dalam tabulasi
library(knitr)
library(dplyr)
pisa[1:10,] %>%
knitr::kable(caption = "First 10 rows")
PV1MATH | CNT | ST04Q01 | ST05Q01 | ST08Q01 | ST09Q01 | ST115Q01 | ST20Q01 | ST27Q01 | ST27Q02 | ST46Q04 | WEALTH | ISCEDO | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
221390 | 407.5479 | Indonesia | Female | Yes, for more than one year | None | None | 1 | Country of test | One | None | Agree | -1.87 | General |
221392 | 385.1924 | Indonesia | Male | Yes, for more than one year | None | None | 1 | Country of test | Three or more | Three or more | Agree | 0.05 | General |
221394 | 528.9843 | Indonesia | Female | Yes, for more than one year | None | None | 1 | Country of test | Three or more | Three or more | Agree | -0.96 | General |
221395 | 388.2303 | Indonesia | Female | Yes, for more than one year | None | None | 1 | Country of test | Three or more | Two | Agree | -1.41 | General |
221396 | 482.3259 | Indonesia | Female | Yes, for more than one year | None | None | 1 | Country of test | Three or more | Three or more | Strongly agree | -0.23 | General |
221397 | 263.6781 | Indonesia | Female | Yes, for one year or less | One or two times | None | 2 | Country of test | Three or more | One | Agree | -1.05 | Vocational |
221398 | 368.7568 | Indonesia | Female | Yes, for one year or less | One or two times | None | 1 | Country of test | One | One | Agree | -4.34 | Vocational |
221401 | 387.2176 | Indonesia | Male | No | None | None | 1 | Country of test | None | None | Strongly agree | -4.34 | Vocational |
221403 | 430.6823 | Indonesia | Female | Yes, for one year or less | One or two times | None | 2 | Country of test | One | One | Agree | -3.67 | Vocational |
221404 | 413.2341 | Indonesia | Female | No | Three or four times | None | 2 | Country of test | Two | One | Strongly agree | -3.16 | Vocational |
Tampilan Norma
library(OneR)
library(ggplot2)
pisa$PIV1MATH.l <- bin(pisa$PV1MATH, nbins = 3, method = "content")
intervals <- paste(levels(pisa$PIV1MATH.l), collapse = " ")
intervals <- gsub("\\(|]", "", intervals)
intervals <- gsub(",", " ", intervals)
intervals <- as.numeric(unique(strsplit(intervals, " ")[[1]]))
pisa %>%
ggplot() +
geom_density(aes(x = PV1MATH), color = "blue", fill = "blue", alpha = 0.4) +
geom_vline(xintercept = intervals[2]) +
geom_vline(xintercept = intervals[3])
intervals[2]
## [1] 407
intervals[3]
## [1] 505
Tampilan Grafik perbandingan Negara
pisa <- select(pisa, -PV1MATH) %>%
mutate(PIV1MATH.l = plyr::revalue(PIV1MATH.l, c("(107,407]" = "low", "(407,505]" = "medium", "(505,887]" = "high")))
pisa %>%
ggplot(aes(x = CNT, fill = PIV1MATH.l)) +
geom_bar(position = "dodge", alpha = 0.7) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
plot.margin = unit(c(0, 0, 0, 1.5), "cm")) +
scale_fill_brewer(palette = "Set1")
str(pisa)
## 'data.frame': 18002 obs. of 13 variables:
## $ CNT : Factor w/ 5 levels "Indonesia","Malaysia",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ST04Q01 : Factor w/ 2 levels "Female","Male": 1 2 1 1 1 1 1 2 1 1 ...
## $ ST05Q01 : Factor w/ 3 levels "No ","Yes, for one year or less ",..: 3 3 3 3 3 2 2 1 2 1 ...
## $ ST08Q01 : Factor w/ 4 levels "None ","One or two times ",..: 1 1 1 1 1 2 2 1 2 3 ...
## $ ST09Q01 : Factor w/ 4 levels "None ","One or two times ",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ST115Q01 : num 1 1 1 1 1 2 1 1 2 2 ...
## $ ST20Q01 : Factor w/ 2 levels "Country of test",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ST27Q01 : Factor w/ 4 levels "None","One","Two",..: 2 4 4 4 4 4 2 1 2 3 ...
## $ ST27Q02 : Factor w/ 4 levels "None","One","Two",..: 1 4 4 3 4 2 2 1 2 2 ...
## $ ST46Q04 : Factor w/ 4 levels "Strongly agree",..: 2 2 2 2 1 2 2 1 2 1 ...
## $ WEALTH : num -1.87 0.05 -0.96 -1.41 -0.23 -1.05 -4.34 -4.34 -3.67 -3.16 ...
## $ ISCEDO : Factor w/ 4 levels "General","Pre-Vocational",..: 1 1 1 1 1 3 3 3 3 3 ...
## $ PIV1MATH.l: Factor w/ 3 levels "low","medium",..: 2 1 3 1 2 1 1 1 2 2 ...
library(caret)
## Loading required package: lattice
library(foreach)
set.seed(42)
index <- createDataPartition(pisa$PIV1MATH.l, p = 0.7, list = FALSE)
train_data <- pisa[index, ]
test_data <- pisa[-index, ]
#linier Model
model <- lm(PIV1MATH.l~., data = train_data)
## Warning in model.response(mf, "numeric"): using type = "numeric" with a
## factor response will be ignored
## Warning in Ops.factor(y, z$residuals): '-' not meaningful for factors
Kita mencoba memprediksi untuk siswa dengan data urutan ke 4500
library(breakDown)
library(ggplot2)
new_observation <- test_data[4500,]
explain_1 <- broken(model, new_observation)
explain_1
## contribution
## (Intercept) 2.0e+00
## CNT = Vietnam 4.6e-01
## ST05Q01 = Yes, for one year or less -7.0e-02
## ST08Q01 = None 4.2e-02
## ST09Q01 = None 3.3e-02
## ST27Q01 = Three or more 1.9e-02
## ISCEDO = General 1.6e-02
## ST04Q01 = Male 1.2e-02
## WEALTH = -1.32 5.4e-03
## ST46Q04 = Disagree 4.3e-03
## ST27Q02 = Two -3.4e-03
## ST20Q01 = Country of test -2.8e-03
## ST115Q01 = 1 7.7e-05
## final_prognosis 2.5e+00
## baseline: 0
plot(explain_1) + ggtitle("breakDown plot for random forest predictors of PISA Score")
print(new_observation)
## CNT ST04Q01 ST05Q01 ST08Q01 ST09Q01 ST115Q01
## 15028 Vietnam Male Yes, for one year or less None None 1
## ST20Q01 ST27Q01 ST27Q02 ST46Q04 WEALTH ISCEDO
## 15028 Country of test Three or more Two Disagree -1.32 General
## PIV1MATH.l
## 15028 high
Dari perkiraan model score akan jatuh di level tinggi = 2.5 , sedangkan data nyatanya ke 4500 nyatanya ada di tingkat + High tinggi juga artinya model dan prediksi dan data sudah cocok
Kita mencoba memprediksi untuk siswa dengan data urutan ke 750
new_observation <- pisa[750,]
explain_1 <- broken(model, new_observation)
explain_1
## contribution
## (Intercept) 2.0e+00
## CNT = Indonesia -3.7e-01
## WEALTH = -2.75 -1.8e-01
## ST08Q01 = One or two times -1.0e-01
## ST05Q01 = Yes, for more than one year 7.5e-02
## ST09Q01 = None 3.3e-02
## ST27Q01 = Two -2.9e-02
## ISCEDO = General 1.6e-02
## ST46Q04 = Strongly agree 1.3e-02
## ST04Q01 = Male 1.2e-02
## ST20Q01 = Country of test -2.8e-03
## ST27Q02 = One -2.2e-03
## ST115Q01 = 1 7.7e-05
## final_prognosis 1.5e+00
## baseline: 0
plot(explain_1) + ggtitle("breakDown plot for linear predictors of PISA SCore")
print(new_observation)
## CNT ST04Q01 ST05Q01 ST08Q01
## 750 Indonesia Male Yes, for more than one year One or two times
## ST09Q01 ST115Q01 ST20Q01 ST27Q01 ST27Q02 ST46Q04 WEALTH
## 750 None 1 Country of test Two One Strongly agree -2.75
## ISCEDO PIV1MATH.l
## 750 General low
Dari model perkiraan dengan data yang ada diprediksi di level rendah, sedangkan angka nyatanya juga di level rendah. terbukti lagi model prediksinya sudah cocok.