Data yang digunakan adalah data dari proyek-proyek di kickstarter.
Saya akan membuat model yang dapat memprediksi apakah proyek tersebut akan sukses terdanai sesuai target (successful) atau gagal (failed).
library(caret)
library(tidyverse)
library(dplyr)
library(car)
library(ggplot2)
library(class)
library(pROC)
kickst <- data.frame(read_csv("kickstarter18.csv"))
Pada fase ini data dibersihkan dengan cara:
1. Menghilangkan NA menggunakan fungsi na.omit
2. Mengecek keunikan data menggunakan fungsi unique dan mencocokkan ID
3. Menstandardisasi format tanggal menggunakan as.date
4. Mengubah “state” menjadi factor karena variabel tsb adalah target variabelnya
kickst <- na.omit(kickst)
kickst <- kickst[match(kickst$ID,unique(kickst$ID)),]
kickst$launched <- as.Date(kickst$launched)
kickst$state <- as.factor(kickst$state)
prop.table(table(kickst$state))
##
## canceled failed live successful suspended
## 0.103390599 0.527159473 0.007464120 0.357069306 0.004916502
Setelah dilihat menggunakan tabel, variabel “state” memiliki 5 tipe, tetapi untuk analisa model disini kita hanya menggunakan 2 tipe saja yaitu “failed” dan “successful”.
Pada fase ini:
1. Data di filter untuk baris yang memiliki state “successful” dan “failed”
2. Data kolom “deadline” dan “launched” diubah menjadi durasi dari waktu pendanaan proyek. Perubahan ditujukan agar didapatkan variabel berbentuk numerik yang relevan terhadap model.
kickst1 <- kickst %>%
filter(state %in% c("successful","failed")) %>%
mutate(duration = deadline - launched,
state = factor(state,levels = c("failed","successful"))) %>%
select(-ID,-name,-deadline,-launched,-category,-main_category,-currency,-country)
kickst1$duration <- as.numeric(kickst1$duration)
table(kickst1$state)
##
## failed successful
## 197611 133851
kick <- kickst1
set.seed(212)
index <- sample(nrow(kick),nrow(kick)*0.8)
kickTrain <- kick[index, ]
kickTest <- kick[-index, ]
Membuat data alternatif dengan melakukan downsample agar target variabel menjadi seimbang.
table(kickTrain$state)
##
## failed successful
## 158030 107139
kickTrainD <- downSample(x = kickTrain[,-3],y = kickTrain$state,yname = "state")
kickTestD <- kickTest
table(kickTrainD$state)
##
## failed successful
## 107139 107139
modA <- glm(state~.,kickTrain,family = "binomial")
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
modB <- glm(state~.,kickTrainD,family = "binomial")
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
modA dimodelkan menggunakan data yang tidak di downsample
modB dimodelkan menggunakan data yang di downsample
vif(modA)
## goal pledged backers usd.pledged
## 1.233075e+06 1.233864e+06 1.631058e+00 5.704568e+01
## usd_pledged_real usd_goal_real duration
## 5.193391e+05 5.185004e+05 1.015981e+00
vif(modB)
## goal pledged backers usd.pledged
## 7.803311e+05 7.794623e+05 1.499561e+00 4.620274e+02
## usd_pledged_real usd_goal_real duration
## 1.776186e+05 1.891139e+05 1.016647e+00
Kedua model telah memenuhi asumsi multicolinearity vif < 10
Membuat prediksi dari model kedalam bentuk probabilitas lalu kedalam bentuk target variabelnya. Untuk kasus ini, menurut saya lebih baik memiliki false negative dibandingkan false positive. Maka dari itu nilai dari thresholdnya saya buat cukup besar agar memperkecil kemungkinan adanya false positive
kickTrain$prob <- predict(modA,kickTrain,type = "response")
kickTrain$pred <- as.factor(ifelse(kickTrain$prob > 0.999,"successful","failed"))
kickTest$prob <- predict(modA,kickTest,type = "response")
kickTest$pred <- as.factor(ifelse(kickTest$prob > 0.999,"successful","failed"))
kickTrainD$prob <- predict(modB,kickTrainD,type = "response")
kickTrainD$pred <- as.factor(ifelse(kickTrainD$prob > 0.999,"successful","failed"))
kickTestD$prob <- predict(modB,kickTestD,type = "response")
kickTestD$pred <- as.factor(ifelse(kickTestD$prob > 0.999,"successful","failed"))
modelATr <- confusionMatrix(kickTrain$pred,kickTrain$state)
modelATs <- confusionMatrix(kickTest$pred,kickTest$state)
plot(roc(kickTest$state,kickTest$prob))
## Setting levels: control = failed, case = successful
## Setting direction: controls < cases
Dari plot ROC diatas bahwa model A memiliki tingkat akurasi yang baik.
performa <- function(cutoff, prob, ref, postarget, negtarget)
{
predict <- as.factor(ifelse(prob >= cutoff, postarget, negtarget))
conf <- caret::confusionMatrix(predict , ref, positive = postarget)
acc <- conf$overall[1]
rec <- conf$byClass[1]
prec <- conf$byClass[3]
spec <- conf$byClass[2]
mat <- t(as.matrix(c(rec , acc , prec, spec)))
colnames(mat) <- c("recall", "accuracy", "precicion", "specificity")
return(mat)
}
co <- seq(0.01,0.80,length=100)
result <- matrix(0,100,4)
for(i in 1:100){
result[i,] = performa(cutoff = co[i],
prob = kickTest$prob,
ref = kickTest$state,
postarget = "successful",
negtarget = "failed")
}
data_frame("Recall" = result[,1],
"Accuracy" = result[,2],
"Precision" = result[,3],
"Specificity" = result[,4],
"Cutoff" = co) %>%
gather(key = "performa", value = "value", 1:4) %>%
ggplot(aes(x = Cutoff, y = value, col = performa)) +
geom_line(lwd = 1.5) +
scale_color_manual(values = c("darkred","darkgreen","orange", "blue")) +
scale_x_continuous(breaks = seq(0,1,0.1)) +
labs(title = "Tradeoff model perfomance") +
theme_minimal() +
theme(legend.position = "top",
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank())
## Warning: `data_frame()` is deprecated, use `tibble()`.
## This warning is displayed once per session.
Dapat dilihat dari plot diatas. Bahwa semakin tinggi cutoffnya maka akan semakin tinggi tingkat Specificitynya. Karena saya ingin meminimalisir False Positive dengan cara memperbesar Specificitynya, maka nilai cutoff threshold menggunakan 0.999
modelBTr <- confusionMatrix(kickTrainD$pred,kickTrainD$state)
modelBTs <- confusionMatrix(kickTestD$pred,kickTestD$state)
plot(roc(kickTestD$state,kickTestD$prob))
## Setting levels: control = failed, case = successful
## Setting direction: controls < cases
Dari plot ROC diatas bahwa model B memiliki tingkat akurasi yang baik.
performa <- function(cutoff, prob, ref, postarget, negtarget)
{
predict <- as.factor(ifelse(prob >= cutoff, postarget, negtarget))
conf <- caret::confusionMatrix(predict , ref, positive = postarget)
acc <- conf$overall[1]
rec <- conf$byClass[1]
prec <- conf$byClass[3]
spec <- conf$byClass[2]
mat <- t(as.matrix(c(rec , acc , prec, spec)))
colnames(mat) <- c("recall", "accuracy", "precicion", "specificity")
return(mat)
}
co <- seq(0.01,0.80,length=100)
result <- matrix(0,100,4)
for(i in 1:100){
result[i,] = performa(cutoff = co[i],
prob = kickTestD$prob,
ref = kickTestD$state,
postarget = "successful",
negtarget = "failed")
}
data_frame("Recall" = result[,1],
"Accuracy" = result[,2],
"Precision" = result[,3],
"Specificity" = result[,4],
"Cutoff" = co) %>%
gather(key = "performa", value = "value", 1:4) %>%
ggplot(aes(x = Cutoff, y = value, col = performa)) +
geom_line(lwd = 1.5) +
scale_color_manual(values = c("darkred","darkgreen","orange", "blue")) +
scale_x_continuous(breaks = seq(0,1,0.1)) +
labs(title = "Tradeoff model perfomance") +
theme_minimal() +
theme(legend.position = "top",
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank())
Dapat dilihat dari plot diatas. Bahwa semakin tinggi cutoffnya maka akan semakin tinggi tingkat Specificitynya. Karena saya ingin meminimalisir False Positive dengan cara memperbesar Specificitynya, maka nilai cutoff threshold menggunakan 0.999
Karena model KNN sangat sensitif terhadap outlier dan jumlah data, maka perlu dihilangkan Outlier dan pengurangan dari data yang digunakan untuk membangun model.
kick2 <- kickst1
summary(kick2)
## goal pledged state
## Min. : 0 Min. : 0 failed :197611
## 1st Qu.: 2000 1st Qu.: 50 successful:133851
## Median : 5000 Median : 782
## Mean : 44266 Mean : 10581
## 3rd Qu.: 15000 3rd Qu.: 4658
## Max. :100000000 Max. :20338986
## backers usd.pledged usd_pledged_real
## Min. : 0.0 Min. : 0 Min. : 0
## 1st Qu.: 2.0 1st Qu.: 25 1st Qu.: 50
## Median : 15.0 Median : 502 Median : 788
## Mean : 116.5 Mean : 7685 Mean : 9940
## 3rd Qu.: 63.0 3rd Qu.: 3421 3rd Qu.: 4609
## Max. :219382.0 Max. :20338986 Max. :20338986
## usd_goal_real duration
## Min. : 0 Min. : 1.00
## 1st Qu.: 2000 1st Qu.:30.00
## Median : 5000 Median :30.00
## Mean : 41523 Mean :33.96
## 3rd Qu.: 15000 3rd Qu.:36.00
## Max. :166361391 Max. :92.00
Setelah melihat summary data, maka data di seleksi menggunakan code dibawah
kick2 <- kick2 %>%
filter(goal <= 15000,
pledged <= 5000,
backers <= 65,
usd.pledged <= 3500,
usd_pledged_real <= 5000,
usd_goal_real <= 15000,
duration <= 36)
Ukuran data dikecilkan menjadi 60% dari data sebelumnya
table(kick2$state)
##
## failed successful
## 97800 39205
set.seed(212)
index2 <- sample(nrow(kick2),nrow(kick2)*0.6)
kick3 <- kick2[index2,]
table(kick3$state)
##
## failed successful
## 58510 23693
range(kick2[,-3])
## [1] 0 15000
normalize <- function(x){
return (
(x - min(x))/(max(x) - min(x))
)
}
kick2 <- kick2 %>%
mutate(duration = as.numeric(duration)) %>%
mutate_if(is.numeric,normalize)
range(kick2[,-3])
## [1] 0 1
Setelah dilakukan scaling, range dari variabel prediktor yang awalnya 0-15000 turun menjadi 0-1
set.seed(212)
index3 <- sample(nrow(kick3),nrow(kick3)*0.8)
kick2Train <- kick3[index3,]
kick2Test <- kick3[-index3,]
kick3Train <- kick3[index3,]
kick3Test <- kick3[-index3,]
Membuat data alternatif dengan melakukan downsample agar target variabel menjadi seimbang.
table(kick2Train$state)
##
## failed successful
## 46754 19008
kick2TrainD <- downSample(x = kick2Train[,-3],y = kick2Train$state,yname = "state")
kick2TestD <- kick2Test
kick3TrainD <- kick2TrainD
kick3TestD <- kick2Test
table(kick2TrainD$state)
##
## failed successful
## 19008 19008
Model KNN dengan menggunakan k = 200
sqrt(nrow(kick2Train))
## [1] 256.441
Dari nilai akar dari jumlah baris, kita mendapatkan k = 257. Tetapi saat di run dengan k = 257. Model menyakatakan “too many ties”. Maka saya coba dengan menurunkan nilai k = 200
kick2Test$predict <- knn(train = kick2Train[ ,-3],
test = kick2Test[ ,-3],
cl = kick2Train[ ,3],
k = 200)
modelC <- confusionMatrix(kick2Test$predict,kick2Test$state,positive="successful")
Model KNN dengan menggunakan k = 50
kick3Test$predict <- knn(train = kick3Train[ ,-3],
test = kick3Test[ ,-3],
cl = kick3Train[ ,3],
k = 50)
modelD <- confusionMatrix(kick3Test$predict,kick3Test$state,positive="successful")
Model KNN dengan menggunakan k = 200 dan data yang sudah di downsample
kick2TestD$predict <- knn(train = kick2TrainD[ ,-8],
test = kick2TestD[ ,-3],
cl = kick2TrainD[ ,8],
k = 200)
modelE <- confusionMatrix(kick2TestD$predict,kick2TestD$state,positive="successful")
Model KNN dengan menggunakan k = 50 dan data yang sudah di downsample
kick3TestD$predict <- knn(train = kick3TrainD[ ,-8],
test = kick3TestD[ ,-3],
cl = kick3TrainD[ ,8],
k = 50)
modelF <- confusionMatrix(kick3TestD$predict,kick3TestD$state,positive="successful")
Model A = Logistic Regression dengan data biasa
list(modelATs$overall[1],modelATs$byClass[1],modelATs$byClass[2])
## [[1]]
## Accuracy
## 0.978776
##
## [[2]]
## Sensitivity
## 0.9999495
##
## [[3]]
## Specificity
## 0.9474019
Model B = Logistic Regression dengan data downsampled
list(modelBTs$overall[1],modelBTs$byClass[1],modelBTs$byClass[2])
## [[1]]
## Accuracy
## 0.9783688
##
## [[2]]
## Sensitivity
## 0.9999495
##
## [[3]]
## Specificity
## 0.9463911
Model C = kNN dengan data biasa, k = 200
list(modelC$overall[1],modelC$byClass[1],modelC$byClass[2])
## [[1]]
## Accuracy
## 0.9945867
##
## [[2]]
## Sensitivity
## 0.9978655
##
## [[3]]
## Specificity
## 0.99328
Model D = kNN dengan data biasa, k = 50
list(modelD$overall[1],modelD$byClass[1],modelD$byClass[2])
## [[1]]
## Accuracy
## 0.9972629
##
## [[2]]
## Sensitivity
## 0.9995731
##
## [[3]]
## Specificity
## 0.9963423
Model E = kNN dengan data downsampled, k = 200
list(modelE$overall[1],modelE$byClass[1],modelE$byClass[2])
## [[1]]
## Accuracy
## 0.9920321
##
## [[2]]
## Sensitivity
## 1
##
## [[3]]
## Specificity
## 0.9888568
Model F = kNN dengan data downsampled, k = 50
list(modelF$overall[1],modelF$byClass[1],modelF$byClass[2])
## [[1]]
## Accuracy
## 0.9951341
##
## [[2]]
## Sensitivity
## 1
##
## [[3]]
## Specificity
## 0.993195
mod.name <- c("A","B","C","D","E","F")
accuracy <- c(modelATs$overall[1],modelBTs$overall[1],modelC$overall[1],modelD$overall[1],modelE$overall[1],modelF$overall[1])
sensitivity <- c(modelATs$byClass[1],modelBTs$byClass[1],modelC$byClass[1],modelD$byClass[1],modelE$byClass[1],modelF$byClass[1])
specificity <- c(modelATs$byClass[2],modelBTs$byClass[2],modelC$byClass[2],modelD$byClass[2],modelE$byClass[2],modelF$byClass[2])
data.frame(mod.name,accuracy,sensitivity,specificity)
## mod.name accuracy sensitivity specificity
## 1 A 0.9787760 0.9999495 0.9474019
## 2 B 0.9783688 0.9999495 0.9463911
## 3 C 0.9945867 0.9978655 0.9932800
## 4 D 0.9972629 0.9995731 0.9963423
## 5 E 0.9920321 1.0000000 0.9888568
## 6 F 0.9951341 1.0000000 0.9931950
Berdasarkan evaluasi performa diatas maka saya memutuskan untuk memilih Model D (kNN dengan data biasa dan k = 50).
Model D memiliki nilai Akurasi dan Specificity yang tertinggi dibanding model lainnya.
Terimakasih