Data yang digunakan adalah data dari proyek-proyek di kickstarter.

Saya akan membuat model yang dapat memprediksi apakah proyek tersebut akan sukses terdanai sesuai target (successful) atau gagal (failed).

1 Libraries

library(caret)
library(tidyverse)
library(dplyr)
library(car)
library(ggplot2)
library(class)
library(pROC)

2 Data Input and Cleaning

kickst <- data.frame(read_csv("kickstarter18.csv"))

Pada fase ini data dibersihkan dengan cara:
1. Menghilangkan NA menggunakan fungsi na.omit
2. Mengecek keunikan data menggunakan fungsi unique dan mencocokkan ID
3. Menstandardisasi format tanggal menggunakan as.date
4. Mengubah “state” menjadi factor karena variabel tsb adalah target variabelnya

kickst <- na.omit(kickst)
kickst <- kickst[match(kickst$ID,unique(kickst$ID)),]
kickst$launched <- as.Date(kickst$launched)
kickst$state <- as.factor(kickst$state)

prop.table(table(kickst$state))
## 
##    canceled      failed        live  successful   suspended 
## 0.103390599 0.527159473 0.007464120 0.357069306 0.004916502

Setelah dilihat menggunakan tabel, variabel “state” memiliki 5 tipe, tetapi untuk analisa model disini kita hanya menggunakan 2 tipe saja yaitu “failed” dan “successful”.

Pada fase ini:
1. Data di filter untuk baris yang memiliki state “successful” dan “failed”
2. Data kolom “deadline” dan “launched” diubah menjadi durasi dari waktu pendanaan proyek. Perubahan ditujukan agar didapatkan variabel berbentuk numerik yang relevan terhadap model.

kickst1 <- kickst %>% 
  filter(state %in% c("successful","failed")) %>% 
  mutate(duration = deadline - launched,
         state = factor(state,levels = c("failed","successful"))) %>% 
  select(-ID,-name,-deadline,-launched,-category,-main_category,-currency,-country)

kickst1$duration <- as.numeric(kickst1$duration)

table(kickst1$state)
## 
##     failed successful 
##     197611     133851

3 Logistic Regression

3.1 Split Data Train dan Test

kick <- kickst1

set.seed(212)
index <-  sample(nrow(kick),nrow(kick)*0.8) 

kickTrain <- kick[index, ] 
kickTest <-  kick[-index, ]

3.2 Data Alternative (Downsample)

Membuat data alternatif dengan melakukan downsample agar target variabel menjadi seimbang.

table(kickTrain$state)
## 
##     failed successful 
##     158030     107139
kickTrainD <- downSample(x = kickTrain[,-3],y = kickTrain$state,yname = "state")
kickTestD <- kickTest

table(kickTrainD$state)
## 
##     failed successful 
##     107139     107139

3.3 Modelling

modA <- glm(state~.,kickTrain,family = "binomial")
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
modB <- glm(state~.,kickTrainD,family = "binomial")
## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

modA dimodelkan menggunakan data yang tidak di downsample

modB dimodelkan menggunakan data yang di downsample

3.4 Multicolinearity Assumption

vif(modA)
##             goal          pledged          backers      usd.pledged 
##     1.233075e+06     1.233864e+06     1.631058e+00     5.704568e+01 
## usd_pledged_real    usd_goal_real         duration 
##     5.193391e+05     5.185004e+05     1.015981e+00
vif(modB)
##             goal          pledged          backers      usd.pledged 
##     7.803311e+05     7.794623e+05     1.499561e+00     4.620274e+02 
## usd_pledged_real    usd_goal_real         duration 
##     1.776186e+05     1.891139e+05     1.016647e+00

Kedua model telah memenuhi asumsi multicolinearity vif < 10

3.5 Prediction

Membuat prediksi dari model kedalam bentuk probabilitas lalu kedalam bentuk target variabelnya. Untuk kasus ini, menurut saya lebih baik memiliki false negative dibandingkan false positive. Maka dari itu nilai dari thresholdnya saya buat cukup besar agar memperkecil kemungkinan adanya false positive

3.5.1 Model A

kickTrain$prob <- predict(modA,kickTrain,type = "response")
kickTrain$pred <- as.factor(ifelse(kickTrain$prob > 0.999,"successful","failed"))

kickTest$prob <- predict(modA,kickTest,type = "response")
kickTest$pred <- as.factor(ifelse(kickTest$prob > 0.999,"successful","failed"))

3.5.2 Model B

kickTrainD$prob <- predict(modB,kickTrainD,type = "response")
kickTrainD$pred <- as.factor(ifelse(kickTrainD$prob > 0.999,"successful","failed"))

kickTestD$prob <- predict(modB,kickTestD,type = "response")
kickTestD$pred <- as.factor(ifelse(kickTestD$prob > 0.999,"successful","failed"))

3.6 Plot Model

3.6.1 Model A

modelATr <- confusionMatrix(kickTrain$pred,kickTrain$state)
modelATs <- confusionMatrix(kickTest$pred,kickTest$state)
plot(roc(kickTest$state,kickTest$prob))
## Setting levels: control = failed, case = successful
## Setting direction: controls < cases

Dari plot ROC diatas bahwa model A memiliki tingkat akurasi yang baik.

performa <- function(cutoff, prob, ref, postarget, negtarget) 
{
  predict <- as.factor(ifelse(prob >= cutoff, postarget, negtarget))
  conf <- caret::confusionMatrix(predict , ref, positive = postarget)
  acc <- conf$overall[1]
  rec <- conf$byClass[1]
  prec <- conf$byClass[3]
  spec <- conf$byClass[2]
  mat <- t(as.matrix(c(rec , acc , prec, spec))) 
  colnames(mat) <- c("recall", "accuracy", "precicion", "specificity")
  return(mat)
}

co <- seq(0.01,0.80,length=100)
result <- matrix(0,100,4)

for(i in 1:100){
  result[i,] = performa(cutoff = co[i], 
                     prob = kickTest$prob, 
                     ref = kickTest$state, 
                     postarget = "successful", 
                     negtarget = "failed")
}

data_frame("Recall" = result[,1],
           "Accuracy" = result[,2],
           "Precision" = result[,3],
           "Specificity" = result[,4],
                   "Cutoff" = co) %>% 
  gather(key = "performa", value = "value", 1:4) %>% 
  ggplot(aes(x = Cutoff, y = value, col = performa)) +
  geom_line(lwd = 1.5) +
  scale_color_manual(values = c("darkred","darkgreen","orange", "blue")) +
  scale_x_continuous(breaks = seq(0,1,0.1)) +
  labs(title = "Tradeoff model perfomance") +
  theme_minimal() +
  theme(legend.position = "top",
        panel.grid.minor.y = element_blank(),
        panel.grid.minor.x = element_blank())
## Warning: `data_frame()` is deprecated, use `tibble()`.
## This warning is displayed once per session.

Dapat dilihat dari plot diatas. Bahwa semakin tinggi cutoffnya maka akan semakin tinggi tingkat Specificitynya. Karena saya ingin meminimalisir False Positive dengan cara memperbesar Specificitynya, maka nilai cutoff threshold menggunakan 0.999

3.6.2 Model B

modelBTr <- confusionMatrix(kickTrainD$pred,kickTrainD$state)
modelBTs <- confusionMatrix(kickTestD$pred,kickTestD$state)
plot(roc(kickTestD$state,kickTestD$prob))
## Setting levels: control = failed, case = successful
## Setting direction: controls < cases

Dari plot ROC diatas bahwa model B memiliki tingkat akurasi yang baik.

performa <- function(cutoff, prob, ref, postarget, negtarget) 
{
  predict <- as.factor(ifelse(prob >= cutoff, postarget, negtarget))
  conf <- caret::confusionMatrix(predict , ref, positive = postarget)
  acc <- conf$overall[1]
  rec <- conf$byClass[1]
  prec <- conf$byClass[3]
  spec <- conf$byClass[2]
  mat <- t(as.matrix(c(rec , acc , prec, spec))) 
  colnames(mat) <- c("recall", "accuracy", "precicion", "specificity")
  return(mat)
}

co <- seq(0.01,0.80,length=100)
result <- matrix(0,100,4)

for(i in 1:100){
  result[i,] = performa(cutoff = co[i], 
                     prob = kickTestD$prob, 
                     ref = kickTestD$state, 
                     postarget = "successful", 
                     negtarget = "failed")
}

data_frame("Recall" = result[,1],
           "Accuracy" = result[,2],
           "Precision" = result[,3],
           "Specificity" = result[,4],
                   "Cutoff" = co) %>% 
  gather(key = "performa", value = "value", 1:4) %>% 
  ggplot(aes(x = Cutoff, y = value, col = performa)) +
  geom_line(lwd = 1.5) +
  scale_color_manual(values = c("darkred","darkgreen","orange", "blue")) +
  scale_x_continuous(breaks = seq(0,1,0.1)) +
  labs(title = "Tradeoff model perfomance") +
  theme_minimal() +
  theme(legend.position = "top",
        panel.grid.minor.y = element_blank(),
        panel.grid.minor.x = element_blank())

Dapat dilihat dari plot diatas. Bahwa semakin tinggi cutoffnya maka akan semakin tinggi tingkat Specificitynya. Karena saya ingin meminimalisir False Positive dengan cara memperbesar Specificitynya, maka nilai cutoff threshold menggunakan 0.999

4 K Nearest Neighbor

Karena model KNN sangat sensitif terhadap outlier dan jumlah data, maka perlu dihilangkan Outlier dan pengurangan dari data yang digunakan untuk membangun model.

4.1 Outlier Removal

kick2 <- kickst1
summary(kick2)
##       goal              pledged                state       
##  Min.   :        0   Min.   :       0   failed    :197611  
##  1st Qu.:     2000   1st Qu.:      50   successful:133851  
##  Median :     5000   Median :     782                      
##  Mean   :    44266   Mean   :   10581                      
##  3rd Qu.:    15000   3rd Qu.:    4658                      
##  Max.   :100000000   Max.   :20338986                      
##     backers          usd.pledged       usd_pledged_real  
##  Min.   :     0.0   Min.   :       0   Min.   :       0  
##  1st Qu.:     2.0   1st Qu.:      25   1st Qu.:      50  
##  Median :    15.0   Median :     502   Median :     788  
##  Mean   :   116.5   Mean   :    7685   Mean   :    9940  
##  3rd Qu.:    63.0   3rd Qu.:    3421   3rd Qu.:    4609  
##  Max.   :219382.0   Max.   :20338986   Max.   :20338986  
##  usd_goal_real          duration    
##  Min.   :        0   Min.   : 1.00  
##  1st Qu.:     2000   1st Qu.:30.00  
##  Median :     5000   Median :30.00  
##  Mean   :    41523   Mean   :33.96  
##  3rd Qu.:    15000   3rd Qu.:36.00  
##  Max.   :166361391   Max.   :92.00

Setelah melihat summary data, maka data di seleksi menggunakan code dibawah

kick2 <- kick2 %>% 
  filter(goal <= 15000,
         pledged <= 5000,
         backers <= 65,
         usd.pledged <= 3500,
         usd_pledged_real <= 5000,
         usd_goal_real <= 15000,
         duration <= 36)

4.2 Data Resizing

Ukuran data dikecilkan menjadi 60% dari data sebelumnya

table(kick2$state)
## 
##     failed successful 
##      97800      39205
set.seed(212)
index2 <-  sample(nrow(kick2),nrow(kick2)*0.6) 

kick3 <- kick2[index2,]

table(kick3$state)
## 
##     failed successful 
##      58510      23693

4.3 Scaling

range(kick2[,-3])
## [1]     0 15000
normalize <- function(x){
  return ( 
    (x - min(x))/(max(x) - min(x)) 
           )
}

kick2 <- kick2 %>% 
  mutate(duration = as.numeric(duration)) %>% 
  mutate_if(is.numeric,normalize) 
 

range(kick2[,-3])
## [1] 0 1

Setelah dilakukan scaling, range dari variabel prediktor yang awalnya 0-15000 turun menjadi 0-1

4.4 Split Train-Test

set.seed(212)
index3 <-  sample(nrow(kick3),nrow(kick3)*0.8)

kick2Train <- kick3[index3,]
kick2Test <- kick3[-index3,]

kick3Train <- kick3[index3,]
kick3Test <- kick3[-index3,]

4.5 Data Alternatives (Downsample)

Membuat data alternatif dengan melakukan downsample agar target variabel menjadi seimbang.

table(kick2Train$state)
## 
##     failed successful 
##      46754      19008
kick2TrainD <- downSample(x = kick2Train[,-3],y = kick2Train$state,yname = "state")
kick2TestD <- kick2Test

kick3TrainD <- kick2TrainD
kick3TestD <- kick2Test

table(kick2TrainD$state)
## 
##     failed successful 
##      19008      19008

4.6 Modelling

4.6.1 Model C

Model KNN dengan menggunakan k = 200

sqrt(nrow(kick2Train))
## [1] 256.441

Dari nilai akar dari jumlah baris, kita mendapatkan k = 257. Tetapi saat di run dengan k = 257. Model menyakatakan “too many ties”. Maka saya coba dengan menurunkan nilai k = 200

kick2Test$predict <- knn(train = kick2Train[ ,-3],
                         test = kick2Test[ ,-3],
                         cl =  kick2Train[ ,3],
                         k = 200)

modelC <- confusionMatrix(kick2Test$predict,kick2Test$state,positive="successful")

4.6.2 Model D

Model KNN dengan menggunakan k = 50

kick3Test$predict <- knn(train = kick3Train[ ,-3],
                         test = kick3Test[ ,-3],
                         cl =  kick3Train[ ,3],
                         k = 50)

modelD <- confusionMatrix(kick3Test$predict,kick3Test$state,positive="successful")

4.6.3 Model E

Model KNN dengan menggunakan k = 200 dan data yang sudah di downsample

kick2TestD$predict <- knn(train = kick2TrainD[ ,-8],
                         test = kick2TestD[ ,-3],
                         cl =  kick2TrainD[ ,8],
                         k = 200)

modelE <- confusionMatrix(kick2TestD$predict,kick2TestD$state,positive="successful")

4.6.4 Model F

Model KNN dengan menggunakan k = 50 dan data yang sudah di downsample

kick3TestD$predict <- knn(train = kick3TrainD[ ,-8],
                         test = kick3TestD[ ,-3],
                         cl =  kick3TrainD[ ,8],
                         k = 50)

modelF <- confusionMatrix(kick3TestD$predict,kick3TestD$state,positive="successful")

5 Evaluate Model

Model A = Logistic Regression dengan data biasa

list(modelATs$overall[1],modelATs$byClass[1],modelATs$byClass[2])
## [[1]]
## Accuracy 
## 0.978776 
## 
## [[2]]
## Sensitivity 
##   0.9999495 
## 
## [[3]]
## Specificity 
##   0.9474019

Model B = Logistic Regression dengan data downsampled

list(modelBTs$overall[1],modelBTs$byClass[1],modelBTs$byClass[2])
## [[1]]
##  Accuracy 
## 0.9783688 
## 
## [[2]]
## Sensitivity 
##   0.9999495 
## 
## [[3]]
## Specificity 
##   0.9463911

Model C = kNN dengan data biasa, k = 200

list(modelC$overall[1],modelC$byClass[1],modelC$byClass[2])
## [[1]]
##  Accuracy 
## 0.9945867 
## 
## [[2]]
## Sensitivity 
##   0.9978655 
## 
## [[3]]
## Specificity 
##     0.99328

Model D = kNN dengan data biasa, k = 50

list(modelD$overall[1],modelD$byClass[1],modelD$byClass[2])
## [[1]]
##  Accuracy 
## 0.9972629 
## 
## [[2]]
## Sensitivity 
##   0.9995731 
## 
## [[3]]
## Specificity 
##   0.9963423

Model E = kNN dengan data downsampled, k = 200

list(modelE$overall[1],modelE$byClass[1],modelE$byClass[2])
## [[1]]
##  Accuracy 
## 0.9920321 
## 
## [[2]]
## Sensitivity 
##           1 
## 
## [[3]]
## Specificity 
##   0.9888568

Model F = kNN dengan data downsampled, k = 50

list(modelF$overall[1],modelF$byClass[1],modelF$byClass[2])
## [[1]]
##  Accuracy 
## 0.9951341 
## 
## [[2]]
## Sensitivity 
##           1 
## 
## [[3]]
## Specificity 
##    0.993195
mod.name <- c("A","B","C","D","E","F")
accuracy <- c(modelATs$overall[1],modelBTs$overall[1],modelC$overall[1],modelD$overall[1],modelE$overall[1],modelF$overall[1])
sensitivity <- c(modelATs$byClass[1],modelBTs$byClass[1],modelC$byClass[1],modelD$byClass[1],modelE$byClass[1],modelF$byClass[1])
specificity <- c(modelATs$byClass[2],modelBTs$byClass[2],modelC$byClass[2],modelD$byClass[2],modelE$byClass[2],modelF$byClass[2])

data.frame(mod.name,accuracy,sensitivity,specificity)
##   mod.name  accuracy sensitivity specificity
## 1        A 0.9787760   0.9999495   0.9474019
## 2        B 0.9783688   0.9999495   0.9463911
## 3        C 0.9945867   0.9978655   0.9932800
## 4        D 0.9972629   0.9995731   0.9963423
## 5        E 0.9920321   1.0000000   0.9888568
## 6        F 0.9951341   1.0000000   0.9931950

Berdasarkan evaluasi performa diatas maka saya memutuskan untuk memilih Model D (kNN dengan data biasa dan k = 50).

Model D memiliki nilai Akurasi dan Specificity yang tertinggi dibanding model lainnya.

Terimakasih