DonorsChoose.org
Xuejia Xu & Qufei Wang
April. 21st 2018
Predict whether teachers' projects proposals are accepted by DonorsChoose.org
resource: https://www.kaggle.com/c/donorschoose-application-screening
1.How to scale current manual processes and resources to screen 500,000 projects so that they can be posted as quickly and as efficiently as possible
2.How to increase the consistency of project vetting across different volunteers to improve the experience for teachers
3.How to focus volunteer time on the applications that need the most assistance
In this project, we will build different types of models to predict project proposal approval rate!
The training dataset includes about 182,080 observations and 16 variables

We create valuable new features to support our investigation
####Create time Features
day,weekday, month, year
badMonth_dv, isThur
#### Join with resources
types_of_items
#### Adding essay word count
wordcount1,2,3,4
count_essay1, count_essay2
#### Adding grade dummy variable
grade_dv
#### Adding prefix dummy variable
isDr, isTeacher
####Subject dummy variable
literacy_dv
library(caret)
trainingDataIndex <- createDataPartition(train$project_is_approved, p = 0.8,
list = FALSE)
model_train <- train[trainingDataIndex, ]
validation <- train[-trainingDataIndex, ]
[1] 182076 14
set.seed(825)
model2 <- train(as.factor(project_is_approved) ~ total_quantity + total_price +
isTeacher + badMonth + types_of_items + grade_dv + literacy_dv + count_essay1 +
count_essay2 + teacher_number_of_previously_posted_projects, data = model_train,
method = "gbm", trControl = fitControl, verbose = FALSE, na.action = na.omit)
Feature “types_of_items” and “count_essay2”, and “# of previously posted projects” are the most important variable.
xgb <- xgboost(data = m1, label = label, max_depth = 4,
eta = 1, nthread = 2, nrounds = 20,objective = "binary:logistic")
[1] train-error:0.152318
[2] train-error:0.152318
[3] train-error:0.152323
[4] train-error:0.152625
[5] train-error:0.152444
[6] train-error:0.152323
[7] train-error:0.152296
[8] train-error:0.152191
[9] train-error:0.152296
[10] train-error:0.152191
[11] train-error:0.152285
[12] train-error:0.152230
[13] train-error:0.152092
[14] train-error:0.152016
[15] train-error:0.151785
[16] train-error:0.151807
[17] train-error:0.151746
[18] train-error:0.151725
[19] train-error:0.151675
[20] train-error:0.151560
The importance plot are pretty the same as what gbm give us and further confirms that gbm and XGBoost generate very similar result. However, one advantage of XGBoost is that it runs much faster than other models
GBM
[1] "Accuracy 0.843942223198594"
XGBoost
[1] "Accuracy 0.850285588752197"
library(randomForest)
model3 <- randomForest(as.factor(project_is_approved) ~ total_quantity + total_price +
isTeacher + badMonth + types_of_items + literacy_dv + count_essay1 + count_essay2,
data = model_train, ntree = 50, mtry = 5, importance = TRUE, na.action = randomForest::na.roughfix,
replace = FALSE)
prediction3 <- predict(model3, newdata = validation)
confusionMatrix(prediction3, validation$project_is_approved)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 375 708
1 5133 30197
Accuracy : 0.8396
95% CI : (0.8358, 0.8433)
No Information Rate : 0.8487
P-Value [Acc > NIR] : 1
Kappa : 0.0674
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.06808
Specificity : 0.97709
Pos Pred Value : 0.34626
Neg Pred Value : 0.85471
Prevalence : 0.15126
Detection Rate : 0.01030
Detection Prevalence : 0.02974
Balanced Accuracy : 0.52259
'Positive' Class : 0
knn5.pred<-knn(knn_train, validation_knn, class, k = 5)
table(knn5.pred, validation_knn$project_is_approved)
knn5.pred 0 1
0 250 929
1 5258 29976
Accuracy rate:
[1] 0.8507691
Based on consideration of speed and accuracy, we choose XGBoost as the final model and we scored top 60% on the Kaggle competition
variable_test <- test[, c("total_price", "isDr", "isTeacher", "isThur", "badMonth",
"literacy_dv", "grade_dv", "count_essay1", "count_essay2", "wordcount1",
"wordcount2", "teacher_number_of_previously_posted_projects", "types_of_items")]
variable_test = as.matrix(variable_test)
m1_test <- as(variable_test, "dgCMatrix")
pred <- predict(xgb, m1_test)