DonorsChoose.org

Application Screening

Xuejia Xu & Qufei Wang
April. 21st 2018

Agenda

Predict whether teachers' projects proposals are accepted by DonorsChoose.org

resource: https://www.kaggle.com/c/donorschoose-application-screening

  • Introduction to Dataset
  • Exploratory Data Analysis
  • Create New Features
  • Models & Evaluations
  • Conclusion

Our Problem

1.How to scale current manual processes and resources to screen 500,000 projects so that they can be posted as quickly and as efficiently as possible

2.How to increase the consistency of project vetting across different volunteers to improve the experience for teachers

3.How to focus volunteer time on the applications that need the most assistance

In this project, we will build different types of models to predict project proposal approval rate!

Introduction to the Dataset

The training dataset includes about 182,080 observations and 16 variables

alt text

EDA---Percentage of approved project VS catergory

alt text

EDA---State Distribution

plot of chunk unnamed-chunk-3

EDA--Boxplots of project approval rate VS month and weekdays

plot of chunk unnamed-chunk-4

EDA--Boxplots of project approval rate VS weekdays

plot of chunk unnamed-chunk-5

EDA--Boxplots of project approval rate VS teacher_prefix

plot of chunk unnamed-chunk-6

EDA--Boxplots of project approval rate VS Grade

plot of chunk unnamed-chunk-7

EDA--Boxplots of project approval rate VS wordcount

plot of chunk unnamed-chunk-8

EDA--Sentiment Analysis

Wordcloud if the project is approved:

plot of chunk unnamed-chunk-10

Wordcloud if the project is rejected

plot of chunk unnamed-chunk-11

Features

We create valuable new features to support our investigation

####Create time Features
day,weekday, month, year
badMonth_dv, isThur
#### Join with resources
types_of_items
#### Adding essay word count
wordcount1,2,3,4
count_essay1, count_essay2
#### Adding grade dummy variable
grade_dv
#### Adding prefix dummy variable
isDr, isTeacher
####Subject dummy variable
literacy_dv

Preparing train, validation and testing dataset

Spliting the train into train and validation

library(caret)
trainingDataIndex <- createDataPartition(train$project_is_approved, p = 0.8, 
    list = FALSE)
model_train <- train[trainingDataIndex, ]
validation <- train[-trainingDataIndex, ]

Model1---Logistic Regression Model

alt text

Model1---ROC Curve

[1] 182076     14

plot of chunk unnamed-chunk-14

Model2---GBM machine

set.seed(825)
model2 <- train(as.factor(project_is_approved) ~ total_quantity + total_price + 
    isTeacher + badMonth + types_of_items + grade_dv + literacy_dv + count_essay1 + 
    count_essay2 + teacher_number_of_previously_posted_projects, data = model_train, 
    method = "gbm", trControl = fitControl, verbose = FALSE, na.action = na.omit)

Model2-GBM Variable Importance

Feature “types_of_items” and “count_essay2”, and “# of previously posted projects” are the most important variable.

alt text

Model3---XGBoost

xgb <- xgboost(data = m1, label = label, max_depth = 4,
               eta = 1, nthread = 2, nrounds = 20,objective = "binary:logistic")
[1] train-error:0.152318 
[2] train-error:0.152318 
[3] train-error:0.152323 
[4] train-error:0.152625 
[5] train-error:0.152444 
[6] train-error:0.152323 
[7] train-error:0.152296 
[8] train-error:0.152191 
[9] train-error:0.152296 
[10]    train-error:0.152191 
[11]    train-error:0.152285 
[12]    train-error:0.152230 
[13]    train-error:0.152092 
[14]    train-error:0.152016 
[15]    train-error:0.151785 
[16]    train-error:0.151807 
[17]    train-error:0.151746 
[18]    train-error:0.151725 
[19]    train-error:0.151675 
[20]    train-error:0.151560 

Model3---XGBoost Variable Importance

The importance plot are pretty the same as what gbm give us and further confirms that gbm and XGBoost generate very similar result. However, one advantage of XGBoost is that it runs much faster than other models

plot of chunk unnamed-chunk-19

Evaluation of XGBoost and GBM

GBM

[1] "Accuracy 0.843942223198594"

XGBoost

[1] "Accuracy 0.850285588752197"

Model4---Random Forest

library(randomForest)
model3 <- randomForest(as.factor(project_is_approved) ~ total_quantity + total_price + 
    isTeacher + badMonth + types_of_items + literacy_dv + count_essay1 + count_essay2, 
    data = model_train, ntree = 50, mtry = 5, importance = TRUE, na.action = randomForest::na.roughfix, 
    replace = FALSE)
prediction3 <- predict(model3, newdata = validation)

Model4---Random Forest Confusion Matrix

confusionMatrix(prediction3, validation$project_is_approved) 
Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0   375   708
         1  5133 30197

               Accuracy : 0.8396          
                 95% CI : (0.8358, 0.8433)
    No Information Rate : 0.8487          
    P-Value [Acc > NIR] : 1               

                  Kappa : 0.0674          
 Mcnemar's Test P-Value : <2e-16          

            Sensitivity : 0.06808         
            Specificity : 0.97709         
         Pos Pred Value : 0.34626         
         Neg Pred Value : 0.85471         
             Prevalence : 0.15126         
         Detection Rate : 0.01030         
   Detection Prevalence : 0.02974         
      Balanced Accuracy : 0.52259         

       'Positive' Class : 0               

Model5---KNN method

knn5.pred<-knn(knn_train, validation_knn, class, k = 5)
table(knn5.pred, validation_knn$project_is_approved)

knn5.pred     0     1
        0   250   929
        1  5258 29976

Accuracy rate:

[1] 0.8507691

Making prediction

Based on consideration of speed and accuracy, we choose XGBoost as the final model and we scored top 60% on the Kaggle competition

variable_test <- test[, c("total_price", "isDr", "isTeacher", "isThur", "badMonth", 
    "literacy_dv", "grade_dv", "count_essay1", "count_essay2", "wordcount1", 
    "wordcount2", "teacher_number_of_previously_posted_projects", "types_of_items")]
variable_test = as.matrix(variable_test)
m1_test <- as(variable_test, "dgCMatrix")
pred <- predict(xgb, m1_test)

Conclusion

  1. Due to the constraint of calculation speed, we couldn't finish feature engineering on sentiment analysis of all the applications
  2. There are too few observations in the dataset on project that have been rejected. Thus our model does not do too well in predicting failed appliations
  3. In the future, we can take more advantage of parameter tuning