Harel Lustiger
October, 2017
New Zealand's Jacinda Ardern sets out priorities: climate,
inequalityand women.
What should we measure to determine if social inequality was tackled successfully by the end of the PM term?
GiniCoeff <- function(solution, submission){
df = data.frame(solution = solution, submission = submission)
df = df[order(df$submission, decreasing = TRUE),]
df$uniform = (1:nrow(df))/nrow(df) # = (1/n, 2/n, ..., 1)
totalPos = sum(df$solution) # how many time '1' appears in reality?
# This will store the cumulative number of positive
# examples found (used for computing "Model Lorentz")
df$cumPosFound = cumsum(df$solution)
# This will store the cumulative proportion of positive examples
# found ("Model Lorentz")
df$Lorentz = df$cumPosFound / totalPos
# This will store Lorentz minus uniform
df$Gini = df$Lorentz - df$uniform
return(sum(df$Gini))
}
To learn more about “Estimation of the Gini coefficient” see this link (p. 14)
NormGiniCoeff <- function(solution, submission){
GiniCoeff(solution, submission) / GiniCoeff(solution, solution)
}
Featured Prediction Competition:
Porto Seguro's Safe Driver Prediction.
Problem:
Predict if a driver will file an insurance claim next year.
Objective:
Maximize the (Normalized) Gini Coefficient.
Prediction Type
Classification with emphasize on scoring classifiers.
Given 6 members from the insurance company database and their true future claim result, we observe two algorithms outputs:
| Member Name | Ground Truth | 1st Algorithm Score | 2nd Algorithm Score |
|---|---|---|---|
| A | No | 1 | 3 |
| B | No | 2 | 2 |
| C | No | 3 | 1 |
| D | Yes | 4 | 6 |
| E | Yes | 5 | 4 |
| F | Yes | 6 | 5 |
Are the two algorithms' scores the same? YES, both would yield the same Gini coeff value
Are the two algorithms' ranks the same? NO, but who cares?!
| 0 | 1 | |
|---|---|---|
| Frequency | 573518 | 21694 |
| Proportion | 96 | 4 |
Why is it important?
Supervised learning rests on the assumption that the initial training set is going to produce a useful model for discriminating between classes.
Furthermore, in their basic form, most classifiers do not behave well on unbalanced data sets. Instead, most classifiers have predictive preference for the class with the greater proportion of examples.
xgboost modelsxgboost models.model <- xgb.train(data=dX_bs,
# Parameter for Tree Booster
max_depth=6,
eta=0.02,
gamma=1,
subsample=0.95,
colsample_bytree=0.8,
min_child_weight=20,
# Early Stopping to Avoid Overfitting
nrounds=1000,
early_stopping_rounds=10,
watchlist=list(train=dX_bs, test=dX_ev),
# Task Parameters
objective="binary:logistic",
eval_metric=eval_metric,
# eval_metric \in {"error","auc","map","rmse"}
seed=2145)
Correlation Plots between Gini and selected
\[ \text{AUC}=(\text{Gini}+1)/2 \quad\leftrightarrow\quad \text{Gini}=2\times \text{AUC}-1 \]
Boxplots of selected
stop xgboost training