First we have to design the sampling method and size. As we are going to predict the policy price on Seguros del Estado, we need to bring enough samples (quotes) to our sample set so we can train our model correctly. We have to bare in mind the company’s budget for ‘dedomatics’ (2.000.000 COP) and the ‘correct’ amount of samples we need (rule of the thumb: al least ten samples per variable/column).
Let’s start loading all the R libraries we will be using:
library(dplyr) # Data manipulation
library(RPostgreSQL) # SQL data import froum our servers
library(lubridate) # Date manipulation
library(stringr) # Text manipulation
library(googlesheets) # Import Google Sheets data
library(Amelia) # Viz for Missing Values
library(ggplot2) # Viz for everything
library(xgboost) # XGBoost
library(data.table) # Data Table convert
library(caTools) # Modeling tools
library(gridExtra) # Viz tool to reorder features
library(caret) # Modeling tools
library(tidyr) # Used it somewhere
Now we query out database so we can select the more relevant vehicles. We will be bringing all the cars that have quoted in our website on the last 8 months:
drv <- dbDriver("PostgreSQL")
# Define con <- dbConnect() with real credentials (not shown for security reasons)
q <- "SELECT
car.id as id,
car.vehicle_brand as brand,
car.vehicle_line as line,
car.vehicle_reference as reference,
car.vehicle_model as model,
car.vehicle_brand || ' ' || car.vehicle_line || ' ' || car.vehicle_reference as vehicle,
car.vehicle_commercial_value as value,
car.fasecolda_code as fasecolda,
car.vehicle_body as body,
car.identification as client_id,
car.identification_type as client_id_type,
car.vehicle_complete_reference as complete,
UPPER(car.vehicle_registration) as plate,
car.date_of_birth
FROM applications_carinsuranceapplication as car
WHERE car.created > CURRENT_DATE - INTERVAL '8 months'
AND car.vehicle_brand IS NOT NULL
AND car.vehicle_has_registration = 1
AND car.identification != '0000000001'
AND car.date_of_birth IS NOT NULL"
q <- dbSendQuery(con, q)
cars <- fetch(q, n = -1)
Once we have this ‘cars’ data frame, we do some features fixes and we can select the cars to use, generate random years and ages in different ranges for our manual quotes and, finally, create the database file that will be used to train the model:
# Which n cars?
chosen <- cars %>%
filter(!body %in% c("MOTOCICLETA","MOTO")) %>% # We do not want bikes
group_by(brand,vehicle) %>%
summarise (n = n()) %>%
mutate(freq=round(100*n/sum(n),1)) %>%
arrange(brand,desc(n)) %>%
mutate(cum=cumsum(freq),row=row_number()) %>%
filter(row <= 3 & n > nrow(cars)*0.0003) %>% # Change % untill you get the top 3 of n brads you want
select(-row) %>% data.frame() ; chosen
## brand vehicle n freq cum
## 1 BMW BMW X3 [E83] xDrive30i Executive 25 4.9 4.9
## 2 BMW BMW 320i E90 20 3.9 8.8
## 3 BMW BMW 120i E82 COUPE 18 3.5 12.3
## 4 CHANGAN CHANGAN CS35 LUXURY 16 39.0 39.0
## 5 CHERY CHERY TIGGO [FL] 1.6L 110 28.4 28.4
## 6 CHERY CHERY YOYA VAN PASS 39 10.1 38.5
## 7 CHERY CHERY YOYA VAN PASS [2] 27 7.0 45.5
## 8 CHEVROLET CHEVROLET SAIL LS 1098 8.5 8.5
## 9 CHEVROLET CHEVROLET SPARK GT [M300] 936 7.2 15.7
## 10 CHEVROLET CHEVROLET SPARK LT [M200] 782 6.1 21.8
## 11 DAIHATSU DAIHATSU TERIOS OKii 74 55.2 55.2
## 12 DAIHATSU DAIHATSU SIRION 1.3L 18 13.4 68.6
## 13 DFSK/DFM/DFZL DFSK/DFM/DFZL VAN CARGA EQ5021XXYF Y01 19 23.8 23.8
## 14 DODGE DODGE JOURNEY SE [FL] 117 30.6 30.6
## 15 DODGE DODGE JOURNEY SE/EXPRESS [FL] 69 18.1 48.7
## 16 DODGE DODGE JOURNEY SXT [FL] 39 10.2 58.9
## 17 FIAT FIAT 500 SPORT BZ 17 8.5 8.5
## 18 FIAT FIAT PALIO ELX 17 8.5 17.0
## 19 FORD FORD FIESTA SPORTBACK 311 11.7 11.7
## 20 FORD FORD FIESTA SPORTBACK TITANIUM 255 9.6 21.3
## 21 FORD FORD FIESTA SPORTBACK SE 209 7.9 29.2
## 22 FOTON FOTON TUNLAND BJ2037Y3MDV 16 11.2 11.2
## 23 GREAT WALL MOTOR GREAT WALL MOTOR HAVAL M4 22 21.2 21.2
## 24 GREAT WALL MOTOR GREAT WALL MOTOR HAVAL H5 20 19.2 40.4
## 25 HONDA HONDA CRV EX 71 21.6 21.6
## 26 HONDA HONDA CIVIC LX 42 12.8 34.4
## 27 HONDA HONDA CRV LX 40 12.2 46.6
## 28 HYUNDAI HYUNDAI ACCENT i25 1.6L 289 10.4 10.4
## 29 HYUNDAI HYUNDAI TUCSON IX 35 2.0L 279 10.1 20.5
## 30 HYUNDAI HYUNDAI ACCENT i25 1.4L 196 7.1 27.6
## 31 JAC JAC S2 HFC7151EAV 46 20.4 20.4
## 32 JEEP JEEP COMPASS [FL] LIMITED 31 17.1 17.1
## 33 JEEP JEEP WRANGLER SPORT 30 16.6 33.7
## 34 JEEP JEEP GRAND CHEROKEE LIMITED 23 12.7 46.4
## 35 KENWORTH KENWORTH T800 FULL FILTROS 18 47.4 47.4
## 36 KIA KIA PICANTO ION XTREM 1.25 331 7.9 7.9
## 37 KIA KIA PICANTO ION 1.0 292 7.0 14.9
## 38 KIA KIA SPORTAGE REVOLUTION 243 5.8 20.7
## 39 MAZDA MAZDA 3 TOURING 222 6.8 6.8
## 40 MAZDA MAZDA 3 GRAND TOURING 210 6.4 13.2
## 41 MAZDA MAZDA 2 15HA8 134 4.1 17.3
## 42 MERCEDES BENZ MERCEDES BENZ A 200 [W176] 34 11.8 11.8
## 43 MINI MINI COOPER R56 1.6 COUPE 22 40.7 40.7
## 44 MITSUBISHI MITSUBISHI MONTERO V11 43 10.2 10.2
## 45 MITSUBISHI MITSUBISHI LANCER AVANZADO 27 6.4 16.6
## 46 MITSUBISHI MITSUBISHI MONTERO V43 WAGON 23 5.5 22.1
## 47 NISSAN NISSAN VERSA ADVANCE 183 7.5 7.5
## 48 NISSAN NISSAN MARCH ACTIVE 154 6.3 13.8
## 49 NISSAN NISSAN NP 300 FRONTIER 2.4L 150 6.1 19.9
## 50 PEUGEOT PEUGEOT 206 XR 48 16.4 16.4
## 51 PEUGEOT PEUGEOT 207 COMPACT 19 6.5 22.9
## 52 PEUGEOT PEUGEOT 206 XRA 17 5.8 28.7
## 53 RENAULT RENAULT DUSTER DYNAMIQUE 492 5.1 5.1
## 54 RENAULT RENAULT LOGAN F.II ENTRY [FAMILIER] 476 5.0 10.1
## 55 RENAULT RENAULT DUSTER EXPRESSION 386 4.0 14.1
## 56 SEAT SEAT IBIZA STYLE 17 17.3 17.3
## 57 SSANGYONG SSANGYONG ACTYON A230 60 20.1 20.1
## 58 SSANGYONG SSANGYONG KORANDO C 50 16.7 36.8
## 59 SSANGYONG SSANGYONG KYRON [FL] M200 XDI 28 9.4 46.2
## 60 SUZUKI SUZUKI GRAND VITARA SZ 127 16.2 16.2
## 61 SUZUKI SUZUKI SWIFT LIVE 1.2L 93 11.8 28.0
## 62 SUZUKI SUZUKI GRAND VITARA [FL] SZ GLX SPORT 78 9.9 37.9
## 63 TOYOTA TOYOTA FORTUNER [FL] 2.7L 126 9.4 9.4
## 64 TOYOTA TOYOTA FORTUNER 2.7L 115 8.6 18.0
## 65 TOYOTA TOYOTA HILUX IMV 86 6.4 24.4
## 66 VOLKSWAGEN VOLKSWAGEN JETTA TRENDLINE 272 16.0 16.0
## 67 VOLKSWAGEN VOLKSWAGEN GOL COMFORTLINE 158 9.3 25.3
## 68 VOLKSWAGEN VOLKSWAGEN JETTA EUROPA 157 9.2 34.5
## 69 ZOTYE ZOTYE NOMADA 1.6L 25 48.1 48.1
# View counter per brand
chosen %>% group_by(brand) %>% summarise(count = n()) %>% data.frame() %>% arrange(desc(count))
## brand count
## 1 BMW 3
## 2 CHERY 3
## 3 CHEVROLET 3
## 4 DODGE 3
## 5 FORD 3
## 6 HONDA 3
## 7 HYUNDAI 3
## 8 JEEP 3
## 9 KIA 3
## 10 MAZDA 3
## 11 MITSUBISHI 3
## 12 NISSAN 3
## 13 PEUGEOT 3
## 14 RENAULT 3
## 15 SSANGYONG 3
## 16 SUZUKI 3
## 17 TOYOTA 3
## 18 VOLKSWAGEN 3
## 19 DAIHATSU 2
## 20 FIAT 2
## 21 GREAT WALL MOTOR 2
## 22 CHANGAN 1
## 23 DFSK/DFM/DFZL 1
## 24 FOTON 1
## 25 JAC 1
## 26 KENWORTH 1
## 27 MERCEDES BENZ 1
## 28 MINI 1
## 29 SEAT 1
## 30 ZOTYE 1
# Start loop for each car (brand-line-reference) combined with model and age
n <- 1 # scalability
year <- year(Sys.Date())
year_max <- 15 # Max per insurance company http://bit.ly/2t3IdQ3
output <- data.frame(vehicle=character(),
model=integer(),
client_year=integer())
for (i in 1:nrow(chosen)){
# MODELS (VEHICLE'S AGE)
years <- cars %>% filter(vehicle == chosen$vehicle[i]) %>% select(model) %>% distinct()
years_dim <- nrow(years)
years <- ifelse(years_dim >= 6,
sample_n(years,6,replace=F),
sample_n(years,6,replace=T))
years <- as.integer(years[[1]])
# PERSON'S AGE
ages <- c(
sample(18:24,n), # Prices only change with the 24 years old threshold
sample(25:70,n)
)
# Concat all combinations
all <- list(a = chosen$vehicle[i], b = years, c = year-ages)
all <- do.call(expand.grid, all) %>% arrange(a,desc(b),c)
output <- rbind(output,all)
}
colnames(output) <- c('vehicle','model','client_year')
# Lastly, we delete all duplicates (because might have less than 6 models per car)
output <- distinct(output)
head(output,10)
## vehicle model client_year
## 1 BMW X3 [E83] xDrive30i Executive 2011 1966
## 2 BMW X3 [E83] xDrive30i Executive 2011 1995
## 3 BMW X3 [E83] xDrive30i Executive 2010 1966
## 4 BMW X3 [E83] xDrive30i Executive 2010 1995
## 5 BMW 320i E90 2009 1962
## 6 BMW 320i E90 2009 1994
## 7 BMW 320i E90 2008 1962
## 8 BMW 320i E90 2008 1994
## 9 BMW 320i E90 2007 1962
## 10 BMW 320i E90 2007 1994
# Add car's data (price, fasecolda, complete reference)
info <- cars %>%
filter(vehicle %in% chosen$vehicle) %>%
select(vehicle,brand,model,fasecolda,value,complete) %>%
mutate(unique = paste(vehicle,model)) %>%
arrange(vehicle,desc(model),desc(value))
prices <- info %>%
group_by(unique) %>%
summarise(mean_value = mean(value)) %>%
left_join(info %>% select(brand,unique,complete,fasecolda), by = "unique") %>%
group_by(unique,complete) %>% mutate(row=row_number()) %>%
filter(row == 1) %>% select(-row)
output <- output %>%
mutate(unique = paste(vehicle,model)) %>%
left_join(info, by = "unique") %>%
group_by(vehicle.x,client_year,model.x) %>%
mutate(row=row_number()) %>%
filter(row==1) %>% select(-row,-unique,-vehicle.y,-model.y) %>%
rename(vehicle=vehicle.x,model=model.x) %>% ungroup() %>%
mutate(plate = c("CXB162"), cc = c("1015471239"), car_value = c("0"), policy_rate = c("0"), id = row_number(), fasecolda = str_pad(fasecolda, 8, pad = "0")) %>%
select(id, car_value, policy_rate, client_year, fasecolda, model, cc, brand, plate, vehicle, complete) %>% data.frame()
Finally, we get a sample set of (n) rows to be quotes manually:
nrow(output)
## [1] 676
For more details on this data see: http://bit.ly/pi-sde-01
# Connect Google Sheet to import data
my_sheets <- gs_ls()
## Auto-refreshing stale OAuth token.
df <- "Pricing Model - Seguros del Estado"
df <- gs_title(df)
## Sheet successfully identified: "Pricing Model - Seguros del Estado"
df <- df %>% gs_read(ws = "Llenado", range = cell_cols(1:22)) %>% data.frame()
## Accessing worksheet titled 'Llenado'.
## Parsed with column specification:
## cols(
## .default = col_character(),
## ID = col_integer(),
## Modelo = col_integer(),
## `Año nacimiento` = col_integer(),
## vehicle_age = col_integer(),
## age = col_integer(),
## engine_size = col_integer(),
## power = col_integer(),
## person_capacity = col_integer(),
## doors = col_integer(),
## is_imported = col_integer()
## )
## See spec(...) for full column specifications.
df$Valor.Asegurado <- as.integer(gsub("[^0-9]", "", df$Valor.Asegurado))
df$Prima <- as.integer(gsub("[^0-9]", "", df$Prima))
df$prime01 <- as.integer(gsub("[^0-9]", "", df$prime01))
df$prime02 <- as.integer(gsub("[^0-9]", "", df$prime02))
df$prime03 <- as.integer(gsub("[^0-9]", "", df$prime03))
df$prime04 <- as.integer(gsub("[^0-9]", "", df$prime04))
df <- df %>%
select(-Año.nacimiento, -Vehículo, -Referencia.completa, -body_name) %>%
rename(id=ID,vehicle_value=Valor.Asegurado,net_policy_value=Prima,fasecolda=Fasecolda,model=Modelo)
scrap <- df
# Missing values
missmap(scrap, legend=FALSE, rank.order=TRUE) #NONE
# Distribution of vehicles
ggplot(scrap, aes(model, fill=brand)) + geom_histogram(bins=100) + facet_grid(body_alias ~.)
# Vehicle vs Policy Prices
ggplot(scrap, aes(vehicle_value, net_policy_value, color=model)) + geom_point()
First we have to prepare our data set: delete the features we will not use, split data and tran sets, one hot encodign for categorical features, and convert data tables into XGBoost objects.
set.seed(1) # Always useful for reproduction
# Delete some data:
df <- select(df,-prime01,-prime02,-prime03,-prime04) # Later: Experiences
df$fasecolda <- NULL # Good predictor but it's noise
df$body_alias <- NULL # No need
df$model <- NULL # We need the vehicle's age so it'll always work
# Spaces fix from brands:
df$brand <- gsub('\\s+', '_',df$brand)
# Split data: test & train
split <- sample.split(df$net_policy_value, SplitRatio = 0.7) #Split 70/30
train <- subset(df, split == TRUE) #Training
test <- subset(df, split == FALSE) #Testing
setDT(train); setDT(test)
#Using one hot encoding
labels <- train$net_policy_value #Target train
ts_label <- test$net_policy_value #Target test
new_tr <- model.matrix(~.+0,data = train[,-c("net_policy_value","id"),with=F]) # Train
new_ts <- model.matrix(~.+0,data = test[,-c("net_policy_value","id"),with=F]) # Test
#Convert data table into a matrix (xgb.DMatrix):
dtrain <- xgb.DMatrix(data = new_tr,label = labels)
dtest <- xgb.DMatrix(data = new_ts,label = ts_label)
After the data preparation, we can start our trainings. First, we have to define our parameters and we validate with Cross-Validation that we are not overfitting our training set.
#Default parameters:
params <- list(
booster = "gbtree", objective = "reg:linear",
metrics = "rmse",
max_depth=4,
eta=0.02, gamma=0.1,
min_child_weight=1, subsample=0.9, colsample_bytree=0.9)
# Cross-validation:
xgbcv <- xgb.cv(params = params, data = dtrain,
nrounds = 10000, nfold = 5,
print_every_n = 1000,
early_stopping_rounds = 20,
prediction = F)
## [1] train-rmse:2682676.800000+57075.971800 test-rmse:2675216.000000+227020.938107
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 20 rounds.
##
## [1001] train-rmse:38962.126562+1957.277331 test-rmse:167705.818750+29977.573693
## [2001] train-rmse:16665.787891+1463.827168 test-rmse:158118.201563+29906.788844
## [3001] train-rmse:9123.931445+973.769899 test-rmse:155070.939062+29799.545273
## [4001] train-rmse:5551.006543+707.186433 test-rmse:153965.982813+29910.255909
## [5001] train-rmse:3558.466846+572.453720 test-rmse:153409.904688+29871.410016
## Stopping. Best iteration:
## [5510] train-rmse:2880.456006+507.839925 test-rmse:153240.198437+29876.498531
bestn <- xgbcv$best_iteration; bestn
## [1] 5510
Once we know our best iteration or n number of trees, we can train our whole training set.
#Model training on our test set:
xgb1 <- xgb.train(
params = params, data = dtrain,
nrounds = bestn,
watchlist = list(val=dtest,train=dtrain),
print_every_n = 1000
)
## [1] val-rmse:2701682.250000 train-rmse:2683069.750000
## [1001] val-rmse:143434.656250 train-rmse:43590.875000
## [2001] val-rmse:133379.421875 train-rmse:20522.781250
## [3001] val-rmse:130540.304688 train-rmse:12117.409180
## [4001] val-rmse:129294.414062 train-rmse:7805.442383
## [5001] val-rmse:128592.773438 train-rmse:5337.421875
## [5510] val-rmse:128378.296875 train-rmse:4366.730469
val_rmse <- min(xgb1$evaluation_log$val_rmse)
train_rmse <- min(xgb1$evaluation_log$train_rmse)
paste("rmse: ",round(val_rmse,1)," @ iter:",bestn,sep="")
## [1] "rmse: 128377.4 @ iter:5510"
Let’s see how good is our model predicting the test set:
# Predictions
test_preds <- cbind(real = test$net_policy_value,
pred = predict(xgb1,dtest)) %>% data.frame()
test_preds <- mutate(test_preds,
dif = abs(test_preds$real-test_preds$pred),
dif_perc = round(100*dif/real,2))
head(test_preds)
## real pred dif dif_perc
## 1 4138632 4285598 146965.50 3.55
## 2 3934584 3999427 64843.00 1.65
## 3 2724030 2697309 26721.00 0.98
## 4 1134200 1177055 42854.62 3.78
## 5 1073160 1105470 32309.62 3.01
## 6 1342608 1307043 35564.62 2.65
dif_perc <- summary(test_preds$dif_perc); dif_perc
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.010 0.770 1.690 2.613 2.970 33.020
# Plot predictions vs real values
ggplot(test_preds, aes(x=real/1000,y=pred/1000)) +
geom_point() + xlim(0,6000) + ylim(0,6000) + geom_smooth(method="lm") +
xlab("Precio Póliza Real (x1000 COP)") + ylab("Precio Póliza Predicción (x1000 COP)") + guides(color=FALSE) +
ggtitle("Valores de Pólizas para Seguros del Estado
Predicción vs Reales del Test Set")
## Warning: Removed 3 rows containing non-finite values (stat_smooth).
## Warning: Removed 3 rows containing missing values (geom_point).
# Importance matrix
imp <- xgb.importance(feature_names = colnames(new_tr), model = xgb1)
n <- ifelse(ncol(new_tr) <= 10, ncol(new_tr), 10)
grid.arrange(arrangeGrob(
ggplot(imp[1:n,1:2],
aes(x=reorder(Feature,Gain),
y=Gain,
label=round(Gain,2),fill=as.numeric(Gain))) +
geom_col() + coord_flip() + xlab('') + ylab('Importancia') +
guides(fill=FALSE) + geom_text(hjust=-0.5)))
# Correlations
library(corrplot)
cor10 <- cor(as.data.frame(
cbind(new_tr[,colnames(new_tr) %in% unlist(c(imp[1:10,1]))],
X=labels)))
corrplot(cor10,method="square",type="upper",order="AOE")
# Deciles (percetage difference between pred and real)
deciles <- quantile(test_preds$dif_perc, probs = seq(0.1, 0.9, length = 9), names = TRUE)
deciles <- data.frame(cbind(Deciles=row.names(as.data.frame(deciles)),Threshold=as.data.frame(deciles)),row.names=NULL)
ggplot(deciles,aes(x=Deciles,y=deciles,label=round(deciles,2),fill=as.numeric(deciles))) +
geom_col() + geom_text(vjust=-1) + guides(fill=FALSE) +
xlab('Deciles de errores por predicción') + ylab('Error porcentual máximo') +
ggtitle("Deciles por error absoluto de predicciones del test set")
# PLot CV learning curve
xgbcv$evaluation_log %>%
select(1,2,4) %>%
gather(type,value, -iter) %>%
ggplot(aes(iter,value)) +
geom_line(aes(colour = type))
# EXPORT TEXT RESULTS
export <- list(
"Nota:"="estos valores predecidos son sin IVA y para una persona 'sin experiencia' (siniestros)",
"Mejor iteración:"=bestn,
"RMSE test:"=val_rmse,
"RMSE train:"=train_rmse,
"Cuartiles (%):"=dif_perc,
"Importancia variables"=imp,
"Parámetros:"=(xgb1$params),
"Campos:"=as.data.frame(cbind(Campos=colnames(new_tr)))); export
## $`Nota:`
## [1] "estos valores predecidos son sin IVA y para una persona 'sin experiencia' (siniestros)"
##
## $`Mejor iteración:`
## [1] 5510
##
## $`RMSE test:`
## [1] 128377.4
##
## $`RMSE train:`
## [1] 4366.73
##
## $`Cuartiles (%):`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.010 0.770 1.690 2.613 2.970 33.020
##
## $`Importancia variables`
## Feature Gain Cover Frequency
## 1: vehicle_value 9.095660e-01 0.365798111 0.433889225
## 2: age 4.646833e-02 0.123269052 0.204596312
## 3: brandRENAULT 1.230105e-02 0.024686901 0.012625974
## 4: engine_size 8.566470e-03 0.074700096 0.061545498
## 5: brandKIA 8.029029e-03 0.019371673 0.012266632
## 6: brandVOLKSWAGEN 2.714405e-03 0.023121687 0.011596949
## 7: brandCHEVROLET 2.443803e-03 0.019017166 0.011074269
## 8: vehicle_age 1.962566e-03 0.064673451 0.094670304
## 9: brandNISSAN 1.338536e-03 0.016994853 0.008869216
## 10: brandCHERY 1.324925e-03 0.023090138 0.013377326
## 11: brandGREAT_WALL_MOTOR 1.096990e-03 0.011447636 0.005161459
## 12: brandHONDA 7.434955e-04 0.010723442 0.006141483
## 13: brandMAZDA 5.692681e-04 0.011567307 0.008297535
## 14: brandJEEP 3.676412e-04 0.008645148 0.006566160
## 15: brandFORD 3.333981e-04 0.008718089 0.004361106
## 16: brandPEUGEOT 3.200264e-04 0.008322071 0.004491776
## 17: brandSSANGYONG 2.998464e-04 0.009744723 0.006958169
## 18: brandJAC 2.671315e-04 0.041917064 0.017493426
## 19: brandBMW 2.522487e-04 0.017889244 0.012070627
## 20: brandHYUNDAI 1.927471e-04 0.013965594 0.011237607
## 21: brandDODGE 1.885250e-04 0.007997571 0.005847476
## 22: brandFIAT 1.023138e-04 0.005795811 0.002989073
## 23: brandMERCEDES_BENZ 8.530916e-05 0.005668312 0.002580729
## 24: brandSEAT 8.105068e-05 0.015736704 0.007725855
## 25: brandBYD 7.809781e-05 0.003928987 0.001617039
## 26: brandMITSUBISHI 6.817014e-05 0.011243638 0.006860167
## 27: brandZOTYE 6.405249e-05 0.008015717 0.003903762
## 28: brandDAIHATSU 4.789049e-05 0.005581256 0.003332081
## 29: brandMINI 4.698002e-05 0.013689129 0.006353821
## 30: brandTOYOTA 4.197172e-05 0.005763669 0.003038074
## 31: brandSUZUKI 2.194815e-05 0.013026252 0.006190484
## 32: brandDFSK/DFM/DFZL 1.578993e-05 0.005889508 0.002270389
## Feature Gain Cover Frequency
##
## $`Parámetros:`
## $`Parámetros:`$booster
## [1] "gbtree"
##
## $`Parámetros:`$objective
## [1] "reg:linear"
##
## $`Parámetros:`$metrics
## [1] "rmse"
##
## $`Parámetros:`$max_depth
## [1] 4
##
## $`Parámetros:`$eta
## [1] 0.02
##
## $`Parámetros:`$gamma
## [1] 0.1
##
## $`Parámetros:`$min_child_weight
## [1] 1
##
## $`Parámetros:`$subsample
## [1] 0.9
##
## $`Parámetros:`$colsample_bytree
## [1] 0.9
##
## $`Parámetros:`$silent
## [1] 1
##
##
## $`Campos:`
## Campos
## 1 vehicle_value
## 2 vehicle_age
## 3 age
## 4 brandBMW
## 5 brandBYD
## 6 brandCHERY
## 7 brandCHEVROLET
## 8 brandDAIHATSU
## 9 brandDFSK/DFM/DFZL
## 10 brandDODGE
## 11 brandFIAT
## 12 brandFORD
## 13 brandGREAT_WALL_MOTOR
## 14 brandHONDA
## 15 brandHYUNDAI
## 16 brandJAC
## 17 brandJEEP
## 18 brandKIA
## 19 brandMAZDA
## 20 brandMERCEDES_BENZ
## 21 brandMINI
## 22 brandMITSUBISHI
## 23 brandNISSAN
## 24 brandPEUGEOT
## 25 brandRENAULT
## 26 brandSEAT
## 27 brandSSANGYONG
## 28 brandSUZUKI
## 29 brandTOYOTA
## 30 brandVOLKSWAGEN
## 31 brandZOTYE
## 32 engine_size
## 33 power
## 34 person_capacity
## 35 doors
## 36 is_imported
Now we can export our binary file so we can implement the prediction model in production:
# EXPORTED MODEL
#xgb.save(xgb1, fname="Models/PI-SDE.01/price_intelligence_v01.model")
# CHECK BINARY FILE
#pred <- predict(xgb1,dtrain, ntreelimit = bestn)
#xgb2 <- xgb.load("Models/PI-SDE.01/price_intelligence_v01.model")
#pred2 <- predict(xgb2, dtrain, ntreelimit = bestn)
#sum(abs(pred2-pred)) #0?
The last step is add experience factor (claims and years insured) to the predicted values:
df <- scrap %>%
filter(!is.na(prime01)) %>%
mutate(p01 = prime01/net_policy_value,
p02 = prime02/net_policy_value,
p03 = prime03/net_policy_value,
p04 = prime04/net_policy_value) %>%
select(id:net_policy_value,model:prime04,p01:p04)
x <- df %>% gather()
p0 <- filter(x, grepl('p0', key))
vv <- filter(x, grepl('vehicle_value', key))
vv <- rbind(vv,vv,vv,vv)
x <- data.frame(cbind(p0,vv)) %>%
rename(vv=value.1) %>% select(-key.1)
values <- x %>% group_by(key) %>%
summarise(mean = mean(value),
max = max(value),
min = min(value))
a <- paste("means: ",paste(round(values$mean,4),collapse=" | "),sep="")
b <- paste("maxs: ",paste(round(values$max,4),collapse=" | "),sep="")
c <- paste("mins: ",paste(round(values$min,4),collapse=" | "),sep="")
ggplot(x,aes(x=key,y=value,color=vv/1000000)) +
geom_boxplot() + geom_point() +
ggtitle("Fracción del valor póliza vs experiencia y valor vehículo") +
xlab("Años sin siniestros") + ylab("Fracción") +
annotate("text", x="p03", y=0.900, label= b) +
annotate("text", x="p03", y=0.880, label= a) +
annotate("text", x="p03", y=0.860, label= c)
This means that, for people who have had experiences, we will show the predicted value (for each of the p0 ranges) times X, where X is:
The total value spent to collect all data on manual quotes was 180.000 COP (9% of the budget) thanks to the optimal number of samples needed and because of the testings done before sending the actual requests.