Sampling

First we have to design the sampling method and size. As we are going to predict the policy price on Seguros del Estado, we need to bring enough samples (quotes) to our sample set so we can train our model correctly. We have to bare in mind the company’s budget for ‘dedomatics’ (2.000.000 COP) and the ‘correct’ amount of samples we need (rule of the thumb: al least ten samples per variable/column).

Let’s start loading all the R libraries we will be using:

library(dplyr)          # Data manipulation
library(RPostgreSQL)    # SQL data import froum our servers
library(lubridate)      # Date manipulation
library(stringr)        # Text manipulation
library(googlesheets)   # Import Google Sheets data
library(Amelia)         # Viz for Missing Values
library(ggplot2)        # Viz for everything
library(xgboost)        # XGBoost
library(data.table)     # Data Table convert
library(caTools)        # Modeling tools
library(gridExtra)      # Viz tool to reorder features
library(caret)          # Modeling tools
library(tidyr)          # Used it somewhere

Now we query out database so we can select the more relevant vehicles. We will be bringing all the cars that have quoted in our website on the last 8 months:

drv <- dbDriver("PostgreSQL")
# Define con <- dbConnect() with real credentials (not shown for security reasons)
q <- "SELECT
car.id as id, 
car.vehicle_brand as brand,
car.vehicle_line as line,
car.vehicle_reference as reference,
car.vehicle_model as model,
car.vehicle_brand || ' ' || car.vehicle_line || ' ' || car.vehicle_reference as vehicle,
car.vehicle_commercial_value as value,
car.fasecolda_code as fasecolda,
car.vehicle_body as body,
car.identification as client_id,
car.identification_type as client_id_type,
car.vehicle_complete_reference as complete,
UPPER(car.vehicle_registration) as plate,
car.date_of_birth
FROM applications_carinsuranceapplication as car
WHERE car.created > CURRENT_DATE - INTERVAL '8 months'
AND car.vehicle_brand IS NOT NULL
AND car.vehicle_has_registration = 1
AND car.identification != '0000000001'
AND car.date_of_birth IS NOT NULL"

q <- dbSendQuery(con, q)
cars <- fetch(q, n = -1)

Once we have this ‘cars’ data frame, we do some features fixes and we can select the cars to use, generate random years and ages in different ranges for our manual quotes and, finally, create the database file that will be used to train the model:

# Which n cars?
chosen <- cars %>% 
  filter(!body %in% c("MOTOCICLETA","MOTO")) %>% # We do not want bikes
  group_by(brand,vehicle) %>% 
  summarise (n = n()) %>% 
  mutate(freq=round(100*n/sum(n),1)) %>% 
  arrange(brand,desc(n)) %>%
  mutate(cum=cumsum(freq),row=row_number()) %>% 
  filter(row <= 3 & n > nrow(cars)*0.0003) %>%  # Change % untill you get the top 3 of n brads you want
  select(-row) %>% data.frame() ; chosen
##               brand                                vehicle    n freq  cum
## 1               BMW       BMW X3 [E83] xDrive30i Executive   25  4.9  4.9
## 2               BMW                           BMW 320i E90   20  3.9  8.8
## 3               BMW                     BMW 120i E82 COUPE   18  3.5 12.3
## 4           CHANGAN                    CHANGAN CS35 LUXURY   16 39.0 39.0
## 5             CHERY                  CHERY TIGGO [FL] 1.6L  110 28.4 28.4
## 6             CHERY                    CHERY YOYA VAN PASS   39 10.1 38.5
## 7             CHERY                CHERY YOYA VAN PASS [2]   27  7.0 45.5
## 8         CHEVROLET                      CHEVROLET SAIL LS 1098  8.5  8.5
## 9         CHEVROLET              CHEVROLET SPARK GT [M300]  936  7.2 15.7
## 10        CHEVROLET              CHEVROLET SPARK LT [M200]  782  6.1 21.8
## 11         DAIHATSU                   DAIHATSU TERIOS OKii   74 55.2 55.2
## 12         DAIHATSU                   DAIHATSU SIRION 1.3L   18 13.4 68.6
## 13    DFSK/DFM/DFZL DFSK/DFM/DFZL VAN CARGA EQ5021XXYF Y01   19 23.8 23.8
## 14            DODGE                  DODGE JOURNEY SE [FL]  117 30.6 30.6
## 15            DODGE          DODGE JOURNEY SE/EXPRESS [FL]   69 18.1 48.7
## 16            DODGE                 DODGE JOURNEY SXT [FL]   39 10.2 58.9
## 17             FIAT                      FIAT 500 SPORT BZ   17  8.5  8.5
## 18             FIAT                         FIAT PALIO ELX   17  8.5 17.0
## 19             FORD                  FORD FIESTA SPORTBACK  311 11.7 11.7
## 20             FORD         FORD FIESTA SPORTBACK TITANIUM  255  9.6 21.3
## 21             FORD               FORD FIESTA SPORTBACK SE  209  7.9 29.2
## 22            FOTON              FOTON TUNLAND BJ2037Y3MDV   16 11.2 11.2
## 23 GREAT WALL MOTOR              GREAT WALL MOTOR HAVAL M4   22 21.2 21.2
## 24 GREAT WALL MOTOR              GREAT WALL MOTOR HAVAL H5   20 19.2 40.4
## 25            HONDA                           HONDA CRV EX   71 21.6 21.6
## 26            HONDA                         HONDA CIVIC LX   42 12.8 34.4
## 27            HONDA                           HONDA CRV LX   40 12.2 46.6
## 28          HYUNDAI                HYUNDAI ACCENT i25 1.6L  289 10.4 10.4
## 29          HYUNDAI              HYUNDAI TUCSON IX 35 2.0L  279 10.1 20.5
## 30          HYUNDAI                HYUNDAI ACCENT i25 1.4L  196  7.1 27.6
## 31              JAC                      JAC S2 HFC7151EAV   46 20.4 20.4
## 32             JEEP              JEEP COMPASS [FL] LIMITED   31 17.1 17.1
## 33             JEEP                    JEEP WRANGLER SPORT   30 16.6 33.7
## 34             JEEP            JEEP GRAND CHEROKEE LIMITED   23 12.7 46.4
## 35         KENWORTH             KENWORTH T800 FULL FILTROS   18 47.4 47.4
## 36              KIA             KIA PICANTO ION XTREM 1.25  331  7.9  7.9
## 37              KIA                    KIA PICANTO ION 1.0  292  7.0 14.9
## 38              KIA                KIA SPORTAGE REVOLUTION  243  5.8 20.7
## 39            MAZDA                        MAZDA 3 TOURING  222  6.8  6.8
## 40            MAZDA                  MAZDA 3 GRAND TOURING  210  6.4 13.2
## 41            MAZDA                          MAZDA 2 15HA8  134  4.1 17.3
## 42    MERCEDES BENZ             MERCEDES BENZ A 200 [W176]   34 11.8 11.8
## 43             MINI              MINI COOPER R56 1.6 COUPE   22 40.7 40.7
## 44       MITSUBISHI                 MITSUBISHI MONTERO V11   43 10.2 10.2
## 45       MITSUBISHI             MITSUBISHI LANCER AVANZADO   27  6.4 16.6
## 46       MITSUBISHI           MITSUBISHI MONTERO V43 WAGON   23  5.5 22.1
## 47           NISSAN                   NISSAN VERSA ADVANCE  183  7.5  7.5
## 48           NISSAN                    NISSAN MARCH ACTIVE  154  6.3 13.8
## 49           NISSAN            NISSAN NP 300 FRONTIER 2.4L  150  6.1 19.9
## 50          PEUGEOT                         PEUGEOT 206 XR   48 16.4 16.4
## 51          PEUGEOT                    PEUGEOT 207 COMPACT   19  6.5 22.9
## 52          PEUGEOT                        PEUGEOT 206 XRA   17  5.8 28.7
## 53          RENAULT               RENAULT DUSTER DYNAMIQUE  492  5.1  5.1
## 54          RENAULT    RENAULT LOGAN F.II ENTRY [FAMILIER]  476  5.0 10.1
## 55          RENAULT              RENAULT DUSTER EXPRESSION  386  4.0 14.1
## 56             SEAT                       SEAT IBIZA STYLE   17 17.3 17.3
## 57        SSANGYONG                  SSANGYONG ACTYON A230   60 20.1 20.1
## 58        SSANGYONG                    SSANGYONG KORANDO C   50 16.7 36.8
## 59        SSANGYONG          SSANGYONG KYRON [FL] M200 XDI   28  9.4 46.2
## 60           SUZUKI                 SUZUKI GRAND VITARA SZ  127 16.2 16.2
## 61           SUZUKI                 SUZUKI SWIFT LIVE 1.2L   93 11.8 28.0
## 62           SUZUKI SUZUKI GRAND VITARA  [FL] SZ GLX SPORT   78  9.9 37.9
## 63           TOYOTA              TOYOTA FORTUNER [FL] 2.7L  126  9.4  9.4
## 64           TOYOTA                   TOYOTA FORTUNER 2.7L  115  8.6 18.0
## 65           TOYOTA                       TOYOTA HILUX IMV   86  6.4 24.4
## 66       VOLKSWAGEN             VOLKSWAGEN JETTA TRENDLINE  272 16.0 16.0
## 67       VOLKSWAGEN             VOLKSWAGEN GOL COMFORTLINE  158  9.3 25.3
## 68       VOLKSWAGEN                VOLKSWAGEN JETTA EUROPA  157  9.2 34.5
## 69            ZOTYE                      ZOTYE NOMADA 1.6L   25 48.1 48.1
# View counter per brand
chosen %>% group_by(brand) %>% summarise(count = n()) %>% data.frame() %>% arrange(desc(count))
##               brand count
## 1               BMW     3
## 2             CHERY     3
## 3         CHEVROLET     3
## 4             DODGE     3
## 5              FORD     3
## 6             HONDA     3
## 7           HYUNDAI     3
## 8              JEEP     3
## 9               KIA     3
## 10            MAZDA     3
## 11       MITSUBISHI     3
## 12           NISSAN     3
## 13          PEUGEOT     3
## 14          RENAULT     3
## 15        SSANGYONG     3
## 16           SUZUKI     3
## 17           TOYOTA     3
## 18       VOLKSWAGEN     3
## 19         DAIHATSU     2
## 20             FIAT     2
## 21 GREAT WALL MOTOR     2
## 22          CHANGAN     1
## 23    DFSK/DFM/DFZL     1
## 24            FOTON     1
## 25              JAC     1
## 26         KENWORTH     1
## 27    MERCEDES BENZ     1
## 28             MINI     1
## 29             SEAT     1
## 30            ZOTYE     1
# Start loop for each car (brand-line-reference) combined with model and age
n <- 1 # scalability
year <- year(Sys.Date())
year_max <- 15 # Max per insurance company http://bit.ly/2t3IdQ3
output <- data.frame(vehicle=character(),
                     model=integer(),
                     client_year=integer())
for (i in 1:nrow(chosen)){
  # MODELS (VEHICLE'S AGE)
  years <- cars %>% filter(vehicle == chosen$vehicle[i]) %>% select(model) %>% distinct()
  years_dim <- nrow(years)
  years <- ifelse(years_dim >= 6, 
                  sample_n(years,6,replace=F), 
                  sample_n(years,6,replace=T))
  years <- as.integer(years[[1]])
  # PERSON'S AGE
  ages <- c(
    sample(18:24,n), # Prices only change with the 24 years old threshold
    sample(25:70,n)
  )
  # Concat all combinations
  all <- list(a = chosen$vehicle[i], b = years, c = year-ages)
  all <- do.call(expand.grid, all) %>% arrange(a,desc(b),c)
  output <- rbind(output,all)
}
colnames(output) <- c('vehicle','model','client_year')
# Lastly, we delete all duplicates (because might have less than 6 models per car)
output <- distinct(output)
head(output,10)
##                             vehicle model client_year
## 1  BMW X3 [E83] xDrive30i Executive  2011        1966
## 2  BMW X3 [E83] xDrive30i Executive  2011        1995
## 3  BMW X3 [E83] xDrive30i Executive  2010        1966
## 4  BMW X3 [E83] xDrive30i Executive  2010        1995
## 5                      BMW 320i E90  2009        1962
## 6                      BMW 320i E90  2009        1994
## 7                      BMW 320i E90  2008        1962
## 8                      BMW 320i E90  2008        1994
## 9                      BMW 320i E90  2007        1962
## 10                     BMW 320i E90  2007        1994
# Add car's data (price, fasecolda, complete reference)
info <- cars %>% 
  filter(vehicle %in% chosen$vehicle) %>%
  select(vehicle,brand,model,fasecolda,value,complete) %>%
  mutate(unique = paste(vehicle,model)) %>%
  arrange(vehicle,desc(model),desc(value))
prices <- info %>%
  group_by(unique) %>% 
  summarise(mean_value = mean(value)) %>%
  left_join(info %>% select(brand,unique,complete,fasecolda), by = "unique") %>% 
  group_by(unique,complete) %>% mutate(row=row_number()) %>%
  filter(row == 1) %>% select(-row)
output <- output %>% 
  mutate(unique = paste(vehicle,model)) %>%
  left_join(info, by = "unique") %>% 
  group_by(vehicle.x,client_year,model.x) %>%
  mutate(row=row_number()) %>% 
  filter(row==1) %>% select(-row,-unique,-vehicle.y,-model.y) %>% 
  rename(vehicle=vehicle.x,model=model.x) %>% ungroup() %>%
  mutate(plate = c("CXB162"), cc = c("1015471239"), car_value = c("0"), policy_rate = c("0"), id = row_number(), fasecolda = str_pad(fasecolda, 8, pad = "0")) %>% 
  select(id, car_value, policy_rate, client_year, fasecolda, model, cc, brand, plate, vehicle, complete) %>% data.frame()

Finally, we get a sample set of (n) rows to be quotes manually:

nrow(output)
## [1] 676

For more details on this data see: http://bit.ly/pi-sde-01

IMPORT FILLED SAMPLES

# Connect Google Sheet to import data
my_sheets <- gs_ls()
## Auto-refreshing stale OAuth token.
df <- "Pricing Model - Seguros del Estado"
df <- gs_title(df)
## Sheet successfully identified: "Pricing Model - Seguros del Estado"
df <- df %>% gs_read(ws = "Llenado", range = cell_cols(1:22)) %>% data.frame()
## Accessing worksheet titled 'Llenado'.
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   ID = col_integer(),
##   Modelo = col_integer(),
##   `Año nacimiento` = col_integer(),
##   vehicle_age = col_integer(),
##   age = col_integer(),
##   engine_size = col_integer(),
##   power = col_integer(),
##   person_capacity = col_integer(),
##   doors = col_integer(),
##   is_imported = col_integer()
## )
## See spec(...) for full column specifications.
df$Valor.Asegurado <- as.integer(gsub("[^0-9]", "", df$Valor.Asegurado))
df$Prima <- as.integer(gsub("[^0-9]", "", df$Prima))
df$prime01 <- as.integer(gsub("[^0-9]", "", df$prime01))
df$prime02 <- as.integer(gsub("[^0-9]", "", df$prime02))
df$prime03 <- as.integer(gsub("[^0-9]", "", df$prime03))
df$prime04 <- as.integer(gsub("[^0-9]", "", df$prime04))
df <- df %>% 
  select(-Año.nacimiento, -Vehículo, -Referencia.completa, -body_name) %>%
  rename(id=ID,vehicle_value=Valor.Asegurado,net_policy_value=Prima,fasecolda=Fasecolda,model=Modelo)

scrap <- df

INSPECT THE DATA

# Missing values
missmap(scrap, legend=FALSE, rank.order=TRUE) #NONE

# Distribution of vehicles
ggplot(scrap, aes(model, fill=brand)) + geom_histogram(bins=100) + facet_grid(body_alias ~.)

# Vehicle vs Policy Prices
ggplot(scrap, aes(vehicle_value, net_policy_value, color=model)) + geom_point()

CREATE THE MODEL

First we have to prepare our data set: delete the features we will not use, split data and tran sets, one hot encodign for categorical features, and convert data tables into XGBoost objects.

set.seed(1) # Always useful for reproduction

# Delete some data:
df <- select(df,-prime01,-prime02,-prime03,-prime04) # Later: Experiences
df$fasecolda <- NULL # Good predictor but it's noise
df$body_alias <- NULL # No need
df$model <- NULL # We need the vehicle's age so it'll always work

# Spaces fix from brands:
df$brand <- gsub('\\s+', '_',df$brand)

# Split data: test & train
split <- sample.split(df$net_policy_value, SplitRatio = 0.7) #Split 70/30
train <- subset(df, split == TRUE) #Training
test <- subset(df, split == FALSE) #Testing
setDT(train); setDT(test)

#Using one hot encoding
labels <- train$net_policy_value #Target train
ts_label <- test$net_policy_value #Target test
new_tr <- model.matrix(~.+0,data = train[,-c("net_policy_value","id"),with=F]) # Train
new_ts <- model.matrix(~.+0,data = test[,-c("net_policy_value","id"),with=F]) # Test

#Convert data table into a matrix (xgb.DMatrix):
dtrain <- xgb.DMatrix(data = new_tr,label = labels)
dtest <- xgb.DMatrix(data = new_ts,label = ts_label)

After the data preparation, we can start our trainings. First, we have to define our parameters and we validate with Cross-Validation that we are not overfitting our training set.

#Default parameters:
params <- list(
  booster = "gbtree", objective = "reg:linear", 
  metrics = "rmse",
  max_depth=4, 
  eta=0.02, gamma=0.1,
  min_child_weight=1, subsample=0.9, colsample_bytree=0.9)
# Cross-validation:
xgbcv <- xgb.cv(params = params, data = dtrain,
                nrounds = 10000, nfold = 5, 
                print_every_n = 1000, 
                early_stopping_rounds = 20, 
                prediction = F)
## [1]  train-rmse:2682676.800000+57075.971800  test-rmse:2675216.000000+227020.938107 
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 20 rounds.
## 
## [1001]   train-rmse:38962.126562+1957.277331 test-rmse:167705.818750+29977.573693 
## [2001]   train-rmse:16665.787891+1463.827168 test-rmse:158118.201563+29906.788844 
## [3001]   train-rmse:9123.931445+973.769899   test-rmse:155070.939062+29799.545273 
## [4001]   train-rmse:5551.006543+707.186433   test-rmse:153965.982813+29910.255909 
## [5001]   train-rmse:3558.466846+572.453720   test-rmse:153409.904688+29871.410016 
## Stopping. Best iteration:
## [5510]   train-rmse:2880.456006+507.839925   test-rmse:153240.198437+29876.498531
bestn <- xgbcv$best_iteration; bestn
## [1] 5510

Once we know our best iteration or n number of trees, we can train our whole training set.

#Model training on our test set:
xgb1 <- xgb.train(
  params = params, data = dtrain,
  nrounds = bestn,
  watchlist = list(val=dtest,train=dtrain),
  print_every_n = 1000
)
## [1]  val-rmse:2701682.250000 train-rmse:2683069.750000 
## [1001]   val-rmse:143434.656250  train-rmse:43590.875000 
## [2001]   val-rmse:133379.421875  train-rmse:20522.781250 
## [3001]   val-rmse:130540.304688  train-rmse:12117.409180 
## [4001]   val-rmse:129294.414062  train-rmse:7805.442383 
## [5001]   val-rmse:128592.773438  train-rmse:5337.421875 
## [5510]   val-rmse:128378.296875  train-rmse:4366.730469
val_rmse <- min(xgb1$evaluation_log$val_rmse)
train_rmse <- min(xgb1$evaluation_log$train_rmse)
paste("rmse: ",round(val_rmse,1)," @ iter:",bestn,sep="")
## [1] "rmse: 128377.4 @ iter:5510"

Let’s see how good is our model predicting the test set:

# Predictions
test_preds <- cbind(real = test$net_policy_value,
                    pred = predict(xgb1,dtest)) %>% data.frame()
test_preds <- mutate(test_preds, 
                     dif = abs(test_preds$real-test_preds$pred),
                     dif_perc = round(100*dif/real,2))
head(test_preds)
##      real    pred       dif dif_perc
## 1 4138632 4285598 146965.50     3.55
## 2 3934584 3999427  64843.00     1.65
## 3 2724030 2697309  26721.00     0.98
## 4 1134200 1177055  42854.62     3.78
## 5 1073160 1105470  32309.62     3.01
## 6 1342608 1307043  35564.62     2.65
dif_perc <- summary(test_preds$dif_perc); dif_perc
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.010   0.770   1.690   2.613   2.970  33.020
# Plot predictions vs real values
ggplot(test_preds, aes(x=real/1000,y=pred/1000)) + 
  geom_point() + xlim(0,6000) + ylim(0,6000) + geom_smooth(method="lm") +
  xlab("Precio Póliza Real (x1000 COP)") + ylab("Precio Póliza Predicción (x1000 COP)") + guides(color=FALSE) + 
  ggtitle("Valores de Pólizas para Seguros del Estado
          Predicción vs Reales del Test Set")
## Warning: Removed 3 rows containing non-finite values (stat_smooth).
## Warning: Removed 3 rows containing missing values (geom_point).

# Importance matrix
imp <- xgb.importance(feature_names = colnames(new_tr), model = xgb1)
n <- ifelse(ncol(new_tr) <= 10, ncol(new_tr), 10)
grid.arrange(arrangeGrob(
  ggplot(imp[1:n,1:2],
         aes(x=reorder(Feature,Gain),
             y=Gain,
             label=round(Gain,2),fill=as.numeric(Gain))) + 
    geom_col() + coord_flip() + xlab('') + ylab('Importancia') + 
    guides(fill=FALSE) + geom_text(hjust=-0.5)))

# Correlations
library(corrplot)
cor10 <- cor(as.data.frame(
  cbind(new_tr[,colnames(new_tr) %in% unlist(c(imp[1:10,1]))],
        X=labels)))
corrplot(cor10,method="square",type="upper",order="AOE")

# Deciles (percetage difference between pred and real)
deciles <- quantile(test_preds$dif_perc, probs = seq(0.1, 0.9, length = 9), names = TRUE)
deciles <- data.frame(cbind(Deciles=row.names(as.data.frame(deciles)),Threshold=as.data.frame(deciles)),row.names=NULL)
ggplot(deciles,aes(x=Deciles,y=deciles,label=round(deciles,2),fill=as.numeric(deciles))) + 
  geom_col() + geom_text(vjust=-1) + guides(fill=FALSE) + 
  xlab('Deciles de errores por predicción') + ylab('Error porcentual máximo') +
  ggtitle("Deciles por error absoluto de predicciones del test set")

# PLot CV learning curve
xgbcv$evaluation_log %>% 
  select(1,2,4) %>% 
  gather(type,value, -iter) %>% 
  ggplot(aes(iter,value)) + 
  geom_line(aes(colour = type))

# EXPORT TEXT RESULTS
export <- list(
  "Nota:"="estos valores predecidos son sin IVA y para una persona 'sin experiencia' (siniestros)",
  "Mejor iteración:"=bestn,
  "RMSE test:"=val_rmse,
  "RMSE train:"=train_rmse,
  "Cuartiles (%):"=dif_perc,
  "Importancia variables"=imp,
  "Parámetros:"=(xgb1$params),
  "Campos:"=as.data.frame(cbind(Campos=colnames(new_tr)))); export
## $`Nota:`
## [1] "estos valores predecidos son sin IVA y para una persona 'sin experiencia' (siniestros)"
## 
## $`Mejor iteración:`
## [1] 5510
## 
## $`RMSE test:`
## [1] 128377.4
## 
## $`RMSE train:`
## [1] 4366.73
## 
## $`Cuartiles (%):`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.010   0.770   1.690   2.613   2.970  33.020 
## 
## $`Importancia variables`
##                   Feature         Gain       Cover   Frequency
##  1:         vehicle_value 9.095660e-01 0.365798111 0.433889225
##  2:                   age 4.646833e-02 0.123269052 0.204596312
##  3:          brandRENAULT 1.230105e-02 0.024686901 0.012625974
##  4:           engine_size 8.566470e-03 0.074700096 0.061545498
##  5:              brandKIA 8.029029e-03 0.019371673 0.012266632
##  6:       brandVOLKSWAGEN 2.714405e-03 0.023121687 0.011596949
##  7:        brandCHEVROLET 2.443803e-03 0.019017166 0.011074269
##  8:           vehicle_age 1.962566e-03 0.064673451 0.094670304
##  9:           brandNISSAN 1.338536e-03 0.016994853 0.008869216
## 10:            brandCHERY 1.324925e-03 0.023090138 0.013377326
## 11: brandGREAT_WALL_MOTOR 1.096990e-03 0.011447636 0.005161459
## 12:            brandHONDA 7.434955e-04 0.010723442 0.006141483
## 13:            brandMAZDA 5.692681e-04 0.011567307 0.008297535
## 14:             brandJEEP 3.676412e-04 0.008645148 0.006566160
## 15:             brandFORD 3.333981e-04 0.008718089 0.004361106
## 16:          brandPEUGEOT 3.200264e-04 0.008322071 0.004491776
## 17:        brandSSANGYONG 2.998464e-04 0.009744723 0.006958169
## 18:              brandJAC 2.671315e-04 0.041917064 0.017493426
## 19:              brandBMW 2.522487e-04 0.017889244 0.012070627
## 20:          brandHYUNDAI 1.927471e-04 0.013965594 0.011237607
## 21:            brandDODGE 1.885250e-04 0.007997571 0.005847476
## 22:             brandFIAT 1.023138e-04 0.005795811 0.002989073
## 23:    brandMERCEDES_BENZ 8.530916e-05 0.005668312 0.002580729
## 24:             brandSEAT 8.105068e-05 0.015736704 0.007725855
## 25:              brandBYD 7.809781e-05 0.003928987 0.001617039
## 26:       brandMITSUBISHI 6.817014e-05 0.011243638 0.006860167
## 27:            brandZOTYE 6.405249e-05 0.008015717 0.003903762
## 28:         brandDAIHATSU 4.789049e-05 0.005581256 0.003332081
## 29:             brandMINI 4.698002e-05 0.013689129 0.006353821
## 30:           brandTOYOTA 4.197172e-05 0.005763669 0.003038074
## 31:           brandSUZUKI 2.194815e-05 0.013026252 0.006190484
## 32:    brandDFSK/DFM/DFZL 1.578993e-05 0.005889508 0.002270389
##                   Feature         Gain       Cover   Frequency
## 
## $`Parámetros:`
## $`Parámetros:`$booster
## [1] "gbtree"
## 
## $`Parámetros:`$objective
## [1] "reg:linear"
## 
## $`Parámetros:`$metrics
## [1] "rmse"
## 
## $`Parámetros:`$max_depth
## [1] 4
## 
## $`Parámetros:`$eta
## [1] 0.02
## 
## $`Parámetros:`$gamma
## [1] 0.1
## 
## $`Parámetros:`$min_child_weight
## [1] 1
## 
## $`Parámetros:`$subsample
## [1] 0.9
## 
## $`Parámetros:`$colsample_bytree
## [1] 0.9
## 
## $`Parámetros:`$silent
## [1] 1
## 
## 
## $`Campos:`
##                   Campos
## 1          vehicle_value
## 2            vehicle_age
## 3                    age
## 4               brandBMW
## 5               brandBYD
## 6             brandCHERY
## 7         brandCHEVROLET
## 8          brandDAIHATSU
## 9     brandDFSK/DFM/DFZL
## 10            brandDODGE
## 11             brandFIAT
## 12             brandFORD
## 13 brandGREAT_WALL_MOTOR
## 14            brandHONDA
## 15          brandHYUNDAI
## 16              brandJAC
## 17             brandJEEP
## 18              brandKIA
## 19            brandMAZDA
## 20    brandMERCEDES_BENZ
## 21             brandMINI
## 22       brandMITSUBISHI
## 23           brandNISSAN
## 24          brandPEUGEOT
## 25          brandRENAULT
## 26             brandSEAT
## 27        brandSSANGYONG
## 28           brandSUZUKI
## 29           brandTOYOTA
## 30       brandVOLKSWAGEN
## 31            brandZOTYE
## 32           engine_size
## 33                 power
## 34       person_capacity
## 35                 doors
## 36           is_imported

Now we can export our binary file so we can implement the prediction model in production:

# EXPORTED MODEL
#xgb.save(xgb1, fname="Models/PI-SDE.01/price_intelligence_v01.model")
# CHECK BINARY FILE
#pred <- predict(xgb1,dtrain, ntreelimit = bestn)
#xgb2 <- xgb.load("Models/PI-SDE.01/price_intelligence_v01.model")
#pred2 <- predict(xgb2, dtrain, ntreelimit = bestn)
#sum(abs(pred2-pred)) #0?

EXPERIENCES

The last step is add experience factor (claims and years insured) to the predicted values:

df <- scrap %>% 
  filter(!is.na(prime01)) %>%
  mutate(p01 = prime01/net_policy_value,
         p02 = prime02/net_policy_value,
         p03 = prime03/net_policy_value,
         p04 = prime04/net_policy_value) %>%
  select(id:net_policy_value,model:prime04,p01:p04)

x <- df %>% gather()
p0 <- filter(x, grepl('p0', key))
vv <- filter(x, grepl('vehicle_value', key))
vv <- rbind(vv,vv,vv,vv)
x <- data.frame(cbind(p0,vv)) %>% 
  rename(vv=value.1) %>% select(-key.1)

values <- x %>% group_by(key) %>% 
  summarise(mean = mean(value), 
            max = max(value),
            min = min(value))

a <- paste("means: ",paste(round(values$mean,4),collapse=" | "),sep="")
b <- paste("maxs: ",paste(round(values$max,4),collapse=" | "),sep="")
c <- paste("mins: ",paste(round(values$min,4),collapse=" | "),sep="")

ggplot(x,aes(x=key,y=value,color=vv/1000000)) + 
  geom_boxplot() + geom_point() +
  ggtitle("Fracción del valor póliza vs experiencia y valor vehículo") +
  xlab("Años sin siniestros") + ylab("Fracción") +
  annotate("text", x="p03", y=0.900, label= b) +
  annotate("text", x="p03", y=0.880, label= a) + 
  annotate("text", x="p03", y=0.860, label= c)

This means that, for people who have had experiences, we will show the predicted value (for each of the p0 ranges) times X, where X is:

The total value spent to collect all data on manual quotes was 180.000 COP (9% of the budget) thanks to the optimal number of samples needed and because of the testings done before sending the actual requests.