Allstate Severity Kaggle Regression Analysis

Background

This project is from the Allstate Severity Kaggle competition located here: https://www.kaggle.com/c/allstate-claims-severity. The goal of this project is to use their data to predict the severity of insurance claims. The competition is scored on minimizing the Residual Mean Squared Error. Despite the competition being closed, I chose this dataset because I have 5+ years experience in the insurance industry and am comfortable with the domain.

The packages used in this project are: caret, dplyr, corrplot, e1071, lattice, MASS, CORElearn, randomForest, gbm, catboost, and glmnet, and tidyr.

Data

Data was provided by Allstate here: https://www.kaggle.com/c/allstate-claims-severity/data

The data contains a training data set to train the model and a test data set to submit predictions. Columns beginning with “cat” are categorical columns and “cont” is for continuous. This is important because cateogorical variables are treated differently than continuous and only certain models can be applied.

Since I do not have access to the code book, I cannot make any assumption about the data and have to rely soley on the model to judge variable importance. I also do not have access to data entry personel to determine if outliers are data entry errors. Since the competition is over, my goal of the project is to fine tune a single regression model.

Consdiering there are categorical variables with 53+ factors, I found the best models to use are boosting alrogrithms or neural networks. Since neural networks have a very long processing time, I will be using the catboost algorithm provided by Yandex.

Data Processing

# Load in data; remove ID column from training set as it is not needed for training

final_test <- read.csv("test.csv")
raw_train <- read.csv("train.csv")
raw_train <- raw_train[-1]
test_id <- final_test[,c("id")]
test_set <- final_test[-1]

Looking at the raw training data, there are 131 featurees and 188,318 observations. The test data has 125,546 observations and every feature excluding the response variable, loss. The range, median, and mean of the loss data are ($.067; $12,101), $2115, and $3037 respectively.

Exploratory Analysis

I will work under the assumption that all data is entered correctly, and the insurance company would be responsible for all losses listed (assuming no deductible).

First, by using the str() function (excluded to save on length), it is confirmed categorical and continous features are coded to the correct respectible classes.

It’s best to analyze categorical and continuous variables separately. Since there are 116 categorical variables, no graphs were printed. From my analysis, there were columns that could be removed due to low variance, but chose to leave them in the model since I do not have information regarding their importance.

cont_train <- raw_train[, -grep("^cat", colnames(raw_train))]
cat_train <- raw_train[,-grep("^cont",colnames(raw_train))]

First, look at highly correlated pairs.

correlated <- cor(cont_train)
corrplot(correlated, method = "square", order = "original", type = "upper", diag = FALSE)

As we can see there are some highly positive correlated pairs. For model building purposes, I am hesitant to remove them without knowing what the features actually are. Highly correlated pairs can lead to bad prediction for linear regression models, which is why I chose a boosting model.

Next, we can take a look at the distribution of the continuous variables. This will give us an idea whether or not they are bimodal, multimodal, or heavily skewed.

cont_train %>% 
  gather(-loss, key = "key", value = "value") %>% 
  ggplot(aes(value)) +
  facet_wrap(~ key, scales = "free", ncol = 3) +
  geom_density()

There are definitely some variables such as cont2 that have a categorical look to them.

Next, we can look at their distribution with losses. We will compare to log losses due to skewed distribution and outliers.

cont_train %>% 
  gather(-loss, key = "var", value = "value") %>% 
  ggplot(aes(x = value, y = log(loss))) +
  geom_point() +
  geom_smooth(method = lm) +
  facet_wrap(~ var, scales = "free", ncol = 3)

This also supports the assumption that cont2 is categorical based. As stated previously, no features were removed. If given more information regarding the features, this step would be revisited.

Modeling

I chose to use the catboost package, which can found here: https://github.com/catboost/catboost

Catboost is a newer boosting model technique which is known for its quick handling of categorical variables. It also is implemented into the caret package, which is a common wrapper for training models and cross-validation.

The first step to building the model is separating the training set into a main training set and a validation set. Since there are many observations, I choose to do a typical 70%/30% split.

set.seed(1185)
train_index <- createDataPartition(raw_train$loss, p = .7, list = FALSE, times =1)
train_set <- raw_train[train_index,]
valid_set <- raw_train[-train_index,]

Model Tuning

I chose to fit the model with a 5 fold cross-valdiation.

Modeling is an iterative process of constantly tuning parameters to get the best prediction. The goal of this model is to minimize RMSE. Since each model takes roughly 30 minutes to run and the competition is finalized, I decided to stop when there were diminishing returns on RMSE improvement. Other models and/or weighted averages with other models may lead to improvements of the RMSE, but that is outside the scope of this project.

Here are the tuning parametrs used in the final model:

Depth of 2, 6 and 8
learning rate of .1
100 iterations
an L2 regularization of .05 to prevent overfitting
Random subspace of 95%
Border count of 65

fit_control <- trainControl(method = "cv", number = 5, classProbs = FALSE)
grid <- expand.grid(depth = c(2,6,8), learning_rate = .1, iterations = 100, l2_leaf_reg = .05, rsm = .95, border_count = 65)

Model Training

Next step is to train the model using the training set. This is the process that takes the bulk of the time.

## Model parameters
response <- c("loss")
x <- train_set[,!names(train_set) %in% response]
y <- train_set$loss

## Training Model
set.seed(1185)
final_model <- train(x,as.vector(y), method = catboost.caret, logging_level = "Silent", preProc = NULL, tuneGrid = grid, trControl = fit_control)

Model Validation

Once the model is trained, we can predict of the validation set and measure performance with RMSE and R².

## Prediction
valid_pred <- predict(final_model,valid_set)

## Measures
final_RMSE <- RMSE(pred = valid_pred, obs = valid_set$loss)
final_R2 <-R2(pred = valid_pred, obs = valid_set$loss)

Thus, the model ended with a validation set RMSE of 1897.80 and R² of 0.56.

Test Set

To use the model on future test sets, the trained model would be used. Also note, the same feature selection must be done on any additional data sets, otherwise the model may not predict accurately.

final_pred <- predict(final_model, newdata = test_set)
my_submission <- data_frame("ID"=test_id, "loss"=final_pred)
head(my_submission)

## # A tibble: 6 x 2
##      ID  loss
##   <int> <dbl>
## 1     4 1728.
## 2     6 2189.
## 3     9 9545.
## 4    12 5884.
## 5    15 1047.
## 6    17 2468.

Conclusion

Without full knowledge of the data, it’s difficult to create a robust model. Catboost does a decent job at efficiently approaching the variables blindly without any feature selections. With knowledge of the domain, better accuracy could be obtained. Neural Networks are an alternative approach that would likely lead to better prediction, but their calculation times are usually very high (24+ hours). Since I do not have that computational capacity, it is not an efficient model for me to tune iteratively for this project.