The Actuaries Institute (Australia) 2016 Kaggle competition is now live, running until 31-Jan-2017. The competition can be found at Actuaries2016-VicRoads

Unlike last year, this year there is more significant money at stake, $5000 in prizes, donated by IAG. But you need to know someone who is a member of the Actuaries Institute to compete (and persuade them to join your team).

The Data

The data used for the competition has been provided by VicRoads. It consists of the current road infrastructure on all Victorian A,B and C category roads (Lane Width, Curvature, Rumble Strips, Speed Limits, Traffic Volumes,…) as well as the accidents that have occurred on those roads since 2006. The accident details include the numbers killed and injured of different types of road users as well the the types of vehicles involved.

What you need to predict is the “Cost” of road accidents on each section of road in a calendar quarter. The Cost is just a formula based on numbers and types of victims as well as numbers and types of vehicles. This allows a complex model to be built predicting each type of casualty independently if you so wish. You can also leave out the detail and just predict total costs.

External Information

Unlike almost every other Kaggle competition this one encourages you to use other data. Just so long as its not accident data. So you are free to gather other information that may be beneficial to your predictions. For example:

#PLOT function from the Sample Code
accidents <- acc[POSTCODE==3220, ACCIDENT_NO]
lats <- acc[ACCIDENT_NO %in% accidents, Latitude]
lons <- acc[ACCIDENT_NO %in% accidents, Longitude]
PLOT(LON=lons, LAT=lats, acc.id=accidents) + ggtitle('Geelong is pretty bad')

Testing and Training Data

The test set has not been derived just by a random selection of road points. If it were you could build a model based on accident history at a neighbouring point in the training set and not know whether the predicted cost was due to the centre divide, traffic volume and speed limit. To prevent this type of model from working, the roads were chopped into length and each length subject to the random selection for the test set. In addtion all roads in the last 4 calendar quarters are in the test set. This means that time trends could be important in building the winning model.

Lots of food for thought.