Introduction

ETL process

Row

Relational database

Picture of database

Picture of database

Row

ETL workflow

Picture of ETL workflow

Picture of ETL workflow

Event Data

Final datasets

Row

Dow Jones dataset

Row

Gold dataset

Row

Oil dataset

Dow Jones

Row

Row

Total number of casualties

Gold

Row

Row

Total number of casualties

Oil

Row

Row

Total number of casualties

Statistical analysis

Analysis

  • Variable selection
  • Numerical prediction
    • Linear regression
    • General additive model
    • Multi adaptive regression spline
    • Random forest
  • Classification
    • Logistic regression
    • Random forest
    • K-nearest neighbours
  • Model training
    • Total deaths
    • Lagged variables
    • Other variables
  • Model selection
    • Predictions vs. Test set
    • Means squared error
    • Accuracy

Examples

Linear regression - without variable selection

term estimate std.error statistic p.value
(Intercept) 3.62 1.16 3.11 0.00186
capital 0.00977 0.0206 0.474 0.636
regionAmericas -0.0377 0.0189 -1.99 0.0463
regionAsia -0.0257 0.0109 -2.35 0.0186
regionEurope -0.0488 0.0189 -2.59 0.00965
regionMiddle East -0.0272 0.0148 -1.84 0.0658
total_deaths 8.43e-07 1.29e-06 0.653 0.514
deaths_civilians -3.18e-05 2.74e-05 -1.16 0.245
best 3.17e-05 2.72e-05 1.16 0.244
year -0.00178 0.00058 -3.06 0.00221
lag1 1.77e-06 8.05e-07 2.19 0.0284
lag2 -1.94e-06 1.21e-06 -1.6 0.11
lag3 -3.49e-05 4.53e-06 -7.72 1.23e-14

Linear regression - with variable selection

term estimate std.error statistic p.value
(Intercept) 3.71 1.16 3.2 0.00138
regionAmericas -0.0385 0.0189 -2.04 0.0414
regionAsia -0.0267 0.0108 -2.47 0.0135
regionEurope -0.0495 0.0189 -2.62 0.00869
regionMiddle East -0.028 0.0147 -1.91 0.0564
year -0.00182 0.000579 -3.15 0.00165
lag1 1.8e-06 8.04e-07 2.24 0.0249
lag2 -1.93e-06 1.21e-06 -1.59 0.113
lag3 -3.48e-05 4.52e-06 -7.7 1.41e-14

Model comparisons

Row

MSE scores for numerical prediction

Random forest scores the best
Dependent variable lm rf gam mars
Dow 1.13 0.0502 1.12 1.13
Gold 1.1 0.0887 1.1 1.1
Oil 2.93 0.162 2.85 2.87

Row

Accuracy scores for classification

Random forest again scores the best
Dependent variable log rf knn
Dow 0.497 0.99 0.844
Gold 0.541 0.987 0.777
Oil 0.528 0.992 0.838

Final results

Row

Numerical prediction - Dow Jones

Numerical prediction - Gold

Numerical prediction - Oil

Row

Classification - Dow Jones

Dow Jones
0 1
Decrease 9048 91
Increase 108 10277

Classification - Gold

Gold
0 1
Decrease 10931 98
Increase 168 9229

Classification - Oil

Oil
0 1
Decrease 6509 49
Increase 55 5879

Row

Variable importance - Dow Jones

Variable importance - Gold

Variable importance - Oil