AToShiDe

Pascal
22-09-2018

Another Toy Shiny Demonstration [あとしで]

Context: Coursera JHU, Developing Data Product

Purpose

  • Apply different algorithms on same dataset and compare their results according to metric criteria such as accuracy or kappa,
  • Extend this process to several datasets (for classification) obtained from UCI repository,
  • Split dataset into two subsets: training and test,
  • Choose type of cross validation and metric,
  • Train model with given algorithm and use result to calculate accuracy, kappa (metrics) and out-of sample error (oose) on test set,
  • Report metrics, oose and user time taken to execute algorithm,
  • Report best results according to chosen metric,
  • Display basic plots about metrics.

In practice, as a proof of concept

About tradeoffs

The tradeoffs are between fluidity/responsiveness and long computation

  • As the size of dataset increases and depending on selected parameters and algorithm, a result can take a while to be computed.
  • So, in order to keep the fluidity, I relied on caching by pre-computing a subset of the possible results (using a fixed seed).
  • However, not all the subsets were pre-computed, so you can see for yourself (for example with a 60% split, …).

Example of pre-computed/cached result:

algorithm acc_train kappa_train acc_test kappa_test oose utime
C5.0 0.9265884 0.8369122 0.9038462 0.7823357 0.0961538 2.899
GBM 0.9348367 0.8566328 0.8942308 0.7620632 0.1057692 1.528
BCART 0.8901735 0.7640245 0.9038462 0.7823357 0.0961538 1.959
RF 0.9432551 0.8741583 0.8557692 0.6834416 0.1442308 3.078

Snapshot

AToShiDe