Data mining is a craft. It involves application of science, technology and art as well.
The model structure.
Process diagram showing the relationship between the different phases of CRISP-DM.
Identify the business problem and characterize it. Link to the objectives of the organization.
High level understanding of the fundamental operations of a business is important. Engage analysts with organizational/ institutional knowledge.
What is the expected value of the outcome?
What data do we need to achieve objective 1?
What is the cost of this data?
Is additional investment in obtaining data merited?
Data conversion.
The success of this step depends on how well the problem was structured as well as in variable selection.
Actual application of data mining techniques to the data prepared under step 3.
Heavy application of science and technology.
Data mining techniques and available algorithms.
Rigorous assessment of data mining results. Ensure validity and reliability of results.
Test whether extracted patterns are true regularities and not idiosynracies or sample anomalies.
Model generalization
Controlled testing of model before deployment.
Ensure results match the required solution to the business problem identified under step 1 and support the required decision making process
Check the economic feasibility of the outcome (too many false negatives or positives and their associated costs).
The results and techniques of data mining are put t0 practical use.
Test compatibility with the existing systems, and evaluate the need to recode the model for implementaion into the production environment.
Return to step 1. The results of data mining and the output of the deployed system could lead to additional insights into ways of improving business performance, new ventures or additional lines of business worthy of a discussion at the strategic level.
Meets the expected value set at the business evaluation stage. provides a clear link between the goal (business problem/objective) and the mining outcome.
Achieves a balance between the cost incurred in its creation and the benefits reaped from its output or its application into the foreseable future.
Meets or outperforms baseline (benchmark) requirements, or performs within a range of the benchmark that is acceptable.
Understandability vs simplicity. Ensure a balance between model complexity and its understandability. The decision makers will need to signoff on the deployment.
Strikes a balance between complexity overfit and accuracy.
Achieves a balance between the cost (false +ves and -ves) and the benefits that acumulates on use of the model over time.
Avoid plain accuracy measures of the model due to their simplicity (the simpicity is their down fall).