Statistical methods in scientific researches

"Krung Sinapiromsaran"
"1/24/2015"

Outline

♦ Role of Statistics in Science

♦ My researches

♦ Statistic application in research publications

♦ Conclusion

Scientific research

Most scientific researches use the inductive-deductive approach.

Scientific method entails formulation of hypotheses from observed facts followed by deductions and verification repeated in a cyclical process.
Facts are observations which are taken to be true while Hypothesis is a tentative conjecture regarding the phenomenon under consideration.
Deductions are made out of the hypotheses through logical arguments which in turn are verified through objective methods.

Scientific research (cont.)

The process of verification may lead to further hypotheses, deductions and verification in a long chain in the course of which scientific theories, principles and laws emerge.

Two main features of scientific method are its repeatability and objectivity.

However, many physical processes, biological phenomena are characterised by variation and uncertainty.
Experiments when repeated under similar conditions need not yield identical results, being subjected to fluctuations of random nature.
Observations on the complete set of individuals in the population are impossible only a sample is considered.

Role of Statistics in Scientific research

The science of statistics is helpful in

objectively selecting a sample,
making valid generalisations out of the sample set of observations and
also in quantifying the degree of uncertainty in the conclusions made.

Statistics

Two major practical aspects of scientific investigations are collection of data and interpretation of the collected data.

The data may be generated through a sample survey on a naturally existing population or a designed experiment on a hypothetical population.
The collected data are condensed and useful information extracted through techniques of statistical inference.

My Academic Research

Krung

Assistant Professor Krung Sinapiromsaran
Department of Mathematics and Computer Science
Faculty of Science, Chulalongkorn University
Web: http://pioneer.netserv.chula.ac.th/~skrung/

Research summary

♦ I currently works in the area of optimization, forecasting and knowledge discovery.

♦ The on going researches are (1) improving the current linear programming algorithms, (2) constructing the new forecasting models, (3) identifying outliers via different “outlier degree”, (4) deriving different techniques for class imbalanced problems, (5) creating the new classifier in KDD.

♦ I incorporate Mathematics, Mathematical programming concept, mathematical logic, machine intelligent and statistical techniques in order to solve problems in broad areas.

See a list of my current publications

Fields of Interests

Optimization researches

Formulating and solving nonlinear programs
Pivot rules for Simplex method
Simplex method without artificial variables

Algorithm researches

Successive difference
Sort

Knowledge Discovery researches

Association analysis
Classifier using the extreme pole
Cluster using the extreme pole
Class imbalance

Knowledge Discovery researches

Class Imbalance

Class imbalance

Motivation

Most machine learning algorithms optimize the overall classification accuracy

Work well with balanced data sets
Decision boundary biased toward the majority class
Tend to treat samples from a minority class as noise

Existing approaches

Recognition-based approach

One-class learning approach = model on the examples of the target class ignoring other classes

Cost-sensitive learning approach

Emphasis on misclassification cost

Sampling approach

Under-sampling
Over-sampling
Advanced sampling

Ensemble-learning approach

Undersampling

Oversampling

Advanced sampling

Up Weighting

Down Weighting

SMOTE Family

SMOTE: Synthetic Minority Over-sampling Technique (2002)
Borderline-SMOTE (2005)
Safe-Level-SMOTE (2009) Chumphol
SMOUTE (2010) Panote
Minority-based Decision tree (July 2011) Kesinee
MUTE (December 2011) Chumphol
DBSMOTE (2012) Chumphol
Safe Level Graph for Majority Under-sampling Techniques (2014)

SMOTE

Borderline SMOTE

Safe-Level-SMOTE

Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-sampling Technique for handling the class imbalanced problem

The 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2009), 27-30 April 2009, Bangkok, Thailand

Noise, Borderline, Safe

Region	Definition
Noise	$ n $ = $ k $
Borderline	$ \frac{k}{2} \leq n < k $
Safe	$ 0 \leq n < \frac{k}{2} $

$ n $ is the number of negative instances among the $ k $ nearest neighbors

Noise, Borderline, Safe

$Noise, Borderline, Safe$

SMOTE vs. Borderline-SMOTE

SMOTE vs. BD-SMOTE

Safe-Level SMOTE

Safe Level

Safe Level Ratio

Safe Level SMOTE

UCI Data set

Name	Instance	Attribute	Positive	Negative	%Minority
Satimage	6,435	37	626	5,809	9.73
Haberman	306	4	81	225	26.47

Decision tree:Haberman

Naive Bayes:Satimage

SVM:Haberman

Safe-Level SMOTE comment 1 (2009)

Nice paper, although there are several issues that avoid a better rating:
- discussion of over-sampling vs cost-sentivive methods. The former do not require changes in the classifier software and can be used as a preprocessing technique. Yet, the proposed methods required the setting of k and amount of synthetic instances needed to be generated. You should discuss this in the conclusions.
- and what about multi-class tasks? How can you address these?
- technical aspect: Only tested one k=5 and two UCI datasets. How does the algorithm behave when you change the k value? Also, UCI contains hundreds of datasets, most of them classification problems and several of these are unbalanced. Hence, why satimage and haberman? why not others?
- statistical confidence: I understand that there was still a high number of experiments, but 10-fold results can change since the 10-fold splitting is randomly made. Thus, ideally, several 10-fold runs should be applyed and results presented in terms of the mean/median plus a confidence interval…

Safe-Level SMOTE comment 2 (2009)

The main complaint about the poor experimental results .Since, the paper does not provide any theoretical framework and the results are primarily based on empirical evidence, it is important that the authors demonstrate the superiority of the algorithm using atleast 5-6 standard datasets.
There are many datasets that contain the minority data less than 10%. I encourage that the authors carefully look into papers on SMOTE and others that they cite, to get the appropriate datasets. Hence, I recommend a reject for this paper.

Safe-Level SMOTE comment 3 (2009)

Overall the paper is well written. Regarding the over sampling, i am wondering if you keep the original data to test the techniques the result is the same or not? When you use 10-fold cross validation on new dataset for evaluation, the oversampled positive instances are included in performance testing. Because of the increase of the positive instances, the performance will be changed definitely. Have you try to test the performance only using original data? What is result?
The English needs to be improved. There are many typos such as “positive stances” (it should be read “positive instances”). The Figs are a bit small for reading.

Safe-Level SMOTE comment 4 (2009)

Fig 3-5 are not very convincing. There does not seem to be much improvement.
Often in practise I find a random forest is good at under-represented positives. Is it better to use SMOTE to pre-process or just rely on RF?
Would confidence levels around the performance wipe out any differences in performance?
It would be better in the figure legends if ORG did not show the line as there is no line - was confusing to start with.
Requires a good proof read to fix grammar.
The results are not very convincing.

Statistical concerns

For a particular dataset, how well does this new method perform comparing with existing classifiers?
For a particular classifier, how effective does this new method perform on various datasets?
For a future unknown dataset, would you recommend this methodology to be used and why?

Solution 1: Student's t-test

A t-test can be used to determine if two sets of data are significantly different from each other.
It can be used to answer the question of “For a particular dataset, how well does this new method perform comparing with existing classifiers?”
By performing repeated experiments of sampling the same datasets many times, it is easy to compute (paired) t-statistics and p-value.

Solution 2: Wilcoxon signed-rank test

The Wilcoxon singed-rank test is a non-parametric statistical hypothesis test used when comparing two related samples, matched samples, or repeated measurments on a single sample to assess whether their population mean ranks differ. Wikipedia
It can be used to answer the question of “For a particular classifier, how effective does this new method perform on various datasets?”
By performing repeated experiments on various datasets, the normality assumption can not be used.

Solution 3: Wilcoxon signed-rank test

Varying datasets and classifiers, the hypothesis testing can not rely on normality assumption so the non-parametric must be used.
It can be used to answer “For a future unknown dataset, would you recommend this methodology to be used and why?”

Region	Definition
Noise	\( n \) = \( k \)
Borderline	\( \frac{k}{2} \leq n < k \)
Safe	\( 0 \leq n < \frac{k}{2} \)