Repróbkowanie(Resampling)
Test error versus Training error
Zbiór testowy wyodrębniony z próby prostej: the Validation Set Approach
Drawbacks to the Validation Approach
- Cross Validation
###Leave-One-Out Cross-Validation (LOOCV)
LOOCV versus the Validation Method
- k-fold Cross Validation
k-fold CV
Bias-Variance Trade-Off for k-Fold Cross-Validation
Bias-Variance Trade-Off for k-Fold Cross-Validation
What about the variance?
Bias-Variance Tradeoff for k-fold CV
Cross-Validation on Classification Problems
Cross-Validation on Classification Problems
An Application to the Auto data set
An Application to the Auto data set
Data set up
Linear regression
Linear regression
LOOCV for Auto data set
k-fold for Auto data set
delta for CV
- Bootstrap
Klasyfikatory złożone
Agregacja bootstrapowa (bagging - bootstrap aggregation)
Lasy losowe
Boosting
- Adaboost - adaptive boosting

Repróbkowanie(Resampling)

Cross Validation - can be used to estimate the test error associated with a statistical learning method to evaluate its performance or to select a model’s level of flexibility
Bootstrap - most commonly measures the accuracy of a parameter estimate or of a given statistical learning method

Metoda estymacji błędu predykcji.

Niech będzie dana populacja, której cechę chcemy odkryć.

Metoda – estymator oparty na losowej próbce obserwacji, na przykład średnia z próby

Mając wartość, skąd wiemy czy to jest wartość, której szukamy, jak odległa jest ta wartość od rzeczywistej? Nie wiemy.

Badamy rozkład statystyki (sampling distribution) i na tej podstawie podejmujemy stwierdzenia

Problem: w większości przypadków rozkład próby (sampling distribution)jest nieznany, można nałożyć restrykcyjne założenia rozkładu dla populacji, np. rozkład normalny w populacji skutkuje rozkładem normalnej średniej z próby o tej samej wartości oczekiwanej ale mniejszym odchyleniu standardowym

Alternatywa: Jeżeli nie znamy rozkładu, oszacujmy go tak, jak każdy inny parametr poprzez empiryczną skumulowaną dystrybuantę.

Test error versus Training error

The test error is the average error that results from using a statistical learning method to predict the response on a new observation.
The test error can be easily calculated if a designated test set is available. (Sometimes this is not the case).
The training error is the average error that results from using a statistical learning method to predict the response on the data that was used to train the method.

Zbiór testowy wyodrębniony z próby prostej: the Validation Set Approach

Klasyczne podejśćie: Jeżeli dysponujemy wystarczająco liczną próbą, wyodrębniamy z niej ciąg walidujący (testowy).

Randomly divide the available set of observations into two parts, a training set and a validation set or hold-out set.
Fit the model on the training set.
Use the resulting fitted model to predict the responses for the observations in the validation set.
The resulting validation set error rate is typically assessed using the MSE in the case of a quantitative response. This provides an estimate of the test error rate.

library(ISLR)
head(Auto)

##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
## 6  15         8          429        198   4341         10.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500

train.idx <- sample(1:nrow(Auto), round(0.7*nrow(Auto)))
train<-Auto[train.idx,]
test<-Auto[-train.idx,]

lmfit<-glm(mpg~horsepower+cylinders+ weight, data=train)
summary(lmfit)

## 
## Call:
## glm(formula = mpg ~ horsepower + cylinders + weight, data = train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -11.3824   -2.7576   -0.4008    2.0039   16.6402  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 45.3951898  0.9666258  46.963  < 2e-16 ***
## horsepower  -0.0365340  0.0145392  -2.513   0.0126 *  
## cylinders   -0.4116783  0.3658297  -1.125   0.2614    
## weight      -0.0054095  0.0007983  -6.777  7.7e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 17.72113)
## 
##     Null deviance: 15761.0  on 273  degrees of freedom
## Residual deviance:  4784.7  on 270  degrees of freedom
## AIC: 1571.2
## 
## Number of Fisher Scoring iterations: 2

mse=mean((test$mpg - predict(lmfit, test))^2)

#Estymacja błędu predykcji na podstawie ciągu testowego: 
mse

## [1] 18.6897

Drawbacks to the Validation Approach

The validation estimate of the test error rate can be highly variable, which depends on which observations are in the training and validations sets.
Only a subset of the observations (training set) are used to fit the model.

The validation set error may tend to over-estimate the test error rate for the model fit on the entire data set.

CV addresses both issues above!

Cross Validation

Zbiór dzielony jest na \({k}\) partycji.

K-fold : jedna z \({k}\) partycji jest używana do testowania, pozostałe \({k-1}\) partrycje służą do budowy modelu. Procedura powtarzana \({k}\) - krotnie (dla każdej partycji wykorzystywanej jako testowa)

LOOCV: szczególny przypadek K-fold, w którym liczba partycji odpowiada liczbie obserwacji.

se_total=0
for (i in 1:nrow(Auto)){
  train = Auto[-i,]
  test = Auto[i,] 
  lmfit<-glm(mpg~horsepower+cylinders+ weight, data=train)
  se=(test$mpg - predict(lmfit, test))^2
  se_total = se_total + se 
}
 mse = se_total / nrow(Auto)
 mse

##        1 
## 18.10962

To samo działanie z wykorzystaniem biblioteki boot:

library(ISLR)
library(boot)
lmfit<-glm(mpg~horsepower+cylinders+ weight, data=Auto)

#LOOCV: 
k=nrow(Auto)
m<-cv.glm(data=Auto, lmfit, K=k)

#Błąd:
mse = m$delta[1]
mse

## [1] 18.10962

###Leave-One-Out Cross-Validation (LOOCV)

LOOCV does not create two subsets of comparable size. Instead it does the following:

A single observation \((x_1,y_1)\) is used for the validation set.
The remaining observations \(\{(x_2, y_2), \ldots , (x_n, y_n)\}\) make up the training set.
The statistical learning method is fit on the \(n - 1\) training observations, and a prediction \(y_1\) is made for the excluded observation, using its value \(x_1\).
Since \((x_1,y_1)\) was not used to fit the model, \(\text{MSE}_1 = (y_1 - \hat{y}_1)^2\) provides an approximately unbiased estimate for the test error.

This isn’t enough though, because it’s only for one single observation.

We repeat the procedure by selecting \((x_2,y_2)\) on the validation set.
We train the model on the \(n-1\) observations that are leftover.
We compute the MSE\(_2\) as we did before, now using \((x_2,y_2)\).

We now repeat this process approach \(n\) times to produce \(n\) squared errors, \[\text{MSE}_1 \ldots, \text{MSE}_n.\]

The LOOCV estimate for the test MSE is the average of these \(n\) test error estimates:

\[\begin{align} CV_{(n)} &= \frac{1}{n}\sum_{i=1}^n MSE_i \end{align}\]

LOOCV versus the Validation Method

LOOCV has a couple of major advantages over the validation set approach.

It has far less bias.

LOOCV: Repeatedly fit the statistical learning method using training sets that contain \(n - 1\) observations, there are almost as many as are in the entire data set.
This contrasts the validation method, where the training set is about half the size of the original data set.
LOOCV approach tends not to overestimate the test error rate as much as the validation set approach does.

Performing LOOCV multiple times produces similar results. This is not typically true for the validation method.

Remark: LOOCV has the potential to be expensive to implement, since the model has to be fit \(n\) times. This can be very time consuming if \(n\) is large, and if each individual model is slow to fit.

k-fold Cross Validation

lmfit<-glm(mpg~horsepower+cylinders+ weight, data=Auto)

#LOOCV: 
m<-cv.glm(data=Auto, lmfit, K=5)

#Błąd wyznaczony w oparciu o średnią z 5: 
mse = m$delta[1]
mse

## [1] 18.25384

k-fold CV

This approach involves randomly dividing the set of observations into \(k\) groups, or folds, of approximately equal size.

The first fold is treated as a validation set, and the method is fit on the remaining \(k - 1\) folds.
The mean squared error, MSE\(_1\), is then computed on the observations in the held-out fold.
This procedure is repeated k times;

Each time, a different group of observations is treated as a validation set.
This process results in \(k\) estimates of the test error, \(\text{MSE}_1, \ldots, \text{MSE}_k.\)

The k-fold CV estimate is computed by averaging these values: \[\begin{align} CV_{(k)} &= \frac{1}{k}\sum_{i=1}^k \text{MSE}_i \end{align}\]

Bias-Variance Trade-Off for k-Fold Cross-Validation

Recall that k-fold CV with \(k < n\) has a computational advantage to LOOCV.
But putting computational issues aside, a less obvious but potentially more important advantage of k-fold CV is that it often gives more accurate estimates of the test error rate than does LOOCV.
This has to do with a bias-variance trade-off.

Bias-Variance Trade-Off for k-Fold Cross-Validation

Recall that the validation set approach can lead to overestimates of the test error rate, since in this approach the training set used to fit the statistical learning method contains only half the observations of the entire data set.
Using this logic, it is not hard to see that LOOCV will give approximately unbiased estimates of the test error, since each training set contains \(n - 1\) observations, which is almost as many as the number of observations in the full data set.
Performing k-fold CV for, say, k = 5 or k = 10 will lead to an intermediate level of bias, since each training set contains \(\frac{(k - 1)n}{k}\) observations — fewer than in the LOOCV approach, but substantially more than in the validation set approach.

Therefore, from the perspective of bias reduction, LOOCV is preferred to k-fold CV.

What about the variance?

Bias is not the only source for concern in an estimating procedure; we must also consider the procedure’s variance.

It turns out that LOOCV has higher variance than does k-fold CV with \(k < n.\)

When we perform LOOCV, we are in effect averaging the outputs of \(n\) fitted models, each of which is trained on an almost identical set of observations.
These outputs are highly (positively) correlated with each other.
In contrast, when we perform k-fold CV with \(k < n,\) we are averaging the outputs of k fitted models that are somewhat less correlated with each other, since the overlap between the training sets in each model is smaller.
Since the mean of many highly correlated quantities has higher variance than does the mean of many quantities that are not as highly correlated, the test error estimate resulting from LOOCV tends to have higher variance than does the test error estimate resulting from k-fold CV.

Bias-Variance Tradeoff for k-fold CV

There is a bias-variance trade-off associated with the choice of \(k\) in k-fold cross-validation.

Typically, given these considerations, one performs k-fold cross-validation using \(k = 5\) or \(k = 10,\) as these values have been shown empirically to yield test error rate estimates that suffer neither from excessively high bias nor from very high variance.

Cross-Validation on Classification Problems

We have illustrated the use of CV in the regression setting where the outcome Y is quantitative, and so have used MSE to quantify test error.
CV can also be a very useful approach in the classification setting when \(Y\) is qualitative.
CV works the same as above except that rather than using MSE to quantify test error, we instead use it to quantify the number of misclassified observations.

Cross-Validation on Classification Problems

For instance, in the classification setting, the LOOCV error rate takes the form

\[\begin{align} CV_{(n)} = \frac{1}{n}\sum_{i=1}^n \text{Err}_i, \end{align}\]

where \(\text{Err}_i = I(y_i \neq \hat{y}_i).\)

The k-fold CV error rate and validation set error rates are defined analogously.

See Section 5.1.5 for more details and further reading.

An Application to the Auto data set

We explore the use of the validation set approach in order to estimate the test error rates that result from fitting various linear models on the Auto data set.

We first do some data set up and pre-processing before we begin.

An Application to the Auto data set

We set a seed to make our methods reproducible
Use the sample() function to split the set of observations into two equal parts.
We randomly select a random subset of 196 observations out of the original data set of 392 observations. We refer to this new sample as the training set

Data set up

library(ISLR)
set.seed (1)
# randomly select 196 units 
# from the original data set. 
train <- sample(392,196)

Linear regression

We now run a linear regression using the training data.

lm.fit <- lm(mpg~horsepower , data = Auto, 
             subset=train)

Linear regression

We now use predict() to estimate the response for all 392 observations, and we use the mean() function to calculate the MSE of the 196 observations in the validation set.

Note that the -train index below selects only the observations that are not in the training set.

attach(Auto)
mean((mpg-predict(lm.fit,Auto))[-train]^2)

## [1] 26.14142

Therefore, the estimated test MSE for the linear regression fit is 26.14.

LOOCV for Auto data set

The LOOCV estimate can be automatically computed for any generalized linear model using the glm() and cv.glm() functions.

Recall if we do not give the glm() a family, then it will perform linear regression.

The cv.glm() function is part of the boot library.

# run a linear model using glm package
library(boot)
glm.fit=glm(mpg~horsepower ,data=Auto)
cv.err=cv.glm(Auto,glm.fit)
cv.err$delta

## [1] 24.23151 24.23114

The cv.glm() function produces a list with several components.

The two numbers in the delta vector contain the cross-validation results. (These coorrespond to the LOOCV statistic, equation 5.1).

Here, they are both the same. (They can differ).

Our cross-validation estimate for the test error is approximately 24.23.

k-fold for Auto data set

# run a linear model using glm package
library(boot)
glm.fit=glm(mpg~horsepower ,data=Auto)
cv.err=cv.glm(Auto,glm.fit, K=10)
cv.err$delta

## [1] 24.15099 24.13997

delta for CV

We saw that the two numbers associated with delta are essentially the same when LOOCV is performed.
When we instead perform k-fold CV, then the two numbers associated with delta differ slightly.
The first is the standard k-fold CV estimate, the second is a bias-corrected version.
On this data set, the two estimates are very similar to each other.

Bootstrap

pull oneself up by one’s bootstrap - wydobyć się z opresji używając własnych sił

Metoda szacowania rozkładów estymatorów, w sytuacji gdy nie nakładamy żadnych założeń na rozkład w zmiennych w populacji

Nie znamy rozkładu F
Dysponujemy jedynie pojedynczą realizacją prostej próby losowej z tego rozkładu

Estymacja rozkładu statystyki metodą bootstrap

Ogólny zarys procedury:

Sampling distribution statystyki S:

\(S=S(x_1, x_2, \cdots, x_n)\)

w oparciu o prostą próbę losową:

\(\mathbf{X}=(X_1, X_2, \cdots, X_n)\)

Utwórz próbki poprzez zwracanie z losowaniem: \(\mathbf{x_1^{*}}, \cdots, \mathbf {x_M^{*}}\) : próba boostrap
Wyznacz interesującą statystykę \(S(\mathbf{x_1^{*}}), \cdots,S(\mathbf{x_M^{*}})\) dla każdej próbki. Rozkład statystyki otrzymanej na próbkach to bootstrap distribution.
Bootstrap distribution daje nam informację o rozkładzie cechy S

Bootstrap - szacowanie rozkładów estymatorów

library(ISLR)
head(Auto)

##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
## 6  15         8          429        198   4341         10.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500

lmfit<-glm(mpg~horsepower+cylinders+ weight, data=Auto)
summary(lmfit)

## 
## Call:
## glm(formula = mpg ~ horsepower + cylinders + weight, data = Auto)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -11.5260   -2.7955   -0.3559    2.2567   16.3209  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 45.7368172  0.7959566  57.461  < 2e-16 ***
## horsepower  -0.0427277  0.0116196  -3.677 0.000269 ***
## cylinders   -0.3889745  0.2988302  -1.302 0.193806    
## weight      -0.0052723  0.0006424  -8.208 3.37e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 17.947)
## 
##     Null deviance: 23819.0  on 391  degrees of freedom
## Residual deviance:  6963.4  on 388  degrees of freedom
## AIC: 2250.3
## 
## Number of Fisher Scoring iterations: 2

coef<-matrix(nrow=30, ncol=4)
nrow(Auto)

## [1] 392

for (i in 1:30){
  index <- sample(1:nrow(Auto), nrow(Auto), replace=TRUE) 
  train <- Auto[index,]
  test  <- Auto[-index,]
  #uczenie sieci , linear,output =FALSE, gdyż zmienne binarne
  lmfit<-glm(mpg~horsepower+cylinders+ weight, data=train)
  coef[i,]<-lmfit$coefficients
}

coefficients<-colMeans(coef)
coef_std <- apply(coef, MARGIN =2, FUN=sd )

coefficients

## [1] 45.806301283 -0.040590264 -0.422373404 -0.005298561

coef_std

## [1] 0.7417564288 0.0128691145 0.3726833936 0.0005549664

hist(coef[,1])

head(Auto)

##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
## 6  15         8          429        198   4341         10.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500

train<-Auto
lmfit<-glm(mpg~horsepower+cylinders+ weight, data=train)
summary(lmfit)

## 
## Call:
## glm(formula = mpg ~ horsepower + cylinders + weight, data = train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -11.5260   -2.7955   -0.3559    2.2567   16.3209  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 45.7368172  0.7959566  57.461  < 2e-16 ***
## horsepower  -0.0427277  0.0116196  -3.677 0.000269 ***
## cylinders   -0.3889745  0.2988302  -1.302 0.193806    
## weight      -0.0052723  0.0006424  -8.208 3.37e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 17.947)
## 
##     Null deviance: 23819.0  on 391  degrees of freedom
## Residual deviance:  6963.4  on 388  degrees of freedom
## AIC: 2250.3
## 
## Number of Fisher Scoring iterations: 2

Klasyfikacja przy pomocy sieci neuronowych

library(neuralnet)
library(nnet)
data(iris)


#one-hot encoder - kodowanie binarne zmiennej 
iris$setosa<-  ifelse(iris$Species == "setosa",1,0)
iris$versicolor<-ifelse(iris$Species == "versicolor",1,0)
iris$virginica<-ifelse(iris$Species == "virginica",1,0)

#zamiast tego kodu można użyć z biblioteki nnet: 
species_enc<-class.ind(iris$Species)


#normalizujemy atrybuty wejsciowe i dolaczamy trzy binarne kolumny wyjsciowe 
iris<-cbind(scale(iris[,1:4]), iris[,6:8])


#przyklad budowy sieci neuronowej  z dwiema ukrytymi warstwami neuronow 
f<-as.formula("setosa+versicolor+virginica~Sepal.Width + Sepal.Length + Petal.Width + Petal.Length")
nn<-neuralnet(f, hidden=c(1), data=iris, linear.output=FALSE)
plot(nn)
pred<-compute(nn,iris[,1:4])$net.result
pred_idx<-max.col(pred)
actuals<-iris[,5:7] 
actuals_idx<-max.col(actuals)
#dokladnosc modelu procentowa: dla zgodny rekordow mjets TRUE(1), dla błędaów FALSE(0)
accuracy<- mean(actuals_idx==pred_idx)
accuracy

## [1] 0.6666666667

Bootstrap 0.632 w klasyfikacji

\(n\) - liczba obserwacji w próbie danych, losujemy ze zwracaniem \(n\) obserwacji.

Prawdopodobieństwo nie wylosowania obserwacji wynosi: \((1- \frac{1}{n})^n \approx e^{-1} \approx 0.368\),

więc oczekiwana liczba unikalnych obserwacji w wylosowanym zbiorze to \(0.632n\). Model będzie budowany z wykorzystaniem jedynie 63,2% obserwacji, a więc estymata błędu na ciągu testowym będzie zawyżona (upward bias). Z drugiej strony, estymata błędu wyznaczona na danych treningowych jest zaniżona.

Szacowanie dokładności jest wykonywane mając \(b\) próbek bootstrapowych, niech \(acc_{i}\) będzie estymatą dokładności klasyfikacji dla \({i}\) - tej próbki bootstrapowej.

Oszacowanie błędu może być kombinacją tych dwóch:

\(acc_{boot} = \frac{1}{b} \sum_{i=1}^{b}(0.632*acc^{test}_{i}+0.368*acc^{train}_i)\)

#przyklad #0.532 bootstrap do szacowania błędu klasyfikacji 
acc_table = NULL 
acc_table_test= NULL 
acc_table_train = NULL
acc_table_all_rows = NULL 

for (i in 1:10){
  index <- sample(1:nrow(iris), nrow(iris), replace=TRUE) 
  train <- iris[index,]
  test  <- iris[-index,]
  #uczenie sieci , linear,output =FALSE, gdyż zmienne binarne
  nn<-neuralnet(f, hidden=c(1),  data=iris, linear.output = FALSE)
  
  
  
  pred_<-compute(nn,iris[,1:4])$net.result
  pred_label<-max.col(pred)

  actuals_label<-max.col(iris[,5:7])
  accuracy_all_rows<- mean(pred_label==actuals_label)
  
  
   #wyznaczenie prognoz na zbiorze testowym 

   accuracy_test<- mean(pred_label[-index]==actuals_label[-index])
  
   #wyznaczenie prognoz na zbiorze treningowym 
   accuracy_train<- mean(pred_label[index]==actuals_label[index])
  
   
   
   acc_table_all_rows[i] <- accuracy_all_rows
   acc_table_test[i] <- accuracy_test
   acc_table_train[i] <- accuracy_train
   
   acc_table[i] <- mean(0.632*accuracy_test+0.368*accuracy_train)

  
}
acc_table_all_rows

##  [1] 0.6666666667 0.6666666667 0.6666666667 0.6666666667 0.6666666667
##  [6] 0.6666666667 0.6666666667 0.6666666667 0.6666666667 0.6666666667

acc_table

##  [1] 0.6604325926 0.6318249057 0.6715733333 0.6789333333 0.6702988506
##  [6] 0.6513169231 0.6738521212 0.6998194872 0.6187428571 0.6389155556

acc_table_test

##  [1] 0.6296296296 0.6037735849 0.6666666667 0.6666666667 0.6724137931
##  [6] 0.6346153846 0.7090909091 0.7307692308 0.5714285714 0.6111111111

acc_table_train

##  [1] 0.7133333333 0.6800000000 0.6800000000 0.7000000000 0.6666666667
##  [6] 0.6800000000 0.6133333333 0.6466666667 0.7000000000 0.6866666667

#kalkulacja bledu predykcji 


acc_boot<- mean(acc_table)
acc_boot

## [1] 0.659570996

Klasyfikatory złożone

Załóżmy, że mamy do czynienia z problemem klasyfikacji z dwiema klasami: g=2.

Dysponujemy nie jednym, lecz C klasyfikatorami, które są od siebie statystycznie niezależne, czyli że klasyfikacje prze znei dokjonywane są od siebie niezależne.

Prawdopodobieństwo dokonania błędnej klasyfikacji jest taki samo i wynosi niewiele mniej niż 0.5 (np. 0.45)

Oczekiwana liczba poprawnych decyzji wśród wszystkich C klasyfikatorów wynosi \(0.55 \cdot C\) ( proste doświadczenie dwumianowe), wariancja liczby poprawnych decyzji wynosi \(0.55 \cdot 0.45 \cdot C = 0.2475 \cdot C\), odchylenie standardowe \(0.4975 \sqrt{C}\)

Jeżeli liczba klasyfiaktorów jest wystarczająco duża ( np. 1000 ) to możemy być “prawie” pewni, że większość klasyfikatorów dokona poprawnej klasyfikacji.

W rzeczywistości klasyfikatory, powstałe na podstawie tej samej próby uczącej są z reguły lepsze od klasyfikatora, który myli się się z prawdopodobieństwem 0.45, ale są oczeywiście od siebie statystycznie zależne.

Tym niemniej, opisana intuicja stanowi podstawę do skonstruowania rodzin klasyfikatorów o dobrych właściwościach.

Agregacja bootstrapowa (bagging - bootstrap aggregation)

Bootstrap - w statystyce metoda szacowania rozkładu błędów estymacji za pomocą wielokrotnego losowania ze zwracaniem z próby.

Dane pobierane są z oryginalnego zbioru danych wielokrotnie, tak że tworzą S zbiorów uczących: Każdy ze zbiorów posiada ten sam rozmiar co zbiór wejściowy Obserwacje do zbiorów są losowane ze zwracaniem ze zbioru oryginalnego

Budujemy S klasyfikatorów Wyznaczenie oceny przez zastosowanie S klasyfikatorów do tego samego rekordu i wybranie klasy przez głosowanie (klasa wskazana przez największą liczbę klasyfikatorów)

Pakiet: adabag, ada

library(ROCR)

## Loading required package: gplots

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':
## 
##     lowess

## 
## Attaching package: 'ROCR'

## The following object is masked from 'package:neuralnet':
## 
##     prediction

url="http://freakonometrics.free.fr/german_credit.csv"
GermanCredit=read.csv(url, header = TRUE, sep = ",")

names(GermanCredit)

##  [1] "Creditability"                    
##  [2] "Account.Balance"                  
##  [3] "Duration.of.Credit..month."       
##  [4] "Payment.Status.of.Previous.Credit"
##  [5] "Purpose"                          
##  [6] "Credit.Amount"                    
##  [7] "Value.Savings.Stocks"             
##  [8] "Length.of.current.employment"     
##  [9] "Instalment.per.cent"              
## [10] "Sex...Marital.Status"             
## [11] "Guarantors"                       
## [12] "Duration.in.Current.address"      
## [13] "Most.valuable.available.asset"    
## [14] "Age..years."                      
## [15] "Concurrent.Credits"               
## [16] "Type.of.apartment"                
## [17] "No.of.Credits.at.this.Bank"       
## [18] "Occupation"                       
## [19] "No.of.dependents"                 
## [20] "Telephone"                        
## [21] "Foreign.Worker"

F<-c(1,2,4,5,7,8,9,10,11,12,13,15,16,17,18,19,20)
for(i in F) GermanCredit[,i]=as.factor(GermanCredit[,i])
credit<-GermanCredit



i_test=sample(1:nrow(GermanCredit),size=333)
i_train=(1:nrow(GermanCredit))[-i_test]

LogisticModel <- glm(Creditability ~ Account.Balance + Payment.Status.of.Previous.Credit + Purpose + 
Length.of.current.employment + 
Sex...Marital.Status, family=binomial, 
data = credit[i_train,])


fitLog <- predict(LogisticModel,type="response",  newdata=credit[i_test,])

pred = prediction( fitLog, credit$Creditability[i_test])
perf <- performance(pred, "tpr", "fpr")
plot(perf)

AUCLog1=performance(pred, measure = "auc")@y.values[[1]]
cat("AUC: ",AUCLog1)

## AUC:  0.7621279163

fitClass <- ifelse(predict(LogisticModel,type="response",  newdata=credit[i_test,])<0.5,0,1)
accuracy <- mean(fitClass == credit[i_test,]$Creditability)


#rozkład w oryginalnym zbiorze danych 
prop.table(table(credit[i_test,]$Creditability))

## 
##            0            1 
## 0.3393393393 0.6606606607

#Algorytm bagging 
library(foreach)
sample_proportion<-1
iterations<-10

set.seed(113)
predictions<-foreach(m=1:iterations) %do% {
  
i_train_bagg =sample(i_train, size=sample_proportion * length(i_train), replace=TRUE)
i_test_bagg  =i_train[!(i_train %in% i_train_bagg)]

LogisticModel <- glm(Creditability ~ Account.Balance + Payment.Status.of.Previous.Credit + Purpose + 
Length.of.current.employment + 
Sex...Marital.Status, family=binomial, 
data = credit[i_train_bagg,])


fitClass <- ifelse(predict(LogisticModel,type="response",  newdata=credit[i_test_bagg,])<0.5,0,1)
accuracy <- mean(fitClass == credit[i_test_bagg,]$Creditability)
result<-list(accuracy, LogisticModel)
result
} 




acc = lapply(predictions, function(l) l[[1]])
acc_vector<-unlist(acc)
mean(acc_vector)

## [1] 0.7269990291

## a teraz predykcja przy pomocy złożonego klasyfikatora: 
models = lapply(predictions, function(l) l[[2]])
pred2<-lapply (models, function(x) predict(x,type="response",  newdata=credit[i_test,] ))
df<-data.frame(pred2)

colnames(df)<-paste('classifier_', 1:iterations)
pred_matrix<-as.matrix(df)

class_matrix<-ifelse(pred_matrix< 0.5, 0, 1)
final_pred<-ifelse(rowMeans(class_matrix) < 0.5, 0, 1)

accuracy <- mean(final_pred == credit[i_test,]$Creditability)
accuracy

## [1] 0.7117117117

pred = prediction( rowMeans(class_matrix), credit$Creditability[i_test])
perf <- performance(pred, "tpr", "fpr")
plot(perf)

AUCLog1=performance(pred, measure = "auc")@y.values[[1]]
cat("AUC: ",AUCLog1)

## AUC:  0.7090707965

library(ipred)  #bagging for classification trees 
library(rpart)

##  Classification: german credit
url="http://freakonometrics.free.fr/german_credit.csv"
GermanCredit=read.csv(url, header = TRUE, sep = ",")

names(GermanCredit)

##  [1] "Creditability"                    
##  [2] "Account.Balance"                  
##  [3] "Duration.of.Credit..month."       
##  [4] "Payment.Status.of.Previous.Credit"
##  [5] "Purpose"                          
##  [6] "Credit.Amount"                    
##  [7] "Value.Savings.Stocks"             
##  [8] "Length.of.current.employment"     
##  [9] "Instalment.per.cent"              
## [10] "Sex...Marital.Status"             
## [11] "Guarantors"                       
## [12] "Duration.in.Current.address"      
## [13] "Most.valuable.available.asset"    
## [14] "Age..years."                      
## [15] "Concurrent.Credits"               
## [16] "Type.of.apartment"                
## [17] "No.of.Credits.at.this.Bank"       
## [18] "Occupation"                       
## [19] "No.of.dependents"                 
## [20] "Telephone"                        
## [21] "Foreign.Worker"

class(GermanCredit$Creditability)

## [1] "integer"

# Convert target variable to factor
GermanCredit$Creditability <- factor(GermanCredit$Creditability)





German.bagging <- bagging(Creditability ~.,
                           data=GermanCredit,
                           mfinal=15, 
                           control=rpart.control(maxdepth=5, minsplit=15))


# score the dataset
GermanCredit$pred.class <- predict(German.bagging,
                                   GermanCredit)



# Load Library or packages
library(e1071)
library(caret)

## Loading required package: lattice

## 
## Attaching package: 'lattice'

## The following object is masked from 'package:boot':
## 
##     melanoma

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'Auto':
## 
##     mpg

# Create Confusion Matrix
confusionMatrix(data=factor(GermanCredit$pred.class),
                reference=factor(GermanCredit$Creditability),
                positive='1')

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 151  54
##          1 149 646
##                                                 
##                Accuracy : 0.797                 
##                  95% CI : (0.7707117, 0.8215264)
##     No Information Rate : 0.7                   
##     P-Value [Acc > NIR] : 0.000000000002492293  
##                                                 
##                   Kappa : 0.4685864             
##  Mcnemar's Test P-Value : 0.000000000041813003  
##                                                 
##             Sensitivity : 0.9228571             
##             Specificity : 0.5033333             
##          Pos Pred Value : 0.8125786             
##          Neg Pred Value : 0.7365854             
##              Prevalence : 0.7000000             
##          Detection Rate : 0.6460000             
##    Detection Prevalence : 0.7950000             
##       Balanced Accuracy : 0.7130952             
##                                                 
##        'Positive' Class : 1                     
##

Lasy losowe

Breiman

Wylosuj ze zwracaniem z oryginalnej n-elementowej próby uczącej n wektorów obserwacji do pseudopróby uczącej, na podstawie której zostanie zbudowane drzewo
W każdym węźle budowanego drzewa podział podpróby, która dotarła do tego węzła odbywa się następująco: niezależnie od innych losowań wylosuj m spośórd p atrybutów wektora obserwacji (losowanie bez zwracania m elementów spośród p elementów); następnie zastosuj przyjętą regułę podziału do wylosowanych m-atrybutow (podział jest oparty na m wylosowanych atrybutach, a nie na wszystkich p atrybutach i wśród tych m atrybutóW odbywa się poszukiwanie najlepszego podziału na na najlepszym atrybucie m<<p).
Drzewo jest budowane bez przycinania, jeżeli to możliwe aż do otrzymania liście o elementach pseudopróby uczącej tylko z jednej klasy.

Klasyfikacja odbywa się tak, jak w przypadku algorytmu bagging - dany wektor obserwacji zostaje poddany klasyfikacji przez wszystkie drzewa lasu i ostatecznie zaklasyfikowany do klasy, która osiągnęła większość głosów.

Dotychczasowe doświadczenie wskazuje, że wartością dającą dobre wyniki jest \(m=\sqrt{p}\)

Boosting

Adaboost - adaptive boosting

Klasyfikatory są uczone sekwencyjnie –kolejny klasyfikator jest uczony w oparciu o błędnie zaklasyfikowane dane Wynik jest sumą ważoną wszystkich klasyfikatorów, przy czym wagi odpowiadają jakości klasyfikatora

Bootstratp

Marcin Mazurek

2017-11-30

Repróbkowanie(Resampling)

Test error versus Training error

Zbiór testowy wyodrębniony z próby prostej: the Validation Set Approach

Drawbacks to the Validation Approach

Cross Validation

###Leave-One-Out Cross-Validation (LOOCV)

LOOCV versus the Validation Method

k-fold Cross Validation

k-fold CV

Bias-Variance Trade-Off for k-Fold Cross-Validation

Bias-Variance Trade-Off for k-Fold Cross-Validation

What about the variance?

Bias-Variance Tradeoff for k-fold CV

Cross-Validation on Classification Problems

Cross-Validation on Classification Problems

An Application to the Auto data set

An Application to the Auto data set

Data set up

Linear regression

Linear regression

LOOCV for Auto data set

k-fold for Auto data set

delta for CV

Bootstrap

Estymacja rozkładu statystyki metodą bootstrap

Bootstrap - szacowanie rozkładów estymatorów

Klasyfikacja przy pomocy sieci neuronowych

Bootstrap 0.632 w klasyfikacji

Klasyfikatory złożone

Agregacja bootstrapowa (bagging - bootstrap aggregation)

Lasy losowe

Boosting

Adaboost - adaptive boosting