#setwd("/home/harpo/Dropbox/ongoing-work/git-repos/labeling-datasets/")
#lees el archivo
dataset=read.csv(file="./characteristicConnectionVector-full.csv",header=F)
names(dataset)=c("sp","wp","wnp","snp","ds","dm","dl","ss","sm","sl","class")



splom(~dataset[,1:10],data=dataset,
        groups=dataset$class,
        diag.panel = function(x, ...){
          yrng <- current.panel.limits()$ylim
          d <- density(x, na.rm=TRUE)
          d$y <- with(d, yrng[1] + 0.95 * diff(yrng) * y / max(y) )
          panel.lines(d)
          diag.panel.splom(x,...)
        },cex=0.5,xlab="",ylab="",auto.key = TRUE,cex.labels=0.1,pscales = 0,alpha=0.5,varname.cex=0.8
  )

Dada una conexion con una unica letra. Los valores para cada uno de los 3 atributos sera 1.

Ejemplo:

R.R.R.R

Produciria un vector de 1 para SNP, 1 para DS, y 1 para SS y 0 para todos las demas caracteristicas del heatmap

PCA

From WIKIPEDIA:

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components. The resulting vectors are an uncorrelated orthogonal basis set. PCA is sensitive to the relative scaling of the original variables.

#pca=prcomp(dataset[,1:ncol(dataset)-1], center = TRUE, scale. = TRUE) 
#plot(pca$x[,1],pca$x[,2],col=c("red","blue","green"))

pca_caret = preProcess(dataset[,1:ncol(dataset)-1], 
                   method=c("BoxCox", "center", 
                            "scale", "pca"))

pca_x <- as.matrix(dataset[,1:ncol(dataset)-1]) %*% pca_caret$rotation
plot(pca_x,col=c("red","green","black"))

CLUSTERING K-MEANS

In k means clustering, we have the specify the number of clusters we want the data to be grouped into. The algorithm randomly assigns each observation to a cluster, and finds the centroid of each cluster. Then, the algorithm iterates through two steps:

Reassign data points to the cluster whose centroid is closest. Calculate new centroid of each cluster. These two steps are repeated till the within cluster variation cannot be reduced any further. The within cluster variation is calculated as the sum of the euclidean distance between the data points and their respective cluster centroids.

\[ SS(k)=\sum^{n}_{j=1}\sum^{p}_{i=0}(x_{ij} - x_{kj})^2\]

#library(rattle)
dataset.stand <- scale(dataset[1:10])
k.means.fit <- kmeans(dataset.stand, 3) # k = 3
dataset=cbind(dataset,k.means.fit$cluster)
names(dataset)[12]="cluster"
group_by(dataset,cluster,class) %>% summarise(n=n())

## Source: local data frame [6 x 3]
## Groups: cluster [?]
## 
##   cluster  class     n
##     <int> <fctr> <int>
## 1       1 Botnet    53
## 2       1 Normal   171
## 3       2 Botnet   240
## 4       2 Normal    22
## 5       3 Botnet   255
## 6       3 Normal    37

2D representation

Aplicamos Random Forest

Seleccionamos solo la parte del dataset que esta etiquetado, y lo separamos en train y test (80/20)

set.seed(1492)
datasample=filter(dataset,class =='Botnet' | class =='Normal')
datasample$class=factor(datasample$class)
trainindex <- createDataPartition(datasample$class, p=0.80, list=F)
train <- datasample[trainindex, ]
test <- datasample[-trainindex, ]

Entrenamos un random Forest

From WIkipedia

Random forests or random decision forests[1][2] are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees’ habit of overfitting to their training set.[3]:587–588

Tree bagging

The training algorithm for random forests applies the general technique of bootstrap aggregating, or bagging, to tree learners. Given a training set \(X = x_1, ..., x_n\) with responses \(Y = y_1,..., y_n\), bagging repeatedly (B times) selects a random sample with replacement of the training set and fits trees to these samples:

For \(b = 1, ..., B:\) Sample, with replacement, n training examples from \(X\), \(Y\); call these \(Xb\), \(Yb\). Train a decision or regression tree fb on \(Xb\), \(Yb\). After training, predictions for unseen samples x’ can be made by averaging the predictions from all the individual regression trees on x’:

\[ {\hat {f}}={\frac {1}{B}}\sum _{b=1}^{B}{\hat {f}}_{b}(x')} {\hat {f}}={\frac {1}{B}}\sum _{b=1}^{B}{\hat {f}}_{b}(x') \]

or by taking the majority vote in the case of decision trees.

This bootstrapping procedure leads to better model performance because it decreases the variance of the model, without increasing the bias. This means that while the predictions of a single tree are highly sensitive to noise in its training set, the average of many trees is not, as long as the trees are not correlated. Simply training many trees on a single training set would give strongly correlated trees (or even the same tree many times, if the training algorithm is deterministic); bootstrap sampling is a way of de-correlating the trees by showing them different training sets.

# Validation method
ctrl_fast <- trainControl(method="cv", 
                     repeats=1,
                     number=10, 
                     summaryFunction=twoClassSummary,
                     verboseIter=F,
                     classProbs=TRUE,
                     allowParallel = TRUE)       

# Random Forest
rfFit <- train(class ~ .,
               data = train,
               metric="ROC",
               method = "rf",
               trControl = ctrl_fast)

Resultado luego de las 10-fold CV

rfFit

## Random Forest 
## 
## 623 samples
##  11 predictors
##   2 classes: 'Botnet', 'Normal' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 561, 560, 561, 561, 561, 560, ... 
## Resampling results across tuning parameters:
## 
##   mtry  ROC        Sens       Spec     
##    2    0.9878927  0.9748943  0.9076023
##    6    0.9812587  0.9658034  0.9020468
##   11    0.9740533  0.9589852  0.8909357
## 
## ROC was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 2.

Matriz de confusion para el modelo elegido mediante CV

rfFit$finalModel

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 4.01%
## Confusion matrix:
##        Botnet Normal class.error
## Botnet    428     11  0.02505695
## Normal     14    170  0.07608696

Resultados en conjunto de test

Consideramos como botnet todo aquello que tenga una probabilidad >0.7 de pertenecer a la clase botnet

predsrfprobs=predict(rfFit,test,type='prob')
predsrf=predsrfprobs
predsrf=ifelse(predsrfprobs$Botnet >0.7,'Botnet','Normal')
print(caret::confusionMatrix(predsrf,test$class))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Botnet Normal
##     Botnet    100      3
##     Normal      9     43
##                                           
##                Accuracy : 0.9226          
##                  95% CI : (0.8687, 0.9594)
##     No Information Rate : 0.7032          
##     P-Value [Acc > NIR] : 2.046e-11       
##                                           
##                   Kappa : 0.8213          
##  Mcnemar's Test P-Value : 0.1489          
##                                           
##             Sensitivity : 0.9174          
##             Specificity : 0.9348          
##          Pos Pred Value : 0.9709          
##          Neg Pred Value : 0.8269          
##              Prevalence : 0.7032          
##          Detection Rate : 0.6452          
##    Detection Prevalence : 0.6645          
##       Balanced Accuracy : 0.9261          
##                                           
##        'Positive' Class : Botnet          
##

#pca=prcomp(datasample[,1:ncol(datasample)-1], center = TRUE, scale. = TRUE) 
#plot(pca$x[,1],pca$x[,2],col=c("red","blue","green"))

heatmap-clustering

Harpo

11/1/2016

PCA