setwd("/home/harpo/Dropbox/ongoing-work/git-repos/labeling-datasets/")
#lees el archivo
dataset=read.csv(file="./conecction_charactersticVector.csv",header=F)
names(dataset)=c("sp","wp","wnp","snp","ds","dm","dl","ss","sm","sl","class")



splom(~dataset[,1:10],data=dataset,
        groups=dataset$class,
        diag.panel = function(x, ...){
          yrng <- current.panel.limits()$ylim
          d <- density(x, na.rm=TRUE)
          d$y <- with(d, yrng[1] + 0.95 * diff(yrng) * y / max(y) )
          panel.lines(d)
          diag.panel.splom(x,...)
        },cex=0.5,xlab="",ylab="",auto.key = TRUE,cex.labels=0.1,pscales = 0,alpha=0.5,varname.cex=0.8
  )

Dada una conexion con una unica letra. Los valores para cada uno de los 3 atributos sera 1.

Ejemplo:

R.R.R.R

Produciria un vector de 1 para SNP, 1 para DS, y 1 para SS y 0 para todos las demas caracteristicas del heatmap

PCA

From WIKIPEDIA:

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components. The resulting vectors are an uncorrelated orthogonal basis set. PCA is sensitive to the relative scaling of the original variables.

#pca=prcomp(dataset[,1:ncol(dataset)-1], center = TRUE, scale. = TRUE) 
#plot(pca$x[,1],pca$x[,2],col=c("red","blue","green"))

pca_caret = preProcess(dataset[,1:ncol(dataset)-1], 
                   method=c("BoxCox", "center", 
                            "scale", "pca"))

pca_x <- as.matrix(dataset[,1:ncol(dataset)-1]) %*% pca_caret$rotation
plot(pca_x,col=c("red","green","black"))

CLUSTERING K-MEANS

In k means clustering, we have the specify the number of clusters we want the data to be grouped into. The algorithm randomly assigns each observation to a cluster, and finds the centroid of each cluster. Then, the algorithm iterates through two steps:

Reassign data points to the cluster whose centroid is closest. Calculate new centroid of each cluster. These two steps are repeated till the within cluster variation cannot be reduced any further. The within cluster variation is calculated as the sum of the euclidean distance between the data points and their respective cluster centroids.

\[ SS(k)=\sum^{n}_{j=1}\sum^{p}_{i=0}(x_{ij} - x_{kj})^2\]

#library(rattle)
dataset.stand <- scale(dataset[1:10])
k.means.fit <- kmeans(dataset.stand, 3) # k = 3
dataset=cbind(dataset,k.means.fit$cluster)
names(dataset)[12]="cluster"
group_by(dataset,cluster,class) %>% summarise(n=n())
## Source: local data frame [9 x 3]
## Groups: cluster [?]
## 
##   cluster      class     n
##     <int>     <fctr> <int>
## 1       1     Botnet   172
## 2       1     Normal    13
## 3       1 Unlabelled    64
## 4       2     Botnet    42
## 5       2     Normal   112
## 6       2 Unlabelled    45
## 7       3     Botnet   190
## 8       3     Normal    17
## 9       3 Unlabelled    74

2D representation

library(cluster)
clusplot(dataset.stand, k.means.fit$cluster, main='2D representation of the Cluster solution',
         color=TRUE, shade=TRUE,
         labels=0, lines=0)

Aplicamos Random Forest

Seleccionamos solo la parte del dataset que esta etiquetado, y lo separamos en train y test (80/20)

set.seed(1492)
datasample=filter(dataset,class =='Botnet' | class =='Normal')
datasample$class=factor(datasample$class)
trainindex <- createDataPartition(datasample$class, p=0.80, list=F)
train <- datasample[trainindex, ]
test <- datasample[-trainindex, ]

Entrenamos un random Forest

From WIkipedia

Random forests or random decision forests[1][2] are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees’ habit of overfitting to their training set.[3]:587–588

# Validation method
ctrl_fast <- trainControl(method="cv", 
                     repeats=1,
                     number=10, 
                     summaryFunction=twoClassSummary,
                     verboseIter=F,
                     classProbs=TRUE,
                     allowParallel = TRUE)       

# Random Forest
rfFit <- train(class ~ .,
               data = train,
               metric="ROC",
               method = "rf",
               trControl = ctrl_fast)
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin

Resultados en test

Consideramos como botnet todo aquello que tenga una probabilidad >0.6 de pertenecer a la clase botnet

predsrfprobs=predict(rfFit,test,type='prob')
predsrf=predsrfprobs
predsrf=ifelse(predsrfprobs$Botnet >0.6,'Botnet','Normal')
print(caret::confusionMatrix(predsrf,test$class))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Botnet Normal
##     Botnet     80      1
##     Normal      0     27
##                                           
##                Accuracy : 0.9907          
##                  95% CI : (0.9495, 0.9998)
##     No Information Rate : 0.7407          
##     P-Value [Acc > NIR] : 3.257e-13       
##                                           
##                   Kappa : 0.9756          
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.9643          
##          Pos Pred Value : 0.9877          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.7407          
##          Detection Rate : 0.7407          
##    Detection Prevalence : 0.7500          
##       Balanced Accuracy : 0.9821          
##                                           
##        'Positive' Class : Botnet          
## 
#pca=prcomp(datasample[,1:ncol(datasample)-1], center = TRUE, scale. = TRUE) 
#plot(pca$x[,1],pca$x[,2],col=c("red","blue","green"))