setwd("/home/harpo/Dropbox/ongoing-work/git-repos/labeling-datasets/")
#lees el archivo
dataset=read.csv(file="./conecction_charactersticVector.csv",header=F)
names(dataset)=c("sp","wp","wnp","snp","ds","dm","dl","ss","sm","sl","class")
splom(~dataset[,1:10],data=dataset,
groups=dataset$class,
diag.panel = function(x, ...){
yrng <- current.panel.limits()$ylim
d <- density(x, na.rm=TRUE)
d$y <- with(d, yrng[1] + 0.95 * diff(yrng) * y / max(y) )
panel.lines(d)
diag.panel.splom(x,...)
},cex=0.5,xlab="",ylab="",auto.key = TRUE,cex.labels=0.1,pscales = 0,alpha=0.5,varname.cex=0.8
)
Dada una conexion con una unica letra. Los valores para cada uno de los 3 atributos sera 1.
Ejemplo:
R.R.R.R
Produciria un vector de 1 para SNP, 1 para DS, y 1 para SS y 0 para todos las demas caracteristicas del heatmap
Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components. The resulting vectors are an uncorrelated orthogonal basis set. PCA is sensitive to the relative scaling of the original variables.
#pca=prcomp(dataset[,1:ncol(dataset)-1], center = TRUE, scale. = TRUE)
#plot(pca$x[,1],pca$x[,2],col=c("red","blue","green"))
pca_caret = preProcess(dataset[,1:ncol(dataset)-1],
method=c("BoxCox", "center",
"scale", "pca"))
pca_x <- as.matrix(dataset[,1:ncol(dataset)-1]) %*% pca_caret$rotation
plot(pca_x,col=c("red","green","black"))
In k means clustering, we have the specify the number of clusters we want the data to be grouped into. The algorithm randomly assigns each observation to a cluster, and finds the centroid of each cluster. Then, the algorithm iterates through two steps:
Reassign data points to the cluster whose centroid is closest. Calculate new centroid of each cluster. These two steps are repeated till the within cluster variation cannot be reduced any further. The within cluster variation is calculated as the sum of the euclidean distance between the data points and their respective cluster centroids.
\[ SS(k)=\sum^{n}_{j=1}\sum^{p}_{i=0}(x_{ij} - x_{kj})^2\]
#library(rattle)
dataset.stand <- scale(dataset[1:10])
k.means.fit <- kmeans(dataset.stand, 3) # k = 3
dataset=cbind(dataset,k.means.fit$cluster)
names(dataset)[12]="cluster"
group_by(dataset,cluster,class) %>% summarise(n=n())
## Source: local data frame [9 x 3]
## Groups: cluster [?]
##
## cluster class n
## <int> <fctr> <int>
## 1 1 Botnet 172
## 2 1 Normal 13
## 3 1 Unlabelled 64
## 4 2 Botnet 42
## 5 2 Normal 112
## 6 2 Unlabelled 45
## 7 3 Botnet 190
## 8 3 Normal 17
## 9 3 Unlabelled 74
library(cluster)
clusplot(dataset.stand, k.means.fit$cluster, main='2D representation of the Cluster solution',
color=TRUE, shade=TRUE,
labels=0, lines=0)
set.seed(1492)
datasample=filter(dataset,class =='Botnet' | class =='Normal')
datasample$class=factor(datasample$class)
trainindex <- createDataPartition(datasample$class, p=0.80, list=F)
train <- datasample[trainindex, ]
test <- datasample[-trainindex, ]
Random forests or random decision forests[1][2] are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees’ habit of overfitting to their training set.[3]:587–588
# Validation method
ctrl_fast <- trainControl(method="cv",
repeats=1,
number=10,
summaryFunction=twoClassSummary,
verboseIter=F,
classProbs=TRUE,
allowParallel = TRUE)
# Random Forest
rfFit <- train(class ~ .,
data = train,
metric="ROC",
method = "rf",
trControl = ctrl_fast)
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
Consideramos como botnet todo aquello que tenga una probabilidad >0.6 de pertenecer a la clase botnet
predsrfprobs=predict(rfFit,test,type='prob')
predsrf=predsrfprobs
predsrf=ifelse(predsrfprobs$Botnet >0.6,'Botnet','Normal')
print(caret::confusionMatrix(predsrf,test$class))
## Confusion Matrix and Statistics
##
## Reference
## Prediction Botnet Normal
## Botnet 80 1
## Normal 0 27
##
## Accuracy : 0.9907
## 95% CI : (0.9495, 0.9998)
## No Information Rate : 0.7407
## P-Value [Acc > NIR] : 3.257e-13
##
## Kappa : 0.9756
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 1.0000
## Specificity : 0.9643
## Pos Pred Value : 0.9877
## Neg Pred Value : 1.0000
## Prevalence : 0.7407
## Detection Rate : 0.7407
## Detection Prevalence : 0.7500
## Balanced Accuracy : 0.9821
##
## 'Positive' Class : Botnet
##
#pca=prcomp(datasample[,1:ncol(datasample)-1], center = TRUE, scale. = TRUE)
#plot(pca$x[,1],pca$x[,2],col=c("red","blue","green"))