These packages are needed for this assignment:

pkg <- c("ggplot2", "RColorBrewer")
new.pkg <- pkg[!(pkg %in% installed.packages())]
if (length(new.pkg)) {
  install.packages(new.pkg)  
}

Let us first make sure we are working in a directory we feel comfortable with. So I will start by changing my working directory to the one listed below

# Lets check our working directory
getwd()
[1] "C:/Users/antho/OneDrive/Documents/School/4.DataSecurity&Governance"

Let us now read the file and start exploring the dataset memproc.

mydata1<- read.csv("memproc.csv", header=T)
summary(mydata1)
     host                proc              mem             state          
 Length:247         Min.   :-3.1517   Min.   :-3.5939   Length:247        
 Class :character   1st Qu.:-1.2056   1st Qu.:-1.4202   Class :character  
 Mode  :character   Median :-0.4484   Median :-0.6212   Mode  :character  
                    Mean   :-0.4287   Mean   :-0.5181                     
                    3rd Qu.: 0.3689   3rd Qu.: 0.2413                     
                    Max.   : 3.1428   Max.   : 3.2184                     

We can see we were succesfull in uploading the memproc.csv data into a data fram we named “mydata1” which has 4 fields. Lets get more info on the dataset by using the head function and inspecting the colomns and top 6 rows.

head(mydata1)

We can see that There are two fields (“host” & “state”) that look like characters and two fields (“proc” & “mem”) that look like numeric data under the dbl data type

In order to explore this dataset more in detail, let us create a plot to compare the processor and memory usage, and differentiate it based on the malware state.

# you will see that we use the ggplot function to create graph comparing the proc(x) and mem(y) fields and set the details/legend of the "state" field by color. 
library(ggplot2)
Warning: package ‘ggplot2’ was built under R version 4.2.1
gg <- ggplot(mydata1, aes(proc, mem, color=state))
gg <- gg + scale_color_brewer(palette="Set2") # we can change the color here
gg <- gg + geom_point(size=3) + theme_bw() # we can change the size of the dots here
print(gg)

I sugest you to create now a new graph but this time including the title in the chart. The title would be “Memory vs Processor Usage as function of the Malaware state”.

# We print the original plot (gg), adding the string title we want by inserting it using the ggplot2 function ggtitle()
print(gg + ggtitle("Memory vs Processor Usage as function of the Malware state"))

We print the original plot (gg), adding the string title we want by inserting it using the ggplot2 function ggtitle()

set.seed(1492)
# count how many in the overall sample
n <- nrow(mydata1)
# set the test.size to be 1/3rd
test.size <- as.integer(n/3)
# randomly sample the rows for test set
testset <- sample(n, test.size)
# now split the data into test and train
test <- mydata1[testset, ]
train <- mydata1[-testset, ]
# pull out proc and mem columns for infected then normal
# then use colMeans() to means of the columns
inf <- colMeans(train[train$state=="Infected", c("proc", "mem")])
nrm <- colMeans(train[train$state=="Normal", c("proc", "mem")])

We create two variables storing the mean value for all the “infected” and “normal” states from the train data that we split earlier.

print(inf)
     proc       mem 
0.9354513 1.0868010 
print(nrm)
      proc        mem 
-0.7907962 -0.9352974 

By using the colMens function we are able to see the mean of the “infected” and “normal” states for the Processor(proc) and memory (mem) fields in the train data.

predict.malware <- function(data) {
  # get 'proc' and 'mem' as numeric values
  proc <- as.numeric(data[['proc']])
  mem <- as.numeric(data[['mem']])
  # set up infected comparison
  inf.a <- inf['proc'] - proc
  inf.b <- inf['mem'] - mem
  # pythagorean distance c = sqrt(a^2 + b^2)
  inf.dist <- sqrt(inf.a^2 + inf.b^2)
  # repeat for normal systems
  nrm.a <- nrm['proc'] - proc
  nrm.b <- nrm['mem'] - mem
  nrm.dist <- sqrt(nrm.a^2 + nrm.b^2)
  # assign a label of the closest (smallest)
  ifelse(inf.dist<nrm.dist,"Infected", "Normal")
}
# could test with these if you uncomment them
# predict.malware(inf['proc'], inf['mem'])
# expect "Infected" 
# predict.malware(nrm['proc'], nrm['mem'])
# expect "Normal"

We create a function called predict.malware to do exactly that, by finding the shortest distance from all point in the train data. Very interesting algorithms we use above to find the pythagorean distance forbetween the processor and memory values for the infected and normal states.

prediction <- apply(test, 1, predict.malware)
sum(test$state==prediction)/nrow(test)
[1] 0.902439

Above is our predictions accuracy which is considerably good at 90%

Lets make a slope and intercept so we can add an diagnal line on the chart so we can better visualize the divide of states.

# Figure 9-2 #########################################################
slope <- -1*(1/((inf['mem']-nrm['mem'])/(inf['proc']-nrm['proc'])))
intercept <- mean(c(inf['mem'], nrm['mem'])) - (slope*mean(c(inf['proc'], nrm['proc'])))

Lets create a result variable that will display weather or predictions were accurate or not

result <- cbind(test, predict=prediction)
result$Accurate <- ifelse(result$state==result$predict, "Yes", "No")
result$Accurate <- factor(result$Accurate, levels=c("Yes", "No"), ordered=T)

Now recreate the previous graph but add the detail/legend of accuracy by shape. We also use the geom_abline() function to create the diagnal line from the slope and intercept we identified and created above.

# notice here we set the detail/legend at the end of this next line where we create the chart. 
gg <- ggplot(result, aes(proc, mem, color=state, size=Accurate, shape=Accurate)) 
gg <- gg + scale_shape_manual(values=c(16, 8))
gg <- gg + scale_size_manual(values=c(3, 6))
gg <- gg + scale_color_brewer(palette="Set2")
gg <- gg + geom_point() + theme_bw()
gg <- gg + geom_abline(intercept = intercept, slope = slope, color="gray80")
print(gg)

we can see that we have less false negatives than we do false positives which means we are predicting more infected situations accurately than we are non infected meaning those with a threat are being identified. Only issue is there are six times more people that are being told they are infected when they are not.

set.seed(1)
x <- runif(200, min=-10, max=10)
y <- 1.377*(x^3) + 0.92*(x^2) + .3*x + rnorm(200, sd=250) + 1572
x <- x + 10
smooth <- ggplot(data.frame(x,y), aes(x, y)) + geom_point() + 
  geom_smooth(method = "lm", formula = y ~ poly(x, 3), size = 1, se=F) + 
  theme_bw()
print(smooth)

Here we create a linear regression model based of random data x, y values we created with a distribution like that of sigmoid function. Very interesting notice how we use the geom_smooth() function to define and deploy the linear regression model in nearly one line. This is useful for predicting quantitative values.

memproc <- read.csv("memproc.csv", header=T)
memproc$infected <- ifelse(memproc$state=="Infected", 1, 0)
set.seed(1492)
n <- nrow(memproc)
test.size <- as.integer(n/3)
testset <- sample(n, test.size)
test <- memproc[testset, ]
train <- memproc[-testset, ]

Since this data is being used more so to classify a qualitative field (“state”) which has only values it we would benifit us to convert these two string values into binary digits so that they are easier to use as a target value for logistic regression.

You will see below how we target only the infected state for our chart when creating this logistic regression model. Logistic regrssion are achieved using the “generalized linear models” glm() function which i read can also do poisson regression and survival analysis.

glm.out = glm(infected ~ proc + mem, data=test, family=binomial(logit))
summary(glm.out)

Call:
glm(formula = infected ~ proc + mem, family = binomial(logit), 
    data = test)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.51110  -0.17718  -0.08015  -0.01132   2.28073  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept)  -2.1905     0.6943  -3.155  0.00160 **
proc          2.1378     0.7192   2.972  0.00295 **
mem           1.6530     0.5682   2.909  0.00362 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 68.275  on 81  degrees of freedom
Residual deviance: 25.348  on 79  degrees of freedom
AIC: 31.348

Number of Fisher Scoring iterations: 8
modelog <- predict.glm(glm.out, test, type="response")
gg <- ggplot(data.frame(x=modelog, y=ifelse(test$infected>0.5, "Yes", "No")), aes(x, y)) +
  geom_point(size=3, fill="steelblue", color="black", shape=4) + 
  ylab("Known Infected Host") +
  xlab("Estimated Probability of Infected Host") + theme_bw()
print(gg)

Notice this chart is perfect for showing the probability of a classification, kind of like a confusion matrix this is identify the probability of each of a binary classifications two values. Notice the host that have a higher probability of infection have more grouping on the yes line of the y axis. Also Notice that the majority of records fall near the 0 probability and on the “no” known infected line.

set.seed(1) # repeatable
x <- c(rnorm(200), rnorm(400)+2, rnorm(400)-2)
y <- c(rnorm(200), rnorm(200)+2, rnorm(200)-2, rnorm(200)+2, rnorm(200)-2)
randata <- data.frame(x=x, y=y)
out <- list()
for(i in c(3,4,5,6)) {
  km <- kmeans(randata, i)
  centers <- data.frame(x=km$centers[ ,1], y=km$centers[ ,2], cluster=1)
  randata$cluster <- factor(km$cluster)
  gg <- ggplot(randata, aes(x, y, color=cluster)) + geom_point(size=2)
  gg <- gg + geom_point(data=centers, aes(x, y), shape=8, color="black", size=4)
  gg <- gg + scale_x_continuous(expand=c(0,0.1))
  gg <- gg + scale_y_continuous(expand=c(0,0.1))
  gg <- gg + ggtitle(paste("k-means with", i, "clusters"))
  gg <- gg + theme(panel.grid = element_blank(),
                   panel.background = element_rect(colour = "black", fill=NA),
                   axis.text = element_blank(),
                   axis.title = element_blank(),
                   legend.position = "none",
                   axis.ticks = element_blank())
  out[[i-2]] <- gg
}
print(out[[1]])

print(out[[2]])

print(out[[3]])

print(out[[4]])

Here we create a random random scatter plot to demonstrate k-means clustering. The K stands for the number of clusters you want to create. Notice that we made 4 different charts with a cluster of 3 in the first and ascending to 6 clusters in the last. The clustering is at random and needs to be specified by you which is why it hierarchical clustering can help in eliminating the guessing as to how many clusters suits you, maybe for dimension reduction.

