These packages are needed for this assignment:
pkg <- c("ggplot2", "RColorBrewer")
new.pkg <- pkg[!(pkg %in% installed.packages())]
if (length(new.pkg)) {
install.packages(new.pkg)
}
Let us first make sure we are working in a directory we feel
comfortable with. So I will start by changing my working directory to
the one listed below
# Lets check our working directory
getwd()
[1] "C:/Users/antho/OneDrive/Documents/School/4.DataSecurity&Governance"
Let us now read the file and start exploring the dataset
memproc.
mydata1<- read.csv("memproc.csv", header=T)
summary(mydata1)
host proc mem state
Length:247 Min. :-3.1517 Min. :-3.5939 Length:247
Class :character 1st Qu.:-1.2056 1st Qu.:-1.4202 Class :character
Mode :character Median :-0.4484 Median :-0.6212 Mode :character
Mean :-0.4287 Mean :-0.5181
3rd Qu.: 0.3689 3rd Qu.: 0.2413
Max. : 3.1428 Max. : 3.2184
We can see we were succesfull in uploading the memproc.csv data into
a data fram we named “mydata1” which has 4 fields. Lets get more info on
the dataset by using the head function and inspecting the colomns and
top 6 rows.
head(mydata1)
We can see that There are two fields (“host” & “state”) that look
like characters and two fields (“proc” & “mem”) that look like
numeric data under the dbl data type
In order to explore this dataset more in detail, let us create a plot
to compare the processor and memory usage, and differentiate it based on
the malware state.
# you will see that we use the ggplot function to create graph comparing the proc(x) and mem(y) fields and set the details/legend of the "state" field by color.
library(ggplot2)
Warning: package ‘ggplot2’ was built under R version 4.2.1
gg <- ggplot(mydata1, aes(proc, mem, color=state))
gg <- gg + scale_color_brewer(palette="Set2") # we can change the color here
gg <- gg + geom_point(size=3) + theme_bw() # we can change the size of the dots here
print(gg)

I sugest you to create now a new graph but this time including the
title in the chart. The title would be “Memory vs Processor Usage as
function of the Malaware state”.
# We print the original plot (gg), adding the string title we want by inserting it using the ggplot2 function ggtitle()
print(gg + ggtitle("Memory vs Processor Usage as function of the Malware state"))

We print the original plot (gg), adding the string title we want by
inserting it using the ggplot2 function ggtitle()
set.seed(1492)
# count how many in the overall sample
n <- nrow(mydata1)
# set the test.size to be 1/3rd
test.size <- as.integer(n/3)
# randomly sample the rows for test set
testset <- sample(n, test.size)
# now split the data into test and train
test <- mydata1[testset, ]
train <- mydata1[-testset, ]
# pull out proc and mem columns for infected then normal
# then use colMeans() to means of the columns
inf <- colMeans(train[train$state=="Infected", c("proc", "mem")])
nrm <- colMeans(train[train$state=="Normal", c("proc", "mem")])
We create two variables storing the mean value for all the “infected”
and “normal” states from the train data that we split earlier.
print(inf)
proc mem
0.9354513 1.0868010
print(nrm)
proc mem
-0.7907962 -0.9352974
By using the colMens function we are able to see the mean of the
“infected” and “normal” states for the Processor(proc) and memory (mem)
fields in the train data.
predict.malware <- function(data) {
# get 'proc' and 'mem' as numeric values
proc <- as.numeric(data[['proc']])
mem <- as.numeric(data[['mem']])
# set up infected comparison
inf.a <- inf['proc'] - proc
inf.b <- inf['mem'] - mem
# pythagorean distance c = sqrt(a^2 + b^2)
inf.dist <- sqrt(inf.a^2 + inf.b^2)
# repeat for normal systems
nrm.a <- nrm['proc'] - proc
nrm.b <- nrm['mem'] - mem
nrm.dist <- sqrt(nrm.a^2 + nrm.b^2)
# assign a label of the closest (smallest)
ifelse(inf.dist<nrm.dist,"Infected", "Normal")
}
# could test with these if you uncomment them
# predict.malware(inf['proc'], inf['mem'])
# expect "Infected"
# predict.malware(nrm['proc'], nrm['mem'])
# expect "Normal"
We create a function called predict.malware to do exactly that, by
finding the shortest distance from all point in the train data. Very
interesting algorithms we use above to find the pythagorean distance
forbetween the processor and memory values for the infected and normal
states.
prediction <- apply(test, 1, predict.malware)
sum(test$state==prediction)/nrow(test)
[1] 0.902439
Above is our predictions accuracy which is considerably good at
90%
Lets make a slope and intercept so we can add an diagnal line on the
chart so we can better visualize the divide of states.
# Figure 9-2 #########################################################
slope <- -1*(1/((inf['mem']-nrm['mem'])/(inf['proc']-nrm['proc'])))
intercept <- mean(c(inf['mem'], nrm['mem'])) - (slope*mean(c(inf['proc'], nrm['proc'])))
Lets create a result variable that will display weather or
predictions were accurate or not
result <- cbind(test, predict=prediction)
result$Accurate <- ifelse(result$state==result$predict, "Yes", "No")
result$Accurate <- factor(result$Accurate, levels=c("Yes", "No"), ordered=T)
Now recreate the previous graph but add the detail/legend of accuracy
by shape. We also use the geom_abline() function to create the diagnal
line from the slope and intercept we identified and created above.
# notice here we set the detail/legend at the end of this next line where we create the chart.
gg <- ggplot(result, aes(proc, mem, color=state, size=Accurate, shape=Accurate))
gg <- gg + scale_shape_manual(values=c(16, 8))
gg <- gg + scale_size_manual(values=c(3, 6))
gg <- gg + scale_color_brewer(palette="Set2")
gg <- gg + geom_point() + theme_bw()
gg <- gg + geom_abline(intercept = intercept, slope = slope, color="gray80")
print(gg)

we can see that we have less false negatives than we do false
positives which means we are predicting more infected situations
accurately than we are non infected meaning those with a threat are
being identified. Only issue is there are six times more people that are
being told they are infected when they are not.
set.seed(1)
x <- runif(200, min=-10, max=10)
y <- 1.377*(x^3) + 0.92*(x^2) + .3*x + rnorm(200, sd=250) + 1572
x <- x + 10
smooth <- ggplot(data.frame(x,y), aes(x, y)) + geom_point() +
geom_smooth(method = "lm", formula = y ~ poly(x, 3), size = 1, se=F) +
theme_bw()
print(smooth)

Here we create a linear regression model based of random data x, y
values we created with a distribution like that of sigmoid function.
Very interesting notice how we use the geom_smooth() function to define
and deploy the linear regression model in nearly one line. This is
useful for predicting quantitative values.
memproc <- read.csv("memproc.csv", header=T)
memproc$infected <- ifelse(memproc$state=="Infected", 1, 0)
set.seed(1492)
n <- nrow(memproc)
test.size <- as.integer(n/3)
testset <- sample(n, test.size)
test <- memproc[testset, ]
train <- memproc[-testset, ]
Since this data is being used more so to classify a qualitative field
(“state”) which has only values it we would benifit us to convert these
two string values into binary digits so that they are easier to use as a
target value for logistic regression.
You will see below how we target only the infected state for our
chart when creating this logistic regression model. Logistic regrssion
are achieved using the “generalized linear models” glm() function which
i read can also do poisson regression and survival analysis.
glm.out = glm(infected ~ proc + mem, data=test, family=binomial(logit))
summary(glm.out)
Call:
glm(formula = infected ~ proc + mem, family = binomial(logit),
data = test)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.51110 -0.17718 -0.08015 -0.01132 2.28073
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.1905 0.6943 -3.155 0.00160 **
proc 2.1378 0.7192 2.972 0.00295 **
mem 1.6530 0.5682 2.909 0.00362 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 68.275 on 81 degrees of freedom
Residual deviance: 25.348 on 79 degrees of freedom
AIC: 31.348
Number of Fisher Scoring iterations: 8
modelog <- predict.glm(glm.out, test, type="response")
gg <- ggplot(data.frame(x=modelog, y=ifelse(test$infected>0.5, "Yes", "No")), aes(x, y)) +
geom_point(size=3, fill="steelblue", color="black", shape=4) +
ylab("Known Infected Host") +
xlab("Estimated Probability of Infected Host") + theme_bw()
print(gg)

Notice this chart is perfect for showing the probability of a
classification, kind of like a confusion matrix this is identify the
probability of each of a binary classifications two values. Notice the
host that have a higher probability of infection have more grouping on
the yes line of the y axis. Also Notice that the majority of records
fall near the 0 probability and on the “no” known infected line.
set.seed(1) # repeatable
x <- c(rnorm(200), rnorm(400)+2, rnorm(400)-2)
y <- c(rnorm(200), rnorm(200)+2, rnorm(200)-2, rnorm(200)+2, rnorm(200)-2)
randata <- data.frame(x=x, y=y)
out <- list()
for(i in c(3,4,5,6)) {
km <- kmeans(randata, i)
centers <- data.frame(x=km$centers[ ,1], y=km$centers[ ,2], cluster=1)
randata$cluster <- factor(km$cluster)
gg <- ggplot(randata, aes(x, y, color=cluster)) + geom_point(size=2)
gg <- gg + geom_point(data=centers, aes(x, y), shape=8, color="black", size=4)
gg <- gg + scale_x_continuous(expand=c(0,0.1))
gg <- gg + scale_y_continuous(expand=c(0,0.1))
gg <- gg + ggtitle(paste("k-means with", i, "clusters"))
gg <- gg + theme(panel.grid = element_blank(),
panel.background = element_rect(colour = "black", fill=NA),
axis.text = element_blank(),
axis.title = element_blank(),
legend.position = "none",
axis.ticks = element_blank())
out[[i-2]] <- gg
}
print(out[[1]])

print(out[[2]])

print(out[[3]])

print(out[[4]])

Here we create a random random scatter plot to demonstrate k-means
clustering. The K stands for the number of clusters you want to create.
Notice that we made 4 different charts with a cluster of 3 in the first
and ascending to 6 clusters in the last. The clustering is at random and
needs to be specified by you which is why it hierarchical clustering can
help in eliminating the guessing as to how many clusters suits you,
maybe for dimension reduction.
---
title: "Machine_Learning"
output: html_notebook
---
These packages are needed for this assignment:

```{r}
pkg <- c("ggplot2", "RColorBrewer")
new.pkg <- pkg[!(pkg %in% installed.packages())]
if (length(new.pkg)) {
  install.packages(new.pkg)  
}
```


Let us first make sure we are working in a directory we feel comfortable with. So I will start by changing my working directory to the one listed below

```{r}
# Lets check our working directory
getwd()

```

Let us now read the file and start exploring the dataset *memproc*.

```{r}
mydata1<- read.csv("memproc.csv", header=T)
summary(mydata1)
```

We can see we were succesfull in uploading the memproc.csv data into a data fram we named "mydata1" which has 4 fields.
Lets get more info on the dataset by using the head function and inspecting the colomns and top 6 rows. 

```{r}
head(mydata1)
```

We can see that There are two fields ("host" & "state") that look like characters and two fields ("proc" & "mem") that look like numeric data under the dbl data type

In order to explore this dataset more in detail, let us create a plot to compare the processor and memory usage, and differentiate it based on the malware state.

```{r}
# you will see that we use the ggplot function to create graph comparing the proc(x) and mem(y) fields and set the details/legend of the "state" field by color. 
library(ggplot2)
gg <- ggplot(mydata1, aes(proc, mem, color=state))
gg <- gg + scale_color_brewer(palette="Set2") # we can change the color here
gg <- gg + geom_point(size=3) + theme_bw() # we can change the size of the dots here
print(gg)
```

I sugest you to create now a new graph but this time including the title in the chart. The title would be "Memory vs Processor Usage as function of the Malaware state".


```{r}
print(gg + ggtitle("Memory vs Processor Usage as function of the Malware state"))
```
We print the original plot (gg), adding the string title we want by inserting it using the ggplot2 function ggtitle()


```{r}
set.seed(1492)
# count how many in the overall sample
n <- nrow(mydata1)
# set the test.size to be 1/3rd
test.size <- as.integer(n/3)
# randomly sample the rows for test set
testset <- sample(n, test.size)
# now split the data into test and train
test <- mydata1[testset, ]
train <- mydata1[-testset, ]
```



```{r}
# pull out proc and mem columns for infected then normal
# then use colMeans() to means of the columns
inf <- colMeans(train[train$state=="Infected", c("proc", "mem")])
nrm <- colMeans(train[train$state=="Normal", c("proc", "mem")])
```


We create two variables storing the mean value for all the "infected" and "normal" states from the train data that we split earlier. 

```{r}
print(inf)
print(nrm)
```
By using the colMens function we are able to see the mean of the "infected" and "normal" states for the Processor(proc) and memory (mem) fields in the train data. 


```{r}
predict.malware <- function(data) {
  # get 'proc' and 'mem' as numeric values
  proc <- as.numeric(data[['proc']])
  mem <- as.numeric(data[['mem']])
  # set up infected comparison
  inf.a <- inf['proc'] - proc
  inf.b <- inf['mem'] - mem
  # pythagorean distance c = sqrt(a^2 + b^2)
  inf.dist <- sqrt(inf.a^2 + inf.b^2)
  # repeat for normal systems
  nrm.a <- nrm['proc'] - proc
  nrm.b <- nrm['mem'] - mem
  nrm.dist <- sqrt(nrm.a^2 + nrm.b^2)
  # assign a label of the closest (smallest)
  ifelse(inf.dist<nrm.dist,"Infected", "Normal")
}
# could test with these if you uncomment them
# predict.malware(inf['proc'], inf['mem'])
# expect "Infected" 
# predict.malware(nrm['proc'], nrm['mem'])
# expect "Normal"
```

We create a function called predict.malware to do exactly that, by finding the shortest distance from all point in the train data. Very interesting algorithms we use above to find the pythagorean distance forbetween the processor and memory values for the infected and normal states. 

```{r}
prediction <- apply(test, 1, predict.malware)
```

```{r}
sum(test$state==prediction)/nrow(test)
```
Above is our predictions accuracy which is considerably good at 90%

Lets make a slope and intercept so we can add an diagnal line on the chart so we can better visualize the divide of states. 

```{r}
# Figure 9-2 #########################################################
slope <- -1*(1/((inf['mem']-nrm['mem'])/(inf['proc']-nrm['proc'])))
intercept <- mean(c(inf['mem'], nrm['mem'])) - (slope*mean(c(inf['proc'], nrm['proc'])))
```

Lets create a result variable that will display weather or predictions were accurate or not

```{r}
result <- cbind(test, predict=prediction)
result$Accurate <- ifelse(result$state==result$predict, "Yes", "No")
result$Accurate <- factor(result$Accurate, levels=c("Yes", "No"), ordered=T)
```

Now recreate the previous graph but add the detail/legend of accuracy by shape. We also use the geom_abline() function to create the diagnal line from the slope and intercept we identified and created above. 


```{r}
# notice here we set the detail/legend at the end of this next line where we create the chart. 
gg <- ggplot(result, aes(proc, mem, color=state, size=Accurate, shape=Accurate)) 
gg <- gg + scale_shape_manual(values=c(16, 8))
gg <- gg + scale_size_manual(values=c(3, 6))
gg <- gg + scale_color_brewer(palette="Set2")
gg <- gg + geom_point() + theme_bw()
gg <- gg + geom_abline(intercept = intercept, slope = slope, color="gray80")
print(gg)
```

we can see that we have less false negatives than we do false positives which means we are predicting more infected situations accurately than we are non infected meaning those with a threat are being identified. Only issue is there are six times more people that are being told they are infected when they are not. 


```{r}
set.seed(1)
x <- runif(200, min=-10, max=10)
y <- 1.377*(x^3) + 0.92*(x^2) + .3*x + rnorm(200, sd=250) + 1572
x <- x + 10
smooth <- ggplot(data.frame(x,y), aes(x, y)) + geom_point() + 
  geom_smooth(method = "lm", formula = y ~ poly(x, 3), size = 1, se=F) + 
  theme_bw()
print(smooth)
```
Here we create a linear regression model based of random data x, y values we created with a distribution like that of sigmoid function. 
Very interesting notice how we use the geom_smooth() function to define and deploy the linear regression model in nearly one line. 
This is useful for predicting quantitative values.  

```{r}
memproc <- read.csv("memproc.csv", header=T)
memproc$infected <- ifelse(memproc$state=="Infected", 1, 0)
set.seed(1492)
n <- nrow(memproc)
test.size <- as.integer(n/3)
testset <- sample(n, test.size)
test <- memproc[testset, ]
train <- memproc[-testset, ]
```

Since this data is being used more so to classify a qualitative field ("state") which has only values it we would benifit us to convert these two string values into binary digits so that they are easier to use as a target value  for logistic regression. 

You will see below how we target only the infected state for our chart when creating this logistic regression model.
Logistic regrssion are achieved using the "generalized linear models" glm() function which i read can also do poisson regression and survival analysis. 

```{r}
glm.out = glm(infected ~ proc + mem, data=test, family=binomial(logit))
summary(glm.out)
modelog <- predict.glm(glm.out, test, type="response")
gg <- ggplot(data.frame(x=modelog, y=ifelse(test$infected>0.5, "Yes", "No")), aes(x, y)) +
  geom_point(size=3, fill="steelblue", color="black", shape=4) + 
  ylab("Known Infected Host") +
  xlab("Estimated Probability of Infected Host") + theme_bw()
print(gg)
```

Notice this chart is perfect for showing the probability of a classification, kind of like a confusion matrix this is identify the probability of each of a binary classifications two values. 
Notice the host that have a higher probability of infection have more grouping on the yes line of the y axis. 
Also Notice that the majority of records fall near the 0 probability and on the "no" known infected line. 

```{r}
set.seed(1) # repeatable
x <- c(rnorm(200), rnorm(400)+2, rnorm(400)-2)
y <- c(rnorm(200), rnorm(200)+2, rnorm(200)-2, rnorm(200)+2, rnorm(200)-2)
randata <- data.frame(x=x, y=y)
out <- list()
for(i in c(3,4,5,6)) {
  km <- kmeans(randata, i)
  centers <- data.frame(x=km$centers[ ,1], y=km$centers[ ,2], cluster=1)
  randata$cluster <- factor(km$cluster)
  gg <- ggplot(randata, aes(x, y, color=cluster)) + geom_point(size=2)
  gg <- gg + geom_point(data=centers, aes(x, y), shape=8, color="black", size=4)
  gg <- gg + scale_x_continuous(expand=c(0,0.1))
  gg <- gg + scale_y_continuous(expand=c(0,0.1))
  gg <- gg + ggtitle(paste("k-means with", i, "clusters"))
  gg <- gg + theme(panel.grid = element_blank(),
                   panel.background = element_rect(colour = "black", fill=NA),
                   axis.text = element_blank(),
                   axis.title = element_blank(),
                   legend.position = "none",
                   axis.ticks = element_blank())
  out[[i-2]] <- gg
}
print(out[[1]])
print(out[[2]])
print(out[[3]])
print(out[[4]])
```

Here we create a random random scatter plot to demonstrate k-means clustering. The K stands for the number of clusters you want to create. Notice that we made 4 different charts with a cluster of 3 in the first and ascending to 6 clusters in the last. The clustering is at random and needs to be specified by you which is why it hierarchical clustering can help in eliminating the guessing as to how many clusters suits you, maybe for dimension reduction.
