In this report I carry out the analysis of the time line of the ISIS attacks and try to build a predictive model around it using neural networks.

Reference for this work is the R-Blogger article on Neural Networks: http://goo.gl/uF7F9i

The full time line of ISIS attacks can be found here: https://goo.gl/Xc31Vm (Based on the data available from The Wilson Center)

GENERALIZED LINEAR MODELS

First we load the required libraries and the data set. We split our data set into training and test data sets using a 60:40 ratio.

We next subset the data set to the required number of columns (day of week of attack, day index on time series, score of the attack and cumulative score of the attack). This data is available here: https://goo.gl/Xc31Vm

train <- train[,c("DayOfWeek","Day","Score","Cumulative.Score")]
test <- test[,c("DayOfWeek","Day","Score","Cumulative.Score")]

GENERALIZED LINEAR REGRESSION MODEL 1

First we fit a generalized linear model to the time series and try to predict the index of the day using this model.

lm.fit <- glm(Day~Score+DayOfWeek+Cumulative.Score, data=train)
summary(lm.fit)
## 
## Call:
## glm(formula = Day ~ Score + DayOfWeek + Cumulative.Score, data = train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2108.54   -212.50     63.65    310.16    705.22  
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2300.19661  195.10549  11.790   <2e-16 ***
## Score               0.37106    0.20457   1.814   0.0744 .  
## DayOfWeek         -47.75624   31.87173  -1.498   0.1389    
## Cumulative.Score    0.11015    0.00917  12.012   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 204755)
## 
##     Null deviance: 43101713  on 67  degrees of freedom
## Residual deviance: 13104319  on 64  degrees of freedom
## AIC: 1030.5
## 
## Number of Fisher Scoring iterations: 2
pr.lm <- predict(lm.fit, test)

MSE.lm1 <- sum((pr.lm-test$Day)^2/nrow(test))

plot(pr.lm ~ test$Day)

pr.lm.test.day <- test$Day
test <- as.data.frame(test)
test$predday <- pr.lm

The plot of predicted day vs actual day as given in the test data set is shown below. The prediction is not very accurate and the mean square error is large.

g<- ggplot(test, aes(x=predday, y=Day))
g = g + geom_point(size=2, colour="black")+geom_point()+geom_smooth()
g = g + xlab("PREDICTED DAY") + ylab("ACTUAL DAY")
g = g + geom_abline(data=test,intercept=coef(lm.fit)[1],slope=coef(lm.fit)[2],size=0.5)
g

Next we do cross validation to get the average prediction error for this model. The error here is due to the large dynamic range of the data set with few outliers.

library(boot)
set.seed(200)
lm.fit <- glm(Day~Score+DayOfWeek+Cumulative.Score,data=train)
cv.glm(train,lm.fit,K=10)$delta[1]
## [1] 227971.7

GENERALIZED LINEAR REGRESSION MODEL 2

If we restrict the number of data points to after 2014, we get a more accurate fit as shown next.

Below are the details of the regression model developed with the reduced data set.

lm.fit <- glm(Day~Score+DayOfWeek+Cumulative.Score, data=train)
summary(lm.fit)
## 
## Call:
## glm(formula = Day ~ Score + DayOfWeek + Cumulative.Score, data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -240.56   -25.29    10.64    46.74   109.75  
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3.217e+03  3.620e+01  88.869   <2e-16 ***
## Score            -7.134e-04  3.072e-02  -0.023    0.982    
## DayOfWeek        -3.841e+00  5.121e+00  -0.750    0.456    
## Cumulative.Score  5.364e-02  1.926e-03  27.848   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 4935.991)
## 
##     Null deviance: 4471773  on 61  degrees of freedom
## Residual deviance:  286287  on 58  degrees of freedom
## AIC: 709.08
## 
## Number of Fisher Scoring iterations: 2
pr.lm <- predict(lm.fit, test)

MSE.lm2 <- sum((pr.lm-test$Day)^2/nrow(test))

plot(pr.lm ~ test$Day)

pr.lm.test.day <- test$Day
test <- as.data.frame(test)
test$predday <- pr.lm

g<- ggplot(test, aes(x=predday, y=Day))
g = g + geom_point(size=2, colour="black")+geom_point()+geom_smooth()
g = g + xlab("PREDICTED DAY") + ylab("ACTUAL DAY")
g = g + geom_abline(data=test,intercept=coef(lm.fit)[1],slope=coef(lm.fit)[2],size=0.5)
g

The cross validation error here is less than before as the numbers are closer to each other.

library(boot)
set.seed(200)
lm.fit <- glm(Day~Score+DayOfWeek+Cumulative.Score,data=train)
cv.glm(train,lm.fit,K=10)$delta[1]
## [1] 5607.325

GENERALIZED LINEAR REGRESSION MODEL 3

The next model tries to predict the interval between the given attacks. Given the index of last day of attack, current strength of ISIS and an estimate of the size of the attack, the model tries to get an estimate of interval to next attack by ISIS. Here I have subsetted the training set to time interval between attacks of less than 20 days.

training <- subset(training, training$Day>3000)
training <- subset(training, training$DaysToNext<20)

inTrain <- createDataPartition(training$Location, p=0.6, list=FALSE)
myTraining <- training[inTrain, ]
myTesting <- training[-inTrain, ]
dim(myTraining); dim(myTesting)
## [1] 54  7
## [1] 22  7
##neuralnetwork


train <- myTraining
test <- myTesting
##train$Day = log(train$Day)
##test$Day = log(test$Day)

train <- train[,c("DayOfWeek","Day","DaysToNext","Score","Cumulative.Score")]
test <- test[,c("DayOfWeek","Day","DaysToNext","Score","Cumulative.Score")]

lm.fit <- glm(DaysToNext~Score+Cumulative.Score+Day, data=train)
summary(lm.fit)
## 
## Call:
## glm(formula = DaysToNext ~ Score + Cumulative.Score + Day, data = train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -6.062  -4.846  -1.598   2.380  13.057  
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)
## (Intercept)      -3.350e+01  4.918e+01  -0.681    0.499
## Score            -2.446e-04  2.775e-03  -0.088    0.930
## Cumulative.Score -5.726e-04  7.542e-04  -0.759    0.451
## Day               1.197e-02  1.488e-02   0.805    0.425
## 
## (Dispersion parameter for gaussian family taken to be 32.95656)
## 
##     Null deviance: 1671.5  on 53  degrees of freedom
## Residual deviance: 1647.8  on 50  degrees of freedom
## AIC: 347.83
## 
## Number of Fisher Scoring iterations: 2
pr.lm <- predict(lm.fit, test[,c(1,2,4,5)])

MSE.lm3 <- sum((pr.lm-test$DaysToNext)^2/nrow(test))

plot(pr.lm ~ test$DaysToNext)

pr.lm.test.daystonext <- test$DaysToNext

test <- as.data.frame(test)
test$predday <- pr.lm

g<- ggplot(test, aes(x=predday, y=DaysToNext))
g = g + geom_point(size=2, colour="black")+geom_point()+geom_smooth()
g = g + xlab("PREDICTED DAY") + ylab("ACTUAL DAY")
g = g + geom_abline(data=test,intercept=coef(lm.fit)[1],slope=coef(lm.fit)[2],size=0.5)
g

library(boot)
set.seed(200)
lm.fit <- glm(DaysToNext~Score+Cumulative.Score+Day,data=train)
cv.glm(train,lm.fit,K=10)$delta[1]
## [1] 37.53655

As you will note in the above model, the data is pretty random and regression fit is not acurate.

NEURAL NETWORKS

Next I build these same models using neural networks with help of the NeuralNet package in R.

An important first step in using Neural Networks is to normalize and scale the data. The first model of neural network uses 3 input parameters: estimated size of predicted attack, week of the day of estimated attack and current strength of ISIS and predicts the day index of the probable attack.

data <- training[,c("DayOfWeek","Day","Score","Cumulative.Score")]

##data<-subset(data, data$Day>3000)

maxs <- apply(data, 2, max)
mins <- apply(data, 2, min)

scaled <- as.data.frame(scale(data, center=mins, scale=maxs-mins))

train <- scaled[inTrain,]
test <- scaled[-inTrain,]

library(neuralnet)
n<- names(train)
f <- as.formula(paste("Day~",paste(n[!n %in% "Day"],collapse="+")))

nn <-neuralnet(f, data=train, hidden=c(3,2), linear.output= T)

plot(nn)
pr.nn <- compute(nn,test[,c(1,3,4)])

NEURAL NETWORK MODEL 1

The neural network developed uses 2 hidden layers with 3 and 2 neurons each. The predicted result of the neural network is compared with the same result for the GLM model and the MSE is less by a factor of 100.

pr.nn_ <- pr.nn$net.result*(max(data$Day)-min(data$Day))+min(data$Day)
test.r <- (test$Day)*(max(data$Day)-min(data$Day))+min(data$Day)

MSE.nn <- sum((test.r - pr.nn_)^2/nrow(test))

print(paste(MSE.lm1, MSE.nn))
## [1] "206698.586030661 3380.98135265871"

A cross validation is done below on this model to get average error.

## Loading required package: grid
## Loading required package: MASS
## [1] "Mean Error for Neural Network based model"
## [1] "4218.59021153826"

NEURAL NETWORK MODEL 2

I repeat the same analysis this time with data from 2014 onwards.

data <- training[,c("DayOfWeek","Day","Score","Cumulative.Score")]

##data<-subset(data, data$Day>3000)

maxs <- apply(data, 2, max)
mins <- apply(data, 2, min)

scaled <- as.data.frame(scale(data, center=mins, scale=maxs-mins))

train <- scaled[inTrain,]
test <- scaled[-inTrain,]

library(neuralnet)
n<- names(train)
f <- as.formula(paste("Day~",paste(n[!n %in% "Day"],collapse="+")))

nn <-neuralnet(f, data=train, hidden=c(3,2), linear.output= T)

plot(nn)
pr.nn <- compute(nn,test[,c(1,3,4)])


pr.nn_ <- pr.nn$net.result*(max(data$Day)-min(data$Day))+min(data$Day)
test.r <- (test$Day)*(max(data$Day)-min(data$Day))+min(data$Day)

MSE.nn <- sum((test.r - pr.nn_)^2/nrow(test))

print(paste(MSE.lm2, MSE.nn))
## [1] "4036.60362100505 3812.83193599303"
library(neuralnet)
set.seed(450)
cv.error <- NULL
k <- 10
data <- training[,c("DayOfWeek","Day","Score","Cumulative.Score")]
data <- subset(data, data$Day>3000)


library(plyr) 
##pbar <- create_progress_bar('text')
##pbar$init(k)

for(i in 1:k){
  index <- sample(1:nrow(data),round(0.6*nrow(data)))
  train.cv <- scaled[index,]
  test.cv <- scaled[-index,]
  
  nn <- neuralnet(f,data=train.cv,hidden=c(3,2),linear.output=T)
  
  pr.nn <- compute(nn,test.cv[,c(1,3,4)])
  pr.nn_ <- pr.nn$net.result*(max(data$Day)-min(data$Day))+min(data$Day)
  
  test.cv.r <- (test.cv$Day)*(max(data$Day)-min(data$Day))+min(data$Day)
  
  cv.error[i] <- sum((test.cv.r - pr.nn_)^2)/nrow(test.cv)
  
  ##pbar$step()
}

paste("Mean Error for Neural Network based model")
## [1] "Mean Error for Neural Network based model"
paste(mean(cv.error))
## [1] "4218.59021153826"

NEURAL NETWORK MODEL 3

The second neural network model developed predicts the time interval to next ISIS attack given the given day index in time series, strength of ISIS and an estimate of size of the attack. This compares to the third generalized linear regression model developed above.

Here too the data is from the last 2 years and is limited to time intervals of 20 days or less.

training <- read.table("isisdata.csv",header = TRUE, sep = ",", na.strings=c("NA","#DIV/0!",""))

data <- training[,c("DayOfWeek","Day","DaysToNext","Score","Cumulative.Score")]
data <- subset(data, data$Day>3000)
data<-subset(data, data$DaysToNext<20)

maxs <- apply(data, 2, max)
mins <- apply(data, 2, min)

scaled <- as.data.frame(scale(data, center=mins, scale=maxs-mins))

train <- scaled[inTrain,]
test <- scaled[-inTrain,]

library(neuralnet)
n<- names(train)
f <- as.formula(paste("DaysToNext~",paste(n[!n %in% "DaysToNext"],collapse="+")))

nn <-neuralnet(f, data=train, hidden=c(3,2), linear.output= T)

plot(nn)

pr.nn <- compute(nn,test[,c(1,2,4,5)])

##pr.nn$net.result

pr.nn_ <- pr.nn$net.result*(max(data$DaysToNext)-min(data$DaysToNext))+min(data$DaysToNext)
test.r <- (test$Day)*(max(data$DaysToNext)-min(data$DaysToNext))+min(data$DaysToNext)

MSE.nn <- sum((test.r - pr.nn_)^2/nrow(test))

print(paste(MSE.lm3, MSE.nn))
## [1] "41.6322182848707 43.6751336350364"

The GLM and Neural Network predictions are compared here. The errors of both the models are very close. Adding more prediction parameters and data points can potentially improve the performance of the neural network.

par(mfrow=c(1,2))

plot(test.r,pr.nn_,col='red',main='Real vs predicted NN',pch=18,cex=0.7)
abline(0,1,lwd=2)
legend('bottomright',legend='NN',pch=18,col='red', bty='n')

plot(pr.lm.test.daystonext,pr.lm,col='blue',main='Real vs predicted lm',pch=18, cex=0.7)
abline(0,1,lwd=2)
legend('bottomright',legend='LM',pch=18,col='blue', bty='n', cex=.95)

Cross Validation is done to get average error.

## [1] "Mean error of Neural Network for Model 2 based on cross validation"
## [1] "196.938382626231"