We need to fit a neural network to the Default data using a single hidden layer with 10 units and dropout regularization.
library(ISLR2)
library(keras)
library(tensorflow)
Per the ISLR package we know that Default is a data frame with 10,000 observations on the following 4 variables.
default: a factor with levels No and Yes indicating whether the customer defaulted on their debt
student: a factor with levels No and Yes indicating whether the customer is a student
balance: the average balance that the customer has remaining on their credit card after making their monthly payment
income: Income of customer
summary(Default)
## default student balance income
## No :9667 No :7056 Min. : 0.0 Min. : 772
## Yes: 333 Yes:2944 1st Qu.: 481.7 1st Qu.:21340
## Median : 823.6 Median :34553
## Mean : 835.4 Mean :33517
## 3rd Qu.:1166.3 3rd Qu.:43808
## Max. :2654.3 Max. :73554
We need to first split our data into training and validation sets.
set.seed(1)
n <- nrow(Default)
ntest <- trunc(n/5)
testid <- sample(1:n, ntest)
y_test <- Default$default[testid] == 'Yes'
One thing to note above is that our default variable is No (if the Default is FALSE) and Yes (if the Default is TRUE).
Let’s train our linear logistic regression model and then assess it on the validation set. In our model it will predict Default Yes if the probability of Default is greater than 0.5. In other words, our model will predict \(P(\text{Default} | X)\). If this probability is greater than 0.5, then our model will classify \(X\) as a Default TRUE. Otherwise, it will classify \(X\) as a Default FALSE.
ll.reg <- glm(default~student+balance+income,family="binomial",data=Default[-testid,])
ll.pred <- predict(ll.reg, data=Default[testid,], type='response') > 0.5
ll.accuracy = mean(ll.pred == y_test)
ll.accuracy
## [1] 0.95225
Our linear logistic regression model performs well on this validation set with accuracy around 95%.
Let us now try to fit a neural network.
x = model.matrix(default ~. -1, data=Default)
x_train <- x[-testid,]
g_train <- Default$default[-testid]=='Yes'
x_test <- x[testid,]
g_test <- Default$default[testid] == 'Yes'
modnn <- keras_model_sequential() %>%
layer_dense(units=10, activation='relu', input_shape=ncol(x)) %>%
layer_dropout(rate=0.4) %>%
layer_dense(units = 1, activation='sigmoid')
Now let’s compile our model.
modnn %>% compile(
optimizer=optimizer_rmsprop(),
loss='binary_crossentropy',
metrics='accuracy')
Now let’s fit our model to our data.
history <- modnn %>% fit(
x = x_train,
y = g_train,
epochs=30,
batch_size=128)
Let’s check our neural network accuracy.
nnpred <- predict(modnn, x_test) > 0.5
nn.accuracy <- mean(nnpred == g_test)
nn.accuracy
## [1] 0.965
As we can see, our neural network test set accuracy is about 96%. This is comparable to our linear logistic regression which had 95% accuracy.
library(ISLR2)
xdata <- data.matrix(NYSE[, c("DJ_return", "log_volume", "log_volatility")])
istrain <- NYSE[,"train"]
xdata <- scale(xdata)
For the NYSE data we will use the past five trading days to predict the next day’s trading volume. Hence, we use \(L=5\), and since \(T=6,051\) we can create 6,046 such \((X,Y)\) pairs.
we will write functions to create lagged versions of the three time series. We start with a function that takes as input a data matrix and a lag \(L\), and returns a lagged version of the matrix. It simply inserts \(L\) rows of NA at the top, and truncates the bottom.
lagm <- function(x, k=1) {
n <- nrow(x)
pad <- matrix(NA, k, ncol(x))
rbind(pad, x[1:(n-k),])
}
arframe <- data.frame(log_volume=xdata[, "log_volume"],
L1 = lagm(xdata,1), L2=lagm(xdata,2),
L3 = lagm(xdata,3), L4=lagm(xdata,4),
L5 = lagm(xdata,5))
We can see that our arframe has missing values in the lagged variables (due to the construction of them). We remove these rows and adjust istrain (our train index) accordingly.
arframe <- arframe[-(1:5), ]
istrain <- istrain[-(1:5)]
We now fit the linear AR model to the training data using lm() and predict on the test data
arfit <- lm(log_volume ~., data=arframe[istrain, ])
arpred <- predict(arfit, arframe[!istrain, ])
V0 <- var(arframe[!istrain, "log_volume"])
1 - mean((arpred - arframe[!istrain, "log_volume"])^2)/V0
## [1] 0.413223
The last two lines compute the \(R^2\) on the test data, as defined in section 3.17 of the textbook. In other words, our model explains 41% of the variation seen in the data.
We’ll add a factor variable month_number to our data because we have a hunch that month_number has a strong association with trading volume.
library(lubridate) #for extracting month from date object
string.dates <- NYSE[-c(1:5), "date"]
dates <- as.Date(string.dates)
months <- as.factor(month(dates))
arframe.month <- data.frame(months, arframe)
arfit.month <- lm(log_volume ~., data=arframe.month[istrain, ])
arpred.month <- predict(arfit.month, arframe.month[!istrain, ])
1 - mean((arpred.month - arframe[!istrain, "log_volume"])^2)/V0
## [1] 0.4170418
It appears that including this 12-level factor representation doesn’t significantly improve the performance of the model. Before, without the 12-level month factor, the \(R^2\) was equal to 0.413. Now, with the 12-level month factor, the \(R^2\) was equal to 0.417. This is hardly an improvement.
Now let’s fit our RNN. The RNN model expects a sequence of \(L=5\) feature vectors \(X = {X_{\ell}}_{1}^{L}\) These are lagged versions of the time series going back \(L\) time points.
n <- nrow(arframe)
xrnn <- data.matrix(arframe[, -1])
xrnn <- array(xrnn, c(n, 3, 5))
xrnn <- xrnn[,, 5:1]
xrnn <- aperm(xrnn, c(1,3,2))
dim(xrnn)
## [1] 6046 5 3
model.rnn <- keras_model_sequential() %>%
layer_flatten(input_shape=c(5,3)) %>%
layer_dense(units=1)
Compile our model.
model.rnn %>% compile(loss='mse', optimizer=optimizer_rmsprop())
Fit our model to the data.
history <- model.rnn %>% fit(
xrnn[istrain,, ], arframe[istrain, "log_volume"], epochs=200,
batch_size = 64, validation_data =
list(xrnn[!istrain,, ], arframe[!istrain, "log_volumne"])
)
Now let’s check our test set \(R^2\).
npred <- predict(model.rnn, xrnn[!istrain,,])
test.Rsquared <- 1 - mean((arframe[!istrain, "log_volume"] - npred)^2)/V0
test.Rsquared
## [1] 0.4124667
It appears that our \(R^2\) for this RNN linear AR model is the same as that of just a simple lm() function call.
The advantage of using lm() is that the interpretation of the coefficients is easier to communicate.The disadvantage of using lm() is that we assume a normality condition on the residuals, which may or may not be true.
Let’s reload the data incase we mutated anything.
library(ISLR2)
xdata <- data.matrix(NYSE[, c("DJ_return", "log_volume", "log_volatility")])
istrain <- NYSE[,"train"]
xdata <- scale(xdata)
We will need to create the AR dataframe and then add day of week as a one-hot vector (of length 5) for each observation.
arframe <- data.frame(log_volume=xdata[, "log_volume"],
L1 = lagm(xdata,1), L2=lagm(xdata,2),
L3 = lagm(xdata,3), L4=lagm(xdata,4),
L5 = lagm(xdata,5))
arframe <- arframe[-(1:5), ]
istrain <- istrain[-(1:5)]
day=NYSE[-(1:5), "day_of_week"]
arframed <- data.frame(day=NYSE[-(1:5), "day_of_week"], arframe)
x <- model.matrix(log_volume ~. -1, data=arframed)
#first five columns are the factor one-hot vector for day of week.
day.feature = x[,c(1:5)]
arframe.day <- data.frame(arframe$log_volume,
arframe[,c(2:4)], day.feature,
arframe[,c(5:7)], day.feature,
arframe[,c(8:10)], day.feature,
arframe[,c(11:13)], day.feature,
arframe[,c(14:16)], day.feature
)
Let’s set up our RNN data structure.
xrnn.day <- data.matrix(arframe.day[, -1])
xrnn.day <- array(xrnn.day, c(n, 8, 5))
xrnn.day <- xrnn.day[,, 5:1]
xrnn.day <- aperm(xrnn.day, c(1,3,2))
Now let’s compile our model.
arnn.day <- keras_model_sequential() %>%
layer_simple_rnn(units = 12,
input_shape = list(5,8),
dropout = 0.1, recurrent_dropout = 0.1) %>%
layer_dense(units = 1)
arnn.day %>% compile(optimizer = optimizer_rmsprop(),
loss = 'mse')
Now let’s fit our model
history <- arnn.day %>% fit(
xrnn.day[istrain,,], arframe[istrain, "log_volume"],
batch_size = 64, epochs = 200,
validation_data =
list(xrnn.day[!istrain,,], arframe[!istrain, "log_volume"])
)
Retry a \(R^2\). Let’s see how our model does on the test set.
kpred <- predict(arnn.day, xrnn.day[!istrain,,])
1 - mean((kpred - arframe[!istrain, "log_volume"])^2)/V0
## [1] 0.4561137
As we can see, this new RNN model with variable day_of_week included explains about 5% more of the variation than the RNN model without variable day_of_week that the textbook completed on page 457.
There we used a dictionary of size 10,000. Consider the effects of varying the dictionary size. Try the values
1,000
3,000
5,000
10,000
and compare the results.
Load our data and one_hot() code our documents into feature vectors which then can joined into a feature matrix. Since we are dealing with a sparse matrix we will use the Matrix library to create a sparseMatrix() object that will save considerable memory space
max_features <- 1000
imdb <- dataset_imdb(num_words=max_features)
c(c(x_train, y_train), c(x_test, y_test)) %<-% imdb
library(Matrix)
one_hot <- function(sequences, dimension) {
seqlen <- sapply(sequences, length)
n <- length(seqlen)
rowind <- rep(1:n, seqlen)
colind <- unlist(sequences)
sparseMatrix(i = rowind, j=colind, dims = c(n, dimension))
}
x_train_1h <- one_hot(x_train, 1000)
x_test_1h <- one_hot(x_test, 1000)
dim(x_train_1h)
## [1] 25000 1000
We can see above that our x_train_1h is a matrix of size 25,000 x 1,000.
Let’s set up our training and validation sets. We’ll create a size 2,000 validation set, which leaves 23,000 documents to train on.
set.seed(3)
ival <- sample(seq(along=y_train), 2000)
Next we fit a fully-connected neural network with two hidden layers, each with 16 units and ReLU activation.
Define model and compile it.
model <- keras_model_sequential() %>%
layer_dense(units = 16, activation ='relu', input_shape=c(1000)) %>%
layer_dense(units=16, activation='relu') %>%
layer_dense(units=1, activation='sigmoid')
model %>% compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=c('accuracy'))
Fit to our data:
history <- model %>% fit(x_train_1h[-ival,], y_train[-ival], epochs=20, batch_size=512, validation_data = list(x_train_1h[ival, ], y_train[ival]))
Now we can check the accuracy of the model on the validation set.
predict_x=model %>% predict(x_train_1h[ival,])
predict_class_x <- as.numeric(predict_x > 0.5)
correct.accuracy.1000 <- mean(y_train[ival] == predict_class_x)
correct.accuracy.1000
## [1] 0.8435
We see that our test accuracy when we use a dictionary of size 1,000 is approximately 0.851.
We repeat the steps above. Fewer comments will be made.
max_features <- 3000
imdb <- dataset_imdb(num_words=max_features)
c(c(x_train, y_train), c(x_test, y_test)) %<-% imdb
x_train_1h <- one_hot(x_train, max_features)
x_test_1h <- one_hot(x_test, max_features)
model <- keras_model_sequential() %>%
layer_dense(units = 16, activation ='relu', input_shape=c(max_features)) %>%
layer_dense(units=16, activation='relu') %>%
layer_dense(units=1, activation='sigmoid')
model %>% compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=c('accuracy'))
history <- model %>% fit(x_train_1h[-ival,], y_train[-ival], epochs=20, batch_size=512, validation_data = list(x_train_1h[ival, ], y_train[ival]))
predict_x=model %>% predict(x_train_1h[ival,])
predict_class_x <- as.numeric(predict_x > 0.5)
correct.accuracy.3000 <- mean(y_train[ival] == predict_class_x)
correct.accuracy.3000
## [1] 0.857
We see that our test accuracy when we use a dictionary of size 3,000 is approximately 0.857.
max_features <- 5000
imdb <- dataset_imdb(num_words=max_features)
c(c(x_train, y_train), c(x_test, y_test)) %<-% imdb
x_train_1h <- one_hot(x_train, max_features)
x_test_1h <- one_hot(x_test, max_features)
model <- keras_model_sequential() %>%
layer_dense(units = 16, activation ='relu', input_shape=c(max_features)) %>%
layer_dense(units=16, activation='relu') %>%
layer_dense(units=1, activation='sigmoid')
model %>% compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=c('accuracy'))
history <- model %>% fit(x_train_1h[-ival,], y_train[-ival], epochs=20, batch_size=512, validation_data = list(x_train_1h[ival, ], y_train[ival]))
predict_x=model %>% predict(x_train_1h[ival,])
predict_class_x <- as.numeric(predict_x > 0.5)
correct.accuracy.5000 <- mean(y_train[ival] == predict_class_x)
correct.accuracy.5000
## [1] 0.862
We see that our test accuracy when we use a dictionary of size 5,000 is approximately 0.8625.
max_features <- 10000
imdb <- dataset_imdb(num_words=max_features)
c(c(x_train, y_train), c(x_test, y_test)) %<-% imdb
x_train_1h <- one_hot(x_train, max_features)
x_test_1h <- one_hot(x_test, max_features)
model <- keras_model_sequential() %>%
layer_dense(units = 16, activation ='relu', input_shape=c(max_features)) %>%
layer_dense(units=16, activation='relu') %>%
layer_dense(units=1, activation='sigmoid')
model %>% compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=c('accuracy'))
history <- model %>% fit(x_train_1h[-ival,], y_train[-ival], epochs=20, batch_size=512, validation_data = list(x_train_1h[ival, ], y_train[ival]))
predict_x=model %>% predict(x_train_1h[ival,])
predict_class_x <- as.numeric(predict_x > 0.5)
correct.accuracy.10000 <- mean(y_train[ival] == predict_class_x)
correct.accuracy.10000
## [1] 0.8585
We see that our test accuracy when we use a dictionary of size 10,000 is approximately 0.866.
Model | Accuracy |
---|---|
Dictionary Size 1,000 | 0.8435 |
Dictionary Size 3,000 | 0.8570 |
Dictionary Size 5,000 | 0.8620 |
Dictionary Size 10,000 | 0.8585 |
We can see that as the dictionary size increases, so does the accuracy, albeit marginally.