General Instructions

You should submit this exercise in an R Markdown document as this one (for which the source file is in the Teams files share folder as Assignment2.Rmd). The code should be visible and included in chunks, and should be running with no errors. When you are confident, knit the document into an html file and submit it. Use any Rmd file as the ones provided as material in class for guide.

Training a logistic regression model

The Data

We want to train a logistic regression model on a dataset consisting of 2 feature variables.The data is in the folder ex2data2.txt in the usual teams file share folder for the class. Here is the plot of the binary Y target variable against the 2 features (crosses are 1s and circles are 0s).

#Read the data
library(data.table)
library(ggplot2)

dataset<-as.matrix(fread("ex2data2.txt"))
colnames(dataset)<-c("x1","x2","y")

d<-as.data.frame(dataset)

d$y<-as.factor(d$y)

ggplot(d, aes(x = x1, y = x2, colour = y)) + geom_point(aes(shape = y, stroke = 2), size = 3) + scale_shape_manual(values=c(1,3))

The training function

In the Overfitting.Rmd file in the same shared file folder you will find a number of functions performing tasks related to logistic regression, regiularization and other functions. Use the training function to obtain a solution for the classification problem on a subsample of the data (the training set) and then use the learned parameters to predict the labels on the test set. Use a 70/30 training/test split.

Use the code in the Overfitting.Rmd file to do the split. Use a seed of 100 before splitting.

set.seed(100)

Report the parameter estimates for the solution. Then use the functions provided for calculating classification metrics (TPR, FPR, Precision, Recall) to report your classification accuracy for the training and test sets.

What is your comment on the model performance? Was it expected given the data and the fitted function, and why?

Optional: Extra points for graphically representing your decision boundary on the training and test sets - use the last part of the Overfitting.Rmd for ideas on how to do that.

Adding Polynomial expansion temrms and Overfitting

Improving the Model performance

Let’s try to add polynomial functions of the features in the hypothesis function in order to make it more complex and hopefully better suited to discriminate between the 0 and 1 in the data. You should have seen already that the linear hypothesis did a pretty bad job at that.

The add.poly.features function in the Overfitting.Rmd collection of functions does that: provided with a desired degree of polynomials of the initial features to be added as extra features, it produces matrices with the extra features. For example if we ask for a degree of polynomial equal to 2, the feature columns for squared X1, squared X2 and the cross-product X1X2 will be added to the matrix.

Try with degrees 2,3,4,5,6. Produce the classification accuracy metrics for the different degrees of polynomial expansions used, for the training and (most importantly) the test set. Comment on your results, the differential between the training and test accuracy, and what is the reason behind them.

Optional: Provide a plot of your decision boundary for each polynomial degree used, on the training set.

Plot in a simple graph the degrees of polynomial used (i.e 1,2,3,4) and the corresponding training and test accuracy figures. What is the conclusion regarding overfitting the data?