Question 1

Which of the following are steps in building a machine learning algorithm?

Solution:

Evaluating the prediction.

Question 2

Suppose we build a prediction algorithm on a data set and it is 100% accurate on that data set. Why might the algorithm not work well if we collect a new data set?

Solution:

Our algorithm may be overfitting the training data, predicting both the signal and the noise.

Question 3

What are typical sizes for the training and test sets?

Solution:

60% in the training set, 40% in the testing set. If our sample size ius quite large, we could have 20% each for test set and validation set.

Question 4

What are some common error rates for predicting binary variables (i.e. variables with two possible values like yes/no, disease/normal, clicked/didn’t click)?

Solution:

Predictive value of a positive, which is defined as \(\frac{TP}{TP + FP}\).

Question 5

Suppose that we have created a machine learning algorithm that predicts whether a link will be clicked with 99% sensitivity and 99% specificity. The rate the link is clicked is 1/1000 of visits to a website. If we predict the link will be clicked on a specific visit, what is the probability it will actually be clicked?

Solution:

Suppose there are 100000 links under study. Based on the information, we can construct a table below

Truth
Clicked Not Clicked
Predict Clicked 99 999 1098
Not clicked 1 98901 98902
100 99900 100000

So the answer will be 99 / (99 + 999) = 9%.