Which of the following are steps in building a machine learning algorithm?
Solution:
Evaluating the prediction.
Suppose we build a prediction algorithm on a data set and it is 100% accurate on that data set. Why might the algorithm not work well if we collect a new data set?
Solution:
Our algorithm may be overfitting the training data, predicting both the signal and the noise.
What are typical sizes for the training and test sets?
Solution:
60% in the training set, 40% in the testing set. If our sample size ius quite large, we could have 20% each for test set and validation set.
What are some common error rates for predicting binary variables (i.e. variables with two possible values like yes/no, disease/normal, clicked/didn’t click)?
Solution:
Predictive value of a positive, which is defined as \(\frac{TP}{TP + FP}\).
Suppose that we have created a machine learning algorithm that predicts whether a link will be clicked with 99% sensitivity and 99% specificity. The rate the link is clicked is 1/1000 of visits to a website. If we predict the link will be clicked on a specific visit, what is the probability it will actually be clicked?
Solution:
Suppose there are 100000 links under study. Based on the information, we can construct a table below
Truth | ||||
---|---|---|---|---|
Clicked | Not Clicked | |||
Predict | Clicked | 99 | 999 | 1098 |
Not clicked | 1 | 98901 | 98902 | |
100 | 99900 | 100000 |
So the answer will be 99 / (99 + 999) = 9%.