Which of the following are components in building a machine learning algorithm?
Answer:
Asking the right question.
Suppose we build a prediction algorithm on a data set and it is 100% accurate on that data set. Why might the algorithm not work well if we collect a new data set?
Answer:
Our algorithm may be overfitting the training data, predicting both the signal and the noise.
What are typical sizes for the training and test sets?
Answer:
60% in the training set, 40% in the testing set.
What are some common error rates for predicting binary variables (i.e. variables with two possible values like yes/no, disease/normal, clicked/didn’t click)? Check the correct answer(s).
Answer:
Predictive value of a positive \(PPV = \frac{TP}{TP + FP}\)
Suppose that we have created a machine learning algorithm that predicts whether a link will be clicked with 99% sensitivity and 99% specificity. The rate the link is clicked is 1/1000 of visits to a website. If we predict the link will be clicked on a specific visit, what is the probability it will actually be clicked?
Answer:
the cofusion matrix will be:
| + | - | |
|---|---|---|
| + | TP = 99 | FP = 999 |
| - | FN = 1 | TN = 98901 |
By the formlula on previous question, positive predictive value (PPV) will be
ppv = 99 / (99 + 999)
ppv_percentage = ppv * 100
ppv_percentage
## [1] 9.016393
So, PPV in percents = 9%