Question 1

Which of the following are components in building a machine learning algorithm?

Answer : Collecting data to answer the question

Question 2

Suppose we build a prediction algorithm on a data set and it is 100% accurate on that data set. Why might the algorithm not work well if we collect a new data set?

Answer : Our algorithm may be overfitting the training data, predicting both the signal and the noise.

Question 3

What are typical sizes for the training and test sets?

Answer : 80% training set, 20% test set.

Question 4

What are some common error rates for predicting binary variables (i.e. variables with two possible values like yes/no, disease/normal, clicked/didn’t click)? Check the correct answer(s).

Answer : Predictive value of a positive.

Question 5

Suppose that we have created a machine learning algorithm that predicts whether a link will be clicked with 99% sensitivity and 99% specificity. The rate the link is clicked is 1/1000 of visits to a website. If we predict the link will be clicked on a specific visit, what is the probability it will actually be clicked?

Answer : y definition we have : * sensivity = TP/(TP+FN) * specificity = TN/(TN+FP) * prevalence = (TP+FN)/(TP+FN+TN+FP)

and we know that :

  1. TP=(TP+FN).sensitivity, FP=(TN+FP).(1−specificity)
  2. sensitivity.prevalence=TP/(TP+FN+TN+FP)
  3. (1−specificity).(1−prevalence)=FP/(TP+FN+TN+FP)

We want to compute : p = Pr(click +|test click +) = TP/(TP+FP)

So p = (10−3.0.99)/(10−3.0.99+0.01∗0.999) ~ 9%