Data Engineering and Mining II Paul Brown
Fall 2022
Assignment 8B
Performance Evaluation - Part 2 - Section A
Directions: Complete the following and then go on to upload the assignment in GT Canvas.
Fill-in-the-blanks:
1. A confusion matrix provides a summary of all of the ________ made compared to the _______.
Predictions, actuals
TRUE or FALSE:
2. The counts of the predicted class values are summarized by columns and the count of the actual values for each class value are predicted by rows.
False
Fill-in-the-blank:
3. A perfect set of predictions is shown as a _________ ___________ from the top left to the bottom right of the matrix.
diagonal line
4. Given the confusion matrix, what is the second index (going across)?
the column for predicted values
TRUE or FALSE:
5. Data mining is about building models from data.
True
6. Why do we build models in data mining?
To gain insights into the world and how the world works so we can predict how things behave
7. What does preventive classification consist of?
Obtaining models designed with the goal of forecasting the value of a nominal target variable using information on a set of predictors
8. When dealing with preventive classification, the models are obtained by using what?
a set of labeled observations of the phenomenon under study;
9. Interpret the following, all of which are related to the confusion matrix: TP, TN, FP, FN.
TP – True Positive
TN – True Negative
FP – False Positive
FN – False Negative
10. What is the relation between accuracy and the error rate?
accuracy = 1 – error rate
11. The confusion matrix is used to define what?
Used to define the performance of a classification algorithm.
12. In the formula for accuracy, the expression N + P or (TP+FP+FN+TN) represents what value?
Total number of samples
13. Using the formula for accuracy, where TP = 56; TN = 15; FP = 3; and FN = 10, find and list the value of the accuracy. Show all work.
TP + TN = 56+15 = 71 = .8452
TP+TN+FP+FN 56+15+3+10 84
14. Look at the confusion matrix where benign tissue is called healthy and malignant tissue is considered cancerous. Which measurement metric of the classifier is considered a Type II error? What does it represent?
FN – False Negative - represents the number of patients misclassified as benign, but actually they are malignant
15. List the performance metrics along with accuracy.
precision, recall, and F1 score
16. In the accuracy formula of the confusion matrix related to benign and malignant, what does TP + TN mean?
Total Positives + Total Negatives – represents those items that are correctly classified
17. In the confusion matrix related to the benign and malignant, how is the precision of an algorithm represented?
It represents as the ratio of correctly classified patients with the disease (TP) to the total patients predicted to have the disease (TP+FP).
18. In the confusion matrix related to benign and malignant, how is the recall metric defined?
This is the proportion of positive data points that are correctly considered as positive.
19. What is the perception behind recall?
how many patients have been classified as having the disease.
TRUE or FALSE:
20. Specificity is called the true positive (TP) rate.
False
Fill-in-the-blank:
21.______________ is also called the true negative (TN) rate.
Specificity
List the formula for sensitivity.
Sensitivity (true positive rate) = TP/(FN+TP)
TRUE or FALSE:
22. Specificity (true negative rate) is the percentage of negative cases that are correctly predicted.
True
23. Sensitivity has values in the range of _______________ .
Zero to one
24. Is the F! score a model performance metric? Yes or NO
Yes
25. What is the range of the F! score?
[0,1]
26. What is a classifier?
It is a supervised function (a machine learning tool), where the learned (target) attribute is categorical (“nominal”).
Fill-in-the-blank:
27. The ___________ tells you how precise your classifier is, as well as how _____________ it is.
F-measure, robust
28. We use the metrics, accuracy and kappa, when dealing with categorical problems. When do we use the metric, RMSE ?
when dealing with numerical or real number problems.
29. The area under the ROC curve is used to measure the quality of a classification model. What happens to the performance of the model as the area under the ROC curve gets larger?
Performance improves
30. Sensitivity and specificity have values in the range ____________ .
[0,1]
31. The has perfect precision and recall at which value?
Range between 0 and 1
32. Which metric is the percent of negative cases correctly predicted?
Specificity (True Negative Rate or TNP)
Fill-in-the-blank:
33. The ___________ and ___________ class values are the key to evaluation.
Actual, predicted
34. In the confusion matrix (or error matrix) for the decision tree model on weather.csv [test] (%), what does the true positive tell us?
That 10 of the 56 (18%) days that rain was predicted that it did rain
35. When working with the confusion matrix, the predictions are compared with ________ _______ , and the true and false positives and negatives are calculated.
Actual values
36. What is the random forest algorithm used for [in classification]?
Random forest is a Supervised Machine Learning Algorithm that is used widely in Classification problems. It builds decision trees on different samples and takes their majority vote for classification.