Assignment 8B - Part I

Data Engineering and Mining II                      Paul Brown
Fall 2022                                                                                               
                                                              Assignment 8B
Performance Evaluation - Part 2 - Section A

Directions:  Complete the following and then go on to upload the assignment in GT Canvas.
 
    Fill-in-the-blanks:
1. A confusion matrix provides a summary of all of the ________ made compared to the _______.
    Predictions, actuals
    
    TRUE or FALSE:
2. The counts of the predicted class values are summarized by columns and the count of the actual values for each class value are predicted by rows.
    False
    
    Fill-in-the-blank:
3. A perfect set of predictions is shown as a _________  ___________  from the top left to the bottom right of the matrix.
    diagonal line
    
4.  Given the confusion matrix, what is the second index (going across)?
    the column for predicted values
  
  TRUE or FALSE:
5. Data mining is about building models from data.
    True
  
6. Why do we build models in data mining?
     To gain insights into the world and how the world    works so we can predict how things behave
  
7. What does preventive classification consist of?
     Obtaining models designed with the goal of          forecasting the value of a nominal target           variable using information on a set of predictors
  
8.  When dealing with preventive classification, the models are obtained by using what?
    a set of labeled observations of the phenomenon under study;

9. Interpret the following, all of which are related to the confusion matrix: TP, TN, FP, FN.
    TP – True Positive
    TN – True Negative
    FP – False Positive
    FN – False Negative

10. What is the relation between accuracy and the error rate?
    accuracy = 1 – error rate

11. The confusion matrix is used to define what?
    Used to define the performance of a classification   algorithm.
 
12.  In the formula for accuracy, the expression N + P or (TP+FP+FN+TN) represents what value?
    Total number of samples
 
13. Using the formula for accuracy, where TP = 56; TN = 15; FP = 3; and FN = 10, find and list the value of the accuracy. Show all work.
          TP + TN   = 56+15    = 71    = .8452
      TP+TN+FP+FN 56+15+3+10   84
      
14. Look at the confusion matrix where benign tissue is called healthy and malignant tissue is considered cancerous. Which measurement metric        of the classifier is considered a Type II error?  What does it represent?
    FN – False Negative - represents the number of patients misclassified as benign, but actually they are malignant 
 
15. List the performance metrics along with accuracy.
    precision, recall, and F1 score
 
16.  In the accuracy formula of the confusion matrix related to benign and malignant, what does TP + TN mean?
     Total Positives + Total Negatives – represents those items that are correctly classified

17. In the confusion matrix related to the benign and malignant, how is the precision of an algorithm represented?
     It represents as the ratio of correctly classified patients with the disease  (TP) to the total patients predicted to have the disease (TP+FP).
     
18.  In the confusion matrix related to benign and malignant, how is the recall metric defined?
    This is the proportion of positive data points that are correctly considered as     positive.  

19. What is the perception behind recall?
      how many patients have been classified as having     the disease.  

    TRUE or FALSE:
20. Specificity is called the true positive (TP) rate.  
    False
    
    Fill-in-the-blank:
21.______________ is also called the true negative (TN) rate.
    Specificity
    List the formula for sensitivity.
    Sensitivity (true positive rate) = TP/(FN+TP)
    
    TRUE or FALSE: 
22. Specificity (true negative rate) is the percentage of negative cases that are correctly predicted.
    True
    
23. Sensitivity has values in the range of _______________ .
    Zero to one

24. Is the F! score a model performance metric? Yes or NO
    Yes
    
25. What is the range of the F! score?
    [0,1]

26. What is a classifier?
    It is a supervised function (a machine learning tool), where the learned  (target) attribute is categorical (“nominal”).  

    Fill-in-the-blank:
27. The ___________ tells you how precise your classifier is, as well as how _____________ it is.
    F-measure, robust
    
28. We use the metrics, accuracy and kappa, when dealing with categorical problems. When do we use the metric, RMSE ?
    when dealing with numerical or real number problems.
    
29. The area under the ROC curve is used to measure the quality of a classification model. What happens to the performance of the model as the area under the ROC curve gets larger?
    Performance improves

30. Sensitivity and specificity have values in the range ____________ .
    [0,1]
    
31.     The has perfect precision and recall at which value?
    Range between 0 and 1
    
32. Which metric is the percent of negative cases correctly predicted?

    Specificity (True Negative Rate or TNP)

    Fill-in-the-blank:
33. The ___________ and ___________ class values are the key to evaluation.
    Actual, predicted
    
34. In the confusion matrix (or error matrix) for the decision tree model on weather.csv [test] (%), what does the true positive tell us?
    That 10 of the 56 (18%) days that rain was           predicted that it did rain
    
35. When working with the confusion matrix, the predictions are compared with ________ _______ , and the true and false positives and negatives are calculated.
    Actual values

36. What is the random forest algorithm used for [in classification]?
     Random forest is a Supervised Machine Learning      Algorithm that is used widely in Classification     problems. It builds decision trees on different     samples and takes their majority vote for           classification.
Assignment 8B - Part I

Paul Brown

2/10/2023