Evaluation of Classifiers & Decision Trees

2023-04-12

LTS from last class B4 we go back to Evaluation metrics

Review Exercise. LTS (Left to Student- from last class) Recall that we can use the predict function with type=‘response’ when we have a logistic regression model to get the probability of the outcome equaling 1, given some data.

What are we doing below with a Shoptastic Customer?

prob_of_purchase_given_data<-
  predict(shop_model, newdata
          = data.frame(Time_Spent=25, Items_In_Cart=1
                       , Promo_Given=1, Pages_Viewed=2)
          , type="response")

How can I leverage the predict command to show that the odds decrease by factor of Q= 0.9915395 when a customer spends another minute on the website and nothing else changes about their data?

#do your work here.

Predicting if a follower will like an Instagram post: Separating the ‘double-tappers’ from the ‘scrollers’

Exercise 1 Using a threshold probability of 0.5 calculate the accuracy of the model.

Exercise 2 Still using this same threshold probability, calculate:

the sensitivity (also known as true positive rate or recall). After you find it, store your answer in TPR_half

#TPR_half<-

the proportion of instances where the model correctly predicted that a follower will not like a post, among all instances where the follower actually did not like the post. After you find it, subtract it from 1 and store it in FPR_half.
```
#FPR_half<- 1 - #proportion of TN who were correctly 
                #identified as such by the model.
```

Evaluation Metrics that Are EZ to Get from Confusion Matrix

	Actual 0	Actual 1
Predicted 0	TN	FN
Predicted 1	FP	TP

Accuracy: The proportion of correctly classified instances out of the total instances.
- Exercise Given the definition above, what is the formula for accuracy then in terms of TP, TN, FP, and FN? (TP=True positive, TN=true negative, FP=false positive and FN=false negative)
Accuracy = (TP+TN)/(TN+TP+FN+FP) (Answer- done last class)
Precision: The fraction of true positives among all instances predicted as positive.
- Exercise What is the formula for precision (given the definition above) in terms of TP, TN, FP, and FN?
Precision = TP/(FP+TP) (Answer- done last class)
Sensitivity (also known as: Recall or True Positive Rate (TPR)): The fraction of true positives among all actual positive instances.
- Exercise What is the formula for sensitivity, then, in terms of TP, TN, FP, and FN?
sensitivity= TP/(FN+TP) (done last class)
F1 Score: F1= 2 \(\cdot\) Precision \(\cdot\) Recall/(Precision + Recall)
- The closer F1 is to 1, the better the model (means the model has high precision and recall). The F1 score is useful when the false positives and false negatives have different costs, and you want to find a model that strikes a good balance between them.

Below is new!

Specificity: The proportion of true negative instances among all actual negative instances.
- Exercise What is the formula for specificity then in terms of TP, TN, FP, and FN?

Corollary 1 - Specificity = false positive rate (how many times the model classifies actual negative results as false positives).

Proof: Do it! Hint. It’s regular algebra…

Area Under Curve (AUC): Another Evaluation Metric

Area under curve (AUC): AUC is a measure of how well a classifier model can distinguish between positive (presence of characteristic) and negative (absence of characteristic) instances. For example, if the AUC of our heart disease model is 0.80, it means that if we randomly select a patient with heart disease and a patient without heart disease, there is an 80% chance that our model will correctly rank the patient with heart disease as having a higher risk of heart disease than the patient without heart disease. We will calculate the AUC of our model below
It is for binary classifiers.
If you have a multi-class classifier (we’ll see some multi-class ones today) then you can still compute AUC for this but you’ll need to do it in a special way where you treat each class as a separate binary classification problem, and the other classes are combined into a single negative class. You can then calculate the AUC for each binary classification problem and average the results to get an overall AUC for the multi-class classifier. You’ll get to try that out in HW 10 problem 2.
- Remark 1 the closer the AUC to 1 the better the classifier model’s performance is. When it’s close to 1 it means that the model has a higher TPR for a given FPR.
- Remark 2 An AUC value below 0.5 suggests that the classifier is performing worse than random chance.
- Remark 3 Below I calculated the AUC for the test set of logitModel but you should also do it for the train set, too, and compare:

library(pROC)
#roc_obj_test <- roc(testSet$num, predictions_prob_test)
#roc_obj_test$auc

AUC is the area under a curve called the Receiver Operator Characterisic curve (ROC) it turns out. The ROC is a plot of the true positive rate against the false positive rate at different threshold probabilities.

Exercise: Calculate the accuracy, sensitivity, specificity, precision and False Positive Rate for the test set on the heart logit model

LTS Do it also on the train set…

More on Precision and Recall and F1 scores

High Precision: Precision is the proportion of true positive instances among all instances predicted as positive. A high precision model is desirable when the cost of false positives is high, i.e., when it is more important to minimize the number of false positives. Some situations where high precision is important include:
- Spam email filtering: In this case, it’s more important to ensure that legitimate emails are not classified as spam (false positives), even if it means letting some spam emails through (false negatives).
- Fraud detection: When detecting fraudulent transactions, it is crucial to minimize the number of false alarms (false positives) to reduce the inconvenience caused to customers whose legitimate transactions are flagged as suspicious.
- Medical diagnosis: In certain medical tests, it’s essential to minimize false positives to reduce unnecessary medical procedures or treatments that may result from a false positive diagnosis.

versus

High Sensitivity (=High TPR = High Recall): Sensitivity is the proportion of true positive instances among all actual positive instances. A high sensitivity model is desirable when the cost of false negatives is high, i.e., when it is more important to minimize the number of false negatives. Some situations where high recall is important include:
- Disease screening: In the context of screening for life-threatening diseases, it is crucial to identify as many positive cases as possible (high recall) to ensure timely treatment, even if it means having some false positives that require further testing.
- Search engines: For search engines, it’s important to retrieve as many relevant documents as possible (high recall) to provide users with comprehensive search results, even if it means including some irrelevant documents (false positives).
- Legal document discovery: In legal cases, it’s essential to find all relevant documents (high recall) to avoid missing critical information, even if it means reviewing some irrelevant documents (false positives).

Depending on the problem context, either high precision or high recall might be more important. In some cases, a balance between the two is desired, and the F1 score can be used to evaluate the model’s performance in such situations

Visualize Your Way to a Good Laugh: A Punny Exploration of Logistic Regression Plots

funny_data<-read_csv("../data/laugh.csv")

Exercise Above in the R chunk that generated this plot, I stored the actual points from the data and their predictions in this dataframe: outcome_data. Run all the Rchunks above if you have not already and use outcome_data to compute the accuracy. It should match the value you see in the title above.

#so take a look first at outcome_data to get started...

Another example that showing Log Regression Visualized

Imagine you work as a data analyst for a power company responsible for maintaining the power grid. The company wants to understand the relationship between daily average temperature and the likelihood of power grid failures within a day. You have collected data on the daily average temperature and whether there was a power grid failure on each day. Your goal is to analyze this data and determine if daily average temperature has an impact on power grid failure.

Let’s see your data. (It turns out is simulated data that I made up.)

data_daily_temp<-read_csv("../data/daily_temp_.csv")
head(data_daily_temp,1)

## # A tibble: 1 × 2
##   temperature success
##         <dbl>   <dbl>
## 1        88.2       0

The temperature variable represents the daily average temperature (in Fahrenheit), and the success variable is a binary outcome (1 = there was not a power grid failure on the day, 0 = there was a power grid failure on the day).

Let’s recode it to have it in terms of failure

data_daily_temp<-data_daily_temp%>%mutate(failure=1-success)
data_daily_temp<-select(data_daily_temp,-success)
head(data_daily_temp,1)

## # A tibble: 1 × 2
##   temperature failure
##         <dbl>   <dbl>
## 1        88.2       1

In this example, you model the relationship using logistic regression. The accuracy of the logistic regression model is high, demonstrating that the model can successfully predict power grid failures based on daily average temperature. See below (for full details, see the .Rmd file):

## [1] "Logistic Regression Accuracy using a threshold prob of 0.5 on the test set: 0.883333333333333"

The relationship between daily temperature and power grid failure appears to follow a S-shaped curve sort of well but we could probably do even better with another model besides logistic regression if you look at that graph. It suggests that we can use logistic regression but like the previous example we may be able to get away with another type of classification model!

Logistic Regression Limitations: When Binary Outcome Data Really Doesn’t Fit the S-Shaped Curve

Imagine you work for a company that manufactures electronic devices. The company is interested in understanding the factors that may lead to device failures within a year. You have collected data on the daily usage of devices (in hours) and whether the devices failed within a year or not. Your goal is to analyze this data and determine if the daily usage has an impact on device failure within a year. So let’s grab the data (I simulated it):

data_daily_usage<-read_csv("../data/daily_usage.csv")
tail(data_daily_usage,1)

## # A tibble: 1 × 2
##   daily_usage failure
##         <dbl>   <dbl>
## 1        1.17       0

In the dataset, daily_usage represents the average number of hours the device is used daily, and the failure variable is a binary outcome (1 = the device failed within a year, 0 = the device did not fail within a year).

So, you first attempt to model the relationship using logistic regression. However, when visualizing the S-shaped curve from the logistic regression model, it becomes clear that the model does not fit the data well:

And as a result, it’s no surprise that the accuracy of the logistic regression model is low, below I compute the accuracy on our test set:

## [1] "Logistic Regression Accuracy on test set:0.605"

Notice that many data points were misclassified because there’s not a linear decision boundary.

What to do??? We can try a new technique: decision trees!

Quick demo of rpart on iris dataset!

How to Implement a Decision Tree Model w/ the Previous Example

Now that we’ve seen that the logistic regression model doesn’t fit the data well, let’s explore another approach: decision trees.
Decision trees are a popular machine learning technique used for classification and regression problems.
They allow us to create a model that can make predictions by splitting the data into smaller groups based on a set of conditions.
Below we go over how to generate a decision tree model to predict failure in terms of daily_usage on train_data

Compute the accuracy of this model:

Compute the AUC of this model:

Decision Trees: The Tree-mendous Tool for Data Science!

Welcome to our lesson on decision trees! Get ready to branch out into the exciting world of classification trees. And don’t worry, we’ll make sure it’s a root-ing good time!

The above image was created with the assistance of DALL·E 2

There are two types of trees in ML: trees for Classification and trees for Regression.

Classification trees classify items into 2 or more discrete categories.
Regression trees try to predict continuous values.

We’ll cover classification trees only in HW but I may do a regression tree example next week as it could help out a team that I wrote to last night!

Loose Ends…

Be ready for next class scheduled for Tuesday not Monday by reviewing this material.
HW 9 is due by Friday at 6:30 pm
HW 10 is posted and is on Decision Trees (only classification ones) and a little review. Due next week on Friday
We will do finish decision trees next class and start on unsupervised learning! Clustering in particular!
I am reading your proposals and will be giving feedback on or by Monday
Very happy to meet to talk about that or anything Data Sci!