There are 3 problems below that resolved around classification. They include fraud detection methods, maximizing marketing from predicted data, and classifying canerous tumors with the use of models.

Part 1: Understand Classification

Question 1 (2.5 pts): The following two confusion matrices represent the performance of two different classifiers, C1 and C2, on the same validation dataset (which had 100 data points). Both classifiers were built to predict whether the person is likely to buy a luxury car. Compare the two classifiers based on their predictive accuracy as well as precision, recall, and F measure (for class “Yes”, i.e., for the purchase outcome). Show the calculation for each metric (i.e., don’t just report which classifier has higher performance).

classifier1 Confusion Matrix

TP1 = 20 # Actual = yes; predicted = yes
FP1 = 8 # Actual = no; predicted = yes
FN1 = 12 # Actual = yes; predicted = no
TN1 = 60 # Actual = no; predicted = no

# accuracy
print("Accuracy for classifier 1")

## [1] "Accuracy for classifier 1"

(ac1 = ((TP1+TN1)/(TP1+FP1+TN1+FN1)))

## [1] 0.8

# precision yes
print("Precision Yes for classifier 1")

## [1] "Precision Yes for classifier 1"

(pr1yes = (TP1/(TP1+FP1)))

## [1] 0.7142857

# precision no
print("Precision No for classifier 1")

## [1] "Precision No for classifier 1"

(pr1no = (TN1/(TN1+FN1)))

## [1] 0.8333333

# recall yes
print("Recall Yes for classifier 1")

## [1] "Recall Yes for classifier 1"

(re1yes = (TP1/(TP1+FN1)))

## [1] 0.625

# recall no
print("Recall No for classifier 1")

## [1] "Recall No for classifier 1"

(re1no = (TN1/(TN1+FP1)))

## [1] 0.8823529

# f-measure yes
print("F-Measure Yes for classifier 1")

## [1] "F-Measure Yes for classifier 1"

(fm1yes = (2*((pr1yes*re1yes)/(pr1yes+re1yes))))

## [1] 0.6666667

classifier2 Confusion Matrix

TP2 = 25 # Actual = yes; predicted = yes
FP2 = 18 # Actual = no; predicted = yes
FN2 = 7 # Actual = yes; predicted = no
TN2 = 50 # Actual = no; predicted = no

# accuracy
print("Accuracy for classifier 2")

## [1] "Accuracy for classifier 2"

(ac2 = ((TP2+TN2)/(TP2+FP2+TN2+FN2)))

## [1] 0.75

# precision yes
print("Precision Yes for classifier 2")

## [1] "Precision Yes for classifier 2"

(pr2yes = (TP2/(TP2+FP2)))

## [1] 0.5813953

# precision no
print("Precision No for classifier 2")

## [1] "Precision No for classifier 2"

(pr2no = (TN2/(TN2+FN2)))

## [1] 0.877193

# recall yes
print("Recall Yes for classifier 2")

## [1] "Recall Yes for classifier 2"

(re2yes = (TP2/(TP2+FN2)))

## [1] 0.78125

# recall no
print("Recall No for classifier 2")

## [1] "Recall No for classifier 2"

(re2no = (TN2/(TN2+FP2)))

## [1] 0.7352941

# f-measure yes
print("F-Measure Yes for classifier 2")

## [1] "F-Measure Yes for classifier 2"

(fm2yes = (2*((pr2yes*re2yes)/(pr2yes+re2yes))))

## [1] 0.6666667

Comparing the 2 classifiers

# accuracy
if(ac1 > ac2){
print("Accuracy for classifier 1 is equal to or Greater than classifier 2 with a difference of:")
} else {
print("Accuracy for classifier 2 is equal to or Greater than classifier 1 with a difference of:")
}

## [1] "Accuracy for classifier 1 is equal to or Greater than classifier 2 with a difference of:"

if(ac1 > ac2){
print(ac1-ac2)
} else {
print(ac2-ac1)
}

## [1] 0.05

# precision yes
if(pr1yes > pr2yes){
print("Precision (Yes) for classifier 1 is equal to or Greater than classifier 2 with a difference of:")
} else {
print("Precision (Yes) for classifier 2 is equal to or Greater than classifier 1 with a difference of:")
}

## [1] "Precision (Yes) for classifier 1 is equal to or Greater than classifier 2 with a difference of:"

if(pr1yes > pr2yes){
print(pr1yes-pr2yes)
} else {
print(pr2yes-pre1yes)
}

## [1] 0.1328904

# precision no
if(pr1no > pr2no){
print("Precision (No) for classifier 1 is equal to or Greater than classifier 2 with a difference of:")
} else {
print("Precision (No) for classifier 2 is equal to or Greater than classifier 1 with a difference of:")
}

## [1] "Precision (No) for classifier 2 is equal to or Greater than classifier 1 with a difference of:"

if(pr1no > pr2no){
print(pr1no-pr2no)
} else {
print(pr2no-pr1no)
}

## [1] 0.04385965

# recall yes
if(re1yes > re2yes){
print("Recall (Yes) for classifier 1 is equal to or Greater than classifier 2 with a difference of:")
} else {
print("Recall (Yes) for classifier 2 is equal to or Greater than classifier 1 with a difference of:")
}

## [1] "Recall (Yes) for classifier 2 is equal to or Greater than classifier 1 with a difference of:"

if(re1yes > re2yes){
print(re1yes-re2yes)
} else {
print(re2yes-re1yes)
}

## [1] 0.15625

# recall no
if(re1no > re2no){
print("Recall (No) for classifier 1 is equal to or Greater than classifier 2 with a difference of:")
} else {
print("Recall (No) for classifier 2 is equal to or Greater than classifier 1 with a difference of:")
}

## [1] "Recall (No) for classifier 1 is equal to or Greater than classifier 2 with a difference of:"

if(re1no > re2no){
print(re1no-re2no)
} else {
print(re2no-re1no)
}

## [1] 0.1470588

# f-measure yes
if(fm1yes > fm2yes){
print("F-Measure (Yes) for classifier 1 is equal to or Greater than classifier 2 with a difference of:")
} else {
print("F-Measure (Yes) for classifier 2 is equal to or Greater than classifier 1 with a difference of:")
}

## [1] "F-Measure (Yes) for classifier 2 is equal to or Greater than classifier 1 with a difference of:"

if(fm1yes > fm2yes){
print(fm1yes-fm2yes)
} else {
print(fm2yes-fm1yes)
}

## [1] 0

Note: Also, compute the accuracy of the naive (majority) rule on this validation dataset. Hint: you may want to first draw the confusion matrix that you would get with naive/majority rule, to help you with accuracy calculation.

# classifier 1 dataframe
tp= c(1,1)
fp= c(1,0)
fn= c(0,1)
tn= c(0,0)

# create replicated rows
r1 =  data.frame(sapply(tp, rep.int, times=20))
r2 =  data.frame(sapply(fp, rep.int, times=8))
r3 =  data.frame(sapply(fn, rep.int, times=12))
r4 =  data.frame(sapply(tn, rep.int, times=60))

# bind rows
c1df <- rbind(r1, r2, r3, r4)

# determine majority
table(c1df$X2) # we determine in the table that the majority is no or 0

## 
##  0  1 
## 68 32

# change all the values in second column to 0 from 1
# classifier 1 dataframe w/ naive
tp= c(1,0)
fp= c(1,0)
fn= c(0,0)
tn= c(0,0)

# create replicated rows
r1 =  data.frame(sapply(tp, rep.int, times=20))
r2 =  data.frame(sapply(fp, rep.int, times=8))
r3 =  data.frame(sapply(fn, rep.int, times=12))
r4 =  data.frame(sapply(tn, rep.int, times=60))

# bind rows
c1df2 <- rbind(r1, r2, r3, r4)

# verify values have changed
table(c1df2$X2)

## 
##   0 
## 100

# count 1 and 0 for new confusion matrix
table(c1df2$X1)

## 
##  0  1 
## 72 28

# our new confusion matrix is 
tp = 0   #1 1
fp = 32   #1 0
fn = 0   #0 1
tn = 68   #0 0

# accuracy of naive (majority)
print("Accuracy for classifier 1 after applying majority rule:")

## [1] "Accuracy for classifier 1 after applying majority rule:"

(ac = ((tp+tn)/(tp+fp+tn+fn)))

## [1] 0.68

# classifier 2 dataframe
tp= c(1,1)
fp= c(1,0)
fn= c(0,1)
tn= c(0,0)

# create replicated rows
r1 =  data.frame(sapply(tp, rep.int, times=25))
r2 =  data.frame(sapply(fp, rep.int, times=18))
r3 =  data.frame(sapply(fn, rep.int, times=7))
r4 =  data.frame(sapply(tn, rep.int, times=50))

# bind rows
c1df <- rbind(r1, r2, r3, r4)

# determine majority
table(c1df$X2) # we determine in the table that the majority is no or 0

## 
##  0  1 
## 68 32

# change all the values in second column to 0 from 1
# classifier 2 dataframe w/ naive
tp= c(1,0)
fp= c(1,0)
fn= c(0,0)
tn= c(0,0)

# create replicated rows
r1 =  data.frame(sapply(tp, rep.int, times=25))
r2 =  data.frame(sapply(fp, rep.int, times=18))
r3 =  data.frame(sapply(fn, rep.int, times=7))
r4 =  data.frame(sapply(tn, rep.int, times=50))

# bind rows
c1df2 <- rbind(r1, r2, r3, r4)

# verify values have changed
table(c1df2$X2)

## 
##   0 
## 100

# count yes and no from first column for new confusion matrix
table(c1df2$X1)

## 
##  0  1 
## 57 43

# our new confusion matrix is 
tp = 0   #1 1
fp = 32   #1 0
fn = 0   #0 1
tn = 68   #0 0

# accuracy of naive (majority)
print("Accuracy for classifier 2 after applying majority rule:")

## [1] "Accuracy for classifier 2 after applying majority rule:"

(ac = ((tp+tn)/(tp+fp+tn+fn)))

## [1] 0.68

Question 2 (0.5 pts): You are using the kNN algorithm to classify new data based on a historical training set. You encounter a new observation, represented below by the green circle. Answer the following question.

Question 2.1 (0.5 pts): If your algorithm employs a k value of 5 in the majority voting, how would this new observation be classified (red triangle, or blue square)?

Answer: The new observation would be classified as blue-square with a k value of 5. This is because there would be 3 blue squares and only 2 red triangles, making blue squares the majority.

Part 2: Understanding Class Probability Predictions

Question 3 (3 pts): You have a fraud detection task (predicting whether a given credit card transaction is “fraud” vs. “non fraud”) and you built a classification model for this purpose. For any credit card transaction, your model estimates the probability that this transaction is “fraud”. The following table represents the probabilities that your model estimated for the validation dataset containing 10 records.

Code to build table

Act_Class <- c("fraud","fraud", "fraud", "non-fraud","fraud", "non-fraud", "fraud","non-fraud", "non_fraud","non-fraud")
# binary representation for actual class
Act_Class_bi <- c(1,1, 1, 0,1, 0, 1,0,0,0)
Pr_fraud <- c(0.95,0.91,0.75,0.67,0.61,0.46,0.42,0.25,0.09,0.04)

df <- as.data.frame(Act_Class)
df$Act_Class_bi <- Act_Class_bi
df$Pr_fraud <- Pr_fraud
df

Based on the above information, answer the following three questions: Question 3.1 (1 pts). What is the overall accuracy of your model, if the chosen probability threshold value is 0.5?

thresh = 0.5

# define new variable by threshold
df$Pr_fraud_bi = Pr_fraud >= thresh

# Confusion matrix
TP4 = length(which(df$Act_Class_bi ==1 & df$Pr_fraud_bi == TRUE)) # 4
FP4 = length(which(df$Act_Class_bi ==1 & df$Pr_fraud_bi == FALSE)) # 1
FN4 = length(which(df$Act_Class_bi ==0 & df$Pr_fraud_bi == TRUE))  # 1
TN4 = length(which(df$Act_Class_bi ==0 & df$Pr_fraud_bi == FALSE)) # 4

# accuracy
print("Accuracy for model with threshold of 0.5 is:")

## [1] "Accuracy for model with threshold of 0.5 is:"

(ac4 = ((TP4+TN4)/(TP4+FP4+TN4+FN4)))

## [1] 0.8

Question 3.2 (1 pts). What probability threshold value should you choose, in order to have Precision(fraud) = 100% for your model? (Explain.) What is the overall accuracy of your model in this case?

# changing the threshold to find 100% for fraud detection. Since our lowest probability value with actual class is 0.42 we would need to be lower to catch all fraud.
thresh = 0.68

# define new variable by threshold
df$Pr_fraud_bi = Pr_fraud >= thresh

# Confusion matrix
TP5 = length(which(df$Act_Class_bi ==1 & df$Pr_fraud_bi == TRUE)) # 5
FP5 = length(which(df$Act_Class_bi ==1 & df$Pr_fraud_bi == FALSE)) # 0 <--- Have fraud class, but 0 fraud probability
FN5 = length(which(df$Act_Class_bi ==0 & df$Pr_fraud_bi == TRUE)) # 2
TN5 = length(which(df$Act_Class_bi ==0 & df$Pr_fraud_bi == FALSE)) # 5

# precision yes
print("Precision(fraud) for Model with 0.4 threshold is:")

## [1] "Precision(fraud) for Model with 0.4 threshold is:"

(pr5yes = (TP5/(TP5+FP5)))

## [1] 0.6

# accuracy
print("Accuracy for model with threshold of 0.4 is:")

## [1] "Accuracy for model with threshold of 0.4 is:"

(ac5 = ((TP5+TN5)/(TP5+FP5+TN5+FN5)))

## [1] 0.8

Explanation: In order to have 100% precision(fraud), you would have to choose a threshold of at least 68%, or .68. This would eliminate the false positive created by entry number 4 in the table, with probability 67%. This value would switch from a false negative to a false positive. The accuracy of this model would also be 80% accurate, with 3 true positives, 5 true negatives, and 2 false positives.

Question 3.3 (1 pts). What probability threshold value should you choose, in order to have Recall(fraud) = 100% for your model? (Explain.) What is the overall accuracy of your model in this case?

# recall yes
print("Recall(fraud) for model with original 0.5 threshold is:")

## [1] "Recall(fraud) for model with original 0.5 threshold is:"

(re4yes = (TP4/(TP4+FN4)))

## [1] 0.8

# changing the threshold
thresh = 0.7

# define new variable by threshold
df$Pr_fraud_bi = Pr_fraud >= thresh

# Confusion matrix
TP6 = length(which(df$Act_Class_bi ==1 & df$Pr_fraud_bi == TRUE)) # 3
FP6 = length(which(df$Act_Class_bi ==1 & df$Pr_fraud_bi == FALSE)) # 2 
FN6 = length(which(df$Act_Class_bi ==0 & df$Pr_fraud_bi == TRUE)) # 0
TN6 = length(which(df$Act_Class_bi ==0 & df$Pr_fraud_bi == FALSE)) # 5

# recall yes
print("Recall(fraud) for model with 0.7 threshold is:")

## [1] "Recall(fraud) for model with 0.7 threshold is:"

(re6yes = (TP6/(TP6+FN6)))

## [1] 1

# accuracy
print("Accuracy for model with threshold of 0.7 is:")

## [1] "Accuracy for model with threshold of 0.7 is:"

(ac6 = ((TP6+TN6)/(TP6+FP6+TN6+FN6)))

## [1] 0.8

Note: that for Questions 3.2 and 3.3, if you think there are multiple possible thresholds, you only need to pick and report one.

Question 4 (1 pts): Recall the breast cancer classification problem that you encountered in Homework 2. You were asked to classify cancer cases as either benign or malignant. Now, think about the cost sensitive classification we talked about in class. Answer the following two sub questions:

Question 4.1 (0.5 pts). In this case of cancer classification, which kind of mis-classification is more costly to patients? Note that “cost” here refers to not just financial cost, but cost to the patients in general.

Answer: Malignant cancer has the highest cost to patients. Misclassifying malignant as benign would be more costly.

Question 4.2 (0.5 pts). Suppose you can control the cutoff value for classifying a case as “malignant”, should you increase or decrease that cutoff, in order to reduce overall misclassification cost? Explain your answer in 1 to 2 sentences.

Answer: We would want to lower the cutoff value just below our lowest known malignant value. This would increase the probability to catch all malignant cases and minimal benign cases.

Question 5 (3 pts): You are a marketing manager for a big financial company. The lift chart (depicted below) was obtained by running your predictive model on a validation sample of customers (representative of general population) from your historical data. The horizontal axis denotes the percentage of solicitation mailings, and the vertical axis denotes the percentage of responders. For the next marketing campaign, you obtained a list of 1,000,000 potential customers (again, representative of the general population). Based on your past data you know that in the general population the average response rate is 0.1%. In other words, in the general population of 1,000 customers, there is 1 responder on average. Now you are trying to decide between the following four marketing strategies:

Mail solicitations to all customers on your list;
Mail solicitations to top 10% of customers (as determined by your model);
Mail solicitations to top 20% of customers (as determined by your model);
Mail solicitations to top 40% of customers (as determined by your model).

Which of these four marketing strategies should you use in each of the following scenarios? (Explain your answer)

The cost of one mailing is 1.00 and the profit of each response is $2000.

cost = 1
prof_per_resp = 2000
n = 1000000
avg_resp = 0.001

# response percent
ten_resp = 0.4
twenty_resp = 0.6
forty_resp = 0.79
hundred_resp = 1.0

# solicit percentage
solicit10 = 0.1
solicit20 = 0.2
solicit40 = 0.4
solicit100 = 1.0


# 10% calculated
ten_calc = ((n * ten_resp * avg_resp) * prof_per_resp) - ((n * solicit10) * cost)
print("Total profit with mail solicitations to top 10% of customers:")

## [1] "Total profit with mail solicitations to top 10% of customers:"

ten_calc

## [1] 7e+05

# 20% calculated
twenty_calc = ((n * twenty_resp * avg_resp) * prof_per_resp) - ((n * solicit20) * cost)
print("Total profit with mail solicitations to top 20% of customers:")

## [1] "Total profit with mail solicitations to top 20% of customers:"

twenty_calc

## [1] 1e+06

# 40% calculated
forty_calc = ((n * forty_resp* avg_resp) * prof_per_resp) - ((n * solicit40) * cost)
print("Total profit with mail solicitations to top 40% of customers:")

## [1] "Total profit with mail solicitations to top 40% of customers:"

forty_calc

## [1] 1180000

# 100% calculated
hundred_calc = ((n * hundred_resp* avg_resp) * prof_per_resp) - ((n * solicit100) * cost)
print("Total profit with mail solicitations to top 100% of customers:")

## [1] "Total profit with mail solicitations to top 100% of customers:"

hundred_calc

## [1] 1e+06

print("The observation with the most profit is number:")

## [1] "The observation with the most profit is number:"

which.max(c(ten_calc, twenty_calc, forty_calc, hundred_calc))

## [1] 3

Answer: Using the given numbers and the simple formula for each percentage given below.

calculation = ((1000000 * % of response* 0.001) * profit per response) - ((1000000 * solicitation %) * cost per mailing)

We attain that the best option for maximizing profit is to mail solicitations to the top 40% as given in the results in the code above with a total profit of 1180000.

The cost of one mailing is 3.00 and the profit of each response is $2000.

cost = 3
prof_per_resp = 2000
n = 1000000
avg_resp = 0.001

# response percent
ten_resp = 0.4
twenty_resp = 0.6
fourty_resp = 0.79
hundred_resp = 1.0

# solicit percentage
solicit10 = 0.1
solicit20 = 0.2
solicit40 = 0.4
solicit100 = 1.0


# 10% calculated
ten_calc = ((n * ten_resp * avg_resp) * prof_per_resp) - ((n * solicit10) * cost)
print("Total profit with mail solicitations to top 10% of customers:")

## [1] "Total profit with mail solicitations to top 10% of customers:"

ten_calc

## [1] 5e+05

# 20% calculated
twenty_calc = ((n * twenty_resp * avg_resp) * prof_per_resp) - ((n * solicit20) * cost)
print("Total profit with mail solicitations to top 20% of customers:")

## [1] "Total profit with mail solicitations to top 20% of customers:"

twenty_calc

## [1] 6e+05

# 40% calculated
forty_calc = ((n * forty_resp* avg_resp) * prof_per_resp) - ((n * solicit40) * cost)
print("Total profit with mail solicitations to top 40% of customers:")

## [1] "Total profit with mail solicitations to top 40% of customers:"

forty_calc

## [1] 380000

# 100% calculated
hundred_calc = ((n * hundred_resp* avg_resp) * prof_per_resp) - ((n * solicit100) * cost)
print("Total profit with mail solicitations to top 100% of customers:")

## [1] "Total profit with mail solicitations to top 100% of customers:"

hundred_calc

## [1] -1e+06

print("The observation with the most profit is number:")

## [1] "The observation with the most profit is number:"

which.max(c(ten_calc, twenty_calc, forty_calc, hundred_calc))

## [1] 2

Answer: Using the given numbers and the simple formula for each percentage given below.

calculation = ((1000000 * % of response* 0.001) * profit per response) - ((1000000 * solicitation %) * cost per mailing) We attain that the best option for maximizing profit is to mail solicitations to the top 20% as given in the results in the code above with a total profit of 600000.

The cost of one mailing is 0.50 and the profit of each response is $2000.

cost = 0.5
prof_per_resp = 2000
n = 1000000
avg_resp = 0.001

# response percent
ten_resp = 0.4
twenty_resp = 0.6
fourty_resp = 0.79
hundred_resp = 1.0

# solicit percentage
solicit10 = 0.1
solicit20 = 0.2
solicit40 = 0.4
solicit100 = 1.0


# 10% calculated
ten_calc = ((n * ten_resp * avg_resp) * prof_per_resp) - ((n * solicit10) * cost)
print("Total profit with mail solicitations to top 10% of customers:")

## [1] "Total profit with mail solicitations to top 10% of customers:"

ten_calc

## [1] 750000

# 20% calculated
twenty_calc = ((n * twenty_resp * avg_resp) * prof_per_resp) - ((n * solicit20) * cost)
print("Total profit with mail solicitations to top 20% of customers:")

## [1] "Total profit with mail solicitations to top 20% of customers:"

twenty_calc

## [1] 1100000

# 40% calculated
forty_calc = ((n * forty_resp* avg_resp) * prof_per_resp) - ((n * solicit40) * cost)
print("Total profit with mail solicitations to top 40% of customers:")

## [1] "Total profit with mail solicitations to top 40% of customers:"

forty_calc

## [1] 1380000

# 100% calculated
hundred_calc = ((n * hundred_resp* avg_resp) * prof_per_resp) - ((n * solicit100) * cost)
print("Total profit with mail solicitations to top 100% of customers:")

## [1] "Total profit with mail solicitations to top 100% of customers:"

hundred_calc

## [1] 1500000

print("The observation with the most profit is number:")

## [1] "The observation with the most profit is number:"

which.max(c(ten_calc, twenty_calc, forty_calc, hundred_calc))

## [1] 4

Answer: Using the given numbers and the simple formula for each percentage given below.

calculation = ((1000000 * % of response* 0.001) * profit per response) - ((1000000 * solicitation %) * cost per mailing)

We attain that the best option for maximizing profit is to mail solicitations to the top 100% as given in the results in the code above with a total profit of 1500000.

Part 3. Classification in R

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

library(rpart)
library(rpart.plot)
library(class)
library(e1071)

Download “breast_cancer.csv” data file from Canvas. Make sure you read the description of Question 6 (in HW2 pdf file) carefully before working on this part. Import the data into R

cancer <- read.csv("C:/Users/CSFic/Desktop/school/DSA8590 - Advanced Analytics/data/breast_cancer.csv")
head(cancer)

Question 6 After you import the data, we need to convert “Class” variable into a factor, so that R treats it as a categorical variable, instead of a numeric variable.

# convert to factor
cancer$Class= factor(cancer$Class)
set.seed(1)
library(caret)
train_rows = createDataPartition(y = cancer$Class, p = 0.80, list = FALSE)
cancer_train = cancer[train_rows,]
cancer_test = cancer[-train_rows,]

6.2. Build a decision tree model

set.seed(1)
library(rpart)
library(rpart.plot)
tree <- rpart(formula = Class~., data = cancer_train,
control = rpart.control(maxdepth = 3, minbucket = 1, cp = 0)) 

prp(tree, faclen = 0, cex = 0.8, extra = 1)

6.3. Plot the tree, and then answer the following questions: 6.3.1. How many decision nodes are there in your tree? Answer: 5

6.3.2. Pick one decision rule from your tree and interpret it

Answer: A decision node to interpret would be Bare.Nuclei, which has a given decision rule of bare.nuclei being < 5. The best way to understand the rule is interpreting it as a if-then statement. For example, IF bare.nuclei is less than 5, THEN split to normal.nuclei, Else split to Uniformity.of.Cell.Sh.

6.4. Evaluate the performance of your tree. Specifically, report the following metrics: (1) confusion matrix; (2) accuracy; (3) precision, recall, f-measure for “malignant” class; (4) AUC for “malignant” class

library(pROC)

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

pred_tree = predict(tree, cancer_test, type = "class")
#1, 2
confusionMatrix(pred_tree, as.factor(cancer_test[,10]))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  2  4
##          2 84  4
##          4  4 43
##                                           
##                Accuracy : 0.9407          
##                  95% CI : (0.8866, 0.9741)
##     No Information Rate : 0.6519          
##     P-Value [Acc > NIR] : 1.347e-15       
##                                           
##                   Kappa : 0.8694          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9545          
##             Specificity : 0.9149          
##          Pos Pred Value : 0.9545          
##          Neg Pred Value : 0.9149          
##              Prevalence : 0.6519          
##          Detection Rate : 0.6222          
##    Detection Prevalence : 0.6519          
##       Balanced Accuracy : 0.9347          
##                                           
##        'Positive' Class : 2               
##

# confusion matrix taken from result of confusionMatrix(); Malignant = 4
tp2= 84
fp2= 4
fn2= 4
tn2= 43

#3
# accuracy
print("Accuracy for tree:")

## [1] "Accuracy for tree:"

(ac7 = ((tp2+tn2)/(tp2+fp2+tn2+fn2)))

## [1] 0.9407407

# precision(non-malignant)
print("Precision(non-malignant) for tree:")

## [1] "Precision(non-malignant) for tree:"

(pr7yes = (tp2/(tp2+fp2)))

## [1] 0.9545455

# precision(malignant)
print("Precision(malignant) for tree:")

## [1] "Precision(malignant) for tree:"

(pr7no = (tn2/(tn2+fn2)))

## [1] 0.9148936

# recall (non-malignant)
print("Recall(non-malignant) for tree:")

## [1] "Recall(non-malignant) for tree:"

# recall (malignant)
print("Recall(malignant) for tree:")

## [1] "Recall(malignant) for tree:"

(re7no = (tn2/(tn2+fp2)))

## [1] 0.9148936

# f-measure malignant
print("F-Measure(malignant) for tree:")

## [1] "F-Measure(malignant) for tree:"

(fm7yes = (2*((pr7no*re7no)/(pr7no+re7no))))

## [1] 0.9148936

#4
prob_pred_tree = predict(tree, cancer_test, type = 'prob')
roc_tree = roc(response = ifelse(cancer_test$Class == "4", 1, 0),
predictor = prob_pred_tree[,2])

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

#Report AUC
auc(roc_tree)

## Area under the curve: 0.9351

6.5. Now, let’s consider using K-NN to do the classification. Is there any need to normalize the data? Why or why not? If you think normalization is needed, write your code below to do so. Feel free to re-use code from in class exercise. If you think normalization is not necessary, explain why and you do not need to write any code.

Answer: I do not believe that normalization is necessary because all of the variables seem to use the same numeric range of 1-10. Normalization is only required when numeric ranges differ greatly.

6.6. Build a K-NN model with your own choice of k value, and evaluate the performance of your KNN model. Does it have higher or lower AUC than your decision tree model?

library(pROC)
set.seed(1)
# choice of k value = 5
# build model
k_model = knn3(Class~., data=cancer_train,k=5)
pred_knn = predict(k_model, cancer_test, type="prob")
# AUC
roc_knn = roc(response = ifelse(cancer_test$Class == "4", 1, 0),
predictor = pred_knn[,2])

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

auc(roc_knn)

## Area under the curve: 0.9911

print("Difference between knn and decision tree models:")

## [1] "Difference between knn and decision tree models:"

(0.9911 - 0.9351)

## [1] 0.056

Answer: The k-nn model has a higher AUC than the decision tree. K-NN model has an AUC of 0.9911 and the decision tree has an AUC of 0.9351 with a difference of 0.056.

6.7. Try several different k values, report the AUC of each one you tried. Also, report which k value gives you the highest AUC. Try using a for loop for this task.

library(pROC)
set.seed(1)
for (i in 1:12) {
print(i)
k_model = knn3(Class~., data = cancer_train, k=i)
prob_pred_knn = predict(k_model , cancer_test, type="prob")
roc_knn = roc(response = ifelse(cancer_test$Class == "4", 1, 0),predictor = prob_pred_knn[,2])
print(auc(roc_knn))
}

## [1] 1

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## Area under the curve: 0.9248
## [1] 2

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## Area under the curve: 0.9869
## [1] 3

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## Area under the curve: 0.9863
## [1] 4

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## Area under the curve: 0.986
## [1] 5

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## Area under the curve: 0.9911
## [1] 6

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## Area under the curve: 0.9902
## [1] 7

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## Area under the curve: 0.99
## [1] 8

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## Area under the curve: 0.9904
## [1] 9

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## Area under the curve: 0.9901
## [1] 10

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## Area under the curve: 0.9898
## [1] 11

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## Area under the curve: 0.9895
## [1] 12

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## Area under the curve: 0.9894

Answer: I tried k values 1 through 12. The k value that had the highest AUC was my k value from problem 6.6, which was equal to 5 with an AUC of 0.9911.

6.8. Build a naive bayes model, and evaluate its performance on the same testing data. Does it have higher or lower AUC than your best decision tree and kNN models?

# build model
NB_model = naiveBayes(Class ~ ., data = cancer_train)
# class probability predictions
prob_pred_nb = predict(NB_model, cancer_test, type = "raw")
roc_nb = roc(response = ifelse(cancer_test$Class == "4", 1, 0), predictor = prob_pred_nb[,2])

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

# AUC
auc(roc_nb)

## Area under the curve: 0.9741

The naive bayes model has a lower AUC than both the decision tree and most of the k-nn values which were mostly 0.98+

6.9. Take your best model in terms of AUC, and plot the lift curve. What is the lift ratio at top 10% of cases with highest “malignant” probability as predicted by your model? Interpret the meaning of that lift ratio.

# best model was knn with k = 5
library(pROC)
library(dplyr)
set.seed(1)
# choice of k value = 5
# build model
k_model = knn3(Class~., data=cancer_train,k=5)
pred_knn = predict(k_model, cancer_test, type="prob")
# AUC
roc_knn = roc(response = ifelse(cancer_test$Class == "4", 1, 0),
predictor = pred_knn[,2])

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

cancer_lift = cancer_test %>%
mutate(prob = pred_knn[,2]) %>% arrange(desc(prob)) %>%
mutate(y = ifelse(Class == "4", 1, 0)) %>%
# the following two lines make the lift curve
mutate(x = row_number()/nrow(cancer_test),
y = (cumsum(y)/sum(y))/x)
plot(cancer_lift$x, cancer_lift$y, main = "Lift Curve for KNN Model with k = 5", col = "steel blue", xlab = "Percentages", ylab = "Malignant Probability")

# lift ratio at top 10%
x_val <- median(which(cancer_lift$x > 0.09 & cancer_lift$x < 0.11))
cancer_lift$y[x_val]

## [1] 2.651391

Answer: In the graph we can see that for the top 10% (x-axis between 0.0 and 0.2) we get a small range of values for probability (y-axis) that is greater than 2.5. To better calculate this we take the median of the range of x values and get roughly 2.65. The graph meaning can be summarized by stating that typically (not always), the top percentages will result in a higher probability of being malignant.

6.10. Again take your best model in terms of AUC, and plot the ROC curve for class “malignant”.

library(pROC)
set.seed(1)
# choice of k value = 5
# build model
k_model = knn3(Class~., data=cancer_train,k=5)
pred_knn = predict(k_model, cancer_test, type="prob")
# AUC
roc_knn = roc(response = ifelse(cancer_test$Class == "4", 1, 0),
predictor = pred_knn[,2])

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

plot(roc_knn, main = "ROC Curve for Malignant Class")

Classification

Craig Fick

10/2/2021

There are 3 problems below that resolved around classification. They include fraud detection methods, maximizing marketing from predicted data, and classifying canerous tumors with the use of models.

Part 1: Understand Classification

Part 2: Understanding Class Probability Predictions