When the number of features p is large, there tends to be a deterioration in the performance of KNN and other local approaches that perform prediction using only observations that are near the test observation for which a prediction must be made. This phenomenon is known as the curse of dimensionality, and it ties into the fact that parametric approaches often perform poorly when p is large. We will now investigate this curse.
(a) Suppose that we have a set of observations, each with measurements on p = 1 feature, X. We assume that X is uniformly (evenly) distributed on [0, 1]. Associated with each observation is a response value. Suppose that we wish to predict a test observation’s response using only observations that are within 10% of the range of X closest to that test observation. For instance, in order to predict the response for a test observation with X = 0.6, we will use observations in the range [0.55, 0.65]. On average, what fraction of the available observations will we use to make the prediction?
*Ans: To predict a response respective to test observation within 10% of the range of X, the following intervals will be used for prediction: (i) if 0.5 <= X <= 0.95 -> [X-0.05, X+0.05] (length = 0.10) (ii) if 0 <= X <= 0.05 -> [0, X+0.05] (length = X+0.05) (iii) if 0.95 <= X <= 1 -> [X-0.05, 1] (length = 1.05-X) Now we can assume X to be uniformly distributed over interval [0,1] Therefore, the average fraction of the interval used to make prediction would be:
f1 <- function(x) x+0.05
f2 <- function(x) 0.1
f3 <- function(x) 1.05-x
m = integrate(f1, 0, 0.05)$value + integrate(Vectorize(f2), 0.05, 0.95)$value + integrate(f3, 0.95, 1)$value
print(paste("Average fraction of the interval used to make prediction is", m))
## [1] "Average fraction of the interval used to make prediction is 0.0975"
(b) Now suppose that we have a set of observations, each with measurements on p = 2 features, X1 and X2. We assume that (X1,X2) are uniformly distributed on [0, 1] × [0, 1]. We wish to predict a test observation’s response using only observations that are within 10% of the range of X1 and within 10% of the range of X2 closest to that test observation. For instance, in order to predict the response for a test observation with X1 = 0.6 and X2 = 0.35, we will use observations in the range [0.55, 0.65] for X1 and in the range [0.3, 0.4] for X2. On average, what fraction of the available observations will we use to make the prediction?
*Ans: Fraction of the available observations will we use to make the prediction is:
l = 0.0975 * 0.0975
print(paste(l,"Fraction of the available observations will we use to make the prediction."))
## [1] "0.00950625 Fraction of the available observations will we use to make the prediction."
Suppose we collect data for a group of students in a statistics class with variables X1 =hours studied, X2 =undergrad GPA, and Y = receive an A. We fit a logistic regression and produce estimated coefficient, ˆ β0 = −6, ˆ β1 = 0.05, ˆ β2 = 1.
(a) Estimate the probability that a student who studies for 40 h and has an undergrad GPA of 3.5 gets an A in the class.
X1 = 40
X2 = 3.5
y = −6+(0.05*X1)+(1*X2)
prob = (exp(y))/(1+ exp(y))
print(paste("Probability that the student gets an A is", prob))
## [1] "Probability that the student gets an A is 0.377540668798145"
(b) How many hours would the student in part (a) need to study to have a 50% chance of getting an A in the class?
prob = 0.5
# y = −6+(0.05*X1)+(1*3.5)
Log_Odds = log10(0.5/(1-0.5))
## Equating y and Log_Odds, we get
#(0.05*X1) - 2.5 = Log_Odds
X1 = (Log_Odds + 2.5)/0.05
print(paste(X1, "hours the student needs to study to have a 50% chance of getting an A in the class"))
## [1] "50 hours the student needs to study to have a 50% chance of getting an A in the class"
This problem has to do with odds.
(a) On average, what fraction of people with an odds of 0.37 of defaulting on their credit card payment will in fact default?
Ans: Odds = Probabilty of Event Happening / Probability of Event Not Happening
p = ((0.37)/(1 + 0.37))*100
print(paste(p, "percent people with an odds of 0.37 of defaulting on their credit card payment will in fact default"))
## [1] "27.007299270073 percent people with an odds of 0.37 of defaulting on their credit card payment will in fact default"
(b) Suppose that an individual has a 16% chance of defaulting on her credit card payment. What are the odds that she will default?
p = 0.16/(1-0.16)
print(paste("Odds for defaulting is", p))
## [1] "Odds for defaulting is 0.19047619047619"
##Question 4.8.16 (Difficulty Level - 4)
Using the Boston data set, fit classification models in order to predict whether a given census tract has a crime rate above or below the median. Explore logistic regression, LDA, naive Bayes, and KNN models using various subsets of the predictors. Describe your findings.
Hint: You will have to create the response variable yourself, using the variables that are contained in the Boston data set.
library(ISLR2)
dat = Boston
median_crim = median(dat$crim)
# Creating a new varibale "New_crim" that will be discrete with two categories "0" and "1"
# "0" means New_crim below median and "1" otherwise
dat["New_Crim"] <- rep(0, 506)
dat$New_Crim[dat$crim > median_crim] = 1
dat$New_Crim <- as.factor(dat$New_Crim)
# creating logistic regression model with all the variables to check which variables are significant
# will drop "crim" variable from the dataset as "New_Crim" variable has been coded using "crim".
dat1 <- dat[,-1]
summary(glm(New_Crim ~., data = dat1, family = binomial))
##
## Call:
## glm(formula = New_Crim ~ ., family = binomial, data = dat1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0082 -0.1688 -0.0004 0.0027 3.4324
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -38.832417 6.086280 -6.380 1.77e-10 ***
## zn -0.086228 0.034090 -2.529 0.0114 *
## indus -0.052438 0.042817 -1.225 0.2207
## chas 0.619914 0.722150 0.858 0.3907
## nox 47.913820 7.344213 6.524 6.84e-11 ***
## rm -0.271941 0.676239 -0.402 0.6876
## age 0.021474 0.012105 1.774 0.0761 .
## dis 0.669991 0.214618 3.122 0.0018 **
## rad 0.669240 0.151742 4.410 1.03e-05 ***
## tax -0.006165 0.002622 -2.351 0.0187 *
## ptratio 0.326433 0.116296 2.807 0.0050 **
## lstat 0.053537 0.047105 1.137 0.2557
## medv 0.147987 0.064347 2.300 0.0215 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 701.46 on 505 degrees of freedom
## Residual deviance: 218.75 on 493 degrees of freedom
## AIC: 244.75
##
## Number of Fisher Scoring iterations: 9
From the summary above we can observe that only variables “zn”, “nox”, “dis”, “rad”, “tax”, “ptratio”, and “medv” are significant. We can drop the rest of the variables just for purposes of this question.
# Creating a new dataset basis the selected variables
dat2 <- data.frame(dat[,c("New_Crim","zn","nox","dis","rad","tax","ptratio","medv")])
# Will now create all the models asked in the questions and will evaluate all the models basis accuracy, precision, recall, and F1 score.
# Checking for data imbalance. Both the classes in New_Crim have equal observations with 253 each so it is balanced.
table(dat2$New_Crim)
##
## 0 1
## 253 253
# Splitting the data into 80:20 ratio
set.seed(420)
q <- sample(nrow(dat2), nrow(dat2)*0.8)
train <- dat2[q, ]
test <- dat2[-q, ]
# Training all models using train data
log1 <- glm(New_Crim ~.,data = train, family = binomial)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:ISLR2':
##
## Boston
lda1 <- lda(New_Crim ~.,data = train)
library(e1071)
nab1 <- naiveBayes(New_Crim ~., data = train)
# KNN model with K value 10. Square Root method used to assume the K value.
library(class)
knn1_pred <- knn(train, test, cl = train$New_Crim, k = 10)
# Storing the predictions of Logistic Regression, LDA, and Naive Bayes using test data
log1_prob <- predict(log1, test, type = "response")
log1_pred <- as.factor(ifelse(log1_prob > 0.5, 1, 0))
lda1_pred <- predict(lda1, test)$class
nab1_pred <- predict(nab1, test[,(2:7)])
## Warning in predict.naiveBayes(nab1, test[, (2:7)]): Type mismatch between
## training and new data for variable 'medv'. Did you use factors with numeric
## labels for training, and numeric values for new data?
# Creating Confusion Matrix
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
log1cf <- confusionMatrix(log1_pred, test$New_Crim)
lda1cf <- confusionMatrix(log1_pred, test$New_Crim)
nab1cf <- confusionMatrix(nab1_pred, test$New_Crim)
knn1cf <- confusionMatrix(knn1_pred, test$New_Crim)
# Calculating accuracies of all the models
log1_accu <- (log1cf$table[[1]] + log1cf$table[[4]])/length(test$New_Crim)
lda1_accu <- (lda1cf$table[[1]] + lda1cf$table[[4]])/length(test$New_Crim)
nab1_accu <- (nab1cf$table[[1]] + nab1cf$table[[4]])/length(test$New_Crim)
knn1_accu <- (knn1cf$table[[1]] + knn1cf$table[[4]])/length(test$New_Crim)
# Calculating Precision for all models
log1_prec <- (log1cf$table[[4]])/ (log1cf$table[[4]] + log1cf$table[[2]])
lda1_prec <- (lda1cf$table[[4]])/ (lda1cf$table[[4]] + lda1cf$table[[2]])
nab1_prec <- (nab1cf$table[[4]])/ (nab1cf$table[[4]] + nab1cf$table[[2]])
knn1_prec <- (knn1cf$table[[4]])/ (knn1cf$table[[4]] + knn1cf$table[[2]])
# Calculating Recall for all models
log1_rec <- (log1cf$table[[4]])/ (log1cf$table[[4]] + log1cf$table[[3]])
lda1_rec <- (lda1cf$table[[4]])/ (lda1cf$table[[4]] + lda1cf$table[[3]])
nab1_rec <- (nab1cf$table[[4]])/ (nab1cf$table[[4]] + nab1cf$table[[3]])
knn1_rec <- (knn1cf$table[[4]])/ (knn1cf$table[[4]] + knn1cf$table[[3]])
# Calculating F1 score for all the models
log1_f1 <- 2*((log1_prec * log1_rec)/(log1_prec + log1_rec))
lda1_f1 <- 2*((lda1_prec * lda1_rec)/(lda1_prec + lda1_rec))
nab1_f1 <- 2*((nab1_prec * nab1_rec)/(nab1_prec + nab1_rec))
knn1_f1 <- 2*((knn1_prec * knn1_rec)/(knn1_prec + knn1_rec))
eval <- data.frame("Model" = c("Logistic Regression", "LDA", "Naive Bayes", "KNN"),
"Accuracy" = c(log1_accu, lda1_accu, nab1_accu, knn1_accu),
"Precision" = c(log1_prec, lda1_prec, nab1_prec, knn1_prec),
"Recall" = c(log1_rec, lda1_rec, nab1_rec, knn1_rec),
"F1 Score" = c(log1_f1, lda1_f1, nab1_f1, knn1_f1))
eval
## Model Accuracy Precision Recall F1.Score
## 1 Logistic Regression 0.9411765 0.9215686 0.9591837 0.940000
## 2 LDA 0.9411765 0.9215686 0.9591837 0.940000
## 3 Naive Bayes 0.8235294 0.8039216 0.8367347 0.820000
## 4 KNN 0.9705882 0.9423077 1.0000000 0.970297
Ans: Across all models KNN with K value 10 is giving the best accuracy. Logistic Regression and Latent Discriminant Analysis gives exactly same performance across all evaluation parameters. Even basis the F1 score KNN is giving the best performance. Basis the given problem, the objective is to identify the positive class where the crime rate is higher than the median crime rate. It means that if any of the high crime rate tracts are classified as less than median crime rate then it is a problematic scenario as requisite crime reducing policies might not be implemented in such areas. Hence, it is critical to minimize Recall. KNN model has a recall value on 1, which means that none of the positive classes, i.e. tracts where crime rate is higher than median crime rate, have been misclassified. Hence, KNN is the best performing model.