Description: A data frame with 768 observations on 9 variables.
pregnant: Number of times pregnant glucose: Plasma glucose concentration (glucose tolerance test) pressure: Diastolic blood pressure (mm Hg) triceps: Triceps skin fold thickness (mm) insulin: 2-Hour serum insulin (mu U/ml) mass: Body mass index pedigree: Diabetes pedigree function age: Age (years) diabetes: Class variable (test for diabetes)
kyphosis: Data on Children who have had Corrective Spinal Surgery
Description: The kyphosis data frame has 81 rows and 4 columns.
Kyphosis: A factor with levels absent present indicating if a kyphosis (a type of deformation) was present after the operation. Age: Age in month Number: The number of vertebrae involved Start: The number of the first (topmost) vertebra operated on.
set.seed(123)# Split the data into training and testing setsdiabetes_split <-initial_split(PimaIndiansDiabetes, prop =0.8)diabetes_train <-training(diabetes_split)diabetes_test <-testing(diabetes_split)# Create a recipe for preprocessingdiabetes_recipe <-recipe(diabetes ~ age + glucose, data = diabetes_train) %>%step_normalize(all_predictors())# Create a grid of points to visualize the decision boundaryage_range <-seq(min(PimaIndiansDiabetes$age) -0.5, max(PimaIndiansDiabetes$age) +0.5, length.out =200)glucose_range <-seq(min(PimaIndiansDiabetes$glucose) -0.5, max(PimaIndiansDiabetes$glucose) +0.5, length.out =200)grid <-expand.grid(age = age_range, glucose = glucose_range)metrics <-metric_set(yardstick::accuracy, yardstick::sensitivity, yardstick::specificity)
Logistic Regression
Code
# Define the logistic regression modellogistic_model <-logistic_reg() %>%set_engine("glm")# Create a workflowdiabetes_workflow <-workflow() %>%add_recipe(diabetes_recipe) %>%add_model(logistic_model)# Fit the logistic regression modellogistic_fit <-fit(diabetes_workflow, data = diabetes_train)# Make predictions on the test setlogistic_predictions <-predict(logistic_fit, diabetes_test, type ="prob") %>%bind_cols(diabetes_test)# Add predictions as a factor to the datasetlogistic_predictions <- logistic_predictions %>%mutate(.pred_class =factor(ifelse(.pred_pos >=0.5, "pos", "neg"), levels =levels(diabetes)) )# Confusion matrixconf_mat <- logistic_predictions %>%conf_mat(truth = diabetes, estimate = .pred_class) %>% print
ggroc(roc_auc) +annotate("segment", x =1, y =0, xend =0, yend =1,col ="red", linetype =3) +annotate("text",label=round(roc_auc$auc,3), x =0.5,y =0.65, col ="darkblue")
Code
# Predict the class for each point in the gridgrid_predictions <-predict(logistic_fit, new_data = grid, type ="class")grid <- grid %>%mutate(Predicted = grid_predictions$.pred_class)# Plot the decision boundaryggplot() +geom_tile(data = grid, aes(x = age, y = glucose, fill = Predicted), alpha =0.5) +# Background colorgeom_point(data = diabetes_train, aes(x = age, y = glucose, color = diabetes), size =1) +# Training pointsscale_fill_manual(values =c("neg"="red", "pos"="blue")) +# Area colorsscale_color_manual(values =c("neg"="red", "pos"="blue")) +# Point colorslabs(title ="Logistic Decision Boundary", x ="age", y ="glucose", fill ="Prediction") +theme_minimal()
Bayes Discriminant Rule (Naive Bayes)
Let we have G groups with \(P (obs \in g^{th} \text{ group})=\pi_g\) (priori probability).
Let X has the conditional probability mass/density function \(f_{\mathbf X}(\mathbf x|G=g)\).
Posterior probability function of \(obs \in g^{th} \text{ group} | \mathbf {X=x}\) will be:
If \(f_{\mathbf X}(\mathbf x|G=g) \text{ be } MVN(\mathbf \mu_g , \mathbf \Sigma)\) then \(d(\mathbf x) = \arg\max_g ~\mathbf \mu_{g}^{T} \mathbf \Sigma^{-1} \mathbf X - 0.5 \times \mathbf \mu_{g}^{T} \mathbf \Sigma^{-1} \mathbf \mu_{g} + \log \pi_g\).
If G=2, then an obs. allocates to first group if and only if \(\mathbf a^T(\mathbf X - \mathbf h) > \log \frac{\pi_1}{\pi_2}\) where \(\mathbf a=\mathbf \Sigma^{-1}(\mathbf \mu_1-\mathbf \mu_2)\) and \(\mathbf h=\frac{(\mathbf \mu_1+\mathbf \mu_2)}{2}\)
If \(f_{\mathbf X}(\mathbf x|G=g) \text{ be } \sim MVN(\mathbf \mu_g , \mathbf \Sigma)\) and \(G \sim \text{Unif}\) then ‘Linear discriminant analysis’ could be used.
If \(f_{\mathbf X}(\mathbf x|G=g) \text{ be } \sim MVN(\mathbf \mu_g , \mathbf \Sigma_g)\) and \(G \sim \text{Unif}\) then ‘Quadratic discriminant analysis’ could be used.
Fisher’s linear discriminant analysis
Fisher approach:
Looked for a linear discriminant functions.
Assuming no particular distribution for each population.
Total covariance matrix (S) decompose to:
Within-class covariance matrix
Between-class covariance matrix
Project the data into a lower dimensional space Z=Xa optimal for classification.
Solution: Eigenvector of \(W^{−1}B\) corresponding to the largest eigenvalue of \(W^{−1}B\)
The function \(L(\mathbf x)=\mathbf a^\top \mathbf x\) is called Fisher’s linear discriminant function. Allocate an obs. to the population g if (discriminant rule):
# Define the LDA modellda_model <-discrim_linear() %>%set_engine("MASS") %>%# Use the MASS package for LDAset_mode("classification")# Create a workflow for LDAlda_workflow <-workflow() %>%add_recipe(diabetes_recipe) %>%add_model(lda_model)# Fit the LDA modellda_fit <-fit(lda_workflow, data = diabetes_train)# Predict class labels using LDAlda_predictions <-predict(lda_fit, diabetes_test, type ="class") %>%bind_cols(predict(lda_fit, diabetes_test, type ="prob")) %>%bind_cols(diabetes_test)# Evaluate the LDA modelaccuracy <- lda_predictions %>%metrics(truth = diabetes, estimate = .pred_class) %>% print
ggroc(roc_auc) +annotate("segment", x =1, y =0, xend =0, yend =1,col ="red", linetype =3) +annotate("text",label=round(roc_auc$auc,3), x =0.5,y =0.65, col ="darkblue")
Code
# Predict the class for each point in the gridgrid_predictions <-predict(lda_fit, new_data = grid, type ="class")grid <- grid %>%mutate(Predicted = grid_predictions$.pred_class)# Plot the decision boundaryggplot() +geom_tile(data = grid, aes(x = age, y = glucose, fill = Predicted), alpha =0.5) +# Background colorgeom_point(data = diabetes_train, aes(x = age, y = glucose, color = diabetes), size =1) +# Training pointsscale_fill_manual(values =c("neg"="red", "pos"="blue")) +# Area colorsscale_color_manual(values =c("neg"="red", "pos"="blue")) +# Point colorslabs(title ="LDA decision Boundary", x ="age", y ="glucose", fill ="Prediction") +theme_minimal()
Quadratic discriminant analysis
Quadratic Discriminant Analysis (QDA) is a statistical classification technique used in machine learning and pattern recognition to categorize observations into predefined classes. It extends the ideas of Linear Discriminant Analysis (LDA) by allowing for more flexible decision boundaries between classes. QDA is particularly useful when the assumption of equal covariance matrices across classes (as in LDA) does not hold.
Comparison with Linear Discriminant Analysis (LDA)
Covariance Assumption:
LDA: Assumes that all classes share the same covariance matrix (\(\boldsymbol{\Sigma}_k = \boldsymbol{\Sigma}\) for all \(k\)). This leads to linear decision boundaries.
QDA: Allows each class to have its own covariance matrix (\(\boldsymbol{\Sigma}_k\) can differ across classes), resulting in quadratic decision boundaries.
Flexibility:
LDA: Less flexible due to the shared covariance assumption. May perform poorly if the covariance matrices are significantly different across classes.
QDA: More flexible and can model more complex class distributions.
Parameter Estimation:
LDA: Estimates fewer parameters since the covariance matrix is shared.
QDA: Estimates a separate covariance matrix for each class, requiring more data to avoid overfitting.
Limitations of QDA
Higher Complexity: Requires estimating a separate covariance matrix for each class, leading to a larger number of parameters.
Data Requirements: Needs more training data to accurately estimate the covariance matrices, especially in high-dimensional feature spaces.
Risk of Overfitting: With limited data, QDA can overfit the training set, reducing its generalization performance.
Example: Pima Indians Diabetes
Code
# Define the QDA modelqda_model <-discrim_quad() %>%set_engine("MASS") %>%# Use the MASS package for QDAset_mode("classification")# Create a workflow for QDAqda_workflow <-workflow() %>%add_recipe(diabetes_recipe) %>%add_model(qda_model)# Fit the QDA modelqda_fit <-fit(qda_workflow, data = diabetes_train)# Predict class labels using QDAqda_predictions <-predict(qda_fit, diabetes_test, type ="class") %>%bind_cols(predict(qda_fit, diabetes_test, type ="prob")) %>%bind_cols(diabetes_test)# Evaluate the QDA modelaccuracy <- qda_predictions %>%metrics(truth = diabetes, estimate = .pred_class) %>% print
ggroc(roc_auc) +annotate("segment", x =1, y =0, xend =0, yend =1,col ="red", linetype =3) +annotate("text",label=round(roc_auc$auc,3), x =0.5,y =0.65, col ="darkblue")
Code
# Predict the class for each point in the gridgrid_predictions <-predict(qda_fit, new_data = grid, type ="class")grid <- grid %>%mutate(Predicted = grid_predictions$.pred_class)# Plot the decision boundaryggplot() +geom_tile(data = grid, aes(x = age, y = glucose, fill = Predicted), alpha =0.5) +# Background colorgeom_point(data = diabetes_train, aes(x = age, y = glucose, color = diabetes), size =1) +# Training pointsscale_fill_manual(values =c("neg"="red", "pos"="blue")) +# Area colorsscale_color_manual(values =c("neg"="red", "pos"="blue")) +# Point colorslabs(title ="QDA Decision Boundary", x ="age", y ="glucose", fill ="Prediction") +theme_minimal()
K-Nearest Neighbors
KNN is one of the simplest and most intuitive machine learning algorithms
KNN used for both classification and regression tasks
KNN can be remarkably effective, especially in scenarios where the decision boundary is irregular
KNN operates on the principle that similar data points exist in close proximity
KNN predict class of new point based on the k closest sample points
KNN does not build an explicit model
KNN Algorithm
Choose the number of neighbors (k):
Compute the distance: (e.g., Euclidean, Manhattan).
Identify the nearest neighbors:
Make a prediction:
Classification: Determine the majority class among the k neighbors.
For Regression: Compute the average/weighted average of the target values of the k neighbors.
# Predict the class for each point in the gridgrid_predictions <-predict(knn1_fit, new_data = grid, type ="class")grid <- grid %>%mutate(Predicted = grid_predictions$.pred_class)# Plot the decision boundaryggplot() +geom_tile(data = grid, aes(x = age, y = glucose, fill = Predicted), alpha =0.5) +# Background colorgeom_point(data = diabetes_train, aes(x = age, y = glucose, color = diabetes), size =1) +# Training pointsscale_fill_manual(values =c("neg"="red", "pos"="blue")) +# Area colorsscale_color_manual(values =c("neg"="red", "pos"="blue")) +# Point colorslabs(title ="k-NN Decision Boundary (k=1)", x ="age", y ="glucose", fill ="Prediction") +theme_minimal()
Let K=5
Code
# Define the k-NN modelknn_model <-nearest_neighbor(neighbors =5, weight_func ="rectangular") %>%set_engine("kknn") %>%set_mode("classification")# Create a workflowknn_workflow <-workflow() %>%add_recipe(diabetes_recipe) %>%add_model(knn_model)# Fit the k-NN modelknn5_fit <-fit(knn_workflow, data = diabetes_train)# Make predictions on the test setknn5_predictions <-predict(knn5_fit, diabetes_test, type ="class") %>%bind_cols(diabetes_test)# Confusion matrixconf_mat <- knn5_predictions %>%conf_mat(truth = diabetes, estimate = .pred_class) %>% print
# Predict the class for each point in the gridgrid_predictions <-predict(knn5_fit, new_data = grid, type ="class")grid <- grid %>%mutate(Predicted = grid_predictions$.pred_class)# Plot the decision boundaryggplot() +geom_tile(data = grid, aes(x = age, y = glucose, fill = Predicted), alpha =0.5) +# Background colorgeom_point(data = diabetes_train, aes(x = age, y = glucose, color = diabetes), size =1) +# Training pointsscale_fill_manual(values =c("neg"="red", "pos"="blue")) +# Area colorsscale_color_manual(values =c("neg"="red", "pos"="blue")) +# Point colorslabs(title ="k-NN Decision Boundary (k=5)", x ="age", y ="glucose", fill ="Prediction") +theme_minimal()
Let K=50
Code
# Define the k-NN modelknn_model <-nearest_neighbor(neighbors =50, weight_func ="rectangular") %>%set_engine("kknn") %>%set_mode("classification")# Create a workflowknn_workflow <-workflow() %>%add_recipe(diabetes_recipe) %>%add_model(knn_model)# Fit the k-NN modelknn50_fit <-fit(knn_workflow, data = diabetes_train)# Make predictions on the test setknn50_predictions <-predict(knn50_fit, diabetes_test, type ="class") %>%bind_cols(diabetes_test)# Confusion matrixconf_mat <- knn50_predictions %>%conf_mat(truth = diabetes, estimate = .pred_class) %>% print
# Predict the class for each point in the gridgrid_predictions <-predict(knn50_fit, new_data = grid, type ="class")grid <- grid %>%mutate(Predicted = grid_predictions$.pred_class)# Plot the decision boundaryggplot() +geom_tile(data = grid, aes(x = age, y = glucose, fill = Predicted), alpha =0.5) +# Background colorgeom_point(data = diabetes_train, aes(x = age, y = glucose, color = diabetes), size =1) +# Training pointsscale_fill_manual(values =c("neg"="red", "pos"="blue")) +# Area colorsscale_color_manual(values =c("neg"="red", "pos"="blue")) +# Point colorslabs(title ="k-NN Decision Boundary (k=50)", x ="age", y ="glucose", fill ="Prediction") +theme_minimal()
Classification Tree.
Statistical Problem
Let:
Y be a response variable,
X1, … , Xp, be a set of p predictors,
X’s will be fixed, and Y be a random variable.
Then the problem is to establish a relationship between Y and the X’s.
Mathematically, to estimate the conditional probability or a function of it like conditional expectation.
Statistician’s job is to to extract important patterns and trends, and understand “what the data says” (learning from data).
Methods for supervised learning, the two end of the spectrum:
Models in one side, rely heavily on some strong assumption (Low variance and High bias).
Logistic regression.
Models on the other side, do not rely on any rigorous assumptions. (High variance and Low bias)
KNN
Tree elements
The tree has 3 layers of nodes.
1st layer (depth=1): the root node.
2nd layer (depth=2), one internal and one external node.
3rd layer (depth=3) two terminal nodes.
Parent: an internal node that partitioning in two nodes.
daughters/offspring/descendants: two nodes extracted from a parent.
Partitioning of subjects
Y: preterm/term delivery.
Covariates, X1 (age) & X13 (amount of drinking)
Every node is a subset of the sample.
Terminal nodes partitioning the sample.
Node (I) x13 ≤ c2.
Node (II) x13 > c2 and x1 ≤ c1.
Node (III) x13 > c2 and x1 > c1.
Aim of the partitioning
Complete homogeneity of terminal nodes is an ideal that is rarely realized.
constitute as homogeneous as possible partitioning of subjects in terms of response based on the X’s.
A quantitative measure of the node homogeneity is the node impurity.
A node homogeneity = \(\frac{\text{n. of women having a preterm delivery in that node}}{\text{Total n. of women in that node}}\)
The closer this ratio is to 0 or 1, the more homogeneous is the node.
Splitting a Node
For each predictor variable:
Constitute all allowable splits
Select the best split based on the a goodness of a split criterion.
Compare the best split of different variables and select among them (the best of the best).
Example: Pima Indians Diabetes
With only 2 variables
Code
# Specify the decision tree modeltree_model <-decision_tree() %>%# Use decision_tree() from the 'parsnip' packageset_engine("rpart") %>%# Set the engine to 'rpart'set_mode("classification") # Set the mode to classification# Create a workflowtree_wf <-workflow() %>%add_model(tree_model) %>%add_recipe(diabetes_recipe)# Fit the model to the training datatree_fit <- tree_wf %>%fit(data = diabetes_train)# Extract the fitted treefitted_tree <- tree_fit %>%extract_fit_engine()# Plot the decision treerpart.plot(fitted_tree)
Code
# Make predictions on the test datatree_predictions <-predict(tree_fit, new_data = diabetes_test, type ="prob") %>%bind_cols(diabetes_test)tree_predictions <- tree_predictions %>%mutate(.pred_class =factor(ifelse(.pred_pos >=0.5, "pos", "neg"), levels =levels(diabetes)) )# Evaluate the modelaccuracy <- tree_predictions %>%metrics(truth = diabetes, estimate = .pred_class) %>% print
ggroc(roc_auc) +annotate("segment", x =1, y =0, xend =0, yend =1,col ="red", linetype =3) +annotate("text",label=round(roc_auc$auc,3), x =0.5,y =0.65, col ="darkblue")
Code
# Predict the class for each point in the gridgrid_predictions <-predict(tree_fit, new_data = grid, type ="class")grid <- grid %>%mutate(Predicted = grid_predictions$.pred_class)# Plot the decision boundaryggplot() +geom_tile(data = grid, aes(x = age, y = glucose, fill = Predicted), alpha =0.5) +# Background colorgeom_point(data = diabetes_train, aes(x = age, y = glucose, color = diabetes), size =1) +# Training pointsscale_fill_manual(values =c("neg"="red", "pos"="blue")) +# Area colorsscale_color_manual(values =c("neg"="red", "pos"="blue")) +# Point colorslabs(title ="Classification Tree", x ="age", y ="glucose", fill ="Prediction") +theme_minimal()
With All variables
Code
diabetes_recipe <-recipe(diabetes ~ ., data = diabetes_train) %>%step_normalize(all_predictors())# Specify the decision tree modeltree_model <-decision_tree() %>%# Use decision_tree() from the 'parsnip' packageset_engine("rpart") %>%# Set the engine to 'rpart'set_mode("classification") # Set the mode to classification# Create a workflowtree_wf <-workflow() %>%add_model(tree_model) %>%add_recipe(diabetes_recipe)# Fit the model to the training datatree_fit <- tree_wf %>%fit(data = diabetes_train)# Extract the fitted treefitted_tree <- tree_fit %>%extract_fit_engine()# Plot the decision treerpart.plot(fitted_tree)
Code
# Make predictions on the test datatree_predictions <-predict(tree_fit, new_data = diabetes_test, type ="prob") %>%bind_cols(diabetes_test)tree_predictions <- tree_predictions %>%mutate(.pred_class =factor(ifelse(.pred_pos >=0.5, "pos", "neg"), levels =levels(diabetes)) )# Evaluate the modelaccuracy <- tree_predictions %>%metrics(truth = diabetes, estimate = .pred_class) %>% print
ggroc(roc_auc) +annotate("segment", x =1, y =0, xend =0, yend =1,col ="red", linetype =3) +annotate("text",label=round(roc_auc$auc,3), x =0.5,y =0.65, col ="darkblue")
Logistic Regression vs. LDA vs. KNN vs. Classification Tree.
1. Logistic Regression
Models the probability of a class using a logistic (sigmoid) function.
Assumptions:
Linearity: The log-odds of the outcome is a linear combination of the input features.
Independence: Observations are independent.
No or little multicollinearity: Predictors are not highly correlated.
Strengths:
Interpretability: Coefficients can be interpreted in terms of odds ratios.
Efficiency: Computationally less intensive; scales well with large datasets.
Probabilistic Outputs: Provides probabilities for class memberships.
Regularization: Extensions like Lasso or Ridge can handle multicollinearity and feature selection.
Handles Mixed Data Types: Can manage numerical and categorical features.
Robust: makes no assumption about a distribution for X.
Weaknesses:
Linearity Assumption: May not capture complex relationships.
Sensitive to Outliers: Can be skewed by extreme values.
Feature extraction: May require transformation or interaction terms to capture non-linearities.
2. Linear Discriminant Analysis (LDA)
A classification method that projects data onto a lower-dimensional space while preserving class separability.
Assumptions:
Normality: Features are normally distributed within each class.
Equal Covariance Matrices: All classes share the same covariance matrix.
Linearity: Decision boundaries are linear.
Strengths:
Efficiency: Works well when assumptions are met; computationally inexpensive.
Dimensionality Reduction: Can reduce feature space, improving visualization and reducing noise.
Robustness: Performs well with small sample sizes relative to feature dimensions.
Weaknesses:
Assumption Sensitivity: Make performance less valuable if normality or equal covariance assumptions are violated.
Linearity: Cannot capture non-linear relationships between features and classes.
Scaling: Sensitive to the scale of the features; requires standardization.
3. k-Nearest Neighbors (kNN)
A non-parametric, instance-based learning algorithm that classifies based on the majority class among the k closest training examples.
Assumptions:
Locality: Assumes that similar instances are near each other in feature space.
Strengths:
Simplicity: Easy to understand and implement.
Flexibility: Can handle multi-class problems and adapt to complex decision boundaries.
No Training Phase: Immediate classification without a separate training phase.
Splits the feature space into regions based on feature values to make predictions.
Assumptions:
Strengths:
Interpretability: Results in a model that’s easy to visualize and understand.
Handles Mixed Data Types: Can manage numerical and categorical features.
Non-linear Relationships and Intractions
Feature Importance: Can provide insights into feature importance.
Weaknesses:
Overfitting: Specially with deep trees.
Instability: Small changes in data can lead to different tree structures.
Scalability: Can become computationally expensive with very large datasets.
Generalized Additive Models (GAMs)
Some predictors have unknown/complicated relation with response. It is possible to replace the linear effect of those explanatory variables by some smooth functions (more general structure).
+\(\rightarrow\) Less potential incorrect conclusions because of model misspecification -\(\rightarrow\) Need to choose among a potentially infinite number of smooth forms -\(\rightarrow\) the number of parameters is much larger -\(\rightarrow\) overfitting
GLM: extend LM
- Distribution of response \(\rightarrow\) Exponential family
- Link function \(\rightarrow\) Every differentiable monotone function GAM: extend GLM
- Linear predictor function \(\rightarrow\) Smooth but additive predictor function
\(g(\mu_i)=\alpha+s(x_{i1})+\dots+s(x_{ip})\)
What are Generalized Additive Models (GAMs)?
Definition:
GAMs are an extension of Generalized Linear Models (GLMs) that allow for non-linear relationships between predictors and the response variable.
Key Feature:
Use smooth functions,sj to model the relationship between jth predictor and the response.
Flexibility:
Capture complex patterns without specifying the exact form of the non-linearity.
Components of GAMs
Response Variable (\(Y\)):
Has distribution from exponential family
Link function
Predictors:
Each predictor can have a non-linear effect modeled by smooth function, sj.
Goal: Build a GAM to predict the probability of diabetes based on age and glucose.
Code
# Fit the GAM model on training datagam_fit1 <-gam(diabetes ~s(glucose) +s(age), family =binomial(link ="logit"), data = diabetes_train)# Summary of the modelsummary(gam_fit1)
ggroc(roc_auc) +annotate("segment", x =1, y =0, xend =0, yend =1,col ="red", linetype =3) +annotate("text",label=round(roc_auc$auc,3), x =0.5,y =0.65, col ="darkblue")
Code
# Predict the class for each point in the gridgrid_predictions <-predict(gam_fit1, newdata = grid, type ="response")grid <- grid %>%mutate(Predicted =factor(ifelse(grid_predictions >0.5, "pos", "neg")))# Plot the decision boundaryggplot() +geom_tile(data = grid, aes(x = age, y = glucose, fill = Predicted), alpha =0.5) +# Background colorgeom_point(data = diabetes_train, aes(x = age, y = glucose, color = diabetes), size =1) +# Training pointsscale_fill_manual(values =c("neg"="red", "pos"="blue")) +# Area colorsscale_color_manual(values =c("neg"="red", "pos"="blue")) +# Point colorslabs(title ="GAM decision Boundary", x ="age", y ="glucose", fill ="Prediction") +theme_minimal()
Code
# Example: Visualize the effect of Glucose on Outcomevisreg(gam_fit1, "glucose", scale ="response", gg =TRUE) +ggtitle("Effect of Glucose on Diabetes Probability") +theme_minimal()
Code
visreg(gam_fit1, "age", scale ="response", gg =TRUE) +ggtitle("Effect of Glucose on Diabetes Probability") +theme_minimal()
The Bias/Variance Tradeoff
Smoothing methods could provide different degree of smoothing (using tuning parameter(s))
Greater smoothness \(\rightarrow\) decreasing the variance, increasing the bias.
Selecting an effective df for sj\(\rightarrow\) control the degree of smoothness
effective df = 3 is somewhat similar in overall complexity to a third-degree polynomial.
Code
visreg(update(gam_fit1,.~s(glucose,fx=TRUE, k =3)),"glucose", scale ="response", gg =TRUE) +ggtitle("Effect of Glucose on Diabetes Probability (edf=2)") +theme_minimal()
Code
visreg(update(gam_fit1,.~s(glucose,fx=TRUE, k =4)),"glucose", scale ="response", gg =TRUE) +ggtitle("Effect of Glucose on Diabetes Probability (edf=3)") +theme_minimal()
Code
visreg(update(gam_fit1,.~s(glucose,fx=TRUE, k =5)),"glucose", scale ="response", gg =TRUE) +ggtitle("Effect of Glucose on Diabetes Probability (edf=4)") +theme_minimal()
Code
visreg(update(gam_fit1,.~s(glucose,fx=TRUE, k =6)),"glucose", scale ="response", gg =TRUE) +ggtitle("Effect of Glucose on Diabetes Probability (edf=5)") +theme_minimal()
Code
visreg(update(gam_fit1,.~s(glucose,fx=TRUE, k =7)),"glucose", scale ="response", gg =TRUE) +ggtitle("Effect of Glucose on Diabetes Probability (edf=6)") +theme_minimal()
Regularization for High-Dimensional Categorical Data
Emphasizing Logistic Regression
1. Introduction
High-Dimensional Data: Situations where the number of predictors (p) is large relative to the number of observations (n).
Regularization: Techniques to prevent overfitting by adding a penalty to the model complexity.
High-dimensional data are increasingly common in applications in genomics, biomedical imaging and so on.
2. Penalized-Likelihood Methods and Lq-Norm Smoothing
Maximum Likelihood Estimation (MLE)
MLE aims to find parameter values that maximize the likelihood function.
In high dimensions, MLE can lead to overfitting.
overfitting: Model captures noise instead of the underlying pattern.
Penalized Likelihood
Objective: Penalize the log-likelihood with a penalty term, \(s(\beta)\), to reduce model complexity.
s decreases as elements of \(\beta\) are smoother, such as \(L_q-norm\) smoothing function (generally shrinks the ML estimates toward 0)
9 x 1 sparse Matrix of class "dgCMatrix"
s1
(Intercept) -5.937235394
pregnant 0.067571906
glucose 0.026149025
pressure .
triceps .
insulin .
mass 0.046798167
pedigree 0.268841839
age 0.004099019
Code
cat("\nselected lambda:\t", 0.1,"\n")
selected lambda: 0.1
Code
coef(lasso_model, s = .1)
9 x 1 sparse Matrix of class "dgCMatrix"
s1
(Intercept) -2.99982384
pregnant .
glucose 0.01738912
pressure .
triceps .
insulin .
mass 0.00721427
pedigree .
age .
4. Why Shrink Maximum Likelihood Estimates Toward 0?
Study with large number of Xs’ \(\rightarrow\) most of them have no/minor effects
For instance: in genetic association studies (thousands of genes as X and whether a person has a particular disease as Y), \(|MLE(\beta_j)|\) tend to be much larger than \(|\beta_j|\) (sampling variation, collinearity, quasi-complete or complete separation, sparseness and …).
Benefits of Shrinkage
Variance-Bias Tradeoff: Shrinking coefficients reduces variance at the cost of a small increase in bias.
Improved Prediction Accuracy: By reducing variance.
Model Interpretability: Simplifies the model by selecting relevant features.
Numerical Stability: Avoids issues with multicollinearity.
Disadvantage of Lasso
May overly penalize \(\beta_j\) that is truly large.
\(\hat{\beta_j}\) obtained from Lasso methods do not have approximate normal sampling distribution
\(\hat{\beta_j}\) obtained from Lasso methods is bias
variable-elimination methods: stepwise methods and the lasso.
replaces the p explanatory variables by a small set of artificial uncorrelated variables (principal components)
identify the relevant effects using standard significance tests but with an adjustment for the possibly huge number of tests conducted: One way to do this uses the false discovery rate (FDR).
6. Controlling the False Discovery Rate
Multiple Comparisons Problem
If p be large and few variables have an effects, testing \(H_0: \beta_j = 0\) at \(\alpha = 0.05\) for each explanatory variable:
without adjustment \(\rightarrow\) probability of type I error is more than nominal level \(\rightarrow\) most of detected effects are type I error.
With Bonferroni adjustment \(\rightarrow\) power may be very low for detect true effects
Recall: if g be the number of tests, Bonferroni’s method use \(\alpha/g\) as the size for each individual test.
False Discoveries: Incorrectly identifying a variable as significant.