Previously on Stat412:

  • Dimension Reduction

  • Categorical Data Analysis

Today, we’ll focus on logistic regression and robust statistics.

Logistic Regression

Logistic regression is a statistical method employed for forecasting the likelihood of an event happening. It’s commonly employed in scenarios where the dependent variable (the outcome being predicted) is categorical and has only two possible outcomes (e.g., yes/no, success/failure, 0/1). It’s also possible to expand to situations with more than two outcomes, either without an order (multinominal logistic regression) or with an order (ordinal logistic regression). Unlike linear regression, which deals with continuous variables, logistic regression focuses on predicting probabilities for categorical outcomes, typically bounded between 0 and 1.

The data “heart” ise used to perform logistic regression. It contains information regarding the factors that lead to heart disease among a population. Assune that you are trying to use this knowledge to better predict who has heart disease so you can intervene and help improve their health and quality of life. In other words, you want to be able to predict the likelihood or probability that a person with certain characteristics will have heart disease. The binary target variable is “heart_disease” (yes or no).

heart= read.csv("heart_data.csv",sep = ",")
class(heart) #see class of the data
## [1] "data.frame"
dim(heart)  #10000 rows & 4 columns
## [1] 10000     4

Data set info:

  • heart_disease: an indicator field corresponding to whether an individual has heart disease (1 = yes, heart disease; 0 = no heart disease)

  • coffee_drinker: an indicator field corresponding to whether an individual drinks coffee regularly (1 = yes, coffee drinker; 0 = not a coffee drinker)

  • fast_food_spend: a numerical field corresponding to the annual spend of each individual on fast food

  • income: a numerical field corresponding to the individual’s annual income

head(heart)
heart_disease coffee_drinker fast_food_spend income
0 0 1823.816 44361.625
0 1 2042.951 12106.135
0 0 2683.873 31767.139
0 0 1323.127 35704.494
0 0 1964.140 38463.496
0 1 2298.971 7491.559
summary(heart) 
##  heart_disease    coffee_drinker   fast_food_spend     income     
##  Min.   :0.0000   Min.   :0.0000   Min.   :   0    Min.   :  772  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:1204    1st Qu.:21340  
##  Median :0.0000   Median :0.0000   Median :2059    Median :34553  
##  Mean   :0.0333   Mean   :0.2944   Mean   :2088    Mean   :33517  
##  3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:2916    3rd Qu.:43808  
##  Max.   :1.0000   Max.   :1.0000   Max.   :6636    Max.   :73554

Do you see any issue regarding the variable types?

Considering the data dictionary, the variables heart disease and coffee drinker must be converted to factor.

str(heart)
## 'data.frame':    10000 obs. of  4 variables:
##  $ heart_disease  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ coffee_drinker : int  0 1 0 0 0 1 0 1 0 0 ...
##  $ fast_food_spend: num  1824 2043 2684 1323 1964 ...
##  $ income         : num  44362 12106 31767 35704 38463 ...
heart$heart_disease=as.factor(heart$heart_disease)
heart$coffee_drinker=as.factor(heart$coffee_drinker)

Now, all classes of the variables seem reasonable. Therefore, we can move on partition the data set. We prefer to use validation set approach to assess the model validity.

library(caret)
## Warning: package 'caret' was built under R version 4.3.3
## Zorunlu paket yükleniyor: ggplot2
## Zorunlu paket yükleniyor: lattice
set.seed(1)
ind = createDataPartition(heart$heart_disease, p = 0.8, list = FALSE)
train  = heart[ind, ] 
test = heart[-ind, ]

80% of the data belongs to train set while 20% of the data is partitioned as test set.

prop.table(table(train$heart_disease))
## 
##          0          1 
## 0.96662917 0.03337083
table(train$heart_disease)
## 
##    0    1 
## 7734  267
prop.table(table(test$heart_disease))
## 
##          0          1 
## 0.96698349 0.03301651
table(test$heart_disease)
## 
##    0    1 
## 1933   66

Both sets have similar percentages with respect to the heart_disease status.

Visualizations

Let’s start to understand heart disease by plotting heart disease, a categorical variable with values of 1 (yes) or 0 (no), against annual fast food spend, a numeric variable.

ggplot(heart) +
  geom_point(aes(fast_food_spend, factor(heart_disease)), color = "blue") +
  labs(x = 'Fast Food Spend', y = 'Heart Disease')

We can see from this chart that it looks as if higher amounts of fast food spending are associated with a higher likelihood of heart disease.

ggplot(heart, aes(x = factor(heart_disease), y = fast_food_spend, fill = factor(heart_disease))) + 
  geom_boxplot() + 
  xlab("Heart Disease (Y/N)") + 
  ylab("Annual Fast Food Spend") + 
  ggtitle("Food Spend and Heart Disease")

ggplot(heart, aes(x = factor(heart_disease), y = income, fill = factor(heart_disease))) + 
  geom_boxplot() + 
  xlab("Heart Disease (Y/N)") + 
  ylab("Income") + 
  ggtitle("Income and Heart Disease")

According to the figure above, it doesn’t appear that income plays much of a role in heart disease

Building the models

Simple Logistic Regression

Simple logistic regression includes dependent variable and only one predictor. Let’s select the variable fast_food_spend to predict the heart disease status. To fit a logistic model, the function glm() can be used with the family being specified as “binomial”. The input “family =”binomial” determines a two-class categorical response. If you use “family=gaussian” it specifies standard linear regression.

simple_model = glm(heart_disease~ fast_food_spend, data = train, family = "binomial")
summary(simple_model)
## 
## Call:
## glm(formula = heart_disease ~ fast_food_spend, family = "binomial", 
##     data = train)
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -1.079e+01  4.095e-01  -26.34   <2e-16 ***
## fast_food_spend  2.228e-03  9.934e-05   22.43   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2340.6  on 8000  degrees of freedom
## Residual deviance: 1240.3  on 7999  degrees of freedom
## AIC: 1244.3
## 
## Number of Fisher Scoring iterations: 8

Is the model overall statistically significant?

anova(simple_model, test = "Chisq")  # it investigates whether difference between the deviances are statistically significant.
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL NA NA 8000 2340.628 NA
fast_food_spend 1 1100.369 7999 1240.260 0

It is seen that the overall model is significant. After that, the covariate is examined. As you see fast food spend has significantly positive effect on the log odds. The expected change in log odds of heart disease for one unit increase in the average spend is 0.00218. We can also say that one unit increment in the spend results in e^0.00218=1.00218 unit increment in heart disease.

Let’s apply multiple logistic regression.

Multiple Logistic Regression

Let’s build a model including all predictors and dependent variable heart disease.

log_model = glm(heart_disease~ factor(coffee_drinker)+fast_food_spend+income,
                 family = binomial,
                 data = train)

summary(log_model)
## 
## Call:
## glm(formula = heart_disease ~ factor(coffee_drinker) + fast_food_spend + 
##     income, family = binomial, data = train)
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -1.096e+01  5.532e-01 -19.817   <2e-16 ***
## factor(coffee_drinker)1 -6.709e-01  2.677e-01  -2.506   0.0122 *  
## fast_food_spend          2.325e-03  1.048e-04  22.182   <2e-16 ***
## income                   1.716e-06  9.233e-06   0.186   0.8526    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2340.6  on 8000  degrees of freedom
## Residual deviance: 1221.3  on 7997  degrees of freedom
## AIC: 1229.3
## 
## Number of Fisher Scoring iterations: 8

Realize that this output (logistic regression) contains a little less info than the linear model case. The distribution of the deviance residuals is not as important here and there is no R-squared because it is not defined for logistic regression. We still have coefficient estimates and associated p-values. The coefficient estimates, as noted above, correspond to the change in the log odds for a one-unit increase in the predictor. The p-values are interpreted in a similar way to the linear regression setting - essentially small p-values indicate that we reject the null hypothesis that the response variable does not depend on the predictor.

Coffee Drinker: Coffee drinking has an estimated coefficient of -0.671 which means that the coffee drinking is associated with a decrease in the odds that a person has heart disease. Specifically, it decreases the odds of heart disease (relative to non-coffee drinkers) by a multiplicative factor of e−0.671, holding all other predictors fixed. Therefore, coffee drinking is also associated with a decrease in the probability of having heart disease. This association appears to be significant.

Fast Food Spend: A one-unit increase in fast food spend is associated with a 0.00233 increase in the log odds of having heart disease. This association appears to be heavily significant.

Income: Based on a p-value of 0.85, we conclude that income is not significantly associated with heart disease.

We may prefer to exclude the variable income, which is a statistically significant, from the model.

log_model_2 = glm(heart_disease~ factor(coffee_drinker)+fast_food_spend,
                 family = binomial,
                 data = train)

summary(log_model_2)
## 
## Call:
## glm(formula = heart_disease ~ factor(coffee_drinker) + fast_food_spend, 
##     family = binomial, data = train)
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -1.090e+01  4.191e-01 -25.998  < 2e-16 ***
## factor(coffee_drinker)1 -7.097e-01  1.673e-01  -4.241 2.23e-05 ***
## fast_food_spend          2.326e-03  1.048e-04  22.198  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2340.6  on 8000  degrees of freedom
## Residual deviance: 1221.3  on 7998  degrees of freedom
## AIC: 1227.3
## 
## Number of Fisher Scoring iterations: 8

Now, all predictors are statistically significant. We can move on predictions.

pred = predict(log_model_2, type="response")
head(pred)
##            2            3            4            5            6            7 
## 0.0010538836 0.0094343472 0.0004019906 0.0017827014 0.0019099266 0.0022465855

Note that these are probabilities, not classifications. To obtain classifications, we will need to compare to the correct cutoff value with an ifelse() statement.

glm_pred = ifelse(predict(log_model_2, test, type = "response") > 0.5, 1, 0)  #Threshold as 0.5

Performance Measures

calc_class_err = function(actual, predicted) {
  mean(actual != predicted)
}

calc_class_err(actual = test$heart_disease, predicted = glm_pred)
## [1] 0.02751376

As you see, the classification error rate for test data is approximately 3%, which is OK. In addition to this metric, we should consider accuracy, sensitivity and specificity.

REMEMBER:

A test with low specificity is more likely to produce a high number of false positives and may incorrectly identify disease or illness in individuals when it is not present. When a test’s sensitivity is high, it is more likely to give a true positive result and correctly detect the disease or illness when it is present.

test_tab= table(predicted = glm_pred,actual = test$heart_disease)
accuracy=(test_tab[1, 1] + test_tab[2, 2]) / sum(test_tab)
library(caret)
glm_pred=as.factor(glm_pred)
sensitivity=sensitivity(test$heart_disease, glm_pred)
specificity=specificity(test$heart_disease,glm_pred)
data.frame(accuracy,sensitivity,specificity)
accuracy sensitivity specificity
0.9724862 0.9747219 0.7619048

What is AUC - ROC curve?

AUC - ROC curve is a performance measurement for classification problem at various thresholds settings. ROC is a probability curve and AUC represents degree or measure of separability. It tells how much model is capable of distinguishing between classes. Higher the AUC, better the model is at predicting 0s as 0s and 1s as 1s.

To draw this curve, we can consider pROC package.

library(pROC)
## Warning: package 'pROC' was built under R version 4.3.3
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
test_prob=predict(log_model_2, type = "response",newdata = test)
test_roc = roc(test$heart_disease ~ test_prob, plot = TRUE, print.auc = TRUE)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

A ROC curve is constructed by plotting the true positive rate (TPR) against the false positive rate (FPR). The ROC curve shows the trade-off between sensitivity (or TPR) and specificity (1 – FPR). Classifiers that give curves closer to the top-left corner indicate a better performance.

as.numeric(test_roc$auc)
## [1] 0.9324648

A good model will have a high AUC, that is as often as possible a high sensitivity and specificity.

Robust Statistics

Robust analysis refers to the application of statistical techniques that are resistant to outliers or violations of underlying assumptions. Data can often contain outliers or may not follow the assumptions of traditional statistical methods like normality or homoscedasticity. Robust analysis techniques aim to address these challenges by offering approaches that are less sensitive to such issues. Robust analysis encompasses a range of statistical methods, including robust regression, robust estimation of location and scale parameters (e.g., median and median absolute deviation), robust hypothesis testing, and robust measures of correlation and covariance.

We utilizes “duncan” data set from bcgam package while performing robust analysis.

library(bcgam)
## Warning: package 'bcgam' was built under R version 4.3.3
## Zorunlu paket yükleniyor: nimble
## Warning: package 'nimble' was built under R version 4.3.3
## nimble version 1.1.0 is loaded.
## For more information on NIMBLE and a User Manual,
## please visit https://R-nimble.org.
## 
## Note for advanced users who have written their own MCMC samplers:
##   As of version 0.13.0, NIMBLE's protocol for handling posterior
##   predictive nodes has changed in a way that could affect user-defined
##   samplers in some situations. Please see Section 15.5.1 of the User Manual.
## 
## Attaching package: 'nimble'
## The following object is masked from 'package:stats':
## 
##     simulate
data("duncan") #Load the data
head(duncan)
type income education prestige
accountant prof 62 86 82
pilot prof 72 76 83
architect prof 75 92 90
author prof 55 90 76
chemist prof 64 86 90
minister prof 21 84 87

The duncan data frame has 45 rows and 4 columns. Data on the prestige and other characteristics of 45 U.S. occupations in 1950.

Info about the data:

  • type: Type of occupation. A factor with the following levels: prof, professional and managerial; wc, white-collar; bc, blue-collar.

  • income: Percent of males in occupation earning $3500 or more in 1950.

  • education: Percent of males in occupation in 1950 who where high-school graduates.

  • prestige: Percent of raters in NORC study rating occupation as excellent or good in prestige.

str(duncan) #seems OK.
## 'data.frame':    45 obs. of  4 variables:
##  $ type     : Factor w/ 3 levels "bc","prof","wc": 2 2 2 2 2 2 2 2 3 2 ...
##  $ income   : int  62 72 75 55 64 21 64 80 67 72 ...
##  $ education: int  86 76 92 90 86 84 93 100 87 86 ...
##  $ prestige : int  82 83 90 76 90 87 93 90 52 88 ...
summary(duncan)
##    type        income        education         prestige    
##  bc  :21   Min.   : 7.00   Min.   :  7.00   Min.   : 3.00  
##  prof:18   1st Qu.:21.00   1st Qu.: 26.00   1st Qu.:16.00  
##  wc  : 6   Median :42.00   Median : 45.00   Median :41.00  
##            Mean   :41.87   Mean   : 52.56   Mean   :47.69  
##            3rd Qu.:64.00   3rd Qu.: 84.00   3rd Qu.:81.00  
##            Max.   :81.00   Max.   :100.00   Max.   :97.00

To detect the influential points or outliers, we examine the scatter plot of income vs education.

ggplot(duncan, aes(x = education, y = income)) + 
  geom_point() + stat_smooth(method = "lm", col = "red") + theme_minimal() +
  ggtitle("Income vs. education")
## `geom_smooth()` using formula = 'y ~ x'

fitLS = lm(income ~ education, data = duncan)
opar = par(mfrow = c(2,2))
plot(fitLS)

library(olsrr)
## Warning: package 'olsrr' was built under R version 4.3.3
## 
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
## 
##     rivers
ols_plot_cooksd_bar(fitLS)

REMEMBER: Cook’s distance (denoted as DD) is a measure used in regression analysis to identify influential observations in a dataset. Typically, observations with Cook’s distance greater than 4/n (where n is the number of observations) are considered influential. However, the threshold value may vary depending on the specific context and the goals of the analysis.

According to the plot above, the observations 6,16 and 27 are the influential points.

Robust Regression

Robust regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables in the presence of outliers or violations of typical assumptions such as normality and homoscedasticity.

M-estimators: These are robust estimators of the regression coefficients that use a robust objective function to balance the influence of data points. Examples include Huber, Tukey’s biweight, and Hampel M-estimators.

Least absolute deviations (LAD) regression: This method minimizes the sum of the absolute residuals instead of the squared residuals used in OLS regression. It is less sensitive to outliers and provides robust estimates of the regression coefficients.

S-estimators: These are a class of M-estimators that use robust estimates of location and scale to provide robust estimates of the regression coefficients.

Robust regression with iteratively reweighted least squares (IRLS): This method iteratively reweights the observations based on their residuals to downweight the influence of outliers. It can be used with different types of robust estimators, such as M-estimators and LAD estimators.

Now, we will apply several robust estimation techniques.

Huber Estimation:

library(MASS)
## Warning: package 'MASS' was built under R version 4.3.3
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:olsrr':
## 
##     cement
fitH = rlm(income ~ education, data = duncan) 
fitH$w
##  [1] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 0.4298673 1.0000000
##  [8] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 0.8418708
## [15] 0.9497151 0.3721334 1.0000000 1.0000000 1.0000000 1.0000000 0.7046921
## [22] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 0.3127197 1.0000000
## [29] 1.0000000 0.8301872 1.0000000 1.0000000 0.9496258 1.0000000 1.0000000
## [36] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [43] 1.0000000 1.0000000 0.9023018
hub = data.frame( resid = fitH$resid, weight = fitH$w)
hub2 = hub[order(fitH$w), ]
head(hub2,10)
resid weight
RR.engineer 56.177043 0.3127197
conductor 47.207885 0.3721334
minister -40.868431 0.4298673
bookkeeper -24.930115 0.7046921
plumber 21.161622 0.8301872
welfare.worker -20.868431 0.8418708
waiter -19.469062 0.9023018
streetcar.motorman 18.500096 0.9496258
teacher -18.499115 0.9497151
accountant -1.191484 1.0000000

See the Huber weights from the output above and realize that outliers (6,16,27) have lower weights. Hence, it is obvious that cases with a large residuals tend to be down-weighted.

S-Estimation:

fitS = lqs(income ~ education, data = duncan, method = "S")

MM-Estimation:

fitMM = rlm(income ~ education, data = duncan, method = "MM")

Visual Comparison:

plot(duncan$education, duncan$income,
      xlab = "education", ylab = "income", type = "p", 
      pch = 20, cex = .8)
abline(fitLS, col = 1) 
abline(fitH, col = 2) 
abline(fitS, col = 3) 
abline(fitMM, col = 4) 
legend(0, 80, c("Least Square", "Huber estimation", "S-estimation","MM-estimation" ),
           lty = rep(1, 4), bty = "n",
           col = c(1, 2, 3, 4))

Performance Measures

#Validation set approach
set.seed(1)
ind_dun = sample(1:nrow(duncan), 0.8 * nrow(duncan))  
train_dun = duncan[ind_dun, ] 
test_dun = duncan[-ind_dun, ] 

fitLS = lm(income ~ education, data = train_dun)
fitH = rlm(income ~ education, data = train_dun) 
fitS = lqs(income ~ education, data = train_dun, method = "S")
fitMM = rlm(income ~ education, data = train_dun, method = "MM")

lm_pred = predict(fitLS, test_dun)
h_pred = predict(fitH, test_dun)
s_pred = predict(fitS, test_dun)
mm_pred = predict(fitMM, test_dun)

#RMSEs
lm_rmse=sqrt(mean((lm_pred - test_dun$income)^2))
h_rmse=sqrt(mean((h_pred - test_dun$income)^2))
s_rmse=sqrt(mean((s_pred - test_dun$income)^2))
mm_rmse=sqrt(mean((mm_pred - test_dun$income)^2))

data.frame(rmse=c(lm_rmse,h_rmse,s_rmse,mm_rmse),est_type=c("ls","h","s","mm"))
rmse est_type
11.806127 ls
10.913416 h
9.398217 s
10.575218 mm

Considering the rmse scores, robust estimations are best suited to the data compared to traditional linear regression.