Previously on Stat412:
Dimension Reduction
Categorical Data Analysis
Today, we’ll focus on logistic regression and robust statistics.
Logistic regression is a statistical method employed for forecasting the likelihood of an event happening. It’s commonly employed in scenarios where the dependent variable (the outcome being predicted) is categorical and has only two possible outcomes (e.g., yes/no, success/failure, 0/1). It’s also possible to expand to situations with more than two outcomes, either without an order (multinominal logistic regression) or with an order (ordinal logistic regression). Unlike linear regression, which deals with continuous variables, logistic regression focuses on predicting probabilities for categorical outcomes, typically bounded between 0 and 1.
The data “heart” ise used to perform logistic regression. It contains information regarding the factors that lead to heart disease among a population. Assune that you are trying to use this knowledge to better predict who has heart disease so you can intervene and help improve their health and quality of life. In other words, you want to be able to predict the likelihood or probability that a person with certain characteristics will have heart disease. The binary target variable is “heart_disease” (yes or no).
## [1] "data.frame"
## [1] 10000 4
Data set info:
heart_disease: an indicator field corresponding to whether an individual has heart disease (1 = yes, heart disease; 0 = no heart disease)
coffee_drinker: an indicator field corresponding to whether an individual drinks coffee regularly (1 = yes, coffee drinker; 0 = not a coffee drinker)
fast_food_spend: a numerical field corresponding to the annual spend of each individual on fast food
income: a numerical field corresponding to the individual’s annual income
| heart_disease | coffee_drinker | fast_food_spend | income |
|---|---|---|---|
| 0 | 0 | 1823.816 | 44361.625 |
| 0 | 1 | 2042.951 | 12106.135 |
| 0 | 0 | 2683.873 | 31767.139 |
| 0 | 0 | 1323.127 | 35704.494 |
| 0 | 0 | 1964.140 | 38463.496 |
| 0 | 1 | 2298.971 | 7491.559 |
## heart_disease coffee_drinker fast_food_spend income
## Min. :0.0000 Min. :0.0000 Min. : 0 Min. : 772
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:1204 1st Qu.:21340
## Median :0.0000 Median :0.0000 Median :2059 Median :34553
## Mean :0.0333 Mean :0.2944 Mean :2088 Mean :33517
## 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:2916 3rd Qu.:43808
## Max. :1.0000 Max. :1.0000 Max. :6636 Max. :73554
Do you see any issue regarding the variable types?
Considering the data dictionary, the variables heart disease and coffee drinker must be converted to factor.
## 'data.frame': 10000 obs. of 4 variables:
## $ heart_disease : int 0 0 0 0 0 0 0 0 0 0 ...
## $ coffee_drinker : int 0 1 0 0 0 1 0 1 0 0 ...
## $ fast_food_spend: num 1824 2043 2684 1323 1964 ...
## $ income : num 44362 12106 31767 35704 38463 ...
heart$heart_disease=as.factor(heart$heart_disease)
heart$coffee_drinker=as.factor(heart$coffee_drinker)Now, all classes of the variables seem reasonable. Therefore, we can move on partition the data set. We prefer to use validation set approach to assess the model validity.
## Warning: package 'caret' was built under R version 4.3.3
## Zorunlu paket yükleniyor: ggplot2
## Zorunlu paket yükleniyor: lattice
set.seed(1)
ind = createDataPartition(heart$heart_disease, p = 0.8, list = FALSE)
train = heart[ind, ]
test = heart[-ind, ]80% of the data belongs to train set while 20% of the data is partitioned as test set.
##
## 0 1
## 0.96662917 0.03337083
##
## 0 1
## 7734 267
##
## 0 1
## 0.96698349 0.03301651
##
## 0 1
## 1933 66
Both sets have similar percentages with respect to the heart_disease status.
Let’s start to understand heart disease by plotting heart disease, a categorical variable with values of 1 (yes) or 0 (no), against annual fast food spend, a numeric variable.
ggplot(heart) +
geom_point(aes(fast_food_spend, factor(heart_disease)), color = "blue") +
labs(x = 'Fast Food Spend', y = 'Heart Disease')We can see from this chart that it looks as if higher amounts of fast food spending are associated with a higher likelihood of heart disease.
ggplot(heart, aes(x = factor(heart_disease), y = fast_food_spend, fill = factor(heart_disease))) +
geom_boxplot() +
xlab("Heart Disease (Y/N)") +
ylab("Annual Fast Food Spend") +
ggtitle("Food Spend and Heart Disease")ggplot(heart, aes(x = factor(heart_disease), y = income, fill = factor(heart_disease))) +
geom_boxplot() +
xlab("Heart Disease (Y/N)") +
ylab("Income") +
ggtitle("Income and Heart Disease")According to the figure above, it doesn’t appear that income plays much of a role in heart disease
Simple logistic regression includes dependent variable and only one predictor. Let’s select the variable fast_food_spend to predict the heart disease status. To fit a logistic model, the function glm() can be used with the family being specified as “binomial”. The input “family =”binomial” determines a two-class categorical response. If you use “family=gaussian” it specifies standard linear regression.
simple_model = glm(heart_disease~ fast_food_spend, data = train, family = "binomial")
summary(simple_model)##
## Call:
## glm(formula = heart_disease ~ fast_food_spend, family = "binomial",
## data = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.079e+01 4.095e-01 -26.34 <2e-16 ***
## fast_food_spend 2.228e-03 9.934e-05 22.43 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2340.6 on 8000 degrees of freedom
## Residual deviance: 1240.3 on 7999 degrees of freedom
## AIC: 1244.3
##
## Number of Fisher Scoring iterations: 8
Is the model overall statistically significant?
anova(simple_model, test = "Chisq") # it investigates whether difference between the deviances are statistically significant.| Df | Deviance | Resid. Df | Resid. Dev | Pr(>Chi) | |
|---|---|---|---|---|---|
| NULL | NA | NA | 8000 | 2340.628 | NA |
| fast_food_spend | 1 | 1100.369 | 7999 | 1240.260 | 0 |
It is seen that the overall model is significant. After that, the covariate is examined. As you see fast food spend has significantly positive effect on the log odds. The expected change in log odds of heart disease for one unit increase in the average spend is 0.00218. We can also say that one unit increment in the spend results in e^0.00218=1.00218 unit increment in heart disease.
Let’s apply multiple logistic regression.
Let’s build a model including all predictors and dependent variable heart disease.
log_model = glm(heart_disease~ factor(coffee_drinker)+fast_food_spend+income,
family = binomial,
data = train)
summary(log_model)##
## Call:
## glm(formula = heart_disease ~ factor(coffee_drinker) + fast_food_spend +
## income, family = binomial, data = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.096e+01 5.532e-01 -19.817 <2e-16 ***
## factor(coffee_drinker)1 -6.709e-01 2.677e-01 -2.506 0.0122 *
## fast_food_spend 2.325e-03 1.048e-04 22.182 <2e-16 ***
## income 1.716e-06 9.233e-06 0.186 0.8526
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2340.6 on 8000 degrees of freedom
## Residual deviance: 1221.3 on 7997 degrees of freedom
## AIC: 1229.3
##
## Number of Fisher Scoring iterations: 8
Realize that this output (logistic regression) contains a little less info than the linear model case. The distribution of the deviance residuals is not as important here and there is no R-squared because it is not defined for logistic regression. We still have coefficient estimates and associated p-values. The coefficient estimates, as noted above, correspond to the change in the log odds for a one-unit increase in the predictor. The p-values are interpreted in a similar way to the linear regression setting - essentially small p-values indicate that we reject the null hypothesis that the response variable does not depend on the predictor.
Coffee Drinker: Coffee drinking has an estimated coefficient of -0.671 which means that the coffee drinking is associated with a decrease in the odds that a person has heart disease. Specifically, it decreases the odds of heart disease (relative to non-coffee drinkers) by a multiplicative factor of e−0.671, holding all other predictors fixed. Therefore, coffee drinking is also associated with a decrease in the probability of having heart disease. This association appears to be significant.
Fast Food Spend: A one-unit increase in fast food spend is associated with a 0.00233 increase in the log odds of having heart disease. This association appears to be heavily significant.
Income: Based on a p-value of 0.85, we conclude that income is not significantly associated with heart disease.
We may prefer to exclude the variable income, which is a statistically significant, from the model.
log_model_2 = glm(heart_disease~ factor(coffee_drinker)+fast_food_spend,
family = binomial,
data = train)
summary(log_model_2)##
## Call:
## glm(formula = heart_disease ~ factor(coffee_drinker) + fast_food_spend,
## family = binomial, data = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.090e+01 4.191e-01 -25.998 < 2e-16 ***
## factor(coffee_drinker)1 -7.097e-01 1.673e-01 -4.241 2.23e-05 ***
## fast_food_spend 2.326e-03 1.048e-04 22.198 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2340.6 on 8000 degrees of freedom
## Residual deviance: 1221.3 on 7998 degrees of freedom
## AIC: 1227.3
##
## Number of Fisher Scoring iterations: 8
Now, all predictors are statistically significant. We can move on predictions.
## 2 3 4 5 6 7
## 0.0010538836 0.0094343472 0.0004019906 0.0017827014 0.0019099266 0.0022465855
Note that these are probabilities, not classifications. To obtain classifications, we will need to compare to the correct cutoff value with an ifelse() statement.
calc_class_err = function(actual, predicted) {
mean(actual != predicted)
}
calc_class_err(actual = test$heart_disease, predicted = glm_pred)## [1] 0.02751376
As you see, the classification error rate for test data is approximately 3%, which is OK. In addition to this metric, we should consider accuracy, sensitivity and specificity.
REMEMBER:
A test with low specificity is more likely to produce a high number of false positives and may incorrectly identify disease or illness in individuals when it is not present. When a test’s sensitivity is high, it is more likely to give a true positive result and correctly detect the disease or illness when it is present.
test_tab= table(predicted = glm_pred,actual = test$heart_disease)
accuracy=(test_tab[1, 1] + test_tab[2, 2]) / sum(test_tab)
library(caret)
glm_pred=as.factor(glm_pred)
sensitivity=sensitivity(test$heart_disease, glm_pred)
specificity=specificity(test$heart_disease,glm_pred)
data.frame(accuracy,sensitivity,specificity)| accuracy | sensitivity | specificity |
|---|---|---|
| 0.9724862 | 0.9747219 | 0.7619048 |
What is AUC - ROC curve?
AUC - ROC curve is a performance measurement for classification problem at various thresholds settings. ROC is a probability curve and AUC represents degree or measure of separability. It tells how much model is capable of distinguishing between classes. Higher the AUC, better the model is at predicting 0s as 0s and 1s as 1s.
To draw this curve, we can consider pROC package.
## Warning: package 'pROC' was built under R version 4.3.3
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
test_prob=predict(log_model_2, type = "response",newdata = test)
test_roc = roc(test$heart_disease ~ test_prob, plot = TRUE, print.auc = TRUE)## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
A ROC curve is constructed by plotting the true positive rate (TPR) against the false positive rate (FPR). The ROC curve shows the trade-off between sensitivity (or TPR) and specificity (1 – FPR). Classifiers that give curves closer to the top-left corner indicate a better performance.
## [1] 0.9324648
A good model will have a high AUC, that is as often as possible a high sensitivity and specificity.
Robust analysis refers to the application of statistical techniques that are resistant to outliers or violations of underlying assumptions. Data can often contain outliers or may not follow the assumptions of traditional statistical methods like normality or homoscedasticity. Robust analysis techniques aim to address these challenges by offering approaches that are less sensitive to such issues. Robust analysis encompasses a range of statistical methods, including robust regression, robust estimation of location and scale parameters (e.g., median and median absolute deviation), robust hypothesis testing, and robust measures of correlation and covariance.
We utilizes “duncan” data set from bcgam package while performing robust analysis.
## Warning: package 'bcgam' was built under R version 4.3.3
## Zorunlu paket yükleniyor: nimble
## Warning: package 'nimble' was built under R version 4.3.3
## nimble version 1.1.0 is loaded.
## For more information on NIMBLE and a User Manual,
## please visit https://R-nimble.org.
##
## Note for advanced users who have written their own MCMC samplers:
## As of version 0.13.0, NIMBLE's protocol for handling posterior
## predictive nodes has changed in a way that could affect user-defined
## samplers in some situations. Please see Section 15.5.1 of the User Manual.
##
## Attaching package: 'nimble'
## The following object is masked from 'package:stats':
##
## simulate
| type | income | education | prestige | |
|---|---|---|---|---|
| accountant | prof | 62 | 86 | 82 |
| pilot | prof | 72 | 76 | 83 |
| architect | prof | 75 | 92 | 90 |
| author | prof | 55 | 90 | 76 |
| chemist | prof | 64 | 86 | 90 |
| minister | prof | 21 | 84 | 87 |
The duncan data frame has 45 rows and 4 columns. Data on the prestige and other characteristics of 45 U.S. occupations in 1950.
Info about the data:
type: Type of occupation. A factor with the following levels: prof, professional and managerial; wc, white-collar; bc, blue-collar.
income: Percent of males in occupation earning $3500 or more in 1950.
education: Percent of males in occupation in 1950 who where high-school graduates.
prestige: Percent of raters in NORC study rating occupation as excellent or good in prestige.
## 'data.frame': 45 obs. of 4 variables:
## $ type : Factor w/ 3 levels "bc","prof","wc": 2 2 2 2 2 2 2 2 3 2 ...
## $ income : int 62 72 75 55 64 21 64 80 67 72 ...
## $ education: int 86 76 92 90 86 84 93 100 87 86 ...
## $ prestige : int 82 83 90 76 90 87 93 90 52 88 ...
## type income education prestige
## bc :21 Min. : 7.00 Min. : 7.00 Min. : 3.00
## prof:18 1st Qu.:21.00 1st Qu.: 26.00 1st Qu.:16.00
## wc : 6 Median :42.00 Median : 45.00 Median :41.00
## Mean :41.87 Mean : 52.56 Mean :47.69
## 3rd Qu.:64.00 3rd Qu.: 84.00 3rd Qu.:81.00
## Max. :81.00 Max. :100.00 Max. :97.00
To detect the influential points or outliers, we examine the scatter plot of income vs education.
ggplot(duncan, aes(x = education, y = income)) +
geom_point() + stat_smooth(method = "lm", col = "red") + theme_minimal() +
ggtitle("Income vs. education")## `geom_smooth()` using formula = 'y ~ x'
## Warning: package 'olsrr' was built under R version 4.3.3
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
REMEMBER: Cook’s distance (denoted as DD) is a measure used in regression analysis to identify influential observations in a dataset. Typically, observations with Cook’s distance greater than 4/n (where n is the number of observations) are considered influential. However, the threshold value may vary depending on the specific context and the goals of the analysis.
According to the plot above, the observations 6,16 and 27 are the influential points.
Robust regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables in the presence of outliers or violations of typical assumptions such as normality and homoscedasticity.
M-estimators: These are robust estimators of the regression coefficients that use a robust objective function to balance the influence of data points. Examples include Huber, Tukey’s biweight, and Hampel M-estimators.
Least absolute deviations (LAD) regression: This method minimizes the sum of the absolute residuals instead of the squared residuals used in OLS regression. It is less sensitive to outliers and provides robust estimates of the regression coefficients.
S-estimators: These are a class of M-estimators that use robust estimates of location and scale to provide robust estimates of the regression coefficients.
Robust regression with iteratively reweighted least squares (IRLS): This method iteratively reweights the observations based on their residuals to downweight the influence of outliers. It can be used with different types of robust estimators, such as M-estimators and LAD estimators.
Now, we will apply several robust estimation techniques.
Huber Estimation:
## Warning: package 'MASS' was built under R version 4.3.3
##
## Attaching package: 'MASS'
## The following object is masked from 'package:olsrr':
##
## cement
## [1] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 0.4298673 1.0000000
## [8] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 0.8418708
## [15] 0.9497151 0.3721334 1.0000000 1.0000000 1.0000000 1.0000000 0.7046921
## [22] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 0.3127197 1.0000000
## [29] 1.0000000 0.8301872 1.0000000 1.0000000 0.9496258 1.0000000 1.0000000
## [36] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [43] 1.0000000 1.0000000 0.9023018
| resid | weight | |
|---|---|---|
| RR.engineer | 56.177043 | 0.3127197 |
| conductor | 47.207885 | 0.3721334 |
| minister | -40.868431 | 0.4298673 |
| bookkeeper | -24.930115 | 0.7046921 |
| plumber | 21.161622 | 0.8301872 |
| welfare.worker | -20.868431 | 0.8418708 |
| waiter | -19.469062 | 0.9023018 |
| streetcar.motorman | 18.500096 | 0.9496258 |
| teacher | -18.499115 | 0.9497151 |
| accountant | -1.191484 | 1.0000000 |
See the Huber weights from the output above and realize that outliers (6,16,27) have lower weights. Hence, it is obvious that cases with a large residuals tend to be down-weighted.
S-Estimation:
MM-Estimation:
Visual Comparison:
plot(duncan$education, duncan$income,
xlab = "education", ylab = "income", type = "p",
pch = 20, cex = .8)
abline(fitLS, col = 1)
abline(fitH, col = 2)
abline(fitS, col = 3)
abline(fitMM, col = 4)
legend(0, 80, c("Least Square", "Huber estimation", "S-estimation","MM-estimation" ),
lty = rep(1, 4), bty = "n",
col = c(1, 2, 3, 4))#Validation set approach
set.seed(1)
ind_dun = sample(1:nrow(duncan), 0.8 * nrow(duncan))
train_dun = duncan[ind_dun, ]
test_dun = duncan[-ind_dun, ]
fitLS = lm(income ~ education, data = train_dun)
fitH = rlm(income ~ education, data = train_dun)
fitS = lqs(income ~ education, data = train_dun, method = "S")
fitMM = rlm(income ~ education, data = train_dun, method = "MM")
lm_pred = predict(fitLS, test_dun)
h_pred = predict(fitH, test_dun)
s_pred = predict(fitS, test_dun)
mm_pred = predict(fitMM, test_dun)
#RMSEs
lm_rmse=sqrt(mean((lm_pred - test_dun$income)^2))
h_rmse=sqrt(mean((h_pred - test_dun$income)^2))
s_rmse=sqrt(mean((s_pred - test_dun$income)^2))
mm_rmse=sqrt(mean((mm_pred - test_dun$income)^2))
data.frame(rmse=c(lm_rmse,h_rmse,s_rmse,mm_rmse),est_type=c("ls","h","s","mm"))| rmse | est_type |
|---|---|
| 11.806127 | ls |
| 10.913416 | h |
| 9.398217 | s |
| 10.575218 | mm |
Considering the rmse scores, robust estimations are best suited to the data compared to traditional linear regression.