Data Dive - GLMs

options(repos = c(CRAN = "https://cran.r-project.org/"))

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(broom)

Loading dataset

Obesity <- read.csv('/Users/ankit/Downloads/Obesity.csv')

Select an interesting binary column of data, or one which can be reasonably converted into a binary variable This should be something worth modeling Build a logistic regression model for this variable, using between 1-4 explanatory variables Interpret the coefficients, and explain what they mean in your notebook (Bonus) Using the Standard Error for at least one coefficient, build a C.I. for that coefficient, and interpret its meaning

Answer: Outcome: BinaryWeight Explainatory variables: Gender, family_history_with_overweight, SMOKE

Manipualting data

unique(Obesity$NObeyesdad)

## [1] "Normal_Weight"       "Overweight_Level_I"  "Overweight_Level_II"
## [4] "Obesity_Type_I"      "Insufficient_Weight" "Obesity_Type_II"    
## [7] "Obesity_Type_III"

Obesity$WeightCategory <- ifelse(Obesity$NObeyesdad %in% c("Obesity_Type_I", "Overweight_Level_II", "Overweight_Level_I", "Insufficient_Weight", "Obesity_Type_II", "Obesity_Type_III"), "Obesed", "Normal Weight")

head(Obesity)

##   Gender Age Height Weight family_history_with_overweight FAVC FCVC NCP
## 1 Female  21   1.62   64.0                            yes   no    2   3
## 2 Female  21   1.52   56.0                            yes   no    3   3
## 3   Male  23   1.80   77.0                            yes   no    2   3
## 4   Male  27   1.80   87.0                             no   no    3   3
## 5   Male  22   1.78   89.8                             no   no    2   1
## 6   Male  29   1.62   53.0                             no  yes    2   3
##        CAEC SMOKE CH2O SCC FAF TUE       CALC                MTRANS
## 1 Sometimes    no    2  no   0   1         no Public_Transportation
## 2 Sometimes   yes    3 yes   3   0  Sometimes Public_Transportation
## 3 Sometimes    no    2  no   2   1 Frequently Public_Transportation
## 4 Sometimes    no    2  no   2   0 Frequently               Walking
## 5 Sometimes    no    2  no   0   0  Sometimes Public_Transportation
## 6 Sometimes    no    2  no   0   0  Sometimes            Automobile
##            NObeyesdad WeightCategory
## 1       Normal_Weight  Normal Weight
## 2       Normal_Weight  Normal Weight
## 3       Normal_Weight  Normal Weight
## 4  Overweight_Level_I         Obesed
## 5 Overweight_Level_II         Obesed
## 6       Normal_Weight  Normal Weight

Converting outcome variable : WeightCategory to binary

Obesity$BinaryWeight <- ifelse(Obesity$WeightCategory == "Normal Weight", 0, 1)

Building a logistic regression model

model <- glm( BinaryWeight~ Gender + family_history_with_overweight + SMOKE, data = Obesity, family = "binomial")

summary(model)

## 
## Call:
## glm(formula = BinaryWeight ~ Gender + family_history_with_overweight + 
##     SMOKE, family = "binomial", data = Obesity)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2536   0.4055   0.4421   0.4421   1.4526  
## 
## Coefficients:
##                                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                         0.7438     0.1212   6.137  8.4e-10 ***
## GenderMale                         -0.1811     0.1348  -1.343 0.179163    
## family_history_with_overweightyes   1.7133     0.1390  12.323  < 2e-16 ***
## SMOKEyes                           -1.1898     0.3555  -3.347 0.000818 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1678.5  on 2110  degrees of freedom
## Residual deviance: 1525.9  on 2107  degrees of freedom
## AIC: 1533.9
## 
## Number of Fisher Scoring iterations: 5

INTERPRETATION : the results suggest that having a family history of overweight is a strong predictor of being “Obesed,” while being male and smoking are associated with a reduced likelihood of being “Obesed,” although the effect of gender is not statistically significant in this model. The AIC value indicates that the model is a good fit for the data, but there is still some residual variability.

INTERPRETATION OF COEFFICIENTS

coef(summary(model))

##                                     Estimate Std. Error   z value     Pr(>|z|)
## (Intercept)                        0.7438484  0.1212024  6.137241 8.396679e-10
## GenderMale                        -0.1811184  0.1348271 -1.343338 1.791625e-01
## family_history_with_overweightyes  1.7133389  0.1390349 12.323081 6.804383e-35
## SMOKEyes                          -1.1897537  0.3555183 -3.346533 8.182907e-04

INTERPRETATION: these results suggest that having a family history of overweight is a highly significant predictor of being “Obesed.” Smoking also has a significant effect, but it is less influential than family history. Gender, on the other hand, is not a statistically significant predictor in this model.

CONFIDENCE INTERVALS

confint(model)

## Waiting for profiling to be done...

##                                        2.5 %      97.5 %
## (Intercept)                        0.5087846  0.98433048
## GenderMale                        -0.4462913  0.08267482
## family_history_with_overweightyes  1.4409213  1.98633706
## SMOKEyes                          -1.8637834 -0.46020529

INTERPRETATION: the confidence intervals provide a range of values within which we can be reasonably confident that the true coefficients lie. While the intercept, gender, and family history coefficients have intervals that do not include zero, indicating statistical significance, the gender coefficient’s interval crosses zero, suggesting non-significance. These results are consistent with the earlier interpretation of the coefficients.

The significant predictors are having a family history of overweight and smoking, which are positively and negatively associated with being “Obesed,” respectively. Gender does not appear to be a significant predictor in this model.

# Predict probabilities of the positive class (e.g., 1) using your model
predicted_probs <- predict(model, type = "response")

# Load the necessary library for 'pROC' if not already loaded
# install.packages("pROC")  # Uncomment this line if the package is not installed
install.packages("pROC")

## 
## The downloaded binary packages are in
##  /var/folders/1t/lvl69_w12vj1sz_yxkxrvt7w0000gn/T//RtmpQgN2hf/downloaded_packages

library(pROC)

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

# Create a ROC object
roc_obj <- roc(Obesity$BinaryWeight, predicted_probs)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

# Plot the ROC curve
plot(roc_obj, main = "ROC Curve")

# Calculate the AUC (Area Under the Curve)
auc_value <- auc(roc_obj)
cat("AUC:", auc_value, "\n")

## AUC: 0.6718626

# Confusion matrix
predicted_classes <- ifelse(predicted_probs >= 0.5, 1, 0)
conf_matrix <- table(Actual = Obesity$BinaryWeight, Predicted = predicted_classes)
print(conf_matrix)

##       Predicted
## Actual    0    1
##      0    4  283
##      1    2 1822

# Calculate sensitivity and specificity
sensitivity <- conf_matrix[2, 2] / sum(conf_matrix[2, ])
specificity <- conf_matrix[1, 1] / sum(conf_matrix[1, ])
cat("Sensitivity:", sensitivity, "\n")

## Sensitivity: 0.9989035

cat("Specificity:", specificity, "\n")

## Specificity: 0.01393728

model shows promise in identifying positive cases (1s) but is less reliable in distinguishing negative cases (0s). Further model refinement may be needed to improve its overall performance.

Data Dive - GLMs

Jagriti Mahajan

2023-10-31