options(repos = c(CRAN = "https://cran.r-project.org/"))
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(broom)
Loading dataset
Obesity <- read.csv('/Users/ankit/Downloads/Obesity.csv')
Select an interesting binary column of data, or one which can be reasonably converted into a binary variable This should be something worth modeling Build a logistic regression model for this variable, using between 1-4 explanatory variables Interpret the coefficients, and explain what they mean in your notebook (Bonus) Using the Standard Error for at least one coefficient, build a C.I. for that coefficient, and interpret its meaning
Answer: Outcome: BinaryWeight Explainatory variables: Gender, family_history_with_overweight, SMOKE
Manipualting data
unique(Obesity$NObeyesdad)
## [1] "Normal_Weight" "Overweight_Level_I" "Overweight_Level_II"
## [4] "Obesity_Type_I" "Insufficient_Weight" "Obesity_Type_II"
## [7] "Obesity_Type_III"
Obesity$WeightCategory <- ifelse(Obesity$NObeyesdad %in% c("Obesity_Type_I", "Overweight_Level_II", "Overweight_Level_I", "Insufficient_Weight", "Obesity_Type_II", "Obesity_Type_III"), "Obesed", "Normal Weight")
head(Obesity)
## Gender Age Height Weight family_history_with_overweight FAVC FCVC NCP
## 1 Female 21 1.62 64.0 yes no 2 3
## 2 Female 21 1.52 56.0 yes no 3 3
## 3 Male 23 1.80 77.0 yes no 2 3
## 4 Male 27 1.80 87.0 no no 3 3
## 5 Male 22 1.78 89.8 no no 2 1
## 6 Male 29 1.62 53.0 no yes 2 3
## CAEC SMOKE CH2O SCC FAF TUE CALC MTRANS
## 1 Sometimes no 2 no 0 1 no Public_Transportation
## 2 Sometimes yes 3 yes 3 0 Sometimes Public_Transportation
## 3 Sometimes no 2 no 2 1 Frequently Public_Transportation
## 4 Sometimes no 2 no 2 0 Frequently Walking
## 5 Sometimes no 2 no 0 0 Sometimes Public_Transportation
## 6 Sometimes no 2 no 0 0 Sometimes Automobile
## NObeyesdad WeightCategory
## 1 Normal_Weight Normal Weight
## 2 Normal_Weight Normal Weight
## 3 Normal_Weight Normal Weight
## 4 Overweight_Level_I Obesed
## 5 Overweight_Level_II Obesed
## 6 Normal_Weight Normal Weight
Converting outcome variable : WeightCategory to binary
Obesity$BinaryWeight <- ifelse(Obesity$WeightCategory == "Normal Weight", 0, 1)
Building a logistic regression model
model <- glm( BinaryWeight~ Gender + family_history_with_overweight + SMOKE, data = Obesity, family = "binomial")
summary(model)
##
## Call:
## glm(formula = BinaryWeight ~ Gender + family_history_with_overweight +
## SMOKE, family = "binomial", data = Obesity)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2536 0.4055 0.4421 0.4421 1.4526
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.7438 0.1212 6.137 8.4e-10 ***
## GenderMale -0.1811 0.1348 -1.343 0.179163
## family_history_with_overweightyes 1.7133 0.1390 12.323 < 2e-16 ***
## SMOKEyes -1.1898 0.3555 -3.347 0.000818 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1678.5 on 2110 degrees of freedom
## Residual deviance: 1525.9 on 2107 degrees of freedom
## AIC: 1533.9
##
## Number of Fisher Scoring iterations: 5
INTERPRETATION : the results suggest that having a family history of overweight is a strong predictor of being “Obesed,” while being male and smoking are associated with a reduced likelihood of being “Obesed,” although the effect of gender is not statistically significant in this model. The AIC value indicates that the model is a good fit for the data, but there is still some residual variability.
INTERPRETATION OF COEFFICIENTS
coef(summary(model))
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.7438484 0.1212024 6.137241 8.396679e-10
## GenderMale -0.1811184 0.1348271 -1.343338 1.791625e-01
## family_history_with_overweightyes 1.7133389 0.1390349 12.323081 6.804383e-35
## SMOKEyes -1.1897537 0.3555183 -3.346533 8.182907e-04
INTERPRETATION: these results suggest that having a family history of overweight is a highly significant predictor of being “Obesed.” Smoking also has a significant effect, but it is less influential than family history. Gender, on the other hand, is not a statistically significant predictor in this model.
CONFIDENCE INTERVALS
confint(model)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 0.5087846 0.98433048
## GenderMale -0.4462913 0.08267482
## family_history_with_overweightyes 1.4409213 1.98633706
## SMOKEyes -1.8637834 -0.46020529
INTERPRETATION: the confidence intervals provide a range of values within which we can be reasonably confident that the true coefficients lie. While the intercept, gender, and family history coefficients have intervals that do not include zero, indicating statistical significance, the gender coefficient’s interval crosses zero, suggesting non-significance. These results are consistent with the earlier interpretation of the coefficients.
The significant predictors are having a family history of overweight and smoking, which are positively and negatively associated with being “Obesed,” respectively. Gender does not appear to be a significant predictor in this model.
# Predict probabilities of the positive class (e.g., 1) using your model
predicted_probs <- predict(model, type = "response")
# Load the necessary library for 'pROC' if not already loaded
# install.packages("pROC") # Uncomment this line if the package is not installed
install.packages("pROC")
##
## The downloaded binary packages are in
## /var/folders/1t/lvl69_w12vj1sz_yxkxrvt7w0000gn/T//RtmpQgN2hf/downloaded_packages
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
# Create a ROC object
roc_obj <- roc(Obesity$BinaryWeight, predicted_probs)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# Plot the ROC curve
plot(roc_obj, main = "ROC Curve")
# Calculate the AUC (Area Under the Curve)
auc_value <- auc(roc_obj)
cat("AUC:", auc_value, "\n")
## AUC: 0.6718626
# Confusion matrix
predicted_classes <- ifelse(predicted_probs >= 0.5, 1, 0)
conf_matrix <- table(Actual = Obesity$BinaryWeight, Predicted = predicted_classes)
print(conf_matrix)
## Predicted
## Actual 0 1
## 0 4 283
## 1 2 1822
# Calculate sensitivity and specificity
sensitivity <- conf_matrix[2, 2] / sum(conf_matrix[2, ])
specificity <- conf_matrix[1, 1] / sum(conf_matrix[1, ])
cat("Sensitivity:", sensitivity, "\n")
## Sensitivity: 0.9989035
cat("Specificity:", specificity, "\n")
## Specificity: 0.01393728
model shows promise in identifying positive cases (1s) but is less reliable in distinguishing negative cases (0s). Further model refinement may be needed to improve its overall performance.