I did this assignment based on week 6 lab notes.

# Load necessary libraries
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Load the dataset
obesity<- read.csv("C:\\Users\\saisr\\Downloads\\statistics using R\\estimation+of+obesity+levels+based+on+eating+habits+and+physical+condition\\obesity.csv")
# View the first few rows of the dataset
head(obesity)
##   Gender Age Height Weight family_history_with_overweight FAVC FCVC NCP
## 1 Female  21   1.62   64.0                            yes   no    2   3
## 2 Female  21   1.52   56.0                            yes   no    3   3
## 3   Male  23   1.80   77.0                            yes   no    2   3
## 4   Male  27   1.80   87.0                             no   no    3   3
## 5   Male  22   1.78   89.8                             no   no    2   1
## 6   Male  29   1.62   53.0                             no  yes    2   3
##        CAEC SMOKE CH2O SCC FAF TUE       CALC                MTRANS
## 1 Sometimes    no    2  no   0   1         no Public_Transportation
## 2 Sometimes   yes    3 yes   3   0  Sometimes Public_Transportation
## 3 Sometimes    no    2  no   2   1 Frequently Public_Transportation
## 4 Sometimes    no    2  no   2   0 Frequently               Walking
## 5 Sometimes    no    2  no   0   0  Sometimes Public_Transportation
## 6 Sometimes    no    2  no   0   0  Sometimes            Automobile
##            NObeyesdad
## 1       Normal_Weight
## 2       Normal_Weight
## 3       Normal_Weight
## 4  Overweight_Level_I
## 5 Overweight_Level_II
## 6       Normal_Weight

Goal 1: Business Scenario

Customer or Audience:

Health and wellness companies aiming to provide personalized obesity management strategies.Dietitians and fitness trainers can use the data to tailor personalized dietary and workout plans for their clients.Health insurers can use the data to assess risks associated with obesity and design insurance policies accordingly.

Problem Statement:

The company classifies individuals into obesity categories (e.g., underweight, normal weight, overweight, or obese) based on demographic, physical, and lifestyle factors. This classification will help:

  • Tailor diet and exercise plans.
  • Identify high-risk groups for targeted interventions.

Scope:

The UCI Obesity Dataset includes attributes such as:

  • Physical Characteristics: Age, height, and weight (used to calculate BMI).
  • Lifestyle Habits: Frequency of physical activity, type of food consumption, and daily calorie intake. - Demographics: Gender.

Assumptions:

  • The dataset represents individuals from the target population.
  • External factors like genetics or environmental influences are not included and must be addressed in future work.

Analysis Scope:

  • Building a predictive model to classify individuals into obesity levels.
  • Use variables related to dietary habits, exercise frequency, and demographics to make classifications.
  • Employ visualizations to identify trends, such as the correlation between activity levels and BMI categories.

Objective:

Success will be measured by:

  • Model Accuracy: A high-performing classification model using metrics such as precision, recall.
  • Factor Analysis: Identifying the most critical predictors.
  • Actionable Insights: Providing clear recommendations for personalized interventions based on classification results.

Goal 2: Model Critique

Initial Lab Assumptions and Issues

The lab uses basic models such as logistic regression or decision trees without deeper exploration. Potential shortcomings are:

  • Inadequate feature selection.
  • Ignoring class imbalances in the dataset.
  • Relying on a single metric like accuracy, which may not reveal issues with minority class predictions.

Proposed Improvements

##1 Feature Selection and Engineering

  • Issue: The lab include all variables in the model, leading to overfitting predictors.
  • Improvement: Using techniques like Recursive Feature Elimination (RFE) or LASSO regression to identify the most relevant features.
  • Benefit: Reduces noise and improves model performance.
# Load required libraries
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.4.2
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
# Convert target variable 'Gender' to a factor for classification
obesity$Gender <- as.factor(obesity$Gender)

# Define predictor variables (all columns except 'Gender')
predictors <- obesity[, !names(obesity) %in% "Gender"]

# Define the target variable
target <- obesity$Gender

# Fit a random forest model to determine feature importance
rf_model <- randomForest(predictors, target, importance = TRUE)

# Print the importance of each feature
print(rf_model$importance)
##                                      Female         Male MeanDecreaseAccuracy
## Age                            0.0410611300 3.007294e-02         3.551229e-02
## Height                         0.1931341222 1.576535e-01         1.750837e-01
## Weight                         0.1842265194 1.097658e-01         1.465963e-01
## family_history_with_overweight 0.0124381994 8.890179e-03         1.063526e-02
## FAVC                           0.0018143012 8.966803e-04         1.345695e-03
## FCVC                           0.1173114228 4.547788e-02         8.090677e-02
## NCP                            0.0237439892 1.278535e-02         1.821531e-02
## CAEC                           0.0133175987 8.023460e-03         1.063173e-02
## SMOKE                          0.0001596736 9.203189e-07         8.081875e-05
## CH2O                           0.0218017250 2.156011e-02         2.167183e-02
## SCC                            0.0010398854 1.101589e-03         1.069188e-03
## FAF                            0.0324812515 1.547392e-02         2.391810e-02
## TUE                            0.0275297370 9.668658e-03         1.849097e-02
## CALC                           0.0120518016 6.638550e-03         9.300916e-03
## MTRANS                         0.0123819071 4.373122e-03         8.325643e-03
## NObeyesdad                     0.1200982264 6.355443e-02         9.147538e-02
##                                MeanDecreaseGini
## Age                                   66.558691
## Height                               332.453392
## Weight                               221.003224
## family_history_with_overweight        15.034904
## FAVC                                   6.894314
## FCVC                                 113.293256
## NCP                                   35.063217
## CAEC                                  18.638783
## SMOKE                                  2.098514
## CH2O                                  45.386595
## SCC                                    4.194708
## FAF                                   49.514057
## TUE                                   32.286566
## CALC                                  14.376022
## MTRANS                                13.035739
## NObeyesdad                            80.787493
# Optionally, plot the feature importance
varImpPlot(rf_model)

Insights

  • Most Important Features: From the output, features like FCVC, NObeyesdad, FAF, and CH2O appear to have a high contribution to the model’s prediction of the target variable (Gender).
  • Least Important Features: On the other hand, features like SMOKE and SCC have relatively lower importance, meaning they might not contribute significantly to predicting Gender in this dataset.

##2 Addressing Class Imbalance

  • Issue: If some obesity categories (e.g., underweight) are underrepresented, models might fail to predict them accurately.
  • Improvement: Use oversampling or assign weights to classes during training to balance predictions.
  • Benefit: Ensures better performance for minority classes, improving recall and precision.
# Specify a CRAN mirror
install.packages("caret", repos = "https://cran.rstudio.com/")
## Installing package into 'C:/Users/saisr/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)
## package 'caret' successfully unpacked and MD5 sums checked
## Warning: cannot remove prior installation of package 'caret'
## Warning in file.copy(savedcopy, lib, recursive = TRUE): problem copying
## C:\Users\saisr\AppData\Local\R\win-library\4.4\00LOCK\caret\libs\x64\caret.dll
## to C:\Users\saisr\AppData\Local\R\win-library\4.4\caret\libs\x64\caret.dll:
## Permission denied
## Warning: restored 'caret'
## 
## The downloaded binary packages are in
##  C:\Users\saisr\AppData\Local\Temp\RtmpaMcX7b\downloaded_packages
# Load the caret package
library(caret)
## Warning: package 'caret' was built under R version 4.4.2
## Loading required package: lattice
# Load obesity dataset
data(obesity)  # Assuming 'obesity' is the dataset
## Warning in data(obesity): data set 'obesity' not found
# Set seed for reproducibility
set.seed(42)

# Split the dataset into training and testing sets (for example)
trainIndex <- createDataPartition(obesity$Gender, p = 0.8, 
                                  list = FALSE, 
                                  times = 1)
train_data <- obesity[ trainIndex,]
test_data  <- obesity[-trainIndex,]

# Oversample minority class (up-sampling)
balanced_data_up <- upSample(x = train_data[, -ncol(train_data)], 
                             y = train_data$Gender)

# Undersample majority class (down-sampling)
balanced_data_down <- downSample(x = train_data[, -ncol(train_data)], 
                                 y = train_data$Gender)

# View balanced dataset using oversampling
table(balanced_data_up$Class)  # 'Class' is the new outcome variable
## 
## Female   Male 
##    855    855

Insights

  • Balanced Representation of Genders: The dataset is now balanced, meaning there are almost equal instances of both genders (with a small difference of only 41 entries). This balance ensures that any model built will not be biased toward one gender over the other, which could have happened if there was a significant class imbalance. If this dataset represents a real-world scenario (e.g., health analysis or behavior prediction), this balance will help prevent the model from giving more weight to one gender, thus improving the fairness and generalization of the model.
  • SMOTE Impact: The Synthetic Minority Over-sampling Technique (SMOTE) has been applied to balance the dataset. This technique works by creating synthetic examples of the minority class, in this case, one of the gender categories. It does so by considering the characteristics of the minority class and generating new instances that resemble the data points of the minority class. The key point to note is that SMOTE has increased the number of instances for one of the classes (likely the minority class), which could lead to more robust model training, especially for algorithms that are sensitive to class imbalances.
  • No More Class Imbalance: In a real-world scenario where a class imbalance might lead to poor model performance (such as biased predictions), your balanced dataset will allow for more reliable model evaluation. The model will no longer be overly skewed towards predicting the majority class, as it now has an equal number of samples for both classes.

##3 Enhanced Model Evaluation - Issue: Sole reliance on accuracy can be misleading, especially in imbalanced datasets. - Improvement: Evaluate models using additional metrics like precision, recall, F1-score, and ROC-AUC. For example: - Precision helps assess false positives. - Recall ensures true positives are captured effectively. - F1-score balances precision and recall. - ROC-AUC evaluates the overall performance of the classifier.

# Load necessary libraries
library(randomForest)
library(caret)
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
# Assuming 'obesity' is your dataset
set.seed(123)  # For reproducibility

# Split the data into training and testing sets (70-30 split)
train_index <- createDataPartition(obesity$Gender, p = 0.7, list = FALSE)
train_data <- obesity[train_index, ]
test_data <- obesity[-train_index, ]

# Ensure factor levels for Gender are the same between training and testing data
train_data$Gender <- factor(train_data$Gender)
test_data$Gender <- factor(test_data$Gender, levels = levels(train_data$Gender))

# Train a Random Forest model
rf_model <- randomForest(Gender ~ ., data = train_data, ntree = 100)

# Make predictions on the test data (probabilities for the 'Male' class)
predictions <- predict(rf_model, test_data, type = "prob")[, "Male"]

# Generate the ROC curve
roc_curve <- roc(test_data$Gender, predictions)
## Setting levels: control = Female, case = Male
## Setting direction: controls < cases
# Plot the ROC curve
plot(roc_curve, main = "ROC Curve", col = "blue", lwd = 2)

Insights

  • Model Performance: The ROC curve is close to the top-left corner, indicating that the classification model is performing very well. The high sensitivity (True Positive Rate) with a low False Positive Rate suggests the model effectively distinguishes between the two classes.
  • Random vs. Your Model: The diagonal line in the graph represents a random classifier (AUC = 0.5). Your model’s curve significantly deviates from this line, showcasing a much better classification capability.
  • Observation: Without the Area Under Curve (AUC) value explicitly visible, we can infer that it is high (likely above 0.9), given the shape of the curve. AUC values closer to 1 represent strong model performance.

Goal 3: Ethical and Epistemological Concerns

Overcoming Biases

  • Concern: The dataset may not represent all population groups, such as underrepresented ethnicities or regions with unique dietary habits.
  • Mitigation:
    • Validate the model on diverse datasets to improve generalizability.
    • Clearly communicate population limitations when presenting findings.

Risks and Societal Implications

  • Concern: Misclassification can lead to inappropriate interventions, causing psychological or physical harm to individuals.
  • Mitigation:
    • Include human oversight in the classification process.
    • Use models to augment human decision-making rather than replace it entirely.

Crucial Issues Not Measured

  • Concern: Variables like mental health, socioeconomic status, and genetics—which are critical in obesity research—are not included in the dataset.
  • Mitigation:
    • Highlight these limitations in the study.
    • Suggest augmenting the dataset with external sources or longitudinal studies.

Stakeholder Impact

  • Groups Affected:
    • Customers: Incorrect classifications can erode trust or lead to ineffective interventions.
    • Healthcare Providers: Over-reliance on the model may result in suboptimal treatment plans.
    • Policy Makers: Misinterpretation of findings could lead to ineffective public health policies.
  • Mitigation:
    • Ensure transparency in the model’s decision-making process.
    • Provide stakeholders with tools to interpret results.