My final project focuses on finding the distinctions between those who are obese and those who are not. The project will be split into three primary parts: lifestyle habits, family history, and physical activity. The obesity dataset will be used in this research because it contains a variety of characteristics, such as family history, physical activity levels, and other lifestyle-related behaviors, which makes it perfect for determining the main distinctions between those who are obese and those who are not.

Obesity is a chronic condition characterized by excessive body fat, impacting over 42% of adults in the United States, as per CDC data. It is linked to severe health risks such as heart disease, type 2 diabetes, and some cancers.

key sources:

Dataset is loaded

library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ✔ readr     2.1.5
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.4.2
## corrplot 0.95 loaded
library(car)
## Warning: package 'car' was built under R version 4.4.2
## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:purrr':
## 
##     some
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
obesity <- read.csv("C:/Users/saisr/Downloads/statistics using R/estimation+of+obesity+levels+based+on+eating+habits+and+physical+condition/ObesityDataSet_raw_and_data_sinthetic.csv")

# View the first few rows of the dataset
head(obesity)
##   Gender Age Height Weight family_history_with_overweight FAVC FCVC NCP
## 1 Female  21   1.62   64.0                            yes   no    2   3
## 2 Female  21   1.52   56.0                            yes   no    3   3
## 3   Male  23   1.80   77.0                            yes   no    2   3
## 4   Male  27   1.80   87.0                             no   no    3   3
## 5   Male  22   1.78   89.8                             no   no    2   1
## 6   Male  29   1.62   53.0                             no  yes    2   3
##        CAEC SMOKE CH2O SCC FAF TUE       CALC                MTRANS
## 1 Sometimes    no    2  no   0   1         no Public_Transportation
## 2 Sometimes   yes    3 yes   3   0  Sometimes Public_Transportation
## 3 Sometimes    no    2  no   2   1 Frequently Public_Transportation
## 4 Sometimes    no    2  no   2   0 Frequently               Walking
## 5 Sometimes    no    2  no   0   0  Sometimes Public_Transportation
## 6 Sometimes    no    2  no   0   0  Sometimes            Automobile
##            NObeyesdad
## 1       Normal_Weight
## 2       Normal_Weight
## 3       Normal_Weight
## 4  Overweight_Level_I
## 5 Overweight_Level_II
## 6       Normal_Weight

Data preprocessing

# Convert categorical variables to factors
categorical_vars <- c("family_history_with_overweight", "FAVC", "CAEC", 
                     "SMOKE", "SCC", "CALC", "MTRANS", "NObeyesdad")
obesity[categorical_vars] <- lapply(obesity[categorical_vars], as.factor)
 
# Create numeric BMI categories for regression
obesity_levels <- c("Insufficient_Weight" = 1, 
                   "Normal_Weight" = 2,
                   "Overweight_Level_I" = 3,
                   "Overweight_Level_II" = 4,
                   "Obesity_Type_I" = 5,
                   "Obesity_Type_II" = 6,
                   "Obesity_Type_III" = 7)
 
obesity$BMI_level <- as.numeric(factor(obesity$NObeyesdad, 
                                          levels = names(obesity_levels), 
                                          labels = obesity_levels))

Calculated BMI and added as new column

obesity$Weight <- as.numeric(as.character(obesity$Weight))
obesity$Height <- as.numeric(as.character(obesity$Height))

# Check for NA values
cat("NA in Weight:", sum(is.na(obesity$Weight_kg)), "\n")
## NA in Weight: 0
obesity <- na.omit(obesity)

# Calculate BMI
obesity$BMI <- obesity$Weight / (obesity$Height)

# Verify the calculation
head(obesity)
##   Gender Age Height Weight family_history_with_overweight FAVC FCVC NCP
## 1 Female  21   1.62   64.0                            yes   no    2   3
## 2 Female  21   1.52   56.0                            yes   no    3   3
## 3   Male  23   1.80   77.0                            yes   no    2   3
## 4   Male  27   1.80   87.0                             no   no    3   3
## 5   Male  22   1.78   89.8                             no   no    2   1
## 6   Male  29   1.62   53.0                             no  yes    2   3
##        CAEC SMOKE CH2O SCC FAF TUE       CALC                MTRANS
## 1 Sometimes    no    2  no   0   1         no Public_Transportation
## 2 Sometimes   yes    3 yes   3   0  Sometimes Public_Transportation
## 3 Sometimes    no    2  no   2   1 Frequently Public_Transportation
## 4 Sometimes    no    2  no   2   0 Frequently               Walking
## 5 Sometimes    no    2  no   0   0  Sometimes Public_Transportation
## 6 Sometimes    no    2  no   0   0  Sometimes            Automobile
##            NObeyesdad BMI_level      BMI
## 1       Normal_Weight         2 39.50617
## 2       Normal_Weight         2 36.84211
## 3       Normal_Weight         2 42.77778
## 4  Overweight_Level_I         3 48.33333
## 5 Overweight_Level_II         4 50.44944
## 6       Normal_Weight         2 32.71605

BMI Distribution

# Histogram for BMI
ggplot(obesity, aes(x = BMI)) +
  geom_histogram(binwidth = 2, fill = "blue", color = "black") +
  labs(title = "BMI Distribution", x = "BMI", y = "Count") +
  theme_minimal()

Insights

  • The BMI distribution shows a rather symmetrical trend with a rightward tail. This implies that although the majority of people have a modest BMI, some outliers have much higher numbers.
  • There are multiple peaks in the histogram, indicating a multimodal distribution.
  • A large number of individuals fall in the BMI range of 30–40 and 70–80, potentially indicating specific clusters of overweight or obese individuals.
  • The distribution suggests variability, with certain BMI ranges (e.g., 20–30) having significantly fewer individuals.
  • Outliers might be present at the extreme ends of the BMI scale.

Correlation Analysis

# Load required packages
library(ggplot2)
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
# Calculate correlations
cor_matrix <- cor(obesity %>% select_if(is.numeric))

# Melt the correlation matrix into long format
cor_matrix_melted <- melt(cor_matrix)

# Create heatmap with ggplot2
ggplot(data = cor_matrix_melted, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile() +
  scale_fill_gradient2(low = "red", high = "green", mid = "white", midpoint = 0) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Correlation Heatmap", x = "Variables", y = "Variables")

Insights

  • Strong Correlations: BMI and Weight exhibit a strong positive correlation (dark green), as expected. Height and Age show a high correlation, potentially due to natural growth patterns.
  • Negative Correlations: BMI and TUE (Time of Exercise) have a negative correlation, suggesting that higher exercise time may be associated with lower BMI.
  • The strong relationship between BMI and Weight indicates that weight significantly contributes to BMI variation.
  • The inverse relationship between BMI and exercise time highlights the importance of physical activity in maintaining a healthy BMI.
  • Variables with weak correlations might require further examination or exclusion from predictive models.

Conclusion from these 2 visualizations

The study reveals a high frequency of overweight in the dataset and confirms established findings that activity and weight are important factors influencing BMI. Predictive models can be improved and targeted health interventions can be created with these findings.


Exploratory Data Analysis

so we have 2 hypotheses which have to be tested here.

Hypotheses 1: Higher levels of physical activity are associated with lower levels of obesity.

summary(obesity$FAF)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1245  1.0000  1.0103  1.6667  3.0000
hist(obesity$FAF, breaks = 30, main = "Distribution of FAF", xlab = "FAF")

# Adjust the breaks to better distribute the data
obesity$FAF_aggregated <- cut(obesity$FAF,
                              breaks = c(-Inf, 0.5, 1.5, Inf),
                              labels = c("Low Activity", "Moderate Activity", "High Activity"),
                              right = TRUE)

# Check the new groups
table(obesity$FAF_aggregated)
## 
##      Low Activity Moderate Activity     High Activity 
##               720               776               615
# Recreate the contingency table
table_aggregated <- table(obesity$FAF_aggregated, obesity$NObeyesdad)

# View the contingency table
print(table_aggregated)
##                    
##                     Insufficient_Weight Normal_Weight Obesity_Type_I
##   Low Activity                       72            80            131
##   Moderate Activity                  72            97            123
##   High Activity                     128           110             97
##                    
##                     Obesity_Type_II Obesity_Type_III Overweight_Level_I
##   Low Activity                   69              187                 84
##   Moderate Activity             165               68                126
##   High Activity                  63               69                 80
##                    
##                     Overweight_Level_II
##   Low Activity                       97
##   Moderate Activity                 125
##   High Activity                      68
# Perform a Chi-squared test
chi_sq_test_aggregated <- chisq.test(table_aggregated, simulate.p.value = TRUE)
chi_sq_test_aggregated
## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  table_aggregated
## X-squared = 195.77, df = NA, p-value = 0.0004998

Insights for test output:

  • This distribution shows that most of the values for FAF are clustered between 0 and 1, with the highest values (close to 3) being relatively rare. The median and mean are very close to each other suggests a relatively symmetric distribution, but there may be some individuals with very high FAF values, which could explain the wide range.
    • Low Activity: 720 observations
    • Moderate Activity: 776 observations
    • High Activity: 615 observations
    • Distribution across weight categories is more representative, eliminating the earlier issue of zero counts in any group.
  • Chi-Squared Test: -Test Statistic (X-squared): 195.77
    • Simulated p-value: 0.0004998
    • Degrees of Freedom (df): Not applicable due to simulated p-value calculation.
  • Interpretation:
    • The p-value (0.0004998) indicates a statistically significant association between activity levels and weight categories.
    • This suggests that physical activity levels are significantly related to weight outcomes.

Insights for the visualization:

  • Skewed Distribution: The distribution is heavily skewed towards the left, with a large number of individuals having low FAF values (close to 0). This indicates that many individuals engage in very little or no physical activity.

  • Peaks in Activity: There are noticeable peaks around values of 1.0 and 2.0, which suggests that some individuals engage in structured physical activity approximately 1-2 times per week. These could correspond to distinct groups based on activity routines.

  • High Activity Levels (FAF > 2.5): A smaller but consistent number of individuals fall in the high activity range (FAF > 2.5). These individuals likely engage in frequent or vigorous physical activity.

  • Sedentary Behavior: The largest bar corresponds to FAF = 0, representing individuals with no recorded physical activity. This highlights a significant sedentary population.

  • Implications for Analysis: The concentration of low and moderate FAF values suggests the need for targeted interventions to encourage physical activity. Segmentation into categories (e.g., Low, Moderate, High Activity) should consider these peaks to create balanced groups for analysis.


Hypotheses 2: Individuals with a family history of obesity are more likely to be obese themselves.

# Combine similar obesity categories
obesity$NObeyesdad_combined <- as.factor(ifelse(obesity$NObeyesdad %in% c("Obesity_Type_I", "Obesity_Type_II", "Obesity_Type_III"), 
                                                "Obesity", 
                                                ifelse(obesity$NObeyesdad %in% c("Overweight_Level_I", "Overweight_Level_II"), 
                                                       "Overweight", 
                                                       obesity$NObeyesdad)))

# Combine similar family history categories (if necessary)
obesity$family_history_with_overweight_combined <- as.factor(ifelse(obesity$family_history_with_overweight == "Unknown", 
                                                                   "No", 
                                                                   obesity$family_history_with_overweight))


table_combined <- table(obesity$family_history_with_overweight_combined, obesity$NObeyesdad_combined)


print(table_combined)
##    
##       1   2 Obesity Overweight
##   1 146 132       8         99
##   2 126 155     964        481
chi_square_result_combined <- chisq.test(table_combined)

print(chi_square_result_combined)
## 
##  Pearson's Chi-squared test
## 
## data:  table_combined
## X-squared = 575.57, df = 3, p-value < 2.2e-16
# Visualization with ggplot2 for combined categories
library(ggplot2)

ggplot(as.data.frame(table_combined), aes(x = Var1, y = Freq, fill = Var2)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Family History and Obesity Levels (Combined)", 
       x = "Family History with Overweight", 
       y = "Frequency") +
  scale_fill_brewer(palette = "Set1")

Insights for test output:

  • Family History: Yes (row 1): Obesity (8): This indicates that 8 individuals with a family history of overweight are classified as having obesity. Overweight (99): 99 individuals with a family history of overweight are classified as overweight.

  • Family History: No (row 2): Obesity (964): A much larger number, 964 individuals without a family history of overweight are classified as obese. Overweight (481): 481 individuals without a family history of overweight are classified as overweight.

  • Chi-Square Test:

    • Test Statistic (X-squared = 575.57): This value compares the observed data with what we would expect if there was no association between the two variables.
    • Degrees of Freedom (df = 3): Degrees of freedom is calculated as (rows - 1) * (columns - 1) and represents the number of values that are free to vary. Here, it’s 3.
    • p-value (< 2.2e-16): The p-value indicates the probability of observing the data given that there is no association between the two variables. A p-value this low (close to zero) is extremely statistically significant, meaning the evidence against the null hypothesis is very strong.
  • Since the p-value is much less than 0.05, we reject the null hypothesis. This means that there is a significant relationship between family history of overweight and obesity levels.

Insights for the visualization:

  • Family History and Obesity: Individuals with a family history of being overweight (X-axis value = 2) are in the “Obesity” category (green bar). This suggests a strong correlation between family history and a higher likelihood of obesity. For individuals without a family history (X-axis value = 1), the frequencies are significantly lower across all categories.

  • Family History and Overweight: For those with a family history (X-axis value = 2), a substantial number are also classified as “Overweight” (purple bar). However, this count is less than the “Obesity” group. Individuals without a family history (X-axis value = 1) show fewer cases of being “Overweight,” but their frequency is still higher than other categories like “1” or “2.”

  • Categories “1” and “2” (Red and Blue Bars): These categories (red and blue) show much smaller frequencies compared to “Obesity” and “Overweight.” Both categories are more prominent for individuals with no family history (X-axis value = 1).


Linear regression model

# Build the regression model
model <- lm(BMI_level ~ Age + Height + Weight + FCVC + NCP + CH2O + FAF + TUE +
            family_history_with_overweight + FAVC + CAEC + SMOKE + SCC + CALC + MTRANS,
           data = obesity)
 
# Model summary
summary(model)
## 
## Call:
## lm(formula = BMI_level ~ Age + Height + Weight + FCVC + NCP + 
##     CH2O + FAF + TUE + family_history_with_overweight + FAVC + 
##     CAEC + SMOKE + SCC + CALC + MTRANS, data = obesity)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.19988 -0.26664  0.00545  0.27558  1.65665 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        8.7878680  0.4989088  17.614  < 2e-16 ***
## Age                                0.0216445  0.0021093  10.262  < 2e-16 ***
## Height                            -7.1890208  0.1328661 -54.107  < 2e-16 ***
## Weight                             0.0758524  0.0005535 137.029  < 2e-16 ***
## FCVC                              -0.0163026  0.0189735  -0.859 0.390313    
## NCP                                0.0298228  0.0127952   2.331 0.019860 *  
## CH2O                              -0.0036785  0.0165400  -0.222 0.824022    
## FAF                               -0.0795688  0.0123792  -6.428 1.60e-10 ***
## TUE                               -0.0329602  0.0168000  -1.962 0.049905 *  
## family_history_with_overweightyes  0.2766560  0.0298100   9.281  < 2e-16 ***
## FAVCyes                            0.0378651  0.0321660   1.177 0.239259    
## CAECFrequently                    -0.1795771  0.0663742  -2.706 0.006875 ** 
## CAECno                            -0.0263665  0.0878912  -0.300 0.764215    
## CAECSometimes                      0.2135752  0.0615779   3.468 0.000534 ***
## SMOKEyes                          -0.0748620  0.0671943  -1.114 0.265359    
## SCCyes                            -0.0740481  0.0475170  -1.558 0.119302    
## CALCFrequently                     0.0273749  0.4405923   0.062 0.950464    
## CALCno                             0.0285001  0.4379239   0.065 0.948117    
## CALCSometimes                     -0.0850226  0.4381576  -0.194 0.846159    
## MTRANSBike                         0.0630842  0.1660370   0.380 0.704028    
## MTRANSMotorbike                    0.2230207  0.1334289   1.671 0.094781 .  
## MTRANSPublic_Transportation        0.2193844  0.0307639   7.131 1.37e-12 ***
## MTRANSWalking                      0.0692133  0.0656660   1.054 0.291996    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4315 on 2088 degrees of freedom
## Multiple R-squared:  0.9532, Adjusted R-squared:  0.9527 
## F-statistic:  1935 on 22 and 2088 DF,  p-value: < 2.2e-16

Insights

  • Key Predictors of BMI: Age, Height, Weight, Family History of Overweight, and Physical Activity Frequency (FAF) are the most impactful predictors of BMI. Increasing physical activity (FAF) or reducing weight has a significant negative impact on BMI, aligning with known health recommendations.

  • The strongest predictor of obesity is current weight, which makes intuitive sense and validates the model’s basic accuracy.

  • Physical activity level (FAF) is the second most important factor - people who exercise more tend to have lower obesity levels, even after controlling for other factors.

  • Family history plays a crucial role - people with a family history of overweight/obesity are more likely to have higher obesity levels.

  • Transportation choices matter - people who walk or bike tend to have lower obesity levels compared to those who primarily use cars.Public transportation shows a significant positive association with BMI, which could be due to sedentary behavior or limited access to active transportation.

  • Age shows a positive correlation - older individuals in the dataset tend to have slightly higher obesity levels.

  • Insignificant Factors: Variables like vegetable consumption, water intake, and calorie drink consumption seem to have no strong association with BMI. This might reflect data limitations or complex interactions not captured by this model.


Model Diagnostics

# Residual plots
par(mfrow = c(2,2))
plot(model)
## Warning: not plotting observations with leverage one:
##   27

Key Diagnostic Findings:

  1. Linearity Assumption:
  1. Normality of Residuals:
  1. Homoscedasticity:
  1. Multicollinearity:
  1. Influential Points:

Supporting visualizations- Family History and Gender Impact on Obesity Levels

# Stacked bar chart for family history and gender
ggplot(obesity, aes(x = family_history_with_overweight, fill = Gender)) +
  geom_bar(position = "fill") +
  labs(title = "Obesity Levels by Family History and Gender",
       x = "Family History",
       y = "Proportion",
       fill = "Gender") +
  theme_minimal()

Insights

  • Gender Proportions Are Balanced: The distribution of males and females is similar within both groups: those with (“yes”) and without (“no”) a family history of obesity.

  • No Strong Differences Observed: Family history of obesity does not appear to disproportionately affect one gender over the other.

  • Focus on Proportions: The chart focuses on gender proportions rather than specific obesity levels or severity, limiting deeper insights into how family history affects actual obesity status.


Limitations

  • Self-reported data: Accuracy of data on diet, exercise, and family history may be compromised due to biases.
  • BMI limitations: BMI does not account for muscle mass or fat distribution.
  • Demographic limitations: The dataset may not represent all populations, as it is region-specific (Mexico, Peru, Colombia).

Conclusion

  • Key Findings: Physical activity, family history, and diet significantly influence obesity risk.
  • Actionable Insights: Focus on increasing physical activity and creating educational programs for high-risk groups.
  • Future Research: Include more diverse populations and consider alternative health indicators beyond BMI.

Recommendations

  • Increase Physical Activity: Encourage exercise through public health initiatives.
  • Target Families with Obesity History: Develop programs for high-risk families to educate and prevent obesity.
  • Promote Healthy Eating: Encourage vegetable consumption and reduce sedentary behavior.