My final project focuses on finding the distinctions between those who are obese and those who are not. The project will be split into three primary parts: lifestyle habits, family history, and physical activity. The obesity dataset will be used in this research because it contains a variety of characteristics, such as family history, physical activity levels, and other lifestyle-related behaviors, which makes it perfect for determining the main distinctions between those who are obese and those who are not.
Obesity is a chronic condition characterized by excessive body fat, impacting over 42% of adults in the United States, as per CDC data. It is linked to severe health risks such as heart disease, type 2 diabetes, and some cancers.
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ✔ readr 2.1.5
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Warning: package 'corrplot' was built under R version 4.4.2
## corrplot 0.95 loaded
## Warning: package 'car' was built under R version 4.4.2
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:purrr':
##
## some
##
## The following object is masked from 'package:dplyr':
##
## recode
obesity <- read.csv("C:/Users/saisr/Downloads/statistics using R/estimation+of+obesity+levels+based+on+eating+habits+and+physical+condition/ObesityDataSet_raw_and_data_sinthetic.csv")
# View the first few rows of the dataset
head(obesity)
## Gender Age Height Weight family_history_with_overweight FAVC FCVC NCP
## 1 Female 21 1.62 64.0 yes no 2 3
## 2 Female 21 1.52 56.0 yes no 3 3
## 3 Male 23 1.80 77.0 yes no 2 3
## 4 Male 27 1.80 87.0 no no 3 3
## 5 Male 22 1.78 89.8 no no 2 1
## 6 Male 29 1.62 53.0 no yes 2 3
## CAEC SMOKE CH2O SCC FAF TUE CALC MTRANS
## 1 Sometimes no 2 no 0 1 no Public_Transportation
## 2 Sometimes yes 3 yes 3 0 Sometimes Public_Transportation
## 3 Sometimes no 2 no 2 1 Frequently Public_Transportation
## 4 Sometimes no 2 no 2 0 Frequently Walking
## 5 Sometimes no 2 no 0 0 Sometimes Public_Transportation
## 6 Sometimes no 2 no 0 0 Sometimes Automobile
## NObeyesdad
## 1 Normal_Weight
## 2 Normal_Weight
## 3 Normal_Weight
## 4 Overweight_Level_I
## 5 Overweight_Level_II
## 6 Normal_Weight
# Convert categorical variables to factors
categorical_vars <- c("family_history_with_overweight", "FAVC", "CAEC",
"SMOKE", "SCC", "CALC", "MTRANS", "NObeyesdad")
obesity[categorical_vars] <- lapply(obesity[categorical_vars], as.factor)
# Create numeric BMI categories for regression
obesity_levels <- c("Insufficient_Weight" = 1,
"Normal_Weight" = 2,
"Overweight_Level_I" = 3,
"Overweight_Level_II" = 4,
"Obesity_Type_I" = 5,
"Obesity_Type_II" = 6,
"Obesity_Type_III" = 7)
obesity$BMI_level <- as.numeric(factor(obesity$NObeyesdad,
levels = names(obesity_levels),
labels = obesity_levels))
obesity$Weight <- as.numeric(as.character(obesity$Weight))
obesity$Height <- as.numeric(as.character(obesity$Height))
# Check for NA values
cat("NA in Weight:", sum(is.na(obesity$Weight_kg)), "\n")
## NA in Weight: 0
obesity <- na.omit(obesity)
# Calculate BMI
obesity$BMI <- obesity$Weight / (obesity$Height)
# Verify the calculation
head(obesity)
## Gender Age Height Weight family_history_with_overweight FAVC FCVC NCP
## 1 Female 21 1.62 64.0 yes no 2 3
## 2 Female 21 1.52 56.0 yes no 3 3
## 3 Male 23 1.80 77.0 yes no 2 3
## 4 Male 27 1.80 87.0 no no 3 3
## 5 Male 22 1.78 89.8 no no 2 1
## 6 Male 29 1.62 53.0 no yes 2 3
## CAEC SMOKE CH2O SCC FAF TUE CALC MTRANS
## 1 Sometimes no 2 no 0 1 no Public_Transportation
## 2 Sometimes yes 3 yes 3 0 Sometimes Public_Transportation
## 3 Sometimes no 2 no 2 1 Frequently Public_Transportation
## 4 Sometimes no 2 no 2 0 Frequently Walking
## 5 Sometimes no 2 no 0 0 Sometimes Public_Transportation
## 6 Sometimes no 2 no 0 0 Sometimes Automobile
## NObeyesdad BMI_level BMI
## 1 Normal_Weight 2 39.50617
## 2 Normal_Weight 2 36.84211
## 3 Normal_Weight 2 42.77778
## 4 Overweight_Level_I 3 48.33333
## 5 Overweight_Level_II 4 50.44944
## 6 Normal_Weight 2 32.71605
# Histogram for BMI
ggplot(obesity, aes(x = BMI)) +
geom_histogram(binwidth = 2, fill = "blue", color = "black") +
labs(title = "BMI Distribution", x = "BMI", y = "Count") +
theme_minimal()
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
# Calculate correlations
cor_matrix <- cor(obesity %>% select_if(is.numeric))
# Melt the correlation matrix into long format
cor_matrix_melted <- melt(cor_matrix)
# Create heatmap with ggplot2
ggplot(data = cor_matrix_melted, aes(x = Var1, y = Var2, fill = value)) +
geom_tile() +
scale_fill_gradient2(low = "red", high = "green", mid = "white", midpoint = 0) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Correlation Heatmap", x = "Variables", y = "Variables")
The study reveals a high frequency of overweight in the dataset and confirms established findings that activity and weight are important factors influencing BMI. Predictive models can be improved and targeted health interventions can be created with these findings.
so we have 2 hypotheses which have to be tested here.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1245 1.0000 1.0103 1.6667 3.0000
# Adjust the breaks to better distribute the data
obesity$FAF_aggregated <- cut(obesity$FAF,
breaks = c(-Inf, 0.5, 1.5, Inf),
labels = c("Low Activity", "Moderate Activity", "High Activity"),
right = TRUE)
# Check the new groups
table(obesity$FAF_aggregated)
##
## Low Activity Moderate Activity High Activity
## 720 776 615
# Recreate the contingency table
table_aggregated <- table(obesity$FAF_aggregated, obesity$NObeyesdad)
# View the contingency table
print(table_aggregated)
##
## Insufficient_Weight Normal_Weight Obesity_Type_I
## Low Activity 72 80 131
## Moderate Activity 72 97 123
## High Activity 128 110 97
##
## Obesity_Type_II Obesity_Type_III Overweight_Level_I
## Low Activity 69 187 84
## Moderate Activity 165 68 126
## High Activity 63 69 80
##
## Overweight_Level_II
## Low Activity 97
## Moderate Activity 125
## High Activity 68
# Perform a Chi-squared test
chi_sq_test_aggregated <- chisq.test(table_aggregated, simulate.p.value = TRUE)
chi_sq_test_aggregated
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: table_aggregated
## X-squared = 195.77, df = NA, p-value = 0.0004998
Skewed Distribution: The distribution is heavily skewed towards the left, with a large number of individuals having low FAF values (close to 0). This indicates that many individuals engage in very little or no physical activity.
Peaks in Activity: There are noticeable peaks around values of 1.0 and 2.0, which suggests that some individuals engage in structured physical activity approximately 1-2 times per week. These could correspond to distinct groups based on activity routines.
High Activity Levels (FAF > 2.5): A smaller but consistent number of individuals fall in the high activity range (FAF > 2.5). These individuals likely engage in frequent or vigorous physical activity.
Sedentary Behavior: The largest bar corresponds to FAF = 0, representing individuals with no recorded physical activity. This highlights a significant sedentary population.
Implications for Analysis: The concentration of low and moderate FAF values suggests the need for targeted interventions to encourage physical activity. Segmentation into categories (e.g., Low, Moderate, High Activity) should consider these peaks to create balanced groups for analysis.
# Combine similar obesity categories
obesity$NObeyesdad_combined <- as.factor(ifelse(obesity$NObeyesdad %in% c("Obesity_Type_I", "Obesity_Type_II", "Obesity_Type_III"),
"Obesity",
ifelse(obesity$NObeyesdad %in% c("Overweight_Level_I", "Overweight_Level_II"),
"Overweight",
obesity$NObeyesdad)))
# Combine similar family history categories (if necessary)
obesity$family_history_with_overweight_combined <- as.factor(ifelse(obesity$family_history_with_overweight == "Unknown",
"No",
obesity$family_history_with_overweight))
table_combined <- table(obesity$family_history_with_overweight_combined, obesity$NObeyesdad_combined)
print(table_combined)
##
## 1 2 Obesity Overweight
## 1 146 132 8 99
## 2 126 155 964 481
##
## Pearson's Chi-squared test
##
## data: table_combined
## X-squared = 575.57, df = 3, p-value < 2.2e-16
# Visualization with ggplot2 for combined categories
library(ggplot2)
ggplot(as.data.frame(table_combined), aes(x = Var1, y = Freq, fill = Var2)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Family History and Obesity Levels (Combined)",
x = "Family History with Overweight",
y = "Frequency") +
scale_fill_brewer(palette = "Set1")
Family History: Yes (row 1): Obesity (8): This indicates that 8 individuals with a family history of overweight are classified as having obesity. Overweight (99): 99 individuals with a family history of overweight are classified as overweight.
Family History: No (row 2): Obesity (964): A much larger number, 964 individuals without a family history of overweight are classified as obese. Overweight (481): 481 individuals without a family history of overweight are classified as overweight.
Chi-Square Test:
Since the p-value is much less than 0.05, we reject the null hypothesis. This means that there is a significant relationship between family history of overweight and obesity levels.
Family History and Obesity: Individuals with a family history of being overweight (X-axis value = 2) are in the “Obesity” category (green bar). This suggests a strong correlation between family history and a higher likelihood of obesity. For individuals without a family history (X-axis value = 1), the frequencies are significantly lower across all categories.
Family History and Overweight: For those with a family history (X-axis value = 2), a substantial number are also classified as “Overweight” (purple bar). However, this count is less than the “Obesity” group. Individuals without a family history (X-axis value = 1) show fewer cases of being “Overweight,” but their frequency is still higher than other categories like “1” or “2.”
Categories “1” and “2” (Red and Blue Bars): These categories (red and blue) show much smaller frequencies compared to “Obesity” and “Overweight.” Both categories are more prominent for individuals with no family history (X-axis value = 1).
# Build the regression model
model <- lm(BMI_level ~ Age + Height + Weight + FCVC + NCP + CH2O + FAF + TUE +
family_history_with_overweight + FAVC + CAEC + SMOKE + SCC + CALC + MTRANS,
data = obesity)
# Model summary
summary(model)
##
## Call:
## lm(formula = BMI_level ~ Age + Height + Weight + FCVC + NCP +
## CH2O + FAF + TUE + family_history_with_overweight + FAVC +
## CAEC + SMOKE + SCC + CALC + MTRANS, data = obesity)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.19988 -0.26664 0.00545 0.27558 1.65665
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.7878680 0.4989088 17.614 < 2e-16 ***
## Age 0.0216445 0.0021093 10.262 < 2e-16 ***
## Height -7.1890208 0.1328661 -54.107 < 2e-16 ***
## Weight 0.0758524 0.0005535 137.029 < 2e-16 ***
## FCVC -0.0163026 0.0189735 -0.859 0.390313
## NCP 0.0298228 0.0127952 2.331 0.019860 *
## CH2O -0.0036785 0.0165400 -0.222 0.824022
## FAF -0.0795688 0.0123792 -6.428 1.60e-10 ***
## TUE -0.0329602 0.0168000 -1.962 0.049905 *
## family_history_with_overweightyes 0.2766560 0.0298100 9.281 < 2e-16 ***
## FAVCyes 0.0378651 0.0321660 1.177 0.239259
## CAECFrequently -0.1795771 0.0663742 -2.706 0.006875 **
## CAECno -0.0263665 0.0878912 -0.300 0.764215
## CAECSometimes 0.2135752 0.0615779 3.468 0.000534 ***
## SMOKEyes -0.0748620 0.0671943 -1.114 0.265359
## SCCyes -0.0740481 0.0475170 -1.558 0.119302
## CALCFrequently 0.0273749 0.4405923 0.062 0.950464
## CALCno 0.0285001 0.4379239 0.065 0.948117
## CALCSometimes -0.0850226 0.4381576 -0.194 0.846159
## MTRANSBike 0.0630842 0.1660370 0.380 0.704028
## MTRANSMotorbike 0.2230207 0.1334289 1.671 0.094781 .
## MTRANSPublic_Transportation 0.2193844 0.0307639 7.131 1.37e-12 ***
## MTRANSWalking 0.0692133 0.0656660 1.054 0.291996
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4315 on 2088 degrees of freedom
## Multiple R-squared: 0.9532, Adjusted R-squared: 0.9527
## F-statistic: 1935 on 22 and 2088 DF, p-value: < 2.2e-16
Key Predictors of BMI: Age, Height, Weight, Family History of Overweight, and Physical Activity Frequency (FAF) are the most impactful predictors of BMI. Increasing physical activity (FAF) or reducing weight has a significant negative impact on BMI, aligning with known health recommendations.
The strongest predictor of obesity is current weight, which makes intuitive sense and validates the model’s basic accuracy.
Physical activity level (FAF) is the second most important factor - people who exercise more tend to have lower obesity levels, even after controlling for other factors.
Family history plays a crucial role - people with a family history of overweight/obesity are more likely to have higher obesity levels.
Transportation choices matter - people who walk or bike tend to have lower obesity levels compared to those who primarily use cars.Public transportation shows a significant positive association with BMI, which could be due to sedentary behavior or limited access to active transportation.
Age shows a positive correlation - older individuals in the dataset tend to have slightly higher obesity levels.
Insignificant Factors: Variables like vegetable consumption, water intake, and calorie drink consumption seem to have no strong association with BMI. This might reflect data limitations or complex interactions not captured by this model.
## Warning: not plotting observations with leverage one:
## 27
# Stacked bar chart for family history and gender
ggplot(obesity, aes(x = family_history_with_overweight, fill = Gender)) +
geom_bar(position = "fill") +
labs(title = "Obesity Levels by Family History and Gender",
x = "Family History",
y = "Proportion",
fill = "Gender") +
theme_minimal()
Gender Proportions Are Balanced: The distribution of males and females is similar within both groups: those with (“yes”) and without (“no”) a family history of obesity.
No Strong Differences Observed: Family history of obesity does not appear to disproportionately affect one gender over the other.
Focus on Proportions: The chart focuses on gender proportions rather than specific obesity levels or severity, limiting deeper insights into how family history affects actual obesity status.