Loading Libraries and Importing the Data
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.0 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)
setwd("C:/Users/tonge/Downloads")
obesity <- read_csv("ObesityDataSet_raw_and_data_sinthetic.csv")
## Rows: 2111 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): Gender, family_history_with_overweight, FAVC, CAEC, SMOKE, SCC, CAL...
## dbl (8): Age, Height, Weight, FCVC, NCP, CH2O, FAF, TUE
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Research Question - How well do age, water intake, technology use, physical activity, and number of daily meals predict BMI in females?
This data set was found on the UC Irvine Machine Learning Repository. Originally, the data was published in Data Brief by Fabio Mendoza Palechor and Alexis De la Hoz Manotas in 2019. It estimates obesity levels based off the eating habits and physical aspects of people in Mexico, Peru, and Colombia. The data contains 17 variables and 2111 observations. I chose this topic because with the advancement of technology, there is a growing concern that people are getting more lazy. I was curious to see what factors can help predict or encourage physical activity. My research question focuses on females because Women’s health is often over looked and understudied.
Link to Data: https://archive.ics.uci.edu/dataset/544/estimation+of+obesity+levels+based+on+eating+habits+and+physical+condition
The Variables I will use include:
First, to clean my data I checked each column for NAs. To answer my research question, I used the “weight” and “height” variables to create a new BMI variable using the formula weight(kg)/height(m). Next, I filtered the data to only include the observations of females. Lastly, I converted all my variable names into lowercase letters to make coding easier.
head(obesity)
## # A tibble: 6 × 17
## Gender Age Height Weight family_history_with_overw…¹ FAVC FCVC NCP CAEC
## <chr> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr>
## 1 Female 21 1.62 64 yes no 2 3 Some…
## 2 Female 21 1.52 56 yes no 3 3 Some…
## 3 Male 23 1.8 77 yes no 2 3 Some…
## 4 Male 27 1.8 87 no no 3 3 Some…
## 5 Male 22 1.78 89.8 no no 2 1 Some…
## 6 Male 29 1.62 53 no yes 2 3 Some…
## # ℹ abbreviated name: ¹family_history_with_overweight
## # ℹ 8 more variables: SMOKE <chr>, CH2O <dbl>, SCC <chr>, FAF <dbl>, TUE <dbl>,
## # CALC <chr>, MTRANS <chr>, NObeyesdad <chr>
str(obesity)
## spc_tbl_ [2,111 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Gender : chr [1:2111] "Female" "Female" "Male" "Male" ...
## $ Age : num [1:2111] 21 21 23 27 22 29 23 22 24 22 ...
## $ Height : num [1:2111] 1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 ...
## $ Weight : num [1:2111] 64 56 77 87 89.8 53 55 53 64 68 ...
## $ family_history_with_overweight: chr [1:2111] "yes" "yes" "yes" "no" ...
## $ FAVC : chr [1:2111] "no" "no" "no" "no" ...
## $ FCVC : num [1:2111] 2 3 2 3 2 2 3 2 3 2 ...
## $ NCP : num [1:2111] 3 3 3 3 1 3 3 3 3 3 ...
## $ CAEC : chr [1:2111] "Sometimes" "Sometimes" "Sometimes" "Sometimes" ...
## $ SMOKE : chr [1:2111] "no" "yes" "no" "no" ...
## $ CH2O : num [1:2111] 2 3 2 2 2 2 2 2 2 2 ...
## $ SCC : chr [1:2111] "no" "yes" "no" "no" ...
## $ FAF : num [1:2111] 0 3 2 2 0 0 1 3 1 1 ...
## $ TUE : num [1:2111] 1 0 1 0 0 0 0 0 1 1 ...
## $ CALC : chr [1:2111] "no" "Sometimes" "Frequently" "Frequently" ...
## $ MTRANS : chr [1:2111] "Public_Transportation" "Public_Transportation" "Public_Transportation" "Walking" ...
## $ NObeyesdad : chr [1:2111] "Normal_Weight" "Normal_Weight" "Normal_Weight" "Overweight_Level_I" ...
## - attr(*, "spec")=
## .. cols(
## .. Gender = col_character(),
## .. Age = col_double(),
## .. Height = col_double(),
## .. Weight = col_double(),
## .. family_history_with_overweight = col_character(),
## .. FAVC = col_character(),
## .. FCVC = col_double(),
## .. NCP = col_double(),
## .. CAEC = col_character(),
## .. SMOKE = col_character(),
## .. CH2O = col_double(),
## .. SCC = col_character(),
## .. FAF = col_double(),
## .. TUE = col_double(),
## .. CALC = col_character(),
## .. MTRANS = col_character(),
## .. NObeyesdad = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
No NAs
colSums(is.na(obesity))
## Gender Age
## 0 0
## Height Weight
## 0 0
## family_history_with_overweight FAVC
## 0 0
## FCVC NCP
## 0 0
## CAEC SMOKE
## 0 0
## CH2O SCC
## 0 0
## FAF TUE
## 0 0
## CALC MTRANS
## 0 0
## NObeyesdad
## 0
Creating a bmi variable
obesity1 <- obesity |>
mutate(bmi = Weight/(Height ^ 2))
Converting the gender column to factor since I was having problems filtering
obesity1$Gender <- as.factor(obesity1$Gender)
Filtering the data set to only include females
obesity2 <- obesity1 |>
filter(Gender == "Female")
Ensuring all column names only have lowercase letters
names(obesity2) <- tolower(names(obesity2)) # makes all names
head(obesity2) #verify
## # A tibble: 6 × 18
## gender age height weight family_history_with_overw…¹ favc fcvc ncp caec
## <fct> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr>
## 1 Female 21 1.62 64 yes no 2 3 Some…
## 2 Female 21 1.52 56 yes no 3 3 Some…
## 3 Female 23 1.5 55 yes yes 3 3 Some…
## 4 Female 21 1.72 80 yes yes 2 3 Freq…
## 5 Female 22 1.7 66 yes no 3 3 Alwa…
## 6 Female 29 1.53 78 no yes 2 1 Some…
## # ℹ abbreviated name: ¹family_history_with_overweight
## # ℹ 9 more variables: smoke <chr>, ch2o <dbl>, scc <chr>, faf <dbl>, tue <dbl>,
## # calc <chr>, mtrans <chr>, nobeyesdad <chr>, bmi <dbl>
I will be creating a Multiple Linear Regression model to test how well various daily habits predict BMI in females. This is the most appropriate method since my interest in solely in quantitative variables and their contributions to BMI. Furthermore, it shows each variable and whether their impact is strong, minor, positive, or negative. Lastly, it results states the strongest predictor for BMI. My research question is dedicated to finding which behavior has the greatest impact on BMI, so it helps produce a answer to my question and can even help come up with health interventions.
multiple_model1 <- lm(bmi ~ ncp + ch2o + faf + tue + age, data = obesity2)
# View the model summary
summary(multiple_model1)
##
## Call:
## lm(formula = bmi ~ ncp + ch2o + faf + tue + age, data = obesity2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.2616 -6.8309 -0.1373 5.4653 21.5825
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.73878 1.77147 8.885 < 2e-16 ***
## ncp 2.12142 0.32943 6.440 1.83e-10 ***
## ch2o 4.47822 0.42609 10.510 < 2e-16 ***
## faf -2.15815 0.33262 -6.488 1.34e-10 ***
## tue -0.53658 0.50363 -1.065 0.2869
## age 0.09540 0.04611 2.069 0.0388 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.623 on 1037 degrees of freedom
## Multiple R-squared: 0.1624, Adjusted R-squared: 0.1584
## F-statistic: 40.22 on 5 and 1037 DF, p-value: < 2.2e-16
Model Summary:
Equation: BMI = = 15.74 + 2.12(ncp) + 4.48(ch2o) − 2.15(faf) − 0.54(tue) + 0.095(age)
Intercept: Predicted BMI when all predictors are 0 is 15.74 Just a mathematical calculation for the intercept, but not realistic.
Coefficients:
ncp - Positive coefficient(2.12). The more meals a female consumes, the higher the BMI.
ch2o - Positive coefficient(4.48). The more water a female consumes, the higher the BMI.
faf - Negative coefficient(-2.15). The more a female works out, the lower their BMI.
tue - Negative coefficient(-0.54). The more time spent on technology, the lower the BMI in females. However the number is very small so the effect is not very strong.
age - Positive coefficient(0.095). The older a female is, the higher the BMI. However, the number is small, so the effect is not to strong.
Adjust R Squared: About 15.8% of BMI variance can by explained by this model
P Values: The most significant value is in ch20, which is 0.0000000000000002. Other statically significant p values are seen in faf, ncp, and age
Linearity Check
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
crPlots(multiple_model1)
Interpretation:
ncp - Mostly flat with a slight curve, showing minor non linearity. ch2o - There is a slight upward trend, showing linearity is reasonable. faf - There is a slight downward trend, showing that linearity is reasonable. tue - There is a slight curve, but it is mostly flat. There, there may be minor non linearity. age - There is one curve, but no clear upward or downward. Therefore, there may be a minor non linearity. Overall - Linear is reasonable, except for ncp, tue, and age.
Independence
plot(resid(multiple_model1), type="b",
main="Residuals vs Order", ylab="Residuals"); abline(h=0, lty=2)
Interpretation: The residuals are not all centered at 0. There are bursts around indices 200-400, and once more around 800-1000. Therefore, the independence assumption is not fully met in this model.
Core diagnostics (covers: linearity, homoscedasticity, normality, influence)
par(mfrow=c(2,2)); plot(multiple_model1); par(mfrow=c(1,1))
Interpretation:
Residuals vs Fitted - Mostly flat, showing minor non-linearity.
Homoscedasticity (equal variance) - The spread of residuals is not consistent. It fans out, showing minor heteroscedasticity.
Scale–Location - There is more fanning toward fitted values between 25-35, minor heteroscedasticity.
Q–Q - There is barely any deviations at the tails, showing very minor non-normality.
Residuals vs Leverage - No extreme or influential outliers.
Check for Multicolinearity
cor(obesity2[, c("ncp", "ch2o", "age", "tue", "faf")], use = "complete.obs")
## ncp ch2o age tue faf
## ncp 1.00000000 0.060436052 -0.021654127 0.01590627 0.1045791
## ch2o 0.06043605 1.000000000 0.002099812 -0.05682945 0.1223811
## age -0.02165413 0.002099812 1.000000000 -0.36959746 -0.2509605
## tue 0.01590627 -0.056829448 -0.369597463 1.00000000 0.1036779
## faf 0.10457909 0.122381054 -0.250960525 0.10367789 1.0000000
Interpretation: The variables age and tue have the most correlation with a value of -0.37. Since most correlation values are below 1 there is very mild multicolinearity.
# Calculate residuals
residuals_multiple <- resid(multiple_model1)
# Calculate RMSE for multiple model
rmse_multiple <- sqrt(mean(residuals_multiple^2))
rmse_multiple
## [1] 8.598182
Interpretation:
This means the multiple model’s predictions miss by ~8.6 BMI on average.
We can conclude that water consumption is the strongest predictor of BMI in females, and that there have a positive relationship. Another predictor that is slightly less strong and also has a positive relationship is the number of meals eaten in a day. Meanwhile, the physical activity variable has a negative correlation, meaning the more a female exercises, the lower their BMI.
The results can help find the possible causes of high/low BMI, help doctors give advise, and even can assist in coming up with intervention strategies for women with an unhealthy weight. Especially since the data set highlights that exercise can reduce BMI. Furthermore, the model pointed out that technology use had minimal impact on BMI and that other daily habits, like meal consumption ,are more effective, This surprised me since technology is often blamed for the decline in bodily movement and outdoor activity.
In the future, I can create a Logistic Regression Model to test the categorical variables of the data set. Such as alcohol consumption, vegetable consumption, smoking status, and their effect on BMI. Another option is using an ANOVA test to see the mean days of physical activity vary across different age groups, To do this, I would need to categorizing the numerical variable(ages) into group like child, teen, and adult. If the results were to find teens are drastically less active than other groups, it woul encourage interventions strategies could be put into place to make youth more healthy.
Estimation of Obesity Levels Based On Eating Habits and Physical Condition [Dataset]. (2019). UCI Machine Learning Repository. https://doi.org/10.24432/C5H31Z.