Introduction

Loading Libraries and Importing the Data

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)
setwd("C:/Users/tonge/Downloads")
obesity <- read_csv("ObesityDataSet_raw_and_data_sinthetic.csv")
## Rows: 2111 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): Gender, family_history_with_overweight, FAVC, CAEC, SMOKE, SCC, CAL...
## dbl (8): Age, Height, Weight, FCVC, NCP, CH2O, FAF, TUE
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Research Question - How well do age, water intake, technology use, physical activity, and number of daily meals predict BMI in females?

This data set was found on the UC Irvine Machine Learning Repository. Originally, the data was published in Data Brief by Fabio Mendoza Palechor and Alexis De la Hoz Manotas in 2019. It estimates obesity levels based off the eating habits and physical aspects of people in Mexico, Peru, and Colombia. The data contains 17 variables and 2111 observations. I chose this topic because with the advancement of technology, there is a growing concern that people are getting more lazy. I was curious to see what factors can help predict or encourage physical activity. My research question focuses on females because Women’s health is often over looked and understudied.

Link to Data: https://archive.ics.uci.edu/dataset/544/estimation+of+obesity+levels+based+on+eating+habits+and+physical+condition

The Variables I will use include:

Data Analysis

First, to clean my data I checked each column for NAs. To answer my research question, I used the “weight” and “height” variables to create a new BMI variable using the formula weight(kg)/height(m). Next, I filtered the data to only include the observations of females. Lastly, I converted all my variable names into lowercase letters to make coding easier.

head(obesity)
## # A tibble: 6 × 17
##   Gender   Age Height Weight family_history_with_overw…¹ FAVC   FCVC   NCP CAEC 
##   <chr>  <dbl>  <dbl>  <dbl> <chr>                       <chr> <dbl> <dbl> <chr>
## 1 Female    21   1.62   64   yes                         no        2     3 Some…
## 2 Female    21   1.52   56   yes                         no        3     3 Some…
## 3 Male      23   1.8    77   yes                         no        2     3 Some…
## 4 Male      27   1.8    87   no                          no        3     3 Some…
## 5 Male      22   1.78   89.8 no                          no        2     1 Some…
## 6 Male      29   1.62   53   no                          yes       2     3 Some…
## # ℹ abbreviated name: ¹​family_history_with_overweight
## # ℹ 8 more variables: SMOKE <chr>, CH2O <dbl>, SCC <chr>, FAF <dbl>, TUE <dbl>,
## #   CALC <chr>, MTRANS <chr>, NObeyesdad <chr>
str(obesity)
## spc_tbl_ [2,111 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Gender                        : chr [1:2111] "Female" "Female" "Male" "Male" ...
##  $ Age                           : num [1:2111] 21 21 23 27 22 29 23 22 24 22 ...
##  $ Height                        : num [1:2111] 1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 ...
##  $ Weight                        : num [1:2111] 64 56 77 87 89.8 53 55 53 64 68 ...
##  $ family_history_with_overweight: chr [1:2111] "yes" "yes" "yes" "no" ...
##  $ FAVC                          : chr [1:2111] "no" "no" "no" "no" ...
##  $ FCVC                          : num [1:2111] 2 3 2 3 2 2 3 2 3 2 ...
##  $ NCP                           : num [1:2111] 3 3 3 3 1 3 3 3 3 3 ...
##  $ CAEC                          : chr [1:2111] "Sometimes" "Sometimes" "Sometimes" "Sometimes" ...
##  $ SMOKE                         : chr [1:2111] "no" "yes" "no" "no" ...
##  $ CH2O                          : num [1:2111] 2 3 2 2 2 2 2 2 2 2 ...
##  $ SCC                           : chr [1:2111] "no" "yes" "no" "no" ...
##  $ FAF                           : num [1:2111] 0 3 2 2 0 0 1 3 1 1 ...
##  $ TUE                           : num [1:2111] 1 0 1 0 0 0 0 0 1 1 ...
##  $ CALC                          : chr [1:2111] "no" "Sometimes" "Frequently" "Frequently" ...
##  $ MTRANS                        : chr [1:2111] "Public_Transportation" "Public_Transportation" "Public_Transportation" "Walking" ...
##  $ NObeyesdad                    : chr [1:2111] "Normal_Weight" "Normal_Weight" "Normal_Weight" "Overweight_Level_I" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Gender = col_character(),
##   ..   Age = col_double(),
##   ..   Height = col_double(),
##   ..   Weight = col_double(),
##   ..   family_history_with_overweight = col_character(),
##   ..   FAVC = col_character(),
##   ..   FCVC = col_double(),
##   ..   NCP = col_double(),
##   ..   CAEC = col_character(),
##   ..   SMOKE = col_character(),
##   ..   CH2O = col_double(),
##   ..   SCC = col_character(),
##   ..   FAF = col_double(),
##   ..   TUE = col_double(),
##   ..   CALC = col_character(),
##   ..   MTRANS = col_character(),
##   ..   NObeyesdad = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

No NAs

colSums(is.na(obesity))
##                         Gender                            Age 
##                              0                              0 
##                         Height                         Weight 
##                              0                              0 
## family_history_with_overweight                           FAVC 
##                              0                              0 
##                           FCVC                            NCP 
##                              0                              0 
##                           CAEC                          SMOKE 
##                              0                              0 
##                           CH2O                            SCC 
##                              0                              0 
##                            FAF                            TUE 
##                              0                              0 
##                           CALC                         MTRANS 
##                              0                              0 
##                     NObeyesdad 
##                              0

Creating a bmi variable

obesity1 <- obesity |>
    mutate(bmi = Weight/(Height ^ 2))

Converting the gender column to factor since I was having problems filtering

  obesity1$Gender <- as.factor(obesity1$Gender)

Filtering the data set to only include females

obesity2 <- obesity1  |>
filter(Gender == "Female")

Ensuring all column names only have lowercase letters

names(obesity2) <- tolower(names(obesity2))         # makes all names
head(obesity2) #verify
## # A tibble: 6 × 18
##   gender   age height weight family_history_with_overw…¹ favc   fcvc   ncp caec 
##   <fct>  <dbl>  <dbl>  <dbl> <chr>                       <chr> <dbl> <dbl> <chr>
## 1 Female    21   1.62     64 yes                         no        2     3 Some…
## 2 Female    21   1.52     56 yes                         no        3     3 Some…
## 3 Female    23   1.5      55 yes                         yes       3     3 Some…
## 4 Female    21   1.72     80 yes                         yes       2     3 Freq…
## 5 Female    22   1.7      66 yes                         no        3     3 Alwa…
## 6 Female    29   1.53     78 no                          yes       2     1 Some…
## # ℹ abbreviated name: ¹​family_history_with_overweight
## # ℹ 9 more variables: smoke <chr>, ch2o <dbl>, scc <chr>, faf <dbl>, tue <dbl>,
## #   calc <chr>, mtrans <chr>, nobeyesdad <chr>, bmi <dbl>

Statistical Analysis - Multiple Linear Regression Model

I will be creating a Multiple Linear Regression model to test how well various daily habits predict BMI in females. This is the most appropriate method since my interest in solely in quantitative variables and their contributions to BMI. Furthermore, it shows each variable and whether their impact is strong, minor, positive, or negative. Lastly, it results states the strongest predictor for BMI. My research question is dedicated to finding which behavior has the greatest impact on BMI, so it helps produce a answer to my question and can even help come up with health interventions.

multiple_model1 <- lm(bmi ~ ncp + ch2o + faf + tue + age, data = obesity2)

# View the model summary
summary(multiple_model1)
## 
## Call:
## lm(formula = bmi ~ ncp + ch2o + faf + tue + age, data = obesity2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.2616  -6.8309  -0.1373   5.4653  21.5825 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15.73878    1.77147   8.885  < 2e-16 ***
## ncp          2.12142    0.32943   6.440 1.83e-10 ***
## ch2o         4.47822    0.42609  10.510  < 2e-16 ***
## faf         -2.15815    0.33262  -6.488 1.34e-10 ***
## tue         -0.53658    0.50363  -1.065   0.2869    
## age          0.09540    0.04611   2.069   0.0388 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.623 on 1037 degrees of freedom
## Multiple R-squared:  0.1624, Adjusted R-squared:  0.1584 
## F-statistic: 40.22 on 5 and 1037 DF,  p-value: < 2.2e-16

Model Summary:

Equation: BMI = = 15.74 + 2.12(ncp) + 4.48(ch2o) − 2.15(faf) − 0.54(tue) + 0.095(age)

Intercept: Predicted BMI when all predictors are 0 is 15.74 Just a mathematical calculation for the intercept, but not realistic.

Coefficients:

Adjust R Squared: About 15.8% of BMI variance can by explained by this model

P Values: The most significant value is in ch20, which is 0.0000000000000002. Other statically significant p values are seen in faf, ncp, and age

Check Assumptions of Linear Regression

Linearity Check

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
crPlots(multiple_model1)

Interpretation:

ncp - Mostly flat with a slight curve, showing minor non linearity. ch2o - There is a slight upward trend, showing linearity is reasonable. faf - There is a slight downward trend, showing that linearity is reasonable. tue - There is a slight curve, but it is mostly flat. There, there may be minor non linearity. age - There is one curve, but no clear upward or downward. Therefore, there may be a minor non linearity. Overall - Linear is reasonable, except for ncp, tue, and age.

Independence

plot(resid(multiple_model1), type="b",
     main="Residuals vs Order", ylab="Residuals"); abline(h=0, lty=2)

Interpretation: The residuals are not all centered at 0. There are bursts around indices 200-400, and once more around 800-1000. Therefore, the independence assumption is not fully met in this model.

Core diagnostics (covers: linearity, homoscedasticity, normality, influence)

par(mfrow=c(2,2)); plot(multiple_model1); par(mfrow=c(1,1))

Interpretation:

Residuals vs Fitted - Mostly flat, showing minor non-linearity.

Homoscedasticity (equal variance) - The spread of residuals is not consistent. It fans out, showing minor heteroscedasticity.

Scale–Location - There is more fanning toward fitted values between 25-35, minor heteroscedasticity.

Q–Q - There is barely any deviations at the tails, showing very minor non-normality.

Residuals vs Leverage - Outliers are at observation 109, 116, and 70. Observation 70 may effect the regression model since it has a higher leverage.

Check for Multicolinearity

cor(obesity2[, c("ncp", "ch2o", "age", "tue", "faf")], use = "complete.obs")
##              ncp         ch2o          age         tue        faf
## ncp   1.00000000  0.060436052 -0.021654127  0.01590627  0.1045791
## ch2o  0.06043605  1.000000000  0.002099812 -0.05682945  0.1223811
## age  -0.02165413  0.002099812  1.000000000 -0.36959746 -0.2509605
## tue   0.01590627 -0.056829448 -0.369597463  1.00000000  0.1036779
## faf   0.10457909  0.122381054 -0.250960525  0.10367789  1.0000000

Interpretation: The variables age and tue have the most correlation with a value of -0.37. Since most correlation values are below 1 there is very mild multicolinearity.

Diagnose Model Fit with Metrics

# Calculate residuals
residuals_multiple <- resid(multiple_model1)

# Calculate RMSE for multiple model
rmse_multiple <- sqrt(mean(residuals_multiple^2))
rmse_multiple
## [1] 8.598182

Interpretation:

This means the multiple model’s predictions miss by ~8.6 BMI on average.

Conclusion

We can conclude that water consumption is the strongest predictor of BMI in females, and that there have a positive relationship. Another predictor that is slightly less strong and also has a positive relationship is the number of meals eaten in a day. Meanwhile, the physical activity variable has a negative correlation, meaning the more a female exercises, the lower their BMI.

The results can help find the possible causes of high/low BMI, help doctors give advise, and even can assist in coming up with intervention strategies for women with an unhealthy weight. Especially since the data set highlights that exercise can reduce BMI. Furthermore, the model pointed out that technology use had minimal impact on BMI and that other daily habits, like meal consumption ,are more effective, This surprised me since technology is often blamed for the decline in bodily movement and outdoor activity.

In the future, I can create a Logistic Regression Model to test the categorical variables of the data set. Such as alcohol consumption, vegetable consumption, smoking status, and their effect on BMI. Another option is using an ANOVA test to see the mean days of physical activity vary across different age groups, To do this, I would need to categorizing the numerical variable(ages) into group like child, teen, and adult. If the results were to find teens are drastically less active than other groups, it woul encourage interventions strategies could be put into place to make youth more healthy.

References(APA)

Estimation of Obesity Levels Based On Eating Habits and Physical Condition [Dataset]. (2019). UCI Machine Learning Repository. https://doi.org/10.24432/C5H31Z.