Obesity Levels with Eating & Physical habits Dataset

a regression analysis done in the tidyverse

by Bonnie Cooper

The following analysis explores the “Estimation of obesity levels based on eating habits and physical condition Data Set”, a dataset consisting of features that quantify the eating and physical habits of individuals as well as estimations of obesity and basic body metrics. The dataset is hosted on the UCI Machine Learning Repository which is an excellent resource to find well curated datasets for testing analysis and modeling techniques.

Before we get started, we will load the libraries into the r environment which will be used:

library( tidyverse ) #tidyverse to model our data & other functions
library( ggplot2 ) #basic plotting of our lin. model fit + data
library( ggfortify ) #plot diagnostics of our lin. model
library( tidymodels )

And now to load the data. The .csv file from the UCI Machine Learning Repository has been uploaded to the author’s github account for use with the following code:

obs_url <- 'https://raw.githubusercontent.com/SmilodonCub/DATA605/master/ObesityDataSet_raw_and_data_sinthetic.csv'
obs_df <- read.csv( obs_url ) #read data in as an r dataframe
paste( 'Obesity dataset dimensions:' )

## [1] "Obesity dataset dimensions:"

dim( obs_df ) #print the dimensions of the dataset

## [1] 2111   17

colnames( obs_df )#print the feature (column) names of the dataset

##  [1] "Gender"                         "Age"                           
##  [3] "Height"                         "Weight"                        
##  [5] "family_history_with_overweight" "FAVC"                          
##  [7] "FCVC"                           "NCP"                           
##  [9] "CAEC"                           "SMOKE"                         
## [11] "CH2O"                           "SCC"                           
## [13] "FAF"                            "TUE"                           
## [15] "CALC"                           "MTRANS"                        
## [17] "NObeyesdad"

The resulting dataframe consists of 2111 observations, or rows of data and each row holds values for 17 columns of feature attributes. The names of the columns are somewhat obscure. However, we can find a more complete description in a Data Article in the Journal of Data in Brief where the authors give a full documentation of the data curation and collection methods.

The following code will use dplyr methods to rename the columns with more intuitive labels, create a new column that maps the obs_df$weight_category values from strings to a numeric value, and we wil create yet another column that calculates the Body Mass Index, where: \[\mbox{Body Mass Index} = \frac{\mbox{Weight}}{\mbox{Height}^2}\]

obs_df <- obs_df %>% rename( eats_high_calor_food = FAVC, eats_veggies = FCVC, 
                             num_meals = NCP, eats_snacks = CAEC, drinks_water = CH2O, 
                             counts_calories = SCC, exercises_often = FAF, 
                             time_using_tech = TUE, drinks_alcohol = CALC, 
                             method_trans = MTRANS, weight_category = NObeyesdad ) %>%
  mutate( bmi = Weight / Height^2 ) %>%
  mutate( weight_cat_num = case_when( ( weight_category == "Insufficient_Weight" ) ~ -1,
          ( weight_category == "Normal_Weight" ) ~ 0,
          ( weight_category == "Overweight_Level_I" ) ~ 1,
          ( weight_category == "Overweight_Level_II" ) ~ 2,
          ( weight_category == "Obesity_Type_I" ) ~ 3,
          ( weight_category == "Obesity_Type_II" ) ~ 4,
          ( weight_category == "Obesity_Type_III" ) ~ 5 ) )

And now to take a glimpse at the dataframe with the tidyverse function glimpse():

glimpse( obs_df )

## Rows: 2,111
## Columns: 19
## $ Gender                         <chr> "Female", "Female", "Male", "Male", "M…
## $ Age                            <dbl> 21, 21, 23, 27, 22, 29, 23, 22, 24, 22…
## $ Height                         <dbl> 1.62, 1.52, 1.80, 1.80, 1.78, 1.62, 1.…
## $ Weight                         <dbl> 64.0, 56.0, 77.0, 87.0, 89.8, 53.0, 55…
## $ family_history_with_overweight <chr> "yes", "yes", "yes", "no", "no", "no",…
## $ eats_high_calor_food           <chr> "no", "no", "no", "no", "no", "yes", "…
## $ eats_veggies                   <dbl> 2, 3, 2, 3, 2, 2, 3, 2, 3, 2, 3, 2, 3,…
## $ num_meals                      <dbl> 3, 3, 3, 3, 1, 3, 3, 3, 3, 3, 3, 3, 3,…
## $ eats_snacks                    <chr> "Sometimes", "Sometimes", "Sometimes",…
## $ SMOKE                          <chr> "no", "yes", "no", "no", "no", "no", "…
## $ drinks_water                   <dbl> 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 3,…
## $ counts_calories                <chr> "no", "yes", "no", "no", "no", "no", "…
## $ exercises_often                <dbl> 0, 3, 2, 2, 0, 0, 1, 3, 1, 1, 2, 2, 2,…
## $ time_using_tech                <dbl> 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 2, 1, 0,…
## $ drinks_alcohol                 <chr> "no", "Sometimes", "Frequently", "Freq…
## $ method_trans                   <chr> "Public_Transportation", "Public_Trans…
## $ weight_category                <chr> "Normal_Weight", "Normal_Weight", "Nor…
## $ bmi                            <dbl> 24.38653, 24.23823, 23.76543, 26.85185…
## $ weight_cat_num                 <dbl> 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 3, 2, 0,…

There will be some modeling of this dataset in future code chuncks, so we will use tidymodels methods to split our data into training and testing sets.

set.seed( 138 )
obs_split <- initial_split( obs_df ) 
train_obs <- training( obs_split )
test_obs <- testing( obs_split )

Using the training data, visualize the two new attributes we introduced to the data: bmi as a function of weight_cat_num.

ggplot( train_obs, aes( y = bmi, x = factor( weight_cat_num ) ) ) +
  geom_boxplot() +
  ggtitle( 'BMI ~ Weight Category' ) +
  xlab( 'Weight Category' ) +
  scale_x_discrete( breaks = c( '-1', '0', '1', '2', '3', '4', '5' ) , labels = c( "Insuff. Wgt", "Normal Wgt", "Overweight L1", "Overweight L2", "Obesity T1", "Obesity T2", "Obesity T3" ) )

The figure above clearly shows that as the weight category increases from insufficient weight up to Obesity level 3, there is a readily appearant increase in BMI. This should come as no surprize, because BMI was used by the creaters of the dataset to delineate obesity categories. But do the ranges defined by the authors yield a linear fit to the data?

Next we fit and evaluate a linear regression of the training dataset:

obs_lm <- lm( bmi ~ weight_cat_num, data = train_obs )
summary( obs_lm )

## 
## Call:
## lm(formula = bmi ~ weight_cat_num, data = train_obs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.4184 -1.1455 -0.0118  0.8505  9.6666 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    21.37177    0.06185   345.5   <2e-16 ***
## weight_cat_num  3.95469    0.02140   184.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.689 on 1582 degrees of freedom
## Multiple R-squared:  0.9557, Adjusted R-squared:  0.9557 
## F-statistic: 3.416e+04 on 1 and 1582 DF,  p-value: < 2.2e-16

The linear model yields a fit to the data described by the following equation: \[\mbox{BMI} = 21.37 + 3.95 * \mbox{Weight Category}\] The $R^2$ value is very high: 0.9557 which strongly suggests that BMI explains much of that variance for weight category. Additionally, the p-value is very small: 2.2e-16 suggesting a very high statistical significance to our linear regression fit.

Here, the ggfortify method autoplot() is used to assess the appropriateness of the linear model for this data.

autoplot( obs_lm, which = 1:4 ) + theme_minimal()

## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

The plots above suggest that the variance for the weight categories is highly linear for all but the highest category, ‘Obesity Type 3’. This is not surprizing, because as this category takes any BMI over 40, whereas the other categories have set intervals. Generally, however, the a linear model does very well at describing how BMI increases with weight categories. Therefore, going forward, we will use BMI as a continuous variable to explore the relationship of other data attibutes with obesity. For example, the following code plots the distribution of bmi values for the categorical variable eats_high_calor_food:

ggplot( train_obs, aes( y = bmi, x = factor( eats_high_calor_food ) ) ) +
  geom_boxplot() +
  ggtitle( 'BMI ~ Eats High Calorie Foods' ) +
  xlab( 'Eats High Calorie Foods' )

The figure above shows that subjects who answer ‘yes’ when asked if they regularly consume high calorie food are more likely to have a higher BMI than subjects that responded ‘no’. However, is this difference significant? Let’s fit a linear model to the data:

bmi_by_hical <- lm( bmi ~ eats_high_calor_food, data = train_obs )
summary( bmi_by_hical )

## 
## Call:
## lm(formula = bmi ~ eats_high_calor_food, data = train_obs)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.4438  -5.0678  -0.0239   5.8217  20.3693 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              24.1533     0.5632   42.89   <2e-16 ***
## eats_high_calor_foodyes   6.2891     0.6003   10.48   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.763 on 1582 degrees of freedom
## Multiple R-squared:  0.06487,    Adjusted R-squared:  0.06428 
## F-statistic: 109.7 on 1 and 1582 DF,  p-value: < 2.2e-16

Above is the summary of the linear regression of this approach to the data. We see that, on average, subjects that responded ‘yes’ have a BMI that is 6.2891 index points higher than subjects who responded ‘no’. Additionally, the p-value is very small which suggests that the difference is significant. However, the $R^2$ value is tiny; the variance of BMI values is not very well described by this data attribute. Can we do better? Let’s see what happens when we perform a multiple linear regression by including several other data features. For simplicity, this analysis will only use the binary (yes/no) attributes in the dataset:

bmi_by_multi <- lm( bmi ~ eats_high_calor_food + family_history_with_overweight + 
                      SMOKE + counts_calories + Gender, data = train_obs )
summary( bmi_by_multi )

## 
## Call:
## lm(formula = bmi ~ eats_high_calor_food + family_history_with_overweight + 
##     SMOKE + counts_calories + Gender, data = train_obs)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.7909  -4.8362  -0.1406   5.1931  18.6397 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        20.1925     0.6023  33.523  < 2e-16 ***
## eats_high_calor_foodyes             3.4526     0.5543   6.228 6.03e-10 ***
## family_history_with_overweightyes   9.1796     0.4658  19.709  < 2e-16 ***
## SMOKEyes                            0.6516     1.1325   0.575    0.565    
## counts_caloriesyes                 -3.4889     0.8164  -4.273 2.04e-05 ***
## GenderMale                         -1.7421     0.3496  -4.984 6.93e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.863 on 1578 degrees of freedom
## Multiple R-squared:  0.271,  Adjusted R-squared:  0.2687 
## F-statistic: 117.3 on 5 and 1578 DF,  p-value: < 2.2e-16

The p-value remains the same. However, the $R^2$ value has increased from 0.06428 to 0.2687 suggesting that by including more features as parameters for our linear model fit, we are able to model 19.442% more of the variance in BMI data. The Coefficients tell us that eating high calorie food, smoking and having a family history of obesity have a postive correlation with bmi whereas being male and counting calories have a negative correlation to BMI. Looking at the p-values for the coefficients, we see that all are significant except the SMOKE parameter. Let’s subtract this from the linear model fit:

bmi_by_multi <- lm( bmi ~ eats_high_calor_food + family_history_with_overweight + 
                      counts_calories + Gender, data = train_obs )
summary( bmi_by_multi )

## 
## Call:
## lm(formula = bmi ~ eats_high_calor_food + family_history_with_overweight + 
##     counts_calories + Gender, data = train_obs)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.8114  -4.8543  -0.1314   5.2155  18.5864 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        20.2049     0.6018  33.572  < 2e-16 ***
## eats_high_calor_foodyes             3.4389     0.5537   6.211 6.73e-10 ***
## family_history_with_overweightyes   9.1888     0.4654  19.745  < 2e-16 ***
## counts_caloriesyes                 -3.4604     0.8147  -4.247 2.29e-05 ***
## GenderMale                         -1.7297     0.3488  -4.958 7.87e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.861 on 1579 degrees of freedom
## Multiple R-squared:  0.2709, Adjusted R-squared:  0.269 
## F-statistic: 146.6 on 4 and 1579 DF,  p-value: < 2.2e-16

Notice how shedding the extra parameter leads to a slightly higher $R^2$ value.

Now to see how this model performs with the test subset of the data that we set aside at the beginning of the analysis:

bmi_predict <- bmi_by_multi %>% predict( test_obs )
test_rsq <- rsq_vec( test_obs$bmi, bmi_predict )
test_rsq

## [1] 0.2875665

The result is a comparable $R^2$ value for the test data compared to the training data.

This multiple linear regression is a start towards modelling BMI data with a few of the binary attributes in the dataset. However, modelling is an itterative process, so the next steps is to see if adding other parameters can improve out model’s ability to describe bmi data.

You made it this far. Wow. Thank you for reading!