DataDive-11

library(boot)
library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(broom)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ✔ readr     2.1.5

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(purrr)
library(lindia)

df <- read.csv("~/Downloads/ObesityDataSet_raw_and_data_sinthetic.csv", header=TRUE)

df['BMI'] <- df['Weight']/df['Height']**2

df['is_obese'] <- ifelse(df$BMI > 30, 1,0)

Build a linear (or generalized linear) model as you like

model_1 <- lm(BMI~Age + CALC, df)
summary(model_1)

## 
## Call:
## lm(formula = BMI ~ Age + CALC, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.3099  -5.8148  -0.7309   5.3913  20.7737 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    15.89403    7.55426   2.104   0.0355 *  
## Age             0.31416    0.02594  12.112   <2e-16 ***
## CALCFrequently  2.56050    7.58989   0.337   0.7359    
## CALCno          3.58899    7.54092   0.476   0.6342    
## CALCSometimes   7.52914    7.53775   0.999   0.3180    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.535 on 2106 degrees of freedom
## Multiple R-squared:  0.1172, Adjusted R-squared:  0.1155 
## F-statistic: 69.87 on 4 and 2106 DF,  p-value: < 2.2e-16

Use whatever response variable and explanatory variables you prefer

Just to experiment, consider Age and CALC as explanatory variable to model BMI response variable.

df |> ggplot(mapping = aes(x = Age, y = BMI))+ geom_point() + theme_minimal() +
  geom_smooth(method = 'lm', se = FALSE, color = 'red')

## `geom_smooth()` using formula = 'y ~ x'

magecalc <- lm(Age~CALC, df)
summary(magecalc)

## 
## Call:
## lm(formula = Age ~ CALC, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.132  -4.342  -1.526   1.744  33.859 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      21.000      6.328   3.319  0.00092 ***
## CALCFrequently    6.141      6.373   0.964  0.33536    
## CALCno            3.132      6.333   0.494  0.62102    
## CALCSometimes     3.256      6.330   0.514  0.60704    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.328 on 2107 degrees of freedom
## Multiple R-squared:  0.007019,   Adjusted R-squared:  0.005605 
## F-statistic: 4.965 on 3 and 2107 DF,  p-value: 0.001954

Use the tools from previous weeks to diagnose the model

Residual Histogram

residuals <- resid(model_1)
ggplot(data = data.frame(residuals), aes(x = residuals)) +
  geom_histogram(binwidth = 1, fill = "orange", color = "black", alpha = 0.7) +
  theme_minimal()

Residuals vs fitted values

gg_resfitted(model_1) +
  geom_smooth(se = FALSE)

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

QQ-Plot

gg_qqplot(model_1)

Cook’D by Observation

gg_cooksd(model_1)

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_segment()`).

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_text()`).

model_1

## 
## Call:
## lm(formula = BMI ~ Age + CALC, data = df)
## 
## Coefficients:
##    (Intercept)             Age  CALCFrequently          CALCno   CALCSometimes  
##        15.8940          0.3142          2.5605          3.5890          7.5291

Highlight any issues with the model

Based on the residual histograms, it is clear that residuals are not normally distributed.
This implies the fifth assumption that ‘errors are normally distributed over the predicted line’ has failed.
Based on Residuals vs fitted plot, it is clear that residuals don’t have constant variance across all the estimates/predictions. This fact also violates the second assumption ‘errors have constant variance across all predictions’
QQ-plot gives an idea how the residuals deviate from the ideal/theoritical normal distribution. In this case the lower quantile and upper quantile are deviated heavily.
From Cook’s D diagnosis plot, data-point 134 has a lot of influence on the linear model ‘model_1’. That is, the
This implies, removing 134 data-point will significantly alters the linear model.

Interpret at least one of the coefficients

summary(model_1)

## 
## Call:
## lm(formula = BMI ~ Age + CALC, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.3099  -5.8148  -0.7309   5.3913  20.7737 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    15.89403    7.55426   2.104   0.0355 *  
## Age             0.31416    0.02594  12.112   <2e-16 ***
## CALCFrequently  2.56050    7.58989   0.337   0.7359    
## CALCno          3.58899    7.54092   0.476   0.6342    
## CALCSometimes   7.52914    7.53775   0.999   0.3180    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.535 on 2106 degrees of freedom
## Multiple R-squared:  0.1172, Adjusted R-squared:  0.1155 
## F-statistic: 69.87 on 4 and 2106 DF,  p-value: < 2.2e-16

Co-effecient of Age is 0.31416, indicates for every unit increase in Age, BMI is estimated to increase by 0.31 units.
This low p-value for Co-effecient of Age implies that the Age and BMI variables are definitely not independent.
So this model tried to represent the relationship between BMI and Age,CALC varibles

DataDive-11

2024-11-13

Build a linear (or generalized linear) model as you like

Use whatever response variable and explanatory variables you prefer

Use the tools from previous weeks to diagnose the model

Residual Histogram

Residuals vs fitted values

QQ-Plot

Cook’D by Observation

Highlight any issues with the model

Interpret at least one of the coefficients