library(boot)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(broom)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ✔ readr 2.1.5
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(purrr)
library(lindia)
df <- read.csv("~/Downloads/ObesityDataSet_raw_and_data_sinthetic.csv", header=TRUE)
df['BMI'] <- df['Weight']/df['Height']**2
df['is_obese'] <- ifelse(df$BMI > 30, 1,0)
model_1 <- lm(BMI~Age + CALC, df)
summary(model_1)
##
## Call:
## lm(formula = BMI ~ Age + CALC, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.3099 -5.8148 -0.7309 5.3913 20.7737
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.89403 7.55426 2.104 0.0355 *
## Age 0.31416 0.02594 12.112 <2e-16 ***
## CALCFrequently 2.56050 7.58989 0.337 0.7359
## CALCno 3.58899 7.54092 0.476 0.6342
## CALCSometimes 7.52914 7.53775 0.999 0.3180
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.535 on 2106 degrees of freedom
## Multiple R-squared: 0.1172, Adjusted R-squared: 0.1155
## F-statistic: 69.87 on 4 and 2106 DF, p-value: < 2.2e-16
df |> ggplot(mapping = aes(x = Age, y = BMI))+ geom_point() + theme_minimal() +
geom_smooth(method = 'lm', se = FALSE, color = 'red')
## `geom_smooth()` using formula = 'y ~ x'
magecalc <- lm(Age~CALC, df)
summary(magecalc)
##
## Call:
## lm(formula = Age ~ CALC, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.132 -4.342 -1.526 1.744 33.859
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21.000 6.328 3.319 0.00092 ***
## CALCFrequently 6.141 6.373 0.964 0.33536
## CALCno 3.132 6.333 0.494 0.62102
## CALCSometimes 3.256 6.330 0.514 0.60704
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.328 on 2107 degrees of freedom
## Multiple R-squared: 0.007019, Adjusted R-squared: 0.005605
## F-statistic: 4.965 on 3 and 2107 DF, p-value: 0.001954
residuals <- resid(model_1)
ggplot(data = data.frame(residuals), aes(x = residuals)) +
geom_histogram(binwidth = 1, fill = "orange", color = "black", alpha = 0.7) +
theme_minimal()
gg_resfitted(model_1) +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
gg_qqplot(model_1)
gg_cooksd(model_1)
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_segment()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_text()`).
model_1
##
## Call:
## lm(formula = BMI ~ Age + CALC, data = df)
##
## Coefficients:
## (Intercept) Age CALCFrequently CALCno CALCSometimes
## 15.8940 0.3142 2.5605 3.5890 7.5291
Based on the residual histograms, it is clear that residuals are not normally distributed.
This implies the fifth assumption that ‘errors are normally distributed over the predicted line’ has failed.
Based on Residuals vs fitted plot, it is clear that residuals don’t have constant variance across all the estimates/predictions. This fact also violates the second assumption ‘errors have constant variance across all predictions’
QQ-plot gives an idea how the residuals deviate from the ideal/theoritical normal distribution. In this case the lower quantile and upper quantile are deviated heavily.
From Cook’s D diagnosis plot, data-point 134 has a lot of influence on the linear model ‘model_1’. That is, the
This implies, removing 134 data-point will significantly alters the linear model.
summary(model_1)
##
## Call:
## lm(formula = BMI ~ Age + CALC, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.3099 -5.8148 -0.7309 5.3913 20.7737
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.89403 7.55426 2.104 0.0355 *
## Age 0.31416 0.02594 12.112 <2e-16 ***
## CALCFrequently 2.56050 7.58989 0.337 0.7359
## CALCno 3.58899 7.54092 0.476 0.6342
## CALCSometimes 7.52914 7.53775 0.999 0.3180
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.535 on 2106 degrees of freedom
## Multiple R-squared: 0.1172, Adjusted R-squared: 0.1155
## F-statistic: 69.87 on 4 and 2106 DF, p-value: < 2.2e-16