library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(purrr)
library(boot)
library(lindia)
df <- read.csv("~/Downloads/ObesityDataSet_raw_and_data_sinthetic.csv", header=TRUE)
df['BMI'] <- df['Weight']/(df['Height']**2)
df |> group_by(MTRANS) |>summarise( count = n())
## # A tibble: 5 × 2
## MTRANS count
## <chr> <int>
## 1 Automobile 457
## 2 Bike 7
## 3 Motorbike 11
## 4 Public_Transportation 1580
## 5 Walking 56
ggplot(df, aes(x = MTRANS, y = BMI)) +
geom_boxplot() +
labs(title = "Boxplot of BMI by Means of Transportation",
x = "Means of Transportation",
y = "BMI") +
theme_minimal()
Analysis of Variance:
m <- aov(BMI ~ MTRANS, data = df)
summary(m)
## Df Sum Sq Mean Sq F value Pr(>F)
## MTRANS 4 2741 685.2 10.88 9.97e-09 ***
## Residuals 2106 132682 63.0
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
pairwise.t.test(df$BMI, df$MTRANS, p.adjust.method = 'bonferroni')
##
## Pairwise comparisons using t tests with pooled SD
##
## data: df$BMI and df$MTRANS
##
## Automobile Bike Motorbike Public_Transportation
## Bike 1.00 - - -
## Motorbike 1.00 1.00 - -
## Public_Transportation 0.29 1.00 0.70 -
## Walking 9.4e-06 1.00 1.00 2.7e-08
##
## P value adjustment method: bonferroni
The evidence that is, p - value = 9.97e-09 is very small (definitely smaller than alpha - True negative(if defined)), thus the hypothesis can be rejected.
Moreover the walking group and Public Transportation group are different from other groups, especially Walking group is far from equal.
Therefore, BMI and means of transportation are related and not independent.
Considering CH2O as explanatory variable in this case, Null hypothesis for Linear regression is The average amount of water consumed by a person over a day and BMI of that person are independent
df |> ggplot(mapping = aes(x = CH2O, y = BMI)) + geom_point(size = 0.5)+
geom_smooth(method = 'lm', se = FALSE, color = 'Orange') + geom_hline( yintercept = mean(df$BMI), linetype = 'dashed') + theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
CH2O values represent the amount of water consumed in a day 1 -> represents less than a liter 2 -> represents between 1L and 2L 3 -> represents greater than 2L
Therefore, it can assumed that CH2O values is between 1 and 2 is same as number of liters of water consumed. Similarly, for CH2O value in between 2,3 is same as number of litres of water consumed.And if CH2O is 3 then number of liters consumed is more than 3, and if CH2O =1 then number of litre of water consumed is less than 1L.
linear_regress <- lm(df$BMI~df$CH2O)
linear_regress$coefficients
## (Intercept) df$CH2O
## 25.915648 1.884706
The 25.915 intercept (beta-0) of our linear regression model represents an estimate BMI for an individual who doesn’t drink water.(Which in reality doesnt happens)
As 25 is a cutoff BMI reading for a normal weight person BMI, our model’s intercept:25.915 implicitly expresses that there are no normal weight or underweight individuals in our dataset.However, this estimation is incorrect as it does not reflect the actual data distribution.
Co-efficient of explanatory variable (beta-1) = 1.884, this implies for every 1 unit increase in CH2O variable, i.e. around a liter water consumption increase, our model gives an estimate BMI of 1.884 more BMI-units.
According to model, the maximum BMI estimated(when CH2O =3) = (beta-1)*3 + beta-0 = 31.57.
A better model can be developed, when the CH2O are true numeric values and when the numeric values range from atleast 1-15 with decent variance across the predictions.
df|> group_by(CH2O) |>filter(CH2O == 3)|> summarise(count = n(), mean = mean(BMI), max = max(BMI), min = min(BMI), sd = sd(BMI))
## # A tibble: 1 × 6
## CH2O count mean max min sd
## <dbl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 3 162 27.2 40.6 13.3 4.97
df |> ggplot(mapping = aes(x = Weight, y = BMI)) + geom_point(size = 0.5) + geom_smooth(method = 'lm', se = FALSE ) + theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
lm_weight <- lm(df$BMI~df$Weight)
lm_weight
##
## Call:
## lm(formula = df$BMI ~ df$Weight)
##
## Coefficients:
## (Intercept) df$Weight
## 4.9419 0.2859
The intercept of the linear regression model is 4.9419, which incidates that when Weight is 0(no real significance) the estimated BMI is 4.9419 and whereas correct/observed BMI should be 0 (or undefined - if height is also 0).
Co-effiecient of explanatory variable(Weight) is 0.2859, this implies for every unit change in Weight, 0.2859 change in BMI is observed. Ideally,(keeping height constant) for every unit change, a unit change in BMI is expected.
On the downside, this scatter shows relatively low error variance for high Weight. But overall a linear regrassion model does a good