DataDive_week

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(purrr)
library(boot)
library(lindia)

df <- read.csv("~/Downloads/ObesityDataSet_raw_and_data_sinthetic.csv", header=TRUE)

df['BMI'] <- df['Weight']/(df['Height']**2)

Select a continuous (or ordered integer) column of data that seems most “valuable” given the context of your data, and call this your response variable.

BMI of a person is the response variable for the Dataset.

Select a categorical column of data (explanatory variable) that you expect might influence the response variable.

The explanatory variable chosen is Means of transportation of an individual usually use.

Devise a null hypothesis for an ANOVA test given this situation. Test this hypothesis using ANOVA, and summarize your results. Be clear about how the R output relates to your conclusions.If there are more than 10 categories, consolidate them before running the test using the methods we’ve learned in class.

Null Hypothesis : A person’s BMI is independent of their means of transportation.

df |> group_by(MTRANS) |>summarise( count = n())

## # A tibble: 5 × 2
##   MTRANS                count
##   <chr>                 <int>
## 1 Automobile              457
## 2 Bike                      7
## 3 Motorbike                11
## 4 Public_Transportation  1580
## 5 Walking                  56

ggplot(df, aes(x = MTRANS, y = BMI)) +
  geom_boxplot() +
  labs(title = "Boxplot of BMI by Means of Transportation",
       x = "Means of Transportation",
       y = "BMI") +
  theme_minimal()

Analysis of Variance:

m <- aov(BMI ~ MTRANS, data = df)
summary(m)

##               Df Sum Sq Mean Sq F value   Pr(>F)    
## MTRANS         4   2741   685.2   10.88 9.97e-09 ***
## Residuals   2106 132682    63.0                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F-value is larger than 1, and p-value is very small. Thus, our hypothesis is False. The BMI is related to MTRANS and they arr not independent.
This implies that the mean of atleast one group of MTRANS is very different from other.
To determine the odd one, lets perform pairwise t-test

pairwise.t.test(df$BMI, df$MTRANS, p.adjust.method = 'bonferroni')

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  df$BMI and df$MTRANS 
## 
##                       Automobile Bike Motorbike Public_Transportation
## Bike                  1.00       -    -         -                    
## Motorbike             1.00       1.00 -         -                    
## Public_Transportation 0.29       1.00 0.70      -                    
## Walking               9.4e-06    1.00 1.00      2.7e-08              
## 
## P value adjustment method: bonferroni

Walking group is very unlikely to be same as other group, even Public_transportation group is slightly different from remaining others.

Explain what this might mean for people who may be interested in your data. E.g., “there is not enough evidence to conclude [—-], so it would be safe to assume that we can [——]”.

The evidence that is, p - value = 9.97e-09 is very small (definitely smaller than alpha - True negative(if defined)), thus the hypothesis can be rejected.
Moreover the walking group and Public Transportation group are different from other groups, especially Walking group is far from equal.
Therefore, BMI and means of transportation are related and not independent.

Find a single continuous (or ordered integer, non-binary) column of data that might influence the response variable. Make sure the relationship between this variable and the response is roughly linear.

Most the continous variable in the dataset are not linearly with respect to BMI or Weight.
CH2O is considered for this linear regression modeling, because among other variables, CH2O has decently uniform variance of errors across all the predictions.

Considering CH2O as explanatory variable in this case, Null hypothesis for Linear regression is The average amount of water consumed by a person over a day and BMI of that person are independent

Build a linear regression model of the response using just this column, and evaluate its fit.

df |> ggplot(mapping = aes(x = CH2O, y = BMI)) + geom_point(size = 0.5)+
  geom_smooth(method = 'lm', se = FALSE, color = 'Orange') + geom_hline( yintercept = mean(df$BMI), linetype = 'dashed') +  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

CH2O values represent the amount of water consumed in a day 1 -> represents less than a liter 2 -> represents between 1L and 2L 3 -> represents greater than 2L

Therefore, it can assumed that CH2O values is between 1 and 2 is same as number of liters of water consumed. Similarly, for CH2O value in between 2,3 is same as number of litres of water consumed.And if CH2O is 3 then number of liters consumed is more than 3, and if CH2O =1 then number of litre of water consumed is less than 1L.

linear_regress <- lm(df$BMI~df$CH2O)
linear_regress$coefficients

## (Intercept)     df$CH2O 
##   25.915648    1.884706

Interpret the coefficients of your model, and explain how they relate to the context of your data. For example, can you make any recommendations about an optimal way of doing something?

The 25.915 intercept (beta-0) of our linear regression model represents an estimate BMI for an individual who doesn’t drink water.(Which in reality doesnt happens)
As 25 is a cutoff BMI reading for a normal weight person BMI, our model’s intercept:25.915 implicitly expresses that there are no normal weight or underweight individuals in our dataset.However, this estimation is incorrect as it does not reflect the actual data distribution.
Co-efficient of explanatory variable (beta-1) = 1.884, this implies for every 1 unit increase in CH2O variable, i.e. around a liter water consumption increase, our model gives an estimate BMI of 1.884 more BMI-units.
According to model, the maximum BMI estimated(when CH2O =3) = (beta-1)*3 + beta-0 = 31.57.
A better model can be developed, when the CH2O are true numeric values and when the numeric values range from atleast 1-15 with decent variance across the predictions.

df|> group_by(CH2O) |>filter(CH2O == 3)|> summarise(count = n(), mean = mean(BMI), max = max(BMI), min = min(BMI), sd = sd(BMI))

## # A tibble: 1 × 6
##    CH2O count  mean   max   min    sd
##   <dbl> <int> <dbl> <dbl> <dbl> <dbl>
## 1     3   162  27.2  40.6  13.3  4.97

Another linear regression model

Another obvious linear relationship that can be established is between Weight and BMI.
Since this relationship involves a derived variable (BMI) and the original variable (Weight), performing regression here doesn’t reveal any new discoveries.
However, an interesting observations are observed, as this relationship is also affected by the Height variable, leading to a distribution of data points, errors around the regression line.

df |> ggplot(mapping  = aes(x = Weight, y = BMI)) + geom_point(size = 0.5) + geom_smooth(method = 'lm', se = FALSE ) + theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

Parameters of the line:

lm_weight <- lm(df$BMI~df$Weight)
lm_weight

## 
## Call:
## lm(formula = df$BMI ~ df$Weight)
## 
## Coefficients:
## (Intercept)    df$Weight  
##      4.9419       0.2859

The intercept of the linear regression model is 4.9419, which incidates that when Weight is 0(no real significance) the estimated BMI is 4.9419 and whereas correct/observed BMI should be 0 (or undefined - if height is also 0).
Co-effiecient of explanatory variable(Weight) is 0.2859, this implies for every unit change in Weight, 0.2859 change in BMI is observed. Ideally,(keeping height constant) for every unit change, a unit change in BMI is expected.
On the downside, this scatter shows relatively low error variance for high Weight. But overall a linear regrassion model does a good

Further Questions

What if the walking group in the first hypothesis is neglected and does this influence our evidence against hypothesis?

DataDive_week_8

2024-10-21