library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(purrr)
library(boot)
library(lindia)
df <- read.csv("~/Downloads/ObesityDataSet_raw_and_data_sinthetic.csv", header=TRUE)

df['BMI'] <- df['Weight']/(df['Height']**2)

Select a continuous (or ordered integer) column of data that seems most “valuable” given the context of your data, and call this your response variable.

Select a categorical column of data (explanatory variable) that you expect might influence the response variable.

Devise a null hypothesis for an ANOVA test given this situation. Test this hypothesis using ANOVA, and summarize your results. Be clear about how the R output relates to your conclusions.If there are more than 10 categories, consolidate them before running the test using the methods we’ve learned in class.

df |> group_by(MTRANS) |>summarise( count = n())
## # A tibble: 5 × 2
##   MTRANS                count
##   <chr>                 <int>
## 1 Automobile              457
## 2 Bike                      7
## 3 Motorbike                11
## 4 Public_Transportation  1580
## 5 Walking                  56
ggplot(df, aes(x = MTRANS, y = BMI)) +
  geom_boxplot() +
  labs(title = "Boxplot of BMI by Means of Transportation",
       x = "Means of Transportation",
       y = "BMI") +
  theme_minimal()

Analysis of Variance:

m <- aov(BMI ~ MTRANS, data = df)
summary(m)
##               Df Sum Sq Mean Sq F value   Pr(>F)    
## MTRANS         4   2741   685.2   10.88 9.97e-09 ***
## Residuals   2106 132682    63.0                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
pairwise.t.test(df$BMI, df$MTRANS, p.adjust.method = 'bonferroni')
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  df$BMI and df$MTRANS 
## 
##                       Automobile Bike Motorbike Public_Transportation
## Bike                  1.00       -    -         -                    
## Motorbike             1.00       1.00 -         -                    
## Public_Transportation 0.29       1.00 0.70      -                    
## Walking               9.4e-06    1.00 1.00      2.7e-08              
## 
## P value adjustment method: bonferroni

Explain what this might mean for people who may be interested in your data. E.g., “there is not enough evidence to conclude [—-], so it would be safe to assume that we can [——]”.

Find a single continuous (or ordered integer, non-binary) column of data that might influence the response variable. Make sure the relationship between this variable and the response is roughly linear.

Considering CH2O as explanatory variable in this case, Null hypothesis for Linear regression is The average amount of water consumed by a person over a day and BMI of that person are independent

Build a linear regression model of the response using just this column, and evaluate its fit.

df |> ggplot(mapping = aes(x = CH2O, y = BMI)) + geom_point(size = 0.5)+
  geom_smooth(method = 'lm', se = FALSE, color = 'Orange') + geom_hline( yintercept = mean(df$BMI), linetype = 'dashed') +  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

CH2O values represent the amount of water consumed in a day 1 -> represents less than a liter 2 -> represents between 1L and 2L 3 -> represents greater than 2L

Therefore, it can assumed that CH2O values is between 1 and 2 is same as number of liters of water consumed. Similarly, for CH2O value in between 2,3 is same as number of litres of water consumed.And if CH2O is 3 then number of liters consumed is more than 3, and if CH2O =1 then number of litre of water consumed is less than 1L.

linear_regress <- lm(df$BMI~df$CH2O)
linear_regress$coefficients
## (Intercept)     df$CH2O 
##   25.915648    1.884706

Interpret the coefficients of your model, and explain how they relate to the context of your data. For example, can you make any recommendations about an optimal way of doing something?

df|> group_by(CH2O) |>filter(CH2O == 3)|> summarise(count = n(), mean = mean(BMI), max = max(BMI), min = min(BMI), sd = sd(BMI))
## # A tibble: 1 × 6
##    CH2O count  mean   max   min    sd
##   <dbl> <int> <dbl> <dbl> <dbl> <dbl>
## 1     3   162  27.2  40.6  13.3  4.97

Another linear regression model

df |> ggplot(mapping  = aes(x = Weight, y = BMI)) + geom_point(size = 0.5) + geom_smooth(method = 'lm', se = FALSE ) + theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

lm_weight <- lm(df$BMI~df$Weight)
lm_weight
## 
## Call:
## lm(formula = df$BMI ~ df$Weight)
## 
## Coefficients:
## (Intercept)    df$Weight  
##      4.9419       0.2859

Further Questions