library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggrepel)
library(ggthemes)

Reading the csv file

Obesity_data_set<- read.csv("C:\\Users\\admin\\Desktop\\Ankitaa\\Assignments\\Intro to R\\obesity_data_set.csv")

head(Obesity_data_set)
##   Gender Age Height Weight family_history_with_overweight FAVC FCVC NCP
## 1 Female  21   1.62   64.0                            yes   no    2   3
## 2 Female  21   1.52   56.0                            yes   no    3   3
## 3   Male  23   1.80   77.0                            yes   no    2   3
## 4   Male  27   1.80   87.0                             no   no    3   3
## 5   Male  22   1.78   89.8                             no   no    2   1
## 6   Male  29   1.62   53.0                             no  yes    2   3
##        CAEC SMOKE CH2O SCC FAF TUE       CALC                MTRANS
## 1 Sometimes    no    2  no   0   1         no Public_Transportation
## 2 Sometimes   yes    3 yes   3   0  Sometimes Public_Transportation
## 3 Sometimes    no    2  no   2   1 Frequently Public_Transportation
## 4 Sometimes    no    2  no   2   0 Frequently               Walking
## 5 Sometimes    no    2  no   0   0  Sometimes Public_Transportation
## 6 Sometimes    no    2  no   0   0  Sometimes            Automobile
##            NObeyesdad
## 1       Normal_Weight
## 2       Normal_Weight
## 3       Normal_Weight
## 4  Overweight_Level_I
## 5 Overweight_Level_II
## 6       Normal_Weight

TASK 1

Selecting response and explanatory variable

My data is about obesity so I will choose NObeyesdad as a Response variable and Weight & MTRANS as a explnatory variable. NObesity was created with the values of: Insufficient Weight, Normal Weight, Overweight Level I, Overweight Level II, Obesity Type I, Obesity Type II and Obesity Type III whereas, MTRANS shows mode of transportation like public transport, walking and automobile & weight has weight values in the column.

Response variable: NObeyesdad explnatory variable: Weight, MTRANS

Build a linear (or generalized linear) model

Obesity_data_set$NObeyesdad <-as.factor(Obesity_data_set$NObeyesdad)

obesity_model <-glm(Obesity_data_set$Weight~ Obesity_data_set$NObeyesdad+ Obesity_data_set$MTRANS,data =Obesity_data_set)

Performing an ANOVA TEST

#Anova test

Obesity_data_set$NObeyesdad<- as.factor(Obesity_data_set$NObeyesdad)
anova_obesity <-aov(Weight~ NObeyesdad,data=Obesity_data_set)

summary(anova_obesity)
##               Df  Sum Sq Mean Sq F value Pr(>F)    
## NObeyesdad     6 1228371  204729    1967 <2e-16 ***
## Residuals   2104  219041     104                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

F value is 1967 P value is < 0.001, highly significant So at least one group mean is different from the others

TASK 2: Diagnostic plots

# for Linearity: Residuals vs Fitted Plot 

plot(residuals(obesity_model) ~fitted(obesity_model),main="Residuals vs Fitted plot",
     xlab="Fitted values",ylab="Residuals")

# for Homoscedasticity: Residuals vs Fitted Plot
plot(fitted(obesity_model),residuals(obesity_model),main="Homoscedasticity Check",
     xlab="Fitted values", ylab="Residuals")

# for Normality:  Q-Q Plot 
qqnorm(residuals(obesity_model), main="Normal Q-Q Plot")

# Residuals vs. Leverage Plot
plot(obesity_model, which = 5, id.n = 3, main = "Residuals vs. Leverage",
     pch = 19, col = "red", cex = 1.5)

Normal Q Q plot: it suggests positive skewness in the data, indicating that the data is skewed to the right and not perfectly normally distributed.

Residuals vs. Leverage Plot: it is observed that no points lie outside the dashed line, indicating the absence of influential outliers.

Task 3:

summary(obesity_model)
## 
## Call:
## glm(formula = Obesity_data_set$Weight ~ Obesity_data_set$NObeyesdad + 
##     Obesity_data_set$MTRANS, data = Obesity_data_set)
## 
## Coefficients:
##                                                Estimate Std. Error t value
## (Intercept)                                     50.5921     0.7732  65.430
## Obesity_data_set$NObeyesdadNormal_Weight        11.8290     0.8758  13.506
## Obesity_data_set$NObeyesdadObesity_Type_I       42.8954     0.8274  51.843
## Obesity_data_set$NObeyesdadObesity_Type_II      65.3220     0.8593  76.017
## Obesity_data_set$NObeyesdadObesity_Type_III     71.2817     0.8437  84.490
## Obesity_data_set$NObeyesdadOverweight_Level_I   24.2401     0.8610  28.152
## Obesity_data_set$NObeyesdadOverweight_Level_II  32.0390     0.8642  37.072
## Obesity_data_set$MTRANSBike                      3.1053     3.8962   0.797
## Obesity_data_set$MTRANSMotorbike                -0.7684     3.1242  -0.246
## Obesity_data_set$MTRANSPublic_Transportation    -0.9357     0.5619  -1.665
## Obesity_data_set$MTRANSWalking                   3.2176     1.4821   2.171
##                                                Pr(>|t|)    
## (Intercept)                                      <2e-16 ***
## Obesity_data_set$NObeyesdadNormal_Weight         <2e-16 ***
## Obesity_data_set$NObeyesdadObesity_Type_I        <2e-16 ***
## Obesity_data_set$NObeyesdadObesity_Type_II       <2e-16 ***
## Obesity_data_set$NObeyesdadObesity_Type_III      <2e-16 ***
## Obesity_data_set$NObeyesdadOverweight_Level_I    <2e-16 ***
## Obesity_data_set$NObeyesdadOverweight_Level_II   <2e-16 ***
## Obesity_data_set$MTRANSBike                       0.426    
## Obesity_data_set$MTRANSMotorbike                  0.806    
## Obesity_data_set$MTRANSPublic_Transportation      0.096 .  
## Obesity_data_set$MTRANSWalking                    0.030 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 103.7417)
## 
##     Null deviance: 1447412  on 2110  degrees of freedom
## Residual deviance:  217857  on 2100  degrees of freedom
## AIC: 15803
## 
## Number of Fisher Scoring iterations: 2

The coefficient for NObeyesdadNormal_Weight is 11.8290 So the individuals classified as ‘Normal Weight’ have a weight 11.8290 units higher compared to the reference category on average.

The coefficient for NObeyesdadObesity_Type_I is 42.8954 So the individuals classified as ‘Obesity Type I’ have an average weight 42.8954 units higher compared to the reference category.

The coefficient for MTRANSWalking is 3.2176 So the individuals who use ‘Walking’ as their mode of transportation have, on average, a weight 3.2176 units higher compared to the reference mode of transportation.

The coefficient for MTRANSBike is 3.1053 So the individuals who use ‘Bike’ as their mode of transportation have, on average, a weight 3.1053 units higher compared to the reference mode.