library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggrepel)
library(ggthemes)
Reading the csv file
Obesity_data_set<- read.csv("C:\\Users\\admin\\Desktop\\Ankitaa\\Assignments\\Intro to R\\obesity_data_set.csv")
head(Obesity_data_set)
## Gender Age Height Weight family_history_with_overweight FAVC FCVC NCP
## 1 Female 21 1.62 64.0 yes no 2 3
## 2 Female 21 1.52 56.0 yes no 3 3
## 3 Male 23 1.80 77.0 yes no 2 3
## 4 Male 27 1.80 87.0 no no 3 3
## 5 Male 22 1.78 89.8 no no 2 1
## 6 Male 29 1.62 53.0 no yes 2 3
## CAEC SMOKE CH2O SCC FAF TUE CALC MTRANS
## 1 Sometimes no 2 no 0 1 no Public_Transportation
## 2 Sometimes yes 3 yes 3 0 Sometimes Public_Transportation
## 3 Sometimes no 2 no 2 1 Frequently Public_Transportation
## 4 Sometimes no 2 no 2 0 Frequently Walking
## 5 Sometimes no 2 no 0 0 Sometimes Public_Transportation
## 6 Sometimes no 2 no 0 0 Sometimes Automobile
## NObeyesdad
## 1 Normal_Weight
## 2 Normal_Weight
## 3 Normal_Weight
## 4 Overweight_Level_I
## 5 Overweight_Level_II
## 6 Normal_Weight
TASK 1
Selecting response and explanatory variable
My data is about obesity so I will choose NObeyesdad as a Response variable and Weight & MTRANS as a explnatory variable. NObesity was created with the values of: Insufficient Weight, Normal Weight, Overweight Level I, Overweight Level II, Obesity Type I, Obesity Type II and Obesity Type III whereas, MTRANS shows mode of transportation like public transport, walking and automobile & weight has weight values in the column.
Response variable: NObeyesdad explnatory variable: Weight, MTRANS
Build a linear (or generalized linear) model
Obesity_data_set$NObeyesdad <-as.factor(Obesity_data_set$NObeyesdad)
obesity_model <-glm(Obesity_data_set$Weight~ Obesity_data_set$NObeyesdad+ Obesity_data_set$MTRANS,data =Obesity_data_set)
Performing an ANOVA TEST
#Anova test
Obesity_data_set$NObeyesdad<- as.factor(Obesity_data_set$NObeyesdad)
anova_obesity <-aov(Weight~ NObeyesdad,data=Obesity_data_set)
summary(anova_obesity)
## Df Sum Sq Mean Sq F value Pr(>F)
## NObeyesdad 6 1228371 204729 1967 <2e-16 ***
## Residuals 2104 219041 104
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
F value is 1967 P value is < 0.001, highly significant So at least one group mean is different from the others
TASK 2: Diagnostic plots
# for Linearity: Residuals vs Fitted Plot
plot(residuals(obesity_model) ~fitted(obesity_model),main="Residuals vs Fitted plot",
xlab="Fitted values",ylab="Residuals")
# for Homoscedasticity: Residuals vs Fitted Plot
plot(fitted(obesity_model),residuals(obesity_model),main="Homoscedasticity Check",
xlab="Fitted values", ylab="Residuals")
# for Normality: Q-Q Plot
qqnorm(residuals(obesity_model), main="Normal Q-Q Plot")
# Residuals vs. Leverage Plot
plot(obesity_model, which = 5, id.n = 3, main = "Residuals vs. Leverage",
pch = 19, col = "red", cex = 1.5)
Normal Q Q plot: it suggests positive skewness in the data, indicating that the data is skewed to the right and not perfectly normally distributed.
Residuals vs. Leverage Plot: it is observed that no points lie outside the dashed line, indicating the absence of influential outliers.
Task 3:
summary(obesity_model)
##
## Call:
## glm(formula = Obesity_data_set$Weight ~ Obesity_data_set$NObeyesdad +
## Obesity_data_set$MTRANS, data = Obesity_data_set)
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 50.5921 0.7732 65.430
## Obesity_data_set$NObeyesdadNormal_Weight 11.8290 0.8758 13.506
## Obesity_data_set$NObeyesdadObesity_Type_I 42.8954 0.8274 51.843
## Obesity_data_set$NObeyesdadObesity_Type_II 65.3220 0.8593 76.017
## Obesity_data_set$NObeyesdadObesity_Type_III 71.2817 0.8437 84.490
## Obesity_data_set$NObeyesdadOverweight_Level_I 24.2401 0.8610 28.152
## Obesity_data_set$NObeyesdadOverweight_Level_II 32.0390 0.8642 37.072
## Obesity_data_set$MTRANSBike 3.1053 3.8962 0.797
## Obesity_data_set$MTRANSMotorbike -0.7684 3.1242 -0.246
## Obesity_data_set$MTRANSPublic_Transportation -0.9357 0.5619 -1.665
## Obesity_data_set$MTRANSWalking 3.2176 1.4821 2.171
## Pr(>|t|)
## (Intercept) <2e-16 ***
## Obesity_data_set$NObeyesdadNormal_Weight <2e-16 ***
## Obesity_data_set$NObeyesdadObesity_Type_I <2e-16 ***
## Obesity_data_set$NObeyesdadObesity_Type_II <2e-16 ***
## Obesity_data_set$NObeyesdadObesity_Type_III <2e-16 ***
## Obesity_data_set$NObeyesdadOverweight_Level_I <2e-16 ***
## Obesity_data_set$NObeyesdadOverweight_Level_II <2e-16 ***
## Obesity_data_set$MTRANSBike 0.426
## Obesity_data_set$MTRANSMotorbike 0.806
## Obesity_data_set$MTRANSPublic_Transportation 0.096 .
## Obesity_data_set$MTRANSWalking 0.030 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 103.7417)
##
## Null deviance: 1447412 on 2110 degrees of freedom
## Residual deviance: 217857 on 2100 degrees of freedom
## AIC: 15803
##
## Number of Fisher Scoring iterations: 2
The coefficient for NObeyesdadNormal_Weight is 11.8290 So the individuals classified as ‘Normal Weight’ have a weight 11.8290 units higher compared to the reference category on average.
The coefficient for NObeyesdadObesity_Type_I is 42.8954 So the individuals classified as ‘Obesity Type I’ have an average weight 42.8954 units higher compared to the reference category.
The coefficient for MTRANSWalking is 3.2176 So the individuals who use ‘Walking’ as their mode of transportation have, on average, a weight 3.2176 units higher compared to the reference mode of transportation.
The coefficient for MTRANSBike is 3.1053 So the individuals who use ‘Bike’ as their mode of transportation have, on average, a weight 3.1053 units higher compared to the reference mode.