library(tidyverse)
## Warning: package 'readr' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(dplyr)
library(ggrepel)
library(effsize)
library(pwrss)
##
## Attaching package: 'pwrss'
##
## The following object is masked from 'package:stats':
##
## power.t.test
library(boot)
library(broom)
library(lindia)
library(vcd)
## Warning: package 'vcd' was built under R version 4.3.2
## Loading required package: grid
I believe that there are few factors in our day to day life which affect our obesity levels unknowingly. By controlling and watching over these things, we can keep our weight in control.
I have a data set which has following attributes:
Gender (Female, Male) Age Height Weight family_history_with_overweight (Categorical variable indicating if there’s a family history of overweight (e.g., yes, no)) FAVC: Categorical variable indicating frequent consumption of high-caloric food (e.g., yes, no). FCVC: Numerical variable indicating the frequency of consumption of vegetables. NCP: Numerical variable representing the number of main meals. CAEC: Categorical variable indicating the consumption of food between meals (e.g., Sometimes, Frequently). SMOKE: Categorical variable indicating if the individual smokes (e.g., yes, no). CH2O: Numerical variable representing daily water consumption. SCC: Categorical variable indicating if the individual monitors calories (e.g., yes, no). FAF: Numerical variable representing physical activity frequency. TUE: Numerical variable representing time using technological devices. CALC: Categorical variable indicating alcohol consumption (e.g., Sometimes, Frequently). MTRANS: Categorical variable indicating mode of transportation (e.g., Public Transportation, Walking). NObeyesdad: Categorical variable indicating obesity level (e.g., Normal Weight, Overweight_Level_I, Overweight_Level_II).
In this I am gonna focus mainly on these attributes:
Family_history_with_overweight FAVC FAF MTRANS
obesity_data <-read.csv("C:\\Users\\admin\\Desktop\\Ankitaa\\Assignments\\Intro to R\\data\\Obesity_data_set.csv")
head(obesity_data)
## Gender Age Height Weight family_history_with_overweight FAVC FCVC NCP
## 1 Female 21 1.62 64.0 yes no 2 3
## 2 Female 21 1.52 56.0 yes no 3 3
## 3 Male 23 1.80 77.0 yes no 2 3
## 4 Male 27 1.80 87.0 no no 3 3
## 5 Male 22 1.78 89.8 no no 2 1
## 6 Male 29 1.62 53.0 no yes 2 3
## CAEC SMOKE CH2O SCC FAF TUE CALC MTRANS
## 1 Sometimes no 2 no 0 1 no Public_Transportation
## 2 Sometimes yes 3 yes 3 0 Sometimes Public_Transportation
## 3 Sometimes no 2 no 2 1 Frequently Public_Transportation
## 4 Sometimes no 2 no 2 0 Frequently Walking
## 5 Sometimes no 2 no 0 0 Sometimes Public_Transportation
## 6 Sometimes no 2 no 0 0 Sometimes Automobile
## NObeyesdad
## 1 Normal_Weight
## 2 Normal_Weight
## 3 Normal_Weight
## 4 Overweight_Level_I
## 5 Overweight_Level_II
## 6 Normal_Weight
obesity_data<-obesity_data %>%
mutate(BMI =Weight/(Height^2))
head(obesity_data)
## Gender Age Height Weight family_history_with_overweight FAVC FCVC NCP
## 1 Female 21 1.62 64.0 yes no 2 3
## 2 Female 21 1.52 56.0 yes no 3 3
## 3 Male 23 1.80 77.0 yes no 2 3
## 4 Male 27 1.80 87.0 no no 3 3
## 5 Male 22 1.78 89.8 no no 2 1
## 6 Male 29 1.62 53.0 no yes 2 3
## CAEC SMOKE CH2O SCC FAF TUE CALC MTRANS
## 1 Sometimes no 2 no 0 1 no Public_Transportation
## 2 Sometimes yes 3 yes 3 0 Sometimes Public_Transportation
## 3 Sometimes no 2 no 2 1 Frequently Public_Transportation
## 4 Sometimes no 2 no 2 0 Frequently Walking
## 5 Sometimes no 2 no 0 0 Sometimes Public_Transportation
## 6 Sometimes no 2 no 0 0 Sometimes Automobile
## NObeyesdad BMI
## 1 Normal_Weight 24.38653
## 2 Normal_Weight 24.23823
## 3 Normal_Weight 23.76543
## 4 Overweight_Level_I 26.85185
## 5 Overweight_Level_II 28.34238
## 6 Normal_Weight 20.19509
#Dimensions of our data
dim(obesity_data)
## [1] 2111 18
str(obesity_data)
## 'data.frame': 2111 obs. of 18 variables:
## $ Gender : chr "Female" "Female" "Male" "Male" ...
## $ Age : num 21 21 23 27 22 29 23 22 24 22 ...
## $ Height : num 1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 ...
## $ Weight : num 64 56 77 87 89.8 53 55 53 64 68 ...
## $ family_history_with_overweight: chr "yes" "yes" "yes" "no" ...
## $ FAVC : chr "no" "no" "no" "no" ...
## $ FCVC : num 2 3 2 3 2 2 3 2 3 2 ...
## $ NCP : num 3 3 3 3 1 3 3 3 3 3 ...
## $ CAEC : chr "Sometimes" "Sometimes" "Sometimes" "Sometimes" ...
## $ SMOKE : chr "no" "yes" "no" "no" ...
## $ CH2O : num 2 3 2 2 2 2 2 2 2 2 ...
## $ SCC : chr "no" "yes" "no" "no" ...
## $ FAF : num 0 3 2 2 0 0 1 3 1 1 ...
## $ TUE : num 1 0 1 0 0 0 0 0 1 1 ...
## $ CALC : chr "no" "Sometimes" "Frequently" "Frequently" ...
## $ MTRANS : chr "Public_Transportation" "Public_Transportation" "Public_Transportation" "Walking" ...
## $ NObeyesdad : chr "Normal_Weight" "Normal_Weight" "Normal_Weight" "Overweight_Level_I" ...
## $ BMI : num 24.4 24.2 23.8 26.9 28.3 ...
summary(obesity_data)
## Gender Age Height Weight
## Length:2111 Min. :14.00 Min. :1.450 Min. : 39.00
## Class :character 1st Qu.:19.95 1st Qu.:1.630 1st Qu.: 65.47
## Mode :character Median :22.78 Median :1.700 Median : 83.00
## Mean :24.31 Mean :1.702 Mean : 86.59
## 3rd Qu.:26.00 3rd Qu.:1.768 3rd Qu.:107.43
## Max. :61.00 Max. :1.980 Max. :173.00
## family_history_with_overweight FAVC FCVC
## Length:2111 Length:2111 Min. :1.000
## Class :character Class :character 1st Qu.:2.000
## Mode :character Mode :character Median :2.386
## Mean :2.419
## 3rd Qu.:3.000
## Max. :3.000
## NCP CAEC SMOKE CH2O
## Min. :1.000 Length:2111 Length:2111 Min. :1.000
## 1st Qu.:2.659 Class :character Class :character 1st Qu.:1.585
## Median :3.000 Mode :character Mode :character Median :2.000
## Mean :2.686 Mean :2.008
## 3rd Qu.:3.000 3rd Qu.:2.477
## Max. :4.000 Max. :3.000
## SCC FAF TUE CALC
## Length:2111 Min. :0.0000 Min. :0.0000 Length:2111
## Class :character 1st Qu.:0.1245 1st Qu.:0.0000 Class :character
## Mode :character Median :1.0000 Median :0.6253 Mode :character
## Mean :1.0103 Mean :0.6579
## 3rd Qu.:1.6667 3rd Qu.:1.0000
## Max. :3.0000 Max. :2.0000
## MTRANS NObeyesdad BMI
## Length:2111 Length:2111 Min. :13.00
## Class :character Class :character 1st Qu.:24.33
## Mode :character Mode :character Median :28.72
## Mean :29.70
## 3rd Qu.:36.02
## Max. :50.81
ggplot(obesity_data, aes(x=Weight)) +
geom_histogram(binwidth = 1, fill = "skyblue", color = "black") +
labs(title = "Distribution of Weight")
num_vars<-select_if(obesity_data,is.numeric)
mean_values<-sapply(num_vars,mean, na.rm =TRUE)
median_values<-sapply(num_vars,median,na.rm =TRUE)
calculate_mode <-function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x,ux)))]
}
mode_values<- sapply(num_vars,calculate_mode)
range_values <-sapply(num_vars,function(x)diff(range(x,na.rm =TRUE)))
variance_values<- sapply(num_vars,var,na.rm =TRUE)
std_dev_values<- sapply(num_vars,sd,na.rm=TRUE)
des_stat <-data.frame(
Mean =mean_values,
Median= median_values,
Mode=mode_values,
Range=range_values,
Variance=variance_values,
Std_Dev= std_dev_values
)
print(des_stat)
## Mean Median Mode Range Variance Std_Dev
## Age 24.3125999 22.777890 18.00000 47.00000 4.027131e+01 6.34596827
## Height 1.7016774 1.700499 1.70000 0.53000 8.705789e-03 0.09330482
## Weight 86.5860581 83.000000 80.00000 134.00000 6.859775e+02 26.19117175
## FCVC 2.4190431 2.385502 3.00000 2.00000 2.850776e-01 0.53392658
## NCP 2.6856280 3.000000 3.00000 3.00000 6.053441e-01 0.77803865
## CH2O 2.0080114 2.000000 2.00000 2.00000 3.757119e-01 0.61295345
## FAF 1.0102977 1.000000 0.00000 3.00000 7.235075e-01 0.85059243
## TUE 0.6578659 0.625350 0.00000 2.00000 3.707924e-01 0.60892726
## BMI 29.7001588 28.719089 26.67276 37.81307 6.418151e+01 8.01133661
ggplot(obesity_data,aes(x =family_history_with_overweight, y=Weight,color= family_history_with_overweight))+
geom_jitter(alpha= 0.8,width = 0.4)+
labs(x ="Family History with Overweight",y ="Weight") +
ggtitle("Trend of Weight based on Family History")+
scale_color_discrete(name="Family History",labels =c("No", "Yes")) +
theme_minimal()
Weight_and_transport<-obesity_data %>%
group_by(MTRANS)%>%
summarise(mean_weight=mean(Weight,na.rm =TRUE))
ggplot(Weight_and_transport,aes(x=MTRANS, y=mean_weight,group= 1)) +
geom_line(color="blue",size =1)+
geom_point(color= "blue",size=3)+
labs(x="Mode of Transportation",y="Mean Weight") +
ggtitle("Mean Weight by Mode of Transportation") +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
ggplot(obesity_data, aes(x=FAVC, y=Weight)) +
geom_point(color="green",alpha =0.7) +
labs(x="Frequency of High-Calorie Food Consumption",y="Weight") +
ggtitle("Relationship Between Weight and High-Calorie Food Consumption") +
theme_minimal()
# Hypothesis Tests #
Anova test for MTRANS and Weight
Null Hypothesis(H0): There is no difference in mean weights between individuals using different modes of transportation. Alternative Hypothesis (H1): There is a difference in mean weights between individuals using different modes of transportation.
anova_result<- aov(Weight~ MTRANS,data =obesity_data)
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## MTRANS 4 18495 4624 6.815 1.9e-05 ***
## Residuals 2106 1428917 678
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
F-value is 6.815 p-value is 1.9e-05 which is very low. Therefore, P value suggests strong evidence against the null hypothesis. So, we are rejecting the null hypothesis.
Conclusion: There is a difference in mean weights among individuals using different modes of transportation.
Null Hypothesis (H0): There is no difference in mean weights between individuals who consume high caloric food frequently (FAVC=yes) and those who don’t (FAVC=no).
Alternative Hypothesis (H1): There is a difference in mean weights between individuals with and without high caloric food consumption frequency.
favc_yes<-obesity_data[obesity_data$FAVC=="yes",]$Weight
favc_no<-obesity_data[obesity_data$FAVC=="no",]$Weight
t_test <-t.test(favc_yes,favc_no)
print(t_test)
##
## Welch Two Sample t-test
##
## data: favc_yes and favc_no
## t = 17.834, df = 410.8, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 19.80748 24.71505
## sample estimates:
## mean of x mean of y
## 89.16967 66.90841
p-value is 2.2e-16 which is very low. confidence interval is 19.80748 to 24.71505 Large t-value is 17.834
So there is a evidence against the null hypothesis. Therefore, we reject the null hypothesis. Conclusion: There is a difference in mean weights between individuals with and without high caloric food consumption frequency.
Null Hypothesis (H0): Family history does not affect weight.
Alternative Hypothesis (H1): Family history does affect weight.
history <-obesity_data[obesity_data$family_history_with_overweight== "yes", ]$Weight
no_history<- obesity_data[obesity_data$family_history_with_overweight=="no",]$Weight
test_result<-t.test(history,no_history)
print(test_result)
##
## Welch Two Sample t-test
##
## data: history and no_history
## t = 36.273, df = 956.71, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 31.86643 35.51170
## sample estimates:
## mean of x mean of y
## 92.73020 59.04114
t-value: The t-value is 36.273 p-value: The p-value is 2.2e-16 Confidence Interval: The 95% confidence interval is between 31.86643 and 35.51170
Therefore, we reject the null hypothesis.
Conclusion: family history does have a significant effect on weight.
Linear regresssion model for family history and weight
model<- lm(Weight~family_history_with_overweight, data=obesity_data)
summary(model)
##
## Call:
## lm(formula = Weight ~ family_history_with_overweight, data = obesity_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -53.730 -14.988 -2.791 17.270 80.270
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 59.041 1.159 50.95 <2e-16 ***
## family_history_with_overweightyes 33.689 1.281 26.29 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.74 on 2109 degrees of freedom
## Multiple R-squared: 0.2468, Adjusted R-squared: 0.2465
## F-statistic: 691.2 on 1 and 2109 DF, p-value: < 2.2e-16
P-value is 2.2e-16.
The coefficient tells that individuals with family history, on average, have a weight around 33.689 units higher than those without family history.
model <-lm(Weight~MTRANS, data=obesity_data)
summary(model)
##
## Call:
## lm(formula = Weight ~ MTRANS, data = obesity_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.487 -20.170 -3.146 20.669 85.513
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 85.908 1.218 70.504 < 2e-16 ***
## MTRANSBike -9.193 9.920 -0.927 0.354
## MTRANSMotorbike -12.817 7.948 -1.613 0.107
## MTRANSPublic_Transportation 1.579 1.384 1.141 0.254
## MTRANSWalking -15.312 3.688 -4.152 3.43e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26.05 on 2106 degrees of freedom
## Multiple R-squared: 0.01278, Adjusted R-squared: 0.0109
## F-statistic: 6.815 on 4 and 2106 DF, p-value: 1.9e-05
R-squared value: 0.01278 F-statistic = 6.815
p-value = 1.9e-05
The R-squared value of 0.01278 means that only about 1.28% of the differences in people’s weight can be understood by looking at their transportation habits alone. The influence of transportation choices on weight is quite small compared to other important things that affect how much someone weighs.
model <-lm(Weight~ FAVC,data= obesity_data)
summary(model)
##
## Call:
## lm(formula = Weight ~ FAVC, data = obesity_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.170 -19.170 -2.212 19.092 83.830
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.908 1.610 41.55 <2e-16 ***
## FAVCyes 22.261 1.713 13.00 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25.21 on 2109 degrees of freedom
## Multiple R-squared: 0.07415, Adjusted R-squared: 0.07371
## F-statistic: 168.9 on 1 and 2109 DF, p-value: < 2.2e-16
p-value < 0.001
The coefficient estimate suggests that individuals with a family history of overweight tend to have a higher weight by approximately 33.69 units compared to those without a family history, after accounting for other factors in the model.
In conclusion, the presence of a family history of overweight is associated with higher individual weights.
The project aimed to investigate the relationship between various factors and weight status in individuals. By observing this we can avoid obesity.
In conclusion, while the study established relationships between certain factors (family history, physical activity, transportation modes) and weight status, these factors do affect overall weight of an individual.
Healthcare: Health sectors can use these insights for patients struggling with weight management. Understanding the impact of factors like family history, physical activity, and transportation on weight can inform better counseling, treatment plans, and lifestyle modifications.
Public Health Programs: Public health initiatives who are helping individuals struggling with obesity can benefit from this.
Research and Academia: The analysis can serve as a foundation for further academic research in the field of obesity, weight management, and associated factors.
Fitness and Nutrition Industries: Insights into the factors affecting weight can also be useful in the fitness and nutrition sectors. Companies developing fitness apps, dietary plans, or products focused on weight management can leverage this analysis to design more targeted and evidence-based solutions.