library(tidyverse)

## Warning: package 'readr' was built under R version 4.3.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(dplyr)
library(ggrepel)
library(effsize)
library(pwrss)

## 
## Attaching package: 'pwrss'
## 
## The following object is masked from 'package:stats':
## 
##     power.t.test

library(boot)
library(broom)
library(lindia)
library(vcd)

## Warning: package 'vcd' was built under R version 4.3.2

## Loading required package: grid

The problem and goals

I believe that there are few factors in our day to day life which affect our obesity levels unknowingly. By controlling and watching over these things, we can keep our weight in control.

I have a data set which has following attributes:

Gender (Female, Male) Age Height Weight family_history_with_overweight (Categorical variable indicating if there’s a family history of overweight (e.g., yes, no)) FAVC: Categorical variable indicating frequent consumption of high-caloric food (e.g., yes, no). FCVC: Numerical variable indicating the frequency of consumption of vegetables. NCP: Numerical variable representing the number of main meals. CAEC: Categorical variable indicating the consumption of food between meals (e.g., Sometimes, Frequently). SMOKE: Categorical variable indicating if the individual smokes (e.g., yes, no). CH2O: Numerical variable representing daily water consumption. SCC: Categorical variable indicating if the individual monitors calories (e.g., yes, no). FAF: Numerical variable representing physical activity frequency. TUE: Numerical variable representing time using technological devices. CALC: Categorical variable indicating alcohol consumption (e.g., Sometimes, Frequently). MTRANS: Categorical variable indicating mode of transportation (e.g., Public Transportation, Walking). NObeyesdad: Categorical variable indicating obesity level (e.g., Normal Weight, Overweight_Level_I, Overweight_Level_II).

In this I am gonna focus mainly on these attributes:

Family_history_with_overweight FAVC FAF MTRANS

Loading the data

obesity_data <-read.csv("C:\\Users\\admin\\Desktop\\Ankitaa\\Assignments\\Intro to R\\data\\Obesity_data_set.csv")

head(obesity_data)

##   Gender Age Height Weight family_history_with_overweight FAVC FCVC NCP
## 1 Female  21   1.62   64.0                            yes   no    2   3
## 2 Female  21   1.52   56.0                            yes   no    3   3
## 3   Male  23   1.80   77.0                            yes   no    2   3
## 4   Male  27   1.80   87.0                             no   no    3   3
## 5   Male  22   1.78   89.8                             no   no    2   1
## 6   Male  29   1.62   53.0                             no  yes    2   3
##        CAEC SMOKE CH2O SCC FAF TUE       CALC                MTRANS
## 1 Sometimes    no    2  no   0   1         no Public_Transportation
## 2 Sometimes   yes    3 yes   3   0  Sometimes Public_Transportation
## 3 Sometimes    no    2  no   2   1 Frequently Public_Transportation
## 4 Sometimes    no    2  no   2   0 Frequently               Walking
## 5 Sometimes    no    2  no   0   0  Sometimes Public_Transportation
## 6 Sometimes    no    2  no   0   0  Sometimes            Automobile
##            NObeyesdad
## 1       Normal_Weight
## 2       Normal_Weight
## 3       Normal_Weight
## 4  Overweight_Level_I
## 5 Overweight_Level_II
## 6       Normal_Weight

obesity_data<-obesity_data %>%
  mutate(BMI =Weight/(Height^2))

head(obesity_data)

##   Gender Age Height Weight family_history_with_overweight FAVC FCVC NCP
## 1 Female  21   1.62   64.0                            yes   no    2   3
## 2 Female  21   1.52   56.0                            yes   no    3   3
## 3   Male  23   1.80   77.0                            yes   no    2   3
## 4   Male  27   1.80   87.0                             no   no    3   3
## 5   Male  22   1.78   89.8                             no   no    2   1
## 6   Male  29   1.62   53.0                             no  yes    2   3
##        CAEC SMOKE CH2O SCC FAF TUE       CALC                MTRANS
## 1 Sometimes    no    2  no   0   1         no Public_Transportation
## 2 Sometimes   yes    3 yes   3   0  Sometimes Public_Transportation
## 3 Sometimes    no    2  no   2   1 Frequently Public_Transportation
## 4 Sometimes    no    2  no   2   0 Frequently               Walking
## 5 Sometimes    no    2  no   0   0  Sometimes Public_Transportation
## 6 Sometimes    no    2  no   0   0  Sometimes            Automobile
##            NObeyesdad      BMI
## 1       Normal_Weight 24.38653
## 2       Normal_Weight 24.23823
## 3       Normal_Weight 23.76543
## 4  Overweight_Level_I 26.85185
## 5 Overweight_Level_II 28.34238
## 6       Normal_Weight 20.19509

#Dimensions of our data
dim(obesity_data)

## [1] 2111   18

str(obesity_data)

## 'data.frame':    2111 obs. of  18 variables:
##  $ Gender                        : chr  "Female" "Female" "Male" "Male" ...
##  $ Age                           : num  21 21 23 27 22 29 23 22 24 22 ...
##  $ Height                        : num  1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 ...
##  $ Weight                        : num  64 56 77 87 89.8 53 55 53 64 68 ...
##  $ family_history_with_overweight: chr  "yes" "yes" "yes" "no" ...
##  $ FAVC                          : chr  "no" "no" "no" "no" ...
##  $ FCVC                          : num  2 3 2 3 2 2 3 2 3 2 ...
##  $ NCP                           : num  3 3 3 3 1 3 3 3 3 3 ...
##  $ CAEC                          : chr  "Sometimes" "Sometimes" "Sometimes" "Sometimes" ...
##  $ SMOKE                         : chr  "no" "yes" "no" "no" ...
##  $ CH2O                          : num  2 3 2 2 2 2 2 2 2 2 ...
##  $ SCC                           : chr  "no" "yes" "no" "no" ...
##  $ FAF                           : num  0 3 2 2 0 0 1 3 1 1 ...
##  $ TUE                           : num  1 0 1 0 0 0 0 0 1 1 ...
##  $ CALC                          : chr  "no" "Sometimes" "Frequently" "Frequently" ...
##  $ MTRANS                        : chr  "Public_Transportation" "Public_Transportation" "Public_Transportation" "Walking" ...
##  $ NObeyesdad                    : chr  "Normal_Weight" "Normal_Weight" "Normal_Weight" "Overweight_Level_I" ...
##  $ BMI                           : num  24.4 24.2 23.8 26.9 28.3 ...

To start with, I will first show basic summary of the data!

summary(obesity_data)

##     Gender               Age            Height          Weight      
##  Length:2111        Min.   :14.00   Min.   :1.450   Min.   : 39.00  
##  Class :character   1st Qu.:19.95   1st Qu.:1.630   1st Qu.: 65.47  
##  Mode  :character   Median :22.78   Median :1.700   Median : 83.00  
##                     Mean   :24.31   Mean   :1.702   Mean   : 86.59  
##                     3rd Qu.:26.00   3rd Qu.:1.768   3rd Qu.:107.43  
##                     Max.   :61.00   Max.   :1.980   Max.   :173.00  
##  family_history_with_overweight     FAVC                FCVC      
##  Length:2111                    Length:2111        Min.   :1.000  
##  Class :character               Class :character   1st Qu.:2.000  
##  Mode  :character               Mode  :character   Median :2.386  
##                                                    Mean   :2.419  
##                                                    3rd Qu.:3.000  
##                                                    Max.   :3.000  
##       NCP            CAEC              SMOKE                CH2O      
##  Min.   :1.000   Length:2111        Length:2111        Min.   :1.000  
##  1st Qu.:2.659   Class :character   Class :character   1st Qu.:1.585  
##  Median :3.000   Mode  :character   Mode  :character   Median :2.000  
##  Mean   :2.686                                         Mean   :2.008  
##  3rd Qu.:3.000                                         3rd Qu.:2.477  
##  Max.   :4.000                                         Max.   :3.000  
##      SCC                 FAF              TUE             CALC          
##  Length:2111        Min.   :0.0000   Min.   :0.0000   Length:2111       
##  Class :character   1st Qu.:0.1245   1st Qu.:0.0000   Class :character  
##  Mode  :character   Median :1.0000   Median :0.6253   Mode  :character  
##                     Mean   :1.0103   Mean   :0.6579                     
##                     3rd Qu.:1.6667   3rd Qu.:1.0000                     
##                     Max.   :3.0000   Max.   :2.0000                     
##     MTRANS           NObeyesdad             BMI       
##  Length:2111        Length:2111        Min.   :13.00  
##  Class :character   Class :character   1st Qu.:24.33  
##  Mode  :character   Mode  :character   Median :28.72  
##                                        Mean   :29.70  
##                                        3rd Qu.:36.02  
##                                        Max.   :50.81

Distribution of weight

ggplot(obesity_data, aes(x=Weight)) +
  geom_histogram(binwidth = 1, fill = "skyblue", color = "black") +
  labs(title = "Distribution of Weight")

EDA

num_vars<-select_if(obesity_data,is.numeric)

mean_values<-sapply(num_vars,mean, na.rm =TRUE)

median_values<-sapply(num_vars,median,na.rm =TRUE)

calculate_mode <-function(x) {
  
  ux <- unique(x)
  
  ux[which.max(tabulate(match(x,ux)))]
  
}
mode_values<- sapply(num_vars,calculate_mode)

range_values <-sapply(num_vars,function(x)diff(range(x,na.rm =TRUE)))

variance_values<- sapply(num_vars,var,na.rm =TRUE)

std_dev_values<- sapply(num_vars,sd,na.rm=TRUE)

des_stat <-data.frame(
  Mean =mean_values,
  
  Median= median_values,
  
  Mode=mode_values,
  Range=range_values,
  
  Variance=variance_values,
  Std_Dev= std_dev_values
)

print(des_stat)

##              Mean    Median     Mode     Range     Variance     Std_Dev
## Age    24.3125999 22.777890 18.00000  47.00000 4.027131e+01  6.34596827
## Height  1.7016774  1.700499  1.70000   0.53000 8.705789e-03  0.09330482
## Weight 86.5860581 83.000000 80.00000 134.00000 6.859775e+02 26.19117175
## FCVC    2.4190431  2.385502  3.00000   2.00000 2.850776e-01  0.53392658
## NCP     2.6856280  3.000000  3.00000   3.00000 6.053441e-01  0.77803865
## CH2O    2.0080114  2.000000  2.00000   2.00000 3.757119e-01  0.61295345
## FAF     1.0102977  1.000000  0.00000   3.00000 7.235075e-01  0.85059243
## TUE     0.6578659  0.625350  0.00000   2.00000 3.707924e-01  0.60892726
## BMI    29.7001588 28.719089 26.67276  37.81307 6.418151e+01  8.01133661

ggplot(obesity_data,aes(x =family_history_with_overweight, y=Weight,color= family_history_with_overweight))+
  geom_jitter(alpha= 0.8,width = 0.4)+
  
  labs(x ="Family History with Overweight",y ="Weight") +
  
  ggtitle("Trend of Weight based on Family History")+
  
  scale_color_discrete(name="Family History",labels =c("No", "Yes")) +
  
  theme_minimal()

Weight_and_transport<-obesity_data %>%
  group_by(MTRANS)%>%
  summarise(mean_weight=mean(Weight,na.rm =TRUE))

ggplot(Weight_and_transport,aes(x=MTRANS, y=mean_weight,group= 1)) +
  geom_line(color="blue",size =1)+
  geom_point(color= "blue",size=3)+
  labs(x="Mode of Transportation",y="Mean Weight") +
  ggtitle("Mean Weight by Mode of Transportation") +
  theme_minimal()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

ggplot(obesity_data, aes(x=FAVC, y=Weight)) +
  geom_point(color="green",alpha =0.7) +
  labs(x="Frequency of High-Calorie Food Consumption",y="Weight") +
  ggtitle("Relationship Between Weight and High-Calorie Food Consumption") +
  theme_minimal()

# Hypothesis Tests #

Anova test for MTRANS and Weight

Null Hypothesis(H0): There is no difference in mean weights between individuals using different modes of transportation. Alternative Hypothesis (H1): There is a difference in mean weights between individuals using different modes of transportation.

anova_result<- aov(Weight~ MTRANS,data =obesity_data)

summary(anova_result)

##               Df  Sum Sq Mean Sq F value  Pr(>F)    
## MTRANS         4   18495    4624   6.815 1.9e-05 ***
## Residuals   2106 1428917     678                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

F-value is 6.815 p-value is 1.9e-05 which is very low. Therefore, P value suggests strong evidence against the null hypothesis. So, we are rejecting the null hypothesis.

Conclusion: There is a difference in mean weights among individuals using different modes of transportation.

t test for weight and high caloric food consumption

Null Hypothesis (H0): There is no difference in mean weights between individuals who consume high caloric food frequently (FAVC=yes) and those who don’t (FAVC=no).

Alternative Hypothesis (H1): There is a difference in mean weights between individuals with and without high caloric food consumption frequency.

favc_yes<-obesity_data[obesity_data$FAVC=="yes",]$Weight
favc_no<-obesity_data[obesity_data$FAVC=="no",]$Weight


t_test <-t.test(favc_yes,favc_no)


print(t_test)

## 
##  Welch Two Sample t-test
## 
## data:  favc_yes and favc_no
## t = 17.834, df = 410.8, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  19.80748 24.71505
## sample estimates:
## mean of x mean of y 
##  89.16967  66.90841

p-value is 2.2e-16 which is very low. confidence interval is 19.80748 to 24.71505 Large t-value is 17.834

So there is a evidence against the null hypothesis. Therefore, we reject the null hypothesis. Conclusion: There is a difference in mean weights between individuals with and without high caloric food consumption frequency.

t test for Weight and Family History

Null Hypothesis (H0): Family history does not affect weight.

Alternative Hypothesis (H1): Family history does affect weight.

history <-obesity_data[obesity_data$family_history_with_overweight== "yes", ]$Weight
no_history<- obesity_data[obesity_data$family_history_with_overweight=="no",]$Weight

test_result<-t.test(history,no_history)

print(test_result)

## 
##  Welch Two Sample t-test
## 
## data:  history and no_history
## t = 36.273, df = 956.71, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  31.86643 35.51170
## sample estimates:
## mean of x mean of y 
##  92.73020  59.04114

t-value: The t-value is 36.273 p-value: The p-value is 2.2e-16 Confidence Interval: The 95% confidence interval is between 31.86643 and 35.51170

Therefore, we reject the null hypothesis.

Conclusion: family history does have a significant effect on weight.

Linear regression model

Linear regresssion model for family history and weight

model<- lm(Weight~family_history_with_overweight, data=obesity_data)

summary(model)

## 
## Call:
## lm(formula = Weight ~ family_history_with_overweight, data = obesity_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -53.730 -14.988  -2.791  17.270  80.270 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         59.041      1.159   50.95   <2e-16 ***
## family_history_with_overweightyes   33.689      1.281   26.29   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.74 on 2109 degrees of freedom
## Multiple R-squared:  0.2468, Adjusted R-squared:  0.2465 
## F-statistic: 691.2 on 1 and 2109 DF,  p-value: < 2.2e-16

P-value is 2.2e-16.

The coefficient tells that individuals with family history, on average, have a weight around 33.689 units higher than those without family history.

model <-lm(Weight~MTRANS, data=obesity_data)

summary(model)

## 
## Call:
## lm(formula = Weight ~ MTRANS, data = obesity_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.487 -20.170  -3.146  20.669  85.513 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   85.908      1.218  70.504  < 2e-16 ***
## MTRANSBike                    -9.193      9.920  -0.927    0.354    
## MTRANSMotorbike              -12.817      7.948  -1.613    0.107    
## MTRANSPublic_Transportation    1.579      1.384   1.141    0.254    
## MTRANSWalking                -15.312      3.688  -4.152 3.43e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.05 on 2106 degrees of freedom
## Multiple R-squared:  0.01278,    Adjusted R-squared:  0.0109 
## F-statistic: 6.815 on 4 and 2106 DF,  p-value: 1.9e-05

R-squared value: 0.01278 F-statistic = 6.815

p-value = 1.9e-05

The R-squared value of 0.01278 means that only about 1.28% of the differences in people’s weight can be understood by looking at their transportation habits alone. The influence of transportation choices on weight is quite small compared to other important things that affect how much someone weighs.

model <-lm(Weight~ FAVC,data= obesity_data)

summary(model)

## 
## Call:
## lm(formula = Weight ~ FAVC, data = obesity_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.170 -19.170  -2.212  19.092  83.830 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   66.908      1.610   41.55   <2e-16 ***
## FAVCyes       22.261      1.713   13.00   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.21 on 2109 degrees of freedom
## Multiple R-squared:  0.07415,    Adjusted R-squared:  0.07371 
## F-statistic: 168.9 on 1 and 2109 DF,  p-value: < 2.2e-16

p-value < 0.001

The coefficient estimate suggests that individuals with a family history of overweight tend to have a higher weight by approximately 33.69 units compared to those without a family history, after accounting for other factors in the model.

In conclusion, the presence of a family history of overweight is associated with higher individual weights.

Analysis Overview:

The project aimed to investigate the relationship between various factors and weight status in individuals. By observing this we can avoid obesity.

Conclusion:

In conclusion, while the study established relationships between certain factors (family history, physical activity, transportation modes) and weight status, these factors do affect overall weight of an individual.

Future Use:

Healthcare: Health sectors can use these insights for patients struggling with weight management. Understanding the impact of factors like family history, physical activity, and transportation on weight can inform better counseling, treatment plans, and lifestyle modifications.

Public Health Programs: Public health initiatives who are helping individuals struggling with obesity can benefit from this.

Research and Academia: The analysis can serve as a foundation for further academic research in the field of obesity, weight management, and associated factors.

Fitness and Nutrition Industries: Insights into the factors affecting weight can also be useful in the fitness and nutrition sectors. Companies developing fitness apps, dietary plans, or products focused on weight management can leverage this analysis to design more targeted and evidence-based solutions.

Final_Stat

Ankita

2023-12-03