Obesity_data_set <- read.csv("C:/Users/admin/Desktop/Ankitaa/Assignments/Intro to R/obesity_data_set.csv")

#Task 1: Select a continuous (or ordered integer) column of data that seems most “valuable” given the context of your data, and call this your response variable. For example, in the Ames housing data, the price of the house is likely of the most value to both buyers and sellers. This is the thing most people will ask about when it comes to houses.

Answer: For the ‘Obesity dataset’, I have selected ‘Weight’ as a Response variable.

head(Obesity_data_set$Weight)
## [1] 64.0 56.0 77.0 87.0 89.8 53.0

#Task 2: Select a categorical column of data (explanatory variable) that you expect might influence the response variable. Devise a null hypothesis for an ANOVA test given this situation. Test this hypothesis using ANOVA, and summarize your results. Be clear about how the R output relates to your conclusions. If there are more than 10 categories, consider consolidating them before running the test using the methods we’ve learned in class. Explain what this might mean for people who may be interested in your data. E.g., “there is not enough evidence to conclude [—-], so it would be safe to assume that we can [——]”.

Answer:

For explanatory variable, I have selected ‘MTRANS’ which gives data about mode of transportation for that particular person. Mode of transportation are: Public Transportation, Automobile, Walking, Motorbike, Bike.

#NULL HYPOTHESIS Null Hypothesis(H0): There is no significant difference in the mean weight across different transportation modes. #ALTERNATIVE HYPOTHESIS Alternative Hypothesis (HA): There is a significant difference in the mean weight across different transportation modes.

#ANOVA TEST:

anovaobesity <- aov(Obesity_data_set$Weight ~ Obesity_data_set$MTRANS, data = Obesity_data_set)

summary(anovaobesity)
##                           Df  Sum Sq Mean Sq F value  Pr(>F)    
## Obesity_data_set$MTRANS    4   18495    4624   6.815 1.9e-05 ***
## Residuals               2106 1428917     678                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#RESULTS F value: 6.815 P value: 1.9e-05 common significance level : 0.05

P value is extremely small than the F value and common significance level, therefore, we REJECT THE NULL HYPOTHESIS.

#SUMMARY: There is a significant difference in the mean weight across different transportation modes. The output of ANOVA result is indicating that the differences in weights are unlikely due to random chance and are more likely due to the transportation modes.

#Task 3: Find at least one other continuous (or ordered integer) column of data that might influence the response variable. Make sure the relationship between this variable and the response is roughly linear. Build a linear regression model of the response using just this column, and evaluate its fit. Run appropriate hypothesis tests and summarize their results. Use diagnostic plots to identify any issues with your model. Interpret the coefficients of your model, and explain how they relate to the context of your data. For example, can you make any recommendations about an optimal way of doing something?

Answer: I am selecting ‘FAF’ as another continuous variable. ‘FAF’ : Physical activity frequency.

#Linear regression model:

lin_model <- lm(Obesity_data_set$Weight ~ Obesity_data_set$FAF, data = Obesity_data_set)


summary(lin_model)
## 
## Call:
## lm(formula = Obesity_data_set$Weight ~ Obesity_data_set$FAF, 
##     data = Obesity_data_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.186 -20.519  -3.356  20.711  87.981 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           88.1862     0.8843  99.730   <2e-16 ***
## Obesity_data_set$FAF  -1.5838     0.6696  -2.365   0.0181 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.16 on 2109 degrees of freedom
## Multiple R-squared:  0.002646,   Adjusted R-squared:  0.002173 
## F-statistic: 5.595 on 1 and 2109 DF,  p-value: 0.01811

#HYPOTHESIS TEST: ANOVA

anova(lin_model)
## Analysis of Variance Table
## 
## Response: Obesity_data_set$Weight
##                        Df  Sum Sq Mean Sq F value  Pr(>F)  
## Obesity_data_set$FAF    1    3829  3829.4  5.5946 0.01811 *
## Residuals            2109 1443583   684.5                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#Summary: The analysis of variance gives data about Weight and FAF. F(1, 2109) = 5.5946 p = 0.01811 sum of squares= 3829.4 residuals sum of squares = 1443583 mean square error= 684.5

#DIAGNOSTIC PLOTS

plot(lin_model)

#COEFICIENTS

coefficient <- coef(lin_model)
print(coefficient)
##          (Intercept) Obesity_data_set$FAF 
##            88.186177            -1.583809

Intercept: 88.186177 coefficient: -1.583809

Intercept shows weight when the physical activity factor is zero.

coefficient shows that when every unit increases in the ‘FAF’, the weight decreases by approximately 1.58 units.

People with high count of ‘FAF’ have lower weights.

#Task 4: Include at least one other variable into your regression model (e.g., you might use the one from the ANOVA), and evaluate how it helps (or doesn’t). Maybe include an interaction term, but explain why you included it. You can add up to 4 variables if you like.

Answer: I have included ‘FCVC’ column in the linear regression model. FCVC= Frequency of consumption of vegetables. I guess what we eat does affect our weight and it is related to the physical activity as well.

#Regression model:

library(stats)
reg_model <- lm(Obesity_data_set$Weight ~ Obesity_data_set$FAF* Obesity_data_set$FCVC, data = Obesity_data_set)  
summary(reg_model)
## 
## Call:
## lm(formula = Obesity_data_set$Weight ~ Obesity_data_set$FAF * 
##     Obesity_data_set$FCVC, data = Obesity_data_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -53.223 -19.657  -0.934  16.727  82.627 
## 
## Coefficients:
##                                            Estimate Std. Error t value Pr(>|t|)
## (Intercept)                                  59.681      3.852  15.495  < 2e-16
## Obesity_data_set$FAF                          1.188      2.910   0.408    0.683
## Obesity_data_set$FCVC                        11.847      1.561   7.589 4.82e-14
## Obesity_data_set$FAF:Obesity_data_set$FCVC   -1.204      1.176  -1.024    0.306
##                                               
## (Intercept)                                ***
## Obesity_data_set$FAF                          
## Obesity_data_set$FCVC                      ***
## Obesity_data_set$FAF:Obesity_data_set$FCVC    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.54 on 2107 degrees of freedom
## Multiple R-squared:  0.05029,    Adjusted R-squared:  0.04894 
## F-statistic: 37.19 on 3 and 2107 DF,  p-value: < 2.2e-16

#Explanation:

Intercept: 59.681 weight when both FAF and FCVC are zero

FAF: 1.188 for each additional unit increase in the Physical Activity Factor, the weight increases by 1.188 units. p value = 0.683

FCVC: 11.847 when each unit increase in the Fcvc, the weight increases by 11.847 units p value < 0.001

Interaction Term: -1.204 decrease in weight with the interaction p value = 0.306

R-squared:0.04894

#Summary: Individuals who consume vegetables more often have higher weights.

FAF doesnt have significant relationship with weight.

p value > significance level (0.05)

The interaction between FAF and FCVC does not significantly influence weight in this model.