library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
READING THE DATASET
Obesity <- read.csv('/Users/ankit/Downloads/Obesity.csv')
PROBLEM 1:
Select a continuous (or ordered integer) column of data that seems most “valuable” given the context of your data, and call this your response variable. For example, in the Ames housing data, the price of the house is likely of the most value to both buyers and sellers. This is the thing most people will ask about when it comes to houses. Select a categorical column of data (explanatory variable) that you expect might influence the response variable. Devise a null hypothesis for an ANOVA test given this situation. Test this hypothesis using ANOVA, and summarize your results. Be clear about how the R output relates to your conclusions. If there are more than 10 categories, consider consolidating them before running the test using the methods we’ve learned in class. Explain what this might mean for people who may be interested in your data. E.g., “there is not enough evidence to conclude [—-], so it would be safe to assume that we can [——]”.
SOLUTION 1:
Explainatory variable: Weight Response variable: family_history_with_overweight
Null Hypothesis: People with a family history of obesity do not have a higher weight.
Alternate Hypothesis: People with a family history of obesity have a higher weight.
# Performing ANOVA
model <- aov(Weight ~ family_history_with_overweight, data = Obesity)
# Summarizing the ANOVA results
summary(model)
## Df Sum Sq Mean Sq F value Pr(>F)
## family_history_with_overweight 1 357266 357266 691.2 <2e-16 ***
## Residuals 2109 1090147 517
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpretation: F-value is 691.2, which means that there’s a greater difference between the two groups. P-value is less than 0.001, which means that the results are highly significant and wwe can reject the null hypothesis.There’s a significant difference in weight between individuals with ans without a family history of obesity. Individuals with a family history of obesity tend to have a different weight profile compared to those without such a history.
PROBLEM 2:
Find at least one other continuous (or ordered integer) column of data that might influence the response variable. Make sure the relationship between this variable and the response is roughly linear. Build a linear regression model of the response using just this column, and evaluate its fit. Run appropriate hypothesis tests and summarize their results. Use diagnostic plots to identify any issues with your model. Interpret the coefficients of your model, and explain how they relate to the context of your data. For example, can you make any recommendations about an optimal way of doing something?
SOLUTION 2:
Running test for the same hypothesis as stated in Solution 1, because I have limited continuous columns that makes sense.
# Linear regression model
model <- lm(Weight ~ family_history_with_overweight, data=Obesity)
summary(model)
##
## Call:
## lm(formula = Weight ~ family_history_with_overweight, data = Obesity)
##
## Residuals:
## Min 1Q Median 3Q Max
## -53.730 -14.988 -2.791 17.270 80.270
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 59.041 1.159 50.95 <2e-16 ***
## family_history_with_overweightyes 33.689 1.281 26.29 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.74 on 2109 degrees of freedom
## Multiple R-squared: 0.2468, Adjusted R-squared: 0.2465
## F-statistic: 691.2 on 1 and 2109 DF, p-value: < 2.2e-16
# diagnostic plots
par(mfrow=c(1,1))
plot(model)
Interpretation: The results of this linear regression model indicate
that family history of overweight (specifically, having a family history
of overweight) is a highly significant predictor of an individual’s
weight. The presence of family history of overweight is associated with
an increase in weight by approximately 33.689 units on average, after
accounting for the influence of the intercept. The model, as a whole, is
highly significant in predicting weight, and it explains a moderate
proportion of the variability in weight (24.68%).
In practical terms, this means that individuals with a family history of overweight tend to have higher weights, on average, compared to those without such a family history. This relationship is statistically significant and can be useful for understanding and predicting weight in the context of family history of overweight.
Also, the p-value is very small and less than 0.05, therefore we can reject the null hypothesis.
PROBLEM 3:
Include at least one other variable into your regression model (e.g., you might use the one from the ANOVA), and evaluate how it helps (or doesn’t). Maybe include an interaction term, but explain why you included it.
SOLUTION 3:
# Fit the multiple linear regression model with an interaction term
model <- lm(Weight ~ family_history_with_overweight * MTRANS, data = Obesity)
# Summary of the regression model
summary(model)
##
## Call:
## lm(formula = Weight ~ family_history_with_overweight * MTRANS,
## data = Obesity)
##
## Residuals:
## Min 1Q Median 3Q Max
## -56.056 -14.352 -2.196 15.685 77.944
##
## Coefficients:
## Estimate
## (Intercept) 72.841
## family_history_with_overweightyes 14.672
## MTRANSBike -10.841
## MTRANSMotorbike -1.041
## MTRANSPublic_Transportation -16.489
## MTRANSWalking -10.051
## family_history_with_overweightyes:MTRANSBike 5.928
## family_history_with_overweightyes:MTRANSMotorbike -12.306
## family_history_with_overweightyes:MTRANSPublic_Transportation 24.032
## family_history_with_overweightyes:MTRANSWalking -2.857
## Std. Error
## (Intercept) 3.153
## family_history_with_overweightyes 3.341
## MTRANSBike 16.079
## MTRANSMotorbike 10.459
## MTRANSPublic_Transportation 3.399
## MTRANSWalking 6.009
## family_history_with_overweightyes:MTRANSBike 18.953
## family_history_with_overweightyes:MTRANSMotorbike 13.909
## family_history_with_overweightyes:MTRANSPublic_Transportation 3.628
## family_history_with_overweightyes:MTRANSWalking 7.125
## t value Pr(>|t|)
## (Intercept) 23.099 < 2e-16
## family_history_with_overweightyes 4.391 1.18e-05
## MTRANSBike -0.674 0.5003
## MTRANSMotorbike -0.100 0.9207
## MTRANSPublic_Transportation -4.851 1.32e-06
## MTRANSWalking -1.673 0.0946
## family_history_with_overweightyes:MTRANSBike 0.313 0.7545
## family_history_with_overweightyes:MTRANSMotorbike -0.885 0.3764
## family_history_with_overweightyes:MTRANSPublic_Transportation 6.623 4.45e-11
## family_history_with_overweightyes:MTRANSWalking -0.401 0.6885
##
## (Intercept) ***
## family_history_with_overweightyes ***
## MTRANSBike
## MTRANSMotorbike
## MTRANSPublic_Transportation ***
## MTRANSWalking .
## family_history_with_overweightyes:MTRANSBike
## family_history_with_overweightyes:MTRANSMotorbike
## family_history_with_overweightyes:MTRANSPublic_Transportation ***
## family_history_with_overweightyes:MTRANSWalking
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.3 on 2101 degrees of freedom
## Multiple R-squared: 0.2783, Adjusted R-squared: 0.2752
## F-statistic: 90.02 on 9 and 2101 DF, p-value: < 2.2e-16
INTERPRETATION: In this model, the interaction terms between family history and transportation modes have lower p-values, indicating that they are statistically significant. This suggests that the effect of family history on weight differs significantly depending on the mode of transportation.
The multiple R-squared value (0.2783) represents the proportion of variance in weight explained by the model, and the F-statistic tests the overall significance of the model. In this case, the F-statistic is highly significant (p-value < 2.2e-16), indicating that the model is a good fit for the data.
The presence of a family history of overweight is associated with an increase in weight. The choice of transportation mode also has an impact on weight, with different modes associated with varying weights. Importantly, the effect of family history on weight varies depending on the mode of transportation. This suggests that the influence of family history on weight is moderated by the mode of transportation used.