Data Dive - Regression

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

READING THE DATASET

Obesity <- read.csv('/Users/ankit/Downloads/Obesity.csv')

PROBLEM 1:

Select a continuous (or ordered integer) column of data that seems most “valuable” given the context of your data, and call this your response variable. For example, in the Ames housing data, the price of the house is likely of the most value to both buyers and sellers. This is the thing most people will ask about when it comes to houses. Select a categorical column of data (explanatory variable) that you expect might influence the response variable. Devise a null hypothesis for an ANOVA test given this situation. Test this hypothesis using ANOVA, and summarize your results. Be clear about how the R output relates to your conclusions. If there are more than 10 categories, consider consolidating them before running the test using the methods we’ve learned in class. Explain what this might mean for people who may be interested in your data. E.g., “there is not enough evidence to conclude [—-], so it would be safe to assume that we can [——]”.

SOLUTION 1:

Explainatory variable: Weight Response variable: family_history_with_overweight

Null Hypothesis: People with a family history of obesity do not have a higher weight.

Alternate Hypothesis: People with a family history of obesity have a higher weight.

# Performing ANOVA
model <- aov(Weight ~ family_history_with_overweight, data = Obesity)

# Summarizing the ANOVA results
summary(model)

##                                  Df  Sum Sq Mean Sq F value Pr(>F)    
## family_history_with_overweight    1  357266  357266   691.2 <2e-16 ***
## Residuals                      2109 1090147     517                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretation: F-value is 691.2, which means that there’s a greater difference between the two groups. P-value is less than 0.001, which means that the results are highly significant and wwe can reject the null hypothesis.There’s a significant difference in weight between individuals with ans without a family history of obesity. Individuals with a family history of obesity tend to have a different weight profile compared to those without such a history.

PROBLEM 2:

Find at least one other continuous (or ordered integer) column of data that might influence the response variable. Make sure the relationship between this variable and the response is roughly linear. Build a linear regression model of the response using just this column, and evaluate its fit. Run appropriate hypothesis tests and summarize their results. Use diagnostic plots to identify any issues with your model. Interpret the coefficients of your model, and explain how they relate to the context of your data. For example, can you make any recommendations about an optimal way of doing something?

SOLUTION 2:

Running test for the same hypothesis as stated in Solution 1, because I have limited continuous columns that makes sense.

# Linear regression model
model <- lm(Weight ~ family_history_with_overweight, data=Obesity)

summary(model)

## 
## Call:
## lm(formula = Weight ~ family_history_with_overweight, data = Obesity)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -53.730 -14.988  -2.791  17.270  80.270 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         59.041      1.159   50.95   <2e-16 ***
## family_history_with_overweightyes   33.689      1.281   26.29   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.74 on 2109 degrees of freedom
## Multiple R-squared:  0.2468, Adjusted R-squared:  0.2465 
## F-statistic: 691.2 on 1 and 2109 DF,  p-value: < 2.2e-16

# diagnostic plots
par(mfrow=c(1,1))
plot(model)

Interpretation: The results of this linear regression model indicate that family history of overweight (specifically, having a family history of overweight) is a highly significant predictor of an individual’s weight. The presence of family history of overweight is associated with an increase in weight by approximately 33.689 units on average, after accounting for the influence of the intercept. The model, as a whole, is highly significant in predicting weight, and it explains a moderate proportion of the variability in weight (24.68%).

In practical terms, this means that individuals with a family history of overweight tend to have higher weights, on average, compared to those without such a family history. This relationship is statistically significant and can be useful for understanding and predicting weight in the context of family history of overweight.

Also, the p-value is very small and less than 0.05, therefore we can reject the null hypothesis.

PROBLEM 3:

Include at least one other variable into your regression model (e.g., you might use the one from the ANOVA), and evaluate how it helps (or doesn’t). Maybe include an interaction term, but explain why you included it.

SOLUTION 3:

# Fit the multiple linear regression model with an interaction term
model <- lm(Weight ~ family_history_with_overweight * MTRANS, data = Obesity)

# Summary of the regression model
summary(model)

## 
## Call:
## lm(formula = Weight ~ family_history_with_overweight * MTRANS, 
##     data = Obesity)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -56.056 -14.352  -2.196  15.685  77.944 
## 
## Coefficients:
##                                                               Estimate
## (Intercept)                                                     72.841
## family_history_with_overweightyes                               14.672
## MTRANSBike                                                     -10.841
## MTRANSMotorbike                                                 -1.041
## MTRANSPublic_Transportation                                    -16.489
## MTRANSWalking                                                  -10.051
## family_history_with_overweightyes:MTRANSBike                     5.928
## family_history_with_overweightyes:MTRANSMotorbike              -12.306
## family_history_with_overweightyes:MTRANSPublic_Transportation   24.032
## family_history_with_overweightyes:MTRANSWalking                 -2.857
##                                                               Std. Error
## (Intercept)                                                        3.153
## family_history_with_overweightyes                                  3.341
## MTRANSBike                                                        16.079
## MTRANSMotorbike                                                   10.459
## MTRANSPublic_Transportation                                        3.399
## MTRANSWalking                                                      6.009
## family_history_with_overweightyes:MTRANSBike                      18.953
## family_history_with_overweightyes:MTRANSMotorbike                 13.909
## family_history_with_overweightyes:MTRANSPublic_Transportation      3.628
## family_history_with_overweightyes:MTRANSWalking                    7.125
##                                                               t value Pr(>|t|)
## (Intercept)                                                    23.099  < 2e-16
## family_history_with_overweightyes                               4.391 1.18e-05
## MTRANSBike                                                     -0.674   0.5003
## MTRANSMotorbike                                                -0.100   0.9207
## MTRANSPublic_Transportation                                    -4.851 1.32e-06
## MTRANSWalking                                                  -1.673   0.0946
## family_history_with_overweightyes:MTRANSBike                    0.313   0.7545
## family_history_with_overweightyes:MTRANSMotorbike              -0.885   0.3764
## family_history_with_overweightyes:MTRANSPublic_Transportation   6.623 4.45e-11
## family_history_with_overweightyes:MTRANSWalking                -0.401   0.6885
##                                                                  
## (Intercept)                                                   ***
## family_history_with_overweightyes                             ***
## MTRANSBike                                                       
## MTRANSMotorbike                                                  
## MTRANSPublic_Transportation                                   ***
## MTRANSWalking                                                 .  
## family_history_with_overweightyes:MTRANSBike                     
## family_history_with_overweightyes:MTRANSMotorbike                
## family_history_with_overweightyes:MTRANSPublic_Transportation ***
## family_history_with_overweightyes:MTRANSWalking                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.3 on 2101 degrees of freedom
## Multiple R-squared:  0.2783, Adjusted R-squared:  0.2752 
## F-statistic: 90.02 on 9 and 2101 DF,  p-value: < 2.2e-16

INTERPRETATION: In this model, the interaction terms between family history and transportation modes have lower p-values, indicating that they are statistically significant. This suggests that the effect of family history on weight differs significantly depending on the mode of transportation.

The multiple R-squared value (0.2783) represents the proportion of variance in weight explained by the model, and the F-statistic tests the overall significance of the model. In this case, the F-statistic is highly significant (p-value < 2.2e-16), indicating that the model is a good fit for the data.

The presence of a family history of overweight is associated with an increase in weight. The choice of transportation mode also has an impact on weight, with different modes associated with varying weights. Importantly, the effect of family history on weight varies depending on the mode of transportation. This suggests that the influence of family history on weight is moderated by the mode of transportation used.

Data Dive - Regression

Jagriti Mahajan

2023-10-23