Chapter 5 - The Many Variables & the Spurious Waffles

This chapter introduced multiple regression, a way of constructing descriptive models for how the mean of a measurement is associated with more than one predictor variable. The defining question of multiple regression is: What is the value of knowing each predictor, once we already know the other predictors? The answer to this question does not by itself provide any causal information. Causal inference requires additional assumptions. Simple directed acyclic graph (DAG) models of causation are one way to represent those assumptions.

Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Make sure to include plots if the question requests them.

Finally, upon completion, name your final output .html file as: YourName_ANLY505-Year-Semester.html and publish the assignment to your R Pubs account and submit the link to Canvas. Each question is worth 5 points.

Questions

5-1. Which of the linear models below are multiple linear regressions? \[\begin{align} {μ_i = α + βx_i} \tag{1}\\ μ_i = β_xx_i + β_zz_i \tag{2} \\ μ_i = α + β(x_i − z_i) \tag{3} \\ μ_i = α + β_xx_i + β_zz_i \tag{4} \\ \end{align}\]

## Linear Model # 1 or tag {1} from above is a simple linear regression model and not the multiple linear regression model because it has a single associated independent variable in the model definition.

### Linear models # 2, 3, & 4 or tag {2},{3}, and {4} from above are the multiple linear regression models since they are constituting of more than a single independent variables in their equation model definition.

5-2. Write down a multiple regression to evaluate the claim: Neither amount of funding nor size of laboratory is by itself a good predictor of time to PhD degree; but together these variables are both positively associated with time to degree. Write down the model definition and indicate which side of zero each slope parameter should be on.

## Multiple regression model for the evaluation of the above claim:

# 𝞵_i = 𝜶 + 𝜷_fFi + 𝜷_sSi

## Here in the above mathematical model definition, "F" is an indicator of the amount of funding, and "S" is an indicator of the magnitude or size of the laboratory.

## Given the condition that the amount of funding, and the size of the laboratory possess a positive association or these parameters are positively associated with the time to acquire PhD degree, hence the slopes for these parameters i.e. "𝜷_f" and "𝜷_s" will be on the positive side of zero.

5-3. It is sometimes observed that the best predictor of fire risk is the presence of firefighters— States and localities with many firefighters also have more fires. Presumably firefighters do not cause fires. Nevertheless, this is not a spurious correlation. Instead fires cause firefighters. Consider the same reversal of causal inference in the context of the divorce and marriage data. How might a high divorce rate cause a higher marriage rate? Can you think of a way to evaluate this relationship, using multiple regression?

# From the context of the divorce and marriage rate scenario, the higher percentage of divorce rate would imply the availability or presence of a higher percentage of bachelor or unmarried individuals who are available for Re-marriage, and this variable would perhaps be also to an extent impacted by the one of the crucial factor i.e the Age of the individual at the time of divorce and this "Age at divorce" variable also has an influence or effects on both "Marriage rate, as well as "Divorce rate', and theoretically if the Age at divorce is younger, the Re-Marriage rate or Collective marriage rate would also be higher then these factors can perhaps possibly result and cause in the increase in the marriage rate or higher percentage of the marriage rate.

# Also, perhaps, an addition of the "Age at the Divorce" in the multiple regression model and assessing the multiple regression model to find the correlation significance of "Divorce age"  with the "Divorce rate" and "Re-marriage rate" would provide insight in the evaluation of this relationship of divorce rate causing high marriage rate, and if there is a strong or significant association or correlation between the "Age" and "Re-marriage or Marriage rate" and the "Divorce rate" and "Re-marriage rate, then in that scenario probably the divorce rate can also possibly cause an increase in the higher marriage rate.


## Hypothetically for this scenario with the Estimand to evaluate the causal effect of Divorce rate on Marriage rate, one of the examples of the mathematical multiple linear regression model to estimate the direct effect of Divorce rate on the Marriage rate and by breaking fork by stratifying by Age at divorce could be as under :

## Here in this Multiple regression model "A_d" signifies the Age at divorce, "M" signifies the collective overall Marriage rate inclusive of any Re-marriage, etc, and "D" signifies the Divorce rate.


#  M = f(A_d, D)  

# Mi ~ Normal(μ_i, σ) 

# μi = 𝜶 + 𝜷aAi + 𝜷dDi

5-4. Suppose you have a single categorical predictor with 4 levels (unique values), labeled A, B, C and D. Let Ai be an indicator variable that is 1 where case i is in category A. Also suppose Bi, Ci, and Di for the other categories. Now which of the following linear models are inferentially equivalent ways to include the categorical variable in a regression? Models are inferentially equivalent when it’s possible to compute one posterior distribution from the posterior distribution of another model. \[\begin{align} μ_i = α + β_AA_i + β_BB_i + β_DD_i \tag{1}\\ μ_i = α + β_AA_i + β_BB_i + β_CC_i + β_DD_i \tag{2}\\ μ_i = α + β_BB_i + β_CC_i + β_DD_i \tag{3}\\ μ_i = α_AA_i + α_BB_i + α_CC_i + α_DD_i \tag{4}\\ μ_i = α_A(1 − B_i − C_i − D_i) + α_BB_i + α_CC_i + α_DD_i \tag{5}\\ \end{align}\]

## In the above scenario Models # 1, 3, 4 & 5 can be considered inferentially equivalent and can compute one posterior distribution from the posterior distribution of another model.

## Model # 1:

#Since model # 1 comprising of a single intercept for  category "C" categorical predictor level, and also a single slope for category " A", "B", and "D", 

## Model # 2:

# Also because Model # 2, comprises slope for all the four-level i.e., Level "A", "B", "C" and "D", thus this model can be considered as unidentifiable or inferentially non-equivalent for the computation of the posterior distribution form the other model's posterior distribution, 

## Model # 3:

# The third model can also be considered as inferentially equivalent and can compute one posterior distribution from the posterior distribution of another model, since model # 3, is also comprised of a single intercept for  category "A" categorical predictor level, and also a single slopes for category " B", "C", and "D", 


## Model # 4, and 5:

# The fourth and fifth models can also be considered inferentially equivalent and can compute one posterior distribution from the posterior distribution of another model, because the fourth model exhibits the trend and approach of distinct or unique indexing for equipping distinct intercepts for all the categories i.e., "A", "B", "C", and "D", and with no slopes for any of these levels, however, the Model # 5, multiplied the intercept for one of the category i.e., "𝜶A" number of times one, when belongs to category "A", and times zero for other as per our scenario condition, hence per this the Model # 4 & % can also be considered as in the scope of to be considered as inferentially equivalent and can compute one posterior distribution from the posterior distribution of another model.

5-5. One way to reason through multiple causation hypotheses is to imagine detailed mechanisms through which predictor variables may influence outcomes. For example, it is sometimes argued that the price of gasoline (predictor variable) is positively associated with lower obesity rates (outcome variable). However, there are at least two important mechanisms by which the price of gas could reduce obesity. First, it could lead to less driving and therefore more exercise. Second, it could lead to less driving, which leads to less eating out, which leads to less consumption of huge restaurant meals. Can you outline one or more multiple regressions that address these two mechanisms? Assume you can have any predictor data you need.

## Here in this scenario, two of the main depicted mechanism i.e., Mechanism # 1: Increase in the gas prices leading to less driving and therefore more exercise, and Mechanism # 2: Increase in the gas prices leading to less driving and therefore having a less access to the consumption of the huge restaurant meals.

## Some of the additional regressions or predictors addressing the above scenario's mechanism could be, "Predictor_a": "Increase in the gas prices leading to less driving and therefore resulting in the increased in the average walking rate or burning of consumed calories daily and thus impacting or having a positive association with the outcome variable of lower obesity rate", and "Predictor_b" "Increase in the gas prices leading to less driving and therefore also resulting in the lesser accessibility to and consumption of alcohol, and therefore resulting in the reduction in the average consumption of empty calories consumed and also a higher and balanced levels of hormones testosterone due to lesser consumption of alcohol, therefore impacting or having a positive association with the outcome variable of lower obesity rate" : 

## Sample Regression model for the Scenario's Hypothesis :

# Here, in this model, "O" signifies "Obesity Rate", "E" signifies Exercise rate, and "R" signifies the consumption of Huge Restaurant meals rate.
# O = f(E, R)  

# Mi ~ Normal(μ_i, σ) 

# μi = 𝜶 + 𝜷eEi + 𝜷rRi