The statement is partially correct. We do get a more accurate estimate of the effect of weight on health by replacing the marginal regression (health vs. weight) by a regression of weight vs. the residuals from weight on height. This is because the predictor now measures excess weight, that is, deviation from the expected weight, so the model better accounts for the apparent correlation between weight and height.
However, we would get an even more correct estimate by regressing the residuals of health vs. weight against excess weight (aka an added variable plot). This will measure how healthy a person is for their weight given their heavy they are for their height. The main reason this is better is that with the first model, weight (not excess weight) still has an (implicit) effect on the response. In the second model, we average out that effect by taking the expectation under the linear model weight vs. height.
Another strength of the AV plot is that it captures all the information that is in the multiple regression \( health \) vs. \( \{height,weight\} \) that is relevant to the effect of weight. It can be thought of as three projections: the height vs. plane projects to the x-axis of the AV plot, the health vs. weight plane projects to the y-axis, and the \( health \) vs. \( \{weight, height\} \) planprojects to the line-of-best-fit. Therefore the coefficient estimated with the AV plot is going to be equivalent to the one we get in a multiple regression.
The same cannot be said of the first model (health vs. excess weight) - it might lead to a different conclusion from the multiple regression.
We have different opinions on this question.
One point of view is that the measure of coffee is more important. In this research, the relationship between coffee and health is studied by the quantitative data of caffeine and health. From common sense, people may just record how many cups of coffee every respondent consume every day and measure the effect of the number of cups of coffee on health. However, caffeine, which is the main substance affecting health in coffee, is not merely contained in coffee. Actually, a large variety of drinks, food and even some workout supplements contain caffeine. So the measure of the real “consumption” of coffee means that we should compute the actual consumption of caffeine from what the respondents intake rather than that just how many cups of coffee and what type of the coffee are wrote down. Otherwise, we may find outcomes will be far away from our expectation and draw some wrong conclusions. For example, those people who drink more cups of coffee are healthier than those who consume fewer cups. But in fact, those people who drink less cups of coffee may intake much more caffeine from other ways so that they are less healthy.
However, it is difficult to measure the consumption of caffeine accurately for two reasons. Firstly, a wider range of drinks, food and workout supplements contain caffeine. (http://www.cspinet.org/new/cafchart.htm) Secondly, the known information of caffeine amount may not be accurate due to manufacturers exaggeration.
Another view is that the measure of “stress” is also important for analyzing the correlation between coffee and health. In this research, stress may be a confounding factor, which correlates with both consumption of coffee and health. It often means that high level of stress may cause more consumption of coffee and worse health. Therefore, it may often overestimate the negative relationship between coffee intake and health. To preclude this misleading estimate, we have to measure the “stress” level accurately and try to eliminate its confounding effect. However, stress is hard to quantified and always contain inter-individual differences. So the measure of stress should take into account the differences of respondents, measurement errors and any other influencing factors.
The statement is blatantly false. Classes with more students are going to have more people reporting the larger class size, whereas every class will have one professor reporting regardless of size. Hence, unless all classes are the same size, there will be a discrepancy in the two estimates of the mean.
More formally, let \( X \) be a random variable measuring the class size. Assuming \( n \) classes taught by \( n \) professors, each professor estimates the class size leading to \( n \) realizations of \( X \): \( x_1 \ldots x_n \). The professors' estimate of the class size is just the mean:
\[ \theta_1 = \frac{1}{n} \sum_{i=1}^n x_i \]
For a class with \( x_i \) students, however, the students report \( x_i \) estimates of the class size. The overall average of the students' estimates is therefore:
\[ \theta_2 = \frac{1}{N}\sum_{i=1}^{n} x_i^2, \quad \text{where } N =\sum_{i=1}^nx_i \].
Therefore, the average of the students' estimates of class size will not generally equal the professors' estimates.
The above analysis assume the students and the professors both estimated the class size correctly each time. It will still hold as long as students' and professors' estimates had the same expectation, which is a reasonable assumption.
The statement is false; the discrepancy could have been caused by reasons other than a systematic underreporting of income.
By the law of large numbers (LLN), if the survey responses are independently, identically drawn from the population equity distribution, then the sample mean equity should converge in probability to the population mean equity. This means that for a sufficiently large sample size, the probability of the means being different is small.
There are two issues with this statement:
The survey responses may not be a truly random sample from the underlying distribution. The sample will inevitably exclude people who didn’t respond to the survey, or who didn’t get the survey in the first place. If there is a systematic difference in equity between this group and the respondents, the sample mean equity will be biased.
Even if the survey sample is truly random, the rate of convergence in LLN could be very slow, meaning that a larger sample size is needed. This might especially be an issue if the population distribution is very positively skewed. In this case, there will be more probability mass to the left of the mean, so lower-equity individuals will be more likely to be sampled. This would make the sample mean biased towards smaller values than the population mean.
As simple simulation illustrates the second pathology. Let's assume the population equity \( X \) follows a very positively skewed gamma distribution:
\[
X \sim Gamma(shape = 1/10, scale = 10^6)
\]
The mean of the gamma is the product of the shape and the scale, in this case \( 100,000 \). First let's simulate a typical sample, and plot the histogram together with the true density:
m.true <- 1e+05 #pop'n mean equity
fac <- 10 #scale factor
sam.eq <- rgamma(2000, shape = 1/fac, scale = fac * m.true)
hist(sam.eq, probability = TRUE, xlab = "Equity", main = "Equity Probability Distribution")
curve(dgamma(x, shape = 1/fac, scale = fac * m.true), add = TRUE, col = 2)
mean(sam.eq)
## [1] 89830
Now, let's try simulating samples of different sizes, from 10 to 2000 and compute the sample mean equity:
for (sim.num in 1:3){
m<-NULL
sam.sizes <-c(seq(10,90,by=10), seq(100,2000,by=25))
for (n in sam.sizes){
sam.eq <- rgamma(n, shape=1/fac, scale=fac*m.true)
m <- c(m, mean(sam.eq))
}
plot(sam.sizes, m, type="l", xlab="Sample size", ylab="Mean equity", main=paste('Sample mean equities (simulation #', sim.num, ')'))
abline(a=m.true, b=0, col='blue')
}
In the above plots, the (true) population mean equity is shown in blue. We can see that while the sample mean does converge as per LLN, for small \( n \), the sample mean can be a severe underestimate of the population mean equity.
The statement may be true or false in different situations.
If \( x_2 \) or \( x_3 \) or both are correlated to \( x_1 \). It may be unsafe to drop \( x_2 \) or \( x_3 \) only according to their p-values. Correlated predictors convey essentially similar information, and if both are included in the model, neither may contribute significantly to the model. Correlations among predictors often results in statistically insignificant regression coefficients with large standard errors for one or both variables. So this correlation should be tested and excluded. Maybe an effective way is to see the correlation between every independent variable (\( x_1 \), \( x_2 \) and \( x_3 \)) and dependent variable (Y).
If \( x_2 \) and \( x_3 \) are correlated with each other or there exist correlation among the three predictors in the equation, it may be safe to drop \( x_2 \) and \( x_3 \) if they are not significant and their confidence interval is small.
Such a perfect result may not appear for the following reasons.
1) There is some differences between short-term and long-term physical conditions for those people quitting smoking. At the beginning of quitting smoking, the respondents may feel uneasy and uncomfortable. They may be tired, drowsy and irascible, and some people may even be sick. These effects may appear for being unaccustomed to not smoking, which may lead to a false impression that those people quitting smoking seems less healthy than the group continuing smoking. So the short-term and long-term health data should be recorded and analyzed.
2) Even if the long-term health data are put more emphasis, the outcome may be also far away from our expectation. The reason may be that those people who quit smoking may intake other harmful materials as substitutes. One example is caffeine, which may also do harm to people’s health. Those people who quit smoking may add their coffee cups per day, which cause them less healthy. Then the analysis result may be misleading if without taking other substitutes in account. Therefore, in observational data, the respondents’ records should include all the details as far as possible.
3) Even if all the above factors can be considered and excluded, some problems may arise in “average” contributing to unexpected result. If only a few respondents quitting smoking have abnormal reactions, their effect may cause the average score of health lower and it seems that the group quitting smoking is less healthy. Thus, these outliers should be paid close attention to.
This statement should be discussed in different situations.
If there is no multicollinearity among the predictors and the coefficient for a term is close to zero (and the confidence interval is small), the coefficients of other predictors will not change a lot if I drop it.
But if there exists multicollinearity among the predictors, this statement may be wrong for disregarding the influence of the removed variable. The remaining variables may be biased when removing variables with large confidence interval from a regression. A biased estimate of the remaining variable will change the coefficient value of the variables if the bias is large.
The statement is wrong when considering there exist multicollinearity among predictors. For example, the remaining variables can be biased when removing variables from a regression if the confidence interval for the removed variable is large. A biased estimate of the remaining variable will of course change the coefficient value of the variable and could change the p-value if the bias is large. That will occur if the missing coefficient is large and the covariance between the missing and remaining variables is large.
We have different opinions on checking the correlations of predictors and how to drop them. The first opinion is to do regression between every predictor and dependent variable before multiple regressions. And the correlation between every two predictors should be calculated. This can make sure which predictors are significantly related to dependent variable and whether there exists correlations in predictors. The second opinion is that multivariate regression can be done to acquire the coefficients and p-values of all the predictors. When removing an insignificant predictor, it should be checked that this predictor is likely to be zero or small (narrow confidence interval ). This can prevent that the remaining coefficients can be both biased and inconsistent when omitting variables from a regression.
Stepwise regression is a widely used principled approach for variable selection, but it can run into severe problems if it is not used with caution.
Forward stepwise regression starts with the null model and adds one variable at a time, measuring its contribution with a metric such as AIC.
Since the effect of a variable is always measured relative to a given model, the procedure is highly sensitive to the order in which the variables are added, and will not generally give the same decision if the order is changed.
The optimal strategy is to try all possible models given a set of variables, but this is computationally very expensive - for \( D \) predictors, there are \( 2^D \) possible models. Once \( D \) is very high, however, the stepwise approach is also prohibitively expensive.
We conclude then that there are two scenarios where the stepwise procedure is a good choice:
I. We have a good intuition about which predictors might be more informative, and hence the order is meaningful.
II. The predictor set is large enough to rule out the option of testing every model, but small enough so running the stepwise procedure is practical.
If these conditions aren’t met, other strategies should be employed to select variables.
Another point to consider is whether the question we are trying to answer is causal or predictive in nature. The above analysis assumed the question was causal. For a predictive question, a better strategy to employ might be to split the data into training and test sets using cross validation, and compare the different models by their average prediction error on the test sets. The strength of this approach is that it is not sensitive to the order of the variables, and is computationally efficient (scales reasonably with \( D \)). Like the stepwise approach, it also does not require to make a decision about whether to keep a variable or not using an (arbitrary) threshold on the p-value.
It is true that it’s dangerous to interpret the main effects of a regression when you have an interaction term. The interaction term, \( XD \) for example, gives the conditions under which you can interpret the coefficient of \( X \) as the effect of \( X \) on \( Y \): namely, when \( D = 0 \). So you cannot interpret the main effect of X without first inspecting \( D \).
It is not true that interpreting the main effect of \( X \), without interpreting \( D \), is always safe when \( XD \) is insignificant. \( XD \) may have a wide confidence interval that happens to include 0 (making the coefficient insignificant) but also includes values that are large relative to the size of \( X \). You would have to examine the full range of values within the confidence interval for \( XD \) before interpreting the effect of \( X \) on \( Y \) when \( D \neq 0 \).
If \( XD \) has a very wide confidence interval that includes zero the best thing to do would be to collect more data to try to narrow down that confidence interval. If that is not feasible, we should include both models (for example, \( D=0 \) and \( D=1 \)) in the analysis.
This is not true. What we are discussing here is presumably an experiment in which test cases are separated into three groups: control (no treatment), treatment A, and treatment B (and potentially some other treatment groups that we will ignore). Let’s say treatment A is the worst treatment, treatment B is the best. Assume that no group receives multiple treatments. To estimate the effect of treatment A we will perform a univariate regression of Health on a dummy variable (call it DA) that indicates whether participants are in the control group or in group A. The coefficient of DA will then represent the effect of A on Health. The difference in the coefficients of DA and DB could be construed as the difference between the effects of A and B on health.
The difference in the coefficients is not the difference in the mean outcomes for A and B, however. Look at outcome A. Assume that the expected value of the Health outcomes for the control group is 0. If the coefficient of DA were equal to the slope of the regression line then the regression line would intercept the vertices of the data ellipse. As we saw in class, this does not occur: the regression line (dashed) has a lower slope than the line intercepting the vertices (solid).
A similar result holds for B: the mean outcome is greater than the regression coefficient. The difference in the mean outcomes between A and B will therefore not give you the difference in the regression coefficients.
This is a well-known fallacy in statistics. The presence of a correlation between variables is irrelevant to interaction. Interaction has to do with the joint impact that two variables have on a third variable: how does the value of A affect the impact that B has on Y. It does not measure the impact that the value of A has on the value of B (which is what correlation measures).
This is true. Significance Magazine (linked from our course wiki) defines confounding as “results being affected by other factors which researchers cannot easily take into account”. The R.A. Fisher debate surrounding tobacco was a case of this: “the researchers who first found the link between smoking and lung cancer were for a long time – and incorrectly – accused of being misled by confounding factors.” (http://www.significancemagazine.org/details/webexclusive/1000531/The-effect-of-tobacco-smoking-within-minutes.html)
When a researcher suggests that \( X \to Y \), a hidden factor \( C \) would have to do the following for it to qualify as a confounding factor:
\[
\begin{cases}
C \to X \\
C \to Y
\end{cases} \]
A confounding factor is likely to occur only in observational data. In any event, for \( C \to X \) to be true, one would have to observe (or posit) a correlation between \( C \) and \( X \). So the question statement is true.
The statement is generally not true.
The interaction term measures how one predictor, say \( x_1 \), affects the response at different levels of another predictor, say \( x_2 \), and vice versa. The correlation between \( x_1 \) and \( x_2 \), on the other hand, only measures the linear relationship between the two. We could have a case where \( x_1 \) and \( x_2 \) were uncorrelated, but the effect of \( x_1 \) on the response was very different at each level of \( x_2 \).
We can show this with a very simple simulation. We will generate 2 samples of 100 iid normal random variables (the predictors), and compute the response as a linear model with an interaction with added gaussian noise. As we expect, the interaction terms are significant:
x1 <- rnorm(100)
x2 <- rnorm(100)
y <- x1 + x2 + x1 * x2 + rnorm(100, mean = 0, sd = 0.5)
fit1 <- lm(y ~ x1 + x2 + x1 * x2)
summary(fit1) #note the strong interaction
##
## Call:
## lm(formula = y ~ x1 + x2 + x1 * x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.1663 -0.3480 0.0422 0.2606 1.4998
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0847 0.0499 1.7 0.093 .
## x1 1.1759 0.0570 20.6 <2e-16 ***
## x2 0.9624 0.0483 19.9 <2e-16 ***
## x1:x2 1.0136 0.0477 21.2 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.492 on 96 degrees of freedom
## Multiple R-squared: 0.934, Adjusted R-squared: 0.932
## F-statistic: 454 on 3 and 96 DF, p-value: <2e-16
However, there is no correlation between the two predictors:
plot(x1, x2)
fit2 <- lm(x2 ~ x1)
abline(fit2, col = "blue")
summary(fit2) #no correlation
##
## Call:
## lm(formula = x2 ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4548 -0.7837 0.0242 0.5988 2.9115
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.123 0.104 1.18 0.24
## x1 0.121 0.119 1.01 0.31
##
## Residual standard error: 1.04 on 98 degrees of freedom
## Multiple R-squared: 0.0103, Adjusted R-squared: 0.000251
## F-statistic: 1.02 on 1 and 98 DF, p-value: 0.314
Geometrically, adding an interaction terms introduces curvature to the plane-of-best-fit. A correlation on the other hand means that when the points are projected onto the $x_1-\( x_2 \) plane, they will lie close to a line. The former doesn’t imply the latter. In the above situation, a curved surface (e.g model with an interaction) is a very good fit even though the data projects to an uncorrelated cloud of points.
Another point to consider is that the correlation only measures the linear relationship between two variables, whereas interactions can be nonlinear (e.g. \( x_1^2 x_2 \)). Thus, there could be a strong nonlinear interaction, but no strong correlation.
I will choose the appropriate method according to different situations, including the final score distribution, the relationship between other students’ mid-term scores and final scores, and so on.
1) If the regression model of other students shows negative relationship between mid-term exam and final exam, it means that there is a trend that those students work harder than other students because they obtain lower scores in mid-term. So the final exam scores may depend on mid-term grades. Then I will choose method d to calculate the student’s mid-term grade.
2) If there is positive relationship between mid-term scores and final grades, it may hint that those diligent students who do well in mid-term exam will also perform excellently in final exam. In that way, final scores can predict mid-term score for the same trend. I would choose method a to calculate the student’s mid-term grade.
3) If the distributions of the mid-term grades and the final scores are both very close to normal distribution and every student’s location is comparatively consistent. I will choose method c to estimate the student’s mid-term score.
4) If the situation is more complicated (far away from normal distribution) and the relationship of mid-term and final scores are not very clear, I may take account into the results of method a, b, d to decide the student’s mid-term score.