In the previous chapter we studied simple linear regression with a single explanatory variable, but usually the response depends on a number of explanatory (predictor) variables which corresponds to a multiple regression. We have also restricted the response (outcome) variable to be numerical, in this chapter we also learn logistic regression used for predicting categorical outcomes with two levels (binary response variable).
Rather than having just one predictor, we now want to determine the value of a continuous response variable \(y\) given the values of \(p > 1\) independent explanatory variables \(x_1, x_2,... x_p\) (predictors). The linear model is defined as:
\[\begin{equation} \hat{y} = E(Y|X_1=x_1,X_2=x_2,...X_p=x_p) = b_0 + b_1 x_1 +b_2 x_2+ ... + b_p x_p \end{equation}\] so that average predicted \(\hat{y}\) is a linear function of predictors \(x_1, x_2,... x_p\) with estimated regression coefficients (slopes) \(b_1, b_2,... b_p\). Note that each slope \(b_i\) is the average change in the response variable \(y\) corresponding to the unit change in variable \(x_i\) while all other explanatory variables are held constant.
In simple linear regression with only one predictor variable, the goal was to find the “line of best fit.” Now, the data is multidimensional and the relationship between response and predictors is a multidimensional plane (hyper-plane), which we will not be able to plot as a simple scatter-plot with a line as before. We want to find the hyper-plane that best fits the multivariate data in terms of minimizing the squared distance between the plane and given y-values. Specifically, the \(b_1, b_2,... b_p\) are found as the values that minimize the sum of squared errors (SSE) or residual sum of squares (RSS): \[\begin{equation} SSE=RSS=\sum_{i=1}^n [y_i -(b_0 + b_1 x_{1,i} +b_2 x_{2,i}+ ... + b_p x_{p,i})]^2 \end{equation}\]
The assumption analysis is in many ways similar to simple regression. There is, however, one additional important and subtle assumption for multiple regression - independence of the predictor variables from each other. If the variables are dependent, they are called collinear. Such variables are strongly correlated which makes it difficult to separate out the individual effects of collinear variables on the response. The collinearity causes the standard error (SE) for each coefficient \(b_i\) to grow. The t-statistic for each predictor is \(b_i/SE\), so \(t\) declines and p-value grows, therefore we may fail to reject H0 : \(\beta_i = 0\). This means that the power of the hypothesis test (the probability of correctly detecting a non-zero coefficient) is reduced by collinearity. Therefore, it reduces the accuracy of the estimates of the regression coefficients.
A first step to detect collinearity is to consider the correlation matrix of the predictors. High correlation would indicate a collinearity problem in the data. However, it is possible for collinearity to exist between three or more variables even if pairwise correlations are moderate - multicollinearity. A better way to assess multicollinearity is to compute the variance inflation factor (VIF) which is the ratio of the variance of \(b_j\) when fitting the full model divided by the variance of \(b_j\) if fit on its own. The smallest possible VIF is 1 if no collinearity is observed. As a rule of thumb, a VIF value that exceeds 5 indicates a problematic amount of collinearity. If collinearity is high, we can drop or combine such variables since they are redundant in the presence of the other variables.
The high VIF values, however, are not always a big problem. For example, we consider later in this chapter polynomial regression with variables \(x\), \(x^2\), \(x^3\), which have high VIF’s, but p-values in this case are not affected by multicollinearity. Analogously, if the model includes interaction of two variables \(x\), \(y\), written as product \(x \cdot y\), multicollinearity (VIF’s) are high but p-values are not affected again. Finally, in the case of categorical variable with three or more categories, if the proportion of cases in the reference category is small, the indicator variables will necessarily have high VIFs. The solution is just to change the reference category.
Suppose, for example, that a marital status variable has three categories: currently married, never married, and formerly married. You choose formerly married as the reference category, with indicator variables for the other two. What happens is that the correlation between those two indicators gets more negative as the fraction of people in the reference category gets smaller. For example, if 45 percent of people are never married, 45 percent are married, and 10 percent are formerly married, the VIFs for the married and never-married indicators will be at least 3.0.
Is this a problem? Well, it does mean that p-values for the indicator variables may be high. But the overall test that all indicators have coefficients of zero is unaffected by the high VIFs. And nothing else in the regression is affected. If you really want to avoid the high VIFs, just choose a reference category with a larger fraction of the cases. That may be desirable in order to avoid situations where none of the individual indicators is statistically significant even though the overall set of indicators is significant.
Note that we used \(r^2\) in simple regression as a fraction of variability in the response that was explained by the linear model: \[ r^2 = 1 - \frac{variability \ in \ residuals}{variability \ in \ outcome} = 1-\frac{SSE}{SST} = 1-\frac{\sum (y_i-\hat{y}_i)^2}{\sum (y_i-\overline{y})^2} \] This strategy is acceptable for just one predictor variable, however, for many predictors, \(r^2\) is a biased estimate of the of variability explained by the model when applied to a new sample of data. The \(r^2\) always increases whether or not added new variables are related to the response. To fix it, an adjusted \(r^2\) is introduced. \[\begin{equation} r^2_{adj} = 1-\frac{\frac{SSE}{n-p-1}}{\frac{SST}{n-1}} \end{equation}\] where \(n\) is the number of cases used to fit the model and \(p\) is the number of predictor variables in the model, i.e. each variance is divided by its degree of freedom. To maximize the \(r^2_{adj}\), we need to minimize \(\frac{SSE}{n-p-1}\). The \(SSE\) always decreases as we add new variables to the model (significant or not), but \(\frac{SSE}{n-p-1}\) may increase or decrease, due to the increasing \(p\) in the denominator. Once all of the relevant variables have been included in the model, adding additional noise variables would result in a very small decrease in \(SSE\), so \(\frac{SSE}{n-p-1}\) will increase, resulting in decreasing \(r^2_{adj}\). Therefore, \(r^2_{adj}\) statistic pays a price for the inclusion of unnecessary variables in the model and they would not be included. In addition, we are generally interested in making predictions for new data, then the un-adjusted \(r^2\) tends to be overly optimistic, while the adjusted \(r^2_{adj}\) helps to correct this bias.
Example
Investigate multiple linear regression model of cesd on some likely
predictors in the HELPrct data file.
The code below shows linear model of the cesd depression score vs. several other variables. The age, mcs (mental score), pcs (physical score), pss_fr (perceived support system friends) are numerical, while sex, homeless, and substance are categorical which must be converted to indicator variables against reference levels. The diagnostic figures show that there is no pattern of residuals and constant variability assumption is satisfied because there is no funneling of the residuals. The QQ plot shows that the residuals are almost perfectly normally distributed (no outliers at all, not to mention any extreme outliers). Also, the assumption of independent observations (no time series) is satisfied too (Durbin-Watson is close to 2). Residuals vs Leverage plot also shows no extreme behavior. Most importantly, the correlations of numerical variables and variance inflation coefficients show no sign of collinearity of explanatory variables to each other. Thus, the linear model is applicable.
# load dataset
url="https://raw.githubusercontent.com/leonkag/Statistics0/main/HELPrct.csv"
mydata = read.csv(url,stringsAsFactors = TRUE) # save as mydata file
cat('\nData Frame dimensions: ',dim(mydata))
Data Frame dimensions: 453 31
mydata = mydata[,c('cesd','age','mcs','pcs','pss_fr','homeless','substance','sex')];
cat('\nAfter choosing variables for our study, the data Frame dimensions: ',dim(mydata))
After choosing variables for our study, the data Frame dimensions: 453 8
summary(mydata)
cesd age mcs pcs
Min. : 1.00 Min. :19.00 Min. : 6.763 Min. :14.07
1st Qu.:25.00 1st Qu.:30.00 1st Qu.:21.676 1st Qu.:40.38
Median :34.00 Median :35.00 Median :28.602 Median :48.88
Mean :32.85 Mean :35.65 Mean :31.677 Mean :48.05
3rd Qu.:41.00 3rd Qu.:40.00 3rd Qu.:40.941 3rd Qu.:56.95
Max. :60.00 Max. :60.00 Max. :62.175 Max. :74.81
pss_fr homeless substance sex
Min. : 0.000 homeless:209 alcohol:177 female:107
1st Qu.: 3.000 housed :244 cocaine:152 male :346
Median : 7.000 heroin :124
Mean : 6.706
3rd Qu.:10.000
Max. :14.000
The correlation coefficients:
library(Hmisc)
rcorr(as.matrix( subset(mydata,select=c(cesd,age, mcs,pcs,pss_fr)) ))
cesd age mcs pcs pss_fr
cesd 1.00 0.01 -0.68 -0.29 -0.18
age 0.01 1.00 0.04 -0.23 0.08
mcs -0.68 0.04 1.00 0.11 0.14
pcs -0.29 -0.23 0.11 1.00 0.08
pss_fr -0.18 0.08 0.14 0.08 1.00
n= 453
P
cesd age mcs pcs pss_fr
cesd 0.8590 0.0000 0.0000 0.0000
age 0.8590 0.3434 0.0000 0.0884
mcs 0.0000 0.3434 0.0187 0.0033
pcs 0.0000 0.0000 0.0187 0.1038
pss_fr 0.0000 0.0884 0.0033 0.1038
We can visualize the correlations with correlograms:
# install.packages("corrgram")
library(corrgram)
corrgram(subset(mydata,select=c(cesd,age, mcs,pcs,pss_fr)),
order=TRUE, lower.panel=panel.shade,
upper.panel=panel.pie, text.panel=panel.txt)
Also, the pairs plot can be useful:
library(GGally)
ggpairs(subset(mydata,select=c(cesd,age, mcs,pcs,pss_fr)))
Linear model:
mymodel = lm('cesd~age+mcs+pcs+pss_fr+homeless+substance+sex', data=mydata)
summary(mymodel)
Call:
lm(formula = "cesd~age+mcs+pcs+pss_fr+homeless+substance+sex",
data = mydata)
Residuals:
Min 1Q Median 3Q Max
-26.5873 -6.0981 0.1166 5.7769 26.0104
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 70.35402 3.32954 21.130 < 2e-16 ***
age -0.04944 0.05661 -0.873 0.38299
mcs -0.61901 0.03283 -18.857 < 2e-16 ***
pcs -0.22871 0.04050 -5.647 2.91e-08 ***
pss_fr -0.24555 0.10508 -2.337 0.01989 *
homelesshoused -0.10293 0.84835 -0.121 0.90349
substancecocaine -2.72873 0.99994 -2.729 0.00661 **
substanceheroin -2.10500 1.06741 -1.972 0.04922 *
sexmale -2.55642 0.98619 -2.592 0.00985 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 8.624 on 444 degrees of freedom
Multiple R-squared: 0.5335, Adjusted R-squared: 0.5251
F-statistic: 63.47 on 8 and 444 DF, p-value: < 2.2e-16
confint(mymodel)
2.5 % 97.5 %
(Intercept) 63.8104059 76.897641431
age -0.1606894 0.061819425
mcs -0.6835285 -0.554499955
pcs -0.3082977 -0.149112469
pss_fr -0.4520570 -0.039034926
homelesshoused -1.7702093 1.564357074
substancecocaine -4.6939362 -0.763514382
substanceheroin -4.2027999 -0.007204202
sexmale -4.4946027 -0.618234214
Variance inflation factors
library(car)
vif(mymodel)
GVIF Df GVIF^(1/(2*Df))
age 1.157716 1 1.075972
mcs 1.079507 1 1.038993
pcs 1.159264 1 1.076691
pss_fr 1.070903 1 1.034844
homeless 1.089299 1 1.043695
substance 1.210434 2 1.048903
sex 1.068666 1 1.033763
Diagnostic Plots:
par(mfrow=c(2,2))
plot(mymodel)
par(mfrow=c(1,1))
The overall ANOVA F statistic test is highly significant.
For the multiple regression, the hypothesis test is performed for each
coefficient (slope):
H0: \(\beta_j = 0\) zero coefficient
(slope) for \(x_j\), no significant
relationship.
H1: \(\beta_j \ne 0\) non zero
coefficient (slope) for \(x_j\),
significant relationship.
The code provides t-test and p-value for each variable and each
indicator variable for categorical predictors. Some of them are
significant others are not. The 95% confidence intervals for the
coefficients (slopes) confirm the t-test with significant CI’s not
containing 0.
Let’s remove the predictor with highest p-value - homeless. Age is also
non-significant, but we have to remove variables one by one and examine
the summary because each time all the coefficients and their
p-values change.
mymodel2 = update(mymodel, . ~ . - homeless)
summodel2 = summary(mymodel2); summodel2
Call:
lm(formula = cesd ~ age + mcs + pcs + pss_fr + substance + sex,
data = mydata)
Residuals:
Min 1Q Median 3Q Max
-26.5458 -6.1094 0.0646 5.7992 25.9648
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 70.32743 3.31864 21.192 < 2e-16 ***
age -0.04915 0.05650 -0.870 0.38479
mcs -0.61926 0.03273 -18.922 < 2e-16 ***
pcs -0.22906 0.04035 -5.678 2.47e-08 ***
pss_fr -0.24744 0.10379 -2.384 0.01754 *
substancecocaine -2.74372 0.99117 -2.768 0.00587 **
substanceheroin -2.12466 1.05387 -2.016 0.04439 *
sexmale -2.54439 0.98011 -2.596 0.00974 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 8.615 on 445 degrees of freedom
Multiple R-squared: 0.5335, Adjusted R-squared: 0.5261
F-statistic: 72.69 on 7 and 445 DF, p-value: < 2.2e-16
confint(mymodel2)
2.5 % 97.5 %
(Intercept) 63.8052804 76.84957305
age -0.1601861 0.06188454
mcs -0.6835795 -0.55493905
pcs -0.3083555 -0.14977316
pss_fr -0.4514275 -0.04345840
substancecocaine -4.6916867 -0.79576014
substanceheroin -4.1958390 -0.05348082
sexmale -4.4706196 -0.61816959
The age is still insignificant, so we remove it as well and obtain the model where all the p-values < 0.05, i.e. all variables are significant.
mymodel3 = update(mymodel2, . ~ . - age)
summodel3 = summary(mymodel3); summodel3
Call:
lm(formula = cesd ~ mcs + pcs + pss_fr + substance + sex, data = mydata)
Residuals:
Min 1Q Median 3Q Max
-26.2617 -6.1196 0.2528 5.5920 25.4253
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 68.15424 2.18417 31.204 < 2e-16 ***
mcs -0.62044 0.03269 -18.979 < 2e-16 ***
pcs -0.22088 0.03922 -5.632 3.17e-08 ***
pss_fr -0.25717 0.10316 -2.493 0.01303 *
substancecocaine -2.58936 0.97490 -2.656 0.00819 **
substanceheroin -1.88331 1.01642 -1.853 0.06456 .
sexmale -2.52852 0.97968 -2.581 0.01017 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 8.612 on 446 degrees of freedom
Multiple R-squared: 0.5327, Adjusted R-squared: 0.5264
F-statistic: 84.73 on 6 and 446 DF, p-value: < 2.2e-16
confint(mymodel3)
2.5 % 97.5 %
(Intercept) 63.8616930 72.44678630
mcs -0.6846868 -0.55619276
pcs -0.2979585 -0.14379448
pss_fr -0.4599081 -0.05442439
substancecocaine -4.5053251 -0.67339988
substanceheroin -3.8808841 0.11425842
sexmale -4.4538723 -0.60316392
There is also a bit more formal way to compare such nested models using a partial F-statistic. Assume we have one model nested within the other: \[ \hat{y}_{full} = b_0+b_1x_1+ b_2x_2+... +b_px_p+... +b_qx_q \] \[ \hat{y}_{part} = b_0+b_1x_1+ b_2x_2+... +b_px_p\] The partial F-statistics is defined as: \[\begin{equation} F_{partial} = \frac{(R^2_{full}-R^2_{part})(n-q-1)}{(1-R^2_{full})(q-p)} (\#eq:PartialF) \end{equation}\] where \(R^2\) denote the respective coefficients of determination and \(n\) is the sample size of the data. \(F_{partial}\) follows F-distribution with \(df_1 = q - p\), \(df_2 = n - q\) degrees of freedom. The anova() function runs this test. The p-value is given by the right tail of this F-distribution and if significant, it indicates that the improvement in goodness-of-fit is large enough to make the additional complexity worth it. We illustrate it with the code below:
anova(mymodel,mymodel2,mymodel3)
Analysis of Variance Table
Model 1: cesd ~ age + mcs + pcs + pss_fr + homeless + substance + sex
Model 2: cesd ~ age + mcs + pcs + pss_fr + substance + sex
Model 3: cesd ~ mcs + pcs + pss_fr + substance + sex
Res.Df RSS Df Sum of Sq F Pr(>F)
1 444 33024
2 445 33025 -1 -1.095 0.0147 0.9035
3 446 33081 -1 -56.167 0.7552 0.3853
As expected mymodel2 and mymodel3 are not significantly different
from mymodel.
Now, let’s remove one of the variables with low p-value, say pss_fr:
mymodel4 = update(mymodel3, . ~ . - pss_fr)
anova(mymodel,mymodel2,mymodel3,mymodel4)
Analysis of Variance Table
Model 1: cesd ~ age + mcs + pcs + pss_fr + homeless + substance + sex
Model 2: cesd ~ age + mcs + pcs + pss_fr + substance + sex
Model 3: cesd ~ mcs + pcs + pss_fr + substance + sex
Model 4: cesd ~ mcs + pcs + substance + sex
Res.Df RSS Df Sum of Sq F Pr(>F)
1 444 33024
2 445 33025 -1 -1.09 0.0147 0.90349
3 446 33081 -1 -56.17 0.7552 0.38532
4 447 33542 -1 -460.94 6.1972 0.01316 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Adding pss_fr makes a difference, though and should be added.
As we were removing insignificant variables the adjusted \(r^2_{adj}\) increased very slightly \(0.5251 < 0.5261 < 0.5264\), which implies the improvement of the model. Note that the other measures - \(AIC\), and \(BIC\) explained in the next section will change more noticeably.
Note that even though substance:heroin has p-value slightly above
0.05, substance:cocaine have p-value < 0.05, so we cannot remove the
substance variable. Using this model, we can illustrate the meaning of
the coefficients (slopes).
If all other variables are kept constant, males have average
cesd scores \(2.53\) below than the
reference level of females. Analogously, with all other variables being
kept constant, cocaine abusers have average cesd score \(2.59\) below reference level of alcoholics
and heroin abusers have average cesd score \(1.88\) below the same reference level. Each
extra unit of mcs score reduces average cesd score by \(0.62\), again with all other variables kept
constant, etc…
We can also use this reduced model to predict cesd score for, say, the male below:
newdata = data.frame(sex="male",substance="alcohol",mcs=30,pcs = 34, pss_fr=10)
predict(mymodel3, newdata, interval = "prediction")
fit lwr upr
1 36.93106 19.89935 53.96278
predict(mymodel3, newdata, interval = "confidence")
fit lwr upr
1 36.93106 35.03477 38.82736
Example
Investigate multiple linear regression model of WT on some likely
predictors in the MHEALTH data file.
The code below shows linear model of the weight WT on several predictor variables. This time all the variables are numerical so no indicator variables are needed. The diagnostic figures show that there is a nonlinear pattern of residuals which invalidates the linear model. In addition, there is high correlation between WAIST and BMI etc… and the resulting VIF coefficients are more than 5 for WAIST and BMI. There is no unique answer as to which variable to remove 1st, but in this case, it should be clear that keeping both WAIST and BMI which is directly related to WT is not reasonable, so we are going to remove BMI 1st. In addition, the QQ plot shows that the residuals are not normally distributed. The assumption of independent observations (no time series) is satisfied. Residuals vs Leverage plot shows no extreme behavior.
# MHEALTH wage multiple regression
rm(list=ls())
url="https://raw.githubusercontent.com/leonkag/Statistics0/main/MHEALTH.csv"
mydata = read.csv(url,stringsAsFactors = TRUE) # save as mydata file
cat('\nData Frame dimensions: ',dim(mydata))
Data Frame dimensions: 40 14
mydata = subset(mydata, select = -c(MALE)); # removed ID column
cat('\nAfter choosing variables for our study, the data Frame dimensions: ',dim(mydata))
After choosing variables for our study, the data Frame dimensions: 40 13
summary(mydata)
AGE HT WT WAIST
Min. :17.00 Min. :61.30 Min. :119.5 Min. : 75.20
1st Qu.:25.75 1st Qu.:66.30 1st Qu.:152.4 1st Qu.: 84.38
Median :32.50 Median :68.30 Median :169.9 Median : 91.20
Mean :35.48 Mean :68.33 Mean :172.6 Mean : 91.28
3rd Qu.:44.50 3rd Qu.:70.08 3rd Qu.:189.6 3rd Qu.: 99.90
Max. :73.00 Max. :76.20 Max. :237.1 Max. :108.70
PULSE SYS DIAS CHOL
Min. :56.0 Min. : 95.0 Min. :44.00 Min. : 31.0
1st Qu.:60.0 1st Qu.:111.5 1st Qu.:67.50 1st Qu.: 163.8
Median :66.0 Median :117.0 Median :75.00 Median : 282.5
Mean :69.4 Mean :118.9 Mean :73.22 Mean : 395.2
3rd Qu.:76.0 3rd Qu.:125.0 3rd Qu.:81.00 3rd Qu.: 619.2
Max. :96.0 Max. :153.0 Max. :87.00 Max. :1252.0
BMI LEG ELBOW WRIST ARM
Min. :19.60 Min. :36.00 Min. :6.500 Min. :5.2 Min. :25.90
1st Qu.:23.73 1st Qu.:40.73 1st Qu.:7.000 1st Qu.:5.6 1st Qu.:30.70
Median :26.20 Median :42.65 Median :7.300 Median :5.8 Median :32.05
Mean :26.00 Mean :42.57 Mean :7.295 Mean :5.8 Mean :32.39
3rd Qu.:27.50 3rd Qu.:44.42 3rd Qu.:7.500 3rd Qu.:6.0 3rd Qu.:33.77
Max. :33.20 Max. :48.40 Max. :8.300 Max. :6.7 Max. :41.10
# The correlation coefficients:
library(Hmisc)
rcorr(as.matrix( mydata ))
AGE HT WT WAIST PULSE SYS DIAS CHOL BMI LEG ELBOW WRIST
AGE 1.00 0.29 0.29 0.43 -0.12 0.39 0.52 -0.02 0.14 -0.05 0.23 0.20
HT 0.29 1.00 0.52 0.27 -0.13 0.15 0.17 -0.17 -0.09 0.60 0.53 0.24
WT 0.29 0.52 1.00 0.89 0.06 0.35 0.39 -0.03 0.80 0.37 0.64 0.52
WAIST 0.43 0.27 0.89 1.00 0.14 0.41 0.53 0.10 0.86 0.15 0.41 0.55
PULSE -0.12 -0.13 0.06 0.14 1.00 0.19 -0.02 0.09 0.18 0.05 -0.09 -0.02
SYS 0.39 0.15 0.35 0.41 0.19 1.00 0.55 0.02 0.30 -0.10 0.29 0.28
DIAS 0.52 0.17 0.39 0.53 -0.02 0.55 1.00 0.06 0.33 -0.01 0.10 0.18
CHOL -0.02 -0.17 -0.03 0.10 0.09 0.02 0.06 1.00 0.11 -0.05 -0.19 -0.05
BMI 0.14 -0.09 0.80 0.86 0.18 0.30 0.33 0.11 1.00 0.01 0.36 0.43
LEG -0.05 0.60 0.37 0.15 0.05 -0.10 -0.01 -0.05 0.01 1.00 0.24 -0.08
ELBOW 0.23 0.53 0.64 0.41 -0.09 0.29 0.10 -0.19 0.36 0.24 1.00 0.45
WRIST 0.20 0.24 0.52 0.55 -0.02 0.28 0.18 -0.05 0.43 -0.08 0.45 1.00
ARM 0.07 0.19 0.82 0.73 -0.03 0.18 0.27 0.09 0.82 0.20 0.53 0.45
ARM
AGE 0.07
HT 0.19
WT 0.82
WAIST 0.73
PULSE -0.03
SYS 0.18
DIAS 0.27
CHOL 0.09
BMI 0.82
LEG 0.20
ELBOW 0.53
WRIST 0.45
ARM 1.00
n= 40
P
AGE HT WT WAIST PULSE SYS DIAS CHOL BMI LEG
AGE 0.0708 0.0742 0.0058 0.4569 0.0124 0.0006 0.9250 0.3986 0.7572
HT 0.0708 0.0005 0.0938 0.4159 0.3428 0.2971 0.2910 0.5846 0.0000
WT 0.0742 0.0005 0.0000 0.7322 0.0259 0.0134 0.8739 0.0000 0.0183
WAIST 0.0058 0.0938 0.0000 0.3825 0.0081 0.0005 0.5298 0.0000 0.3410
PULSE 0.4569 0.4159 0.7322 0.3825 0.2415 0.8927 0.5931 0.2641 0.7452
SYS 0.0124 0.3428 0.0259 0.0081 0.2415 0.0002 0.9138 0.0572 0.5536
DIAS 0.0006 0.2971 0.0134 0.0005 0.8927 0.0002 0.6933 0.0349 0.9389
CHOL 0.9250 0.2910 0.8739 0.5298 0.5931 0.9138 0.6933 0.4981 0.7372
BMI 0.3986 0.5846 0.0000 0.0000 0.2641 0.0572 0.0349 0.4981 0.9493
LEG 0.7572 0.0000 0.0183 0.3410 0.7452 0.5536 0.9389 0.7372 0.9493
ELBOW 0.1567 0.0004 0.0000 0.0090 0.6020 0.0699 0.5278 0.2368 0.0228 0.1345
WRIST 0.2088 0.1369 0.0006 0.0002 0.8877 0.0817 0.2584 0.7414 0.0057 0.6286
ARM 0.6479 0.2493 0.0000 0.0000 0.8374 0.2655 0.0918 0.5786 0.0000 0.2117
ELBOW WRIST ARM
AGE 0.1567 0.2088 0.6479
HT 0.0004 0.1369 0.2493
WT 0.0000 0.0006 0.0000
WAIST 0.0090 0.0002 0.0000
PULSE 0.6020 0.8877 0.8374
SYS 0.0699 0.0817 0.2655
DIAS 0.5278 0.2584 0.0918
CHOL 0.2368 0.7414 0.5786
BMI 0.0228 0.0057 0.0000
LEG 0.1345 0.6286 0.2117
ELBOW 0.0032 0.0005
WRIST 0.0032 0.0038
ARM 0.0005 0.0038
library(corrgram)
corrgram(mydata, order=TRUE, lower.panel=panel.shade,
upper.panel=panel.pie, text.panel=panel.txt)
library(GGally)
ggpairs(subset(mydata,select=c(AGE,HT,WT,WAIST,BMI)))
# Linear model
mymodel = lm(WT~AGE+HT+WAIST+BMI+PULSE, data=mydata)
summary(mymodel)
Call:
lm(formula = WT ~ AGE + HT + WAIST + BMI + PULSE, data = mydata)
Residuals:
Min 1Q Median 3Q Max
-5.7891 -0.8003 0.1076 0.8003 5.2434
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -349.75463 9.56153 -36.579 <2e-16 ***
AGE -0.01881 0.03026 -0.621 0.5384
HT 5.18833 0.14743 35.191 <2e-16 ***
WAIST 0.02807 0.10395 0.270 0.7887
BMI 6.51491 0.26826 24.286 <2e-16 ***
PULSE -0.05053 0.02972 -1.700 0.0983 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.002 on 34 degrees of freedom
Multiple R-squared: 0.995, Adjusted R-squared: 0.9942
F-statistic: 1341 on 5 and 34 DF, p-value: < 2.2e-16
confint(mymodel)
2.5 % 97.5 %
(Intercept) -369.18600149 -3.303233e+02
AGE -0.08030589 4.269300e-02
HT 4.88870480 5.487948e+00
WAIST -0.18317906 2.393288e-01
BMI 5.96974158 7.060087e+00
PULSE -0.11093769 9.876556e-03
library(car)
vif(mymodel)
AGE HT WAIST BMI PULSE
1.727504 1.927632 10.221581 8.238270 1.096788
par(mfrow=c(2,2))
plot(mymodel)
par(mfrow=c(1,1))
First remove BMI:
mymodel2 = update(mymodel, . ~ . - BMI)
summodel2 = summary(mymodel2); summodel2
Call:
lm(formula = WT ~ AGE + HT + WAIST + PULSE, data = mydata)
Residuals:
Min 1Q Median 3Q Max
-18.5385 -4.6193 0.3536 5.6446 15.0012
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -222.0183 33.7097 -6.586 1.32e-07 ***
AGE -0.3775 0.1115 -3.385 0.00177 **
HT 2.9085 0.4799 6.060 6.44e-07 ***
WAIST 2.3818 0.1587 15.010 < 2e-16 ***
PULSE -0.1184 0.1249 -0.947 0.34992
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 8.454 on 35 degrees of freedom
Multiple R-squared: 0.9075, Adjusted R-squared: 0.8969
F-statistic: 85.81 on 4 and 35 DF, p-value: < 2.2e-16
vif(mymodel2)
AGE HT WAIST PULSE
1.316074 1.146114 1.336336 1.087104
The PULSE is still insignificant, so we remove it as well and obtain the model where all the p-values < 0.05, i.e. all variables are significant.
mymodel3 = update(mymodel2, . ~ . -PULSE)
summodel3 = summary(mymodel3); summodel3
Call:
lm(formula = WT ~ AGE + HT + WAIST, data = mydata)
Residuals:
Min 1Q Median 3Q Max
-19.521 -3.729 1.875 3.978 16.087
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -232.0936 31.9428 -7.266 1.47e-08 ***
AGE -0.3591 0.1097 -3.275 0.00234 **
HT 2.9739 0.4743 6.271 3.03e-07 ***
WAIST 2.3460 0.1539 15.243 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 8.442 on 36 degrees of freedom
Multiple R-squared: 0.9051, Adjusted R-squared: 0.8972
F-statistic: 114.4 on 3 and 36 DF, p-value: < 2.2e-16
CIb1 = confint(mymodel3); CIb1
2.5 % 97.5 %
(Intercept) -296.8764881 -167.3106159
AGE -0.5814702 -0.1366994
HT 2.0120757 3.9357859
WAIST 2.0339086 2.6581771
vif(mymodel3)
AGE HT WAIST
1.276228 1.122369 1.260754
Now, when we obtained a model with all significant predictors and all VIF’s well below 5, we rerun diagnostic plots and see that non-linearity of the residuals has been removed, normality of the residuals improved and none of the points on residuals vs. leverage graph crosses over the Cook’s level curves. Thus, the latest linear model is reasonable.
par(mfrow=c(2,2))
plot(mymodel3)
par(mfrow=c(1,1))
Using this model, we can illustrate the meaning of the coefficients
(slopes).
If all other variables are kept constant, each extra year of age
somewhat surprisingly lowers weight by 0.36 lb on average. Each extra
inch of height on average increase weight by 2.97 lb. Finally each extra
inch of WAIST increases WT by 2.34 lb on average.
We can also use this reduced model to predict WT below:
newdata = data.frame(AGE = 40,HT = 69,WAIST = 90)
predict(mymodel3, newdata, interval = "prediction")
fit lwr upr
1 169.8881 152.5054 187.2709
predict(mymodel3, newdata, interval = "confidence")
fit lwr upr
1 169.8881 166.8803 172.896
Sometimes, we can observe clear curvature in a scatter plot and
linear model is inadequate for the data. This can often be fixed by
including quadratic or cubic terms in the model. For example, a cubic
polynomial relationship is represented by \[
\hat{y} = b_0+b_1\cdot x+b_2\cdot x^2+b_3\cdot x^3 \] At order 1,
the linear relationship allows no curvature (straight line). At order 2,
a quadratic function (parabola) allows one bend. At order 3, the model
can cope with two bends in the relationship etc… Although much higher
order polynomials introduce artificial wiggles and are to be avoided. It
is very important to realize that while non-linear in x, the
model is still linear in the coeffients \(b_0, b_1...\). A truly non-linear model in
coefficients requires different approach.
Generally, a numeric transformation refers to the application of a
mathematical function to numeric observations to rescale them and is not
limited to polynomials, it could be any function of \(x_i\)’s.
Example
Consider the built-in mtcars data set. Let’s investigate the dependence of mpg (miles per gallon) vs. hp (horsepower) of a car. The scatter plot shown in the Figure below has a noticeable curve in the relationship, so the regression line shown in this figure appears to be a poor fit. The diagnostic plots also show problems with the pattern in residuals. Note that we don’t need to check VIF’s - they are going to be high because these are the powers of the same variable.
# Polynomial fit mtcars
plot(mpg ~ hp, data=mtcars, pch = 20)
mymodel1 = lm(mpg ~ hp, data=mtcars)
summary(mymodel1)
Call:
lm(formula = mpg ~ hp, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-5.7121 -2.1122 -0.8854 1.5819 8.2360
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 30.09886 1.63392 18.421 < 2e-16 ***
hp -0.06823 0.01012 -6.742 1.79e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.863 on 30 degrees of freedom
Multiple R-squared: 0.6024, Adjusted R-squared: 0.5892
F-statistic: 45.46 on 1 and 30 DF, p-value: 1.788e-07
abline(mymodel1,lwd=3,col="red")
par(mfrow=c(2,2))
plot(mymodel1)
par(mfrow=c(1,1))
Note that even though the scatterplot and residual plot both show a rather poor fit, the p-value of the slope is still close to 0 which indicates statistical evidence of a negative linear impact of higher horsepower of cars on mileage. Therefore, based purely on the regression output, we would have to reject the H0 assumption of no linear relationship. Thus, as we said before, you should always examine the scatterplot.
Add a quadratic term in hp via I(hp^2). Note that R requires I() function which changes the class of an object to indicate that it should be treated ‘as is’. For a formula, it is used to inhibit the interpretation of operators such as “+”, “-”, “*” and “^” as formula operators, so they are used as arithmetical operators.
mymodel2 = lm(mpg ~ hp+I(hp^2), data=mtcars)
summary(mymodel2)
Call:
lm(formula = mpg ~ hp + I(hp^2), data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.5512 -1.6027 -0.6977 1.5509 8.7213
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.041e+01 2.741e+00 14.744 5.23e-15 ***
hp -2.133e-01 3.488e-02 -6.115 1.16e-06 ***
I(hp^2) 4.208e-04 9.844e-05 4.275 0.000189 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.077 on 29 degrees of freedom
Multiple R-squared: 0.7561, Adjusted R-squared: 0.7393
F-statistic: 44.95 on 2 and 29 DF, p-value: 1.301e-09
x.seq <- seq(50,350,length=1000) # sequence of values in the range of x variable
mymodel2.pred <- predict(mymodel2,newdata=data.frame(hp=x.seq))
plot(mpg ~ hp, data=mtcars, pch = 20)
abline(mymodel1,lwd=3,col="red")
lines(x.seq,mymodel2.pred,lty=2,col="blue",lwd=3)
the Figure above shows both linear and quadratic fits. To add the quadratic prediction we used seq() function to create a finely spaced set of observations in the entire range of hp’s, predict() function to predict the mpg’s at those values, and lines() to add the quadratic line. The summary(mymodel2) indicates a small p-value (<0.05) for the quadratic term, so the contribution of the squared component is statistically significant. This conclusion is also supported by a higher adjusted \(R^2\).
Would the cubic term fit the data even better? No. The code below shows that the cubic term is not significant (p-value > 0.05) and the Figure below displays linear, quadratic, and cubic fits. The cubic fit is basically the same as quadratic.
mymodel3 = lm(mpg ~ hp+I(hp^2)+I(hp^3), data=mtcars)
summary(mymodel3)
Call:
lm(formula = mpg ~ hp + I(hp^2) + I(hp^3), data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.8605 -1.3972 -0.5736 1.6461 9.0738
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.422e+01 5.961e+00 7.419 4.43e-08 ***
hp -2.945e-01 1.178e-01 -2.500 0.0185 *
I(hp^2) 9.115e-04 6.863e-04 1.328 0.1949
I(hp^3) -8.701e-07 1.204e-06 -0.722 0.4760
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.103 on 28 degrees of freedom
Multiple R-squared: 0.7606, Adjusted R-squared: 0.7349
F-statistic: 29.65 on 3 and 28 DF, p-value: 7.769e-09
mymodel3.pred <- predict(mymodel3,newdata=data.frame(hp=x.seq))
plot(mpg ~ hp, data=mtcars, pch = 20)
abline(mymodel1,lwd=3,col="red")
lines(x.seq,mymodel2.pred,lty=2,col="blue" ,lwd=3)
lines(x.seq,mymodel3.pred,lty=3,col="green",lwd=3)
legend("topright",lty=1:3,
legend=c("order 1 (linear)","order 2 (quadratic)","order 3 (cubic)"),
col=c("red","blue","green"),lwd=3)
From R code point of view, there is actually a better way to fit any order polynomial using poly() function with appopriate order. The results are the same.
mymodel3a = lm(mpg ~ poly(hp,3), data=mtcars)
summary(mymodel3a)
Call:
lm(formula = mpg ~ poly(hp, 3), data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.8605 -1.3972 -0.5736 1.6461 9.0738
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20.0906 0.5485 36.625 < 2e-16 ***
poly(hp, 3)1 -26.0456 3.1030 -8.394 3.95e-09 ***
poly(hp, 3)2 13.1546 3.1030 4.239 0.000221 ***
poly(hp, 3)3 -2.2419 3.1030 -0.722 0.475987
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.103 on 28 degrees of freedom
Multiple R-squared: 0.7606, Adjusted R-squared: 0.7349
F-statistic: 29.65 on 3 and 28 DF, p-value: 7.769e-09
The relative statistical significance for nested models used here can be also assessed using anova() function:
anova(mymodel1,mymodel2,mymodel3)
Analysis of Variance Table
Model 1: mpg ~ hp
Model 2: mpg ~ hp + I(hp^2)
Model 3: mpg ~ hp + I(hp^2) + I(hp^3)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 30 447.67
2 29 274.63 1 173.043 17.971 0.0002205 ***
3 28 269.61 1 5.026 0.522 0.4759872
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
It shows that quadratic term is significant, but cubic is not,
similar to individual p-values.
It should also be noted that residuals and assumptions plot for linear
model shows quadratic pattern in residuals, while quadratic model fixes
this problem and has random distribution of residuals and better
Residuals vs. Leverage plot with none of the data points crossing Cook
distance level curves.
par(mfrow=c(2,2))
plot(mymodel1)
plot(mymodel2)
par(mfrow=c(1,1))
Note that polynomial and most other non-linear fits are only valid in the data region (interpolation). They have completely different behavior outside of the data range (extrapolation). For example, plotting our linear, quadratic, and cubic approximations on a much wider range, we obtain the Figure below where the quadratic and cubic extrapolations go wildly off course.
x.seq <- seq(50,500,length=1000) # sequence of values in the range of x variable
mymodel2.pred <- predict(mymodel2,newdata=data.frame(hp=x.seq))
mymodel3.pred <- predict(mymodel3,newdata=data.frame(hp=x.seq))
plot(mpg ~ hp, data=mtcars, pch = 20, xlim=c(0,500),ylim=c(0,40))
abline(mymodel1,lwd=3,col="red")
lines(x.seq,mymodel2.pred,lty=2,col="blue" ,lwd=3)
lines(x.seq,mymodel3.pred,lty=3,col="green",lwd=3)
legend("topright",lty=1:3,
legend=c("order 1 (linear)","order 2 (quadratic)","order 3 (cubic)"),
col=c("red","blue","green"),lwd=3)
So far, for regression, we looked only at the additive main effects of predictors on outcome. Now, we consider interactions between predictors. An interaction effect occurs when the effect of one variable depends on the value of another variable. For example, medicine interaction effects are very common. Say, statins used for cholesterol redaction negatively interact with the grapefruit juice because it contains chemical compounds that inhibit the drug effects. Interactions can occur between categorical variables, numeric variables, or both. We commonly study interaction between two variables (two-way), higher order interactions are rarely considered as they are hard to interpret. Note also that we don’t need to check for multicollinearity with VIF’s - they are going to be high because it only makes sense to consider examples with strong interactions and these have a lot of correlations. Including the interaction term explicitly accounts for it.
Let’s first consider interaction between one categorical and one continuous variables. The categorical variable changes slope of the continuous predictor with respect to its non-reference levels. A categorical variable with \(k\) levels has \(k - 1\) main effect terms, so there will be also \(k - 1\) interactive terms with continuous variable.
Consider an example of classical iris data file which contains 3 classes of 50 instances each, where each class refers to a type of iris plant and its measurements. Just like for the interaction of terms in Two-Way-ANOVA, we use multiplication sign \(*\) to introduce interaction between the terms which automatically includes the main effects of the terms. ggplot2 package is used to produce an informative plot of regression lines broken by the categorical variable Species as shown in the Figure below.
# Interaction one continuous one categorical
mymodel = lm(Petal.Length ~ Petal.Width + Species, data = iris)
summary(mymodel)
Call:
lm(formula = Petal.Length ~ Petal.Width + Species, data = iris)
Residuals:
Min 1Q Median 3Q Max
-1.02977 -0.22241 -0.01514 0.18180 1.17449
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.21140 0.06524 18.568 < 2e-16 ***
Petal.Width 1.01871 0.15224 6.691 4.41e-10 ***
Speciesversicolor 1.69779 0.18095 9.383 < 2e-16 ***
Speciesvirginica 2.27669 0.28132 8.093 2.08e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3777 on 146 degrees of freedom
Multiple R-squared: 0.9551, Adjusted R-squared: 0.9542
F-statistic: 1036 on 3 and 146 DF, p-value: < 2.2e-16
library(ggplot2)
ggplot(iris, aes(x = Petal.Width, y = Petal.Length, color = Species,fill=Species)) +
geom_point() +
geom_smooth(method = "lm")
The output shows main effects for Petal.Width and Species as well as
their interactions for each non-reference levels of the Species. The
main effect for Petal.Width is not significant (p-value > 0.05),
SpeciesVersicolor is not significant compared to the reference level of
Setosa (p-value > 0.05), but SpeciesVirginica is significant (p-value
<< 0.05) compared to Setosa. Note that if at least one of the
non-reference levels of a categorical variable is significant, the entire
effect is significant. The interaction term
Petal.Width:Speciesversicolor is significant (p-value < 0.05). Note
that even though the main effect of Petal.Width is not significant, we
cannot drop this variable because there is significant interaction with
Species. The Figure above shows that slopes differ by Species which
confirms the significance of the interaction. The resulting model is:
\[Petal.Length =1.33+0.54 \cdot Petal.Width +
\] \[ 0.45 \cdot Speciesversicolor +
2.91 \cdot Speciesvirginica + \] \[1.33 \cdot Petal.Width:Speciesversicolor +0.1
\cdot Petal.Width:Speciesvirginica\] For the reference level of
the categorical predictor Species - Setosa, the fitted model is read
from the output: \[Petal.Length =1.33+0.54
\cdot Petal.Width\] First consider the effects without
interactions.
Changing Species to the level \(SpeciesVersicolor = 1\) (and others 0) adds
a term: \[Petal.Length =1.33+0.54 \cdot
Petal.Width + 0.45\cdot SpeciesVersicolor = \] \[1.33+0.54 \cdot Petal.Width + 0.45 \cdot 1 =
1.87+0.54 \cdot Petal.Width \] But note that it only changes the
y-intercept, not the slope.
Analogously for the level \(Speciesvirginica =
1\) and others 0: \[Petal.Length
=1.33+0.54 \cdot Petal.Width + 2.91 \cdot Speciesvirginica =\]
\[1.33+0.54 \cdot Petal.Width + 2.91 \cdot 1
= 4.24+0.54 \cdot Petal.Width \] Now, let’s consider the
interaction terms. They not only add to intercept, but also to the
slopes. For the plant with \(Speciesversicolor
= 1\), \[Petal.Length =1.33+0.54 \cdot
Petal.Width + \] \[ 0.45 \cdot
Speciesversicolor + 1.33 \cdot Petal.Width:Speciesversicolor =
\] \[ 1.33+0.54 \cdot Petal.Width +
0.45 \cdot 1 + 1.33 \cdot Petal.Width \cdot 1 = \] \[ 1.88+1.87 \cdot Petal.Width \] Thus, we
obtained higher y-intercept and higher slope of \(Petal.Width\).
The predict() function with interaction works exactly the same:
newdata = data.frame(Petal.Width=2,Species="virginica")
predict(mymodel, newdata, interval = "prediction")
fit lwr upr
1 5.525513 4.77148 6.279547
predict(mymodel, newdata, interval = "confidence")
fit lwr upr
1 5.525513 5.419644 5.631383
Next, let’s consider the interaction between two continuous
variables.
Two continuous main effects fit a plane and an interaction term modifies
the slopes in a continuous way. We use example data from state.x77 that
is built into R modeling Income (average) vs. Illistracy and Murder
rates including the interaction term.
states <- as.data.frame(state.x77);
mymodel <- lm(Income ~ Illiteracy * Murder, data = states)
summary(mymodel)
Call:
lm(formula = Income ~ Illiteracy * Murder, data = states)
Residuals:
Min 1Q Median 3Q Max
-955.20 -325.99 10.66 299.96 1892.12
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3822.61 405.33 9.431 2.54e-12 ***
Illiteracy 617.34 434.85 1.420 0.16245
Murder 146.82 50.33 2.917 0.00544 **
Illiteracy:Murder -117.10 40.13 -2.918 0.00544 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 520.1 on 46 degrees of freedom
Multiple R-squared: 0.3273, Adjusted R-squared: 0.2834
F-statistic: 7.461 on 3 and 46 DF, p-value: 0.000359
The main effect of Illiteracy rate is not significant (p-value > 0.05), but Murder rate is significantly related to Income (p-value < 0.05). The interaction of these two rates is significant (p-value < 0.05). The model is written as: \[Income = 3822.61 + 617.34\cdot Illiteracy + 146.82 \cdot Murder -117.10 \cdot Illiteracy:Murder \] \[ Income = 3822.61 + 617.34\cdot Illiteracy + 146.82 \cdot Murder -117.10 \cdot Illiteracy \cdot Murder\] Note how the 2nd line writes the interaction as actual product of two predictor variables. Note that we could have written the interaction of continuous and categorical variable in the previous example as a product as well, the dummy coding or 0 and 1 would multiply correctly (0 = term is absent, 1 = term is present).
The interpretation of the continuous variables interaction depends on the sign \(\pm\) of the interaction coefficient. If it is negative (as it is here), as predictors increase, the mean response is reduced (after computing the main effects). If it is positive, as the predictors increase, the mean response is additionally amplified. As both illiteracy and murder rates increase, average income is decreased in a state. It is interesting to compare the model with interaction above with the model without interaction and individual models:
mymodel2 <- lm(Income ~ Illiteracy + Murder, data = states)
summary(mymodel2)
Call:
lm(formula = Income ~ Illiteracy + Murder, data = states)
Residuals:
Min 1Q Median 3Q Max
-880.90 -397.26 -51.31 333.13 1960.68
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4890.45 187.64 26.063 < 2e-16 ***
Illiteracy -548.74 184.60 -2.973 0.00465 **
Murder 25.40 30.48 0.833 0.40895
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 560.2 on 47 degrees of freedom
Multiple R-squared: 0.2028, Adjusted R-squared: 0.1689
F-statistic: 5.979 on 2 and 47 DF, p-value: 0.004861
mymodel3 <- lm(Income ~ Illiteracy, data = states)
summary(mymodel3)
Call:
lm(formula = Income ~ Illiteracy, data = states)
Residuals:
Min 1Q Median 3Q Max
-948.89 -376.20 -49.77 347.00 2024.60
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4951.3 172.3 28.739 < 2e-16 ***
Illiteracy -440.6 130.9 -3.367 0.00151 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 558.4 on 48 degrees of freedom
Multiple R-squared: 0.191, Adjusted R-squared: 0.1742
F-statistic: 11.34 on 1 and 48 DF, p-value: 0.001505
mymodel4 <- lm(Income ~ Murder, data = states)
summary(mymodel4)
Call:
lm(formula = Income ~ Murder, data = states)
Residuals:
Min 1Q Median 3Q Max
-1141.64 -483.71 -19.01 403.31 2029.40
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4718.36 192.51 24.510 <2e-16 ***
Murder -38.30 23.38 -1.638 0.108
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 604.2 on 48 degrees of freedom
Multiple R-squared: 0.05294, Adjusted R-squared: 0.03321
F-statistic: 2.683 on 1 and 48 DF, p-value: 0.108
newdata = data.frame(Illiteracy=0.6,Murder=4)
predict(mymodel, newdata, interval = "prediction")
fit lwr upr
1 4499.252 3425.301 5573.204
predict(mymodel, newdata, interval = "confidence")
fit lwr upr
1 4499.252 4260.184 4738.321
We can see that without interaction the illiteracy effect is reversed and both individual models have negative effects of Illiteracy and Murder respectively.
Finally, let’s consider two categorical variables
interaction.
We have actually considered the interactions between two categorical
explanatory variables in the two-way ANOVA. Therefore, the same
ToothGrowth file from Base R data sets is used. This file contains the
data on the effect of vitamin C on tooth growth in guinea pigs. Each
animal received one of three dose levels of vitamin C (0.5, 1, and 2
mg/day) by one of two delivery methods, orange juice or ascorbic acid
(OJ vs. VC).
# Interaction two categorical
mydata = ToothGrowth
mydata$dose = factor(mydata$dose)
mymodel = lm(len ~ supp*dose,data=mydata)
summary(mymodel)
Call:
lm(formula = len ~ supp * dose, data = mydata)
Residuals:
Min 1Q Median 3Q Max
-8.20 -2.72 -0.27 2.65 8.27
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.230 1.148 11.521 3.60e-16 ***
suppVC -5.250 1.624 -3.233 0.00209 **
dose1 9.470 1.624 5.831 3.18e-07 ***
dose2 12.830 1.624 7.900 1.43e-10 ***
suppVC:dose1 -0.680 2.297 -0.296 0.76831
suppVC:dose2 5.330 2.297 2.321 0.02411 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.631 on 54 degrees of freedom
Multiple R-squared: 0.7937, Adjusted R-squared: 0.7746
F-statistic: 41.56 on 5 and 54 DF, p-value: < 2.2e-16
library(ggpubr);
ggline(mydata, x = "dose", y = "len", color = "supp",
add = c("mean_se", "dotplot"))
interaction.plot(x.factor = mydata$dose, #x-axis variable
trace.factor = mydata$supp, #variable for lines
response = mydata$len, #y-axis variable
fun = median, #metric to plot
ylab = "tooth length", xlab = "dose",
col = c("red", "blue"), trace.label = "supp")
There is a term for each non-reference level of the first predictor
combined with all non-reference levels of the second predictor. The
delivery method VC is significant (p-value < 0.05) compared to the
reference level OJ. The dose1 and dose2 are both significant compared to
the reference dose of 0.5 (p-value << 0.05). The interaction of
delivery suppVC with dose1 is not significant (p-value > 0.05), but
suppVC with dose2 is significant (p-value < 0.05).
These results provide the same conclusion as the ANOVA analysis shown
below:
anovamod = aov(len ~ supp*dose,data=mydata)
summary(anovamod)
Df Sum Sq Mean Sq F value Pr(>F)
supp 1 205.4 205.4 15.572 0.000231 ***
dose 2 2426.4 1213.2 92.000 < 2e-16 ***
supp:dose 2 108.3 54.2 4.107 0.021860 *
Residuals 54 712.1 13.2
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The model is: \[len = 13.23-5.25 \cdot suppVC + 9.47\cdot dose1 + 12.83\cdot dose2 \] \[ -0.68 \cdot suppVC \cdot dose1 + 5.33 \cdot suppVC \cdot dose2\] The model is seen as a series of additive terms to the overall intercept. Depending on a particular choice of the predictors, we get average estimate. For example, for a pig given supplement via ascorbic acid (\(VC\)) with \(dose = 2\), we obtain the following predictions:
newdata = data.frame(supp="VC",dose="2")
predict(mymodel, newdata, interval = "prediction")
fit lwr upr
1 26.14 18.5041 33.7759
predict(mymodel, newdata, interval = "confidence")
fit lwr upr
1 26.14 23.83769 28.44231
In the the beginning of this multiple regression chapter, we have started with a number of predictors which might be related to the response variable and removed them one by one using the highest (least significant) p-value. Keeping such irrelevant variables would only hinder the accuracy and explanatory power of the model. There are many other selection strategies which help us to eliminate (prune out) unnecessary predictors from the model to obtain a better model. A model that includes all available explanatory variables is called the full model.
The main goal of fitting a statistical model is to faithfully represent the data and the relationships held within them. It is a balancing act between goodness-of-fit and complexity. Complexity is tied to the number of terms and additional functions (transformations, interactions, etc…). The principle of parsimony is the principle of model selection to find a model that’s as simple as possible (relatively low complexity), without sacrificing goodness-of-fit in a major way. Of course, assessing the significance of the effect of predictors or functions of predictors on the response plays the major role in these decision. The number of possibilities for variable selection grows exponentially with the number of variables, so the systematic selection algorithms are essential in this analysis. There are a number of model selection algorithms which often lead to different final models. As George Box once said “All models are wrong, but some are useful.”
It should be noted also, that the data needs to be split into training part on which the model is created and testing part on which we test the model. If we test the model on the same data on which the model was created the fit tends to be too optimistic and the overfitting to the particular data set tends to occur with worse results for the new test data…
The general guidelines on predictor variables
inclusion/exclusion are:
1) Individual insignificant levels of categorical predictor cannot be
removed if at least one of the non-reference levels is
significant.
2) If an interaction is present in the model, all lower-order main
effects of the interacting predictors cannot be removed even if they are
not statistically significant.
3) In polynomial transformation models, keep all lower-order polynomial
terms in the model if the highest is deemed significant.
Prediction Accuracy
Consider the expected prediction error at point x: \[ Error(x) = E( Y - \hat{f}(x))^2\] It can
be decomposed into three terms: \[ Error(x) =
(E[\hat{f}(x)]-f(x))^2 + E[(\hat{f}(x)-E[\hat{f}(x)])^2] +
\sigma_{\epsilon}^2\] \[ Error(x) = \
\ Bias^2 + \ \ \ \ \ \ \ \ \ \ Variance + \ \ \ \ \ \ Irreducible \
Error\]
The bias is the error in approximating a real-life problem by a model. For example, linear regression assumes that there is a linear relationship between the response and predictors. If the true \(f(x)\) substantially non-linear, it will not be possible to produce an accurate estimate using linear regression - high bias (underfitting) . Generally, more flexible methods result in less bias.
The variance is the change in \(\hat{f}\) if it is estimated with a different training data set. More flexible statistical methods have higher variance. In a flexible nonlinear fit, changing any data point may cause the estimate \(\hat{f}\) to change considerably resulting in high variance (overfitting). A least squares line fit is relatively inflexible and has low variance, because moving any single observation slightly will likely cause only a small shift in the position of the line.
The ideas of Bias and Variance are well illustrated in the Figure below:
# install.packages("imager")
library(imager)
file = "https://raw.githubusercontent.com/leonkag/Statistics0/main/VarianceBiasTarget.JPG"
im <- load.image(file)
trying URL 'https://raw.githubusercontent.com/leonkag/Statistics0/main/VarianceBiasTarget.JPG'
Content type 'image/jpeg' length 38670 bytes (37 KB)
downloaded 37 KB
plot(im)
As we increase flexibility of the methods, the variance will increase and the bias will decrease. However, the bias tends to initially decrease faster than the variance increases. Consequently, the expected test Mean Squared Error (\(MSE\)) declines. However, at some point increasing flexibility provides only small improvements to the bias but significantly increases the variance. Then the test \(MSE\) increases. We can see this bias-variance trade-off in the Figure below.
library(imager)
file = "https://raw.githubusercontent.com/leonkag/Statistics0/main/BiasVarianceTradeoff.JPG"
im <- load.image(file)
trying URL 'https://raw.githubusercontent.com/leonkag/Statistics0/main/BiasVarianceTradeoff.JPG'
Content type 'image/jpeg' length 33961 bytes (33 KB)
downloaded 33 KB
plot(im)
Which criteria can be used in the selection algorithms? The least squares model is designed to minimize the Residual Sum of Squares which we defined before, but we repeat it again here: \[RSS=SSE=\sum_{i=1}^n [y_i -(b_0 + b_1 x_{1,i} +b_2 x_{2,i}+ ... + b_p x_{p,i})]^2\]
Generally, the training set Mean Squared Error (\(MSE\)) is an underestimate of the \(MSE\) for the test data set. The reason is that a model is fit to the training data using least squares, the regression coefficients are chosen to minimize the training \(RSS\). The training \(RSS\) always decreases as we add more features to the model, but the test error may increase. Therefore the training \(RSS\) and \(r^2\) may not be used for selecting the best model without the adjustment for this underestimation.
Adding new predictors (even irrelevant) can only reduce \(RSS\) and increase regular coefficient of determination, so we defined an adjusted \(r^2_{adj}\): \[r^2_{adj} = 1-\frac{\frac{SSE}{n-p-1}}{\frac{SST}{n-1}} \] However, it tends to vary only slightly when we reduce our models. There are many other, more efficient measures, used for model selection.
For example, Mallows’s Cp \[C_p=\frac{1}{n}(RSS +2p\hat{\sigma}^2) \] where \(RSS\) is the residual sum of squares on a training set of data, \(p\) is the number of predictors and \(\hat{\sigma }^{2}\) refers to an estimate of the variance associated with each response in the linear model (estimated on a model containing all predictors). A small value of \(C_p\) means that the model is relatively precise. The \(C_p\) adds a penalty of \(2p\hat{\sigma}^2\) to the training \(RSS\) in order to adjust for the fact that the training error tends to underestimate the new data test error. Only if the decrease in \(RSS\) is more than increase in penalty term, will \(C_p\) go down.
For our least squares model, Mallows’s \(C_p\) is equivalent to Akaike
information criterion (\(AIC\)) defined for a large class of models
fit by maximum likelihood \(\hat{L}\).
\[AIC = 2p-\ln(\hat{L}) =
\frac{1}{n\hat{\sigma}^2}(RSS +2p\hat{\sigma}^2)\] Bayes
Information Criterion (\(BIC\)) is derived from a Bayesian
Probability theory. For the least squares model with \(p\) predictors, the \(BIC\) is, up to irrelevant constants, given
by: \[BIC=\frac{1}{n}(RSS
+\log(n)p\hat{\sigma}^2) \] Like \(C_p\), the \(BIC\) is smaller for models with lower test
error, so we are looking for model with the lowest BIC value. \(log(n) > 2\) for any reasonable sample
size \(n\), so BIC has bigger penalty
on models with many variables, and therefore produces smaller models
than \(C_p\).
Best Subset Selection - a least squares regression
is fit for each possible combination of the \(p\) predictors. There are \(p\) models with one predictor, \(\frac{p(p-1)}{2}\) models with two
predictors, etc… Each predictor can be in or out of the model, so there
are \(2^p\) possible models. Say, for
\(p=10\) predictors, there are \(2^{10}=1024\) models to investigate. The
algorithm proceeds as follows:
1) Let \(M_0\) be the null model with
no predictors. It simply predicts the sample mean for each
observation.
2) For \(k = 1, 2, . . . p\), fit all
\(_kC_p\) models with exactly \(k\) predictors. Pick the best according to
smallest \(RSS\) and call it \(M_k\).
3) Select the best model from among \(M_0,...,
M_p\) using \(C_p (AIC)\), \(BIC\), \(r^2_{adj}\), or cross validated predicton
error.
Step 2 identifies the best model for the training data, then step 3
picks the best using the criteria penalizing for inclusion of
non-relevant predictors.
The exponential growth of the number of models is a serious computational issue. Also, the models with very large number of predictors are hard to interpret. Therefore, forward and backward stepwise selection is often used instead. Forward stepwise selection begins with no-predictors model and adds a predictor that gives the greatest additional improvement to the \(RSS\) fit one-at-a-time. Backward stepwise selection begins with the full model with all \(p\) predictors, and then iteratively removes the least useful predictor, one-at-a-time.
In the Best Subset Selection a least squares
regression is fit for each possible combination of the \(p\) predictors. There are \(p\) models with one predictor, \(\frac{p(p-1)}{2}\) models with two
predictors, etc… Each predictor can be in or out of the model, so there
are \(2^p\) possible models. Say, for
\(p=10\) predictors, there are \(2^{10}=1024\) models to investigate. The
algorithm proceeds as follows:
1) Let \(M_0\) be the null model with
no predictors. It simply predicts the sample mean for each
observation.
2) For \(k = 1, 2, . . . p\), fit all
\(_kC_p\) models with exactly \(k\) predictors. Pick the best according to
smallest \(RSS\) and call it \(M_k\).
3) Select the best model from among \(M_0,...,
M_p\) using \(C_p (AIC)\), \(BIC\), \(r^2_{adj}\), or cross validated predicton
error.
Step 2 identifies the best model for the training data, then step 3
picks the best using the criteria penalizing for inclusion of
non-relevant predictors.
The exponential growth of the number of models is a serious computational issue. Also, the models with very large number of predictors are hard to interpret. Therefore, forward and backward stepwise selection is often used instead. Forward stepwise selection begins with no-predictors model and adds a predictor that gives the greatest additional improvement to the \(RSS\) fit one-at-a-time. Backward stepwise selection begins with the full model with all \(p\) predictors, and then iteratively removes the least useful predictor, one-at-a-time.