Visualization of Interaction in Multiple Regression Model with package moonBook2

Keon-Woong Moon

2016-06-05

Package installation

Package moonBook is avaiable on CRAN and github. Package moonBook2 is available only on github. Please install moonBook2 package using the following R code.

install.packages("devtools")
devtools::install_github("cardiomoon/moonBook")
devtools::install_github("cardiomoon/moonBook2")

In this vignette, I will show you how to visualize the interaction between two predictor variables.

  1. ggAncova(): Visualization of One-way ANOVA with 1 covariate(ANCOVA)
  2. ggEffect(): Visualization of interaction bewteen two predictor variables in multiple regression model
  3. ggCatepillar(): Visualization of interaction between two independent (one or two categorical) variables

ggAncova()

In ANCOVA model, there are one categorical variable with one continuous variable(covariate). With mtcars data, we can make a ANCOVA model in which the mpg(mile per gallon) is a response variable and the wt(weight) and the number of cylinders(cyl) are explanatory variables. The number of cylinders are recorded as number. To make a ANCOVA model, we make a new categorical variable cyl1.

mtcars$engine=factor(mtcars$cyl)

You can easily calculate the mean and SD of mpg by engine.

require(moonBook)
require(moonBook2)
require(ggplot2)
summarySE(mtcars,"mpg","engine")
  engine  n      mpg       sd        se       ci
1      4 11 26.66364 4.509828 1.3597642 3.029743
2      6  7 19.74286 1.453567 0.5493967 1.344325
3      8 14 15.10000 2.560048 0.6842016 1.478128

The summarySE() function is made by slight modification of summarySE() function in “R graphics cookbook” by Winston Chang. You can make plot comparing the average of three groups.

str(mtcars)
'data.frame':   32 obs. of  12 variables:
 $ mpg   : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl   : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp  : num  160 160 108 258 360 ...
 $ hp    : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat  : num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt    : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec  : num  16.5 17 18.6 19.4 17 ...
 $ vs    : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am    : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear  : num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb  : num  4 4 1 1 2 1 4 2 2 4 ...
 $ engine: Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
ggCatepillar(mpg~engine,data=mtcars,interactive=TRUE)
fit=lm(mpg~wt+engine,data=mtcars)
summary(fit)

Call:
lm(formula = mpg ~ wt + engine, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5890 -1.2357 -0.5159  1.3845  5.7915 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  33.9908     1.8878  18.006  < 2e-16 ***
wt           -3.2056     0.7539  -4.252 0.000213 ***
engine6      -4.2556     1.3861  -3.070 0.004718 ** 
engine8      -6.0709     1.6523  -3.674 0.000999 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.557 on 28 degrees of freedom
Multiple R-squared:  0.8374,    Adjusted R-squared:   0.82 
F-statistic: 48.08 on 3 and 28 DF,  p-value: 3.594e-11

In summary, we can obtain regression equation for engine 4,6 and 8.

engine4 : mpg = -3.2056*wt+33.9908
engine6 : mpg = -3.2056*wt+33.9908-4.2556
engine8 : mpg = -3.2056*wt+33.9908-6.0709

We can easly visualize this model with ggAncova() function.

ggAncova(fit,interactive=TRUE)
ggAncova(mpg~wt+engine,data=mtcars,interactive=TRUE)

If you are not familiar with the formula, You can enter data and variables in order. If you enter two continuous variables as explanatory variables, the last one is converted into categorical variable and make the ANCOVA plot.

ggAncova(mtcars,mpg,wt,cyl,interactive=TRUE)

In radial data in moonBook package, the degree of atherosclerosis are measured as the normalized total atheroma volume(NTAV). With this data we can make an ANCOVA model in which age and sex are used as explanatory variables.

ggAncova(NTAV~age+sex,data=radial,interactive=TRUE)

As you can see, the NTAV is increasing with age in both sex, but men have larger NTAV thae women.

ggEffect() : Multiple linear regression with interaction

One of the most interesting research findings are those involving interactions among predictor variables. If you are interested in th impact of horse power and automoble weight on mileage, you can fit a regression model with the following formula.

fit2=lm(mpg~hp*wt,data=mtcars)
summary(fit2)

Call:
lm(formula = mpg ~ hp * wt, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.0632 -1.6491 -0.7362  1.4211  4.5513 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 49.80842    3.60516  13.816 5.01e-14 ***
hp          -0.12010    0.02470  -4.863 4.04e-05 ***
wt          -8.21662    1.26971  -6.471 5.20e-07 ***
hp:wt        0.02785    0.00742   3.753 0.000811 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.153 on 28 degrees of freedom
Multiple R-squared:  0.8848,    Adjusted R-squared:  0.8724 
F-statistic: 71.66 on 3 and 28 DF,  p-value: 2.981e-13

In formula, A*B is interpreted as A + B + A:B. A:B is the interaction of A and B. The relationship between mileage(mpg) and horse power(hp) varies by car weight(wt). The quantile of weight are as follows.

summary(mtcars$wt)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.513   2.581   3.325   3.217   3.610   5.424 

The minimal value of weight is 1.513 and th maimal value of weight is 5.424. The regression equation between mpg and wt varies with hp. If the coefficients were rounded to the decimal point two digits, the regression equation is as follows

mpg=49.81 - 0.12 * hp- 8.22 * wt + 0.03 * hp * wt 

If the weight is 3.2(mean), the regression equation will be:

mpg=49.81-0.12 * hp - 8.22 * 3.2 + 0.03 * hp * 3.2 

mpg=23.51 - 0.03 * hp 

You can calculate the regression slope and y intercept with the mean of weight(3.2) and one standard deviation below and above the maen(2.2,and 4.2 respectively) as follows:

A<-c(2.2,3.2,4.2)        
coef<-fit2$coef
labels=as.character(A)
intercept=coef[1]+coef[2]*A
slope=coef[3]+coef[4]*A
intercept
[1] 49.54420 49.42410 49.30399
slope
[1] -8.155358 -8.127510 -8.099662

With ggEffect() function, you can visualize this interaction easily.

ggEffect(mpg~hp*wt,data=mtcars,interactive=TRUE)

By default, three regression lines are displayed at c(0.10,0.5,0.9) percentiles. You can get other regression lines by changing the probs parameter.

ggEffect(mpg~hp*wt,data=mtcars,probs=c(0.2,0.4,0.6,0.8),interactive=TRUE)

You can get regression lines with the desired value. If you wanted to get regression lines with weight 2.2,3.2 and 4.2, try this :

ggEffect(mpg~hp*wt,data=mtcars,xvalue=c(2.2,3.2,4.2),interactive=TRUE)

You can use the ggEffect() function with categorical variables. You can see the impact of intweraction between aging and smoking status on the progression of atherosclerosis, try this:

ggEffect(NTAV~age*smoking,data=radial,interactive=TRUE)

You can clearly see that the atheroscerosis progresses rapidly among smokers.

Two-way ANOVA

Another interesting examples comes from acs data(moonBook package). In acs data, 857 patients with acute coronary syndrome(acs) were enrolled. If you are interested the age of disease onset and diabetes and smoking status, you can summarize the onset of age grouped by smoking and DM status as follows.

summarySE(acs,"age",c("DM","smoking"))
   DM   smoking   n      age       sd        se       ci
1  No Ex-smoker 135 66.54074 11.79853 1.0154562 2.008395
2  No     Never 195 66.92821 11.05166 0.7914257 1.560903
3  No    Smoker 223 58.61435 11.60437 0.7770861 1.531409
4 Yes Ex-smoker  69 64.18841 10.76797 1.2963118 2.586750
5 Yes     Never 137 65.91241 10.09038 0.8620795 1.704815
6 Yes    Smoker  98 58.10204 10.45904 1.0565227 2.096905

You can visualize this as follows:

ggEffect(age~DM*smoking,data=acs,interactive=TRUE)
ggEffect(age~DM*smoking,data=acs,x=2,interactive=TRUE)

You can perform two-way ANOVA with interaction.

fit=aov(age~DM*smoking,data=acs)
summary(fit)
             Df Sum Sq Mean Sq F value Pr(>F)    
DM            1     45      45   0.364  0.546    
smoking       2  12378    6189  50.359 <2e-16 ***
DM:smoking    2     95      48   0.387  0.679    
Residuals   851 104586     123                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In the two-way ANOVA, the impact of smoking status is significant, whereas the impacts of DM status and the interaction between DM and smoking are insignificant. You can perform multiple comparison by computinig Tukey Honest Significant Differences.

result=TukeyHSD(fit)
result
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = age ~ DM * smoking, data = acs)

$DM
             diff       lwr     upr     p adj
Yes-No -0.4777232 -2.031286 1.07584 0.5463034

$smoking
                      diff        lwr       upr     p adj
Never-Ex-smoker   0.799488  -1.515954  3.114930 0.6965132
Smoker-Ex-smoker -7.302890  -9.633383 -4.972397 0.0000000
Smoker-Never     -8.102378 -10.139752 -6.065004 0.0000000

$`DM:smoking`
                                 diff        lwr       upr     p adj
Yes:Ex-smoker-No:Ex-smoker -2.3523349  -7.038172  2.333502 0.7063073
No:Never-No:Ex-smoker       0.3874644  -3.157700  3.932628 0.9996051
Yes:Never-No:Ex-smoker     -0.6283320  -4.468240  3.211576 0.9972215
No:Smoker-No:Ex-smoker     -7.9263910 -11.379305 -4.473477 0.0000000
Yes:Smoker-No:Ex-smoker    -8.4386999 -12.640753 -4.236646 0.0000002
No:Never-Yes:Ex-smoker      2.7397993  -1.695507  7.175106 0.4895283
Yes:Never-Yes:Ex-smoker     1.7240030  -2.950251  6.398257 0.8994286
No:Smoker-Yes:Ex-smoker    -5.5740560  -9.935979 -1.212133 0.0037712
Yes:Smoker-Yes:Ex-smoker   -6.0863650 -11.062409 -1.110321 0.0066345
Yes:Never-No:Never         -1.0157964  -4.545636  2.514043 0.9634988
No:Smoker-No:Never         -8.3138554 -11.418286 -5.209425 0.0000000
Yes:Smoker-No:Never        -8.8261643 -12.746895 -4.905434 0.0000000
No:Smoker-Yes:Never        -7.2980590 -10.735237 -3.860881 0.0000000
Yes:Smoker-Yes:Never       -7.8103679 -11.999501 -3.621235 0.0000019
Yes:Smoker-No:Smoker       -0.5123090  -4.349828  3.325210 0.9989544

You can visualize the result with ggHSD() function. Because the result of HSD test in thsi case is a list of length 3, you can select the second list.

ggHSD(result,2,interactive=TRUE)

Followings are othe examples of ggEffect()

ggEffect(age~Dx*sex,data=acs,interactive=TRUE)

As you can see, male patient with acs are younger than female patient with acs

This plot is identical with the following plot

ggCatepillar(age~Dx+sex,data=acs,interactive=TRUE)

How to save the plots

The static plots(default: interactive=FALSE) are standard ggplots. You can save the plots with ggsave() function. You can make a static plot interactive by ggiraph() function. Interactive plots are htmlwidgets. You can save the plot with htmlwidgets::saveWidget() function.

p<-ggCatepillar(age~Dx+sex,data=acs)
ggsave("ggCatepillar.pdf",p)

p1<-ggiraph(code=print(p))
require(htmlwidgets)
saveWidget(p1,"ggCatepillar.html")

Alternatively, you can make a interactive plot by setting the parameter interactive=TRUE. In this case, you can save as html file with saveWidget() function.

p2<-ggEffect(NTAV~age*smoking,data=radial,interactive=TRUE)
saveWidget(p2,"ggEffect.html")