library(modelr)
library(tidyverse)
library(ggiraphExtra)
library(broom)
library(kableExtra)

Introduction

My first discussion board post for Data 621 was about visualizing regression models (Exhibit A). Unfortunately my post got no response from any of my fellow classmates - so sad. Despite the apparent lack of interest in this subject, I have decided to dedicate my final Blog post to Visualizing Regression models. For this post we will utilize the ggiraphExtra package and ggpredict to make the case for visualizing regression models. We’ll look at some examples from Simple, Multiple and Logistic Regression. For this post we won’t spend much time talking about the data. This post is about the Viz. Let’s get started

Exhibit A - The Unknown Post

As I made my way through week 1’s readings, I was struck by the prominent role that plots played. There’s rarely a page that does not include one plot or another. To truly master business analytics and analysis, I believe one must also master ggplot2. This is important not only for the preliminary analysis but also for effectively communicating your findings. I’ve taken data 608 and feel comfortable with ggplot, but I wonder if it would be useful to have a module in a course like Data 621 that was devoted to plotting for linear regression analysis. There seems to be numerous areas of opportunity ranging from error plots, exploratory plot, plots of many models and how to make the important models standout.

What are your thoughts? How important do you think plots are to linear regression analysis. Would a course like this benefit from a tailor-made plotting session or am I making much to do about nothing?

Simple Linear Regression.

For a simple linear regression, you can use the summary function to print the model’s paramenters.

Typical (and boring) Simple Linear Regression

## 
## Call:
## lm(formula = NTAV ~ age, data = radial)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -45.231 -14.626  -4.803   9.685 100.961 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  44.3398    14.6251   3.032  0.00302 **
## age           0.3848     0.2271   1.694  0.09302 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.94 on 113 degrees of freedom
## Multiple R-squared:  0.02477,    Adjusted R-squared:  0.01614 
## F-statistic:  2.87 on 1 and 113 DF,  p-value: 0.09302

Simple Regression Viz

term estimate std.error statistic p.value
(Intercept) 44.3397932 14.6250549 3.031769 0.0030154
age 0.3847666 0.2271353 1.693997 0.0930204

Interactive Simple Regression Viz

Another alternative is to make you plot interactive. A lot of you might think Ploty, but are you familiar with ggPredict and the ggiraphExtra package. This plot you to identify the points in the plot and to see the regression equation with their mouse. Highlight the plot line or any data point to see what I mean.

## Warning: package 'gdtools' was built under R version 3.5.3

Multiple Liinear Regression

We can also create some interesting eye candy for multiple linear regression. Our next model include a categorical variable with no interaction. A typical summary is provide.

## 
## Call:
## lm(formula = NTAV ~ age + sex, data = radial)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.025 -12.687  -1.699   5.784  89.419 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  17.8697    14.3846   1.242  0.21673    
## age           0.6379     0.2134   2.989  0.00344 ** 
## sexM         20.5476     4.1943   4.899 3.27e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.82 on 112 degrees of freedom
## Multiple R-squared:  0.1969, Adjusted R-squared:  0.1825 
## F-statistic: 13.73 on 2 and 112 DF,  p-value: 4.659e-06

Multiple Regression Viz - ggpredict version

Again, the ggpredict version is interactive and looks great.

These plots are also possible with models with interactions. Now you can use age and DM(diabetes mellitus) and interaction between age and DM as predcitor variables.

## 
## Call:
## lm(formula = NTAV ~ age * DM, data = radial)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -44.094 -15.115  -4.093   9.102 102.024 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  49.6463    16.8660   2.944  0.00395 **
## age           0.2925     0.2648   1.105  0.27174   
## DM          -20.8618    34.8936  -0.598  0.55115   
## age:DM        0.3453     0.5353   0.645  0.52026   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.1 on 111 degrees of freedom
## Multiple R-squared:  0.0292, Adjusted R-squared:  0.002966 
## F-statistic: 1.113 on 3 and 111 DF,  p-value: 0.347

Multiple Regression with Interaction - ggpredict

Logistic Regression

We can use the glm function to make logistic regression model. We’ll use data from a German Breast Cancer study and will try to predict survival with number of positive nodes and hormonal therapy. Model summary included.

## 
## Call:
## glm(formula = cens ~ pnodes * horTh, family = binomial, data = GBSG2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7892  -1.0208  -0.7573   1.2288   1.6667  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -0.55368    0.13942  -3.971 7.15e-05 ***
## pnodes           0.08672    0.02172   3.993 6.53e-05 ***
## horThyes        -0.69833    0.25394  -2.750  0.00596 ** 
## pnodes:horThyes  0.06306    0.03899   1.617  0.10582    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 939.68  on 685  degrees of freedom
## Residual deviance: 887.69  on 682  degrees of freedom
## AIC: 895.69
## 
## Number of Fisher Scoring iterations: 4

ggpredict Multiple Logistic Regression

Wrap-up

That’s all I have for now. I hope you picked up something you can use in a future analysis. I’ve provided some useful links below.

ggpredict - https://cran.r-project.org/web/packages/ggiraphExtra/vignettes/ggPredict.html

Annotate ggplot - https://sejohnston.com/2012/08/09/a-quick-and-easy-function-to-plot-lm-results-in-r/