My first discussion board post for Data 621 was about visualizing regression models (Exhibit A). Unfortunately my post got no response from any of my fellow classmates - so sad. Despite the apparent lack of interest in this subject, I have decided to dedicate my final Blog post to Visualizing Regression models. For this post we will utilize the ggiraphExtra package and ggpredict to make the case for visualizing regression models. We’ll look at some examples from Simple, Multiple and Logistic Regression. For this post we won’t spend much time talking about the data. This post is about the Viz. Let’s get started
As I made my way through week 1’s readings, I was struck by the prominent role that plots played. There’s rarely a page that does not include one plot or another. To truly master business analytics and analysis, I believe one must also master ggplot2. This is important not only for the preliminary analysis but also for effectively communicating your findings. I’ve taken data 608 and feel comfortable with ggplot, but I wonder if it would be useful to have a module in a course like Data 621 that was devoted to plotting for linear regression analysis. There seems to be numerous areas of opportunity ranging from error plots, exploratory plot, plots of many models and how to make the important models standout.
What are your thoughts? How important do you think plots are to linear regression analysis. Would a course like this benefit from a tailor-made plotting session or am I making much to do about nothing?
For a simple linear regression, you can use the summary function to print the model’s paramenters.
##
## Call:
## lm(formula = NTAV ~ age, data = radial)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.231 -14.626 -4.803 9.685 100.961
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 44.3398 14.6251 3.032 0.00302 **
## age 0.3848 0.2271 1.694 0.09302 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.94 on 113 degrees of freedom
## Multiple R-squared: 0.02477, Adjusted R-squared: 0.01614
## F-statistic: 2.87 on 1 and 113 DF, p-value: 0.09302
ggplotRegression <- function (fit) {
require(ggplot2)
require(ggthemes)
ggplot(fit$model, aes_string(x = names(fit$model)[2], y = names(fit$model)[1])) +
geom_point() +
stat_smooth(method = "lm", col = "red") +
labs(title = paste("Adj R2 = ",signif(summary(fit)$adj.r.squared, 5),
"Intercept =",signif(fit$coef[[1]],5 ),
" Slope =",signif(fit$coef[[2]], 5),
" P =",signif(summary(fit)$coef[2,4], 5)))
}| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 44.3397932 | 14.6250549 | 3.031769 | 0.0030154 |
| age | 0.3847666 | 0.2271353 | 1.693997 | 0.0930204 |
Another alternative is to make you plot interactive. A lot of you might think Ploty, but are you familiar with ggPredict and the ggiraphExtra package. This plot you to identify the points in the plot and to see the regression equation with their mouse. Highlight the plot line or any data point to see what I mean.
## Warning: package 'gdtools' was built under R version 3.5.3
We can also create some interesting eye candy for multiple linear regression. Our next model include a categorical variable with no interaction. A typical summary is provide.
##
## Call:
## lm(formula = NTAV ~ age + sex, data = radial)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46.025 -12.687 -1.699 5.784 89.419
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.8697 14.3846 1.242 0.21673
## age 0.6379 0.2134 2.989 0.00344 **
## sexM 20.5476 4.1943 4.899 3.27e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.82 on 112 degrees of freedom
## Multiple R-squared: 0.1969, Adjusted R-squared: 0.1825
## F-statistic: 13.73 on 2 and 112 DF, p-value: 4.659e-06
equation1=function(x){coef(fit1)[2]*x+coef(fit1)[1]}
equation2=function(x){coef(fit1)[2]*x+coef(fit1)[1]+coef(fit1)[3]}
ggplot(radial,aes(y=NTAV,x=age,color=sex))+geom_point()+
stat_function(fun=equation1,geom="line",color=scales::hue_pal()(2)[1])+
stat_function(fun=equation2,geom="line",color=scales::hue_pal()(2)[2])Again, the ggpredict version is interactive and looks great.
These plots are also possible with models with interactions. Now you can use age and DM(diabetes mellitus) and interaction between age and DM as predcitor variables.
##
## Call:
## lm(formula = NTAV ~ age * DM, data = radial)
##
## Residuals:
## Min 1Q Median 3Q Max
## -44.094 -15.115 -4.093 9.102 102.024
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 49.6463 16.8660 2.944 0.00395 **
## age 0.2925 0.2648 1.105 0.27174
## DM -20.8618 34.8936 -0.598 0.55115
## age:DM 0.3453 0.5353 0.645 0.52026
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.1 on 111 degrees of freedom
## Multiple R-squared: 0.0292, Adjusted R-squared: 0.002966
## F-statistic: 1.113 on 3 and 111 DF, p-value: 0.347
We can use the glm function to make logistic regression model. We’ll use data from a German Breast Cancer study and will try to predict survival with number of positive nodes and hormonal therapy. Model summary included.
require(TH.data) # for use data GBSG2
fit5=glm(cens~pnodes*horTh,data=GBSG2,family=binomial)
summary(fit5)##
## Call:
## glm(formula = cens ~ pnodes * horTh, family = binomial, data = GBSG2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7892 -1.0208 -0.7573 1.2288 1.6667
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.55368 0.13942 -3.971 7.15e-05 ***
## pnodes 0.08672 0.02172 3.993 6.53e-05 ***
## horThyes -0.69833 0.25394 -2.750 0.00596 **
## pnodes:horThyes 0.06306 0.03899 1.617 0.10582
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 939.68 on 685 degrees of freedom
## Residual deviance: 887.69 on 682 degrees of freedom
## AIC: 895.69
##
## Number of Fisher Scoring iterations: 4
That’s all I have for now. I hope you picked up something you can use in a future analysis. I’ve provided some useful links below.
ggpredict - https://cran.r-project.org/web/packages/ggiraphExtra/vignettes/ggPredict.html
Annotate ggplot - https://sejohnston.com/2012/08/09/a-quick-and-easy-function-to-plot-lm-results-in-r/