Introduction
Students at the beginning of PREDICT 411 (generalized linear models) tend to lean towards reporting Adjusted R-squared as their model evaluation criteria whereas grades are assigned according to some other, mysterious, error measurement conducted by the Professor.
The goal of adjusted R-squared is to account for the amount of variance explained by a given mdoel without encouraging over-fitting - when the model fits the sample data very well but is not as good of a fit out of sample. Unfortunately, sometimes over-fitting still happens when using adjusted r-squared as opposed to something like cross-validated error. The task here is to compare adjusted R-squared to this error measure and evaluate any patterns we see.
The data
Prof. Wilck has made 45 example records available that contain his error metric and past students’ self-reported adjusted r-squared.
library(readr)
library(dplyr)
setwd('C:/Users/joshy/Google Drive/northwestern/PREDICT-411/general-docs/arsq-eval/')
dat <- read_csv('Adjusted_Rsquared.csv')
Parsed with column specification:
cols(
ADJ_R2 = col_double(),
SCORE = col_double()
)
head(dat)
Plotting the model health scores (using log scales to normalize the distributions) demonstrates that students may actually be selling themselves short by reporting adjusted R-squared! Values for adjusted r-squared span a wide range of values while Prof. Wilck’s score is more consistently at the low end of it’s distribution.
library(ggplot2)
library(yaztheme)
library(ggridges)
# cor(log(dat))^2
scatter <- ggplot(dat, aes(log(ADJ_R2), log(SCORE)))+
geom_point(size = 3, alpha = .7, color = yaz_cols[3])+
theme_yaz()+
labs(title = 'Distributions of Model Health Metrics')
scatter

At the risk of being too on-the-nose, the R-squared value for the line of best fit on the above scatter plot is just .14, meaning adjusted r-squared only accounts for 14% of Prof. Wilck’s out of sample error calculation!
Why does this matter?
By definition, predictive models are meant to be deployed out of sample. If the model fits in-sample and fails when applied to new data we risk making bad decisions. On the other hand, if a model appears to fit poorly in-sample but performs well out of sample, we risk the opportunity cost of passing up a good tool to use in our decision-making process. My preferred model health metric is some form of hold-out sample error testing like root mean squared error.
LS0tDQp0aXRsZTogIkFkanVzdGVkIFItU3F1YXJlZCBFdmFsdWF0aW9uIg0Kb3V0cHV0OiBodG1sX25vdGVib29rDQotLS0NCiMjIEludHJvZHVjdGlvbg0KU3R1ZGVudHMgYXQgdGhlIGJlZ2lubmluZyBvZiBQUkVESUNUIDQxMSAoZ2VuZXJhbGl6ZWQgbGluZWFyIG1vZGVscykgdGVuZCB0byBsZWFuIHRvd2FyZHMgcmVwb3J0aW5nIEFkanVzdGVkIFItc3F1YXJlZCBhcyB0aGVpciBtb2RlbCBldmFsdWF0aW9uIGNyaXRlcmlhIHdoZXJlYXMgZ3JhZGVzIGFyZSBhc3NpZ25lZCBhY2NvcmRpbmcgdG8gc29tZSBvdGhlciwgbXlzdGVyaW91cywgZXJyb3IgbWVhc3VyZW1lbnQgY29uZHVjdGVkIGJ5IHRoZSBQcm9mZXNzb3IuIA0KDQpUaGUgZ29hbCBvZiBhZGp1c3RlZCBSLXNxdWFyZWQgaXMgdG8gYWNjb3VudCBmb3IgdGhlIGFtb3VudCBvZiB2YXJpYW5jZSBleHBsYWluZWQgYnkgYSBnaXZlbiBtZG9lbCB3aXRob3V0IGVuY291cmFnaW5nIG92ZXItZml0dGluZyAtIHdoZW4gdGhlIG1vZGVsIGZpdHMgdGhlIHNhbXBsZSBkYXRhIHZlcnkgd2VsbCBidXQgaXMgbm90IGFzIGdvb2Qgb2YgYSBmaXQgb3V0IG9mIHNhbXBsZS4gVW5mb3J0dW5hdGVseSwgc29tZXRpbWVzIG92ZXItZml0dGluZyBzdGlsbCBoYXBwZW5zIHdoZW4gdXNpbmcgYWRqdXN0ZWQgci1zcXVhcmVkIGFzIG9wcG9zZWQgdG8gc29tZXRoaW5nIGxpa2UgY3Jvc3MtdmFsaWRhdGVkIGVycm9yLiBUaGUgdGFzayBoZXJlIGlzIHRvIGNvbXBhcmUgYWRqdXN0ZWQgUi1zcXVhcmVkIHRvIHRoaXMgZXJyb3IgbWVhc3VyZSBhbmQgZXZhbHVhdGUgYW55IHBhdHRlcm5zIHdlIHNlZS4gDQoNCiMjIFRoZSBkYXRhDQpQcm9mLiBXaWxjayBoYXMgbWFkZSA0NSBleGFtcGxlIHJlY29yZHMgYXZhaWxhYmxlIHRoYXQgY29udGFpbiBoaXMgZXJyb3IgbWV0cmljIGFuZCBwYXN0IHN0dWRlbnRzJyBzZWxmLXJlcG9ydGVkIGFkanVzdGVkIHItc3F1YXJlZC4NCmBgYHtyLCBlY2hvID0gVFJVRX0NCmxpYnJhcnkocmVhZHIpDQpsaWJyYXJ5KGRwbHlyKQ0Kc2V0d2QoJ0M6L1VzZXJzL2pvc2h5L0dvb2dsZSBEcml2ZS9ub3J0aHdlc3Rlcm4vUFJFRElDVC00MTEvZ2VuZXJhbC1kb2NzL2Fyc3EtZXZhbC8nKQ0KDQpkYXQgPC0gcmVhZF9jc3YoJ0FkanVzdGVkX1JzcXVhcmVkLmNzdicpDQoNCmhlYWQoZGF0KQ0KYGBgDQoNClBsb3R0aW5nIHRoZSBtb2RlbCBoZWFsdGggc2NvcmVzICh1c2luZyBsb2cgc2NhbGVzIHRvIG5vcm1hbGl6ZSB0aGUgZGlzdHJpYnV0aW9ucykgZGVtb25zdHJhdGVzIHRoYXQgc3R1ZGVudHMgbWF5IGFjdHVhbGx5IGJlIHNlbGxpbmcgdGhlbXNlbHZlcyBzaG9ydCBieSByZXBvcnRpbmcgYWRqdXN0ZWQgUi1zcXVhcmVkISBWYWx1ZXMgZm9yIGFkanVzdGVkIHItc3F1YXJlZCBzcGFuIGEgd2lkZSByYW5nZSBvZiB2YWx1ZXMgd2hpbGUgUHJvZi4gV2lsY2sncyBzY29yZSBpcyBtb3JlIGNvbnNpc3RlbnRseSBhdCB0aGUgbG93IGVuZCBvZiBpdCdzIGRpc3RyaWJ1dGlvbi4gDQpgYGB7ciwgZmlnLndpZHRoPTYsIGZpZy5oZWlnaHQ9NCwgZWNobyA9IFRSVUV9DQpsaWJyYXJ5KGdncGxvdDIpDQpsaWJyYXJ5KHlhenRoZW1lKQ0KDQojIGNvcihsb2coZGF0KSleMg0KDQpzY2F0dGVyIDwtIGdncGxvdChkYXQsIGFlcyhsb2coQURKX1IyKSwgbG9nKFNDT1JFKSkpKw0KICBnZW9tX3BvaW50KHNpemUgPSAzLCBhbHBoYSA9IC43LCBjb2xvciA9IHlhel9jb2xzWzNdKSsNCiAgdGhlbWVfeWF6KCkrDQogIGxhYnModGl0bGUgPSAnRGlzdHJpYnV0aW9ucyBvZiBNb2RlbCBIZWFsdGggTWV0cmljcycpDQpzY2F0dGVyDQpgYGANCg0KQXQgdGhlIHJpc2sgb2YgYmVpbmcgdG9vIG9uLXRoZS1ub3NlLCB0aGUgUi1zcXVhcmVkIHZhbHVlIGZvciB0aGUgbGluZSBvZiBiZXN0IGZpdCBvbiB0aGUgYWJvdmUgc2NhdHRlciBwbG90IGlzIGp1c3QgLjE0LCBtZWFuaW5nIGFkanVzdGVkIHItc3F1YXJlZCBvbmx5IGFjY291bnRzIGZvciAxNCUgb2YgUHJvZi4gV2lsY2sncyBvdXQgb2Ygc2FtcGxlIGVycm9yIGNhbGN1bGF0aW9uIQ0KDQojIyBXaHkgZG9lcyB0aGlzIG1hdHRlcj8NCkJ5IGRlZmluaXRpb24sIHByZWRpY3RpdmUgbW9kZWxzIGFyZSBtZWFudCB0byBiZSBkZXBsb3llZCBvdXQgb2Ygc2FtcGxlLiBJZiB0aGUgbW9kZWwgZml0cyBpbi1zYW1wbGUgYW5kIGZhaWxzIHdoZW4gYXBwbGllZCB0byBuZXcgZGF0YSB3ZSByaXNrIG1ha2luZyBiYWQgZGVjaXNpb25zLiBPbiB0aGUgb3RoZXIgaGFuZCwgaWYgYSBtb2RlbCBhcHBlYXJzIHRvIGZpdCBwb29ybHkgaW4tc2FtcGxlIGJ1dCBwZXJmb3JtcyB3ZWxsIG91dCBvZiBzYW1wbGUsIHdlIHJpc2sgdGhlIG9wcG9ydHVuaXR5IGNvc3Qgb2YgcGFzc2luZyB1cCBhIGdvb2QgdG9vbCB0byB1c2UgaW4gb3VyIGRlY2lzaW9uLW1ha2luZyBwcm9jZXNzLiBNeSBwcmVmZXJyZWQgbW9kZWwgaGVhbHRoIG1ldHJpYyBpcyBzb21lIGZvcm0gb2YgaG9sZC1vdXQgc2FtcGxlIGVycm9yIHRlc3RpbmcgbGlrZSByb290IG1lYW4gc3F1YXJlZCBlcnJvci4NCg==