require(car)
## Loading required package: car
## Loading required package: carData
require(lmtest)
## Loading required package: lmtest
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
require(Hmisc)
## Loading required package: Hmisc
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
require(ResourceSelection)
## Loading required package: ResourceSelection
## ResourceSelection 0.3-6 2023-06-27
require(corrplot)
## Loading required package: corrplot
## corrplot 0.95 loaded
For Week 6 Assignment, I was hoping that most of you would identify that the log transformation wasn’t the best in this case and that you could not compare a transformed dependent variable to a non-transformed dependent variable. In fact, we could have investigated Box-Cox power transformations to help.
Let’s look at the graphs of the data.
mydata=read.csv('C:/Users/lfult/Documents/_Courses/Boston College/Data Analysis/week 6 data.csv', stringsAsFactors = T)
kdepairs(mydata)
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
subdata=mydata[,c(1,3)]
kdepairs(subdata)
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
corrplot(cor(mydata), type="lower", is.corr=T)
In these kernel density estimate plots, you see the histogram and density on the diagonal. Below the diagonal are the bivariate boxplots. Often, it is helpful to get these as bivariate normal as possible (the shape of an oval) through transformations. Above the diagonal are pairs plots with the correlation coefficient (sized by magnitude) and a loess curve. We can see from the graphs, that the data are a little messy. The pairs plot between RVUs and Expenditures shows a compressed ‘oval’ish’ shape at the bottom right. The correlogram confirms powerful relationships, but are they linear? Can we do better?
LM=function(x,y){
mylm=lm(y~x)
myplot=plot(y~x); abline(mylm,col="red")
myplot2=hist(mylm$residuals)
print(c(myplot,myplot2))
par(mfrow=c(2,3))
par(ask=FALSE)
print(plot(mylm, which=seq(1:6)))
return(summary(mylm))
}
I originally just asked for a linear model of Expenditures ~ RVUs. Let’s look at that.
LM(mydata$RVUs,mydata$Expenditures)
## $breaks
## [1] -2e+08 -1e+08 0e+00 1e+08 2e+08 3e+08 4e+08 5e+08 6e+08 7e+08
##
## $counts
## [1] 18 148 204 8 2 2 1 0 1
##
## $density
## [1] 4.687500e-10 3.854167e-09 5.312500e-09 2.083333e-10 5.208333e-11
## [6] 5.208333e-11 2.604167e-11 0.000000e+00 2.604167e-11
##
## $mids
## [1] -1.5e+08 -5.0e+07 5.0e+07 1.5e+08 2.5e+08 3.5e+08 4.5e+08 5.5e+08
## [9] 6.5e+08
##
## $xname
## [1] "mylm$residuals"
##
## $equidist
## [1] TRUE
## NULL
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -185723026 -14097620 2813431 11919781 642218316
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.785e+06 4.413e+06 -0.858 0.392
## x 2.351e+02 5.061e+00 46.449 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 67350000 on 382 degrees of freedom
## Multiple R-squared: 0.8496, Adjusted R-squared: 0.8492
## F-statistic: 2157 on 1 and 382 DF, p-value: < 2.2e-16
This diagnostic analysis clearly indicates multiple severe violations of linear regression assumptions. Heteroskedasticity is evident from the distinct fan-shaped pattern in the residuals vs. fitted values plot, and this observation is strongly reinforced by the scale-location plot, where the variability of residuals systematically increases with larger predicted values. The QQ-plot vividly illustrates non-normality, with significant deviations from the theoretical normal quantiles. Additionally, influential points are identified through Cook’s distance—observation 256 notably surpasses the conventional threshold of 1, marking it as an influential outlier. Observations 135 and 384, although visually prominent, fall short of this critical threshold and thus aren’t considered influential by standard guidelines. Overall, this model fails substantially in meeting key regression assumptions, suggesting a need for transformations, robust regression methods, or alternative modeling approaches.
Then I asked for ln(Expenditures)~RVUS.
LM(mydata$RVUs,log(mydata$Expenditures))
## $breaks
## [1] -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5
##
## $counts
## [1] 2 12 56 100 147 63 4
##
## $density
## [1] 0.01041667 0.06250000 0.29166667 0.52083333 0.76562500 0.32812500 0.02083333
##
## $mids
## [1] -1.75 -1.25 -0.75 -0.25 0.25 0.75 1.25
##
## $xname
## [1] "mylm$residuals"
##
## $equidist
## [1] TRUE
## NULL
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.59439 -0.29504 0.06135 0.35333 1.20871
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.730e+01 3.325e-02 520.11 <2e-16 ***
## x 1.349e-06 3.814e-08 35.38 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5076 on 382 degrees of freedom
## Multiple R-squared: 0.7661, Adjusted R-squared: 0.7655
## F-statistic: 1251 on 1 and 382 DF, p-value: < 2.2e-16
Clearly, I’ve done it to myself again—this regression model exhibits multiple violations, despite the apparent effort to fit a curvilinear relationship. The residuals vs. fitted plot shows a pronounced curve, indicating model misspecification and a failure to adequately linearize the relationship between predictors and response. This is further corroborated by the scale-location plot, which shows increasing variance across fitted values—clear evidence of heteroskedasticity. The variance isn’t constant; it grows, meaning standard OLS assumptions are broken.
However, the Q-Q plot tells a more reassuring story—residuals fall mostly along the diagonal, suggesting that the normality assumption of the residuals isn’t dramatically violated. This indicates that the error structure is somewhat stable in terms of distributional shape, even if not variance.
Cook’s distance plot flags a few observations (e.g., 127, 255, and 383), but all are well below the critical threshold of 1, so none of them individually exerts undue influence on the model. Still, it’s worth noting their presence, particularly because observations 382 and 383 also show up with higher leverage in the residuals vs. leverage and leverage-Cook’s distance plots. These data points don’t trigger concern on their own but could be symptomatic of deeper structural issues in the model.
Then I asked you to investigate a log-log transformation.
LM(log(mydata$Expenditures),log(mydata$RVUs))
## $breaks
## [1] -1.2 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0
##
## $counts
## [1] 1 2 15 25 46 96 95 67 31 5 1
##
## $density
## [1] 0.01302083 0.02604167 0.19531250 0.32552083 0.59895833 1.25000000
## [7] 1.23697917 0.87239583 0.40364583 0.06510417 0.01302083
##
## $mids
## [1] -1.1 -0.9 -0.7 -0.5 -0.3 -0.1 0.1 0.3 0.5 0.7 0.9
##
## $xname
## [1] "mylm$residuals"
##
## $equidist
## [1] TRUE
## NULL
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.1107 -0.1667 0.0239 0.2249 0.8099
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.22680 0.28033 -22.21 <2e-16 ***
## x 1.04240 0.01552 67.17 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3183 on 382 degrees of freedom
## Multiple R-squared: 0.9219, Adjusted R-squared: 0.9217
## F-statistic: 4512 on 1 and 382 DF, p-value: < 2.2e-16
Much better, right? This log-log model represents a noticeable improvement over previous versions, but subtle issues remain. The residuals vs. fitted plot shows a more symmetric spread of residuals, though a slight curve persists, suggesting minor misspecification. The Q-Q plot indicates that the residuals are approximately normally distributed, with only slight deviations in the tails—much improved from earlier fits. The scale-location plot supports this, showing relatively constant variance across fitted values, which suggests that heteroskedasticity has largely been addressed by the log transformation. Cook’s distance highlights a few observations (128, 256, and 384), but none exceed the conventional threshold of 1, indicating that no individual point is unduly influential. Similarly, the residuals vs. leverage plot shows no obvious pattern or concern, and the Cook’s distance vs. leverage plot reinforces this by showing that even high-leverage points fall well within acceptable Cook’s distance bounds. In short, the log-log transformation has substantially stabilized variance, improved normality, and reduced leverage concerns—but a slight curvature remains, suggesting that a Box-Cox transformation might further optimize model fit by fine-tuning the functional form.
Finally, I asked you to look at a bivariate Box-Cox transformation of both variables.
$$
y()=I_{>0}+(1-I_{>0})log(y)
$$
Here, \(I_{\lambda >0}\) is an indicator variable for \(\lambda>0\). When \(\lambda>0\), the value becomes \(\frac{y^\lambda-1}{\lambda}\). Otherwise, the value is \(log(y)\).
So we optimize this function with respect to \(\lambda\). And we can do this for a matrix \(Y\) or a vector \(y\). For a matrix, we are simultaneously optimizing \(\lambda\) for each column.
(myt=powerTransform(cbind(mydata$Expenditures,mydata$RVUs)~1)) #Box Cox transformation...must be positive definite
## Estimated transformation parameters
## Y1 Y2
## -0.11521100 0.03179065
testTransform(myt, myt$lambda)
## LRT df pval
## LR test, lambda = (-0.12 0.03) 0 2 1
LM(mydata$Expenditures^myt$lambda[1],mydata$RVUs^myt$lambda[2])
## $breaks
## [1] -0.06 -0.05 -0.04 -0.03 -0.02 -0.01 0.00 0.01 0.02 0.03 0.04
##
## $counts
## [1] 1 1 6 27 57 85 106 76 20 5
##
## $density
## [1] 0.2604167 0.2604167 1.5625000 7.0312500 14.8437500 22.1354167
## [7] 27.6041667 19.7916667 5.2083333 1.3020833
##
## $mids
## [1] -0.055 -0.045 -0.035 -0.025 -0.015 -0.005 0.005 0.015 0.025 0.035
##
## $xname
## [1] "mylm$residuals"
##
## $equidist
## [1] TRUE
## NULL
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.052317 -0.009508 0.000982 0.010563 0.035449
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.938212 0.006413 302.25 <2e-16 ***
## x -3.535821 0.050504 -70.01 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0146 on 382 degrees of freedom
## Multiple R-squared: 0.9277, Adjusted R-squared: 0.9275
## F-statistic: 4902 on 1 and 382 DF, p-value: < 2.2e-16
The Box-Cox transformation appears to have effectively resolved nearly all prior issues, producing a model that satisfies the major assumptions of linear regression. The residuals vs. fitted plot shows no clear pattern or curvature, with residuals symmetrically scattered around zero—suggesting that linearity has finally been achieved. The Q-Q plot further confirms this: residuals align closely with the theoretical quantiles, indicating that the assumption of normality is well-met, with only minor deviation at the lower tail. The scale-location plot reveals a flat red trend line and a fairly even spread of √|standardized residuals|, which together suggest homoscedasticity has been restored. The Cook’s distance chart highlights a few points (e.g., 237, 293, 321), but all are well below the critical threshold of 1 and pose no undue influence. Residuals vs. leverage and the Cook’s distance vs. leverage plots reinforce this: while some leverage exists, it is moderate and combined with low Cook’s values, presents no real threat to the model’s stability. Altogether, this Box-Cox model is well-behaved, statistically sound, and ready for inference or forecasting. This is a strong stopping point—finally, a model we can trust.
The recommended parameters?
Estimated transformation parameters Y1 Y2 -0.11521100 0.03179065
Since 0 is defined as log, the log-log is a good transform for interpretability. But I did ask for histograms!
par(mfrow=c(1,2))
hist(mydata$Expenditures**-.11521100)
hist(mydata$Expenditures**0.03179065)
Not terrible by themselves but beautiful together…
You can also run some statistical tests, but note that these are sensitive to sample size.
#shapiro.test(mylm$residuals) #normality of residuals, null is normal.
#fivenum(mylm$residuals)
#bptest(mylm) #heteroskedasticity, null is homoskedastic
#bp=length(x)*summary((lm(mylm$residuals^2~x)))$r.squared#manual calculation #of studentized Breusch-Pagan Test
#(c(bp,pchisq(bp,1)))
So we have normally distributed residuals centered at zero but some heteroskedasticity. We might address this with splines or additional variables, but the eyeball test says it’s ok for our purposes.