diamonds <- read.csv(url("http://gattonweb.uky.edu/sheather/book/docs/datasets/diamonds.txt"),sep = '')
plot(Price ~ Size, data=diamonds )
The data look reasonably linearly related when plotted as a scatter. There appears to be some clustering / local variability around certain points. I wonder whether this may be suggestive of non-constant variance
m <- lm(Price ~ Size, data=diamonds )
summary(m)
##
## Call:
## lm(formula = Price ~ Size, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -85.654 -21.503 -1.203 16.797 79.295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -258.05 16.94 -15.23 <2e-16 ***
## Size 3715.02 80.41 46.20 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 31.6 on 47 degrees of freedom
## Multiple R-squared: 0.9785, Adjusted R-squared: 0.978
## F-statistic: 2135 on 1 and 47 DF, p-value: < 2.2e-16
plot(m$residuals,main="Residuals")
plot(rstandard(m),main="Standardized Residuals")
plot(rstandard(m) ~ diamonds$Size,main="Standardized Residuals vs Size")
hist(m$residuals,main="Residuals Histogram")
qqnorm(m$residuals)
qqline(m$residuals)
The model actually looks reasonably decent without any transformation of the data: the summary shows that it’s predictive, the residuals look reasonably normally distributed (although the dispersion appears as though it may get narrower as Size/Price rises), the residuals look reasonably normally distributed based on the histogram and qqnorm plot also.
We do, however, see strong evidence of non-constant variance indicated in the “Standard Residuals vs. Size” plot. This gives us some guidance as to how we might want to modify the data
We have multiple observations of “Price” at the same “Size”, so given this, i’ll try taking the sqrt of price. Note that i tried numerous other things too, but this one caused the least degredation.
m2 <- lm(sqrt(Price) ~ Size, data=diamonds )
summary(m2)
##
## Call:
## lm(formula = sqrt(Price) ~ Size, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.98882 -0.44362 -0.06248 0.45273 1.87691
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.8926 0.4029 14.63 <2e-16 ***
## Size 78.4663 1.9122 41.03 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7515 on 47 degrees of freedom
## Multiple R-squared: 0.9728, Adjusted R-squared: 0.9723
## F-statistic: 1684 on 1 and 47 DF, p-value: < 2.2e-16
plot(m2$residuals,main="Residuals")
plot(rstandard(m2),main="Standardized Residuals")
plot(rstandard(m2) ~ diamonds$Size,main="Standardized Residuals vs Size")
hist(m2$residuals,main="Residuals Histogram")
qqnorm(m2$residuals)
qqline(m2$residuals)
The model produces more-or-less the same results as without the transform. We see a slight degredaiton in model performance as measured by R2, a slight improvement in the histogram of the residuals (it looks more normal) and the qqnorm plot looks a little bit worse, but very similar
For this particular application, I would opt for the simpler model (m1) given that the outputs of the 2 models are extremely similar, but the simple model is less complicated.