"" ""

Load and Inspect the Data

diamonds <-  read.csv(url("http://gattonweb.uky.edu/sheather/book/docs/datasets/diamonds.txt"),sep = '')

plot(Price ~ Size, data=diamonds )

The data look reasonably linearly related when plotted as a scatter. There appears to be some clustering / local variability around certain points. I wonder whether this may be suggestive of non-constant variance

Building Model 1

m <- lm(Price ~ Size, data=diamonds )

summary(m)
## 
## Call:
## lm(formula = Price ~ Size, data = diamonds)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -85.654 -21.503  -1.203  16.797  79.295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -258.05      16.94  -15.23   <2e-16 ***
## Size         3715.02      80.41   46.20   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31.6 on 47 degrees of freedom
## Multiple R-squared:  0.9785, Adjusted R-squared:  0.978 
## F-statistic:  2135 on 1 and 47 DF,  p-value: < 2.2e-16
plot(m$residuals,main="Residuals")

plot(rstandard(m),main="Standardized Residuals")

plot(rstandard(m) ~ diamonds$Size,main="Standardized Residuals vs Size")

hist(m$residuals,main="Residuals Histogram")

qqnorm(m$residuals)
qqline(m$residuals) 

The model actually looks reasonably decent without any transformation of the data: the summary shows that it’s predictive, the residuals look reasonably normally distributed (although the dispersion appears as though it may get narrower as Size/Price rises), the residuals look reasonably normally distributed based on the histogram and qqnorm plot also.

We do, however, see strong evidence of non-constant variance indicated in the “Standard Residuals vs. Size” plot. This gives us some guidance as to how we might want to modify the data

Building Model 2

We have multiple observations of “Price” at the same “Size”, so given this, i’ll try taking the sqrt of price. Note that i tried numerous other things too, but this one caused the least degredation.

m2 <- lm(sqrt(Price) ~ Size, data=diamonds )

summary(m2)
## 
## Call:
## lm(formula = sqrt(Price) ~ Size, data = diamonds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.98882 -0.44362 -0.06248  0.45273  1.87691 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.8926     0.4029   14.63   <2e-16 ***
## Size         78.4663     1.9122   41.03   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7515 on 47 degrees of freedom
## Multiple R-squared:  0.9728, Adjusted R-squared:  0.9723 
## F-statistic:  1684 on 1 and 47 DF,  p-value: < 2.2e-16
plot(m2$residuals,main="Residuals")

plot(rstandard(m2),main="Standardized Residuals")

plot(rstandard(m2) ~ diamonds$Size,main="Standardized Residuals vs Size")

hist(m2$residuals,main="Residuals Histogram")

qqnorm(m2$residuals)
qqline(m2$residuals) 

The model produces more-or-less the same results as without the transform. We see a slight degredaiton in model performance as measured by R2, a slight improvement in the histogram of the residuals (it looks more normal) and the qqnorm plot looks a little bit worse, but very similar

Selecting a Model

For this particular application, I would opt for the simpler model (m1) given that the outputs of the 2 models are extremely similar, but the simple model is less complicated.