Day 22 Homework

Exercise 3.25:

Get data and fit complete second order model:

setwd("/Users/traves/Dropbox/SM339/day22")
data = read.csv("Diamonds.csv")
head(data)

##   Carat Color Clarity Depth PricePerCt TotalPrice
## 1  1.08     E     VS1  68.6       6693     7228.8
## 2  0.31     F    VVS1  61.9       3159      979.3
## 3  0.31     H     VS1  62.1       1755      544.1
## 4  0.32     F    VVS1  60.8       3159     1010.9
## 5  0.33     D      IF  60.8       4759     1570.4
## 6  0.33     G    VVS1  61.5       2896      955.6

attach(data)
C2 = lm(TotalPrice ~ 1 + Carat + Depth + I(Carat^2) + I(Depth^2) + I(Carat * 
    Depth))
C2woD = lm(TotalPrice ~ 1 + Carat + I(Carat^2))
anova(C2woD, C2)

## Analysis of Variance Table
## 
## Model 1: TotalPrice ~ 1 + Carat + I(Carat^2)
## Model 2: TotalPrice ~ 1 + Carat + Depth + I(Carat^2) + I(Depth^2) + I(Carat * 
##     Depth)
##   Res.Df      RSS Df Sum of Sq    F  Pr(>F)    
## 1    348 1.57e+09                              
## 2    345 1.45e+09  3  1.19e+08 9.43 5.2e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The hypothesis test is determining whether all the terms involving Depth actually help the model fit better (Null hypothesis is that extra terms don't help, Alternative hypothesis is that extra terms do help). Since the p-value is tiny, we reject the null hypothesis and conclude that removing the Depth terms significantly impair the complete model's effectiveness. It is harder to see this in the adjusted $ R^2 $ statistics since both are so high:

summary(C2)

## 
## Call:
## lm(formula = TotalPrice ~ 1 + Carat + Depth + I(Carat^2) + I(Depth^2) + 
##     I(Carat * Depth))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12196   -653    -39    486  10582 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      24338.82   30297.91    0.80    0.422    
## Carat             7573.62    3040.79    2.49    0.013 *  
## Depth             -728.70     904.44   -0.81    0.421    
## I(Carat^2)        4761.59     330.25   14.42   <2e-16 ***
## I(Depth^2)           5.28       6.73    0.78    0.433    
## I(Carat * Depth)   -83.89      53.53   -1.57    0.118    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2050 on 345 degrees of freedom
## Multiple R-squared:  0.931,  Adjusted R-squared:  0.93 
## F-statistic:  936 on 5 and 345 DF,  p-value: <2e-16

summary(C2woD)

## 
## Call:
## lm(formula = TotalPrice ~ 1 + Carat + I(Carat^2))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -10207   -712   -168    355  12147 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -523        466   -1.12   0.2631    
## Carat           2386        752    3.17   0.0017 ** 
## I(Carat^2)      4498        263   17.10   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2130 on 348 degrees of freedom
## Multiple R-squared:  0.926,  Adjusted R-squared:  0.925 
## F-statistic: 2.17e+03 on 2 and 348 DF,  p-value: <2e-16

Exercise 3.26

Fit the quadratic model in Carat:

quad = lm(TotalPrice ~ 1 + Carat + I(Carat^2))
summary(quad)

## 
## Call:
## lm(formula = TotalPrice ~ 1 + Carat + I(Carat^2))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -10207   -712   -168    355  12147 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -523        466   -1.12   0.2631    
## Carat           2386        752    3.17   0.0017 ** 
## I(Carat^2)      4498        263   17.10   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2130 on 348 degrees of freedom
## Multiple R-squared:  0.926,  Adjusted R-squared:  0.925 
## F-statistic: 2.17e+03 on 2 and 348 DF,  p-value: <2e-16

a. Fit:

new = data.frame(Carat = c(0.5), Depth = 62)
predict(quad, newdata = new)

##    1 
## 1795

The model predicts a total price of $1,794.84.

b. Predict mean total price (confidence interval):

new = data.frame(Carat = c(0.5), Depth = 62)
predict(quad, newdata = new, interval = "confidence", level = 0.95)

##    fit  lwr  upr
## 1 1795 1424 2165

The model gives a 95% confidence interval for the mean total price of such diamonds between $1,424.30 and $2165.40. We are 95% confident that the average total price for all such diamonds is between $1,424.30 and $2165.40.

c. Predict

new = data.frame(Carat = c(0.5), Depth = 62)
predict(quad, newdata = new, interval = "predict", level = 0.95)

##    fit   lwr  upr
## 1 1795 -2404 5994

The model gives a 95% confidence interval for the total price of an individual such diamond between $-2,404.46 and $5,994.15. We are 95% confident that the price for such a diamond will be less than $5,994.15.

d. First fit the log scale model:

logm = lm(log(TotalPrice) ~ 1 + Carat + Depth + I(Carat^2) + I(Depth^2) + I(Carat * 
    Depth))
new = data.frame(Carat = c(0.5), Depth = 62)
p1 = predict(logm, newdata = new, interval = "confidence", level = 0.95)
exp(p1)

##    fit  lwr  upr
## 1 1856 1781 1934

p2 = predict(logm, newdata = new, interval = "predict", level = 0.95)
exp(p2)

##    fit  lwr  upr
## 1 1856 1177 2926

Using the log model, we are 95% confident that the average total price of such diamonds is between $1,781 and $1934. We are also 95% confident that the total price of the individal diamond is between $1,177 and $2,926. Note how much more helpful this second interval is, compared with the interval in part c.