Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?
This analysis look at the diamonds dataset and measures the relationship between the carat and price attributes. The diamonds dataset contains the prices and attributes for more that 54,000 diamonds.
plot(diamonds$carat, diamonds$price, main="Diamonds",
xlab="Diamonds", ylab="Price")
diamonds.lm <- lm(price ~ carat, data=diamonds)
summary(diamonds.lm)
##
## Call:
## lm(formula = price ~ carat, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18585.3 -804.8 -18.9 537.4 12731.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2256.36 13.06 -172.8 <2e-16 ***
## carat 7756.43 14.07 551.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1549 on 53938 degrees of freedom
## Multiple R-squared: 0.8493, Adjusted R-squared: 0.8493
## F-statistic: 3.041e+05 on 1 and 53938 DF, p-value: < 2.2e-16
plot(fitted(diamonds.lm),resid(diamonds.lm))
qqnorm(resid(diamonds.lm))
qqline(resid(diamonds.lm))
par(mfrow=c(2,2))
plot(diamonds.lm)
7756.43/14.07
## [1] 551.2743
The correlation coefficient divided by the standard error has a ratio of 551.2743. The large ration lets us know that the amount of variability is very small. The p value is less than 0.05, we can conclude that there is a strong linear relation between a diamond’s carat and its price. For this pair of attributes the linear model was appropriate.