The dataset that I will use is characteristics of the Diamond market in Singapore during the year 2000. This dataset was gathered from the 'Ecdat' package that has many datasets related to econometrics.
The independent variable I am using is carat of the diamond, and the dependent variable I am using is price. My \( H_0 \) is that the carat of a given diamond has no effect on it's listed price.
Data already in data.frame format
# load in data
library(Ecdat)
## Loading required package: Ecfun
##
## Attaching package: 'Ecdat'
##
## The following object is masked from 'package:datasets':
##
## Orange
data(Diamond)
# variables within the dataset
colnames(Diamond)
## [1] "carat" "colour" "clarity" "certification"
## [5] "price"
My linear model is attempting to see if there is a linear relationship between the carats (a measure of diamond weight) of a diamond with it's price. Using this, one could potentially predict the price of a diamond based on its carats if enough variance is explained by the model.
fit <- lm(price ~ carat, data = Diamond)
plot(Diamond$carat, Diamond$price, main = "Diamond Carats vs Price", xlab = 'Carat', ylab = 'Price', pch = 21, bg = 'gold', ylim = c(0,16000))
abline(fit, lwd = 2)
abline(confint(fit)[,1],col="red", lty = 2, lwd = 2)
abline(confint(fit)[,2],col="red", lty = 2, lwd = 2)
Linear Model Summary
summary(fit)
##
## Call:
## lm(formula = price ~ carat, data = Diamond)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2264.7 -604.3 -116.1 435.1 6591.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2298.4 158.5 -14.50 <2e-16 ***
## carat 11598.9 230.1 50.41 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1118 on 306 degrees of freedom
## Multiple R-squared: 0.8925, Adjusted R-squared: 0.8922
## F-statistic: 2541 on 1 and 306 DF, p-value: < 2.2e-16
\( b_0 \) is -2298.4 in Singapore dollars, this would be the price of a Diamond that is 0 carats. Obviously, things cannot have a negative price so the model is slightly innacurate here.
\( b_1 \) is 11598.9 in Singapore dollars, this is the change in price for every change in carat of 1. This seems to fit the data pretty well, especially for the lower to middle range.
\( r^2 \) is .8925, meaning that 89% of the variance in price is explained by carat. This is very high.
One observation I would note here is that the model does not predict as well at the higher carats. I am guessing that this is because other factors come into play at this level, such as the clarity and certification of a given diamond.