The dataset that I will use is characteristics of the Diamond market in Singapore during the year 2000. This dataset was gathered from the 'Ecdat' package that has many datasets related to econometrics. The data set has 308 observations of 5 variables including: carat, colour, clarity, certification, and price of different diamonds on the market at this time.
Load in Data
Data already in data.frame format
# load in data
library(Ecdat)
## Loading required package: Ecfun
##
## Attaching package: 'Ecdat'
##
## The following object is masked from 'package:datasets':
##
## Orange
data(Diamond)
# variables within the dataset
summary(Diamond)
## carat colour clarity certification price
## Min. :0.1800 D:16 IF :44 GIA:151 Min. : 638
## 1st Qu.:0.3500 E:44 VS1 :81 HRD: 79 1st Qu.: 1625
## Median :0.6200 F:82 VS2 :53 IGI: 78 Median : 4215
## Mean :0.6309 G:65 VVS1:52 Mean : 5019
## 3rd Qu.:0.8500 H:61 VVS2:78 3rd Qu.: 7446
## Max. :1.1000 I:40 Max. :16008
The independent variable I am using is carat of the diamond, and the dependent variable I am using is price. My guess is that the carat of a diamond will be the best predictor of its price on the market.
My \( H_0 \) is that the carat of a given diamond has no effect on it's listed price.
My linear model is attempting to see if there is a linear relationship between the carats (a measure of diamond weight) of a diamond with it's price. Using this, one could potentially predict the price of a diamond based on its carats if enough variance is explained by the model.
fit <- lm(price ~ carat, data = Diamond)
plot(Diamond$carat, Diamond$price, main = "Diamond Carats vs Price", xlab = 'Carat', ylab = 'Price', pch = 21, bg = 'gold', ylim = c(0,16000))
abline(fit, lwd = 2)
abline(confint(fit)[,1],col="red", lty = 2, lwd = 2)
abline(confint(fit)[,2],col="red", lty = 2, lwd = 2)
Linear Model Summary
summary(fit)
##
## Call:
## lm(formula = price ~ carat, data = Diamond)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2264.7 -604.3 -116.1 435.1 6591.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2298.4 158.5 -14.50 <2e-16 ***
## carat 11598.9 230.1 50.41 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1118 on 306 degrees of freedom
## Multiple R-squared: 0.8925, Adjusted R-squared: 0.8922
## F-statistic: 2541 on 1 and 306 DF, p-value: < 2.2e-16
\( b_0 \) is -2298.4 in Singapore dollars, this would be the price of a Diamond that is 0 carats. Obviously, things cannot have a negative price so the model is slightly innacurate here.
\( b_1 \) is 11598.9 in Singapore dollars, this is the change in price for every change in carat of 1. This seems to fit the data pretty well, especially for the lower to middle range. The standard error of the residual is 1118, which means that this is standard error between the fitted points of prediction on the model and the observed values in the dataset.
\( r^2 \) is .8925, meaning that 89% of the variance in price is explained by carat. This is very high.
One observation I would note here is that the model does not predict as well at the higher carats. I am guessing that this is because other factors come into play at this level, such as the clarity and certification of a given diamond. The P-value for the F-test on the regression model is very low (< 2.23-16), therefore it is safe to say that there is a relationship between price and carat predicted by this model.