Simple Linear Regression Project

Sam Rose

Dataset

The dataset that I will use is characteristics of the Diamond market in Singapore during the year 2000. This dataset was gathered from the 'Ecdat' package that has many datasets related to econometrics.

Hypothesis

The independent variable I am using is carat of the diamond, and the dependent variable I am using is price. My \( H_0 \) is that the carat of a given diamond has no effect on it's listed price.

Load in data

Data already in data.frame format

# load in data
library(Ecdat)
## Loading required package: Ecfun
## 
## Attaching package: 'Ecdat'
## 
## The following object is masked from 'package:datasets':
## 
##     Orange
data(Diamond)
# variables within the dataset
colnames(Diamond)
## [1] "carat"         "colour"        "clarity"       "certification"
## [5] "price"

Fit linear model

My linear model is attempting to see if there is a linear relationship between the carats (a measure of diamond weight) of a diamond with it's price. Using this, one could potentially predict the price of a diamond based on its carats if enough variance is explained by the model.

fit <- lm(price ~ carat, data = Diamond)

Plots

plot(Diamond$carat, Diamond$price, main = "Diamond Carats vs Price", xlab = 'Carat', ylab = 'Price', pch = 21, bg = 'gold', ylim = c(0,16000))
abline(fit, lwd = 2)
abline(confint(fit)[,1],col="red", lty = 2, lwd = 2)
abline(confint(fit)[,2],col="red", lty = 2, lwd = 2)

plot of chunk unnamed-chunk-3

Linear Model Summary

summary(fit)
## 
## Call:
## lm(formula = price ~ carat, data = Diamond)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2264.7  -604.3  -116.1   435.1  6591.5 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2298.4      158.5  -14.50   <2e-16 ***
## carat        11598.9      230.1   50.41   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1118 on 306 degrees of freedom
## Multiple R-squared:  0.8925, Adjusted R-squared:  0.8922 
## F-statistic:  2541 on 1 and 306 DF,  p-value: < 2.2e-16

Interpretation

\( b_0 \) is -2298.4 in Singapore dollars, this would be the price of a Diamond that is 0 carats. Obviously, things cannot have a negative price so the model is slightly innacurate here.

\( b_1 \) is 11598.9 in Singapore dollars, this is the change in price for every change in carat of 1. This seems to fit the data pretty well, especially for the lower to middle range.

\( r^2 \) is .8925, meaning that 89% of the variance in price is explained by carat. This is very high.

One observation I would note here is that the model does not predict as well at the higher carats. I am guessing that this is because other factors come into play at this level, such as the clarity and certification of a given diamond.