Simple Linear Regression Project

Sam Rose

2/26/2015

1. Dataset

The dataset that I will use is characteristics of the Diamond market in Singapore during the year 2000. This dataset was gathered from the 'Ecdat' package that has many datasets related to econometrics. The data set has 308 observations of 5 variables including: carat, colour, clarity, certification, and price of different diamonds on the market at this time.

Load in Data

Data already in data.frame format

# load in data
library(Ecdat)

## Loading required package: Ecfun
## 
## Attaching package: 'Ecdat'
## 
## The following object is masked from 'package:datasets':
## 
##     Orange

data(Diamond)
# variables within the dataset
summary(Diamond)

##      carat        colour clarity   certification     price      
##  Min.   :0.1800   D:16   IF  :44   GIA:151       Min.   :  638  
##  1st Qu.:0.3500   E:44   VS1 :81   HRD: 79       1st Qu.: 1625  
##  Median :0.6200   F:82   VS2 :53   IGI: 78       Median : 4215  
##  Mean   :0.6309   G:65   VVS1:52                 Mean   : 5019  
##  3rd Qu.:0.8500   H:61   VVS2:78                 3rd Qu.: 7446  
##  Max.   :1.1000   I:40                           Max.   :16008

2. Hypothesis

The independent variable I am using is carat of the diamond, and the dependent variable I am using is price. My guess is that the carat of a diamond will be the best predictor of its price on the market.

My \( H_0 \) is that the carat of a given diamond has no effect on it's listed price.

Fit linear model

My linear model is attempting to see if there is a linear relationship between the carats (a measure of diamond weight) of a diamond with it's price. Using this, one could potentially predict the price of a diamond based on its carats if enough variance is explained by the model.

fit <- lm(price ~ carat, data = Diamond)

3.Plots

plot(Diamond$carat, Diamond$price, main = "Diamond Carats vs Price", xlab = 'Carat', ylab = 'Price', pch = 21, bg = 'gold', ylim = c(0,16000))
abline(fit, lwd = 2)
abline(confint(fit)[,1],col="red", lty = 2, lwd = 2)
abline(confint(fit)[,2],col="red", lty = 2, lwd = 2)

plot of chunk unnamed-chunk-3

Linear Model Summary

summary(fit)

## 
## Call:
## lm(formula = price ~ carat, data = Diamond)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2264.7  -604.3  -116.1   435.1  6591.5 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2298.4      158.5  -14.50   <2e-16 ***
## carat        11598.9      230.1   50.41   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1118 on 306 degrees of freedom
## Multiple R-squared:  0.8925, Adjusted R-squared:  0.8922 
## F-statistic:  2541 on 1 and 306 DF,  p-value: < 2.2e-16

4. Interpretation

\( b_0 \) is -2298.4 in Singapore dollars, this would be the price of a Diamond that is 0 carats. Obviously, things cannot have a negative price so the model is slightly innacurate here.

\( b_1 \) is 11598.9 in Singapore dollars, this is the change in price for every change in carat of 1. This seems to fit the data pretty well, especially for the lower to middle range. The standard error of the residual is 1118, which means that this is standard error between the fitted points of prediction on the model and the observed values in the dataset.

\( r^2 \) is .8925, meaning that 89% of the variance in price is explained by carat. This is very high.

One observation I would note here is that the model does not predict as well at the higher carats. I am guessing that this is because other factors come into play at this level, such as the clarity and certification of a given diamond. The P-value for the F-test on the regression model is very low (< 2.23-16), therefore it is safe to say that there is a relationship between price and carat predicted by this model.