August 25, 2018

Senario

The goal of the project is to predict the price of diamonds by establishing a linear regression model. For the model establishing and testing, the dataset "diamonds" which has been built in R will be utilized.
The data frame with 53940 rows and 10 variables:
- price: price in US dollars ($326–$18,823)
- carat: weight of the diamond (0.2–5.01)
- cut: quality of the cut (Fair, Good, Very Good, Premium, Ideal)
- color: diamond colour, from J (worst) to D (best)
- clarity: a measurement of how clear the diamond is
- x: length in mm (0–10.74)
- y: width in mm (0–58.9)
- z: depth in mm (0–31.8)
- depth: total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79)
- table: width of top of diamond relative to widest point (43–95)

Overall distribution of diamonds price

Overall correlation between price and carat

Model establishment

"carat" is selected as the only variable to run the regression
Then the model is tested by test dataset.