Objective: this project will look at a data set of diamonds that contains both the size (in carats) and the price (in dollars) of each diamond. The objective will be to quantify how well we can predict the price of a given diamond given only its size.
To begin, we will bring in some libraries and load the dataset into memory.
library(dplyr)
library(ggplot2)
library(caret)
data("diamonds")
Description: The diamonds dataset contains information about 53940 individual diamonds. It contains a number of different pieces of data about each diamond; such as its cut, its clarity, the depth and table it is registered under, its color rating, and an x/y/z measurement in milimeters for each. For the purposes of this analysis, the size in carats and the price in dollars will be singled out.
Method To determine how well our model performs, we will split the data into randomly selected set to train from consisting of 75% of the data, and the remaining 25% we will use to evaluate the final model. To ensur that our model is reproducible, the random seed is set.
set.seed(99)
idx_train <- as.vector(createDataPartition(y = 1:nrow(diamonds),
times = 1,
p = 0.75,
list = FALSE))
The full dataset is then split into a training a test set using these randomly choses indices as a cutting mechanism.
train_data <- diamonds %>%
mutate(idx_row = 1:n()) %>%
filter(idx_row %in% idx_train)
test_data <- diamonds %>%
mutate(idx_row = 1:n()) %>%
filter(!(idx_row %in% idx_train))
dim(train_data)
## [1] 40456 11
dim(test_data)
## [1] 13484 11
Evaluating the dimensions of the split data shows us that both still contain all original columns, but have split the original 53940 rows into a training set of 40456 and a test set for evaluation of 13484 records.
Exploration To see visually the kind of relationship that may exist between these factors, we will plot a scatter point chart of each diamond accoriding to these factors.
ggplot(data = train_data) +
geom_point(mapping = aes(x = carat, y = price),
color = 'steelblue', alpha = 0.1) +
labs(x = "Diamond Size (carats)", y = "Diamond Price (dollars)") +
theme_minimal()
Next, our respective sets will be formatted ready for insertion into the model we will build. For the moment, we will not calculate the “answers” of the labels for the test set until we use the model to make predictions for the test set.
features_train <- train_data %>%
select(carat)
features_test <- test_data %>%
select(carat)
labels_train <- train_data$price
Pre-Modeling Exploration
summary(labels_train)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 326 949 2405 3934 5317 18818
Exploring the value we are trying to predict for shows us that it is a continuous numerical value. Therefore we will be building a regression model and evaluating its ability to minimize the amount of error we would otherwise get by random guess.
Model For this regression, we will use a Linear Regression Model as build by the caret R package. The Linear Regression will build a linear algebraic model based on our input data in order to provide an output function of the reltionship between the input and the output we are tryint to predict.
model <- train(x = features_train, y = labels_train, method = 'lm')
model$results
## intercept RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 TRUE 1551.281 0.8485327 1009.541 14.80946 0.003237813 7.831418
model$finalModel$coefficients
## (Intercept) carat
## -2255.126 7760.331
Initial Results The initial results tell us that the statistical relationship between the values is 0.8485327 (the R squared value, or “Coefficient of Determination”), the amount of randomness that can be accounted for by the model for making predictions against random guessing for future data.
The coefficients give us the ingredients necessary to superimpose the linear model over the original scatter chart to show scale.
ggplot(data = train_data) +
geom_point(mapping = aes(x = carat, y = price),
color = 'steelblue', alpha = 0.1) +
geom_abline(intercept = model$finalModel$coefficients[[1]],
slope = model$finalModel$coefficients[[2]],
color = 'dark orange', size = 1.5)
labs(x = "Diamond Size (carats)", y = "Diamond Price (dollars)") +
theme_minimal()
## NULL
Testing the Model To test the model, we will make predictions from the testing data using the model we just built. We’ll inspect the first few to make sure the data type is as expected.
pred <- as.vector(predict(model, features_test))
head(pred)
## [1] -470.25027 72.97288 72.97288 72.97288 72.97288 -470.25027
Next we will attach this vector of the results to the test data so we can compare them to the actuals. We will compute the error, absolute error, and squared error.
test_data <- test_data %>%
select(carat, price) %>%
mutate(model_pred = pred) %>%
mutate(error = (pred - price),
abs_error = abs(pred - price),
sq_error = (pred - price) ^ 2)
test_data
## # A tibble: 13,484 x 6
## carat price model_pred error abs_error sq_error
## <dbl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 0.230 327 -470. -797. 797. 635608.
## 2 0.300 339 73.0 -266. 266. 70770.
## 3 0.300 351 73.0 -278. 278. 77299.
## 4 0.300 351 73.0 -278. 278. 77299.
## 5 0.300 351 73.0 -278. 278. 77299.
## 6 0.230 354 -470. -824. 824. 679389.
## 7 0.240 355 -393. -748. 748. 558976.
## 8 0.230 402 -470. -872. 872. 760821.
## 9 0.230 402 -470. -872. 872. 760821.
## 10 0.330 403 306. -97.2 97.2 9451.
## # ... with 13,474 more rows
Evaluation First we will plot our model compared to our test data to get a visual sense of similarity.
ggplot(data = test_data) +
geom_point(mapping = aes(x = carat, y = price),
color = 'steelblue', alpha = 0.1) +
geom_abline(intercept = model$finalModel$coefficients[[1]],
slope = model$finalModel$coefficients[[2]],
color = 'dark orange', size = 1.5)
labs(x = "Diamond Size (carats)", y = "Diamond Price (dollars)") +
theme_minimal()
## NULL
While the results themselves look as expected, we can show a histogram of the errors to better hone in on the extend our model performed well.
ggplot(data = test_data) +
geom_histogram(mapping = aes(x = error), bins = 50,
fill = 'steelblue') +
labs(x = "Diamond Price Error", y = "Frequqncy of error per bin") +
theme_minimal()
Since we are not specifically concerned with whether our model guessed incorrectly by a positive or negative value, concentrating on the absolute error will give a us more pure picture of the landscape.
ggplot(data = test_data) +
geom_histogram(mapping = aes(x = abs_error), bins = 50,
fill = 'dark blue') +
labs(x = "Diamond Price Error", y = "Frequqncy of error per bin") +
theme_minimal()
Conclusion While a seemingly large amount of randomness can be accounted for only within the mass size of a diamond, there is a considerable amount of the price that is not directly attributable to this factor. Given other elements of a diamond (such as its cut, its clarity, its specific dimensions, and its color rating), a minority of the remaining value may be attributable to these elements. Further analysis would determine if this was so and to what degree.