Introduction

Methodology

In Phase 1 of this project, we have explore the diamond.csv dataset and in this phase, we will further investigate effect of the diamonds’ cut and a multinomial logistic regression model was formulated. The model will be fitted on the subsetted dataset inorder not to fry the computer. The variable “CUT” was uused as the response variable. The latter has 5 levels and these are “Fair”, “Good”, “Very Good”, “Premium” and “Ideal”. All the other variables will be used in the analysis, however, the variables x, y and z will be combined to create a new variable known as Volume so as to eliminate interactions between these 3 variables.

Loading packages

library(nnet)
library(mlr)
## Loading required package: ParamHelpers
## 'mlr' is in maintenance mode since July 2019. Future development
## efforts will go into its successor 'mlr3' (<https://mlr3.mlr-org.com>).
library(knitr)
library(dplyr)
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
library(broom)

Modification of Data

Adding the new variable “Volume” and dropping the variables x, y and z.

sub_diamond2 <- sub_diamond %>% mutate(volume = x*y*z) %>% select(-x,-y,-z) %>% mutate(obs_num = 1:n())

Data Summary

mlr::summarizeColumns(sub_diamond2) %>% 
  select(-mad,-disp) %>% 
  knitr::kable(caption = "Summary of the features of Diamond")
Summary of the features of Diamond
name type na mean median min max nlevs
carat numeric 0 0.792182 0.7000 0.20000 3.0400 0
cut factor 0 NA NA 157.00000 1974.0000 5
color factor 0 NA NA 244.00000 1030.0000 7
clarity factor 0 NA NA 0.00000 1190.0000 8
depth numeric 0 61.738460 61.8000 52.30000 72.2000 0
table numeric 0 57.469580 57.0000 43.00000 70.0000 0
price numeric 0 3897.760400 2332.0000 334.00000 18780.0000 0
volume numeric 0 128.903532 114.2554 33.46742 505.6011 0
obs_num integer 0 2500.500000 2500.5000 1.00000 5000.0000 0

Statistical Modelling

Model Fitting

We will start the model by using all the variables.

cut_model <- multinom(cut~., data=sub_diamond2, maxit=1000)
## # weights:  120 (92 variable)
## initial  value 8047.189562 
## iter  10 value 7139.851414
## iter  20 value 6570.954502
## iter  30 value 6002.671806
## iter  40 value 5710.830558
## iter  50 value 5535.042800
## iter  60 value 5445.864421
## iter  70 value 5328.252592
## iter  80 value 5245.502706
## iter  90 value 5223.235594
## iter 100 value 5212.458532
## final  value 5212.428172 
## converged
summary(cut_model)
## Warning in sqrt(diag(vc)): NaNs produced
## Call:
## multinom(formula = cut ~ ., data = sub_diamond2, maxit = 1000)
## 
## Coefficients:
##           (Intercept)     carat      colorE      colorF    colorG      colorH
## Good         54.21420  -3.76824 -0.19258822 -0.16411939 0.2446273 -0.22591634
## Very Good    85.83917 -15.98443 -0.05711157 -0.08226506 0.2789349 -0.08732381
## Premium      85.52025 -23.97448 -0.16867930  0.11409894 0.6188116  0.27512372
## Ideal       158.06632 -25.26972 -0.09698909 -0.01455872 0.3865093  0.11968282
##              colorI    colorJ clarityIF clarityVVS1 clarityVVS2 clarityVS1
## Good      0.1719059 0.4403532  6.289421    7.052726     6.18418   7.765773
## Very Good 0.3537943 0.6210180  9.847532   11.241705    10.57907  11.745240
## Premium   0.5586258 0.6716991  9.730120   10.730215    10.19800  11.824645
## Ideal     0.6094552 0.8311882 19.347339   20.384699    19.44746  20.632122
##           clarityVS2 claritySI1 claritySI2 clarityI1 clarityI2 clarityI3
## Good        7.095028   6.748835   7.060304  6.017937         0         0
## Very Good  10.794342  10.646713  10.942620 10.041942         0         0
## Premium    11.044026  10.655650  10.998687 10.338914         0         0
## Ideal      19.922847  19.434776  19.648178 19.248900         0         0
##                depth      table        price      volume       obs_num
## Good      -0.7222614 -0.2241240 0.0002574625 0.009383132 -6.229005e-05
## Very Good -1.0587536 -0.4605102 0.0003905921 0.079481440  5.423326e-06
## Premium   -1.2197973 -0.2869896 0.0004077296 0.129024236 -2.606915e-05
## Ideal     -1.5450560 -1.3467004 0.0004791622 0.131464683  1.171036e-05
## 
## Std. Errors:
##            (Intercept)        carat       colorE       colorF       colorG
## Good      7.437509e-05 2.330397e-05 3.754204e-05 2.674331e-05 8.550715e-05
## Very Good 5.561293e-05 2.543335e-05 6.746020e-05 6.225066e-05 8.197512e-05
## Premium   6.446960e-05 3.523882e-05 1.228747e-04 9.252711e-05 1.240807e-04
## Ideal     4.849397e-05 4.303036e-05 7.985738e-05 1.625500e-04 1.834308e-04
##                 colorH       colorI       colorJ    clarityIF  clarityVVS1
## Good      3.714494e-05 7.612905e-06 6.459727e-05 1.016769e-05 6.979354e-06
## Very Good 4.915003e-05 2.543297e-05 3.788062e-05 1.528406e-05 4.423823e-05
## Premium   1.027705e-04 4.529901e-05 4.042992e-05 5.048592e-05 5.199741e-05
## Ideal     6.327803e-05 4.706848e-05 3.018999e-05 7.290056e-05 8.493166e-05
##            clarityVVS2   clarityVS1   clarityVS2   claritySI1   claritySI2
## Good      1.195546e-05 6.923849e-05 3.753880e-05 1.295417e-04 4.471006e-05
## Very Good 2.757687e-05 3.071345e-05 5.489049e-05 1.072948e-04 5.970518e-05
## Premium   5.155278e-05 4.088751e-05 7.527618e-05 7.287233e-05 8.603199e-05
## Ideal     8.203525e-05 5.013821e-05 1.246556e-04 2.052041e-04 1.476617e-04
##              clarityI1    clarityI2 clarityI3      depth      table
## Good      2.986070e-05 4.463401e-18       NaN 0.01452978 0.01532820
## Very Good 2.457243e-05 0.000000e+00         0 0.01031760 0.01094469
## Premium   2.589225e-05 0.000000e+00         0 0.01028870 0.01079183
## Ideal     1.205969e-05 0.000000e+00         0 0.01257817 0.01373851
##                  price      volume      obs_num
## Good      7.910916e-05 0.003534778 7.634524e-05
## Very Good 7.706729e-05 0.003452049 7.474795e-05
## Premium   7.699940e-05 0.003447068 7.492335e-05
## Ideal     7.897400e-05 0.003580757 7.640460e-05
## 
## Residual Deviance: 10424.86 
## AIC: 10584.86

The full response is somehow difficult to interpret. Lets give a try!

The model execution shows that it has converged after 130 iterations and it includes a final negative log-likelihood 5220.25. When multiply by 2, the latter gives the Residual Deviance. The AIC is 10592.49 and it gives can be used to compare with other related model. The smallest AIC should be selected but however unlike the adjusted R-squared, this number itself is not meaningful.

We will now calculate the Z score and the p-value for all the variables in the model.

z_score <- summary(cut_model)$coefficients/summary(cut_model)$standard.errors
## Warning in sqrt(diag(vc)): NaNs produced

## Warning in sqrt(diag(vc)): NaNs produced
z_score
##           (Intercept)     carat     colorE      colorF   colorG    colorH
## Good         728929.6 -161699.5 -5129.9342 -6136.83930 2860.899 -6082.022
## Very Good   1543511.1 -628482.9  -846.5965 -1321.51300 3402.677 -1776.679
## Premium     1326520.5 -680342.7 -1372.7753  1233.14071 4987.169  2677.068
## Ideal       3259504.6 -587253.2 -1214.5289   -89.56455 2107.112  1891.380
##             colorI    colorJ clarityIF clarityVVS1 clarityVVS2 clarityVS1
## Good      22580.85  6816.901  618569.5   1010512.7    517268.1   112159.8
## Very Good 13910.85 16394.082  644300.6    254117.5    383621.3   382413.6
## Premium   12331.97 16613.911  192729.4    206360.5    197816.6   289199.5
## Ideal     12948.27 27531.908  265393.6    240013.0    237062.2   411505.0
##           clarityVS2 claritySI1 claritySI2 clarityI1 clarityI2 clarityI3
## Good        189005.2   52097.78   157913.1  201533.7         0       NaN
## Very Good   196652.3   99228.58   183277.6  408667.1       NaN       NaN
## Premium     146713.4  146223.54   127844.2  399305.4       NaN       NaN
## Ideal       159823.1   94709.47   133062.1 1596135.3       NaN       NaN
##                depth     table    price    volume     obs_num
## Good       -49.70904 -14.62168 3.254522  2.654518 -0.81589962
## Very Good -102.61629 -42.07612 5.068196 23.024427  0.07255485
## Premium   -118.55703 -26.59322 5.295230 37.430141 -0.34794425
## Ideal     -122.83631 -98.02376 6.067341 36.714213  0.15326776
p_value <- (1 - pnorm(abs(z_score), 0, 1))*2
p_value
##           (Intercept) carat colorE colorF colorG colorH colorI colorJ clarityIF
## Good                0     0      0      0      0      0      0      0         0
## Very Good           0     0      0      0      0      0      0      0         0
## Premium             0     0      0      0      0      0      0      0         0
## Ideal               0     0      0      0      0      0      0      0         0
##           clarityVVS1 clarityVVS2 clarityVS1 clarityVS2 claritySI1 claritySI2
## Good                0           0          0          0          0          0
## Very Good           0           0          0          0          0          0
## Premium             0           0          0          0          0          0
## Ideal               0           0          0          0          0          0
##           clarityI1 clarityI2 clarityI3 depth table        price      volume
## Good              0         1       NaN     0     0 1.135833e-03 0.007942184
## Very Good         0       NaN       NaN     0     0 4.016045e-07 0.000000000
## Premium           0       NaN       NaN     0     0 1.188664e-07 0.000000000
## Ideal             0       NaN       NaN     0     0 1.300452e-09 0.000000000
##             obs_num
## Good      0.4145575
## Very Good 0.9421604
## Premium   0.7278821
## Ideal     0.8781871

The p-Value shows that all the variables are significant. However at the project, we will show which model will be best fitted through a stepwise selection.

tidy(cut_model, exponentiate=FALSE, conf.int = TRUE) %>% kable(digits = 3, format = "markdown")
## Warning in sqrt(diag(vc)): NaNs produced
## Warning in sqrt(diag(vcov(object))): NaNs produced
y.level term estimate std.error statistic p.value conf.low conf.high
Good (Intercept) 54.214 0.000 728929.568 0.000 54.214 54.214
Good carat -3.768 0.000 -161699.476 0.000 -3.768 -3.768
Good colorE -0.193 0.000 -5129.934 0.000 -0.193 -0.193
Good colorF -0.164 0.000 -6136.839 0.000 -0.164 -0.164
Good colorG 0.245 0.000 2860.899 0.000 0.244 0.245
Good colorH -0.226 0.000 -6082.022 0.000 -0.226 -0.226
Good colorI 0.172 0.000 22580.855 0.000 0.172 0.172
Good colorJ 0.440 0.000 6816.901 0.000 0.440 0.440
Good clarityIF 6.289 0.000 618569.504 0.000 6.289 6.289
Good clarityVVS1 7.053 0.000 1010512.656 0.000 7.053 7.053
Good clarityVVS2 6.184 0.000 517268.131 0.000 6.184 6.184
Good clarityVS1 7.766 0.000 112159.772 0.000 7.766 7.766
Good clarityVS2 7.095 0.000 189005.208 0.000 7.095 7.095
Good claritySI1 6.749 0.000 52097.779 0.000 6.749 6.749
Good claritySI2 7.060 0.000 157913.091 0.000 7.060 7.060
Good clarityI1 6.018 0.000 201533.714 0.000 6.018 6.018
Good clarityI2 0.000 0.000 0.000 1.000 0.000 0.000
Good clarityI3 0.000 NaN NaN NaN NaN NaN
Good depth -0.722 0.015 -49.709 0.000 -0.751 -0.694
Good table -0.224 0.015 -14.622 0.000 -0.254 -0.194
Good price 0.000 0.000 3.255 0.001 0.000 0.000
Good volume 0.009 0.004 2.655 0.008 0.002 0.016
Good obs_num 0.000 0.000 -0.816 0.415 0.000 0.000
Very Good (Intercept) 85.839 0.000 1543511.075 0.000 85.839 85.839
Very Good carat -15.984 0.000 -628482.930 0.000 -15.984 -15.984
Very Good colorE -0.057 0.000 -846.597 0.000 -0.057 -0.057
Very Good colorF -0.082 0.000 -1321.513 0.000 -0.082 -0.082
Very Good colorG 0.279 0.000 3402.677 0.000 0.279 0.279
Very Good colorH -0.087 0.000 -1776.679 0.000 -0.087 -0.087
Very Good colorI 0.354 0.000 13910.852 0.000 0.354 0.354
Very Good colorJ 0.621 0.000 16394.082 0.000 0.621 0.621
Very Good clarityIF 9.848 0.000 644300.585 0.000 9.848 9.848
Very Good clarityVVS1 11.242 0.000 254117.453 0.000 11.242 11.242
Very Good clarityVVS2 10.579 0.000 383621.291 0.000 10.579 10.579
Very Good clarityVS1 11.745 0.000 382413.591 0.000 11.745 11.745
Very Good clarityVS2 10.794 0.000 196652.314 0.000 10.794 10.794
Very Good claritySI1 10.647 0.000 99228.580 0.000 10.647 10.647
Very Good claritySI2 10.943 0.000 183277.583 0.000 10.943 10.943
Very Good clarityI1 10.042 0.000 408667.124 0.000 10.042 10.042
Very Good clarityI2 0.000 0.000 NaN NaN 0.000 0.000
Very Good clarityI3 0.000 0.000 NaN NaN 0.000 0.000
Very Good depth -1.059 0.010 -102.616 0.000 -1.079 -1.039
Very Good table -0.461 0.011 -42.076 0.000 -0.482 -0.439
Very Good price 0.000 0.000 5.068 0.000 0.000 0.001
Very Good volume 0.079 0.003 23.024 0.000 0.073 0.086
Very Good obs_num 0.000 0.000 0.073 0.942 0.000 0.000
Premium (Intercept) 85.520 0.000 1326520.545 0.000 85.520 85.520
Premium carat -23.974 0.000 -680342.726 0.000 -23.975 -23.974
Premium colorE -0.169 0.000 -1372.775 0.000 -0.169 -0.168
Premium colorF 0.114 0.000 1233.141 0.000 0.114 0.114
Premium colorG 0.619 0.000 4987.169 0.000 0.619 0.619
Premium colorH 0.275 0.000 2677.068 0.000 0.275 0.275
Premium colorI 0.559 0.000 12331.966 0.000 0.559 0.559
Premium colorJ 0.672 0.000 16613.911 0.000 0.672 0.672
Premium clarityIF 9.730 0.000 192729.388 0.000 9.730 9.730
Premium clarityVVS1 10.730 0.000 206360.547 0.000 10.730 10.730
Premium clarityVVS2 10.198 0.000 197816.614 0.000 10.198 10.198
Premium clarityVS1 11.825 0.000 289199.473 0.000 11.825 11.825
Premium clarityVS2 11.044 0.000 146713.422 0.000 11.044 11.044
Premium claritySI1 10.656 0.000 146223.538 0.000 10.656 10.656
Premium claritySI2 10.999 0.000 127844.165 0.000 10.999 10.999
Premium clarityI1 10.339 0.000 399305.356 0.000 10.339 10.339
Premium clarityI2 0.000 0.000 NaN NaN 0.000 0.000
Premium clarityI3 0.000 0.000 NaN NaN 0.000 0.000
Premium depth -1.220 0.010 -118.557 0.000 -1.240 -1.200
Premium table -0.287 0.011 -26.593 0.000 -0.308 -0.266
Premium price 0.000 0.000 5.295 0.000 0.000 0.001
Premium volume 0.129 0.003 37.430 0.000 0.122 0.136
Premium obs_num 0.000 0.000 -0.348 0.728 0.000 0.000
Ideal (Intercept) 158.066 0.000 3259504.602 0.000 158.066 158.066
Ideal carat -25.270 0.000 -587253.201 0.000 -25.270 -25.270
Ideal colorE -0.097 0.000 -1214.529 0.000 -0.097 -0.097
Ideal colorF -0.015 0.000 -89.565 0.000 -0.015 -0.014
Ideal colorG 0.387 0.000 2107.112 0.000 0.386 0.387
Ideal colorH 0.120 0.000 1891.380 0.000 0.120 0.120
Ideal colorI 0.609 0.000 12948.267 0.000 0.609 0.610
Ideal colorJ 0.831 0.000 27531.908 0.000 0.831 0.831
Ideal clarityIF 19.347 0.000 265393.577 0.000 19.347 19.347
Ideal clarityVVS1 20.385 0.000 240012.970 0.000 20.385 20.385
Ideal clarityVVS2 19.447 0.000 237062.201 0.000 19.447 19.448
Ideal clarityVS1 20.632 0.000 411504.992 0.000 20.632 20.632
Ideal clarityVS2 19.923 0.000 159823.147 0.000 19.923 19.923
Ideal claritySI1 19.435 0.000 94709.475 0.000 19.434 19.435
Ideal claritySI2 19.648 0.000 133062.114 0.000 19.648 19.648
Ideal clarityI1 19.249 0.000 1596135.319 0.000 19.249 19.249
Ideal clarityI2 0.000 0.000 NaN NaN 0.000 0.000
Ideal clarityI3 0.000 0.000 NaN NaN 0.000 0.000
Ideal depth -1.545 0.013 -122.836 0.000 -1.570 -1.520
Ideal table -1.347 0.014 -98.024 0.000 -1.374 -1.320
Ideal price 0.000 0.000 6.067 0.000 0.000 0.001
Ideal volume 0.131 0.004 36.714 0.000 0.124 0.138
Ideal obs_num 0.000 0.000 0.153 0.878 0.000 0.000

Residuals Analysis

cut_residuals <- as_tibble(residuals(cut_model)) %>% 
  setNames(paste('resid', names(.), sep = "")) %>%
  mutate(obs_num = 1:n())
  
cut_residuals %>% slice(1:10)
## # A tibble: 10 x 6
##     residFair residGood `residVery Good` residPremium residIdeal obs_num
##         <dbl>     <dbl>            <dbl>        <dbl>      <dbl>   <int>
##  1 -0.00284    -0.0592           -0.211       -0.127     0.400         1
##  2 -0.00754    -0.143            -0.373        0.759    -0.235         2
##  3 -0.0000918  -0.0209           -0.179        0.622    -0.422         3
##  4 -0.0000419  -0.00326          -0.0277      -0.0302    0.0612        4
##  5 -0.0655     -0.217            -0.294        0.618    -0.0410        5
##  6 -0.00165    -0.0605           -0.334       -0.191     0.587         6
##  7 -0.000271   -0.0238           -0.141       -0.128     0.293         7
##  8 -0.000653   -0.0180           -0.188       -0.164     0.371         8
##  9 -0.0102     -0.178            -0.391        0.691    -0.112         9
## 10 -0.319       0.638            -0.141       -0.175    -0.00402      10
plot(cut_residuals)

Response Analysis

cut_pred <- as_tibble(predict(cut_model, type = "probs")) %>% mutate(obs_num = 1:n())

cut_pred %>% slice(1:10)
## # A tibble: 10 x 6
##         Fair    Good `Very Good` Premium   Ideal obs_num
##        <dbl>   <dbl>       <dbl>   <dbl>   <dbl>   <int>
##  1 0.00284   0.0592       0.211   0.127  0.600         1
##  2 0.00754   0.143        0.373   0.241  0.235         2
##  3 0.0000918 0.0209       0.179   0.378  0.422         3
##  4 0.0000419 0.00326      0.0277  0.0302 0.939         4
##  5 0.0655    0.217        0.294   0.382  0.0410        5
##  6 0.00165   0.0605       0.334   0.191  0.413         6
##  7 0.000271  0.0238       0.141   0.128  0.707         7
##  8 0.000653  0.0180       0.188   0.164  0.629         8
##  9 0.0102    0.178        0.391   0.309  0.112         9
## 10 0.319     0.362        0.141   0.175  0.00402      10

Goodness of fit - talk about residual deviance

As states earlier, the residual deviance of the full model plot is 10437.33, which is somehow a large residual deviance which indicates a bad fit of the model.

In the model final model selection, we will see if the residual deviance can be reduced.

Confidence Interval and Hypothesis test

Anova(cut_model)
## Analysis of Deviance Table (Type II tests)
## 
## Response: cut
##         LR Chisq Df Pr(>Chisq)    
## carat      41.40  4  2.223e-08 ***
## color      26.82 24     0.3127    
## clarity    59.84 40     0.0226 *  
## depth     433.41  4  < 2.2e-16 ***
## table    1683.87  4  < 2.2e-16 ***
## price      29.08  4  7.536e-06 ***
## volume     34.80  4  5.099e-07 ***
## obs_num     4.28  4     0.3689    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

At a 95% confidence interval, only the variable volume is not significant. However the variable depth, table and price are highly significant.

Sensitivity Analysis

exp(coef(cut_model))
##            (Intercept)        carat    colorE    colorF   colorG    colorH
## Good      3.506948e+23 2.309266e-02 0.8248215 0.8486407 1.277145 0.7977848
## Very Good 1.903165e+37 1.143011e-07 0.9444887 0.9210278 1.321721 0.9163803
## Premium   1.383484e+37 3.872734e-11 0.8447798 1.1208630 1.856720 1.3166936
## Ideal     4.439462e+68 1.060475e-11 0.9075659 0.9855467 1.471834 1.1271393
##             colorI   colorJ    clarityIF  clarityVVS1  clarityVVS2   clarityVS1
## Good      1.187566 1.553256 5.388414e+02 1.156006e+03 4.850149e+02 2.358481e+03
## Very Good 1.424462 1.860821 1.891162e+04 7.624487e+04 3.930365e+04 1.261516e+05
## Premium   1.748268 1.957561 1.681658e+04 4.571650e+04 2.684934e+04 1.365771e+05
## Ideal     1.839429 2.296045 2.526054e+08 7.127912e+08 2.792052e+08 9.128865e+08
##             clarityVS2   claritySI1   claritySI2    clarityI1 clarityI2
## Good      1.205956e+03 8.530642e+02 1.164799e+03 4.107304e+02         1
## Very Good 4.874421e+04 4.205412e+04 5.653529e+04 2.296995e+04         1
## Premium   6.256902e+04 4.243164e+04 5.979560e+04 3.091244e+04         1
## Ideal     4.491409e+08 2.756869e+08 3.412679e+08 2.289238e+08         1
##           clarityI3     depth     table    price   volume   obs_num
## Good              1 0.4856528 0.7992161 1.000257 1.009427 0.9999377
## Very Good         1 0.3468879 0.6309617 1.000391 1.082725 1.0000054
## Premium           1 0.2952900 0.7505195 1.000408 1.137718 0.9999739
## Ideal             1 0.2132999 0.2600971 1.000479 1.140498 1.0000117

The sensitivity analysis gives the odds ratio of the model. For example, we can see if an increase of 1 unit of fair cut vs. good cut, there is an increase of 0.523 odds ratio in depth and 0.840 odds ratio in table.

Stepwise Model Selection

EMPTY.MOD <- multinom(formula =  cut ~ 1, data = sub_diamond2)
## # weights:  10 (4 variable)
## initial  value 8047.189562 
## iter  10 value 6883.189708
## iter  10 value 6883.189707
## final  value 6883.189707 
## converged
FULL.MOD <- multinom(formula = cut ~., data = sub_diamond2)
## # weights:  120 (92 variable)
## initial  value 8047.189562 
## iter  10 value 7139.851414
## iter  20 value 6570.954502
## iter  30 value 6002.671806
## iter  40 value 5710.830558
## iter  50 value 5535.042800
## iter  60 value 5445.864421
## iter  70 value 5328.252592
## iter  80 value 5245.502706
## iter  90 value 5223.235594
## iter 100 value 5212.458532
## final  value 5212.458532 
## stopped after 100 iterations
forw.sel <- step(object=EMPTY.MOD, scope = list(upper = FULL.MOD), 
                 direction = "forward", k = log(nrow(sub_diamond2)), trace = TRUE)
## Start:  AIC=13800.45
## cut ~ 1
## 
## trying + carat 
## # weights:  15 (8 variable)
## initial  value 8047.189562 
## iter  10 value 6854.244107
## final  value 6782.216571 
## converged
## trying + color 
## # weights:  40 (28 variable)
## initial  value 8047.189562 
## iter  10 value 6975.418728
## iter  20 value 6890.656768
## iter  30 value 6861.764224
## final  value 6861.499688 
## converged
## trying + clarity 
## # weights:  60 (44 variable)
## initial  value 8047.189562 
## iter  10 value 6907.155372
## iter  20 value 6731.875138
## iter  30 value 6694.463493
## iter  40 value 6691.918023
## iter  40 value 6691.918007
## iter  40 value 6691.918007
## final  value 6691.918007 
## converged
## trying + depth 
## # weights:  15 (8 variable)
## initial  value 8047.189562 
## iter  10 value 6650.040899
## final  value 6494.705299 
## converged
## trying + table 
## # weights:  15 (8 variable)
## initial  value 8047.189562 
## iter  10 value 5861.002316
## final  value 5837.796192 
## converged
## trying + price 
## # weights:  15 (8 variable)
## initial  value 8047.189562 
## iter  10 value 6874.878704
## final  value 6845.303860 
## converged
## trying + volume 
## # weights:  15 (8 variable)
## initial  value 8047.189562 
## iter  10 value 6817.137241
## final  value 6793.262218 
## converged
## trying + obs_num 
## # weights:  15 (8 variable)
## initial  value 8047.189562 
## iter  10 value 6890.030010
## final  value 6881.051217 
## converged
##            Df      AIC
## + +table    8 11691.59
## + +depth    8 13005.41
## + +clarity 32 13447.84
## + +carat    8 13580.43
## + +volume   8 13602.52
## + +price    8 13706.61
## <none>      4 13774.38
## + +obs_num  8 13778.10
## + +color   28 13779.00
## # weights:  15 (8 variable)
## initial  value 8047.189562 
## iter  10 value 5861.002316
## final  value 5837.796192 
## converged
## 
## Step:  AIC=11743.73
## cut ~ table
## 
## trying + carat 
## # weights:  20 (12 variable)
## initial  value 8047.189562 
## iter  10 value 6260.439237
## iter  20 value 5802.816581
## final  value 5802.711204 
## converged
## trying + color 
## # weights:  45 (32 variable)
## initial  value 8047.189562 
## iter  10 value 6433.859032
## iter  20 value 5993.920295
## iter  30 value 5849.795267
## iter  40 value 5823.047204
## final  value 5823.010269 
## converged
## trying + clarity 
## # weights:  65 (48 variable)
## initial  value 8047.189562 
## iter  10 value 6523.344685
## iter  20 value 5958.431821
## iter  30 value 5743.891472
## iter  40 value 5704.319242
## final  value 5703.918756 
## converged
## trying + depth 
## # weights:  20 (12 variable)
## initial  value 8047.189562 
## iter  10 value 5988.726502
## iter  20 value 5383.968078
## iter  30 value 5331.076816
## iter  40 value 5330.908627
## iter  50 value 5330.815527
## final  value 5330.815331 
## converged
## trying + price 
## # weights:  20 (12 variable)
## initial  value 8047.189562 
## iter  10 value 6153.044517
## iter  20 value 5829.134869
## final  value 5829.119094 
## converged
## trying + volume 
## # weights:  20 (12 variable)
## initial  value 8047.189562 
## iter  10 value 6227.149179
## iter  20 value 5808.453735
## final  value 5807.658860 
## converged
## trying + obs_num 
## # weights:  20 (12 variable)
## initial  value 8047.189562 
## iter  10 value 6212.252276
## iter  20 value 5835.180775
## final  value 5835.170249 
## converged
##            Df      AIC
## + +depth   12 10685.63
## + +clarity 36 11479.84
## + +carat   12 11629.42
## + +volume  12 11639.32
## + +price   12 11682.24
## <none>      8 11691.59
## + +obs_num 12 11694.34
## + +color   32 11710.02
## # weights:  20 (12 variable)
## initial  value 8047.189562 
## iter  10 value 5988.726502
## iter  20 value 5383.968078
## iter  30 value 5331.076816
## iter  40 value 5330.908627
## iter  50 value 5330.815527
## final  value 5330.815331 
## converged
## 
## Step:  AIC=10763.84
## cut ~ table + depth
## 
## trying + carat 
## # weights:  25 (16 variable)
## initial  value 8047.189562 
## iter  10 value 6471.260617
## iter  20 value 5512.786436
## iter  30 value 5314.780493
## final  value 5314.526366 
## converged
## trying + color 
## # weights:  50 (36 variable)
## initial  value 8047.189562 
## iter  10 value 6512.272810
## iter  20 value 5910.312870
## iter  30 value 5714.025138
## iter  40 value 5489.418847
## iter  50 value 5318.647762
## final  value 5318.567617 
## converged
## trying + clarity 
## # weights:  70 (52 variable)
## initial  value 8047.189562 
## iter  10 value 6408.543105
## iter  20 value 6057.803860
## iter  30 value 5733.973944
## iter  40 value 5456.316407
## iter  50 value 5266.783722
## final  value 5266.439906 
## converged
## trying + price 
## # weights:  25 (16 variable)
## initial  value 8047.189562 
## iter  10 value 6685.968796
## iter  20 value 5528.105337
## iter  30 value 5322.241810
## final  value 5322.159440 
## converged
## trying + volume 
## # weights:  25 (16 variable)
## initial  value 8047.189562 
## iter  10 value 6624.182926
## iter  20 value 5515.253692
## iter  30 value 5314.765013
## final  value 5314.531774 
## converged
## trying + obs_num 
## # weights:  25 (16 variable)
## initial  value 8047.189562 
## iter  10 value 6520.541175
## iter  20 value 5528.468801
## iter  30 value 5328.660963
## final  value 5328.579830 
## converged
##            Df      AIC
## + +clarity 40 10612.88
## + +carat   16 10661.05
## + +volume  16 10661.06
## + +price   16 10676.32
## <none>     12 10685.63
## + +obs_num 16 10689.16
## + +color   36 10709.14
## # weights:  70 (52 variable)
## initial  value 8047.189562 
## iter  10 value 6408.543105
## iter  20 value 6057.803860
## iter  30 value 5733.973944
## iter  40 value 5456.316407
## iter  50 value 5266.783722
## final  value 5266.439906 
## converged
## 
## Step:  AIC=10873.57
## cut ~ table + depth + clarity

Critiques & Limitations

From the final model selection, we can see that the best model that should have been fitted, should be cut ~ table + depth + clarity. But for this project, we have analyse the full model to gives a better understanding of how the model works. The best model selected in the last step stills having a very large AIC and confirm the bad fit of the model.

To improve the model, maybe cut observations could be reduce to only very good, premium and ideal inorder to reduce the value of aic.

Conclusion

From the preliminary analysis, we can find that only one categorical variable does not help to predict the price or depth or carat, it is a combination of the categorical variables which helps to determine other factor. Subsequently, the cut variable has not been a good indicator. Perhaps, using color as should be a better determinant.

Some further investion are necessary to see how the variables interact between themselves and create a way better model for prediction.