In Phase 1 of this project, we have explore the diamond.csv dataset and in this phase, we will further investigate effect of the diamonds’ cut and a multinomial logistic regression model was formulated. The model will be fitted on the subsetted dataset inorder not to fry the computer. The variable “CUT” was uused as the response variable. The latter has 5 levels and these are “Fair”, “Good”, “Very Good”, “Premium” and “Ideal”. All the other variables will be used in the analysis, however, the variables x, y and z will be combined to create a new variable known as Volume so as to eliminate interactions between these 3 variables.
library(nnet)
library(mlr)
## Loading required package: ParamHelpers
## 'mlr' is in maintenance mode since July 2019. Future development
## efforts will go into its successor 'mlr3' (<https://mlr3.mlr-org.com>).
library(knitr)
library(dplyr)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
library(broom)
Adding the new variable “Volume” and dropping the variables x, y and z.
sub_diamond2 <- sub_diamond %>% mutate(volume = x*y*z) %>% select(-x,-y,-z) %>% mutate(obs_num = 1:n())
mlr::summarizeColumns(sub_diamond2) %>%
select(-mad,-disp) %>%
knitr::kable(caption = "Summary of the features of Diamond")
| name | type | na | mean | median | min | max | nlevs |
|---|---|---|---|---|---|---|---|
| carat | numeric | 0 | 0.792182 | 0.7000 | 0.20000 | 3.0400 | 0 |
| cut | factor | 0 | NA | NA | 157.00000 | 1974.0000 | 5 |
| color | factor | 0 | NA | NA | 244.00000 | 1030.0000 | 7 |
| clarity | factor | 0 | NA | NA | 0.00000 | 1190.0000 | 8 |
| depth | numeric | 0 | 61.738460 | 61.8000 | 52.30000 | 72.2000 | 0 |
| table | numeric | 0 | 57.469580 | 57.0000 | 43.00000 | 70.0000 | 0 |
| price | numeric | 0 | 3897.760400 | 2332.0000 | 334.00000 | 18780.0000 | 0 |
| volume | numeric | 0 | 128.903532 | 114.2554 | 33.46742 | 505.6011 | 0 |
| obs_num | integer | 0 | 2500.500000 | 2500.5000 | 1.00000 | 5000.0000 | 0 |
We will start the model by using all the variables.
cut_model <- multinom(cut~., data=sub_diamond2, maxit=1000)
## # weights: 120 (92 variable)
## initial value 8047.189562
## iter 10 value 7139.851414
## iter 20 value 6570.954502
## iter 30 value 6002.671806
## iter 40 value 5710.830558
## iter 50 value 5535.042800
## iter 60 value 5445.864421
## iter 70 value 5328.252592
## iter 80 value 5245.502706
## iter 90 value 5223.235594
## iter 100 value 5212.458532
## final value 5212.428172
## converged
summary(cut_model)
## Warning in sqrt(diag(vc)): NaNs produced
## Call:
## multinom(formula = cut ~ ., data = sub_diamond2, maxit = 1000)
##
## Coefficients:
## (Intercept) carat colorE colorF colorG colorH
## Good 54.21420 -3.76824 -0.19258822 -0.16411939 0.2446273 -0.22591634
## Very Good 85.83917 -15.98443 -0.05711157 -0.08226506 0.2789349 -0.08732381
## Premium 85.52025 -23.97448 -0.16867930 0.11409894 0.6188116 0.27512372
## Ideal 158.06632 -25.26972 -0.09698909 -0.01455872 0.3865093 0.11968282
## colorI colorJ clarityIF clarityVVS1 clarityVVS2 clarityVS1
## Good 0.1719059 0.4403532 6.289421 7.052726 6.18418 7.765773
## Very Good 0.3537943 0.6210180 9.847532 11.241705 10.57907 11.745240
## Premium 0.5586258 0.6716991 9.730120 10.730215 10.19800 11.824645
## Ideal 0.6094552 0.8311882 19.347339 20.384699 19.44746 20.632122
## clarityVS2 claritySI1 claritySI2 clarityI1 clarityI2 clarityI3
## Good 7.095028 6.748835 7.060304 6.017937 0 0
## Very Good 10.794342 10.646713 10.942620 10.041942 0 0
## Premium 11.044026 10.655650 10.998687 10.338914 0 0
## Ideal 19.922847 19.434776 19.648178 19.248900 0 0
## depth table price volume obs_num
## Good -0.7222614 -0.2241240 0.0002574625 0.009383132 -6.229005e-05
## Very Good -1.0587536 -0.4605102 0.0003905921 0.079481440 5.423326e-06
## Premium -1.2197973 -0.2869896 0.0004077296 0.129024236 -2.606915e-05
## Ideal -1.5450560 -1.3467004 0.0004791622 0.131464683 1.171036e-05
##
## Std. Errors:
## (Intercept) carat colorE colorF colorG
## Good 7.437509e-05 2.330397e-05 3.754204e-05 2.674331e-05 8.550715e-05
## Very Good 5.561293e-05 2.543335e-05 6.746020e-05 6.225066e-05 8.197512e-05
## Premium 6.446960e-05 3.523882e-05 1.228747e-04 9.252711e-05 1.240807e-04
## Ideal 4.849397e-05 4.303036e-05 7.985738e-05 1.625500e-04 1.834308e-04
## colorH colorI colorJ clarityIF clarityVVS1
## Good 3.714494e-05 7.612905e-06 6.459727e-05 1.016769e-05 6.979354e-06
## Very Good 4.915003e-05 2.543297e-05 3.788062e-05 1.528406e-05 4.423823e-05
## Premium 1.027705e-04 4.529901e-05 4.042992e-05 5.048592e-05 5.199741e-05
## Ideal 6.327803e-05 4.706848e-05 3.018999e-05 7.290056e-05 8.493166e-05
## clarityVVS2 clarityVS1 clarityVS2 claritySI1 claritySI2
## Good 1.195546e-05 6.923849e-05 3.753880e-05 1.295417e-04 4.471006e-05
## Very Good 2.757687e-05 3.071345e-05 5.489049e-05 1.072948e-04 5.970518e-05
## Premium 5.155278e-05 4.088751e-05 7.527618e-05 7.287233e-05 8.603199e-05
## Ideal 8.203525e-05 5.013821e-05 1.246556e-04 2.052041e-04 1.476617e-04
## clarityI1 clarityI2 clarityI3 depth table
## Good 2.986070e-05 4.463401e-18 NaN 0.01452978 0.01532820
## Very Good 2.457243e-05 0.000000e+00 0 0.01031760 0.01094469
## Premium 2.589225e-05 0.000000e+00 0 0.01028870 0.01079183
## Ideal 1.205969e-05 0.000000e+00 0 0.01257817 0.01373851
## price volume obs_num
## Good 7.910916e-05 0.003534778 7.634524e-05
## Very Good 7.706729e-05 0.003452049 7.474795e-05
## Premium 7.699940e-05 0.003447068 7.492335e-05
## Ideal 7.897400e-05 0.003580757 7.640460e-05
##
## Residual Deviance: 10424.86
## AIC: 10584.86
The full response is somehow difficult to interpret. Lets give a try!
The model execution shows that it has converged after 130 iterations and it includes a final negative log-likelihood 5220.25. When multiply by 2, the latter gives the Residual Deviance. The AIC is 10592.49 and it gives can be used to compare with other related model. The smallest AIC should be selected but however unlike the adjusted R-squared, this number itself is not meaningful.
We will now calculate the Z score and the p-value for all the variables in the model.
z_score <- summary(cut_model)$coefficients/summary(cut_model)$standard.errors
## Warning in sqrt(diag(vc)): NaNs produced
## Warning in sqrt(diag(vc)): NaNs produced
z_score
## (Intercept) carat colorE colorF colorG colorH
## Good 728929.6 -161699.5 -5129.9342 -6136.83930 2860.899 -6082.022
## Very Good 1543511.1 -628482.9 -846.5965 -1321.51300 3402.677 -1776.679
## Premium 1326520.5 -680342.7 -1372.7753 1233.14071 4987.169 2677.068
## Ideal 3259504.6 -587253.2 -1214.5289 -89.56455 2107.112 1891.380
## colorI colorJ clarityIF clarityVVS1 clarityVVS2 clarityVS1
## Good 22580.85 6816.901 618569.5 1010512.7 517268.1 112159.8
## Very Good 13910.85 16394.082 644300.6 254117.5 383621.3 382413.6
## Premium 12331.97 16613.911 192729.4 206360.5 197816.6 289199.5
## Ideal 12948.27 27531.908 265393.6 240013.0 237062.2 411505.0
## clarityVS2 claritySI1 claritySI2 clarityI1 clarityI2 clarityI3
## Good 189005.2 52097.78 157913.1 201533.7 0 NaN
## Very Good 196652.3 99228.58 183277.6 408667.1 NaN NaN
## Premium 146713.4 146223.54 127844.2 399305.4 NaN NaN
## Ideal 159823.1 94709.47 133062.1 1596135.3 NaN NaN
## depth table price volume obs_num
## Good -49.70904 -14.62168 3.254522 2.654518 -0.81589962
## Very Good -102.61629 -42.07612 5.068196 23.024427 0.07255485
## Premium -118.55703 -26.59322 5.295230 37.430141 -0.34794425
## Ideal -122.83631 -98.02376 6.067341 36.714213 0.15326776
p_value <- (1 - pnorm(abs(z_score), 0, 1))*2
p_value
## (Intercept) carat colorE colorF colorG colorH colorI colorJ clarityIF
## Good 0 0 0 0 0 0 0 0 0
## Very Good 0 0 0 0 0 0 0 0 0
## Premium 0 0 0 0 0 0 0 0 0
## Ideal 0 0 0 0 0 0 0 0 0
## clarityVVS1 clarityVVS2 clarityVS1 clarityVS2 claritySI1 claritySI2
## Good 0 0 0 0 0 0
## Very Good 0 0 0 0 0 0
## Premium 0 0 0 0 0 0
## Ideal 0 0 0 0 0 0
## clarityI1 clarityI2 clarityI3 depth table price volume
## Good 0 1 NaN 0 0 1.135833e-03 0.007942184
## Very Good 0 NaN NaN 0 0 4.016045e-07 0.000000000
## Premium 0 NaN NaN 0 0 1.188664e-07 0.000000000
## Ideal 0 NaN NaN 0 0 1.300452e-09 0.000000000
## obs_num
## Good 0.4145575
## Very Good 0.9421604
## Premium 0.7278821
## Ideal 0.8781871
The p-Value shows that all the variables are significant. However at the project, we will show which model will be best fitted through a stepwise selection.
tidy(cut_model, exponentiate=FALSE, conf.int = TRUE) %>% kable(digits = 3, format = "markdown")
## Warning in sqrt(diag(vc)): NaNs produced
## Warning in sqrt(diag(vcov(object))): NaNs produced
| y.level | term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|---|
| Good | (Intercept) | 54.214 | 0.000 | 728929.568 | 0.000 | 54.214 | 54.214 |
| Good | carat | -3.768 | 0.000 | -161699.476 | 0.000 | -3.768 | -3.768 |
| Good | colorE | -0.193 | 0.000 | -5129.934 | 0.000 | -0.193 | -0.193 |
| Good | colorF | -0.164 | 0.000 | -6136.839 | 0.000 | -0.164 | -0.164 |
| Good | colorG | 0.245 | 0.000 | 2860.899 | 0.000 | 0.244 | 0.245 |
| Good | colorH | -0.226 | 0.000 | -6082.022 | 0.000 | -0.226 | -0.226 |
| Good | colorI | 0.172 | 0.000 | 22580.855 | 0.000 | 0.172 | 0.172 |
| Good | colorJ | 0.440 | 0.000 | 6816.901 | 0.000 | 0.440 | 0.440 |
| Good | clarityIF | 6.289 | 0.000 | 618569.504 | 0.000 | 6.289 | 6.289 |
| Good | clarityVVS1 | 7.053 | 0.000 | 1010512.656 | 0.000 | 7.053 | 7.053 |
| Good | clarityVVS2 | 6.184 | 0.000 | 517268.131 | 0.000 | 6.184 | 6.184 |
| Good | clarityVS1 | 7.766 | 0.000 | 112159.772 | 0.000 | 7.766 | 7.766 |
| Good | clarityVS2 | 7.095 | 0.000 | 189005.208 | 0.000 | 7.095 | 7.095 |
| Good | claritySI1 | 6.749 | 0.000 | 52097.779 | 0.000 | 6.749 | 6.749 |
| Good | claritySI2 | 7.060 | 0.000 | 157913.091 | 0.000 | 7.060 | 7.060 |
| Good | clarityI1 | 6.018 | 0.000 | 201533.714 | 0.000 | 6.018 | 6.018 |
| Good | clarityI2 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 |
| Good | clarityI3 | 0.000 | NaN | NaN | NaN | NaN | NaN |
| Good | depth | -0.722 | 0.015 | -49.709 | 0.000 | -0.751 | -0.694 |
| Good | table | -0.224 | 0.015 | -14.622 | 0.000 | -0.254 | -0.194 |
| Good | price | 0.000 | 0.000 | 3.255 | 0.001 | 0.000 | 0.000 |
| Good | volume | 0.009 | 0.004 | 2.655 | 0.008 | 0.002 | 0.016 |
| Good | obs_num | 0.000 | 0.000 | -0.816 | 0.415 | 0.000 | 0.000 |
| Very Good | (Intercept) | 85.839 | 0.000 | 1543511.075 | 0.000 | 85.839 | 85.839 |
| Very Good | carat | -15.984 | 0.000 | -628482.930 | 0.000 | -15.984 | -15.984 |
| Very Good | colorE | -0.057 | 0.000 | -846.597 | 0.000 | -0.057 | -0.057 |
| Very Good | colorF | -0.082 | 0.000 | -1321.513 | 0.000 | -0.082 | -0.082 |
| Very Good | colorG | 0.279 | 0.000 | 3402.677 | 0.000 | 0.279 | 0.279 |
| Very Good | colorH | -0.087 | 0.000 | -1776.679 | 0.000 | -0.087 | -0.087 |
| Very Good | colorI | 0.354 | 0.000 | 13910.852 | 0.000 | 0.354 | 0.354 |
| Very Good | colorJ | 0.621 | 0.000 | 16394.082 | 0.000 | 0.621 | 0.621 |
| Very Good | clarityIF | 9.848 | 0.000 | 644300.585 | 0.000 | 9.848 | 9.848 |
| Very Good | clarityVVS1 | 11.242 | 0.000 | 254117.453 | 0.000 | 11.242 | 11.242 |
| Very Good | clarityVVS2 | 10.579 | 0.000 | 383621.291 | 0.000 | 10.579 | 10.579 |
| Very Good | clarityVS1 | 11.745 | 0.000 | 382413.591 | 0.000 | 11.745 | 11.745 |
| Very Good | clarityVS2 | 10.794 | 0.000 | 196652.314 | 0.000 | 10.794 | 10.794 |
| Very Good | claritySI1 | 10.647 | 0.000 | 99228.580 | 0.000 | 10.647 | 10.647 |
| Very Good | claritySI2 | 10.943 | 0.000 | 183277.583 | 0.000 | 10.943 | 10.943 |
| Very Good | clarityI1 | 10.042 | 0.000 | 408667.124 | 0.000 | 10.042 | 10.042 |
| Very Good | clarityI2 | 0.000 | 0.000 | NaN | NaN | 0.000 | 0.000 |
| Very Good | clarityI3 | 0.000 | 0.000 | NaN | NaN | 0.000 | 0.000 |
| Very Good | depth | -1.059 | 0.010 | -102.616 | 0.000 | -1.079 | -1.039 |
| Very Good | table | -0.461 | 0.011 | -42.076 | 0.000 | -0.482 | -0.439 |
| Very Good | price | 0.000 | 0.000 | 5.068 | 0.000 | 0.000 | 0.001 |
| Very Good | volume | 0.079 | 0.003 | 23.024 | 0.000 | 0.073 | 0.086 |
| Very Good | obs_num | 0.000 | 0.000 | 0.073 | 0.942 | 0.000 | 0.000 |
| Premium | (Intercept) | 85.520 | 0.000 | 1326520.545 | 0.000 | 85.520 | 85.520 |
| Premium | carat | -23.974 | 0.000 | -680342.726 | 0.000 | -23.975 | -23.974 |
| Premium | colorE | -0.169 | 0.000 | -1372.775 | 0.000 | -0.169 | -0.168 |
| Premium | colorF | 0.114 | 0.000 | 1233.141 | 0.000 | 0.114 | 0.114 |
| Premium | colorG | 0.619 | 0.000 | 4987.169 | 0.000 | 0.619 | 0.619 |
| Premium | colorH | 0.275 | 0.000 | 2677.068 | 0.000 | 0.275 | 0.275 |
| Premium | colorI | 0.559 | 0.000 | 12331.966 | 0.000 | 0.559 | 0.559 |
| Premium | colorJ | 0.672 | 0.000 | 16613.911 | 0.000 | 0.672 | 0.672 |
| Premium | clarityIF | 9.730 | 0.000 | 192729.388 | 0.000 | 9.730 | 9.730 |
| Premium | clarityVVS1 | 10.730 | 0.000 | 206360.547 | 0.000 | 10.730 | 10.730 |
| Premium | clarityVVS2 | 10.198 | 0.000 | 197816.614 | 0.000 | 10.198 | 10.198 |
| Premium | clarityVS1 | 11.825 | 0.000 | 289199.473 | 0.000 | 11.825 | 11.825 |
| Premium | clarityVS2 | 11.044 | 0.000 | 146713.422 | 0.000 | 11.044 | 11.044 |
| Premium | claritySI1 | 10.656 | 0.000 | 146223.538 | 0.000 | 10.656 | 10.656 |
| Premium | claritySI2 | 10.999 | 0.000 | 127844.165 | 0.000 | 10.999 | 10.999 |
| Premium | clarityI1 | 10.339 | 0.000 | 399305.356 | 0.000 | 10.339 | 10.339 |
| Premium | clarityI2 | 0.000 | 0.000 | NaN | NaN | 0.000 | 0.000 |
| Premium | clarityI3 | 0.000 | 0.000 | NaN | NaN | 0.000 | 0.000 |
| Premium | depth | -1.220 | 0.010 | -118.557 | 0.000 | -1.240 | -1.200 |
| Premium | table | -0.287 | 0.011 | -26.593 | 0.000 | -0.308 | -0.266 |
| Premium | price | 0.000 | 0.000 | 5.295 | 0.000 | 0.000 | 0.001 |
| Premium | volume | 0.129 | 0.003 | 37.430 | 0.000 | 0.122 | 0.136 |
| Premium | obs_num | 0.000 | 0.000 | -0.348 | 0.728 | 0.000 | 0.000 |
| Ideal | (Intercept) | 158.066 | 0.000 | 3259504.602 | 0.000 | 158.066 | 158.066 |
| Ideal | carat | -25.270 | 0.000 | -587253.201 | 0.000 | -25.270 | -25.270 |
| Ideal | colorE | -0.097 | 0.000 | -1214.529 | 0.000 | -0.097 | -0.097 |
| Ideal | colorF | -0.015 | 0.000 | -89.565 | 0.000 | -0.015 | -0.014 |
| Ideal | colorG | 0.387 | 0.000 | 2107.112 | 0.000 | 0.386 | 0.387 |
| Ideal | colorH | 0.120 | 0.000 | 1891.380 | 0.000 | 0.120 | 0.120 |
| Ideal | colorI | 0.609 | 0.000 | 12948.267 | 0.000 | 0.609 | 0.610 |
| Ideal | colorJ | 0.831 | 0.000 | 27531.908 | 0.000 | 0.831 | 0.831 |
| Ideal | clarityIF | 19.347 | 0.000 | 265393.577 | 0.000 | 19.347 | 19.347 |
| Ideal | clarityVVS1 | 20.385 | 0.000 | 240012.970 | 0.000 | 20.385 | 20.385 |
| Ideal | clarityVVS2 | 19.447 | 0.000 | 237062.201 | 0.000 | 19.447 | 19.448 |
| Ideal | clarityVS1 | 20.632 | 0.000 | 411504.992 | 0.000 | 20.632 | 20.632 |
| Ideal | clarityVS2 | 19.923 | 0.000 | 159823.147 | 0.000 | 19.923 | 19.923 |
| Ideal | claritySI1 | 19.435 | 0.000 | 94709.475 | 0.000 | 19.434 | 19.435 |
| Ideal | claritySI2 | 19.648 | 0.000 | 133062.114 | 0.000 | 19.648 | 19.648 |
| Ideal | clarityI1 | 19.249 | 0.000 | 1596135.319 | 0.000 | 19.249 | 19.249 |
| Ideal | clarityI2 | 0.000 | 0.000 | NaN | NaN | 0.000 | 0.000 |
| Ideal | clarityI3 | 0.000 | 0.000 | NaN | NaN | 0.000 | 0.000 |
| Ideal | depth | -1.545 | 0.013 | -122.836 | 0.000 | -1.570 | -1.520 |
| Ideal | table | -1.347 | 0.014 | -98.024 | 0.000 | -1.374 | -1.320 |
| Ideal | price | 0.000 | 0.000 | 6.067 | 0.000 | 0.000 | 0.001 |
| Ideal | volume | 0.131 | 0.004 | 36.714 | 0.000 | 0.124 | 0.138 |
| Ideal | obs_num | 0.000 | 0.000 | 0.153 | 0.878 | 0.000 | 0.000 |
cut_residuals <- as_tibble(residuals(cut_model)) %>%
setNames(paste('resid', names(.), sep = "")) %>%
mutate(obs_num = 1:n())
cut_residuals %>% slice(1:10)
## # A tibble: 10 x 6
## residFair residGood `residVery Good` residPremium residIdeal obs_num
## <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 -0.00284 -0.0592 -0.211 -0.127 0.400 1
## 2 -0.00754 -0.143 -0.373 0.759 -0.235 2
## 3 -0.0000918 -0.0209 -0.179 0.622 -0.422 3
## 4 -0.0000419 -0.00326 -0.0277 -0.0302 0.0612 4
## 5 -0.0655 -0.217 -0.294 0.618 -0.0410 5
## 6 -0.00165 -0.0605 -0.334 -0.191 0.587 6
## 7 -0.000271 -0.0238 -0.141 -0.128 0.293 7
## 8 -0.000653 -0.0180 -0.188 -0.164 0.371 8
## 9 -0.0102 -0.178 -0.391 0.691 -0.112 9
## 10 -0.319 0.638 -0.141 -0.175 -0.00402 10
plot(cut_residuals)
cut_pred <- as_tibble(predict(cut_model, type = "probs")) %>% mutate(obs_num = 1:n())
cut_pred %>% slice(1:10)
## # A tibble: 10 x 6
## Fair Good `Very Good` Premium Ideal obs_num
## <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 0.00284 0.0592 0.211 0.127 0.600 1
## 2 0.00754 0.143 0.373 0.241 0.235 2
## 3 0.0000918 0.0209 0.179 0.378 0.422 3
## 4 0.0000419 0.00326 0.0277 0.0302 0.939 4
## 5 0.0655 0.217 0.294 0.382 0.0410 5
## 6 0.00165 0.0605 0.334 0.191 0.413 6
## 7 0.000271 0.0238 0.141 0.128 0.707 7
## 8 0.000653 0.0180 0.188 0.164 0.629 8
## 9 0.0102 0.178 0.391 0.309 0.112 9
## 10 0.319 0.362 0.141 0.175 0.00402 10
As states earlier, the residual deviance of the full model plot is 10437.33, which is somehow a large residual deviance which indicates a bad fit of the model.
In the model final model selection, we will see if the residual deviance can be reduced.
Anova(cut_model)
## Analysis of Deviance Table (Type II tests)
##
## Response: cut
## LR Chisq Df Pr(>Chisq)
## carat 41.40 4 2.223e-08 ***
## color 26.82 24 0.3127
## clarity 59.84 40 0.0226 *
## depth 433.41 4 < 2.2e-16 ***
## table 1683.87 4 < 2.2e-16 ***
## price 29.08 4 7.536e-06 ***
## volume 34.80 4 5.099e-07 ***
## obs_num 4.28 4 0.3689
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
At a 95% confidence interval, only the variable volume is not significant. However the variable depth, table and price are highly significant.
exp(coef(cut_model))
## (Intercept) carat colorE colorF colorG colorH
## Good 3.506948e+23 2.309266e-02 0.8248215 0.8486407 1.277145 0.7977848
## Very Good 1.903165e+37 1.143011e-07 0.9444887 0.9210278 1.321721 0.9163803
## Premium 1.383484e+37 3.872734e-11 0.8447798 1.1208630 1.856720 1.3166936
## Ideal 4.439462e+68 1.060475e-11 0.9075659 0.9855467 1.471834 1.1271393
## colorI colorJ clarityIF clarityVVS1 clarityVVS2 clarityVS1
## Good 1.187566 1.553256 5.388414e+02 1.156006e+03 4.850149e+02 2.358481e+03
## Very Good 1.424462 1.860821 1.891162e+04 7.624487e+04 3.930365e+04 1.261516e+05
## Premium 1.748268 1.957561 1.681658e+04 4.571650e+04 2.684934e+04 1.365771e+05
## Ideal 1.839429 2.296045 2.526054e+08 7.127912e+08 2.792052e+08 9.128865e+08
## clarityVS2 claritySI1 claritySI2 clarityI1 clarityI2
## Good 1.205956e+03 8.530642e+02 1.164799e+03 4.107304e+02 1
## Very Good 4.874421e+04 4.205412e+04 5.653529e+04 2.296995e+04 1
## Premium 6.256902e+04 4.243164e+04 5.979560e+04 3.091244e+04 1
## Ideal 4.491409e+08 2.756869e+08 3.412679e+08 2.289238e+08 1
## clarityI3 depth table price volume obs_num
## Good 1 0.4856528 0.7992161 1.000257 1.009427 0.9999377
## Very Good 1 0.3468879 0.6309617 1.000391 1.082725 1.0000054
## Premium 1 0.2952900 0.7505195 1.000408 1.137718 0.9999739
## Ideal 1 0.2132999 0.2600971 1.000479 1.140498 1.0000117
The sensitivity analysis gives the odds ratio of the model. For example, we can see if an increase of 1 unit of fair cut vs. good cut, there is an increase of 0.523 odds ratio in depth and 0.840 odds ratio in table.
EMPTY.MOD <- multinom(formula = cut ~ 1, data = sub_diamond2)
## # weights: 10 (4 variable)
## initial value 8047.189562
## iter 10 value 6883.189708
## iter 10 value 6883.189707
## final value 6883.189707
## converged
FULL.MOD <- multinom(formula = cut ~., data = sub_diamond2)
## # weights: 120 (92 variable)
## initial value 8047.189562
## iter 10 value 7139.851414
## iter 20 value 6570.954502
## iter 30 value 6002.671806
## iter 40 value 5710.830558
## iter 50 value 5535.042800
## iter 60 value 5445.864421
## iter 70 value 5328.252592
## iter 80 value 5245.502706
## iter 90 value 5223.235594
## iter 100 value 5212.458532
## final value 5212.458532
## stopped after 100 iterations
forw.sel <- step(object=EMPTY.MOD, scope = list(upper = FULL.MOD),
direction = "forward", k = log(nrow(sub_diamond2)), trace = TRUE)
## Start: AIC=13800.45
## cut ~ 1
##
## trying + carat
## # weights: 15 (8 variable)
## initial value 8047.189562
## iter 10 value 6854.244107
## final value 6782.216571
## converged
## trying + color
## # weights: 40 (28 variable)
## initial value 8047.189562
## iter 10 value 6975.418728
## iter 20 value 6890.656768
## iter 30 value 6861.764224
## final value 6861.499688
## converged
## trying + clarity
## # weights: 60 (44 variable)
## initial value 8047.189562
## iter 10 value 6907.155372
## iter 20 value 6731.875138
## iter 30 value 6694.463493
## iter 40 value 6691.918023
## iter 40 value 6691.918007
## iter 40 value 6691.918007
## final value 6691.918007
## converged
## trying + depth
## # weights: 15 (8 variable)
## initial value 8047.189562
## iter 10 value 6650.040899
## final value 6494.705299
## converged
## trying + table
## # weights: 15 (8 variable)
## initial value 8047.189562
## iter 10 value 5861.002316
## final value 5837.796192
## converged
## trying + price
## # weights: 15 (8 variable)
## initial value 8047.189562
## iter 10 value 6874.878704
## final value 6845.303860
## converged
## trying + volume
## # weights: 15 (8 variable)
## initial value 8047.189562
## iter 10 value 6817.137241
## final value 6793.262218
## converged
## trying + obs_num
## # weights: 15 (8 variable)
## initial value 8047.189562
## iter 10 value 6890.030010
## final value 6881.051217
## converged
## Df AIC
## + +table 8 11691.59
## + +depth 8 13005.41
## + +clarity 32 13447.84
## + +carat 8 13580.43
## + +volume 8 13602.52
## + +price 8 13706.61
## <none> 4 13774.38
## + +obs_num 8 13778.10
## + +color 28 13779.00
## # weights: 15 (8 variable)
## initial value 8047.189562
## iter 10 value 5861.002316
## final value 5837.796192
## converged
##
## Step: AIC=11743.73
## cut ~ table
##
## trying + carat
## # weights: 20 (12 variable)
## initial value 8047.189562
## iter 10 value 6260.439237
## iter 20 value 5802.816581
## final value 5802.711204
## converged
## trying + color
## # weights: 45 (32 variable)
## initial value 8047.189562
## iter 10 value 6433.859032
## iter 20 value 5993.920295
## iter 30 value 5849.795267
## iter 40 value 5823.047204
## final value 5823.010269
## converged
## trying + clarity
## # weights: 65 (48 variable)
## initial value 8047.189562
## iter 10 value 6523.344685
## iter 20 value 5958.431821
## iter 30 value 5743.891472
## iter 40 value 5704.319242
## final value 5703.918756
## converged
## trying + depth
## # weights: 20 (12 variable)
## initial value 8047.189562
## iter 10 value 5988.726502
## iter 20 value 5383.968078
## iter 30 value 5331.076816
## iter 40 value 5330.908627
## iter 50 value 5330.815527
## final value 5330.815331
## converged
## trying + price
## # weights: 20 (12 variable)
## initial value 8047.189562
## iter 10 value 6153.044517
## iter 20 value 5829.134869
## final value 5829.119094
## converged
## trying + volume
## # weights: 20 (12 variable)
## initial value 8047.189562
## iter 10 value 6227.149179
## iter 20 value 5808.453735
## final value 5807.658860
## converged
## trying + obs_num
## # weights: 20 (12 variable)
## initial value 8047.189562
## iter 10 value 6212.252276
## iter 20 value 5835.180775
## final value 5835.170249
## converged
## Df AIC
## + +depth 12 10685.63
## + +clarity 36 11479.84
## + +carat 12 11629.42
## + +volume 12 11639.32
## + +price 12 11682.24
## <none> 8 11691.59
## + +obs_num 12 11694.34
## + +color 32 11710.02
## # weights: 20 (12 variable)
## initial value 8047.189562
## iter 10 value 5988.726502
## iter 20 value 5383.968078
## iter 30 value 5331.076816
## iter 40 value 5330.908627
## iter 50 value 5330.815527
## final value 5330.815331
## converged
##
## Step: AIC=10763.84
## cut ~ table + depth
##
## trying + carat
## # weights: 25 (16 variable)
## initial value 8047.189562
## iter 10 value 6471.260617
## iter 20 value 5512.786436
## iter 30 value 5314.780493
## final value 5314.526366
## converged
## trying + color
## # weights: 50 (36 variable)
## initial value 8047.189562
## iter 10 value 6512.272810
## iter 20 value 5910.312870
## iter 30 value 5714.025138
## iter 40 value 5489.418847
## iter 50 value 5318.647762
## final value 5318.567617
## converged
## trying + clarity
## # weights: 70 (52 variable)
## initial value 8047.189562
## iter 10 value 6408.543105
## iter 20 value 6057.803860
## iter 30 value 5733.973944
## iter 40 value 5456.316407
## iter 50 value 5266.783722
## final value 5266.439906
## converged
## trying + price
## # weights: 25 (16 variable)
## initial value 8047.189562
## iter 10 value 6685.968796
## iter 20 value 5528.105337
## iter 30 value 5322.241810
## final value 5322.159440
## converged
## trying + volume
## # weights: 25 (16 variable)
## initial value 8047.189562
## iter 10 value 6624.182926
## iter 20 value 5515.253692
## iter 30 value 5314.765013
## final value 5314.531774
## converged
## trying + obs_num
## # weights: 25 (16 variable)
## initial value 8047.189562
## iter 10 value 6520.541175
## iter 20 value 5528.468801
## iter 30 value 5328.660963
## final value 5328.579830
## converged
## Df AIC
## + +clarity 40 10612.88
## + +carat 16 10661.05
## + +volume 16 10661.06
## + +price 16 10676.32
## <none> 12 10685.63
## + +obs_num 16 10689.16
## + +color 36 10709.14
## # weights: 70 (52 variable)
## initial value 8047.189562
## iter 10 value 6408.543105
## iter 20 value 6057.803860
## iter 30 value 5733.973944
## iter 40 value 5456.316407
## iter 50 value 5266.783722
## final value 5266.439906
## converged
##
## Step: AIC=10873.57
## cut ~ table + depth + clarity
From the final model selection, we can see that the best model that should have been fitted, should be cut ~ table + depth + clarity. But for this project, we have analyse the full model to gives a better understanding of how the model works. The best model selected in the last step stills having a very large AIC and confirm the bad fit of the model.
To improve the model, maybe cut observations could be reduce to only very good, premium and ideal inorder to reduce the value of aic.
From the preliminary analysis, we can find that only one categorical variable does not help to predict the price or depth or carat, it is a combination of the categorical variables which helps to determine other factor. Subsequently, the cut variable has not been a good indicator. Perhaps, using color as should be a better determinant.
Some further investion are necessary to see how the variables interact between themselves and create a way better model for prediction.