Introduction

Methodology

In Phase 1 of this project, we have explore the diamond.csv dataset and in this phase, we will further investigate effect of the diamonds’ cut and a multinomial logistic regression model was formulated. The model will be fitted on the subsetted dataset inorder not to fry the computer. The variable “CUT” was uused as the response variable. The latter has 5 levels and these are “Fair”, “Good”, “Very Good”, “Premium” and “Ideal”. All the other variables will be used in the analysis, however, the variables x, y and z will be combined to create a new variable known as Volume so as to eliminate interactions between these 3 variables.

Loading packages

library(nnet)
library(mlr)

## Loading required package: ParamHelpers

## 'mlr' is in maintenance mode since July 2019. Future development
## efforts will go into its successor 'mlr3' (<https://mlr3.mlr-org.com>).

library(knitr)
library(dplyr)
library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

library(broom)

Modification of Data

Adding the new variable “Volume” and dropping the variables x, y and z.

sub_diamond2 <- sub_diamond %>% mutate(volume = x*y*z) %>% select(-x,-y,-z) %>% mutate(obs_num = 1:n())

Data Summary

mlr::summarizeColumns(sub_diamond2) %>% 
  select(-mad,-disp) %>% 
  knitr::kable(caption = "Summary of the features of Diamond")

Summary of the features of Diamond
name	type	mean	median	min	max	nlevs
carat	numeric	0.792182	0.7000	0.20000	3.0400	0
cut	factor	NA	NA	157.00000	1974.0000	5
color	factor	NA	NA	244.00000	1030.0000	7
clarity	factor	NA	NA	0.00000	1190.0000	8
depth	numeric	61.738460	61.8000	52.30000	72.2000	0
table	numeric	57.469580	57.0000	43.00000	70.0000	0
price	numeric	3897.760400	2332.0000	334.00000	18780.0000	0
volume	numeric	128.903532	114.2554	33.46742	505.6011	0
obs_num	integer	2500.500000	2500.5000	1.00000	5000.0000	0

Statistical Modelling

Model Fitting

We will start the model by using all the variables.

cut_model <- multinom(cut~., data=sub_diamond2, maxit=1000)

## # weights:  120 (92 variable)
## initial  value 8047.189562 
## iter  10 value 7139.851414
## iter  20 value 6570.954502
## iter  30 value 6002.671806
## iter  40 value 5710.830558
## iter  50 value 5535.042800
## iter  60 value 5445.864421
## iter  70 value 5328.252592
## iter  80 value 5245.502706
## iter  90 value 5223.235594
## iter 100 value 5212.458532
## final  value 5212.428172 
## converged

summary(cut_model)

## Warning in sqrt(diag(vc)): NaNs produced

## Call:
## multinom(formula = cut ~ ., data = sub_diamond2, maxit = 1000)
## 
## Coefficients:
##           (Intercept)     carat      colorE      colorF    colorG      colorH
## Good         54.21420  -3.76824 -0.19258822 -0.16411939 0.2446273 -0.22591634
## Very Good    85.83917 -15.98443 -0.05711157 -0.08226506 0.2789349 -0.08732381
## Premium      85.52025 -23.97448 -0.16867930  0.11409894 0.6188116  0.27512372
## Ideal       158.06632 -25.26972 -0.09698909 -0.01455872 0.3865093  0.11968282
##              colorI    colorJ clarityIF clarityVVS1 clarityVVS2 clarityVS1
## Good      0.1719059 0.4403532  6.289421    7.052726     6.18418   7.765773
## Very Good 0.3537943 0.6210180  9.847532   11.241705    10.57907  11.745240
## Premium   0.5586258 0.6716991  9.730120   10.730215    10.19800  11.824645
## Ideal     0.6094552 0.8311882 19.347339   20.384699    19.44746  20.632122
##           clarityVS2 claritySI1 claritySI2 clarityI1 clarityI2 clarityI3
## Good        7.095028   6.748835   7.060304  6.017937         0         0
## Very Good  10.794342  10.646713  10.942620 10.041942         0         0
## Premium    11.044026  10.655650  10.998687 10.338914         0         0
## Ideal      19.922847  19.434776  19.648178 19.248900         0         0
##                depth      table        price      volume       obs_num
## Good      -0.7222614 -0.2241240 0.0002574625 0.009383132 -6.229005e-05
## Very Good -1.0587536 -0.4605102 0.0003905921 0.079481440  5.423326e-06
## Premium   -1.2197973 -0.2869896 0.0004077296 0.129024236 -2.606915e-05
## Ideal     -1.5450560 -1.3467004 0.0004791622 0.131464683  1.171036e-05
## 
## Std. Errors:
##            (Intercept)        carat       colorE       colorF       colorG
## Good      7.437509e-05 2.330397e-05 3.754204e-05 2.674331e-05 8.550715e-05
## Very Good 5.561293e-05 2.543335e-05 6.746020e-05 6.225066e-05 8.197512e-05
## Premium   6.446960e-05 3.523882e-05 1.228747e-04 9.252711e-05 1.240807e-04
## Ideal     4.849397e-05 4.303036e-05 7.985738e-05 1.625500e-04 1.834308e-04
##                 colorH       colorI       colorJ    clarityIF  clarityVVS1
## Good      3.714494e-05 7.612905e-06 6.459727e-05 1.016769e-05 6.979354e-06
## Very Good 4.915003e-05 2.543297e-05 3.788062e-05 1.528406e-05 4.423823e-05
## Premium   1.027705e-04 4.529901e-05 4.042992e-05 5.048592e-05 5.199741e-05
## Ideal     6.327803e-05 4.706848e-05 3.018999e-05 7.290056e-05 8.493166e-05
##            clarityVVS2   clarityVS1   clarityVS2   claritySI1   claritySI2
## Good      1.195546e-05 6.923849e-05 3.753880e-05 1.295417e-04 4.471006e-05
## Very Good 2.757687e-05 3.071345e-05 5.489049e-05 1.072948e-04 5.970518e-05
## Premium   5.155278e-05 4.088751e-05 7.527618e-05 7.287233e-05 8.603199e-05
## Ideal     8.203525e-05 5.013821e-05 1.246556e-04 2.052041e-04 1.476617e-04
##              clarityI1    clarityI2 clarityI3      depth      table
## Good      2.986070e-05 4.463401e-18       NaN 0.01452978 0.01532820
## Very Good 2.457243e-05 0.000000e+00         0 0.01031760 0.01094469
## Premium   2.589225e-05 0.000000e+00         0 0.01028870 0.01079183
## Ideal     1.205969e-05 0.000000e+00         0 0.01257817 0.01373851
##                  price      volume      obs_num
## Good      7.910916e-05 0.003534778 7.634524e-05
## Very Good 7.706729e-05 0.003452049 7.474795e-05
## Premium   7.699940e-05 0.003447068 7.492335e-05
## Ideal     7.897400e-05 0.003580757 7.640460e-05
## 
## Residual Deviance: 10424.86 
## AIC: 10584.86

The full response is somehow difficult to interpret. Lets give a try!

The model execution shows that it has converged after 130 iterations and it includes a final negative log-likelihood 5220.25. When multiply by 2, the latter gives the Residual Deviance. The AIC is 10592.49 and it gives can be used to compare with other related model. The smallest AIC should be selected but however unlike the adjusted R-squared, this number itself is not meaningful.

We will now calculate the Z score and the p-value for all the variables in the model.

z_score <- summary(cut_model)$coefficients/summary(cut_model)$standard.errors

## Warning in sqrt(diag(vc)): NaNs produced

## Warning in sqrt(diag(vc)): NaNs produced

z_score

##           (Intercept)     carat     colorE      colorF   colorG    colorH
## Good         728929.6 -161699.5 -5129.9342 -6136.83930 2860.899 -6082.022
## Very Good   1543511.1 -628482.9  -846.5965 -1321.51300 3402.677 -1776.679
## Premium     1326520.5 -680342.7 -1372.7753  1233.14071 4987.169  2677.068
## Ideal       3259504.6 -587253.2 -1214.5289   -89.56455 2107.112  1891.380
##             colorI    colorJ clarityIF clarityVVS1 clarityVVS2 clarityVS1
## Good      22580.85  6816.901  618569.5   1010512.7    517268.1   112159.8
## Very Good 13910.85 16394.082  644300.6    254117.5    383621.3   382413.6
## Premium   12331.97 16613.911  192729.4    206360.5    197816.6   289199.5
## Ideal     12948.27 27531.908  265393.6    240013.0    237062.2   411505.0
##           clarityVS2 claritySI1 claritySI2 clarityI1 clarityI2 clarityI3
## Good        189005.2   52097.78   157913.1  201533.7         0       NaN
## Very Good   196652.3   99228.58   183277.6  408667.1       NaN       NaN
## Premium     146713.4  146223.54   127844.2  399305.4       NaN       NaN
## Ideal       159823.1   94709.47   133062.1 1596135.3       NaN       NaN
##                depth     table    price    volume     obs_num
## Good       -49.70904 -14.62168 3.254522  2.654518 -0.81589962
## Very Good -102.61629 -42.07612 5.068196 23.024427  0.07255485
## Premium   -118.55703 -26.59322 5.295230 37.430141 -0.34794425
## Ideal     -122.83631 -98.02376 6.067341 36.714213  0.15326776

p_value <- (1 - pnorm(abs(z_score), 0, 1))*2
p_value

##           (Intercept) carat colorE colorF colorG colorH colorI colorJ clarityIF
## Good                0     0      0      0      0      0      0      0         0
## Very Good           0     0      0      0      0      0      0      0         0
## Premium             0     0      0      0      0      0      0      0         0
## Ideal               0     0      0      0      0      0      0      0         0
##           clarityVVS1 clarityVVS2 clarityVS1 clarityVS2 claritySI1 claritySI2
## Good                0           0          0          0          0          0
## Very Good           0           0          0          0          0          0
## Premium             0           0          0          0          0          0
## Ideal               0           0          0          0          0          0
##           clarityI1 clarityI2 clarityI3 depth table        price      volume
## Good              0         1       NaN     0     0 1.135833e-03 0.007942184
## Very Good         0       NaN       NaN     0     0 4.016045e-07 0.000000000
## Premium           0       NaN       NaN     0     0 1.188664e-07 0.000000000
## Ideal             0       NaN       NaN     0     0 1.300452e-09 0.000000000
##             obs_num
## Good      0.4145575
## Very Good 0.9421604
## Premium   0.7278821
## Ideal     0.8781871

The p-Value shows that all the variables are significant. However at the project, we will show which model will be best fitted through a stepwise selection.

tidy(cut_model, exponentiate=FALSE, conf.int = TRUE) %>% kable(digits = 3, format = "markdown")

## Warning in sqrt(diag(vc)): NaNs produced

## Warning in sqrt(diag(vcov(object))): NaNs produced

y.level	term	estimate	std.error	statistic	p.value	conf.low	conf.high
Good	(Intercept)	54.214	0.000	728929.568	0.000	54.214	54.214
Good	carat	-3.768	0.000	-161699.476	0.000	-3.768	-3.768
Good	colorE	-0.193	0.000	-5129.934	0.000	-0.193	-0.193
Good	colorF	-0.164	0.000	-6136.839	0.000	-0.164	-0.164
Good	colorG	0.245	0.000	2860.899	0.000	0.244	0.245
Good	colorH	-0.226	0.000	-6082.022	0.000	-0.226	-0.226
Good	colorI	0.172	0.000	22580.855	0.000	0.172	0.172
Good	colorJ	0.440	0.000	6816.901	0.000	0.440	0.440
Good	clarityIF	6.289	0.000	618569.504	0.000	6.289	6.289
Good	clarityVVS1	7.053	0.000	1010512.656	0.000	7.053	7.053
Good	clarityVVS2	6.184	0.000	517268.131	0.000	6.184	6.184
Good	clarityVS1	7.766	0.000	112159.772	0.000	7.766	7.766
Good	clarityVS2	7.095	0.000	189005.208	0.000	7.095	7.095
Good	claritySI1	6.749	0.000	52097.779	0.000	6.749	6.749
Good	claritySI2	7.060	0.000	157913.091	0.000	7.060	7.060
Good	clarityI1	6.018	0.000	201533.714	0.000	6.018	6.018
Good	clarityI2	0.000	0.000	0.000	1.000	0.000	0.000
Good	clarityI3	0.000	NaN	NaN	NaN	NaN	NaN
Good	depth	-0.722	0.015	-49.709	0.000	-0.751	-0.694
Good	table	-0.224	0.015	-14.622	0.000	-0.254	-0.194
Good	price	0.000	0.000	3.255	0.001	0.000	0.000
Good	volume	0.009	0.004	2.655	0.008	0.002	0.016
Good	obs_num	0.000	0.000	-0.816	0.415	0.000	0.000
Very Good	(Intercept)	85.839	0.000	1543511.075	0.000	85.839	85.839
Very Good	carat	-15.984	0.000	-628482.930	0.000	-15.984	-15.984
Very Good	colorE	-0.057	0.000	-846.597	0.000	-0.057	-0.057
Very Good	colorF	-0.082	0.000	-1321.513	0.000	-0.082	-0.082
Very Good	colorG	0.279	0.000	3402.677	0.000	0.279	0.279
Very Good	colorH	-0.087	0.000	-1776.679	0.000	-0.087	-0.087
Very Good	colorI	0.354	0.000	13910.852	0.000	0.354	0.354
Very Good	colorJ	0.621	0.000	16394.082	0.000	0.621	0.621
Very Good	clarityIF	9.848	0.000	644300.585	0.000	9.848	9.848
Very Good	clarityVVS1	11.242	0.000	254117.453	0.000	11.242	11.242
Very Good	clarityVVS2	10.579	0.000	383621.291	0.000	10.579	10.579
Very Good	clarityVS1	11.745	0.000	382413.591	0.000	11.745	11.745
Very Good	clarityVS2	10.794	0.000	196652.314	0.000	10.794	10.794
Very Good	claritySI1	10.647	0.000	99228.580	0.000	10.647	10.647
Very Good	claritySI2	10.943	0.000	183277.583	0.000	10.943	10.943
Very Good	clarityI1	10.042	0.000	408667.124	0.000	10.042	10.042
Very Good	clarityI2	0.000	0.000	NaN	NaN	0.000	0.000
Very Good	clarityI3	0.000	0.000	NaN	NaN	0.000	0.000
Very Good	depth	-1.059	0.010	-102.616	0.000	-1.079	-1.039
Very Good	table	-0.461	0.011	-42.076	0.000	-0.482	-0.439
Very Good	price	0.000	0.000	5.068	0.000	0.000	0.001
Very Good	volume	0.079	0.003	23.024	0.000	0.073	0.086
Very Good	obs_num	0.000	0.000	0.073	0.942	0.000	0.000
Premium	(Intercept)	85.520	0.000	1326520.545	0.000	85.520	85.520
Premium	carat	-23.974	0.000	-680342.726	0.000	-23.975	-23.974
Premium	colorE	-0.169	0.000	-1372.775	0.000	-0.169	-0.168
Premium	colorF	0.114	0.000	1233.141	0.000	0.114	0.114
Premium	colorG	0.619	0.000	4987.169	0.000	0.619	0.619
Premium	colorH	0.275	0.000	2677.068	0.000	0.275	0.275
Premium	colorI	0.559	0.000	12331.966	0.000	0.559	0.559
Premium	colorJ	0.672	0.000	16613.911	0.000	0.672	0.672
Premium	clarityIF	9.730	0.000	192729.388	0.000	9.730	9.730
Premium	clarityVVS1	10.730	0.000	206360.547	0.000	10.730	10.730
Premium	clarityVVS2	10.198	0.000	197816.614	0.000	10.198	10.198
Premium	clarityVS1	11.825	0.000	289199.473	0.000	11.825	11.825
Premium	clarityVS2	11.044	0.000	146713.422	0.000	11.044	11.044
Premium	claritySI1	10.656	0.000	146223.538	0.000	10.656	10.656
Premium	claritySI2	10.999	0.000	127844.165	0.000	10.999	10.999
Premium	clarityI1	10.339	0.000	399305.356	0.000	10.339	10.339
Premium	clarityI2	0.000	0.000	NaN	NaN	0.000	0.000
Premium	clarityI3	0.000	0.000	NaN	NaN	0.000	0.000
Premium	depth	-1.220	0.010	-118.557	0.000	-1.240	-1.200
Premium	table	-0.287	0.011	-26.593	0.000	-0.308	-0.266
Premium	price	0.000	0.000	5.295	0.000	0.000	0.001
Premium	volume	0.129	0.003	37.430	0.000	0.122	0.136
Premium	obs_num	0.000	0.000	-0.348	0.728	0.000	0.000
Ideal	(Intercept)	158.066	0.000	3259504.602	0.000	158.066	158.066
Ideal	carat	-25.270	0.000	-587253.201	0.000	-25.270	-25.270
Ideal	colorE	-0.097	0.000	-1214.529	0.000	-0.097	-0.097
Ideal	colorF	-0.015	0.000	-89.565	0.000	-0.015	-0.014
Ideal	colorG	0.387	0.000	2107.112	0.000	0.386	0.387
Ideal	colorH	0.120	0.000	1891.380	0.000	0.120	0.120
Ideal	colorI	0.609	0.000	12948.267	0.000	0.609	0.610
Ideal	colorJ	0.831	0.000	27531.908	0.000	0.831	0.831
Ideal	clarityIF	19.347	0.000	265393.577	0.000	19.347	19.347
Ideal	clarityVVS1	20.385	0.000	240012.970	0.000	20.385	20.385
Ideal	clarityVVS2	19.447	0.000	237062.201	0.000	19.447	19.448
Ideal	clarityVS1	20.632	0.000	411504.992	0.000	20.632	20.632
Ideal	clarityVS2	19.923	0.000	159823.147	0.000	19.923	19.923
Ideal	claritySI1	19.435	0.000	94709.475	0.000	19.434	19.435
Ideal	claritySI2	19.648	0.000	133062.114	0.000	19.648	19.648
Ideal	clarityI1	19.249	0.000	1596135.319	0.000	19.249	19.249
Ideal	clarityI2	0.000	0.000	NaN	NaN	0.000	0.000
Ideal	clarityI3	0.000	0.000	NaN	NaN	0.000	0.000
Ideal	depth	-1.545	0.013	-122.836	0.000	-1.570	-1.520
Ideal	table	-1.347	0.014	-98.024	0.000	-1.374	-1.320
Ideal	price	0.000	0.000	6.067	0.000	0.000	0.001
Ideal	volume	0.131	0.004	36.714	0.000	0.124	0.138
Ideal	obs_num	0.000	0.000	0.153	0.878	0.000	0.000

Residuals Analysis

cut_residuals <- as_tibble(residuals(cut_model)) %>% 
  setNames(paste('resid', names(.), sep = "")) %>%
  mutate(obs_num = 1:n())
  
cut_residuals %>% slice(1:10)

## # A tibble: 10 x 6
##     residFair residGood `residVery Good` residPremium residIdeal obs_num
##         <dbl>     <dbl>            <dbl>        <dbl>      <dbl>   <int>
##  1 -0.00284    -0.0592           -0.211       -0.127     0.400         1
##  2 -0.00754    -0.143            -0.373        0.759    -0.235         2
##  3 -0.0000918  -0.0209           -0.179        0.622    -0.422         3
##  4 -0.0000419  -0.00326          -0.0277      -0.0302    0.0612        4
##  5 -0.0655     -0.217            -0.294        0.618    -0.0410        5
##  6 -0.00165    -0.0605           -0.334       -0.191     0.587         6
##  7 -0.000271   -0.0238           -0.141       -0.128     0.293         7
##  8 -0.000653   -0.0180           -0.188       -0.164     0.371         8
##  9 -0.0102     -0.178            -0.391        0.691    -0.112         9
## 10 -0.319       0.638            -0.141       -0.175    -0.00402      10

plot(cut_residuals)

Response Analysis

cut_pred <- as_tibble(predict(cut_model, type = "probs")) %>% mutate(obs_num = 1:n())

cut_pred %>% slice(1:10)

## # A tibble: 10 x 6
##         Fair    Good `Very Good` Premium   Ideal obs_num
##        <dbl>   <dbl>       <dbl>   <dbl>   <dbl>   <int>
##  1 0.00284   0.0592       0.211   0.127  0.600         1
##  2 0.00754   0.143        0.373   0.241  0.235         2
##  3 0.0000918 0.0209       0.179   0.378  0.422         3
##  4 0.0000419 0.00326      0.0277  0.0302 0.939         4
##  5 0.0655    0.217        0.294   0.382  0.0410        5
##  6 0.00165   0.0605       0.334   0.191  0.413         6
##  7 0.000271  0.0238       0.141   0.128  0.707         7
##  8 0.000653  0.0180       0.188   0.164  0.629         8
##  9 0.0102    0.178        0.391   0.309  0.112         9
## 10 0.319     0.362        0.141   0.175  0.00402      10

Goodness of fit - talk about residual deviance

As states earlier, the residual deviance of the full model plot is 10437.33, which is somehow a large residual deviance which indicates a bad fit of the model.

In the model final model selection, we will see if the residual deviance can be reduced.

Confidence Interval and Hypothesis test

Anova(cut_model)

## Analysis of Deviance Table (Type II tests)
## 
## Response: cut
##         LR Chisq Df Pr(>Chisq)    
## carat      41.40  4  2.223e-08 ***
## color      26.82 24     0.3127    
## clarity    59.84 40     0.0226 *  
## depth     433.41  4  < 2.2e-16 ***
## table    1683.87  4  < 2.2e-16 ***
## price      29.08  4  7.536e-06 ***
## volume     34.80  4  5.099e-07 ***
## obs_num     4.28  4     0.3689    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

At a 95% confidence interval, only the variable volume is not significant. However the variable depth, table and price are highly significant.

Sensitivity Analysis

exp(coef(cut_model))

##            (Intercept)        carat    colorE    colorF   colorG    colorH
## Good      3.506948e+23 2.309266e-02 0.8248215 0.8486407 1.277145 0.7977848
## Very Good 1.903165e+37 1.143011e-07 0.9444887 0.9210278 1.321721 0.9163803
## Premium   1.383484e+37 3.872734e-11 0.8447798 1.1208630 1.856720 1.3166936
## Ideal     4.439462e+68 1.060475e-11 0.9075659 0.9855467 1.471834 1.1271393
##             colorI   colorJ    clarityIF  clarityVVS1  clarityVVS2   clarityVS1
## Good      1.187566 1.553256 5.388414e+02 1.156006e+03 4.850149e+02 2.358481e+03
## Very Good 1.424462 1.860821 1.891162e+04 7.624487e+04 3.930365e+04 1.261516e+05
## Premium   1.748268 1.957561 1.681658e+04 4.571650e+04 2.684934e+04 1.365771e+05
## Ideal     1.839429 2.296045 2.526054e+08 7.127912e+08 2.792052e+08 9.128865e+08
##             clarityVS2   claritySI1   claritySI2    clarityI1 clarityI2
## Good      1.205956e+03 8.530642e+02 1.164799e+03 4.107304e+02         1
## Very Good 4.874421e+04 4.205412e+04 5.653529e+04 2.296995e+04         1
## Premium   6.256902e+04 4.243164e+04 5.979560e+04 3.091244e+04         1
## Ideal     4.491409e+08 2.756869e+08 3.412679e+08 2.289238e+08         1
##           clarityI3     depth     table    price   volume   obs_num
## Good              1 0.4856528 0.7992161 1.000257 1.009427 0.9999377
## Very Good         1 0.3468879 0.6309617 1.000391 1.082725 1.0000054
## Premium           1 0.2952900 0.7505195 1.000408 1.137718 0.9999739
## Ideal             1 0.2132999 0.2600971 1.000479 1.140498 1.0000117

The sensitivity analysis gives the odds ratio of the model. For example, we can see if an increase of 1 unit of fair cut vs. good cut, there is an increase of 0.523 odds ratio in depth and 0.840 odds ratio in table.

Stepwise Model Selection

EMPTY.MOD <- multinom(formula =  cut ~ 1, data = sub_diamond2)

## # weights:  10 (4 variable)
## initial  value 8047.189562 
## iter  10 value 6883.189708
## iter  10 value 6883.189707
## final  value 6883.189707 
## converged

FULL.MOD <- multinom(formula = cut ~., data = sub_diamond2)

## # weights:  120 (92 variable)
## initial  value 8047.189562 
## iter  10 value 7139.851414
## iter  20 value 6570.954502
## iter  30 value 6002.671806
## iter  40 value 5710.830558
## iter  50 value 5535.042800
## iter  60 value 5445.864421
## iter  70 value 5328.252592
## iter  80 value 5245.502706
## iter  90 value 5223.235594
## iter 100 value 5212.458532
## final  value 5212.458532 
## stopped after 100 iterations

forw.sel <- step(object=EMPTY.MOD, scope = list(upper = FULL.MOD), 
                 direction = "forward", k = log(nrow(sub_diamond2)), trace = TRUE)

## Start:  AIC=13800.45
## cut ~ 1
## 
## trying + carat 
## # weights:  15 (8 variable)
## initial  value 8047.189562 
## iter  10 value 6854.244107
## final  value 6782.216571 
## converged
## trying + color 
## # weights:  40 (28 variable)
## initial  value 8047.189562 
## iter  10 value 6975.418728
## iter  20 value 6890.656768
## iter  30 value 6861.764224
## final  value 6861.499688 
## converged
## trying + clarity 
## # weights:  60 (44 variable)
## initial  value 8047.189562 
## iter  10 value 6907.155372
## iter  20 value 6731.875138
## iter  30 value 6694.463493
## iter  40 value 6691.918023
## iter  40 value 6691.918007
## iter  40 value 6691.918007
## final  value 6691.918007 
## converged
## trying + depth 
## # weights:  15 (8 variable)
## initial  value 8047.189562 
## iter  10 value 6650.040899
## final  value 6494.705299 
## converged
## trying + table 
## # weights:  15 (8 variable)
## initial  value 8047.189562 
## iter  10 value 5861.002316
## final  value 5837.796192 
## converged
## trying + price 
## # weights:  15 (8 variable)
## initial  value 8047.189562 
## iter  10 value 6874.878704
## final  value 6845.303860 
## converged
## trying + volume 
## # weights:  15 (8 variable)
## initial  value 8047.189562 
## iter  10 value 6817.137241
## final  value 6793.262218 
## converged
## trying + obs_num 
## # weights:  15 (8 variable)
## initial  value 8047.189562 
## iter  10 value 6890.030010
## final  value 6881.051217 
## converged
##            Df      AIC
## + +table    8 11691.59
## + +depth    8 13005.41
## + +clarity 32 13447.84
## + +carat    8 13580.43
## + +volume   8 13602.52
## + +price    8 13706.61
## <none>      4 13774.38
## + +obs_num  8 13778.10
## + +color   28 13779.00
## # weights:  15 (8 variable)
## initial  value 8047.189562 
## iter  10 value 5861.002316
## final  value 5837.796192 
## converged
## 
## Step:  AIC=11743.73
## cut ~ table
## 
## trying + carat 
## # weights:  20 (12 variable)
## initial  value 8047.189562 
## iter  10 value 6260.439237
## iter  20 value 5802.816581
## final  value 5802.711204 
## converged
## trying + color 
## # weights:  45 (32 variable)
## initial  value 8047.189562 
## iter  10 value 6433.859032
## iter  20 value 5993.920295
## iter  30 value 5849.795267
## iter  40 value 5823.047204
## final  value 5823.010269 
## converged
## trying + clarity 
## # weights:  65 (48 variable)
## initial  value 8047.189562 
## iter  10 value 6523.344685
## iter  20 value 5958.431821
## iter  30 value 5743.891472
## iter  40 value 5704.319242
## final  value 5703.918756 
## converged
## trying + depth 
## # weights:  20 (12 variable)
## initial  value 8047.189562 
## iter  10 value 5988.726502
## iter  20 value 5383.968078
## iter  30 value 5331.076816
## iter  40 value 5330.908627
## iter  50 value 5330.815527
## final  value 5330.815331 
## converged
## trying + price 
## # weights:  20 (12 variable)
## initial  value 8047.189562 
## iter  10 value 6153.044517
## iter  20 value 5829.134869
## final  value 5829.119094 
## converged
## trying + volume 
## # weights:  20 (12 variable)
## initial  value 8047.189562 
## iter  10 value 6227.149179
## iter  20 value 5808.453735
## final  value 5807.658860 
## converged
## trying + obs_num 
## # weights:  20 (12 variable)
## initial  value 8047.189562 
## iter  10 value 6212.252276
## iter  20 value 5835.180775
## final  value 5835.170249 
## converged
##            Df      AIC
## + +depth   12 10685.63
## + +clarity 36 11479.84
## + +carat   12 11629.42
## + +volume  12 11639.32
## + +price   12 11682.24
## <none>      8 11691.59
## + +obs_num 12 11694.34
## + +color   32 11710.02
## # weights:  20 (12 variable)
## initial  value 8047.189562 
## iter  10 value 5988.726502
## iter  20 value 5383.968078
## iter  30 value 5331.076816
## iter  40 value 5330.908627
## iter  50 value 5330.815527
## final  value 5330.815331 
## converged
## 
## Step:  AIC=10763.84
## cut ~ table + depth
## 
## trying + carat 
## # weights:  25 (16 variable)
## initial  value 8047.189562 
## iter  10 value 6471.260617
## iter  20 value 5512.786436
## iter  30 value 5314.780493
## final  value 5314.526366 
## converged
## trying + color 
## # weights:  50 (36 variable)
## initial  value 8047.189562 
## iter  10 value 6512.272810
## iter  20 value 5910.312870
## iter  30 value 5714.025138
## iter  40 value 5489.418847
## iter  50 value 5318.647762
## final  value 5318.567617 
## converged
## trying + clarity 
## # weights:  70 (52 variable)
## initial  value 8047.189562 
## iter  10 value 6408.543105
## iter  20 value 6057.803860
## iter  30 value 5733.973944
## iter  40 value 5456.316407
## iter  50 value 5266.783722
## final  value 5266.439906 
## converged
## trying + price 
## # weights:  25 (16 variable)
## initial  value 8047.189562 
## iter  10 value 6685.968796
## iter  20 value 5528.105337
## iter  30 value 5322.241810
## final  value 5322.159440 
## converged
## trying + volume 
## # weights:  25 (16 variable)
## initial  value 8047.189562 
## iter  10 value 6624.182926
## iter  20 value 5515.253692
## iter  30 value 5314.765013
## final  value 5314.531774 
## converged
## trying + obs_num 
## # weights:  25 (16 variable)
## initial  value 8047.189562 
## iter  10 value 6520.541175
## iter  20 value 5528.468801
## iter  30 value 5328.660963
## final  value 5328.579830 
## converged
##            Df      AIC
## + +clarity 40 10612.88
## + +carat   16 10661.05
## + +volume  16 10661.06
## + +price   16 10676.32
## <none>     12 10685.63
## + +obs_num 16 10689.16
## + +color   36 10709.14
## # weights:  70 (52 variable)
## initial  value 8047.189562 
## iter  10 value 6408.543105
## iter  20 value 6057.803860
## iter  30 value 5733.973944
## iter  40 value 5456.316407
## iter  50 value 5266.783722
## final  value 5266.439906 
## converged
## 
## Step:  AIC=10873.57
## cut ~ table + depth + clarity

Critiques & Limitations

From the final model selection, we can see that the best model that should have been fitted, should be cut ~ table + depth + clarity. But for this project, we have analyse the full model to gives a better understanding of how the model works. The best model selected in the last step stills having a very large AIC and confirm the bad fit of the model.

To improve the model, maybe cut observations could be reduce to only very good, premium and ideal inorder to reduce the value of aic.

Conclusion

From the preliminary analysis, we can find that only one categorical variable does not help to predict the price or depth or carat, it is a combination of the categorical variables which helps to determine other factor. Subsequently, the cut variable has not been a good indicator. Perhaps, using color as should be a better determinant.

Some further investion are necessary to see how the variables interact between themselves and create a way better model for prediction.

Effects of the cut of Diamonds

Project Phase 2 MATH1298

Project Group 30: Mary Legrand (s3815368)

06/11/2020