Diamond Price

About Dataset

Real time prices in the diamond market are reflected by the so-called Diamond Financial Index (DFX) which is available on a daily base since April 2018.

As diamond prices are influenced by many factors like trade barriers, political instability, operational disruptions like mine closures or economic downturns resp. upturns, it is not an easy task to predict the development of future diamond prices.

To predict prices, indicators are needed. Empirical findings support the argument that diamond prices respond to economic downturns resp. upturns and are therefore also correlated with inflation rates and interest rates resp. fed rates. Also gold prices could be an indicator for the development of diamond prices.

Libraries and Setup

We’ll set-up caching for this notebook given how computationally expensive some of the code we will write can get.

this workspace using the library() function:

library(dplyr)
library(lubridate)
library(ggplot2)
library(plotly)
library(glue)
library(ggpubr)
library(scales)
library(tidyr)
library(GGally)

written library is very useful for the results of the analysis

Dataset

data = read.csv("E:/Algoritma/4_lbb/diamond_data1.csv")
data$date = sort(data$date)
head(data)

Because the US are playing quite a big role in the diamond business, the following US rates can be considered:

  • ‘date’ : Date of data collection
  • ‘diamond.price’ : Price of Diamond
  • ‘inflation.rate’ : 10-year breakeven inflation rate
  • ‘interest.rate’ : 10-year treasury inflation-indexed security, constant maturity, risk-free
  • ‘fed.rate’ : effective federal funds rate
  • ‘gold.price’ : Price of Gold

Descriptive Statistics

c(min(data$date), max(data$date))
## [1] "2018-04-28" "2021-06-04"

Data used April 28, 2018 to June 4, 2021

summary(data$diamond.price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9272    9766   10005   10016   10305   10805

The average diamond price is 10005 usd with the highest price being 10805 usd and the lowest being 9272 usd

summary(data$inflation.rate)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.500   1.630   1.780   1.806   2.100   2.540

The highest inflation.rate value is 2.54% and the lowest is 0.5%

summary(data$interest.rate)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -1.08000 -0.75750  0.12000  0.01936  0.75000  1.17000

The average interest rate is 0.01936% with the highest interest rate being 1.17% and the lowest being -1.08%. negative value means he reduced interest rate

summary(data$fed.rate)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.04    0.09    1.58    1.26    2.19    2.45

The average interest rate is 1.26% with the highest interest rate being 2.45% and the lowest being 0.04%

summary(data$gold.price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1170    1293    1502    1538    1776    2064

The average gold price is 1538 usd with the highest price being 2064 usd and the lowest being 1170 usd

Data Type

The following is the data type before the changes are made

glimpse(data)
## Rows: 1,134
## Columns: 6
## $ date           <chr> "2018-04-28", "2018-04-29", "2018-04-30", "2018-05-01",…
## $ diamond.price  <dbl> 10030.58, 10032.95, 10038.62, 10050.07, 10047.16, 10046…
## $ inflation.rate <dbl> 2.17, 2.17, 2.17, 2.16, 2.17, 2.16, 2.16, 2.17, 2.17, 2…
## $ interest.rate  <dbl> 0.78, 0.78, 0.78, 0.81, 0.80, 0.78, 0.79, 0.78, 0.78, 0…
## $ fed.rate       <dbl> 1.70, 1.70, 1.69, 1.70, 1.70, 1.70, 1.70, 1.70, 1.70, 1…
## $ gold.price     <dbl> 1322.80, 1322.80, 1315.38, 1300.58, 1302.13, 1312.23, 1…

The following is the data type after the changes are made

data =data %>%
  mutate(date = as.Date(date))
glimpse(data)
## Rows: 1,134
## Columns: 6
## $ date           <date> 2018-04-28, 2018-04-29, 2018-04-30, 2018-05-01, 2018-0…
## $ diamond.price  <dbl> 10030.58, 10032.95, 10038.62, 10050.07, 10047.16, 10046…
## $ inflation.rate <dbl> 2.17, 2.17, 2.17, 2.16, 2.17, 2.16, 2.16, 2.17, 2.17, 2…
## $ interest.rate  <dbl> 0.78, 0.78, 0.78, 0.81, 0.80, 0.78, 0.79, 0.78, 0.78, 0…
## $ fed.rate       <dbl> 1.70, 1.70, 1.69, 1.70, 1.70, 1.70, 1.70, 1.70, 1.70, 1…
## $ gold.price     <dbl> 1322.80, 1322.80, 1315.38, 1300.58, 1302.13, 1312.23, 1…

Trend Data

trend_diamond <- data %>% 
  filter(date > "2021-03-04" & date <= "2021-06-04") %>% 
  group_by(date) %>% 
  summarise(rate = diamond.price) %>% 
  ungroup()%>%
  mutate(
    label = glue(
      "date: {date}
      price : {comma(rate)}")
  )

plot_trend <- ggplot(data = trend_diamond, mapping =  aes(x = date, 
                                         y = rate)) +
  geom_line(color = 'blue') +
  geom_point(aes(text = label), color='blue')+
  scale_x_date(date_labels = "%b %Y")+
  labs(
    title = "Trend Last 3 Months",
    x = "date",
    y = "Diamond Price")+
  theme_minimal()


ggplotly(plot_trend, tooltip = "text")

The following is the trend of Diamond prices for the last 3 months from the dataset. On March 5, 2021 to March 9, 2021 the Diamond price tend to be flat and Starting to rise sharply on March 10, 2021 until June 4, 2021. The possibility of rising diamond prices due to political or economic policies in America which have an impact on rising diamond prices

Correlation

ggcorr(data, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)

Plot ‘ggcorr’ above shows the correlation value between the predictor variable and the Diamond Price. the highest correlation is the gold price with a correlation value of -0.8. which means that the gold price is negatively correlated with the diamond price, meaning that when the price of diamonds increases, the price of gold will decrease and vice versa if the price of diamonds decreases, the price of gold will increase.

Data Preperation

Cleansing Data

data.NA <- data %>%
  filter(!complete.cases(.))
data.NA

the dataset that we have does not have a missing value so that further analysis can be carried out and no missing value handling is required

Cross Validation

data_clean = data %>%
  select(-c(date))
head(data_clean)

Dataset used is time series data, so in this case the deleted variable is ‘date’, because the author wants to eliminate the time factor, and if using date, it is necessary to know the events and policies that apply at a certain time that cause a trend in the data. date will be useful when doing time series analysis

nrow(data_clean)
## [1] 1134
nrow(data_clean)*0.9
## [1] 1020.6

The distribution of training and testing data is 90:10 separated sequentially because the data used is time series data.

data_train = data_clean[1:1021,]
data_tes = data_clean[1022:1134,]
nrow(data_train)
## [1] 1021
nrow(data_tes)
## [1] 113

After the separation, the training data obtained is 1021 rows and the testing is 113 rows

Modelling Data

At this chapter, linear regression modeling is carried out with several models and selection of the best model based on the smallest AIC value

All Variabel

model_all <- lm(diamond.price ~ ., data_train) 
summary(model_all) 
## 
## Call:
## lm(formula = diamond.price ~ ., data = data_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -472.52  -69.21   17.68   89.93  413.34 
## 
## Coefficients:
##                   Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept)    11196.09760   150.66733  74.310 < 0.0000000000000002 ***
## inflation.rate   591.51716    15.45582  38.271 < 0.0000000000000002 ***
## interest.rate      3.30532    31.04143   0.106                0.915    
## fed.rate         -75.85469    10.11936  -7.496    0.000000000000143 ***
## gold.price        -1.40823     0.09196 -15.313 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 134.8 on 1016 degrees of freedom
## Multiple R-squared:  0.902,  Adjusted R-squared:  0.9016 
## F-statistic:  2339 on 4 and 1016 DF,  p-value: < 0.00000000000000022

modeling that uses all predictor variables in the training data. the r-square value is 0.9016, which means that the model obtained is good with predictor variables that have no effect is interest.rate .

with model formed is y = 11196.097 + 591.517(inflation.rate) + 3.305(interest.rate) - 75.85469(fed.rate) - 1.40823(gold.price)

  • if the predictor variable is considered constant there is no change then diamond price is equal to 11196.097 usd
  • if the inflation rate increases by one unit, it will increase the diamond price by 591.517 usd with the other predictor variables constant
  • if the fed rate increases by one unit, it will increase the diamond price by 3.305 usd with the other predictor variables constant
  • if the interest rate increases by one unit, it will decrease the diamond price by 75.85469 usd with the other predictor variables constant
  • if the gold price increases by one unit, it will decrease the diamond price by 1.40823 usd with the other predictor variables constant

Base on Correlation

model_corr <- lm(diamond.price ~ inflation.rate + interest.rate + gold.price, data_train) 
summary(model_corr) 
## 
## Call:
## lm(formula = diamond.price ~ inflation.rate + interest.rate + 
##     gold.price, data = data_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -436.25  -87.63    5.99   98.93  379.84 
## 
## Coefficients:
##                   Estimate  Std. Error t value            Pr(>|t|)    
## (Intercept)    10811.05627   145.43341  74.337 <0.0000000000000002 ***
## inflation.rate   580.13900    15.79294  36.734 <0.0000000000000002 ***
## interest.rate    -15.71708    31.76588  -0.495               0.621    
## gold.price        -1.20877     0.09039 -13.373 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 138.4 on 1017 degrees of freedom
## Multiple R-squared:  0.8966, Adjusted R-squared:  0.8963 
## F-statistic:  2940 on 3 and 1017 DF,  p-value: < 0.00000000000000022

This modeling uses a predictor variable that has a correlation value of more than 0.6, so fed.rate is not used for modeling. the r-square value is 0.8963, which means that the model obtained is good with predictor variables that have no effect is interest.rate

with model formed is y = 10811.056 + 580.139(inflation.rate) - 15.71708(interest.rate) - 1.2087(gold.price)

  • if the predictor variable is considered constant there is no change then diamond price is equal to 10811.056 usd
  • if the inflation rate increases by one unit, it will increase the diamond price by 580.139 usd with the other predictor variables constant
  • if the interest rate increases by one unit, it will decrease the diamond price by 15.71708 usd with the other predictor variables constant
  • if the gold price increases by one unit, it will decrease the diamond price by 1.2087 usd with the other predictor variables constant

Backward

model_backward <- step(object = model_all,
                       direction = "backward",
                       trace = FALSE)

summary(model_backward)
## 
## Call:
## lm(formula = diamond.price ~ inflation.rate + fed.rate + gold.price, 
##     data = data_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -472.10  -68.67   17.30   89.59  413.16 
## 
## Coefficients:
##                   Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept)    11209.80700    78.21858 143.314 < 0.0000000000000002 ***
## inflation.rate   591.48715    15.44574  38.295 < 0.0000000000000002 ***
## fed.rate         -75.76661    10.08058  -7.516    0.000000000000124 ***
## gold.price        -1.41712     0.03852 -36.793 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 134.7 on 1017 degrees of freedom
## Multiple R-squared:  0.902,  Adjusted R-squared:  0.9017 
## F-statistic:  3121 on 3 and 1017 DF,  p-value: < 0.00000000000000022

modeling that uses Model Backward. the r-square value is 0.9017, which means that the model obtained is good with variables that have an influence is inflation.rate, fed.rate, and gold.price .

with model formed is y = 11209.807 + 591.48715(inflation.rate) - 75.76661(fed.rate) - 1.41712(gold.price)

  • if the predictor variable is considered constant there is no change then diamond price is equal to 11209.807 usd
  • if the inflation rate increases by one unit, it will increase the diamond price by 591.48715 usd with the other predictor variables constant
  • if the fed rate increases by one unit, it will decrease the diamond price by 75.76661 usd with the other predictor variables constant
  • if the gold price increases by one unit, it will decrease the diamond price by 1.41712 usd with the other predictor variables constant

Forward

model_none <- lm(diamond.price ~ 1, data_train)

model_forward <- step(object = model_none, 
                      direction = "forward", 
                      scope = list(lower = model_none, upper = model_all))
## Start:  AIC=12381.71
## diamond.price ~ 1
## 
##                  Df Sum of Sq       RSS   AIC
## + gold.price      1 142998442  45315542 10929
## + interest.rate   1 137159640  51154343 11053
## + fed.rate        1 110331360  77982623 11484
## + inflation.rate  1 100281889  88032095 11607
## <none>                        188313983 12382
## 
## Step:  AIC=10929.34
## diamond.price ~ gold.price
## 
##                  Df Sum of Sq      RSS   AIC
## + inflation.rate  1  25841947 19473595 10069
## + fed.rate        1    264336 45051206 10925
## <none>                        45315542 10929
## + interest.rate   1     14582 45300960 10931
## 
## Step:  AIC=10069.01
## diamond.price ~ gold.price + inflation.rate
## 
##                 Df Sum of Sq      RSS   AIC
## + fed.rate       1   1024783 18448811 10016
## <none>                       19473595 10069
## + interest.rate  1      4686 19468908 10071
## 
## Step:  AIC=10015.81
## diamond.price ~ gold.price + inflation.rate + fed.rate
## 
##                 Df Sum of Sq      RSS   AIC
## <none>                       18448811 10016
## + interest.rate  1    205.88 18448606 10018
summary(model_forward)
## 
## Call:
## lm(formula = diamond.price ~ gold.price + inflation.rate + fed.rate, 
##     data = data_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -472.10  -68.67   17.30   89.59  413.16 
## 
## Coefficients:
##                   Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept)    11209.80700    78.21858 143.314 < 0.0000000000000002 ***
## gold.price        -1.41712     0.03852 -36.793 < 0.0000000000000002 ***
## inflation.rate   591.48715    15.44574  38.295 < 0.0000000000000002 ***
## fed.rate         -75.76661    10.08058  -7.516    0.000000000000124 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 134.7 on 1017 degrees of freedom
## Multiple R-squared:  0.902,  Adjusted R-squared:  0.9017 
## F-statistic:  3121 on 3 and 1017 DF,  p-value: < 0.00000000000000022

modeling that uses Model Forward. the r-square value is 0.9017, which means that the model obtained is good with variables that have an influence is inflation.rate, fed.rate, and gold.price .

with model formed is y = 11209.807 + 591.48715(inflation.rate) - 75.76661(fed.rate) - 1.41712(gold.price)

  • if the predictor variable is considered constant there is no change then diamond price is equal to 11209.807 usd
  • if the inflation rate increases by one unit, it will increase the diamond price by 591.48715 usd with the other predictor variables constant
  • if the fed rate increases by one unit, it will decrease the diamond price by 75.76661 usd with the other predictor variables constant
  • if the gold price increases by one unit, it will decrease the diamond price by 1.41712 usd with the other predictor variables constant

Both

model_both <- step(object = model_none,
                   direction = "both",
                   scope = list(upper = model_all))
## Start:  AIC=12381.71
## diamond.price ~ 1
## 
##                  Df Sum of Sq       RSS   AIC
## + gold.price      1 142998442  45315542 10929
## + interest.rate   1 137159640  51154343 11053
## + fed.rate        1 110331360  77982623 11484
## + inflation.rate  1 100281889  88032095 11607
## <none>                        188313983 12382
## 
## Step:  AIC=10929.34
## diamond.price ~ gold.price
## 
##                  Df Sum of Sq       RSS   AIC
## + inflation.rate  1  25841947  19473595 10069
## + fed.rate        1    264336  45051206 10925
## <none>                         45315542 10929
## + interest.rate   1     14582  45300960 10931
## - gold.price      1 142998442 188313983 12382
## 
## Step:  AIC=10069.01
## diamond.price ~ gold.price + inflation.rate
## 
##                  Df Sum of Sq      RSS   AIC
## + fed.rate        1   1024783 18448811 10016
## <none>                        19473595 10069
## + interest.rate   1      4686 19468908 10071
## - inflation.rate  1  25841947 45315542 10929
## - gold.price      1  68558500 88032095 11607
## 
## Step:  AIC=10015.81
## diamond.price ~ gold.price + inflation.rate + fed.rate
## 
##                  Df Sum of Sq      RSS   AIC
## <none>                        18448811 10016
## + interest.rate   1       206 18448606 10018
## - fed.rate        1   1024783 19473595 10069
## - gold.price      1  24556706 43005518 10878
## - inflation.rate  1  26602394 45051206 10925
summary(model_both)
## 
## Call:
## lm(formula = diamond.price ~ gold.price + inflation.rate + fed.rate, 
##     data = data_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -472.10  -68.67   17.30   89.59  413.16 
## 
## Coefficients:
##                   Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept)    11209.80700    78.21858 143.314 < 0.0000000000000002 ***
## gold.price        -1.41712     0.03852 -36.793 < 0.0000000000000002 ***
## inflation.rate   591.48715    15.44574  38.295 < 0.0000000000000002 ***
## fed.rate         -75.76661    10.08058  -7.516    0.000000000000124 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 134.7 on 1017 degrees of freedom
## Multiple R-squared:  0.902,  Adjusted R-squared:  0.9017 
## F-statistic:  3121 on 3 and 1017 DF,  p-value: < 0.00000000000000022

modeling that uses Model Forward. the r-square value is 0.9017, which means that the model obtained is good with variables that have an influence is inflation.rate, fed.rate, and gold.price .

with model formed is y = 11209.807 + 591.48715(inflation.rate) - 75.76661(fed.rate) - 1.41712(gold.price)

  • if the predictor variable is considered constant there is no change then diamond price is equal to 11209.807 usd
  • if the inflation rate increases by one unit, it will increase the diamond price by 591.48715 usd with the other predictor variables constant
  • if the fed rate increases by one unit, it will decrease the diamond price by 75.76661 usd with the other predictor variables constant
  • if the gold price increases by one unit, it will decrease the diamond price by 1.41712 usd with the other predictor variables constant

Comparison

library(performance)
#install.packages("performance")
comparison <- compare_performance(model_none, model_all, model_backward, model_forward, model_both, model_corr)

as.data.frame(comparison)

from the model obtained, the lowest aic value is obtained in the model_backward, model_forward, and model_bothwith a value of 12915.29 . so that for the next used is model_backward.

Prediction

pred_model <- predict(
  object = model_backward,
  newdata = data_tes,
  interval = "prediction",
  level = 0.95)

tail(pred_model)
##           fit      lwr      upr
## 1129 9955.385 9689.110 10221.66
## 1130 9960.416 9694.151 10226.68
## 1131 9967.098 9700.849 10233.35
## 1132 9944.980 9678.775 10211.19
## 1133 9984.719 9718.665 10250.77
## 1134 9945.100 9679.063 10211.14

the results above are the predicted values ​​of the test data using model_backward

Illustration of Predict

mydata <- cbind(data_tes, pred_model)

ggplot(data = mydata, aes(x = inflation.rate, y = diamond.price)) +
  geom_point()+
  labs(title = "Linear Regression of Diamond Price by GDP") +
  geom_line(aes(y = fit), color = "blue") +
  geom_line(aes(y = lwr), color = "red", linetype = "dashed") +
  geom_line(aes(y = upr), color = "red", linetype = "dashed") +
  theme_minimal() 

this plot shows the predicted values obtained by setting the upper and lower bounds. obtained several points more than the specified limit. This value occurs because the model created cannot accommodate data that has never been learn

Asumsi Linier Reggresion

Linearity Test

White Test

library(lmtest)

resettest(data_clean$diamond.price ~ data_clean$inflation.rate)
## 
##  RESET test
## 
## data:  data_clean$diamond.price ~ data_clean$inflation.rate
## RESET = 139.13, df1 = 2, df2 = 1130, p-value < 0.00000000000000022
resettest(data_clean$diamond.price ~ data_clean$interest.rate)
## 
##  RESET test
## 
## data:  data_clean$diamond.price ~ data_clean$interest.rate
## RESET = 139.27, df1 = 2, df2 = 1130, p-value < 0.00000000000000022
resettest(data_clean$diamond.price ~ data_clean$fed.rate)
## 
##  RESET test
## 
## data:  data_clean$diamond.price ~ data_clean$fed.rate
## RESET = 108.34, df1 = 2, df2 = 1130, p-value < 0.00000000000000022
resettest(data_clean$diamond.price ~ data_clean$gold.price)
## 
##  RESET test
## 
## data:  data_clean$diamond.price ~ data_clean$gold.price
## RESET = 273.73, df1 = 2, df2 = 1130, p-value < 0.00000000000000022

it can be seen that in each variable in white test has p-value < 0.05 so that it can be said that the data is not linear. to be sure see plot below

resact <- data.frame(residual = model_backward$residuals, fitted = model_backward$fitted.values)

resact %>% ggplot(aes(fitted, residual)) + geom_point() + geom_hline(aes(yintercept = 0)) + 
    geom_smooth() + theme(panel.grid = element_blank(), panel.background = element_blank())

from the resulting plot it can be said that the data is not linear and has a polynomial pattern. a good model to use should be a non-linear regression model. However, in this case the author gives a limit to using a linear regression model

Normality Test

Result

hist(model_backward$residuals, freq=FALSE)
lines(density(model_backward$residuals), col = "red")

Based on the histogram plot, it can be seen that the shape of the plot is close to the bell shape, but this can’t be concluded that the residuals are normally distributed. further tests need to be carried out, one of which is the Shapiro-Wilk test

shapiro.test(model_backward$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model_backward$residuals
## W = 0.95076, p-value < 0.00000000000000022

based on the results of the Shapiro-Wilk test, the p-value < 0.05 was obtained. it can be concluded that the residuals from the model used are not normally distributed, so further handling is needed

Assumption Handling

Delete Outlier

sort(boxplot.stats(data$diamond.price)$out)
## numeric(0)
sort(boxplot.stats(data$inflation.rate)$out)
##  [1] 0.50 0.63 0.63 0.73 0.73 0.73 0.75 0.80 0.80 0.80 0.87 0.90 0.91 0.92
sort(boxplot.stats(data$interest.rate)$out)
## numeric(0)
sort(boxplot.stats(data$fed.rate)$out)
## numeric(0)
sort(boxplot.stats(data$gold.price)$out)
## numeric(0)

It can be seen from the results above that the inflation.rate variable has an outlier, to ensure the outlier can be seen in the boxplot below

boxplot(data$inflation.rate, horizontal = TRUE)

from the boxplot above it can be seen that the outlier value is on the left side of the minimum value, so for deletion of data only on the left side of the minimum value

data_out <- data %>% 
  select(-c(date)) %>%
  filter(inflation.rate > 0.92)

data_out is data set that has no outliers

Transform with sqrt

data_out <- data_out %>%
  mutate(
    diamond.price = sqrt(diamond.price),
    inflation.rate = sqrt(inflation.rate),
    interest.rate = sqrt(interest.rate),
    fed.rate = sqrt(fed.rate),
    gold.price = sqrt(gold.price)
  )

The next step is to transform the data. for the transformation formula used based on the shape of the residual histogram plot obtained.

supposed to handle this assumption, it is enough to transform, there is no need to remove outliers, because the data obtained is very valuable, but for this writing the author wants to show how to remove outlier data.

Remodeling

nrow(data_out)
## [1] 1120
nrow(data_out)*0.9
## [1] 1008

with the same stage of dividing the data with the proportion of 90 : 10

out_train = data_out[1:1008,]
out_tes = data_out[1009:1120,]
trans_model_all <- lm(diamond.price ~ ., out_train) 
trans_backward <- step(object = trans_model_all,
                       direction = "backward",
                       trace = FALSE)

shapiro.test(trans_backward$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  trans_backward$residuals
## W = 0.97243, p-value = 0.000000001725

After remodeling with model_backward, we get p-value3 < 0.05, which means that the residuals are still not normally distributed even though the transformation has been carried out. Cases like this can occur because the model used is not quite right

Homoscedasticity of Residuals

Result

to test the Homoscedasticity of Residuals you can use the Breusch-Pagan test

library(lmtest)
bptest(model_backward)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_backward
## BP = 78.424, df = 3, p-value < 0.00000000000000022

the results of the Breusch-Pagan test obtained a p-value < 0.05 which means that the residual spread is not constant or heteroscedasticity

Transformasi BoxCox

handling for heteroscedasticity can use box-cox transformation

library(caret)
library(lmtest)
bcy <- BoxCoxTrans(data_clean$diamond.price)
ybc=predict(bcy, data_clean$diamond.price)
box_trans = cbind(data_clean, ybc)
box_trans = box_trans %>%
  select(-c(diamond.price))
bcy
## Box-Cox Transformation
## 
## 1134 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9272    9766   10005   10016   10305   10805 
## 
## Largest/Smallest: 1.17 
## Sample Skewness: 0.0115 
## 
## Estimated Lambda: 0.7

the result of the box-cox transformation was obtained Lambda = 0.7. and henceforth it can be re-modeled with transformation data

Remodeling

nrow(box_trans)
## [1] 1134
nrow(box_trans)*0.9
## [1] 1020.6

with the same stage of dividing the data with the proportion of 90 : 10

box_train = box_trans[1:1021,]
box_tes = box_trans[1022:1134,]
box_model_all <- lm(ybc ~ ., box_train) 
box_backward <- step(object = box_model_all,
                       direction = "backward",
                       trace = FALSE)

bptest(box_backward)
## 
##  studentized Breusch-Pagan test
## 
## data:  box_backward
## BP = 76.906, df = 3, p-value < 0.00000000000000022

After remodeling with model_backward, we get p-value < 0.05, which means that the residual spread is not constant or heteroscedasticity even though the transformation box-cox has been carried out. Cases like this can occur because the model used is not quite right

Non-Multicollinearity

library(car)
vif(model_backward)
## inflation.rate       fed.rate     gold.price 
##       1.280387       5.183588       5.238544

the value of VIF for each variable is < 10 so it can be concluded that there is no multicollinearity between variables

Autocorrelation

Result

for autocorrelation assumptions can use the Durbin-Watson test

durbinWatsonTest(model_backward)
##  lag Autocorrelation D-W Statistic p-value
##    1       0.9644079    0.05930637       0
##  Alternative hypothesis: rho != 0

From the output we can see that he corresponding p-value is 0. Since this p-value is less than 0.05, we can reject the null hypothesis and conclude that the residuals in this regression model are autocorrelated.

if the residuals are not autocorrelated then a Cochrane-Orcutt Estimation can be performed

Conclusion

based on the analysis that has been done it can be concluded that :

  • The model used is the model_backward y = 11209.807 + 591.48715(inflation.rate) - 75.76661(fed.rate) - 1.41712(gold.price)

  • However, the linear regression model obtained is not good for interpreting diamond price data because the data has a polynomial pattern.

  • the results obtained are biased and unreliable, so it is not recommended to use a linear regression model

  • if you want to modelling diamond price data properly, it’s recommended to use a non-linear regression model