Real time prices in the diamond market are reflected by the so-called Diamond Financial Index (DFX) which is available on a daily base since April 2018.
As diamond prices are influenced by many factors like trade barriers, political instability, operational disruptions like mine closures or economic downturns resp. upturns, it is not an easy task to predict the development of future diamond prices.
To predict prices, indicators are needed. Empirical findings support the argument that diamond prices respond to economic downturns resp. upturns and are therefore also correlated with inflation rates and interest rates resp. fed rates. Also gold prices could be an indicator for the development of diamond prices.
We’ll set-up caching for this notebook given how computationally expensive some of the code we will write can get.
this workspace using the library() function:
library(dplyr)
library(lubridate)
library(ggplot2)
library(plotly)
library(glue)
library(ggpubr)
library(scales)
library(tidyr)
library(GGally)written library is very useful for the results of the analysis
data = read.csv("E:/Algoritma/4_lbb/diamond_data1.csv")
data$date = sort(data$date)
head(data)Because the US are playing quite a big role in the diamond business, the following US rates can be considered:
c(min(data$date), max(data$date))## [1] "2018-04-28" "2021-06-04"
Data used April 28, 2018 to June 4, 2021
summary(data$diamond.price)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9272 9766 10005 10016 10305 10805
The average diamond price is 10005 usd with the highest price being 10805 usd and the lowest being 9272 usd
summary(data$inflation.rate)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.500 1.630 1.780 1.806 2.100 2.540
The highest inflation.rate value is 2.54% and the lowest is 0.5%
summary(data$interest.rate)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.08000 -0.75750 0.12000 0.01936 0.75000 1.17000
The average interest rate is 0.01936% with the highest interest rate being 1.17% and the lowest being -1.08%. negative value means he reduced interest rate
summary(data$fed.rate)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.04 0.09 1.58 1.26 2.19 2.45
The average interest rate is 1.26% with the highest interest rate being 2.45% and the lowest being 0.04%
summary(data$gold.price)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1170 1293 1502 1538 1776 2064
The average gold price is 1538 usd with the highest price being 2064 usd and the lowest being 1170 usd
The following is the data type before the changes are made
glimpse(data)## Rows: 1,134
## Columns: 6
## $ date <chr> "2018-04-28", "2018-04-29", "2018-04-30", "2018-05-01",…
## $ diamond.price <dbl> 10030.58, 10032.95, 10038.62, 10050.07, 10047.16, 10046…
## $ inflation.rate <dbl> 2.17, 2.17, 2.17, 2.16, 2.17, 2.16, 2.16, 2.17, 2.17, 2…
## $ interest.rate <dbl> 0.78, 0.78, 0.78, 0.81, 0.80, 0.78, 0.79, 0.78, 0.78, 0…
## $ fed.rate <dbl> 1.70, 1.70, 1.69, 1.70, 1.70, 1.70, 1.70, 1.70, 1.70, 1…
## $ gold.price <dbl> 1322.80, 1322.80, 1315.38, 1300.58, 1302.13, 1312.23, 1…
The following is the data type after the changes are made
data =data %>%
mutate(date = as.Date(date))
glimpse(data)## Rows: 1,134
## Columns: 6
## $ date <date> 2018-04-28, 2018-04-29, 2018-04-30, 2018-05-01, 2018-0…
## $ diamond.price <dbl> 10030.58, 10032.95, 10038.62, 10050.07, 10047.16, 10046…
## $ inflation.rate <dbl> 2.17, 2.17, 2.17, 2.16, 2.17, 2.16, 2.16, 2.17, 2.17, 2…
## $ interest.rate <dbl> 0.78, 0.78, 0.78, 0.81, 0.80, 0.78, 0.79, 0.78, 0.78, 0…
## $ fed.rate <dbl> 1.70, 1.70, 1.69, 1.70, 1.70, 1.70, 1.70, 1.70, 1.70, 1…
## $ gold.price <dbl> 1322.80, 1322.80, 1315.38, 1300.58, 1302.13, 1312.23, 1…
trend_diamond <- data %>%
filter(date > "2021-03-04" & date <= "2021-06-04") %>%
group_by(date) %>%
summarise(rate = diamond.price) %>%
ungroup()%>%
mutate(
label = glue(
"date: {date}
price : {comma(rate)}")
)
plot_trend <- ggplot(data = trend_diamond, mapping = aes(x = date,
y = rate)) +
geom_line(color = 'blue') +
geom_point(aes(text = label), color='blue')+
scale_x_date(date_labels = "%b %Y")+
labs(
title = "Trend Last 3 Months",
x = "date",
y = "Diamond Price")+
theme_minimal()
ggplotly(plot_trend, tooltip = "text")The following is the trend of Diamond prices for the last 3 months from the dataset. On March 5, 2021 to March 9, 2021 the Diamond price tend to be flat and Starting to rise sharply on March 10, 2021 until June 4, 2021. The possibility of rising diamond prices due to political or economic policies in America which have an impact on rising diamond prices
ggcorr(data, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)Plot ‘ggcorr’ above shows the correlation value between the predictor variable and the Diamond Price. the highest correlation is the gold price with a correlation value of -0.8. which means that the gold price is negatively correlated with the diamond price, meaning that when the price of diamonds increases, the price of gold will decrease and vice versa if the price of diamonds decreases, the price of gold will increase.
data.NA <- data %>%
filter(!complete.cases(.))
data.NAthe dataset that we have does not have a missing value so that further analysis can be carried out and no missing value handling is required
data_clean = data %>%
select(-c(date))
head(data_clean)Dataset used is time series data, so in this case the deleted variable is ‘date’, because the author wants to eliminate the time factor, and if using date, it is necessary to know the events and policies that apply at a certain time that cause a trend in the data. date will be useful when doing time series analysis
nrow(data_clean)## [1] 1134
nrow(data_clean)*0.9## [1] 1020.6
The distribution of training and testing data is 90:10 separated sequentially because the data used is time series data.
data_train = data_clean[1:1021,]
data_tes = data_clean[1022:1134,]
nrow(data_train)## [1] 1021
nrow(data_tes)## [1] 113
After the separation, the training data obtained is 1021 rows and the testing is 113 rows
At this chapter, linear regression modeling is carried out with several models and selection of the best model based on the smallest AIC value
model_all <- lm(diamond.price ~ ., data_train)
summary(model_all) ##
## Call:
## lm(formula = diamond.price ~ ., data = data_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -472.52 -69.21 17.68 89.93 413.34
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11196.09760 150.66733 74.310 < 0.0000000000000002 ***
## inflation.rate 591.51716 15.45582 38.271 < 0.0000000000000002 ***
## interest.rate 3.30532 31.04143 0.106 0.915
## fed.rate -75.85469 10.11936 -7.496 0.000000000000143 ***
## gold.price -1.40823 0.09196 -15.313 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 134.8 on 1016 degrees of freedom
## Multiple R-squared: 0.902, Adjusted R-squared: 0.9016
## F-statistic: 2339 on 4 and 1016 DF, p-value: < 0.00000000000000022
modeling that uses all predictor variables in the training data. the
r-square value is 0.9016, which means that the model obtained is good
with predictor variables that have no effect is
interest.rate .
with model formed is
y = 11196.097 + 591.517(inflation.rate) + 3.305(interest.rate) - 75.85469(fed.rate) - 1.40823(gold.price)
inflation rate increases by one unit, it will
increase the diamond price by 591.517 usd with the other predictor
variables constantfed rate increases by one unit, it will increase
the diamond price by 3.305 usd with the other predictor variables
constantinterest rate increases by one unit, it will
decrease the diamond price by 75.85469 usd with the other predictor
variables constantgold price increases by one unit, it will
decrease the diamond price by 1.40823 usd with the other predictor
variables constantmodel_corr <- lm(diamond.price ~ inflation.rate + interest.rate + gold.price, data_train)
summary(model_corr) ##
## Call:
## lm(formula = diamond.price ~ inflation.rate + interest.rate +
## gold.price, data = data_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -436.25 -87.63 5.99 98.93 379.84
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10811.05627 145.43341 74.337 <0.0000000000000002 ***
## inflation.rate 580.13900 15.79294 36.734 <0.0000000000000002 ***
## interest.rate -15.71708 31.76588 -0.495 0.621
## gold.price -1.20877 0.09039 -13.373 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 138.4 on 1017 degrees of freedom
## Multiple R-squared: 0.8966, Adjusted R-squared: 0.8963
## F-statistic: 2940 on 3 and 1017 DF, p-value: < 0.00000000000000022
This modeling uses a predictor variable that has a correlation value
of more than 0.6, so fed.rate is not used for modeling. the r-square
value is 0.8963, which means that the model obtained is good with
predictor variables that have no effect is
interest.rate
with model formed is
y = 10811.056 + 580.139(inflation.rate) - 15.71708(interest.rate) - 1.2087(gold.price)
inflation rate increases by one unit, it will
increase the diamond price by 580.139 usd with the other predictor
variables constantinterest rate increases by one unit, it will
decrease the diamond price by 15.71708 usd with the other predictor
variables constantgold price increases by one unit, it will
decrease the diamond price by 1.2087 usd with the other predictor
variables constantmodel_backward <- step(object = model_all,
direction = "backward",
trace = FALSE)
summary(model_backward)##
## Call:
## lm(formula = diamond.price ~ inflation.rate + fed.rate + gold.price,
## data = data_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -472.10 -68.67 17.30 89.59 413.16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11209.80700 78.21858 143.314 < 0.0000000000000002 ***
## inflation.rate 591.48715 15.44574 38.295 < 0.0000000000000002 ***
## fed.rate -75.76661 10.08058 -7.516 0.000000000000124 ***
## gold.price -1.41712 0.03852 -36.793 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 134.7 on 1017 degrees of freedom
## Multiple R-squared: 0.902, Adjusted R-squared: 0.9017
## F-statistic: 3121 on 3 and 1017 DF, p-value: < 0.00000000000000022
modeling that uses Model Backward. the r-square value is 0.9017,
which means that the model obtained is good with variables that have an
influence is inflation.rate, fed.rate, and
gold.price .
with model formed is
y = 11209.807 + 591.48715(inflation.rate) - 75.76661(fed.rate) - 1.41712(gold.price)
inflation rate increases by one unit, it will
increase the diamond price by 591.48715 usd with the other predictor
variables constantfed rate increases by one unit, it will decrease
the diamond price by 75.76661 usd with the other predictor variables
constantgold price increases by one unit, it will
decrease the diamond price by 1.41712 usd with the other predictor
variables constantmodel_none <- lm(diamond.price ~ 1, data_train)
model_forward <- step(object = model_none,
direction = "forward",
scope = list(lower = model_none, upper = model_all))## Start: AIC=12381.71
## diamond.price ~ 1
##
## Df Sum of Sq RSS AIC
## + gold.price 1 142998442 45315542 10929
## + interest.rate 1 137159640 51154343 11053
## + fed.rate 1 110331360 77982623 11484
## + inflation.rate 1 100281889 88032095 11607
## <none> 188313983 12382
##
## Step: AIC=10929.34
## diamond.price ~ gold.price
##
## Df Sum of Sq RSS AIC
## + inflation.rate 1 25841947 19473595 10069
## + fed.rate 1 264336 45051206 10925
## <none> 45315542 10929
## + interest.rate 1 14582 45300960 10931
##
## Step: AIC=10069.01
## diamond.price ~ gold.price + inflation.rate
##
## Df Sum of Sq RSS AIC
## + fed.rate 1 1024783 18448811 10016
## <none> 19473595 10069
## + interest.rate 1 4686 19468908 10071
##
## Step: AIC=10015.81
## diamond.price ~ gold.price + inflation.rate + fed.rate
##
## Df Sum of Sq RSS AIC
## <none> 18448811 10016
## + interest.rate 1 205.88 18448606 10018
summary(model_forward)##
## Call:
## lm(formula = diamond.price ~ gold.price + inflation.rate + fed.rate,
## data = data_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -472.10 -68.67 17.30 89.59 413.16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11209.80700 78.21858 143.314 < 0.0000000000000002 ***
## gold.price -1.41712 0.03852 -36.793 < 0.0000000000000002 ***
## inflation.rate 591.48715 15.44574 38.295 < 0.0000000000000002 ***
## fed.rate -75.76661 10.08058 -7.516 0.000000000000124 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 134.7 on 1017 degrees of freedom
## Multiple R-squared: 0.902, Adjusted R-squared: 0.9017
## F-statistic: 3121 on 3 and 1017 DF, p-value: < 0.00000000000000022
modeling that uses Model Forward. the r-square value is 0.9017, which
means that the model obtained is good with variables that have an
influence is inflation.rate, fed.rate, and
gold.price .
with model formed is
y = 11209.807 + 591.48715(inflation.rate) - 75.76661(fed.rate) - 1.41712(gold.price)
inflation rate increases by one unit, it will
increase the diamond price by 591.48715 usd with the other predictor
variables constantfed rate increases by one unit, it will decrease
the diamond price by 75.76661 usd with the other predictor variables
constantgold price increases by one unit, it will
decrease the diamond price by 1.41712 usd with the other predictor
variables constantmodel_both <- step(object = model_none,
direction = "both",
scope = list(upper = model_all))## Start: AIC=12381.71
## diamond.price ~ 1
##
## Df Sum of Sq RSS AIC
## + gold.price 1 142998442 45315542 10929
## + interest.rate 1 137159640 51154343 11053
## + fed.rate 1 110331360 77982623 11484
## + inflation.rate 1 100281889 88032095 11607
## <none> 188313983 12382
##
## Step: AIC=10929.34
## diamond.price ~ gold.price
##
## Df Sum of Sq RSS AIC
## + inflation.rate 1 25841947 19473595 10069
## + fed.rate 1 264336 45051206 10925
## <none> 45315542 10929
## + interest.rate 1 14582 45300960 10931
## - gold.price 1 142998442 188313983 12382
##
## Step: AIC=10069.01
## diamond.price ~ gold.price + inflation.rate
##
## Df Sum of Sq RSS AIC
## + fed.rate 1 1024783 18448811 10016
## <none> 19473595 10069
## + interest.rate 1 4686 19468908 10071
## - inflation.rate 1 25841947 45315542 10929
## - gold.price 1 68558500 88032095 11607
##
## Step: AIC=10015.81
## diamond.price ~ gold.price + inflation.rate + fed.rate
##
## Df Sum of Sq RSS AIC
## <none> 18448811 10016
## + interest.rate 1 206 18448606 10018
## - fed.rate 1 1024783 19473595 10069
## - gold.price 1 24556706 43005518 10878
## - inflation.rate 1 26602394 45051206 10925
summary(model_both)##
## Call:
## lm(formula = diamond.price ~ gold.price + inflation.rate + fed.rate,
## data = data_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -472.10 -68.67 17.30 89.59 413.16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11209.80700 78.21858 143.314 < 0.0000000000000002 ***
## gold.price -1.41712 0.03852 -36.793 < 0.0000000000000002 ***
## inflation.rate 591.48715 15.44574 38.295 < 0.0000000000000002 ***
## fed.rate -75.76661 10.08058 -7.516 0.000000000000124 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 134.7 on 1017 degrees of freedom
## Multiple R-squared: 0.902, Adjusted R-squared: 0.9017
## F-statistic: 3121 on 3 and 1017 DF, p-value: < 0.00000000000000022
modeling that uses Model Forward. the r-square value is 0.9017, which
means that the model obtained is good with variables that have an
influence is inflation.rate, fed.rate, and
gold.price .
with model formed is
y = 11209.807 + 591.48715(inflation.rate) - 75.76661(fed.rate) - 1.41712(gold.price)
inflation rate increases by one unit, it will
increase the diamond price by 591.48715 usd with the other predictor
variables constantfed rate increases by one unit, it will decrease
the diamond price by 75.76661 usd with the other predictor variables
constantgold price increases by one unit, it will
decrease the diamond price by 1.41712 usd with the other predictor
variables constantlibrary(performance)
#install.packages("performance")
comparison <- compare_performance(model_none, model_all, model_backward, model_forward, model_both, model_corr)
as.data.frame(comparison)from the model obtained, the lowest aic value is obtained in the
model_backward, model_forward, and
model_bothwith a value of 12915.29 . so that for the next
used is model_backward.
pred_model <- predict(
object = model_backward,
newdata = data_tes,
interval = "prediction",
level = 0.95)
tail(pred_model)## fit lwr upr
## 1129 9955.385 9689.110 10221.66
## 1130 9960.416 9694.151 10226.68
## 1131 9967.098 9700.849 10233.35
## 1132 9944.980 9678.775 10211.19
## 1133 9984.719 9718.665 10250.77
## 1134 9945.100 9679.063 10211.14
the results above are the predicted values of the test data using model_backward
mydata <- cbind(data_tes, pred_model)
ggplot(data = mydata, aes(x = inflation.rate, y = diamond.price)) +
geom_point()+
labs(title = "Linear Regression of Diamond Price by GDP") +
geom_line(aes(y = fit), color = "blue") +
geom_line(aes(y = lwr), color = "red", linetype = "dashed") +
geom_line(aes(y = upr), color = "red", linetype = "dashed") +
theme_minimal() this plot shows the predicted values obtained by setting the upper and lower bounds. obtained several points more than the specified limit. This value occurs because the model created cannot accommodate data that has never been learn
library(lmtest)
resettest(data_clean$diamond.price ~ data_clean$inflation.rate)##
## RESET test
##
## data: data_clean$diamond.price ~ data_clean$inflation.rate
## RESET = 139.13, df1 = 2, df2 = 1130, p-value < 0.00000000000000022
resettest(data_clean$diamond.price ~ data_clean$interest.rate)##
## RESET test
##
## data: data_clean$diamond.price ~ data_clean$interest.rate
## RESET = 139.27, df1 = 2, df2 = 1130, p-value < 0.00000000000000022
resettest(data_clean$diamond.price ~ data_clean$fed.rate)##
## RESET test
##
## data: data_clean$diamond.price ~ data_clean$fed.rate
## RESET = 108.34, df1 = 2, df2 = 1130, p-value < 0.00000000000000022
resettest(data_clean$diamond.price ~ data_clean$gold.price)##
## RESET test
##
## data: data_clean$diamond.price ~ data_clean$gold.price
## RESET = 273.73, df1 = 2, df2 = 1130, p-value < 0.00000000000000022
it can be seen that in each variable in white test has p-value < 0.05 so that it can be said that the data is not linear. to be sure see plot below
resact <- data.frame(residual = model_backward$residuals, fitted = model_backward$fitted.values)
resact %>% ggplot(aes(fitted, residual)) + geom_point() + geom_hline(aes(yintercept = 0)) +
geom_smooth() + theme(panel.grid = element_blank(), panel.background = element_blank())from the resulting plot it can be said that the data is not linear
and has a polynomial pattern. a good model to use should be a
non-linear regression model. However, in this case the
author gives a limit to using a linear regression model
hist(model_backward$residuals, freq=FALSE)
lines(density(model_backward$residuals), col = "red")Based on the histogram plot, it can be seen that the shape of the plot is close to the bell shape, but this can’t be concluded that the residuals are normally distributed. further tests need to be carried out, one of which is the Shapiro-Wilk test
shapiro.test(model_backward$residuals)##
## Shapiro-Wilk normality test
##
## data: model_backward$residuals
## W = 0.95076, p-value < 0.00000000000000022
based on the results of the Shapiro-Wilk test, the p-value < 0.05 was obtained. it can be concluded that the residuals from the model used are not normally distributed, so further handling is needed
sort(boxplot.stats(data$diamond.price)$out)## numeric(0)
sort(boxplot.stats(data$inflation.rate)$out)## [1] 0.50 0.63 0.63 0.73 0.73 0.73 0.75 0.80 0.80 0.80 0.87 0.90 0.91 0.92
sort(boxplot.stats(data$interest.rate)$out)## numeric(0)
sort(boxplot.stats(data$fed.rate)$out)## numeric(0)
sort(boxplot.stats(data$gold.price)$out)## numeric(0)
It can be seen from the results above that the
inflation.rate variable has an outlier, to ensure the
outlier can be seen in the boxplot below
boxplot(data$inflation.rate, horizontal = TRUE)from the boxplot above it can be seen that the outlier value is on the left side of the minimum value, so for deletion of data only on the left side of the minimum value
data_out <- data %>%
select(-c(date)) %>%
filter(inflation.rate > 0.92)data_out is data set that has no outliers
data_out <- data_out %>%
mutate(
diamond.price = sqrt(diamond.price),
inflation.rate = sqrt(inflation.rate),
interest.rate = sqrt(interest.rate),
fed.rate = sqrt(fed.rate),
gold.price = sqrt(gold.price)
)The next step is to transform the data. for the transformation formula used based on the shape of the residual histogram plot obtained.
supposed to handle this assumption, it is enough to transform, there is no need to remove outliers, because the data obtained is very valuable, but for this writing the author wants to show how to remove outlier data.
nrow(data_out)## [1] 1120
nrow(data_out)*0.9## [1] 1008
with the same stage of dividing the data with the proportion of 90 : 10
out_train = data_out[1:1008,]
out_tes = data_out[1009:1120,]
trans_model_all <- lm(diamond.price ~ ., out_train)
trans_backward <- step(object = trans_model_all,
direction = "backward",
trace = FALSE)
shapiro.test(trans_backward$residuals)##
## Shapiro-Wilk normality test
##
## data: trans_backward$residuals
## W = 0.97243, p-value = 0.000000001725
After remodeling with model_backward, we get p-value3 < 0.05, which means that the residuals are still not normally distributed even though the transformation has been carried out. Cases like this can occur because the model used is not quite right
to test the Homoscedasticity of Residuals you can use the Breusch-Pagan test
library(lmtest)
bptest(model_backward)##
## studentized Breusch-Pagan test
##
## data: model_backward
## BP = 78.424, df = 3, p-value < 0.00000000000000022
the results of the Breusch-Pagan test obtained a p-value < 0.05 which means that the residual spread is not constant or heteroscedasticity
handling for heteroscedasticity can use box-cox transformation
library(caret)
library(lmtest)
bcy <- BoxCoxTrans(data_clean$diamond.price)
ybc=predict(bcy, data_clean$diamond.price)
box_trans = cbind(data_clean, ybc)
box_trans = box_trans %>%
select(-c(diamond.price))
bcy## Box-Cox Transformation
##
## 1134 data points used to estimate Lambda
##
## Input data summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9272 9766 10005 10016 10305 10805
##
## Largest/Smallest: 1.17
## Sample Skewness: 0.0115
##
## Estimated Lambda: 0.7
the result of the box-cox transformation was obtained Lambda = 0.7. and henceforth it can be re-modeled with transformation data
nrow(box_trans)## [1] 1134
nrow(box_trans)*0.9## [1] 1020.6
with the same stage of dividing the data with the proportion of 90 : 10
box_train = box_trans[1:1021,]
box_tes = box_trans[1022:1134,]
box_model_all <- lm(ybc ~ ., box_train)
box_backward <- step(object = box_model_all,
direction = "backward",
trace = FALSE)
bptest(box_backward)##
## studentized Breusch-Pagan test
##
## data: box_backward
## BP = 76.906, df = 3, p-value < 0.00000000000000022
After remodeling with model_backward, we get p-value < 0.05, which means that the residual spread is not constant or heteroscedasticity even though the transformation box-cox has been carried out. Cases like this can occur because the model used is not quite right
library(car)
vif(model_backward)## inflation.rate fed.rate gold.price
## 1.280387 5.183588 5.238544
the value of VIF for each variable is < 10 so it can be concluded that there is no multicollinearity between variables
for autocorrelation assumptions can use the Durbin-Watson test
durbinWatsonTest(model_backward)## lag Autocorrelation D-W Statistic p-value
## 1 0.9644079 0.05930637 0
## Alternative hypothesis: rho != 0
From the output we can see that he corresponding p-value is 0. Since this p-value is less than 0.05, we can reject the null hypothesis and conclude that the residuals in this regression model are autocorrelated.
if the residuals are not autocorrelated then a Cochrane-Orcutt Estimation can be performed
based on the analysis that has been done it can be concluded that :
The model used is the model_backward
y = 11209.807 + 591.48715(inflation.rate) - 75.76661(fed.rate) - 1.41712(gold.price)
However, the linear regression model obtained is not good for interpreting diamond price data because the data has a polynomial pattern.
the results obtained are biased and unreliable, so it is not recommended to use a linear regression model
if you want to modelling diamond price data properly, it’s recommended to use a non-linear regression model