Introduction

This analysis will try to create a regression analysis to predict the diamonds price given certain criteria of the diamonds. The dataset is downloaded from Kaggle: https://www.kaggle.com/shivam2503/diamonds

This analysis is part of Algoritma LBB Project in Regression class.

This project is consists of:

  1. Data Wrangling

  2. Exploratory Data Analysis

  3. Modeling

First, we will load the library needed for this analysis.

Reading the dataset file.

price: price in US dollars ($326–$18,823) carat: weight of the diamond (0.2–5.01) cut: quality of the cut (Fair, Good, Very Good, Premium, Ideal) color: diamond colour, from J (worst) to D (best) clarity: a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best)) x: length in mm (0–10.74) y: width in mm (0–58.9) z: depth in mm (0–31.8) depth: total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79) table: width of top of diamond relative to widest point (43–95)

Now we will get a glimpse over the data.

## Observations: 53,940
## Variables: 11
## $ X1      <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
## $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0…
## $ cut     <chr> "Ideal", "Premium", "Good", "Premium", "Good", "Very Good", "…
## $ color   <chr> "E", "E", "E", "I", "J", "J", "I", "H", "E", "H", "J", "J", "…
## $ clarity <chr> "SI2", "SI1", "VS1", "VS2", "SI2", "VVS2", "VVS1", "SI1", "VS…
## $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 6…
## $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 5…
## $ price   <dbl> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 3…
## $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4…
## $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4…
## $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2…

Now we will do some data transformation, converting some column type into the appropriate format. In this case, a column like Cut, Color, and Clarity are a factor, not a character.

Personally, I like the skim() function from skimr package over glimpse() or summary() because it give a more detailed information about the data, including the missing rate, complete rate, number of unique, column type and for the numeric type. skim() also return the five-number summary with a cute little histogram.

Data summary
Name diamonds
Number of rows 53940
Number of columns 11
_______________________
Column type frequency:
factor 3
numeric 8
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
cut 0 1 FALSE 5 Ide: 21551, Pre: 13791, Ver: 12082, Goo: 4906
color 0 1 FALSE 7 G: 11292, E: 9797, F: 9542, H: 8304
clarity 0 1 FALSE 8 SI1: 13065, VS2: 12258, SI2: 9194, VS1: 8171

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
X1 0 1 26970.50 15571.28 1.0 13485.75 26970.50 40455.25 53940.00 ▇▇▇▇▇
carat 0 1 0.80 0.47 0.2 0.40 0.70 1.04 5.01 ▇▂▁▁▁
depth 0 1 61.75 1.43 43.0 61.00 61.80 62.50 79.00 ▁▁▇▁▁
table 0 1 57.46 2.23 43.0 56.00 57.00 59.00 95.00 ▁▇▁▁▁
price 0 1 3932.80 3989.44 326.0 950.00 2401.00 5324.25 18823.00 ▇▂▁▁▁
x 0 1 5.73 1.12 0.0 4.71 5.70 6.54 10.74 ▁▁▇▃▁
y 0 1 5.73 1.14 0.0 4.72 5.71 6.54 58.90 ▇▁▁▁▁
z 0 1 3.54 0.71 0.0 2.91 3.53 4.04 31.80 ▇▁▁▁▁

From above info, we can see if there is a dataset that contain 0 in certain variable, let see how the datas look like.

Now we remove it because it can make unnecessary noise in our analysis.

We will divide our dataset by 80% into train data and 20% into test data. We will using set.seed() function to make sure that R produce the same random number to make this report is reproducible.

Our EDA will be only using train dataset, to avoid peeking at the test dataset.

Now we want to look at the correlation and distribution of the data, I will be using ggpairs() function to explore them. Also, to shorten time I only using 2000 rows of the data to look at. It is not a problem since it is only an exploratory.

We will see the correlation of the numerical variables.

##            carat       depth      table       price           x           y
## carat 1.00000000  0.02588547  0.1821634  0.92179780  0.97758001  0.94856258
## depth 0.02588547  1.00000000 -0.2995327 -0.01068854 -0.02775043 -0.03143387
## table 0.18216342 -0.29953269  1.0000000  0.12707642  0.19752901  0.18461968
## price 0.92179780 -0.01068854  0.1270764  1.00000000  0.88760093  0.86335956
## x     0.97758001 -0.02775043  0.1975290  0.88760093  1.00000000  0.96947100
## y     0.94856258 -0.03143387  0.1846197  0.86335956  0.96947100  1.00000000
## z     0.97060583  0.09471147  0.1561591  0.87764439  0.98579920  0.96268263
##                z
## carat 0.97060583
## depth 0.09471147
## table 0.15615910
## price 0.87764439
## x     0.98579920
## y     0.96268263
## z     1.00000000

Looking at the correlation above is not really eye-pleasing, so we will create a correlation plot below.

Now, we will looking in-depth how the criteria of the diamonds (numerical variables) affecting the price.

Also, we want to looking in-depth how the criteria of the diamonds (categorical variables) affecting the price.

By domain expertise, we are certain that price of the diamonds is affected by the carat.

Now let’s try to create a single regression model that predict price by only using one feature: carat.

Modeling

R-squared is a statistical measure that represents the proportion of the variance for a dependent variable that’s explained by an independent variable or variables in a regression model. It means that by using carat only, we can explain 85% of diamonds price.

Now let’s try to throw all variables into the model.

The full model have a higher R-squared: 92%, also lower RMSE. Good sign! But in statistics, if we have fewer variables that can explain as good as model that have more variables, it is more favorable.

Feature Selection

Next, we will be doing a feature engineering by applying the step() function into our full model. We will be using: - backward option, which the function will iterate over the model, starts with all predictor (full model as we input it) and remove the least contributive predictors. - forward option, starts with the least predictor and add the most contributive predictors - both option, it works backward and forward, it combines two methods

## [1] "Step Backward:  price ~ carat + cut + color + clarity + depth + table + x + z"
## [1] "Step Forward:  price ~ carat + clarity + color + x + cut + depth + table + z"
## [1] "Step Both:  price ~ carat + clarity + color + x + cut + depth + table + z"

Here we can see that all three methods produce the same formula, so we will assign the formula to model_bo.

Now we will assess if multicollinearity is exist in our model by using Variance Inflation Factor. A general guideline is that a VIF greater than 10 may indicate high collinearity and worth further inspection (Algoritma Lec. Notes).

##              GVIF Df GVIF^(1/(2*Df))
## carat   25.133260  1        5.013308
## clarity  1.348197  7        1.021570
## color    1.180001  6        1.013889
## x       98.499490  1        9.924691
## cut      1.935357  4        1.086038
## depth    2.566888  1        1.602151
## table    1.788154  1        1.337219
## z       76.050877  1        8.720715

We see that x value is almost 10. I decided just to remove it and recalculate the VIF.

##              GVIF Df GVIF^(1/(2*Df))
## carat   19.411236  1        4.405818
## cut      1.920970  4        1.085026
## color    1.176628  6        1.013647
## clarity  1.333244  7        1.020756
## depth    1.457653  1        1.207333
## table    1.788128  1        1.337209
## z       19.455282  1        4.410814

Now our model’s VIF value is all under five. I will use it for future modeling.

Testing the Normality Assumption

One of assumption in linear regression is that the residuals are normally distributed. Now we will plot the Studentized Residuals our model.

It seems that our data is not normally distributed by looking at the graph. For make sure, we should calculate it statistically. Because our data is exceeding 5000, we must use Kolmogorov-Smirnov test.

## Warning in ks.test(studres(model_bo_vif), train$price): p-value will be
## approximate in the presence of ties
## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  studres(model_bo_vif) and train$price
## D = 1, p-value < 2.2e-16
## alternative hypothesis: two-sided

H0: The data follow a specified distribution HA: The data do not follow the specified distribution

Because our p-value is <0.5, then the H0 is rejected, so our data is not normally distributed.

Heteroskedasticity Test

## Warning: package 'lmtest' was built under R version 3.6.2
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
##  studentized Breusch-Pagan test
## 
## data:  model_bo_vif
## BP = 9575.7, df = 21, p-value < 2.2e-16

H0 : Residual Homoscedasticity HA : Residual Heteroscedasticity

Because our p-value is <0.5, then the H0 is rejected, so our data error is heteroscedastic, it means our data is not too suitable to use this regression model.

This is the visualization of our model’s residuals. By looking at the Residuals vs Fitted, the point look like a fan, it mean the bigger the values, the more bigger the residuals. We should try to normalize them. Also, by looking at the Residuals vs Leverage, we can see some outlier points that we will do an outlier analysis.

Normalization using bestNormalize()

Reference: https://cran.r-project.org/web/packages/bestNormalize/vignettes/bestNormalize.html

## Warning: package 'bestNormalize' was built under R version 3.6.2
## 
## Attaching package: 'bestNormalize'
## The following object is masked from 'package:MASS':
## 
##     boxcox

Before we normalize the variables, we will plot the variables distribution first.

I will normalize them using orderNorm() function from bestNormalize package.

For more info about orderNorm(), see this documentation: https://www.rdocumentation.org/packages/bestNormalize/versions/1.4.3/topics/orderNorm

Quoted from above webpage, “The Ordered Quantile (ORQ) normalization transformation, orderNorm(), is a rank-based procedure by which the values of a vector are mapped to their percentile, which is then mapped to the same percentile of the normal distribution. Without the presence of ties, this essentially guarantees that the transformation leads to a uniform distribution.”

We will create a new variable with suffix BN that indicate the normalized variables.

Now we will create a plot for the normalized variables.

Now we will recreate a full model with all included variables that already normalized.

## 
## Call:
## lm(formula = priceBN ~ caratBN + depthBN + tableBN + xBN + yBN + 
##     zBN + cut + clarity + color, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8571 -0.1333 -0.0022  0.1320  2.1312 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.690751   0.012261 -56.336  < 2e-16 ***
## caratBN       0.099842   0.018032   5.537  3.1e-08 ***
## depthBN       0.047107   0.002115  22.276  < 2e-16 ***
## tableBN       0.033082   0.001639  20.183  < 2e-16 ***
## xBN           1.126113   0.018371  61.300  < 2e-16 ***
## yBN          -0.333227   0.016961 -19.646  < 2e-16 ***
## zBN           0.177328   0.009888  17.933  < 2e-16 ***
## cutGood       0.078556   0.008218   9.559  < 2e-16 ***
## cutIdeal      0.145786   0.008069  18.068  < 2e-16 ***
## cutPremium    0.103346   0.007836  13.189  < 2e-16 ***
## cutVery Good  0.139240   0.007865  17.704  < 2e-16 ***
## clarityIF     1.231945   0.012529  98.326  < 2e-16 ***
## claritySI1    0.609340   0.010686  57.020  < 2e-16 ***
## claritySI2    0.439314   0.010746  40.881  < 2e-16 ***
## clarityVS1    0.843692   0.010918  77.276  < 2e-16 ***
## clarityVS2    0.770712   0.010743  71.738  < 2e-16 ***
## clarityVVS1   1.124040   0.011572  97.136  < 2e-16 ***
## clarityVVS2   1.041521   0.011250  92.583  < 2e-16 ***
## colorE       -0.039177   0.004378  -8.948  < 2e-16 ***
## colorF       -0.078180   0.004418 -17.694  < 2e-16 ***
## colorG       -0.171255   0.004334 -39.517  < 2e-16 ***
## colorH       -0.278430   0.004605 -60.465  < 2e-16 ***
## colorI       -0.425689   0.005148 -82.687  < 2e-16 ***
## colorJ       -0.581566   0.006335 -91.801  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.247 on 43122 degrees of freedom
## Multiple R-squared:  0.939,  Adjusted R-squared:  0.939 
## F-statistic: 2.886e+04 on 23 and 43122 DF,  p-value: < 2.2e-16

Now I am doing a feature selection by applying step function.

## 
## Call:
## lm(formula = priceBN ~ xBN + clarity + color + depthBN + zBN + 
##     yBN + cut + tableBN + caratBN, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8571 -0.1333 -0.0022  0.1320  2.1312 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.690751   0.012261 -56.336  < 2e-16 ***
## xBN           1.126113   0.018371  61.300  < 2e-16 ***
## clarityIF     1.231945   0.012529  98.326  < 2e-16 ***
## claritySI1    0.609340   0.010686  57.020  < 2e-16 ***
## claritySI2    0.439314   0.010746  40.881  < 2e-16 ***
## clarityVS1    0.843692   0.010918  77.276  < 2e-16 ***
## clarityVS2    0.770712   0.010743  71.738  < 2e-16 ***
## clarityVVS1   1.124040   0.011572  97.136  < 2e-16 ***
## clarityVVS2   1.041521   0.011250  92.583  < 2e-16 ***
## colorE       -0.039177   0.004378  -8.948  < 2e-16 ***
## colorF       -0.078180   0.004418 -17.694  < 2e-16 ***
## colorG       -0.171255   0.004334 -39.517  < 2e-16 ***
## colorH       -0.278430   0.004605 -60.465  < 2e-16 ***
## colorI       -0.425689   0.005148 -82.687  < 2e-16 ***
## colorJ       -0.581566   0.006335 -91.801  < 2e-16 ***
## depthBN       0.047107   0.002115  22.276  < 2e-16 ***
## zBN           0.177328   0.009888  17.933  < 2e-16 ***
## yBN          -0.333227   0.016961 -19.646  < 2e-16 ***
## cutGood       0.078556   0.008218   9.559  < 2e-16 ***
## cutIdeal      0.145786   0.008069  18.068  < 2e-16 ***
## cutPremium    0.103346   0.007836  13.189  < 2e-16 ***
## cutVery Good  0.139240   0.007865  17.704  < 2e-16 ***
## tableBN       0.033082   0.001639  20.183  < 2e-16 ***
## caratBN       0.099842   0.018032   5.537  3.1e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.247 on 43122 degrees of freedom
## Multiple R-squared:  0.939,  Adjusted R-squared:  0.939 
## F-statistic: 2.886e+04 on 23 and 43122 DF,  p-value: < 2.2e-16
##               GVIF Df GVIF^(1/(2*Df))
## xBN     238.564764  1       15.445542
## clarity   1.337832  7        1.021007
## color     1.150554  6        1.011756
## depthBN   3.158787  1        1.777298
## zBN      69.103870  1        8.312874
## yBN     203.367975  1       14.260714
## cut       2.185745  4        1.102681
## tableBN   1.831019  1        1.353152
## caratBN 228.513764  1       15.116672

Testing the Normality Assumption

Now we will plot the Studentized Residuals our model.

## Warning in ks.test(studres(model_BN_bo), dnorm(xfit)): p-value will be
## approximate in the presence of ties
## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  studres(model_BN_bo) and dnorm(xfit)
## D = 0.50417, p-value = 3.004e-09
## alternative hypothesis: two-sided

H0: The data follow a specified distribution HA: The data do not follow the specified distribution

Because our p-value is <0.5, then the H0 is rejected, so our data is not normally distributed.

*Heteroskedasticity Test**

## 
##  studentized Breusch-Pagan test
## 
## data:  model_BN_bo
## BP = 2088.2, df = 23, p-value < 2.2e-16

H0 : Residual Homoscedasticity HA : Residual Heteroscedasticity

Because our p-value is <0.5, then the H0 is rejected, so our data error is heteroscedastic, it means our data is not too suitable to use this regression model.

Conclusion: After normalization, we can see there is significant change in residuals normality in the graph. However, statistically it still not normality distributed.

Transformation is only used for exploring, not for predicition.

In the next step, we will try to handling the outlier. There are many ways to handling outlier. For this analysis, I will just remove it.

Outliers Handling Method: Cook’s Distance

I will analyze the outliers by calculating the Cook’s distance. It is commonly used to estimate the influence of the data when doing a regression analysis. It is very useful to find influential outliers.

I will draw the graph by using ols_plot_cooksd_bar function from olsrr package. I will save it into an object where we can extract the outliers data from the saved objects.

Now we will extract the outliers data.

After extracting the outliers data, we recreate a model without outliers.

Testing the Normality Assumption

## Warning in ks.test(model_bo_clean$residuals, y = train_clean$price): p-value
## will be approximate in the presence of ties
## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  model_bo_clean$residuals and train_clean$price
## D = 0.79989, p-value < 2.2e-16
## alternative hypothesis: two-sided

H0: The data follow a specified distribution HA: The data do not follow the specified distribution

Because our p-value is <0.5, then the H0 is rejected, so our data is not normally distributed.

*Heteroskedasticity Test**

## 
##  studentized Breusch-Pagan test
## 
## data:  model_bo_clean
## BP = 7456.8, df = 21, p-value < 2.2e-16

H0 : Residual Homoscedasticity HA : Residual Heteroscedasticity

Because our p-value is <0.5, then the H0 is rejected, so our data error is heteroscedastic, it means our data is not too suitable to use this regression model.

Visually, there is not much improvement by removing the outlier. In the next section, we will try to predict using all our built models.

Prediction

Create objects that contain predicted values.

Now we will computes the root mean squared error between each of predictions and the test data.

## Warning: package 'MLmetrics' was built under R version 3.6.2
## 
## Attaching package: 'MLmetrics'
## The following object is masked from 'package:base':
## 
##     Recall
## [1] 1545.377
## [1] 1110.493
## [1] 1181.229
## [1] 1431.266

Conclusion

Our model summary:

  1. Model with single predictor: Carat
  1. Model with all predictors
  1. Model with feature selection (from step using both method)
  1. Model with removed outlier

By looking at the RMSE, we will choose either option 2 or 3 for our model.