##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## 'data.frame': 5000 obs. of 7 variables:
## $ carat : num 0.41 0.5 1.03 1.1 1.51 0.3 0.87 1.05 1 2.01 ...
## $ depth : num 62.3 62.8 65.2 62.1 63.3 62.1 61.4 63.3 64 63.8 ...
## $ table : num 61 57 56 57 61 55 57 57 57 57 ...
## $ width : num 4.72 5.05 6.42 6.6 7.24 4.3 6.17 6.45 6.29 7.95 ...
## $ length: num 4.75 5.08 6.35 6.64 7.17 4.33 6.14 6.4 6.33 7.91 ...
## $ height: num 2.95 3.18 4.16 4.11 4.56 2.68 3.78 4.07 4.04 5.06 ...
## $ price : int 638 1402 3530 5037 13757 457 2321 5657 4372 13976 ...
## carat cut color clarity
## Min. :0.2000 Length:53940 Length:53940 Length:53940
## 1st Qu.:0.4000 Class :character Class :character Class :character
## Median :0.7000 Mode :character Mode :character Mode :character
## Mean :0.7979
## 3rd Qu.:1.0400
## Max. :5.0100
## depth table width length
## Min. :43.00 Min. :43.00 Min. : 0.000 Min. : 0.000
## 1st Qu.:61.00 1st Qu.:56.00 1st Qu.: 4.710 1st Qu.: 4.720
## Median :61.80 Median :57.00 Median : 5.700 Median : 5.710
## Mean :61.75 Mean :57.46 Mean : 5.731 Mean : 5.735
## 3rd Qu.:62.50 3rd Qu.:59.00 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :79.00 Max. :95.00 Max. :10.740 Max. :58.900
## height price
## Min. : 0.000 Min. : 326
## 1st Qu.: 2.910 1st Qu.: 950
## Median : 3.530 Median : 2401
## Mean : 3.539 Mean : 3933
## 3rd Qu.: 4.040 3rd Qu.: 5324
## Max. :31.800 Max. :18823
I found my data set from kaggle.com. It has about 55000 observation and 10 variables (3 categorical and 7 quantitative). I have choosen my quatitative variable - Price as my response variable for any supervised learning.
##
## Fair Good Ideal Premium Very Good
## 148 450 1916 1291 1195
##
## D E F G H I J
## 611 904 859 1056 825 483 262
##
## I1 IF SI1 SI2 VS1 VS2 VVS1 VVS2
## 70 152 1193 860 764 1129 332 500
## `geom_smooth()` using formula 'y ~ x'
There seems to be a linear relationship between carat and price. As the
carat increases the price of diamond seems to increase. There also seems
to be an outlier. I have not removed the outlier just for time purposes
but definitely in the future.
## `geom_smooth()` using formula 'y ~ x'
There seems to be a linear relationship between height and price. As the
height increases the price of diamond seems to increase. There also
seems to be an outlier. I have not removed the outlier just for time
purposes but definitely in the future.
Our data is pretty skewed to the right for both price and carat. This
tells us that there are more number of diamonds that are inexpensive and
have less carats than their are expensive with high carats. This tells
us that they are dependent on one another, therefore positive
correlation.
The histogram for length shows that there are more number of diamonds that have a length between 4 and 7.
## `geom_smooth()` using formula 'y ~ x'
It looks like better the cut of the diamond higher the price. This shows
a positive linear relationship between cut and price also price and
carat.
## `geom_smooth()` using formula 'y ~ x'
This scatter plot shows that as clarity is better the price is
higher.
## `geom_smooth()` using formula 'y ~ x'
This scatterplot represents that color for a diamond is alphabetically arranged in which higher the alphabet lower the price. Therefore, negative correlation between price and color.
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
Variables width and height has the largest positive correlation of
0.981. Variables table and depth has the largest negative correlation
coefficient of -0.300. There seems to be highly correlated variables
such as length, height, and width. This likely is showing there is
overlap and the fact that change in any one of the three variables
results to little effect on our model.
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.1763 1.1360 0.82682 0.43217 0.25874 0.16359 0.09536
## Proportion of Variance 0.6766 0.1844 0.09766 0.02668 0.00956 0.00382 0.00130
## Cumulative Proportion 0.6766 0.8610 0.95863 0.98531 0.99488 0.99870 1.00000
## PC1 PC2 PC3 PC4 PC5
## carat 0.452571010 -0.0347417656 0.01056185 -0.13785434 0.34195521
## depth 0.003671826 -0.7328475217 -0.66918152 -0.04226632 -0.06954696
## table 0.105140273 0.6726498703 -0.72861229 -0.06296629 -0.03744078
## width 0.453750638 0.0059819808 0.03844655 0.20547037 0.38055910
## length 0.440437752 0.0001785218 0.05830084 0.46767480 -0.74709650
## height 0.450290429 -0.0896560039 -0.04422833 0.25224097 0.29327596
## price 0.425998822 -0.0345782681 0.11989748 -0.80664843 -0.29637619
## PC6 PC7
## carat -0.78567571 -0.2015872753
## depth -0.02038929 0.0898786980
## table 0.01565260 0.0004868328
## width 0.19812443 0.7525289671
## length -0.15873557 0.0233599116
## height 0.50486626 -0.6186420208
## price 0.25042512 0.0414809759
as our price value is way off we scale them by their standard deviation
The first PC accounts for about 67.66% of the variation in the data set. The first two PCs account for over 86.10% of the variation.
All variable have positive loading value for PC1. Whereas for PC2, variables table, width, and length got positive loading values and variables carat, depth, height, and price got negative loading values.
Variables carat, width, length, height, and price got the largest (in absolute value) loading values in PC1. Variables depth and table got the largest (in absolute value) loading values in PC2.
We have all the variables with same sign (positive) in PC 1 (x-axis).
That’s why all the variable go to the right. For PC2(y-axis), carat,
depth, height, and price are all negative, and the rest are positive.
That’s why all four of those variables point below the (imaginary)
horizontal line at PC2 = 0. The rest point above that horizontal line.
Yes, the arrows and relative lengths are as mentioned above from their
loading values.
## carat depth table width length height price
## 4880 0.30 51.0 67 4.67 4.62 2.37 945
## 149 5.01 65.5 59 10.74 10.54 6.98 18018
Since 149 is to the right of the vertical line at PC1 = 0, a linear
combination of all the variables (mostly just
carat,width,length,height,
and price) for these observations tend to be pretty
different from observation 269.
I would choose PC 4, as there is small decrease after PC 5 relative to
the decrease from previous 4. 4 PCs explains about 98.531% of the
variation.
## 'data.frame': 5000 obs. of 4 variables:
## $ PC1: num -1.68 -1.3 1.05 1.46 3.72 ...
## $ PC2: num 0.921 -0.5841 -2.2804 -0.4144 0.0268 ...
## $ PC3: num -1.5675 -0.4501 -1.1116 0.0687 -1.5069 ...
## $ PC4: num -0.122 0.069 0.547 0.413 -1.137 ...
## clusters2
## 1 2
## 2983 2017
## clusters3
## 1 2 3
## 2019 657 2324
## clusters4
## 1 2 3 4
## 633 1560 627 2180
## clusters5
## 1 2 3 4 5
## 1789 611 1339 553 708
## clusters6
## 1 2 3 4 5 6
## 1241 1075 551 563 517 1053
## clusters7
## 1 2 3 4 5 6 7
## 492 473 562 846 473 1230 924
## clusters8
## 1 2 3 4 5 6 7 8
## 904 720 931 824 319 442 398 462
## clusters9
## 1 2 3 4 5 6 7 8 9
## 378 224 759 934 319 367 886 425 708
## clusters10
## 1 10 2 3 4 5 6 7 8 9
## 720 363 442 331 863 907 666 221 298 189
## Warning: package 'cluster' was built under R version 4.1.2
We choose K = 2 because it has the Silhouette width closest to 1.
Using a cut height of 14, we get 4 clusters. The maximum distance
between any two observations within the same cluster is 12.23235, which
is less than the height at which we made our cut.
## clus.comp
## 1 2 3 4
## 3701 500 798 1
## [1] 10.79723
## [1] 12.22427
## [1] 12.23235
Ideal cut is the only category kept under cluster 4
##
## ---------------------
## Welcome to dendextend version 1.15.2
## Type citation('dendextend') for how to cite the package.
##
## Type browseVignettes(package = 'dendextend') for the package vignette.
## The github page is: https://github.com/talgalili/dendextend/
##
## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
## You may ask questions at stackoverflow, use the r and dendextend tags:
## https://stackoverflow.com/questions/tagged/dendextend
##
## To suppress this message use: suppressPackageStartupMessages(library(dendextend))
## ---------------------
##
## Attaching package: 'dendextend'
## The following object is masked from 'package:stats':
##
## cutree
Here we can see the colors representing 4 different clusters. In which
majority belongs to cluster green and only 1 belongs to cluster lime
green.
## Warning: package 'caret' was built under R version 4.1.2
## Loading required package: lattice
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20171.9 -575.8 -178.3 395.2 8058.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3153.673 1365.747 2.309 0.020978 *
## carat 10775.924 140.775 76.547 < 2e-16 ***
## cutGood 490.217 107.065 4.579 4.79e-06 ***
## cutIdeal 891.308 106.706 8.353 < 2e-16 ***
## cutPremium 818.524 102.998 7.947 2.35e-15 ***
## `cutVery Good` 778.135 102.492 7.592 3.74e-14 ***
## colorE -171.209 57.416 -2.982 0.002879 **
## colorF -228.132 58.304 -3.913 9.24e-05 ***
## colorG -383.597 56.942 -6.737 1.80e-11 ***
## colorH -934.412 59.554 -15.690 < 2e-16 ***
## colorI -1312.167 68.619 -19.122 < 2e-16 ***
## colorJ -2220.470 83.048 -26.737 < 2e-16 ***
## clarityIF 5283.519 162.840 32.446 < 2e-16 ***
## claritySI1 3767.375 137.263 27.446 < 2e-16 ***
## claritySI2 2847.646 137.685 20.682 < 2e-16 ***
## clarityVS1 4700.363 140.020 33.569 < 2e-16 ***
## clarityVS2 4366.603 137.921 31.660 < 2e-16 ***
## clarityVVS1 5077.313 148.641 34.158 < 2e-16 ***
## clarityVVS2 5030.638 143.586 35.036 < 2e-16 ***
## depth -84.946 16.301 -5.211 1.95e-07 ***
## table -35.487 9.372 -3.786 0.000155 ***
## width -956.072 111.665 -8.562 < 2e-16 ***
## length -20.813 42.648 -0.488 0.625558
## height 202.667 161.272 1.257 0.208929
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1093 on 4976 degrees of freedom
## Multiple R-squared: 0.9246, Adjusted R-squared: 0.9242
## F-statistic: 2652 on 23 and 4976 DF, p-value: < 2.2e-16
## Linear Regression
##
## 5000 samples
## 9 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 5000, 5000, 5000, 5000, 5000, 5000, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 1302.482 0.8914509 748.5153
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
## Linear Regression
##
## 5000 samples
## 9 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 4500, 4500, 4500, 4500, 4500, 4500, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 1112.698 0.9216218 729.3563
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
## 1
## 6053.88
We expect the price to increase by about 10775.924 unit for every one unit increase in carat, on average when all other variables are held constant.
Price is 490.217 higher on average for cut == Good,
Price is 891.308 units higher on average for cut == Ideal,
Price is 818.524 units higher on average for cut ==
Premium, and Price is 778.135 units higher on average for
cut == Very good when holding all the other variables
constant.
Price is 171.209 units lower on average for color == E,
Price is 228.132 units lower on average for color == F,
Price is 383.597 units lower on average for color == G,
Price is 934.412 units lower on average for color == H,
Price is 1312.167 units lower on average for color == I,
and Price is 2220.470 units lower on average for color == J
when holding all the other variables constant.
Price is 5283.519 units higher on average for clarity ==
IF, Price is 3767.375 units higher on average for clarity
== SI1, Price is 2847.646 units higher on average for
clarity == SI2, Price is 4700.363 units higher on average
for clarity == VS1, Price is 4366.603 units higher on
average for clarity == VS2, Price is 5077.313 units higher
on average for clarity == VVS1, and Price is 5030 units
higher on average for clarity == VVS2 when holding all the
other variables constant.
We expect the price to decrease by about 84.946 unit for every one unit increase in depth, on average when all other variables are held constant.
We expect the price to decrease by about 35.487 unit for every one unit increase in table, on average when all other variables are held constant.
We expect the price to decrease by about 956.072 unit for every one unit increase in width, on average when all other variables are held constant.
We expect the price to decrease by about 20.813 unit for every one unit increase in length, on average when all other variables are held constant.
We expect the price to increase by about 202.667 unit for every one unit increase in height, on average when all other variables are held constant.
DUMMY VARIABLES ARE ARRANGED IN ALPHABETICAL ORDER. Our test RMSE is 1112.698.
When predicting the price of diamond with given descriptions
(carat = 1, cut = "Ideal", color ="I", clarity = "VVS1", depth = 62, table = 63, width = 6, length = 5, height = 4 )
we get $6053.88
## k-Nearest Neighbors
##
## 5000 samples
## 9 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 4500, 4500, 4500, 4500, 4500, 4500, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 1 1628.429 0.8333856 902.0552
## 3 1365.549 0.8832345 757.9656
## 5 1331.222 0.8923463 734.3735
## 7 1323.197 0.8952139 728.3290
## 9 1323.155 0.8973306 718.2109
## 11 1318.958 0.8986176 718.0017
## 13 1317.807 0.8993923 717.0025
## 15 1321.600 0.8995483 721.3069
## 17 1322.795 0.9000728 721.4926
## 19 1324.665 0.9004483 722.7665
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 13.
## [1] 3198.308
Our final model used was K = 13, in which RMSE value is 1317.807. We also used our model to predict the price of the diamond when inputted these values (carat = 1, cut = “Ideal”, color =“I”, clarity = “VVS1”, depth = 62, table = 63, width = 6, length = 5, height = 4 ), we get $3198.308. When centering and scaling, we actually don’t do a good job with our model. Therefore, we use without centering and scaling. We still don’t do a good job with our model than linear regression model.
## Random Forest
##
## 5000 samples
## 9 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 4500, 4500, 4500, 4500, 4500, 4500, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 1160.6855 0.9306338 692.8170
## 3 955.6816 0.9434891 488.2799
## 4 913.2948 0.9473364 448.5679
## 5 897.9391 0.9488828 435.4641
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 5.
## 1
## 4670.545
Since we have 9 potential predictors, we can set m = 5. Trying values between 2 and 5. Our smallest RMSE is 897.9391, which is the lowest among the model used. Therefore, a better model.
## rf variable importance
##
## only 20 most important variables shown (out of 23)
##
## Overall
## length 100.0000
## carat 87.0947
## width 81.0782
## height 62.3751
## claritySI2 4.1006
## depth 3.6472
## table 2.4803
## colorJ 1.9343
## clarityVVS2 1.8820
## colorI 1.4921
## claritySI1 1.4096
## clarityVVS1 1.3200
## clarityVS1 0.8448
## colorG 0.7710
## clarityVS2 0.7555
## clarityIF 0.7308
## colorH 0.7210
## colorF 0.6689
## colorE 0.6401
## cutIdeal 0.5365
We do a much better job of predicting the price of the diamond using
Random forest model than linear model or K-Nearest Neighbor model
because RMSE is the lowest for Random forest model. Since, outliers were
not dealt with in the preprocessing phase we don’t take the RMSE but
instead take median error ipn this case for a better prediction as shown
earlier using the histogram which were heavily skewed to the right.
Therefore, when predicting the price of diamond with given descriptions
(carat = 1, cut = "Ideal", color ="I", clarity = "VVS1", depth = 62, table = 63, width = 6, length = 5, height = 4 )
we get $4670.545 -
The most important variable was length and the least important variable was cut = very good. Method of calculation for regression problem is by measuring root mean square error.