Submitted by: Joseph Ricafort

Assignment briefing

Explain the price’s of the diamonds using information from the other variables. What are the more relevant variables to explain the price?

#> # A tibble: 53,940 × 10
#>    carat cut       color clarity depth table price     x     y     z
#>    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#>  1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
#>  2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
#>  3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
#>  4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
#>  5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
#>  6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
#>  7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
#>  8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
#>  9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
#> 10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
#> # … with 53,930 more rows

Answer to assignment starts here

There is a description about the diamonds dataset from bookdown.org

Exploring the dataset

Diamonds dataset will be available once we added tidyverse library

library(tidyverse)

as_tibble(diamonds)

We can see that cut, color and clarity are ordinal variables while the rest except for the price are numeric values

summary(diamonds)
     carat               cut        color        clarity          depth           table           price      
 Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00   Min.   :43.00   Min.   :  326  
 1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00   1st Qu.:56.00   1st Qu.:  950  
 Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194   Median :61.80   Median :57.00   Median : 2401  
 Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75   Mean   :57.46   Mean   : 3933  
 3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066   3rd Qu.:62.50   3rd Qu.:59.00   3rd Qu.: 5324  
 Max.   :5.0100                     I: 5422   VVS1   : 3655   Max.   :79.00   Max.   :95.00   Max.   :18823  
                                    J: 2808   (Other): 2531                                                  
       x                y                z         
 Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
 1st Qu.: 4.710   1st Qu.: 4.720   1st Qu.: 2.910  
 Median : 5.700   Median : 5.710   Median : 3.530  
 Mean   : 5.731   Mean   : 5.735   Mean   : 3.539  
 3rd Qu.: 6.540   3rd Qu.: 6.540   3rd Qu.: 4.040  
 Max.   :10.740   Max.   :58.900   Max.   :31.800  
                                                   

Looking at summary on the summation and the extent of values for every variable

Defining exploration questions

What are the most interesting questions to ask the dataset:

  1. How is the price of the diamond defined by its characteristics?
  1. Can we define the relationship between the diamonds’ characteristics in terms of which of them more (or less) defines the price of the diamond?

The questions above may or may not guarantee to be answered by the dataset

Restructure the dataset

I’ll group the characteristics accordingly to which they are more related to each other:

  • Numerical - carat (weight), x, y, z (Physical attributes)
  • Ordinal - color, clarity, cut (Visual attributes)

Visualize

The number of carats is the most relevant in terms of the price of the diamond

ggplot(diamonds, aes(x = carat, y = price)) +
  geom_point(alpha = 0.05) +
  geom_smooth()
[WARNING] Deprecated: --self-contained. use --embed-resources --standalone

Distribution of the diamond observations across the prices range Most of the observations lies at the lower price range of the dataset

Exploring other variables

A higher quality of cut may not guarantee a higher price

We have to take note as well that the observations between the quality of cuts are not evenly distributed

ggplot(diamonds, aes(x = cut, y = price)) +
  geom_boxplot(alpha = 0.5) +
  geom_jitter(alpha = 0.01) +
  scale_y_log10()

Given that colored D to J diamonds were sequenced from best to worst, we can say that the best colored diamonds again aren not the most expensive

ggplot(diamonds, aes(x = color, y = price)) +
  geom_jitter(alpha = 0.01) +
  geom_boxplot(alpha = 0.5) +
  scale_y_log10()
[WARNING] Deprecated: --self-contained. use --embed-resources --standalone

Sequenced from I1 (best) to IF (worst), SI2 (second best) has the highest mean value price

ggplot(diamonds, aes(x = clarity, y = price)) +
  geom_jitter(alpha = 0.01) +
  geom_boxplot(alpha = 0.5) +
  scale_y_log10()

LS0tDQp0aXRsZTogIkFzc2lnbm1lbnQgNTogRGlhbW9uZHMgQWN0aXZpdHkiDQpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sNCi0tLQ0KU3VibWl0dGVkIGJ5OiBKb3NlcGggUmljYWZvcnQNCg0KIyMgQXNzaWdubWVudCBicmllZmluZw0KDQpFeHBsYWluIHRoZSBwcmljZeKAmXMgb2YgdGhlIGRpYW1vbmRzIHVzaW5nIGluZm9ybWF0aW9uIGZyb20gdGhlIG90aGVyIHZhcmlhYmxlcy4gV2hhdCBhcmUgdGhlIG1vcmUgcmVsZXZhbnQgdmFyaWFibGVzIHRvIGV4cGxhaW4gdGhlIHByaWNlPw0KYGBge3J9DQojPiAjIEEgdGliYmxlOiA1Myw5NDAgw5cgMTANCiM+ICAgIGNhcmF0IGN1dCAgICAgICBjb2xvciBjbGFyaXR5IGRlcHRoIHRhYmxlIHByaWNlICAgICB4ICAgICB5ICAgICB6DQojPiAgICA8ZGJsPiA8b3JkPiAgICAgPG9yZD4gPG9yZD4gICA8ZGJsPiA8ZGJsPiA8aW50PiA8ZGJsPiA8ZGJsPiA8ZGJsPg0KIz4gIDEgIDAuMjMgSWRlYWwgICAgIEUgICAgIFNJMiAgICAgIDYxLjUgICAgNTUgICAzMjYgIDMuOTUgIDMuOTggIDIuNDMNCiM+ICAyICAwLjIxIFByZW1pdW0gICBFICAgICBTSTEgICAgICA1OS44ICAgIDYxICAgMzI2ICAzLjg5ICAzLjg0ICAyLjMxDQojPiAgMyAgMC4yMyBHb29kICAgICAgRSAgICAgVlMxICAgICAgNTYuOSAgICA2NSAgIDMyNyAgNC4wNSAgNC4wNyAgMi4zMQ0KIz4gIDQgIDAuMjkgUHJlbWl1bSAgIEkgICAgIFZTMiAgICAgIDYyLjQgICAgNTggICAzMzQgIDQuMiAgIDQuMjMgIDIuNjMNCiM+ICA1ICAwLjMxIEdvb2QgICAgICBKICAgICBTSTIgICAgICA2My4zICAgIDU4ICAgMzM1ICA0LjM0ICA0LjM1ICAyLjc1DQojPiAgNiAgMC4yNCBWZXJ5IEdvb2QgSiAgICAgVlZTMiAgICAgNjIuOCAgICA1NyAgIDMzNiAgMy45NCAgMy45NiAgMi40OA0KIz4gIDcgIDAuMjQgVmVyeSBHb29kIEkgICAgIFZWUzEgICAgIDYyLjMgICAgNTcgICAzMzYgIDMuOTUgIDMuOTggIDIuNDcNCiM+ICA4ICAwLjI2IFZlcnkgR29vZCBIICAgICBTSTEgICAgICA2MS45ICAgIDU1ICAgMzM3ICA0LjA3ICA0LjExICAyLjUzDQojPiAgOSAgMC4yMiBGYWlyICAgICAgRSAgICAgVlMyICAgICAgNjUuMSAgICA2MSAgIDMzNyAgMy44NyAgMy43OCAgMi40OQ0KIz4gMTAgIDAuMjMgVmVyeSBHb29kIEggICAgIFZTMSAgICAgIDU5LjQgICAgNjEgICAzMzggIDQgICAgIDQuMDUgIDIuMzkNCiM+ICMg4oCmIHdpdGggNTMsOTMwIG1vcmUgcm93cw0KYGBgDQoNCi0gYGRpYW1vbmRzYCBpcyBhIHdlbGwta25vd24gZGF0YXNldCBhbmQgYW5hbHl6ZWQgb24gbWFueSB3ZWJzaXRlcyB0aHJvdWdoIHRoZSBpbnRlcm5ldC4gWW91IGNhbiBib3Jyb3cgaWRlYXMgZnJvbSB0aGVyZS4NCi0gSWYgeW91IHdhbnQsIHlvdSBjYW4gdXNlIGVpdGhlciBgcXVhcnRvYCBvciBgUm1hcmtkb3duYCB0byBnZW5lcmF0ZSB5b3VyIGRvY3VtZW50Lg0KDQojIyMgQW5zd2VyIHRvIGFzc2lnbm1lbnQgc3RhcnRzIGhlcmUNCg0KVGhlcmUgaXMgYSBkZXNjcmlwdGlvbiBhYm91dCB0aGUgZGlhbW9uZHMgZGF0YXNldCBmcm9tDQpbYm9va2Rvd24ub3JnXShodHRwczovL2Jvb2tkb3duLm9yZy95aWhfaHV5bmgvR3VpZGUtdG8tUi1Cb29rL2RpYW1vbmRzLmh0bWwpDQoNCg0KIyMgRXhwbG9yaW5nIHRoZSBkYXRhc2V0DQpEaWFtb25kcyBkYXRhc2V0IHdpbGwgYmUgYXZhaWxhYmxlIG9uY2Ugd2UgYWRkZWQgdGlkeXZlcnNlIGxpYnJhcnkNCg0KYGBge3J9DQpsaWJyYXJ5KHRpZHl2ZXJzZSkNCg0KYXNfdGliYmxlKGRpYW1vbmRzKQ0KYGBgDQoNCldlIGNhbiBzZWUgdGhhdCBjdXQsIGNvbG9yIGFuZCBjbGFyaXR5IGFyZSBvcmRpbmFsIHZhcmlhYmxlcyB3aGlsZSB0aGUgcmVzdCBleGNlcHQgZm9yIHRoZSBwcmljZSBhcmUgbnVtZXJpYyB2YWx1ZXMNCg0KDQpgYGB7cn0NCnN1bW1hcnkoZGlhbW9uZHMpDQpgYGANCkxvb2tpbmcgYXQgc3VtbWFyeSBvbiB0aGUgc3VtbWF0aW9uIGFuZCB0aGUgZXh0ZW50IG9mIHZhbHVlcyBmb3IgZXZlcnkgdmFyaWFibGUNCg0KIyMgRGVmaW5pbmcgZXhwbG9yYXRpb24gcXVlc3Rpb25zDQoNCldoYXQgYXJlIHRoZSBtb3N0IGludGVyZXN0aW5nIHF1ZXN0aW9ucyB0byBhc2sgdGhlIGRhdGFzZXQ6DQoNCjEuIEhvdyBpcyB0aGUgcHJpY2Ugb2YgdGhlIGRpYW1vbmQgZGVmaW5lZCBieSBpdHMgY2hhcmFjdGVyaXN0aWNzPyANCg0KLSBBcmUgaGVhdmllciBkaWFtb25kcyAobnVtYmVyIG9mIGNhcmF0cykgbW9yZSBleHBlbnNpdmU/IA0KLSBEb2VzIHF1YWxpdHkgb2YgdGhlIGRpYW1vbmQgZGVmaW5lcyBpdCBwcmljZT8NCi0gQXJlIHRoZXJlIGNvbG9ycyB0aGF0IGFyZSBtb3JlIGV4cGVuc2l2ZSB0aGFuIHRoZSBvdGhlcj8NCi0gQXJlIGJpZ2dlciBkaWFtb25kcyBtb3JlIGV4cGVuc2l2ZT8NCg0KMi4gQ2FuIHdlIGRlZmluZSB0aGUgcmVsYXRpb25zaGlwIGJldHdlZW4gdGhlIGRpYW1vbmRzJyBjaGFyYWN0ZXJpc3RpY3MgaW4gdGVybXMgb2Ygd2hpY2ggb2YgdGhlbSBtb3JlIChvciBsZXNzKSBkZWZpbmVzIHRoZSBwcmljZSBvZiB0aGUgZGlhbW9uZD8NCg0KVGhlIHF1ZXN0aW9ucyBhYm92ZSBtYXkgb3IgbWF5IG5vdCBndWFyYW50ZWUgdG8gYmUgYW5zd2VyZWQgYnkgdGhlIGRhdGFzZXQNCg0KDQojIyMgUmVzdHJ1Y3R1cmUgdGhlIGRhdGFzZXQNCg0KSSdsbCBncm91cCB0aGUgY2hhcmFjdGVyaXN0aWNzIGFjY29yZGluZ2x5IHRvIHdoaWNoIHRoZXkgYXJlIG1vcmUgcmVsYXRlZCB0byBlYWNoIG90aGVyOg0KDQotICoqTnVtZXJpY2FsKiogLSBjYXJhdCAod2VpZ2h0KSwgeCwgeSwgeiAoUGh5c2ljYWwgYXR0cmlidXRlcykNCi0gKipPcmRpbmFsKiogLSBjb2xvciwgY2xhcml0eSwgY3V0IChWaXN1YWwgYXR0cmlidXRlcykNCg0KDQojIyBWaXN1YWxpemUNCg0KKipUaGUgbnVtYmVyIG9mIGNhcmF0cyBpcyB0aGUgbW9zdCByZWxldmFudCBpbiB0ZXJtcyBvZiB0aGUgcHJpY2Ugb2YgdGhlIGRpYW1vbmQqKg0KYGBge3J9DQpnZ3Bsb3QoZGlhbW9uZHMsIGFlcyh4ID0gY2FyYXQsIHkgPSBwcmljZSkpICsNCiAgZ2VvbV9wb2ludChhbHBoYSA9IDAuMDUpICsNCiAgZ2VvbV9zbW9vdGgoKQ0KYGBgDQoNCioqRGlzdHJpYnV0aW9uIG9mIHRoZSBkaWFtb25kIG9ic2VydmF0aW9ucyBhY3Jvc3MgdGhlIHByaWNlcyByYW5nZSoqDQpNb3N0IG9mIHRoZSBvYnNlcnZhdGlvbnMgbGllcyBhdCB0aGUgbG93ZXIgcHJpY2UgcmFuZ2Ugb2YgdGhlIGRhdGFzZXQNCmBgYHtyfQ0KZ2dwbG90KGRpYW1vbmRzLCBhZXMoeCA9IHByaWNlKSkgKw0KICBnZW9tX2hpc3RvZ3JhbShiaW53aWR0aCA9IDUwMCwgYWxwaGEgPSAwLjc1KQ0KYGBgDQoNCiMjIyBFeHBsb3Jpbmcgb3RoZXIgdmFyaWFibGVzDQoNCioqQSBoaWdoZXIgcXVhbGl0eSBvZiBjdXQgbWF5IG5vdCBndWFyYW50ZWUgYSBoaWdoZXIgcHJpY2UqKg0KDQpXZSBoYXZlIHRvIHRha2Ugbm90ZSBhcyB3ZWxsIHRoYXQgdGhlIG9ic2VydmF0aW9ucyBiZXR3ZWVuIHRoZSBxdWFsaXR5IG9mIGN1dHMgYXJlIG5vdCBldmVubHkgZGlzdHJpYnV0ZWQNCmBgYHtyfQ0KZ2dwbG90KGRpYW1vbmRzLCBhZXMoeCA9IGN1dCwgeSA9IHByaWNlKSkgKw0KICBnZW9tX2JveHBsb3QoYWxwaGEgPSAwLjUpICsNCiAgZ2VvbV9qaXR0ZXIoYWxwaGEgPSAwLjAxKSArDQogIHNjYWxlX3lfbG9nMTAoKQ0KYGBgDQoNCioqR2l2ZW4gdGhhdCBjb2xvcmVkIEQgdG8gSiBkaWFtb25kcyB3ZXJlIHNlcXVlbmNlZCBmcm9tIGJlc3QgdG8gd29yc3QsIHdlIGNhbiBzYXkgdGhhdCB0aGUgYmVzdCBjb2xvcmVkIGRpYW1vbmRzIGFnYWluIGFyZW4gbm90IHRoZSBtb3N0IGV4cGVuc2l2ZSoqDQpgYGB7cn0NCmdncGxvdChkaWFtb25kcywgYWVzKHggPSBjb2xvciwgeSA9IHByaWNlKSkgKw0KICBnZW9tX2ppdHRlcihhbHBoYSA9IDAuMDEpICsNCiAgZ2VvbV9ib3hwbG90KGFscGhhID0gMC41KSArDQogIHNjYWxlX3lfbG9nMTAoKQ0KYGBgDQoNClNlcXVlbmNlZCBmcm9tIEkxIChiZXN0KSB0byBJRiAod29yc3QpLCBTSTIgKHNlY29uZCBiZXN0KSBoYXMgdGhlIGhpZ2hlc3QgbWVhbiB2YWx1ZSBwcmljZQ0KYGBge3J9DQpnZ3Bsb3QoZGlhbW9uZHMsIGFlcyh4ID0gY2xhcml0eSwgeSA9IHByaWNlKSkgKw0KICBnZW9tX2ppdHRlcihhbHBoYSA9IDAuMDEpICsNCiAgZ2VvbV9ib3hwbG90KGFscGhhID0gMC41KSArDQogIHNjYWxlX3lfbG9nMTAoKQ0KYGBgDQo=