Submitted by: Joseph Ricafort
Assignment briefing
Explain the price’s of the diamonds using information from the other
variables. What are the more relevant variables to explain the
price?
#> # A tibble: 53,940 × 10
#> carat cut color clarity depth table price x y z
#> <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
#> 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
#> 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
#> 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
#> 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
#> 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
#> 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
#> 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
#> 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
#> 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
#> # … with 53,930 more rows
diamonds is a well-known dataset and analyzed on many
websites through the internet. You can borrow ideas from there.
- If you want, you can use either
quarto or
Rmarkdown to generate your document.
Answer to assignment starts here
There is a description about the diamonds dataset from bookdown.org
Exploring the dataset
Diamonds dataset will be available once we added tidyverse
library
library(tidyverse)
as_tibble(diamonds)
We can see that cut, color and clarity are ordinal variables while
the rest except for the price are numeric values
summary(diamonds)
carat cut color clarity depth table price
Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00 Min. :43.00 Min. : 326
1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00 1st Qu.:56.00 1st Qu.: 950
Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194 Median :61.80 Median :57.00 Median : 2401
Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75 Mean :57.46 Mean : 3933
3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066 3rd Qu.:62.50 3rd Qu.:59.00 3rd Qu.: 5324
Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00 Max. :95.00 Max. :18823
J: 2808 (Other): 2531
x y z
Min. : 0.000 Min. : 0.000 Min. : 0.000
1st Qu.: 4.710 1st Qu.: 4.720 1st Qu.: 2.910
Median : 5.700 Median : 5.710 Median : 3.530
Mean : 5.731 Mean : 5.735 Mean : 3.539
3rd Qu.: 6.540 3rd Qu.: 6.540 3rd Qu.: 4.040
Max. :10.740 Max. :58.900 Max. :31.800
Looking at summary on the summation and the extent of values for
every variable
Defining exploration questions
What are the most interesting questions to ask the dataset:
- How is the price of the diamond defined by its characteristics?
- Are heavier diamonds (number of carats) more expensive?
- Does quality of the diamond defines it price?
- Are there colors that are more expensive than the other?
- Are bigger diamonds more expensive?
- Can we define the relationship between the diamonds’ characteristics
in terms of which of them more (or less) defines the price of the
diamond?
The questions above may or may not guarantee to be answered by the
dataset
Restructure the dataset
I’ll group the characteristics accordingly to which they are more
related to each other:
- Numerical - carat (weight), x, y, z (Physical
attributes)
- Ordinal - color, clarity, cut (Visual
attributes)
Visualize
The number of carats is the most relevant in terms of the
price of the diamond
ggplot(diamonds, aes(x = carat, y = price)) +
geom_point(alpha = 0.05) +
geom_smooth()
[WARNING] Deprecated: --self-contained. use --embed-resources --standalone

Distribution of the diamond observations across the prices
range Most of the observations lies at the lower price range of
the dataset

Exploring other variables
A higher quality of cut may not guarantee a higher
price
We have to take note as well that the observations between the
quality of cuts are not evenly distributed
ggplot(diamonds, aes(x = cut, y = price)) +
geom_boxplot(alpha = 0.5) +
geom_jitter(alpha = 0.01) +
scale_y_log10()

Given that colored D to J diamonds were sequenced from best
to worst, we can say that the best colored diamonds again aren not the
most expensive
ggplot(diamonds, aes(x = color, y = price)) +
geom_jitter(alpha = 0.01) +
geom_boxplot(alpha = 0.5) +
scale_y_log10()
[WARNING] Deprecated: --self-contained. use --embed-resources --standalone

Sequenced from I1 (best) to IF (worst), SI2 (second best) has the
highest mean value price
ggplot(diamonds, aes(x = clarity, y = price)) +
geom_jitter(alpha = 0.01) +
geom_boxplot(alpha = 0.5) +
scale_y_log10()

LS0tDQp0aXRsZTogIkFzc2lnbm1lbnQgNTogRGlhbW9uZHMgQWN0aXZpdHkiDQpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sNCi0tLQ0KU3VibWl0dGVkIGJ5OiBKb3NlcGggUmljYWZvcnQNCg0KIyMgQXNzaWdubWVudCBicmllZmluZw0KDQpFeHBsYWluIHRoZSBwcmljZeKAmXMgb2YgdGhlIGRpYW1vbmRzIHVzaW5nIGluZm9ybWF0aW9uIGZyb20gdGhlIG90aGVyIHZhcmlhYmxlcy4gV2hhdCBhcmUgdGhlIG1vcmUgcmVsZXZhbnQgdmFyaWFibGVzIHRvIGV4cGxhaW4gdGhlIHByaWNlPw0KYGBge3J9DQojPiAjIEEgdGliYmxlOiA1Myw5NDAgw5cgMTANCiM+ICAgIGNhcmF0IGN1dCAgICAgICBjb2xvciBjbGFyaXR5IGRlcHRoIHRhYmxlIHByaWNlICAgICB4ICAgICB5ICAgICB6DQojPiAgICA8ZGJsPiA8b3JkPiAgICAgPG9yZD4gPG9yZD4gICA8ZGJsPiA8ZGJsPiA8aW50PiA8ZGJsPiA8ZGJsPiA8ZGJsPg0KIz4gIDEgIDAuMjMgSWRlYWwgICAgIEUgICAgIFNJMiAgICAgIDYxLjUgICAgNTUgICAzMjYgIDMuOTUgIDMuOTggIDIuNDMNCiM+ICAyICAwLjIxIFByZW1pdW0gICBFICAgICBTSTEgICAgICA1OS44ICAgIDYxICAgMzI2ICAzLjg5ICAzLjg0ICAyLjMxDQojPiAgMyAgMC4yMyBHb29kICAgICAgRSAgICAgVlMxICAgICAgNTYuOSAgICA2NSAgIDMyNyAgNC4wNSAgNC4wNyAgMi4zMQ0KIz4gIDQgIDAuMjkgUHJlbWl1bSAgIEkgICAgIFZTMiAgICAgIDYyLjQgICAgNTggICAzMzQgIDQuMiAgIDQuMjMgIDIuNjMNCiM+ICA1ICAwLjMxIEdvb2QgICAgICBKICAgICBTSTIgICAgICA2My4zICAgIDU4ICAgMzM1ICA0LjM0ICA0LjM1ICAyLjc1DQojPiAgNiAgMC4yNCBWZXJ5IEdvb2QgSiAgICAgVlZTMiAgICAgNjIuOCAgICA1NyAgIDMzNiAgMy45NCAgMy45NiAgMi40OA0KIz4gIDcgIDAuMjQgVmVyeSBHb29kIEkgICAgIFZWUzEgICAgIDYyLjMgICAgNTcgICAzMzYgIDMuOTUgIDMuOTggIDIuNDcNCiM+ICA4ICAwLjI2IFZlcnkgR29vZCBIICAgICBTSTEgICAgICA2MS45ICAgIDU1ICAgMzM3ICA0LjA3ICA0LjExICAyLjUzDQojPiAgOSAgMC4yMiBGYWlyICAgICAgRSAgICAgVlMyICAgICAgNjUuMSAgICA2MSAgIDMzNyAgMy44NyAgMy43OCAgMi40OQ0KIz4gMTAgIDAuMjMgVmVyeSBHb29kIEggICAgIFZTMSAgICAgIDU5LjQgICAgNjEgICAzMzggIDQgICAgIDQuMDUgIDIuMzkNCiM+ICMg4oCmIHdpdGggNTMsOTMwIG1vcmUgcm93cw0KYGBgDQoNCi0gYGRpYW1vbmRzYCBpcyBhIHdlbGwta25vd24gZGF0YXNldCBhbmQgYW5hbHl6ZWQgb24gbWFueSB3ZWJzaXRlcyB0aHJvdWdoIHRoZSBpbnRlcm5ldC4gWW91IGNhbiBib3Jyb3cgaWRlYXMgZnJvbSB0aGVyZS4NCi0gSWYgeW91IHdhbnQsIHlvdSBjYW4gdXNlIGVpdGhlciBgcXVhcnRvYCBvciBgUm1hcmtkb3duYCB0byBnZW5lcmF0ZSB5b3VyIGRvY3VtZW50Lg0KDQojIyMgQW5zd2VyIHRvIGFzc2lnbm1lbnQgc3RhcnRzIGhlcmUNCg0KVGhlcmUgaXMgYSBkZXNjcmlwdGlvbiBhYm91dCB0aGUgZGlhbW9uZHMgZGF0YXNldCBmcm9tDQpbYm9va2Rvd24ub3JnXShodHRwczovL2Jvb2tkb3duLm9yZy95aWhfaHV5bmgvR3VpZGUtdG8tUi1Cb29rL2RpYW1vbmRzLmh0bWwpDQoNCg0KIyMgRXhwbG9yaW5nIHRoZSBkYXRhc2V0DQpEaWFtb25kcyBkYXRhc2V0IHdpbGwgYmUgYXZhaWxhYmxlIG9uY2Ugd2UgYWRkZWQgdGlkeXZlcnNlIGxpYnJhcnkNCg0KYGBge3J9DQpsaWJyYXJ5KHRpZHl2ZXJzZSkNCg0KYXNfdGliYmxlKGRpYW1vbmRzKQ0KYGBgDQoNCldlIGNhbiBzZWUgdGhhdCBjdXQsIGNvbG9yIGFuZCBjbGFyaXR5IGFyZSBvcmRpbmFsIHZhcmlhYmxlcyB3aGlsZSB0aGUgcmVzdCBleGNlcHQgZm9yIHRoZSBwcmljZSBhcmUgbnVtZXJpYyB2YWx1ZXMNCg0KDQpgYGB7cn0NCnN1bW1hcnkoZGlhbW9uZHMpDQpgYGANCkxvb2tpbmcgYXQgc3VtbWFyeSBvbiB0aGUgc3VtbWF0aW9uIGFuZCB0aGUgZXh0ZW50IG9mIHZhbHVlcyBmb3IgZXZlcnkgdmFyaWFibGUNCg0KIyMgRGVmaW5pbmcgZXhwbG9yYXRpb24gcXVlc3Rpb25zDQoNCldoYXQgYXJlIHRoZSBtb3N0IGludGVyZXN0aW5nIHF1ZXN0aW9ucyB0byBhc2sgdGhlIGRhdGFzZXQ6DQoNCjEuIEhvdyBpcyB0aGUgcHJpY2Ugb2YgdGhlIGRpYW1vbmQgZGVmaW5lZCBieSBpdHMgY2hhcmFjdGVyaXN0aWNzPyANCg0KLSBBcmUgaGVhdmllciBkaWFtb25kcyAobnVtYmVyIG9mIGNhcmF0cykgbW9yZSBleHBlbnNpdmU/IA0KLSBEb2VzIHF1YWxpdHkgb2YgdGhlIGRpYW1vbmQgZGVmaW5lcyBpdCBwcmljZT8NCi0gQXJlIHRoZXJlIGNvbG9ycyB0aGF0IGFyZSBtb3JlIGV4cGVuc2l2ZSB0aGFuIHRoZSBvdGhlcj8NCi0gQXJlIGJpZ2dlciBkaWFtb25kcyBtb3JlIGV4cGVuc2l2ZT8NCg0KMi4gQ2FuIHdlIGRlZmluZSB0aGUgcmVsYXRpb25zaGlwIGJldHdlZW4gdGhlIGRpYW1vbmRzJyBjaGFyYWN0ZXJpc3RpY3MgaW4gdGVybXMgb2Ygd2hpY2ggb2YgdGhlbSBtb3JlIChvciBsZXNzKSBkZWZpbmVzIHRoZSBwcmljZSBvZiB0aGUgZGlhbW9uZD8NCg0KVGhlIHF1ZXN0aW9ucyBhYm92ZSBtYXkgb3IgbWF5IG5vdCBndWFyYW50ZWUgdG8gYmUgYW5zd2VyZWQgYnkgdGhlIGRhdGFzZXQNCg0KDQojIyMgUmVzdHJ1Y3R1cmUgdGhlIGRhdGFzZXQNCg0KSSdsbCBncm91cCB0aGUgY2hhcmFjdGVyaXN0aWNzIGFjY29yZGluZ2x5IHRvIHdoaWNoIHRoZXkgYXJlIG1vcmUgcmVsYXRlZCB0byBlYWNoIG90aGVyOg0KDQotICoqTnVtZXJpY2FsKiogLSBjYXJhdCAod2VpZ2h0KSwgeCwgeSwgeiAoUGh5c2ljYWwgYXR0cmlidXRlcykNCi0gKipPcmRpbmFsKiogLSBjb2xvciwgY2xhcml0eSwgY3V0IChWaXN1YWwgYXR0cmlidXRlcykNCg0KDQojIyBWaXN1YWxpemUNCg0KKipUaGUgbnVtYmVyIG9mIGNhcmF0cyBpcyB0aGUgbW9zdCByZWxldmFudCBpbiB0ZXJtcyBvZiB0aGUgcHJpY2Ugb2YgdGhlIGRpYW1vbmQqKg0KYGBge3J9DQpnZ3Bsb3QoZGlhbW9uZHMsIGFlcyh4ID0gY2FyYXQsIHkgPSBwcmljZSkpICsNCiAgZ2VvbV9wb2ludChhbHBoYSA9IDAuMDUpICsNCiAgZ2VvbV9zbW9vdGgoKQ0KYGBgDQoNCioqRGlzdHJpYnV0aW9uIG9mIHRoZSBkaWFtb25kIG9ic2VydmF0aW9ucyBhY3Jvc3MgdGhlIHByaWNlcyByYW5nZSoqDQpNb3N0IG9mIHRoZSBvYnNlcnZhdGlvbnMgbGllcyBhdCB0aGUgbG93ZXIgcHJpY2UgcmFuZ2Ugb2YgdGhlIGRhdGFzZXQNCmBgYHtyfQ0KZ2dwbG90KGRpYW1vbmRzLCBhZXMoeCA9IHByaWNlKSkgKw0KICBnZW9tX2hpc3RvZ3JhbShiaW53aWR0aCA9IDUwMCwgYWxwaGEgPSAwLjc1KQ0KYGBgDQoNCiMjIyBFeHBsb3Jpbmcgb3RoZXIgdmFyaWFibGVzDQoNCioqQSBoaWdoZXIgcXVhbGl0eSBvZiBjdXQgbWF5IG5vdCBndWFyYW50ZWUgYSBoaWdoZXIgcHJpY2UqKg0KDQpXZSBoYXZlIHRvIHRha2Ugbm90ZSBhcyB3ZWxsIHRoYXQgdGhlIG9ic2VydmF0aW9ucyBiZXR3ZWVuIHRoZSBxdWFsaXR5IG9mIGN1dHMgYXJlIG5vdCBldmVubHkgZGlzdHJpYnV0ZWQNCmBgYHtyfQ0KZ2dwbG90KGRpYW1vbmRzLCBhZXMoeCA9IGN1dCwgeSA9IHByaWNlKSkgKw0KICBnZW9tX2JveHBsb3QoYWxwaGEgPSAwLjUpICsNCiAgZ2VvbV9qaXR0ZXIoYWxwaGEgPSAwLjAxKSArDQogIHNjYWxlX3lfbG9nMTAoKQ0KYGBgDQoNCioqR2l2ZW4gdGhhdCBjb2xvcmVkIEQgdG8gSiBkaWFtb25kcyB3ZXJlIHNlcXVlbmNlZCBmcm9tIGJlc3QgdG8gd29yc3QsIHdlIGNhbiBzYXkgdGhhdCB0aGUgYmVzdCBjb2xvcmVkIGRpYW1vbmRzIGFnYWluIGFyZW4gbm90IHRoZSBtb3N0IGV4cGVuc2l2ZSoqDQpgYGB7cn0NCmdncGxvdChkaWFtb25kcywgYWVzKHggPSBjb2xvciwgeSA9IHByaWNlKSkgKw0KICBnZW9tX2ppdHRlcihhbHBoYSA9IDAuMDEpICsNCiAgZ2VvbV9ib3hwbG90KGFscGhhID0gMC41KSArDQogIHNjYWxlX3lfbG9nMTAoKQ0KYGBgDQoNClNlcXVlbmNlZCBmcm9tIEkxIChiZXN0KSB0byBJRiAod29yc3QpLCBTSTIgKHNlY29uZCBiZXN0KSBoYXMgdGhlIGhpZ2hlc3QgbWVhbiB2YWx1ZSBwcmljZQ0KYGBge3J9DQpnZ3Bsb3QoZGlhbW9uZHMsIGFlcyh4ID0gY2xhcml0eSwgeSA9IHByaWNlKSkgKw0KICBnZW9tX2ppdHRlcihhbHBoYSA9IDAuMDEpICsNCiAgZ2VvbV9ib3hwbG90KGFscGhhID0gMC41KSArDQogIHNjYWxlX3lfbG9nMTAoKQ0KYGBgDQo=