Title: Week_5_Data-Dive
Output: html document
library(ggplot2)
library(skimr)
head(diamonds)
## # A tibble: 6 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
skim(diamonds)
| Name | diamonds |
| Number of rows | 53940 |
| Number of columns | 10 |
| _______________________ | |
| Column type frequency: | |
| factor | 3 |
| numeric | 7 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| cut | 0 | 1 | TRUE | 5 | Ide: 21551, Pre: 13791, Ver: 12082, Goo: 4906 |
| color | 0 | 1 | TRUE | 7 | G: 11292, E: 9797, F: 9542, H: 8304 |
| clarity | 0 | 1 | TRUE | 8 | SI1: 13065, VS2: 12258, SI2: 9194, VS1: 8171 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| carat | 0 | 1 | 0.80 | 0.47 | 0.2 | 0.40 | 0.70 | 1.04 | 5.01 | ▇▂▁▁▁ |
| depth | 0 | 1 | 61.75 | 1.43 | 43.0 | 61.00 | 61.80 | 62.50 | 79.00 | ▁▁▇▁▁ |
| table | 0 | 1 | 57.46 | 2.23 | 43.0 | 56.00 | 57.00 | 59.00 | 95.00 | ▁▇▁▁▁ |
| price | 0 | 1 | 3932.80 | 3989.44 | 326.0 | 950.00 | 2401.00 | 5324.25 | 18823.00 | ▇▂▁▁▁ |
| x | 0 | 1 | 5.73 | 1.12 | 0.0 | 4.71 | 5.70 | 6.54 | 10.74 | ▁▁▇▃▁ |
| y | 0 | 1 | 5.73 | 1.14 | 0.0 | 4.72 | 5.71 | 6.54 | 58.90 | ▇▁▁▁▁ |
| z | 0 | 1 | 3.54 | 0.71 | 0.0 | 2.91 | 3.53 | 4.04 | 31.80 | ▇▁▁▁▁ |
cut :
type: ordered factor
This column contains values like “Fair,” “Good,” “Very Good,” “Premium,”
and “Ideal.” Without documentation, it’s unclear what these represent in
terms of diamond quality. Reading the documentation
reveals they indicate cut quality, with “Ideal” being the highest and
“Fair” the lowest.
Reason for encoding:
This encoding aligns with industry standards, making it familiar to
those in the diamond trade.
consequences of not reading the documentation:
Misinterpreting these values could lead to
incorrect assumptions about diamond
quality.
depth:
This column has numerical values, and their meaning may not be obvious.
Documentation clarifies that this is the total depth
percentage of the diamond, calculated as a ratio of
height to average width and length.
Reason for encoding
This encoding emphasizes the diamond’s proportions, crucial for light
reflection and brilliance.
consequences of not reading the documentation:
Not understanding this could lead to misinterpreting diamond
characteristics affecting sparkle.
table:
Similar to depth, this column’s numerical values lack immediate meaning.
Referring to the documentation reveals it represents the width of the
top facet relative to the diamond’s widest point. This
ratio influences light return and overall
appearance.
consequences of not reading the documentation:
Without explanation, this value could be misinterpreted, impacting
assessments of a diamond’s beauty and value.
ggplot(diamonds,aes(x=clarity,y=price))+
geom_boxplot()
While external resources or specialized diamond knowledge might clarify these codes, users relying solely on the built-in documentation could face ambiguity. This could lead to Limited understanding, Misinterpretation of results, and Challenges in comparison.
unique(diamonds$clarity)
## [1] SI2 SI1 VS1 VS2 VVS2 VVS1 I1 IF
## Levels: I1 < SI2 < SI1 < VS2 < VS1 < VVS2 < VVS1 < IF
Color each clarity grade differently, with brighter or more saturated colors representing less explained grades (VS1, VS2, SI1, SI2, I1). This visually draws attention to grades needing further exploration.
palette <- c("IF" = "#00BFFF", "FL" = "#ADD8E6",
"VS1" = "#99CCFF", "VS2" = "#66B2FF",
"SI1" = "#3399FF", "SI2" = "#0073E6",
"I1" = "#004C99")
ggplot(data = diamonds, aes(x = clarity, y = price, fill = clarity)) +
geom_boxplot() +
#geom_text(data=diamonds,aes(label=clarity,y=price),position = position_dodge(width = 0.75), vjust = -0.5, size = 5)+
labs(x = "Clarity Grade", y = "Price", title = "Price Distribution by Clarity Grade") +
scale_fill_manual(values = palette) +
theme_classic()
The exact nature and severity of inclusions within each grade (VS1, VS2, SI1, SI2, I1) remain unclear without external references
3.Significant Risks:
1.Misinterpretation of clarity grades leads to inaccurate valuation
or purchase decisions.
2.Overlooking specific inclusions that could affect a diamond’s beauty
or durability.
3.Difficulty comparing diamonds across datasets with different clarity
coding systems.
Reducing Negative Consequences:
Provide additional resources: Include links to detailed
diamond grading explanations within the dataset documentation or as a
separate resource guide.
Standardize data: Advocate for the adoption of a
universal clarity coding system with detailed explanations across
datasets.