Title: Week_5_Data-Dive
Output: html document

library(ggplot2)
library(skimr)
head(diamonds)
## # A tibble: 6 × 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
skim(diamonds)
Data summary
Name diamonds
Number of rows 53940
Number of columns 10
_______________________
Column type frequency:
factor 3
numeric 7
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
cut 0 1 TRUE 5 Ide: 21551, Pre: 13791, Ver: 12082, Goo: 4906
color 0 1 TRUE 7 G: 11292, E: 9797, F: 9542, H: 8304
clarity 0 1 TRUE 8 SI1: 13065, VS2: 12258, SI2: 9194, VS1: 8171

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
carat 0 1 0.80 0.47 0.2 0.40 0.70 1.04 5.01 ▇▂▁▁▁
depth 0 1 61.75 1.43 43.0 61.00 61.80 62.50 79.00 ▁▁▇▁▁
table 0 1 57.46 2.23 43.0 56.00 57.00 59.00 95.00 ▁▇▁▁▁
price 0 1 3932.80 3989.44 326.0 950.00 2401.00 5324.25 18823.00 ▇▂▁▁▁
x 0 1 5.73 1.12 0.0 4.71 5.70 6.54 10.74 ▁▁▇▃▁
y 0 1 5.73 1.14 0.0 4.72 5.71 6.54 58.90 ▇▁▁▁▁
z 0 1 3.54 0.71 0.0 2.91 3.53 4.04 31.80 ▇▁▁▁▁
  1. Unclear columns in the diamonds dataset

cut :
type: ordered factor
This column contains values like “Fair,” “Good,” “Very Good,” “Premium,” and “Ideal.” Without documentation, it’s unclear what these represent in terms of diamond quality. Reading the documentation reveals they indicate cut quality, with “Ideal” being the highest and “Fair” the lowest.
Reason for encoding:
This encoding aligns with industry standards, making it familiar to those in the diamond trade.
consequences of not reading the documentation:
Misinterpreting these values could lead to incorrect assumptions about diamond quality.

depth:
This column has numerical values, and their meaning may not be obvious. Documentation clarifies that this is the total depth percentage of the diamond, calculated as a ratio of height to average width and length.
Reason for encoding
This encoding emphasizes the diamond’s proportions, crucial for light reflection and brilliance.
consequences of not reading the documentation:
Not understanding this could lead to misinterpreting diamond characteristics affecting sparkle.

table:
Similar to depth, this column’s numerical values lack immediate meaning. Referring to the documentation reveals it represents the width of the top facet relative to the diamond’s widest point. This ratio influences light return and overall appearance.
consequences of not reading the documentation:
Without explanation, this value could be misinterpreted, impacting assessments of a diamond’s beauty and value.

  1. Clarity: The clarity grading system in the diamonds dataset utilizes letters like “IF,” “FL,” “VS1,” etc. The documentation explains they represent internal flaws but only defines some key terms like “internally flawless” (IF) and “flawless” (FL). The remaining codes lack a detailed explanation within the readily available ggplot2 documentation
ggplot(diamonds,aes(x=clarity,y=price))+
  geom_boxplot()

While external resources or specialized diamond knowledge might clarify these codes, users relying solely on the built-in documentation could face ambiguity. This could lead to Limited understanding, Misinterpretation of results, and Challenges in comparison.

unique(diamonds$clarity)
## [1] SI2  SI1  VS1  VS2  VVS2 VVS1 I1   IF  
## Levels: I1 < SI2 < SI1 < VS2 < VS1 < VVS2 < VVS1 < IF

Color each clarity grade differently, with brighter or more saturated colors representing less explained grades (VS1, VS2, SI1, SI2, I1). This visually draws attention to grades needing further exploration.

palette <- c("IF" = "#00BFFF", "FL" = "#ADD8E6", 
             "VS1" = "#99CCFF", "VS2" = "#66B2FF", 
             "SI1" = "#3399FF", "SI2" = "#0073E6", 
             "I1" = "#004C99")


ggplot(data = diamonds, aes(x = clarity, y = price, fill = clarity)) +
  geom_boxplot() +
  #geom_text(data=diamonds,aes(label=clarity,y=price),position = position_dodge(width = 0.75), vjust = -0.5, size = 5)+
  labs(x = "Clarity Grade", y = "Price", title = "Price Distribution by Clarity Grade") +
  scale_fill_manual(values = palette) +
  theme_classic()

The exact nature and severity of inclusions within each grade (VS1, VS2, SI1, SI2, I1) remain unclear without external references

3.Significant Risks:

1.Misinterpretation of clarity grades leads to inaccurate valuation or purchase decisions.
2.Overlooking specific inclusions that could affect a diamond’s beauty or durability.
3.Difficulty comparing diamonds across datasets with different clarity coding systems.

Reducing Negative Consequences:
Provide additional resources: Include links to detailed diamond grading explanations within the dataset documentation or as a separate resource guide.
Standardize data: Advocate for the adoption of a universal clarity coding system with detailed explanations across datasets.