library(ggplot2)
data(diamonds)
str(diamonds)
## tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
##  $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

DATASET DESCRIPTION

carat : weight of the diamond
cut : quality of the cut
color : diamond color
clarity : measurement of how clear the diamond is
depth : total depth percentage
table : width of the top of the diamond relative to the widest point
price : price in US dollars
x : length in mm
y : width in mm
z : depth in mm

summary(diamonds)
##      carat               cut        color        clarity          depth      
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194   Median :61.80  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066   3rd Qu.:62.50  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655   Max.   :79.00  
##                                     J: 2808   (Other): 2531                  
##      table           price             x                y         
##  Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720  
##  Median :57.00   Median : 2401   Median : 5.700   Median : 5.710  
##  Mean   :57.46   Mean   : 3933   Mean   : 5.731   Mean   : 5.735  
##  3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540   3rd Qu.: 6.540  
##  Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :58.900  
##                                                                   
##        z         
##  Min.   : 0.000  
##  1st Qu.: 2.910  
##  Median : 3.530  
##  Mean   : 3.539  
##  3rd Qu.: 4.040  
##  Max.   :31.800  
## 

Interpretation:

carat:
This represents the weight of the diamond in carats. It ranges from a minimum of 0.20 carats to a maximum of 5.01 carats, with an average weight of around 0.7979 carats. The distribution is likely skewed towards smaller diamonds as the 1st quartile (Q1) is at 0.40 carats and the 3rd quartile (Q3) is at 1.04 carats.

depth: This indicates the depth of the diamond as a percentage of its diameter. It ranges from 43% to 79%, with an average of 61.75%. The interquartile range (IQR) from Q1 (61.00%) to Q3 (62.50%) suggests most depths fall within a narrow range.

table: This represents the width of the diamond’s table (the flat top) relative to its diameter, expressed as a percentage. It ranges from 43% to 95%, with an average of 57.46%. Similar to depth, the IQR (56.00% - 59.00%) indicates most tables fall within a limited range.

price: This shows the price of the diamond in USD. It ranges from $326 to $18,823, with an average price of $3,933. The distribution is likely skewed towards lower prices based on the summary alone.

x, y, z: These represent the length, width, and height of the diamond in millimeters, respectively. They all follow a similar pattern to carat weight, with minimum values around 0mm, maximums exceeding 10mm, and averages around 5.7mm.

cut: This describes the quality of the diamond’s cut, with categories like “Fair,” “Good,” “Very Good,” “Premium,” and “Ideal.” “Ideal” cuts are the most prevalent (21,551 diamonds), followed by “Very Good” (12,082 diamonds).

color: This indicates the color grade of the diamond, ranging from “D” (colorless) to “J” (slightly yellow). “G” and “H” color grades are the most common.

clarity: This represents the presence of inclusions or blemishes within the diamond, with grades like “SI1” (Slightly Included 1), “VS2” (Very Slightly Included 2), etc. “SI1” is the most frequent clarity grade, followed by “VS2.”

Diamond Data Dive: Interactions and Impact

The provided summary of the diamonds dataset offers a glimpse into how different characteristics might influence a diamond’s price. Let’s delve deeper!

1. Distribution of Diamond Prices: This histogram shows the distribution of diamond prices in the dataset. It can help identify the range of prices, as well as the most common price points.


In the real world,
this information can be useful for diamond retailers and consumers to understand the typical pricing structure and set appropriate expectations.

ggplot(data = diamonds, aes(x = price)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "black") +
  labs(title = "Distribution of Diamond Prices", x = "Price (USD)", y = "Count")

Interpretation:
This histogram shows the distribution of diamond prices in the dataset. The majority of diamonds have prices concentrated below $5,000, with a sharp decline in frequency as the price increases. However, there are still some very expensive diamonds in the dataset, extending up to around $20,000.

2. Price vs. Carat Weight: This scatterplot with a linear regression line shows the relationship between carat weight and price. As expected, higher carat weights generally correspond to higher prices.

This information can help diamond buyers estimate the cost of diamonds based on their desired carat weight and can also guide pricing strategies for diamond sellers.

ggplot(data = diamonds, aes(x = carat, y = price)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = FALSE, color = "red", formula = y~x) +
  labs(title = "Price vs. Carat Weight", x = "Carat Weight", y = "Price (USD)")

Interpretation:
This scatterplot depicts the relationship between carat weight and price, with a linear regression line overlaid. There is a strong positive correlation, where diamonds with higher carat weights tend to have higher prices. The red line indicates that as carat weight increases, the price tends to increase linearly.

3. Price Distribution by Cut Quality: This boxplot shows how the distribution of prices varies across different cut qualities.


It can help identify the impact of cut quality on diamond pricing and guide buyers and sellers in making informed decisions based on their budget or desired profit margins.

ggplot(data = diamonds, aes(x = cut, y = price)) +
  geom_boxplot(aes(fill = cut), alpha = 0.7) +
  labs(title = "Price Distribution by Cut Quality", x = "Cut Quality", y = "Price (USD)") +
  theme(legend.position = "none")

Interpretation:
This boxplot displays the distribution of diamond prices across different cut quality categories. It shows that diamonds with better cut qualities (e.g., Ideal, Premium) tend to have higher median prices and a wider range of prices compared to those with lower cut qualities (e.g., Fair, Good).

4. Price Distribution by Color Grade: The violin plot visualizes the distribution of prices for different color grades.


It can help understand the premium associated with rarer color grades and guide pricing strategies accordingly.

ggplot(data = diamonds, aes(x = color, y = price)) +
  geom_violin(trim = FALSE, aes(fill = color)) +
  labs(title = "Price Distribution by Color Grade", x = "Color Grade", y = "Price (USD)") +
  theme(legend.position = "none")

Interpretation:
This violin plot visualizes the distribution of diamond prices for different color grades.
The plot shows that diamonds with rarer and more desirable color grades (e.g., D, E) tend to have higher prices, while diamonds with lower color grades (e.g., J) generally have lower prices.

5. Depth vs. Table (colored by Price): This scatterplot shows the relationship between depth, table, and price.


It can help identify ideal combinations of depth and table that correspond to higher prices, which can be useful for diamond cutters and graders.

ggplot(data = diamonds, aes(x = depth, y = table, color = price)) +
  geom_point(alpha = 0.5) +
  labs(title = "Depth vs. Table (colored by Price)", x = "Depth", y = "Table") +
  scale_color_gradientn(colors = terrain.colors(10))

Interpretation:
The scatter plot displays the relationship between depth and table prices, with the dots colored based on the price value. There appears to be a clustering of green and yellow dots around a depth value of 60-65, suggesting that tables with depths in this range tend to have lower prices, likely representing more standard or common table sizes. Meanwhile, the higher-priced tables, indicated by orange and red dots, are more scattered across different depth values, potentially representing more unique or custom table sizes and configurations. Overall, the plot reveals that while there is a general trend of lower prices associated with typical depths, higher-priced tables can exist across a wider range of depth dimensions.

6. Price vs. Carat Weight by Clarity: These faceted scatterplots show the relationship between carat weight and price for different clarity grades.


It can help buyers understand the price premium associated with higher clarity grades for their desired carat weight.

ggplot(data = diamonds, aes(x = carat, y = price, color = clarity)) +
  geom_point(alpha = 0.5) +
  labs(title = "Carat Weight vs. Price (colored by Clarity)", x = "Carat Weight", y = "Price (USD)") +
  scale_color_brewer(type = "div", palette = "RdYlBu")

Interpretation:
This scatterplot shows the relationship between carat weight and price, with the points colored by clarity grade. It demonstrates that diamonds with higher clarity grades (e.g., IF, VVS1) tend to have higher prices for a given carat weight compared to those with lower clarity grades (e.g., SI2, I1).

7. Price vs. Carat Weight by Clarity: These faceted scatterplots show the relationship between carat weight and price for different clarity grades.


It can help buyers understand the price premium associated with higher clarity grades for their desired carat weight.

ggplot(data = diamonds, aes(x = carat, y = price, color = clarity)) +
  geom_point(alpha = 0.5) +
  facet_wrap(~ clarity, ncol = 3) +
  labs(title = "Price vs. Carat Weight by Clarity", x = "Carat Weight", y = "Price (USD)")

Interpretation:
A scatter plot visualizes price (USD) vs. carat weight for diamonds, colored by clarity grade. Upward trends show price increases with carat weight within each grade. Higher clarity grades (IF to I1) have progressively higher price curves.

8. Depth vs. Table (colored by Price) by Cut Quality: These faceted scatterplots show the relationship between depth, table, and price for different cut qualities.


This information can be useful for diamond cutters and graders to optimize the cut proportions for maximum value.

ggplot(data = diamonds, aes(x = depth, y = table, color = price)) +
  geom_point(alpha = 0.5) +
  facet_wrap(~ cut, ncol = 2) +
  labs(title = "Depth vs. Table (colored by Price) by Cut Quality", x = "Depth", y = "Table")

Interpretation:
This scatter plot shows the relationship between depth and table percentages of diamonds, colored by price range, across different cut quality grades. The plot suggests that ideal depth and table values tend to fall within specific ranges for each cut quality, with higher cut qualities generally commanding higher prices.

9. Carat Weight vs. Depth (colored by Price) by Cut Quality: These faceted scatterplots show the relationship between carat weight, depth, and price for different cut qualities.


This information can help diamond cutters and graders optimize the proportions for maximum value based on the cut quality.

ggplot(data = diamonds, aes(x = carat, y = depth, color = price)) +
  geom_point(alpha = 0.5) +
  facet_wrap(~ cut, ncol = 2) +
  labs(title = "Carat Weight vs. Depth (colored by Price) by Cut Quality", x = "Carat Weight", y = "Depth")

Interpretation:
The scatter plot displays the relationship between carat weight and depth of diamonds, with the color indicating price. The higher the cut quality, such as “Premium” and “Ideal,” the higher the price range for diamonds with similar carat weight and depth values.

10. Price Distribution by Cut and Clarity: The faceted density plots show the price distributions for different combinations of cut and clarity.


This information can guide pricing strategies based on these two key attributes and help identify the most valuable combinations.

ggplot(data = diamonds, aes(x = price, fill = cut)) +
  geom_density(alpha = 0.5) +
  facet_wrap(~ clarity, ncol = 3) +
  labs(title = "Price Distribution by Cut and Clarity", x = "Price (USD)", y = "Density")

Interpretation:
This density plot shows the distribution of diamond prices across different cut quality and clarity levels. The plots reveal that most diamonds have lower prices, with the density curves skewed towards the lower end of the price range. The peak density shifts slightly higher for better cut qualities like “Ideal” and higher clarity grades like “IF” (Internally Flawless), indicating higher prices for superior diamond quality.

Carat Weight Distribution by Cut and Clarity: The faceted density plots show the distribution of carat weights for different combinations of cut and clarity.

ggplot(data = diamonds, aes(x = carat, fill = cut)) +
  geom_density(alpha = 0.5) +
  facet_grid(cut ~ clarity) +
  labs(title = "Carat Weight Distribution by Cut and Clarity", x = "Carat Weight", y = "Density")

Interpretation:
This series of density plots displays the carat weight distribution of diamonds categorized by cut quality and clarity levels. Across all cut and clarity combinations, the distributions are skewed to the left, indicating that most diamonds have lower carat weights. The peaks occur around 0.3 to 0.7 carats, with the frequency decreasing as carat weight increases. This information helps retailers understand the availability patterns of different carat weight ranges based on the diamond’s cut and clarity attributes.

CORRELATION:

library(corrplot)
## corrplot 0.92 loaded
# Select numeric columns only
numeric_columns <- Filter(is.numeric, diamonds)

# Calculate the correlation matrix
cor_table <- cor(numeric_columns)
print(cor_table)
##            carat       depth      table      price           x           y
## carat 1.00000000  0.02822431  0.1816175  0.9215913  0.97509423  0.95172220
## depth 0.02822431  1.00000000 -0.2957785 -0.0106474 -0.02528925 -0.02934067
## table 0.18161755 -0.29577852  1.0000000  0.1271339  0.19534428  0.18376015
## price 0.92159130 -0.01064740  0.1271339  1.0000000  0.88443516  0.86542090
## x     0.97509423 -0.02528925  0.1953443  0.8844352  1.00000000  0.97470148
## y     0.95172220 -0.02934067  0.1837601  0.8654209  0.97470148  1.00000000
## z     0.95338738  0.09492388  0.1509287  0.8612494  0.97077180  0.95200572
##                z
## carat 0.95338738
## depth 0.09492388
## table 0.15092869
## price 0.86124944
## x     0.97077180
## y     0.95200572
## z     1.00000000
# Plot the correlation matrix
corrplot(cor_table)

ANOVA Analysis:

# Perform ANOVA for cut vs. price
cut_anova <- aov(price ~ cut, data = diamonds)
summary(cut_anova)
##                Df    Sum Sq   Mean Sq F value Pr(>F)    
## cut             4 1.104e+10 2.760e+09   175.7 <2e-16 ***
## Residuals   53935 8.474e+11 1.571e+07                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Perform ANOVA for color vs. price
color_anova <- aov(price ~ color, data = diamonds)
summary(color_anova)
##                Df    Sum Sq   Mean Sq F value Pr(>F)    
## color           6 2.685e+10 4.475e+09   290.2 <2e-16 ***
## Residuals   53933 8.316e+11 1.542e+07                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
clarity_Anova <- aov(price~clarity,data=diamonds)
summary(clarity_Anova)
##                Df    Sum Sq   Mean Sq F value Pr(>F)    
## clarity         7 2.331e+10 3.330e+09     215 <2e-16 ***
## Residuals   53932 8.352e+11 1.549e+07                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Analysis of Variance (ANOVA) tests to determine if there are statistically significant differences in diamond prices based on the cut quality, color, and clarity attributes.

The ANOVA results show that all three factors (cut, color, and clarity) have very small p-values (< 2e-16, which is essentially 0) in the Pr(>F) column. This means that the null hypothesis, which assumes no difference in mean prices across the different levels of each factor, can be rejected at any reasonable significance level.

In other words, the ANOVA confirms that cut quality, color, and clarity all have a significant impact on the price of a diamond. This information is valuable for several real-world applications:

  1. Pricing strategy: Retailers and wholesalers can use these findings to develop pricing models that accurately reflect the value contribution of each diamond attribute. Higher prices can be justified for better cut qualities, more desirable colors, and higher clarity grades.

  2. Inventory management: Understanding the relationship between diamond attributes and prices can help retailers optimize their inventory mix. They can stock more diamonds with attributes that are in higher demand and command premium prices.

  3. Consumer education: The ANOVA results quantify the importance of cut, color, and clarity in determining a diamond’s value. Retailers can use this information to educate consumers on the factors that influence diamond prices, helping them make informed purchasing decisions.

  4. Quality control: Diamond suppliers and manufacturers can use the ANOVA insights to prioritize and optimize the cutting, polishing, and sorting processes to achieve the desired levels of cut quality, color, and clarity that maximize value.

Overall, these ANOVA analyses provide statistical evidence of the significant impact of diamond attributes on pricing, which can be leveraged by various stakeholders in the diamond industry to make more informed decisions and enhance their operations and customer experiences.

Linear Regression Model and Hypothesis Testing

# Fit linear regression model
model <- lm(price ~ carat + cut + color + clarity, data = diamonds)

# Summary of the fitted model
summary(model)
## 
## Call:
## lm(formula = price ~ carat + cut + color + clarity, data = diamonds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16813.5   -680.4   -197.6    466.4  10394.9 
## 
## Coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) -3710.603     13.980 -265.414  < 2e-16 ***
## carat        8886.129     12.034  738.437  < 2e-16 ***
## cut.L         698.907     20.335   34.369  < 2e-16 ***
## cut.Q        -327.686     17.911  -18.295  < 2e-16 ***
## cut.C         180.565     15.557   11.607  < 2e-16 ***
## cut^4          -1.207     12.458   -0.097    0.923    
## color.L     -1910.288     17.712 -107.853  < 2e-16 ***
## color.Q      -627.954     16.121  -38.952  < 2e-16 ***
## color.C      -171.960     15.070  -11.410  < 2e-16 ***
## color^4        21.678     13.840    1.566    0.117    
## color^5       -85.943     13.076   -6.572 5.00e-11 ***
## color^6       -49.986     11.889   -4.205 2.62e-05 ***
## clarity.L    4217.535     30.831  136.794  < 2e-16 ***
## clarity.Q   -1832.406     28.827  -63.565  < 2e-16 ***
## clarity.C     923.273     24.679   37.411  < 2e-16 ***
## clarity^4    -361.995     19.739  -18.339  < 2e-16 ***
## clarity^5     216.616     16.109   13.447  < 2e-16 ***
## clarity^6       2.105     14.037    0.150    0.881    
## clarity^7     110.340     12.383    8.910  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1157 on 53921 degrees of freedom
## Multiple R-squared:  0.9159, Adjusted R-squared:  0.9159 
## F-statistic: 3.264e+04 on 18 and 53921 DF,  p-value: < 2.2e-16
# Hypothesis testing - testing individual coefficients

# Hypothesis 1: Is there a significant relationship between carat weight and diamond price?
carat_p_value <- summary(model)$coefficients["carat", "Pr(>|t|)"]
if (carat_p_value < 0.05) {
  print("The coefficient for carat weight is statistically significant.")
} else {
  print("The coefficient for carat weight is not statistically significant.")
}
## [1] "The coefficient for carat weight is statistically significant."
# Hypothesis 2: Are certain levels of cut associated with significantly different diamond prices?
cut_p_values <- summary(model)$coefficients[grep("^cut", rownames(summary(model)$coefficients)), "Pr(>|t|)"]
alpha <- 0.05
significant_cut <- names(cut_p_values)[which(cut_p_values < alpha)]
if (length(significant_cut) > 0) {
  print(paste("Significant cuts:", paste(significant_cut, collapse = ", ")))
} else {
  print("No significant cuts found.")
}
## [1] "Significant cuts: cut.L, cut.Q, cut.C"

Interpretation:

This output shows the results of a multiple linear regression analysis performed on the diamond dataset, with the price as the dependent variable and carat weight, cut quality, color, and clarity as independent variables.

The coefficients table provides the following insights:

  1. Carat weight: The coefficient for carat weight is 8886.129, with a very small p-value (< 2e-16), indicating that carat weight has a statistically significant positive effect on diamond prices. As carat weight increases, the price tends to increase significantly.

  2. Cut quality: The cut quality levels “Ideal” (cut.L), “Premium” (cut.Q), and “Very Good” (cut.C) have positive coefficients (698.907, 337.696, and 180.565, respectively) with p-values < 2e-16, suggesting that these cut qualities command higher prices compared to the base level (Fair cut). However, the “Good” cut level (cut.4) has a negative coefficient (-1.207) with a p-value of 0.923, which is not statistically significant.

  3. Color: The color levels “D” (color.L), “E” (color.Q), and “F” (color.C) have negative coefficients (-1910.288, -627.954, and -171.960, respectively) with p-values < 2e-16, indicating that these better color grades are associated with lower prices compared to the base level (J color). The coefficients for other color levels also vary, with some being positive and others negative.

  4. Clarity: The clarity levels “IF” (clarity.L), “VVS1” (clarity.Q), and “VVS2” (clarity.C) have positive coefficients (4217.535, 1832.406, and 923.273, respectively) with p-values < 2e-16, suggesting that these higher clarity grades command higher prices compared to the base level (I1 clarity). However, some lower clarity levels like “SI2” (clarity.4) and “SI1” (clarity.5) have negative coefficients.

The model has an adjusted R-squared value of 0.9159, which means it explains approximately 91.59% of the variance in diamond prices based on the included variables.

The interpretation provided in the output states that the coefficient for carat weight is statistically significant, and the significant cut levels are “Ideal,” “Premium,” and “Very Good.”

CONCLUSION: