library(ggplot2)
data(diamonds)
str(diamonds)
## tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
## $ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
DATASET DESCRIPTION
carat : weight of the diamond
cut : quality of the cut
color : diamond color
clarity : measurement of how clear the diamond is
depth : total depth percentage
table : width of the top of the diamond relative to the widest
point
price : price in US dollars
x : length in mm
y : width in mm
z : depth in mm
summary(diamonds)
## carat cut color clarity depth
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194 Median :61.80
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066 3rd Qu.:62.50
## Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00
## J: 2808 (Other): 2531
## table price x y
## Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720
## Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
## Mean :57.46 Mean : 3933 Mean : 5.731 Mean : 5.735
## 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900
##
## z
## Min. : 0.000
## 1st Qu.: 2.910
## Median : 3.530
## Mean : 3.539
## 3rd Qu.: 4.040
## Max. :31.800
##
Interpretation:
carat: This represents the weight of the diamond in carats. It
ranges from a minimum of 0.20 carats to a maximum of 5.01 carats, with
an average weight of around 0.7979 carats. The distribution is likely
skewed towards smaller diamonds as the 1st quartile (Q1) is at 0.40
carats and the 3rd quartile (Q3) is at 1.04 carats.
depth: This indicates the depth of the diamond as a
percentage of its diameter. It ranges from 43% to 79%, with an average
of 61.75%. The interquartile range (IQR) from Q1 (61.00%) to Q3 (62.50%)
suggests most depths fall within a narrow range.
table: This represents the width of the diamond’s table
(the flat top) relative to its diameter, expressed as a percentage. It
ranges from 43% to 95%, with an average of 57.46%. Similar to depth, the
IQR (56.00% - 59.00%) indicates most tables fall within a limited
range.
price: This shows the price of the diamond in USD. It
ranges from $326 to $18,823, with an average price of $3,933. The
distribution is likely skewed towards lower prices based on the summary
alone.
x, y, z: These represent the length, width, and height
of the diamond in millimeters, respectively. They all follow a similar
pattern to carat weight, with minimum values around 0mm, maximums
exceeding 10mm, and averages around 5.7mm.
cut: This describes the quality of the diamond’s
cut, with categories like “Fair,” “Good,” “Very Good,” “Premium,” and
“Ideal.” “Ideal” cuts are the most prevalent (21,551 diamonds), followed
by “Very Good” (12,082 diamonds).
color: This indicates the color grade of the diamond,
ranging from “D” (colorless) to “J” (slightly yellow). “G” and “H” color
grades are the most common.
clarity: This represents the presence of inclusions or
blemishes within the diamond, with grades like “SI1” (Slightly Included
1), “VS2” (Very Slightly Included 2), etc. “SI1” is the most frequent
clarity grade, followed by “VS2.”
The provided summary of the diamonds dataset offers a glimpse into how different characteristics might influence a diamond’s price. Let’s delve deeper!
1. Distribution of Diamond Prices: This histogram shows the distribution of diamond prices in the dataset. It can help identify the range of prices, as well as the most common price points.
In the real world,
this information can be useful for diamond retailers and
consumers to understand the typical pricing structure and set
appropriate expectations.
ggplot(data = diamonds, aes(x = price)) +
geom_histogram(bins = 30, fill = "steelblue", color = "black") +
labs(title = "Distribution of Diamond Prices", x = "Price (USD)", y = "Count")
Interpretation:
This histogram shows the distribution of diamond prices in the
dataset. The majority of diamonds have prices concentrated below $5,000,
with a sharp decline in frequency as the price increases. However, there
are still some very expensive diamonds in the dataset, extending up to
around $20,000.
2. Price vs. Carat Weight: This scatterplot with a
linear regression line shows the relationship between carat weight and
price. As expected, higher carat weights generally correspond to higher
prices.
This information can help diamond buyers estimate the cost of
diamonds based on their desired carat weight and can also guide pricing
strategies for diamond sellers.
ggplot(data = diamonds, aes(x = carat, y = price)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", se = FALSE, color = "red", formula = y~x) +
labs(title = "Price vs. Carat Weight", x = "Carat Weight", y = "Price (USD)")
Interpretation:
This scatterplot depicts the relationship between carat weight and
price, with a linear regression line overlaid. There is a strong
positive correlation, where diamonds with higher carat weights tend to
have higher prices. The red line indicates that as carat weight
increases, the price tends to increase linearly.
3. Price Distribution by Cut Quality: This boxplot shows how the distribution of prices varies across different cut qualities.
It can help identify the impact of cut quality on diamond
pricing and guide buyers and sellers in making informed decisions based
on their budget or desired profit margins.
ggplot(data = diamonds, aes(x = cut, y = price)) +
geom_boxplot(aes(fill = cut), alpha = 0.7) +
labs(title = "Price Distribution by Cut Quality", x = "Cut Quality", y = "Price (USD)") +
theme(legend.position = "none")
Interpretation:
This boxplot displays the distribution of diamond prices across
different cut quality categories. It shows that diamonds with better cut
qualities (e.g., Ideal, Premium) tend to have higher median prices and a
wider range of prices compared to those with lower cut qualities (e.g.,
Fair, Good).
4. Price Distribution by Color Grade: The violin plot visualizes the distribution of prices for different color grades.
It can help understand the premium associated with rarer color
grades and guide pricing strategies accordingly.
ggplot(data = diamonds, aes(x = color, y = price)) +
geom_violin(trim = FALSE, aes(fill = color)) +
labs(title = "Price Distribution by Color Grade", x = "Color Grade", y = "Price (USD)") +
theme(legend.position = "none")
Interpretation:
This violin plot visualizes the distribution of diamond prices for
different color grades.
The plot shows that diamonds with rarer and more desirable color grades
(e.g., D, E) tend to have higher prices, while diamonds with lower color
grades (e.g., J) generally have lower prices.
5. Depth vs. Table (colored by Price): This scatterplot shows the relationship between depth, table, and price.
It can help identify ideal combinations of depth and table that
correspond to higher prices, which can be useful for diamond cutters and
graders.
ggplot(data = diamonds, aes(x = depth, y = table, color = price)) +
geom_point(alpha = 0.5) +
labs(title = "Depth vs. Table (colored by Price)", x = "Depth", y = "Table") +
scale_color_gradientn(colors = terrain.colors(10))
Interpretation:
The scatter plot displays the relationship between depth and table
prices, with the dots colored based on the price value. There appears to
be a clustering of green and yellow dots around a depth value of 60-65,
suggesting that tables with depths in this range tend to have lower
prices, likely representing more standard or common table sizes.
Meanwhile, the higher-priced tables, indicated by orange and red dots,
are more scattered across different depth values, potentially
representing more unique or custom table sizes and configurations.
Overall, the plot reveals that while there is a general trend of lower
prices associated with typical depths, higher-priced tables can exist
across a wider range of depth dimensions.
6. Price vs. Carat Weight by Clarity: These faceted scatterplots show the relationship between carat weight and price for different clarity grades.
It can help buyers understand the price premium associated with
higher clarity grades for their desired carat weight.
ggplot(data = diamonds, aes(x = carat, y = price, color = clarity)) +
geom_point(alpha = 0.5) +
labs(title = "Carat Weight vs. Price (colored by Clarity)", x = "Carat Weight", y = "Price (USD)") +
scale_color_brewer(type = "div", palette = "RdYlBu")
Interpretation:
This scatterplot shows the relationship between carat weight and price,
with the points colored by clarity grade. It demonstrates that diamonds
with higher clarity grades (e.g., IF, VVS1) tend to have higher prices
for a given carat weight compared to those with lower clarity grades
(e.g., SI2, I1).
7. Price vs. Carat Weight by Clarity: These faceted scatterplots show the relationship between carat weight and price for different clarity grades.
It can help buyers understand the price premium associated with
higher clarity grades for their desired carat weight.
ggplot(data = diamonds, aes(x = carat, y = price, color = clarity)) +
geom_point(alpha = 0.5) +
facet_wrap(~ clarity, ncol = 3) +
labs(title = "Price vs. Carat Weight by Clarity", x = "Carat Weight", y = "Price (USD)")
Interpretation:
A scatter plot visualizes price (USD) vs. carat weight for
diamonds, colored by clarity grade. Upward trends show price increases
with carat weight within each grade. Higher clarity grades (IF to I1)
have progressively higher price curves.
8. Depth vs. Table (colored by Price) by Cut Quality: These faceted scatterplots show the relationship between depth, table, and price for different cut qualities.
This information can be useful for diamond cutters and graders
to optimize the cut proportions for maximum value.
ggplot(data = diamonds, aes(x = depth, y = table, color = price)) +
geom_point(alpha = 0.5) +
facet_wrap(~ cut, ncol = 2) +
labs(title = "Depth vs. Table (colored by Price) by Cut Quality", x = "Depth", y = "Table")
Interpretation:
This scatter plot shows the relationship between depth and
table percentages of diamonds, colored by price range, across different
cut quality grades. The plot suggests that ideal depth and table values
tend to fall within specific ranges for each cut quality, with higher
cut qualities generally commanding higher prices.
9. Carat Weight vs. Depth (colored by Price) by Cut Quality: These faceted scatterplots show the relationship between carat weight, depth, and price for different cut qualities.
This information can help diamond cutters and graders optimize
the proportions for maximum value based on the cut quality.
ggplot(data = diamonds, aes(x = carat, y = depth, color = price)) +
geom_point(alpha = 0.5) +
facet_wrap(~ cut, ncol = 2) +
labs(title = "Carat Weight vs. Depth (colored by Price) by Cut Quality", x = "Carat Weight", y = "Depth")
Interpretation:
The scatter plot displays the relationship between carat weight
and depth of diamonds, with the color indicating price. The higher the
cut quality, such as “Premium” and “Ideal,” the higher the price range
for diamonds with similar carat weight and depth values.
10. Price Distribution by Cut and Clarity: The faceted density plots show the price distributions for different combinations of cut and clarity.
This information can guide pricing strategies based on these two
key attributes and help identify the most valuable
combinations.
ggplot(data = diamonds, aes(x = price, fill = cut)) +
geom_density(alpha = 0.5) +
facet_wrap(~ clarity, ncol = 3) +
labs(title = "Price Distribution by Cut and Clarity", x = "Price (USD)", y = "Density")
Interpretation:
This density plot shows the distribution of diamond prices
across different cut quality and clarity levels. The plots reveal that
most diamonds have lower prices, with the density curves skewed towards
the lower end of the price range. The peak density shifts slightly
higher for better cut qualities like “Ideal” and higher clarity grades
like “IF” (Internally Flawless), indicating higher prices for superior
diamond quality.
Carat Weight Distribution by Cut and Clarity: The faceted density plots show the distribution of carat weights for different combinations of cut and clarity.
ggplot(data = diamonds, aes(x = carat, fill = cut)) +
geom_density(alpha = 0.5) +
facet_grid(cut ~ clarity) +
labs(title = "Carat Weight Distribution by Cut and Clarity", x = "Carat Weight", y = "Density")
Interpretation:
This series of density plots displays the carat weight
distribution of diamonds categorized by cut quality and clarity levels.
Across all cut and clarity combinations, the distributions are skewed to
the left, indicating that most diamonds have lower carat weights. The
peaks occur around 0.3 to 0.7 carats, with the frequency decreasing as
carat weight increases. This information helps retailers
understand the availability patterns of different carat weight ranges
based on the diamond’s cut and clarity attributes.
CORRELATION:
library(corrplot)
## corrplot 0.92 loaded
# Select numeric columns only
numeric_columns <- Filter(is.numeric, diamonds)
# Calculate the correlation matrix
cor_table <- cor(numeric_columns)
print(cor_table)
## carat depth table price x y
## carat 1.00000000 0.02822431 0.1816175 0.9215913 0.97509423 0.95172220
## depth 0.02822431 1.00000000 -0.2957785 -0.0106474 -0.02528925 -0.02934067
## table 0.18161755 -0.29577852 1.0000000 0.1271339 0.19534428 0.18376015
## price 0.92159130 -0.01064740 0.1271339 1.0000000 0.88443516 0.86542090
## x 0.97509423 -0.02528925 0.1953443 0.8844352 1.00000000 0.97470148
## y 0.95172220 -0.02934067 0.1837601 0.8654209 0.97470148 1.00000000
## z 0.95338738 0.09492388 0.1509287 0.8612494 0.97077180 0.95200572
## z
## carat 0.95338738
## depth 0.09492388
## table 0.15092869
## price 0.86124944
## x 0.97077180
## y 0.95200572
## z 1.00000000
# Plot the correlation matrix
corrplot(cor_table)
Interpretation of the Matrix:
Each cell in the matrix represents the correlation coefficient between two variables.
The diagonal cells (from top-left to bottom-right) show the correlation of each variable with itself, which is always 1 (perfect correlation).
For example, the correlation between carat and price is 0.9216, indicating a strong positive correlation. This suggests that as the carat weight of a diamond increases, its price tends to increase as well.
Similarly, the correlation between depth and table is -0.2958, indicating a moderate negative correlation. This suggests that as the depth percentage increases, the table percentage tends to decrease, and vice versa.
Correlation between x, y, and z dimensions of the diamond is also quite high, indicating strong positive correlations among these variables.
ANOVA Analysis:
# Perform ANOVA for cut vs. price
cut_anova <- aov(price ~ cut, data = diamonds)
summary(cut_anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## cut 4 1.104e+10 2.760e+09 175.7 <2e-16 ***
## Residuals 53935 8.474e+11 1.571e+07
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Perform ANOVA for color vs. price
color_anova <- aov(price ~ color, data = diamonds)
summary(color_anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## color 6 2.685e+10 4.475e+09 290.2 <2e-16 ***
## Residuals 53933 8.316e+11 1.542e+07
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
clarity_Anova <- aov(price~clarity,data=diamonds)
summary(clarity_Anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## clarity 7 2.331e+10 3.330e+09 215 <2e-16 ***
## Residuals 53932 8.352e+11 1.549e+07
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Analysis of Variance (ANOVA) tests to determine if there are statistically significant differences in diamond prices based on the cut quality, color, and clarity attributes.
The ANOVA results show that all three factors (cut, color, and clarity) have very small p-values (< 2e-16, which is essentially 0) in the Pr(>F) column. This means that the null hypothesis, which assumes no difference in mean prices across the different levels of each factor, can be rejected at any reasonable significance level.
In other words, the ANOVA confirms that cut quality, color, and clarity all have a significant impact on the price of a diamond. This information is valuable for several real-world applications:
Pricing strategy: Retailers and wholesalers can use these findings to develop pricing models that accurately reflect the value contribution of each diamond attribute. Higher prices can be justified for better cut qualities, more desirable colors, and higher clarity grades.
Inventory management: Understanding the relationship between diamond attributes and prices can help retailers optimize their inventory mix. They can stock more diamonds with attributes that are in higher demand and command premium prices.
Consumer education: The ANOVA results quantify the importance of cut, color, and clarity in determining a diamond’s value. Retailers can use this information to educate consumers on the factors that influence diamond prices, helping them make informed purchasing decisions.
Quality control: Diamond suppliers and manufacturers can use the ANOVA insights to prioritize and optimize the cutting, polishing, and sorting processes to achieve the desired levels of cut quality, color, and clarity that maximize value.
Overall, these ANOVA analyses provide statistical evidence of the significant impact of diamond attributes on pricing, which can be leveraged by various stakeholders in the diamond industry to make more informed decisions and enhance their operations and customer experiences.
Linear Regression Model and Hypothesis Testing
# Fit linear regression model
model <- lm(price ~ carat + cut + color + clarity, data = diamonds)
# Summary of the fitted model
summary(model)
##
## Call:
## lm(formula = price ~ carat + cut + color + clarity, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16813.5 -680.4 -197.6 466.4 10394.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3710.603 13.980 -265.414 < 2e-16 ***
## carat 8886.129 12.034 738.437 < 2e-16 ***
## cut.L 698.907 20.335 34.369 < 2e-16 ***
## cut.Q -327.686 17.911 -18.295 < 2e-16 ***
## cut.C 180.565 15.557 11.607 < 2e-16 ***
## cut^4 -1.207 12.458 -0.097 0.923
## color.L -1910.288 17.712 -107.853 < 2e-16 ***
## color.Q -627.954 16.121 -38.952 < 2e-16 ***
## color.C -171.960 15.070 -11.410 < 2e-16 ***
## color^4 21.678 13.840 1.566 0.117
## color^5 -85.943 13.076 -6.572 5.00e-11 ***
## color^6 -49.986 11.889 -4.205 2.62e-05 ***
## clarity.L 4217.535 30.831 136.794 < 2e-16 ***
## clarity.Q -1832.406 28.827 -63.565 < 2e-16 ***
## clarity.C 923.273 24.679 37.411 < 2e-16 ***
## clarity^4 -361.995 19.739 -18.339 < 2e-16 ***
## clarity^5 216.616 16.109 13.447 < 2e-16 ***
## clarity^6 2.105 14.037 0.150 0.881
## clarity^7 110.340 12.383 8.910 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1157 on 53921 degrees of freedom
## Multiple R-squared: 0.9159, Adjusted R-squared: 0.9159
## F-statistic: 3.264e+04 on 18 and 53921 DF, p-value: < 2.2e-16
# Hypothesis testing - testing individual coefficients
# Hypothesis 1: Is there a significant relationship between carat weight and diamond price?
carat_p_value <- summary(model)$coefficients["carat", "Pr(>|t|)"]
if (carat_p_value < 0.05) {
print("The coefficient for carat weight is statistically significant.")
} else {
print("The coefficient for carat weight is not statistically significant.")
}
## [1] "The coefficient for carat weight is statistically significant."
# Hypothesis 2: Are certain levels of cut associated with significantly different diamond prices?
cut_p_values <- summary(model)$coefficients[grep("^cut", rownames(summary(model)$coefficients)), "Pr(>|t|)"]
alpha <- 0.05
significant_cut <- names(cut_p_values)[which(cut_p_values < alpha)]
if (length(significant_cut) > 0) {
print(paste("Significant cuts:", paste(significant_cut, collapse = ", ")))
} else {
print("No significant cuts found.")
}
## [1] "Significant cuts: cut.L, cut.Q, cut.C"
Interpretation:
This output shows the results of a multiple linear regression analysis performed on the diamond dataset, with the price as the dependent variable and carat weight, cut quality, color, and clarity as independent variables.
The coefficients table provides the following insights:
Carat weight: The coefficient for carat weight is 8886.129, with a very small p-value (< 2e-16), indicating that carat weight has a statistically significant positive effect on diamond prices. As carat weight increases, the price tends to increase significantly.
Cut quality: The cut quality levels “Ideal” (cut.L), “Premium” (cut.Q), and “Very Good” (cut.C) have positive coefficients (698.907, 337.696, and 180.565, respectively) with p-values < 2e-16, suggesting that these cut qualities command higher prices compared to the base level (Fair cut). However, the “Good” cut level (cut.4) has a negative coefficient (-1.207) with a p-value of 0.923, which is not statistically significant.
Color: The color levels “D” (color.L), “E” (color.Q), and “F” (color.C) have negative coefficients (-1910.288, -627.954, and -171.960, respectively) with p-values < 2e-16, indicating that these better color grades are associated with lower prices compared to the base level (J color). The coefficients for other color levels also vary, with some being positive and others negative.
Clarity: The clarity levels “IF” (clarity.L), “VVS1” (clarity.Q), and “VVS2” (clarity.C) have positive coefficients (4217.535, 1832.406, and 923.273, respectively) with p-values < 2e-16, suggesting that these higher clarity grades command higher prices compared to the base level (I1 clarity). However, some lower clarity levels like “SI2” (clarity.4) and “SI1” (clarity.5) have negative coefficients.
The model has an adjusted R-squared value of 0.9159, which means it explains approximately 91.59% of the variance in diamond prices based on the included variables.
The interpretation provided in the output states that the coefficient for carat weight is statistically significant, and the significant cut levels are “Ideal,” “Premium,” and “Very Good.”
CONCLUSION: