We started by looking at the distribution of all numerical variables (price, carat, depth, table, x, y, z). Both price and carat were right-skewed, meaning most diamonds are small and cheap, while a few are large and very expensive. This made us wonder if there’s a strong link between them.
A scatter plot confirmed a strong positive and nonlinear relationship , as carat increases, price rises rapidly. This suggested that carat is the main factor affecting price. To check this across the full dataset, we drew a heat map, which showed the same pattern: price grows sharply with carat size.
We also calculated price per carat to see if bigger diamonds are more valuable per unit weight. The result showed that price per carat increases with size, meaning larger diamonds are disproportionately expensive.
When we checked other numeric variables (depth, table, x, y, z), only x, y, and z showed some correlation with price but this happens because they depend on carat. Depth and table had almost no effect.
Next, we looked at categorical factors like cut, color, and clarity. Boxplots showed that cut doesn’t change price much. “Premium” and “Very Good” are slightly higher, but the difference is small. Color and clarity also didn’t show clear trends, because their effect is often mixed with carat size.
Finally, we used a heat map for color and clarity, and an ECDF plot to compare price distributions by color. Both confirmed that carat size dominates, while color and clarity only slightly influence the final price.
In short:
*Carat strongly drives diamond prices.
*Bigger diamonds are much more expensive, not just heavier.
*Cut, color, and clarity have smaller, secondary effects.
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ lubridate 1.9.4 ✔ tibble 3.3.0
✔ purrr 1.1.0 ✔ tidyr 1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(hexbin)library(scales)
Attaching package: 'scales'
The following object is masked from 'package:purrr':
discard
The following object is masked from 'package:readr':
col_factor
library(ggridges)library(janitor)
Attaching package: 'janitor'
The following objects are masked from 'package:stats':
chisq.test, fisher.test
diamonds <- diamonds view(diamonds)
#Purpose: See skewness, outliers, and typical ranges.diamonds %>%select(price, carat, depth, table, x, y, z) %>%pivot_longer(everything()) %>%ggplot(aes(value)) +geom_histogram(bins =40) +facet_wrap(~name, scales ="free") +labs(title="Distributions of Numerical Variables")
Both carat and price are heavily skewed to the right; low values are common and high values are rare in both. Therefore, there’s a strong sense that as carat increases, price also tends to rise.(but not sure, more test is needed)
#Expectation: The strongest relationship is generally with carat. Heteroskedasticity and nonlinearity may be observed.set.seed(1)diam_samp <- diamonds %>%sample_n(10000)ggplot(diam_samp, aes(carat, price)) +geom_point(alpha =0.2) +geom_smooth(se =FALSE, color="black") +scale_y_continuous(labels=scales::comma) +labs(title="Price ~ Carat (model)")
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
The scatter plot shows a strong positive relationship between carat and price. As carat increases, price rises and its variance also grows, indicating heteroskedasticity. This supports that carat is the main factor influencing diamond prices.
#This heat map visualizes the relationship between diamond carat and price where color intensity represents the number of diamonds in each area.ggplot(diamonds, aes(carat, price)) +geom_hex(bins =40) +scale_y_log10(labels=scales::comma) +labs(title="Price ~ Carat (hexbin, log10 fiyat)") +guides(fill=guide_colorbar(title="Count"))
same variables but different visual
#This heat map shows how the price per carat changes with carat, using a log-scaled y-axis where color intensity indicates the number of diamonds in each area. diamonds %>%mutate(ppc = price/carat) %>%ggplot(aes(carat, ppc)) +geom_hex(bins=40) +scale_y_log10(labels=scales::comma) +labs(title="Price per carat ~ Carat", y="price/carat (log)")
We already know from the scatter plot of price and carat that price increases with carat. However, that only tells us bigger stones cost more, which is obvious.
What we really want to know is: “Do larger diamonds also have a higher price per carat, or are they just heavier?” The relationship between carat and price is nonlinear. Large diamonds are much less common and thus have a premium.
#This faceted scatter plot compares diamond price (log-scaled) with each numerical variable (carat, x, y, z, depth, and table), showing individual data points and smooth trend lines for each relationship.set.seed(123)diam_samp <- diamonds %>%sample_n(5000)num_vars <-c("carat","x","y","z","depth","table")diam_samp %>%pivot_longer(all_of(num_vars)) %>%ggplot(aes(value, price)) +geom_point(alpha=.3) +geom_smooth(se=FALSE, color="black") +scale_y_log10(labels=scales::comma) +facet_wrap(~name, scales="free_x") +labs(title="Price ~ Numerical Variables (Sample 5000)")
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
The faceted scatter plots reveal that carat has by far the strongest and most nonlinear relationship with diamond price, confirming its dominant role in price determination. The physical dimensions (x, y, z) are also positively correlated with price but largely redundant due to their strong dependence on carat. In contrast, depth and table show very weak or negligible effects, indicating that diamond proportions have limited direct impact on pricing compared to size and weight.
#This boxplot shows how diamond prices (on a log10 scale) vary across different cut categories, with cuts ordered by their median price.ggplot(diamonds, aes(x =reorder(cut, price, median), y = price)) +geom_boxplot(outlier.alpha = .15) +scale_y_log10(labels = comma) +# veya labels = label_comma()labs(title="Price ~ Cut (boxplot, log10)", x="cut (median based on price)")
The boxplot indicates that diamond prices vary widely within each cut category, and the median price differences between cuts are relatively small. While Premium and Very Good cuts tend to have slightly higher prices, the overall trend suggests that cut quality has a weaker influence on price compared to carat size.
#This loop creates separate boxplots to visualize how diamond prices (on a log10 scale) differ across each category of color and clarity, helping identify which quality levels are associated with higher or lower prices.for (cat inc("color","clarity")) {print(ggplot(diamonds, aes(x =reorder(.data[[cat]], price, median), y = price)) +geom_boxplot(outlier.alpha = .15) +scale_y_log10(labels=scales::comma) +labs(title=paste("Fiyat ~", cat, "(boxplot, log10price)"), x=cat) ) }
The boxplots for color and clarity reveal that these categorical quality attributes have relatively weak effects on diamond prices compared to carat. While higher clarity or better color theoretically increase value, the total price is largely driven by size. Larger diamonds with slightly lower color or clarity grades often cost more overall, indicating that carat is the dominant factor in determining diamond prices.
#This heat map shows the median diamond price for each combination of color and clarity, allowing quick visual comparison of how these two quality factors interact to influence pricing. diamonds %>%group_by(color, clarity) %>%summarise(median_price =median(price), .groups="drop") %>%ggplot(aes(color, clarity, fill = median_price)) +geom_tile() +scale_fill_continuous(labels=scales::comma) +labs(title="Median Price Heat Map: color × clarity", fill="Median price")
The heat map of median prices across color and clarity combinations shows no consistent upward trend in price with higher-quality grades. Some lower-quality combinations still have high median prices. Overall, this visualization highlights that color and clarity interact weakly with price once size effects are considered.
#This faceted scatter plot shows how diamond price (on a log10 scale) relates to carat size within each cut category, with smooth trend lines revealing how the price–carat relationship varies by cut quality.ggplot(diam_samp, aes(carat, price)) +geom_point(alpha=.2) +geom_smooth(se=FALSE, color="black") +scale_y_log10(labels=scales::comma) +facet_wrap(~cut) +labs(title="Price ~ Carat according to cut")
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
The plot shows that price increases exponentially with carat across all cut types. While higher-quality cuts (Ideal, Premium) have more stable price patterns, lower-quality cuts (Fair, Good) show greater variation. This confirms that carat is the main driver of price, with cut having only a minor effect.
#This ECDFU(Empirical Cumulative Distribution Function) plot was created to compare how diamond prices are distributed across different color grades, showing the cumulative proportion of diamonds up to each price level.ggplot(diamonds, aes(price, color=color)) +stat_ecdf(geom="step") +scale_x_log10(labels=scales::comma) +labs(title="Price ECDF: color groups")
The ECDF plot compares price distributions across color grades. Lower-quality colors (I–J) appear shifted to the right, meaning they include higher-priced stones overall, likely due to larger carat sizes. This confirms that carat outweighs color in determining total price.
#This scatter plot visualizes how diamond price (on a log10 scale) increases with carat size, colored by cut quality to show how cut influences the price–carat relationship.ggplot(diam_samp, aes(carat, price, color=cut)) +geom_point(alpha=.25) +scale_y_log10(labels=scales::comma) +labs(title="Price ~ Carat")
The scatter plot shows a clear exponential increase in price with carat size, confirming that carat is the dominant factor affecting diamond price. Cut quality adds only minor variation, as points from different cut categories largely overlap along the same trend.