Datachoice - diamonds

head(diamonds)
## # A tibble: 6 × 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

Pair Plot

Prepare the data

diamonds_1 <- tibble(diamonds) %>%
  dplyr::select(carat, x, y, z, depth, table)

Create the pair plot

ggpairs(diamonds_1, progress = FALSE)

Covariance & Correlation

cov(diamonds_1)
##            carat          x           y          z       depth      table
## carat 0.22468666  0.5184841  0.51524782 0.31891684  0.01916653  0.1923645
## x     0.51848413  1.2583472  1.24878933 0.76848748 -0.04064130  0.4896429
## y     0.51524782  1.2487893  1.30447161 0.76731958 -0.04800857  0.4689723
## z     0.31891684  0.7684875  0.76731958 0.49801086  0.09596797  0.2379960
## depth 0.01916653 -0.0406413 -0.04800857 0.09596797  2.05240384 -0.9468399
## table 0.19236452  0.4896429  0.46897228 0.23799604 -0.94683994  4.9929481
cor(diamonds_1)
##            carat           x           y          z       depth      table
## carat 1.00000000  0.97509423  0.95172220 0.95338738  0.02822431  0.1816175
## x     0.97509423  1.00000000  0.97470148 0.97077180 -0.02528925  0.1953443
## y     0.95172220  0.97470148  1.00000000 0.95200572 -0.02934067  0.1837601
## z     0.95338738  0.97077180  0.95200572 1.00000000  0.09492388  0.1509287
## depth 0.02822431 -0.02528925 -0.02934067 0.09492388  1.00000000 -0.2957785
## table 0.18161755  0.19534428  0.18376015 0.15092869 -0.29577852  1.0000000

Comment: The weight of the diamonds has a strong relationship with its size (x, y, z). However, the depth (z) of the diamonds has very little relationship to its width (y) and its length (x), their corelation are close to zero. The diamonds’ size that are in our data is not so spread except the depth of them. Diamonds’ depth are more spread out than their length and width.

depth Vs table density plot

ggplot(diamonds, aes(x=depth, y= table)) +
  geom_density2d_filled() +
  labs(title = "2D Density Plot - depth Vs table")

ggplot(diamonds, aes(x=carat, y= depth)) +
  geom_density2d_filled() +
  labs(title = "2D Density Plot - carat Vs depth")

ggplot(diamonds, aes(x=carat, y= x)) +
  geom_density2d_filled() +
  labs(title = "2D Density Plot - carat Vs x")

ggplot(diamonds, aes(x=z, y= depth)) +
  geom_density2d_filled() +
  labs(title = "2D Density Plot - z Vs depth")

Comment: Since most these density plot don’t have the eyes shape that direct from left-bottom to top right, they do not look normal to me. Each one of them having a rather small center (hight light). Diamonds’ weight (carat) and length look a lot normal than other. It has a bulleyes shape but its center is quite separate.

Kernel Density Estimation (KDE)

kde <- kde2d(diamonds_1$carat, diamonds_1$table, n = 1000)
image(kde, col = viridis(50))

Inverse sampling & Scatterplots

old_data <- diamonds_1 %>% slice_sample(n = 2000)

inverse_sample <- function(data, n_new) {
  u <- runif(n_new)
  new_value <- quantile(data, probs = u, type = 8, names = FALSE)
  return(new_value)
}

new_data <- as.data.frame(lapply(old_data, inverse_sample, n_new = 2000))
                      
old_data$Source <- "Original Data"
new_data$Source <- "New Samples"

combined_data <- bind_rows(old_data, new_data)
ggplot(combined_data, aes(x = carat, y = x, color = Source)) +
  geom_point(alpha = 0.4) +
  labs(title = "Carat Vs X")

ggplot(combined_data, aes(x = carat, y = table, color = Source)) +
  geom_point(alpha = 0.4) +
  labs(title = "Carat Vs table")

Comment: In the first plot, Although the inverse sampling correctly reproduced the marginal distribution, but it destroy the correlation between the carat and the length of the diamonds. However, the sampling did a better job in the second plot, probably because the correlation between carat and table is too weak from the beginning, so it did not affect much.