a) Find a data set of your interest with at least five meaningful continuous variables. The sample size should be larger than 1000. Introduce your data set and five selected continuous variables.

Diamond:

  • Price: price in US dollars ($326–$18,823)
  • Carat: weight of the diamond (0.2–5.01)
  • Depth: total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79)
  • Table: width of top of diamond relative to widest point (43–95)
  • x: length in mm (0–10.74)

b) Create a pair plot of your continuous variables. Also compute the covariance and correlation matrices. Comment on your findings.

Plot

library(GGally)
## Loading required package: ggplot2
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ lubridate 1.9.4     ✔ tibble    3.3.1
## ✔ purrr     1.2.1     ✔ tidyr     1.3.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Select continuous variables
diamonds1 <- dplyr::select(diamonds, price, carat, depth, table, x)

# Create pairs plot
ggpairs(diamonds1)

Cov

cov(diamonds1)
##               price        carat        depth        table            x
## price  1.591563e+07 1.742765e+03 -60.85371214 1133.3180641 3958.0214908
## carat  1.742765e+03 2.246867e-01   0.01916653    0.1923645    0.5184841
## depth -6.085371e+01 1.916653e-02   2.05240384   -0.9468399   -0.0406413
## table  1.133318e+03 1.923645e-01  -0.94683994    4.9929481    0.4896429
## x      3.958021e+03 5.184841e-01  -0.04064130    0.4896429    1.2583472

Cor

cor(diamonds1)
##            price      carat       depth      table           x
## price  1.0000000 0.92159130 -0.01064740  0.1271339  0.88443516
## carat  0.9215913 1.00000000  0.02822431  0.1816175  0.97509423
## depth -0.0106474 0.02822431  1.00000000 -0.2957785 -0.02528925
## table  0.1271339 0.18161755 -0.29577852  1.0000000  0.19534428
## x      0.8844352 0.97509423 -0.02528925  0.1953443  1.00000000

Comment

Price and carat show a very strong positive correlation (r ≈ 0.92), indicating that larger diamonds cost significantly more. Similarly, carat and x (length) are highly correlated (r ≈ 0.98), which makes sense as larger carats mean physically larger stones

c) Create a few 2D density plots of joint distributions of some selected variable pairs. Do they look normal to you? Give comments.

Price & carat:

ggplot(diamonds, aes(x = carat, y = price)) +
  geom_point() +
  geom_smooth()+
  labs(title = "2D Density: Price vs Carat")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Price & x

ggplot(diamonds, aes(x = x, y = price)) +
  geom_point() +
  geom_smooth()+
  labs(title = "2D Density: Price vs x")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Depth & Table

ggplot(diamonds, aes(x = depth, y = table)) +
  geom_point() +
  geom_smooth()+
  labs(title = "2D Density: Depth vs Table")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Comment:

All of them have some meaning. For the first one, we can see the price is vary for diferent carat, but the larger the carat the more expensice it would be and they haev smaller range of price.

For the second on, we can absolutely tell price increase sharply with a longer length, especially from 3mm.

For the last one, we can tell the relationship is kinda inverse relationship with the larger the table the smaller the persentage depth

d) Select a pair of variables, use the Gaussian KDE method to approximate a joint distribution. Create a density plot.

ggplot(diamonds, aes(x = carat, y = price)) +
  geom_density_2d_filled() +
  labs(
    title = "Gaussian KDE of Joint Distribution: Carat vs Price",
    x = "Carat",
    y = "Price"
  )

e) Use the inverse sampling method to create new samples for all the five variables together (study how to do it, you can use generative AI for help). Then create scatterplots of new samples on top of old samples for a few variable pairs to see how good the sampling is

library(tidyverse)
library(MASS)  # for kde2d
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
# Step 1: Select your 5 variables
diamonds_subset <- diamonds1

# Step 2: Univariate Inverse Sampling for Each Variable

inverse_sample_univariate <- function(data, n_samples = 1000) {
  # Estimate density
  kde <- density(data, n = 512)
  
  # Create CDF from density
  x_grid <- kde$x
  dx <- diff(x_grid)[1]
  cdf <- cumsum(kde$y) * dx
  cdf <- cdf / max(cdf)
  
  # Sample using inverse transform
  u <- runif(n_samples)
  sampled_data <- approx(cdf, x_grid, u)$y
  
  return(sampled_data)
}

# Generate new samples for each variable independently
set.seed(123)
new_price <- inverse_sample_univariate(diamonds_subset$price, 1000)
new_carat <- inverse_sample_univariate(diamonds_subset$carat, 1000)
new_depth <- inverse_sample_univariate(diamonds_subset$depth, 1000)
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
new_table <- inverse_sample_univariate(diamonds_subset$table, 1000)
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
new_x <- inverse_sample_univariate(diamonds_subset$x, 1000)
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
# Step 3: Create comparison plots

# Price comparison
df_price <- data.frame(
  value = c(diamonds_subset$price, new_price),
  type = rep(c("Original", "Sampled"), c(nrow(diamonds_subset), 1000))
)

ggplot(df_price, aes(x = value, fill = type)) +
  geom_histogram(alpha = 0.5, position = "identity", bins = 50) +
  labs(title = "Price: Original vs Sampled", x = "Price") +
  theme_minimal()

# Step 4: Scatterplots comparing variable pairs

# Create data frames for plotting
original_df <- data.frame(
  price = diamonds_subset$price,
  carat = diamonds_subset$carat,
  depth = diamonds_subset$depth,
  table = diamonds_subset$table,
  x = diamonds_subset$x,
  type = "Original"
)

sampled_df <- data.frame(
  price = new_price,
  carat = new_carat,
  depth = new_depth,
  table = new_table,
  x = new_x,
  type = "Sampled"
)

combined_df <- rbind(original_df, sampled_df)

# Scatterplot 1: Price vs Carat
ggplot(combined_df, aes(x = carat, y = price, color = type)) +
  geom_point(alpha = 0.3) +
  labs(title = "Price vs Carat: Original vs Sampled") +
  theme_minimal()

# Scatterplot 2: Carat vs Depth
ggplot(combined_df, aes(x = depth, y = carat, color = type)) +
  geom_point(alpha = 0.3) +
  labs(title = "Carat vs Depth: Original vs Sampled") +
  theme_minimal()

# Scatterplot 3: Price vs X (length)
ggplot(combined_df, aes(x = x, y = price, color = type)) +
  geom_point(alpha = 0.3) +
  labs(title = "Price vs X: Original vs Sampled") +
  theme_minimal()