week_4_Data_Dive.knit

Title: Week_4_Data_Dive
Output: html document

Objective
1. Creating 5 samples
2. Comparing the samples
3. Monte Carlo Simulation

Installing necessary libraries

library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(viridisLite)
library(viridis)

loading the data

data(diamonds)
head(diamonds)

## # A tibble: 6 × 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

We are creating 5 samples from our diamonds datase with replacement. Each sample contains roughly 50 percent of the diamonds dataset.

sample_size <- round(nrow(diamonds)*0.5) #rounding to the nearest number
samples <- list()
for (i in 1:5){
  samples[[i]] <- diamonds[sample(nrow(diamonds),size = sample_size,replace = TRUE),]
}
d1<-samples[[1]]
d2<-samples[[2]]
d3<-samples[[3]]
d4<-samples[[4]]
d5<-samples[[5]]

#verifying the dimensions of the samples
print(dim(d1))

## [1] 26970    10

print(dim(d2))

## [1] 26970    10

print(dim(d3))

## [1] 26970    10

print(dim(d4))

## [1] 26970    10

print(dim(d5))

## [1] 26970    10

all our samples contain 50 percent of the data.

Let us compare the prices of the diamonds for the carat of all the 5 samples that we have created. let’s explore using visualization.

options(repr.plot.width =20,repr.plot.height = 16)
mean_price_by_carat <-diamonds|>
  group_by(carat)|>
  summarize(mean_price = mean(price))
mean_price_by_d1carat <-d1|>
  group_by(carat)|>
  summarize(mean_price = mean(price))
mean_price_by_d2carat <-d2|>
  group_by(carat)|>
  summarize(mean_price = mean(price))
mean_price_by_d3carat <- d3|>
  group_by(carat)|>
  summarize(mean_price = mean(price))
mean_price_by_d4carat <-d4|>
  group_by(carat)|>
  summarize(mean_price = mean(price))
mean_price_by_d5carat <-d5|>
  group_by(carat)|>
  summarize(mean_price = mean(price))

#combining the results into new dataframe
diamonds_combined<- bind_rows(
  mutate(mean_price_by_carat,sample='diamonds'),
  mutate(mean_price_by_d1carat,sample='d1'),
  mutate(mean_price_by_d2carat,sample='d2'),
  mutate(mean_price_by_d3carat,sample='d3'),
  mutate(mean_price_by_d4carat,sample='d4'),
  mutate(mean_price_by_d5carat,sample='d5')
)

#plotting the data
ggplot(data=diamonds_combined,aes(x=carat,y=mean_price,fill=sample))+
  geom_bar(stat = 'identity', position='dodge')+
  labs(title = 'comparision between original and samples',fill='sample')+
  facet_wrap(~sample ,scales = 'free_y',ncol = 1)+
  theme_classic()

The prices of the diamonds for the carat of all samples are almost similar to the original diamonds dataframe, with little outliers present in one sample but not present in another.

For example, only sample 3 contains the data of price of the diamond concerning carat having value 5, which is an anomaly.

Now, let’s explore even deeper by comparing the mean prices of diamonds concerning the carat and cut of the diamonds. Carat and Cut are the most important factors that drives the price of the diamonds.

# Create a list of samples
samples_list <- list(d1, d2, d3, d4, d5)

# Calculate mean price by carat and cut for each sample
mean_pp_list <- lapply(samples_list, function(sample) {
  group_data <- sample %>%
    group_by(carat, cut) %>%
    summarize(mean_price = mean(price), .groups = 'keep')
  return(group_data)
})

# Set the color scale using viridis
color_palette <- viridis_pal()(length(levels(diamonds$cut)))

# Plotting for each sample with color for each level of the cut variable
plots_list <- lapply(1:length(mean_pp_list), function(i) {
  ggplot(data = mean_pp_list[[i]], aes(x = carat, y = mean_price, color = cut)) +
    geom_point() +
    facet_wrap(~cut, scales = 'free') +
    scale_color_manual(values = color_palette) + 
    theme_classic() +
    ggtitle(paste("Sample", i))
})

# Display the plots
plots_list

## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

If we can observe the mean prices for the very good cut in sample 1 and sample 2, in sample 2, it doesn’t have values for carat 3 and 4, whereas in sample 1 it has values up to carat 4 but it can be considered an outlier.
In sample 5, the mean price of the diamonds of very good cut and carat is present(basically an outlier), an anomaly that is absent in another sample.

Monte Carlo Simulation

let’s take a price column and specify the number of simulations. we calculate the sample mean for each simulation and visualize it.

set.seed(123)
num_samples <- 1000
sample_means <- numeric(num_samples)
for (i in 1:num_samples) {
  sample_data <- sample(diamonds$price, replace = TRUE)
  
  sample_means[i] <- mean(sample_data)
}

head(sample_means)

## [1] 3940.745 3921.476 3943.889 3922.540 3927.049 3930.295

plot_data <- data.frame(
  sample_mean = sample_means
)
ggplot(plot_data, aes(x = sample_mean)) +
  geom_histogram(binwidth = 1, color = "skyblue") +
 theme_classic()

it follows a normal distribution, we can also check whether it follows normal distribution or not by calculating the average of the sample means.

print(mean(sample_means))

## [1] 3933.536

print(mean(diamonds$price))

## [1] 3932.8

The mean of the sample mean is almost equal to the mean of the price of the diamonds.