Title: Week_4_Data_Dive
Output: html document
Objective
1. Creating 5 samples
2. Comparing the samples
3. Monte Carlo Simulation
Installing necessary libraries
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(viridisLite)
library(viridis)
loading the data
data(diamonds)
head(diamonds)
## # A tibble: 6 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
sample_size <- round(nrow(diamonds)*0.5) #rounding to the nearest number
samples <- list()
for (i in 1:5){
samples[[i]] <- diamonds[sample(nrow(diamonds),size = sample_size,replace = TRUE),]
}
d1<-samples[[1]]
d2<-samples[[2]]
d3<-samples[[3]]
d4<-samples[[4]]
d5<-samples[[5]]
#verifying the dimensions of the samples
print(dim(d1))
## [1] 26970 10
print(dim(d2))
## [1] 26970 10
print(dim(d3))
## [1] 26970 10
print(dim(d4))
## [1] 26970 10
print(dim(d5))
## [1] 26970 10
all our samples contain 50 percent of the data.
options(repr.plot.width =20,repr.plot.height = 16)
mean_price_by_carat <-diamonds|>
group_by(carat)|>
summarize(mean_price = mean(price))
mean_price_by_d1carat <-d1|>
group_by(carat)|>
summarize(mean_price = mean(price))
mean_price_by_d2carat <-d2|>
group_by(carat)|>
summarize(mean_price = mean(price))
mean_price_by_d3carat <- d3|>
group_by(carat)|>
summarize(mean_price = mean(price))
mean_price_by_d4carat <-d4|>
group_by(carat)|>
summarize(mean_price = mean(price))
mean_price_by_d5carat <-d5|>
group_by(carat)|>
summarize(mean_price = mean(price))
#combining the results into new dataframe
diamonds_combined<- bind_rows(
mutate(mean_price_by_carat,sample='diamonds'),
mutate(mean_price_by_d1carat,sample='d1'),
mutate(mean_price_by_d2carat,sample='d2'),
mutate(mean_price_by_d3carat,sample='d3'),
mutate(mean_price_by_d4carat,sample='d4'),
mutate(mean_price_by_d5carat,sample='d5')
)
#plotting the data
ggplot(data=diamonds_combined,aes(x=carat,y=mean_price,fill=sample))+
geom_bar(stat = 'identity', position='dodge')+
labs(title = 'comparision between original and samples',fill='sample')+
facet_wrap(~sample ,scales = 'free_y',ncol = 1)+
theme_classic()
The prices of the diamonds for the carat of all samples are almost similar to the original diamonds dataframe, with little outliers present in one sample but not present in another.
For example, only sample 3 contains the data of price of the diamond concerning carat having value 5, which is an anomaly.
Now, let’s explore even deeper by comparing the mean prices of diamonds concerning the carat and cut of the diamonds. Carat and Cut are the most important factors that drives the price of the diamonds.
# Create a list of samples
samples_list <- list(d1, d2, d3, d4, d5)
# Calculate mean price by carat and cut for each sample
mean_pp_list <- lapply(samples_list, function(sample) {
group_data <- sample %>%
group_by(carat, cut) %>%
summarize(mean_price = mean(price), .groups = 'keep')
return(group_data)
})
# Set the color scale using viridis
color_palette <- viridis_pal()(length(levels(diamonds$cut)))
# Plotting for each sample with color for each level of the cut variable
plots_list <- lapply(1:length(mean_pp_list), function(i) {
ggplot(data = mean_pp_list[[i]], aes(x = carat, y = mean_price, color = cut)) +
geom_point() +
facet_wrap(~cut, scales = 'free') +
scale_color_manual(values = color_palette) +
theme_classic() +
ggtitle(paste("Sample", i))
})
# Display the plots
plots_list
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
If we can observe the mean prices for the very good cut in sample 1
and sample 2, in sample 2, it doesn’t have values for carat 3 and 4,
whereas in sample 1 it has values up to carat 4 but it can be considered
an outlier.
In sample 5, the mean price of the diamonds of very good cut and carat
is present(basically an outlier), an anomaly that is absent in another
sample.
let’s take a price column and specify the number of simulations. we calculate the sample mean for each simulation and visualize it.
set.seed(123)
num_samples <- 1000
sample_means <- numeric(num_samples)
for (i in 1:num_samples) {
sample_data <- sample(diamonds$price, replace = TRUE)
sample_means[i] <- mean(sample_data)
}
head(sample_means)
## [1] 3940.745 3921.476 3943.889 3922.540 3927.049 3930.295
plot_data <- data.frame(
sample_mean = sample_means
)
ggplot(plot_data, aes(x = sample_mean)) +
geom_histogram(binwidth = 1, color = "skyblue") +
theme_classic()
it follows a normal distribution, we can also check whether it
follows normal distribution or not by calculating the average of the
sample means.
print(mean(sample_means))
## [1] 3933.536
print(mean(diamonds$price))
## [1] 3932.8
The mean of the sample mean is almost equal to the mean of the price
of the diamonds.