Title : Week_6_DataDive
Output : html document
Installing the necessary libraries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
View(diamonds) #show the data
head(diamonds) #display the top 5 rows
## # A tibble: 6 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
Creating the new variables and visualizing it
# Price vs. Carat Weight (with calculated price per carat)
diamonds$price_per_carat <- diamonds$price / diamonds$carat
ggplot(diamonds, aes(x = carat, y = price)) + geom_point(alpha = 0.5,position = 'dodge') + labs(x = "Carat Weight", y = "Price", title = "Diamond Price vs Carat Weight (with Price per Carat)")
## Warning: Width not defined
## ℹ Set with `position_dodge(width = ...)`
As the weight of the carat increases, the price of the diamonds increases.
#Pair 2: Cut Quality vs. Price (with calculated price difference between Good and Ideal)
diamonds$price_diff_ideal <- diamonds$price - mean(diamonds$price[diamonds$cut == "Ideal"])
ggplot(diamonds, aes(x = cut, y = price_diff_ideal, fill = cut)) + geom_boxplot() + labs(x = "Cut Quality", y = "Price Difference from Ideal Cut", title = "Price Difference by Cut Quality") + scale_fill_brewer(palette = "Set3")
the median line is above zero, it indicates that, on average,
diamonds of that cut quality are priced higher than the average price of
diamonds with an Ideal cut, whereas if it’s below zero, it means they
are priced lower.
The median line of all the cut quality is below zero stating that most
of the diamonds are priced below the average prices but there are many
outliers present in the data which may affect the price of the diamond
based on the quality cut.
diamonds$clarity_score <- as.integer(factor(diamonds$clarity, levels = c("I1", "SI2", "SI1", "VS2", "VS1", "VVS2", "VVS1", "IF")))
diamonds$price_color_clarity <- diamonds$price / diamonds$clarity_score
ggplot(diamonds, aes(x = color, y = price_color_clarity, fill = clarity)) + geom_boxplot() + labs(x = "Color", y = "Price per Clarity Score", title = "Price per Clarity Score by Color and Clarity") + scale_fill_discrete(name = "Clarity")
The median line within each boxplot indicates the central tendency of
the price per clarity score distribution for diamonds of a particular
color and clarity. A higher median suggests that, on average, diamonds
of that color and clarity combination have a higher price per clarity
score
The height of the box and the spread of the whiskers show the
variability in price per clarity score within each color and clarity
group. A larger spread indicates greater variability in prices among
diamonds of the same color and clarity.
cor_price_carat <- cor(diamonds$price, diamonds$carat)
cor_cut_price <- cor(as.numeric(factor(diamonds$cut)), diamonds$price_diff_ideal)
avg_price_color <- tapply(diamonds$price, diamonds$color, mean)
color_numeric <- as.numeric(factor(diamonds$color,levels=names(avg_price_color), labels = 1:length(avg_price_color)))
cor_color_clarity_price <- cor(color_numeric, diamonds$clarity_score)
cat("Correlation coefficient for Color (numerical representation) and Clarity vs.Price per Clarity Score:", cor_color_clarity_price,"\n")
## Correlation coefficient for Color (numerical representation) and Clarity vs.Price per Clarity Score: 0.02563128
cat("Correlation coefficient for Price vs. Carat Weight:", cor_price_carat, "\n")
## Correlation coefficient for Price vs. Carat Weight: 0.9215913
cat("Correlation coefficient for Cut Quality vs. Price Difference from Ideal Cut:", cor_cut_price, "")
## Correlation coefficient for Cut Quality vs. Price Difference from Ideal Cut: -0.05349066
lets visualize the above correlation coefficient
cor_matrix <- cor(diamonds[c("price", "carat", "price_diff_ideal", "clarity_score","price_color_clarity")])
# Create correlation matrix plot
library(corrplot)
## corrplot 0.92 loaded
corrplot(cor_matrix, method = "circle", type = "upper", tl.cex = 0.7, tl.col = "black")
the correlation coefficient for price vs carat is 0.92 indicating a strong positive correlation between price and carat. The Correlation coefficient for Color (numerical representation) and Clarity vs. Price per Clarity Score is 0.025 indicating a neutral relationship between the two variables. The correlation coefficient for Cut Quality vs. Price Difference from Ideal Cut is -0.053, indicating that the relationship between these two variables is also close to neutral.
Now we find the confidence interval for the newly created variables.
price_per_carat_ci <- t.test(diamonds$price_per_carat)$conf.int
cut_price_ci <- t.test(diamonds$price_diff_ideal)$conf.int
price_color_clarity_ci <- t.test(diamonds$price_color_clarity)$conf.int
price_per_carat_ci
## [1] 3991.409 4025.380
## attr(,"conf.level")
## [1] 0.95
We are 95% confident that the true average price per carat of diamonds in the population falls within the range of $3991.409 and $4025.380 per carat. This interval provides a measure of uncertainty around our estimate of the average price per carat, considering the variability in the data.
cut_price_ci
## [1] 441.5900 508.9255
## attr(,"conf.level")
## [1] 0.95
We are 95% confident that the true average price difference from the average price of diamonds with an Ideal cut in the population falls within the range of $441.5900 and $508.9255
price_color_clarity_ci
## [1] 1222.482 1247.622
## attr(,"conf.level")
## [1] 0.95
We are 95% confident that the true average price per clarity score of diamonds in the population falls within the range of $1222.482 and $1247.622
ci_data <- data.frame(
variable = c("Price per Carat", "Cut Price Difference", "Price Color Clarity"),
lower_ci = c(price_per_carat_ci[1], cut_price_ci[1], price_color_clarity_ci[1]),
upper_ci = c(price_per_carat_ci[2], cut_price_ci[2], price_color_clarity_ci[2])
)
# Plotting confidence intervals
ggplot(ci_data, aes(x = variable, y = (lower_ci + upper_ci) / 2, ymin = lower_ci, ymax = upper_ci)) +
geom_pointrange() +
labs(x = "Variable", y = "Confidence Interval") +
ggtitle("Confidence Intervals for Different Variables")