Week_6_Data_Dive.knit

Title : Week_6_DataDive
Output : html document

Installing the necessary libraries

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
View(diamonds) #show the data 
head(diamonds) #display the top 5 rows

## # A tibble: 6 × 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

Creating the new variables and visualizing it

# Price vs. Carat Weight (with calculated price per carat) 
diamonds$price_per_carat <- diamonds$price / diamonds$carat 
ggplot(diamonds, aes(x = carat, y = price)) + geom_point(alpha = 0.5,position = 'dodge') + labs(x = "Carat Weight", y = "Price", title = "Diamond Price vs Carat Weight (with Price per Carat)")

## Warning: Width not defined
## ℹ Set with `position_dodge(width = ...)`

As the weight of the carat increases, the price of the diamonds increases.

#Pair 2: Cut Quality vs. Price (with calculated price difference between Good and Ideal) 
diamonds$price_diff_ideal <- diamonds$price - mean(diamonds$price[diamonds$cut == "Ideal"])
ggplot(diamonds, aes(x = cut, y = price_diff_ideal, fill = cut)) + geom_boxplot() + labs(x = "Cut Quality", y = "Price Difference from Ideal Cut", title = "Price Difference by Cut Quality") + scale_fill_brewer(palette = "Set3")

the median line is above zero, it indicates that, on average, diamonds of that cut quality are priced higher than the average price of diamonds with an Ideal cut, whereas if it’s below zero, it means they are priced lower.
The median line of all the cut quality is below zero stating that most of the diamonds are priced below the average prices but there are many outliers present in the data which may affect the price of the diamond based on the quality cut.

diamonds$clarity_score <- as.integer(factor(diamonds$clarity, levels = c("I1", "SI2", "SI1", "VS2", "VS1", "VVS2", "VVS1", "IF"))) 
diamonds$price_color_clarity <- diamonds$price / diamonds$clarity_score 
ggplot(diamonds, aes(x = color, y = price_color_clarity, fill = clarity)) + geom_boxplot() + labs(x = "Color", y = "Price per Clarity Score", title = "Price per Clarity Score by Color and Clarity") + scale_fill_discrete(name = "Clarity")

The median line within each boxplot indicates the central tendency of the price per clarity score distribution for diamonds of a particular color and clarity. A higher median suggests that, on average, diamonds of that color and clarity combination have a higher price per clarity score
The height of the box and the spread of the whiskers show the variability in price per clarity score within each color and clarity group. A larger spread indicates greater variability in prices among diamonds of the same color and clarity.

cor_price_carat <- cor(diamonds$price, diamonds$carat) 
cor_cut_price <- cor(as.numeric(factor(diamonds$cut)), diamonds$price_diff_ideal)
avg_price_color <- tapply(diamonds$price, diamonds$color, mean)

color_numeric <- as.numeric(factor(diamonds$color,levels=names(avg_price_color), labels = 1:length(avg_price_color)))
cor_color_clarity_price <- cor(color_numeric, diamonds$clarity_score)



cat("Correlation coefficient for Color (numerical representation) and Clarity vs.Price per Clarity Score:", cor_color_clarity_price,"\n")

## Correlation coefficient for Color (numerical representation) and Clarity vs.Price per Clarity Score: 0.02563128

cat("Correlation coefficient for Price vs. Carat Weight:", cor_price_carat, "\n")

## Correlation coefficient for Price vs. Carat Weight: 0.9215913

cat("Correlation coefficient for Cut Quality vs. Price Difference from Ideal Cut:", cor_cut_price, "")

## Correlation coefficient for Cut Quality vs. Price Difference from Ideal Cut: -0.05349066

lets visualize the above correlation coefficient

cor_matrix <- cor(diamonds[c("price", "carat", "price_diff_ideal", "clarity_score","price_color_clarity")])

# Create correlation matrix plot
library(corrplot)

## corrplot 0.92 loaded

corrplot(cor_matrix, method = "circle", type = "upper", tl.cex = 0.7, tl.col = "black")

the correlation coefficient for price vs carat is 0.92 indicating a strong positive correlation between price and carat. The Correlation coefficient for Color (numerical representation) and Clarity vs. Price per Clarity Score is 0.025 indicating a neutral relationship between the two variables. The correlation coefficient for Cut Quality vs. Price Difference from Ideal Cut is -0.053, indicating that the relationship between these two variables is also close to neutral.

Now we find the confidence interval for the newly created variables.

price_per_carat_ci <- t.test(diamonds$price_per_carat)$conf.int 
cut_price_ci <- t.test(diamonds$price_diff_ideal)$conf.int 
price_color_clarity_ci <- t.test(diamonds$price_color_clarity)$conf.int

price_per_carat_ci

## [1] 3991.409 4025.380
## attr(,"conf.level")
## [1] 0.95

We are 95% confident that the true average price per carat of diamonds in the population falls within the range of $3991.409 and $4025.380 per carat. This interval provides a measure of uncertainty around our estimate of the average price per carat, considering the variability in the data.

cut_price_ci

## [1] 441.5900 508.9255
## attr(,"conf.level")
## [1] 0.95

We are 95% confident that the true average price difference from the average price of diamonds with an Ideal cut in the population falls within the range of $441.5900 and $508.9255

price_color_clarity_ci

## [1] 1222.482 1247.622
## attr(,"conf.level")
## [1] 0.95

We are 95% confident that the true average price per clarity score of diamonds in the population falls within the range of $1222.482 and $1247.622

ci_data <- data.frame(
  variable = c("Price per Carat", "Cut Price Difference", "Price Color Clarity"),
  lower_ci = c(price_per_carat_ci[1], cut_price_ci[1], price_color_clarity_ci[1]),
  upper_ci = c(price_per_carat_ci[2], cut_price_ci[2], price_color_clarity_ci[2])
)


# Plotting confidence intervals
ggplot(ci_data, aes(x = variable, y = (lower_ci + upper_ci) / 2, ymin = lower_ci, ymax = upper_ci)) +
  geom_pointrange() +
  labs(x = "Variable", y = "Confidence Interval") +
  ggtitle("Confidence Intervals for Different Variables")