week_8_Data_Dive.knit

Title: Week_8_Data_Dive
Output: html_document

#loading the necessary libraries
library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(skimr)

#loading the dataset
diamonds

## # A tibble: 53,940 × 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
##  2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
##  3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
##  4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
##  5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
##  6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
##  7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
##  8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
##  9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
## 10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
## # ℹ 53,930 more rows

Response variable: Price
Explanatory variable: Cut

Null Hypothesis: The mean price is the same for all diamond Cut.
Alternate Hypothesis: The mean price is different for different diamonds Cut.
let us perform ANOVA test to determine which hypothesis is true.

ANOVA

#response variable  = price
#explanatory variable  = cut

data <- aov(price~cut,data=diamonds)
data

## Call:
##    aov(formula = price ~ cut, data = diamonds)
## 
## Terms:
##                          cut    Residuals
## Sum of Squares   11041745359 847431390159
## Deg. of Freedom            4        53935
## 
## Residual standard error: 3963.847
## Estimated effects may be unbalanced

summary(data)

##                Df    Sum Sq   Mean Sq F value Pr(>F)    
## cut             4 1.104e+10 2.760e+09   175.7 <2e-16 ***
## Residuals   53935 8.474e+11 1.571e+07                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretation:

The highly significant p-value (< 2.2e-16) and large F-statistic (175.7) indicate that there is strong evidence, so it would be safe to assume that we can reject the null hypothesis that the mean price is the same for all diamond cuts.
For a consumer who wants to buy diamonds, it would be safe to assume that cut price significantly affects the price of the diamonds.

let us do a post hoc analysis to determine which specific cut levels differ significantly in price from each other.

library(agricolae)

model<-aov(price~cut, data=diamonds)
out <- HSD.test(model,"cut", group=TRUE,console=TRUE,
main="price vs cut post hoc")

## 
## Study: price vs cut post hoc
## 
## HSD Test for price 
## 
## Mean Square Error:  15712087 
## 
## cut,  means
## 
##              price      std     r       se Min   Max     Q25    Q50     Q75
## Fair      4358.758 3560.387  1610 98.78795 337 18574 2050.25 3282.0 5205.50
## Good      3928.864 3681.590  4906 56.59175 327 18788 1145.00 3050.5 5028.00
## Ideal     3457.542 3808.401 21551 27.00121 326 18806  878.00 1810.0 4678.50
## Premium   4584.258 4349.205 13791 33.75352 326 18823 1046.00 3185.0 6296.00
## Very Good 3981.760 3935.862 12082 36.06181 336 18818  912.00 2648.0 5372.75
## 
## Alpha: 0.05 ; DF Error: 53935 
## Critical Value of Studentized Range: 3.857656 
## 
## Groups according to probability of means differences and alpha level( 0.05 )
## 
## Treatments with the same letter are not significantly different.
## 
##              price groups
## Premium   4584.258      a
## Fair      4358.758      a
## Very Good 3981.760      b
## Good      3928.864      b
## Ideal     3457.542      c

Interpretation:
It seems that “Premium” and “Fair” cuts have the highest average prices, while “Ideal” cut has the lowest.

let us visualize the result.

ggplot(diamonds, aes(x = cut, y = price, fill = cut)) +
  geom_boxplot() +
  labs(title = "Comparison of Mean Prices by Cut (ANOVA)",
       x = "Cut", y = "Price") +
  theme_minimal()

LINEAR REGRESSION

let’s perform linear regression between carat and price of the diamonds.

model1 <- lm(price ~ carat, diamonds)
model1$coefficients

## (Intercept)       carat 
##   -2256.361    7756.426

summary(model1)

## 
## Call:
## lm(formula = price ~ carat, data = diamonds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18585.3   -804.8    -18.9    537.4  12731.7 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2256.36      13.06  -172.8   <2e-16 ***
## carat        7756.43      14.07   551.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1549 on 53938 degrees of freedom
## Multiple R-squared:  0.8493, Adjusted R-squared:  0.8493 
## F-statistic: 3.041e+05 on 1 and 53938 DF,  p-value: < 2.2e-16

Intercept: The intercept term (-2256.36) represents the estimated baseline price of a diamond when the carat weight is zero. However, this value might not have a practical interpretation since diamonds with zero-carat weight are non-existent or worthless. In this case, it’s more of a mathematical construct.

Carat: The coefficient for carat (7756.43) suggests that for each additional carat in weight, the price of the diamond is expected to increase by approximately $7756.43, holding all other variables constant. This coefficient is highly significant (p < 0.001), indicating a strong positive relationship between carat weight and diamond price. It implies that larger diamonds are more expensive than smaller ones, which is obvious in the diamond market.

Let us visualize the result

# Create a scatter plot with regression line
ggplot(diamonds, aes(x = carat, y = price)) +
  geom_point(alpha = 0.5) +  # Scatter plot of points
  geom_smooth(method = "lm", se = FALSE, color = "blue") +  # Regression line
  labs(x = "Carat", y = "Price", title = "Diamond Price vs. Carat Weight") +  
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

let us predict the price of the diamonds and visualize the result

predicted <- predict(model1)

# Scatter plot of observed vs. predicted values
plot(diamonds$price, predicted, 
     xlab = "Observed Price", ylab = "Predicted Price", 
     main = "Observed vs. Predicted Price")
abline(0, 1, col = "red")

OBSERVATION:
Consider Carat Weight Carefully: For consumers looking to purchase diamonds, understanding the impact of carat weight on price is crucial. If budget constraints are a concern, opting for diamonds with slightly lower carat weights can significantly reduce costs while still obtaining a visually appealing gem.

Factor in Other Characteristics: While carat weight is a significant determinant of price, it’s essential to consider other factors like cut, color, and clarity. These characteristics also influence a diamond’s appearance and value. Finding the right balance among these factors is key to making an informed purchasing decision.