Title: Week_8_Data_Dive
Output: html_document
#loading the necessary libraries
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(skimr)
#loading the dataset
diamonds
## # A tibble: 53,940 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
## # ℹ 53,930 more rows
Response variable: Price
Explanatory variable: Cut
Null Hypothesis: The mean price is the same for all
diamond Cut.
Alternate Hypothesis: The mean price is different for
different diamonds Cut.
let us perform ANOVA test to determine which hypothesis is true.
ANOVA
#response variable = price
#explanatory variable = cut
data <- aov(price~cut,data=diamonds)
data
## Call:
## aov(formula = price ~ cut, data = diamonds)
##
## Terms:
## cut Residuals
## Sum of Squares 11041745359 847431390159
## Deg. of Freedom 4 53935
##
## Residual standard error: 3963.847
## Estimated effects may be unbalanced
summary(data)
## Df Sum Sq Mean Sq F value Pr(>F)
## cut 4 1.104e+10 2.760e+09 175.7 <2e-16 ***
## Residuals 53935 8.474e+11 1.571e+07
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpretation:
The highly significant p-value (< 2.2e-16) and large F-statistic (175.7) indicate that there is strong evidence, so it would be safe to assume that we can reject the null hypothesis that the mean price is the same for all diamond cuts.
For a consumer who wants to buy diamonds, it would be safe to assume that cut price significantly affects the price of the diamonds.
let us do a post hoc analysis to determine which specific cut levels differ significantly in price from each other.
library(agricolae)
model<-aov(price~cut, data=diamonds)
out <- HSD.test(model,"cut", group=TRUE,console=TRUE,
main="price vs cut post hoc")
##
## Study: price vs cut post hoc
##
## HSD Test for price
##
## Mean Square Error: 15712087
##
## cut, means
##
## price std r se Min Max Q25 Q50 Q75
## Fair 4358.758 3560.387 1610 98.78795 337 18574 2050.25 3282.0 5205.50
## Good 3928.864 3681.590 4906 56.59175 327 18788 1145.00 3050.5 5028.00
## Ideal 3457.542 3808.401 21551 27.00121 326 18806 878.00 1810.0 4678.50
## Premium 4584.258 4349.205 13791 33.75352 326 18823 1046.00 3185.0 6296.00
## Very Good 3981.760 3935.862 12082 36.06181 336 18818 912.00 2648.0 5372.75
##
## Alpha: 0.05 ; DF Error: 53935
## Critical Value of Studentized Range: 3.857656
##
## Groups according to probability of means differences and alpha level( 0.05 )
##
## Treatments with the same letter are not significantly different.
##
## price groups
## Premium 4584.258 a
## Fair 4358.758 a
## Very Good 3981.760 b
## Good 3928.864 b
## Ideal 3457.542 c
Interpretation:
It seems that “Premium” and “Fair” cuts have the highest average prices,
while “Ideal” cut has the lowest.
let us visualize the result.
ggplot(diamonds, aes(x = cut, y = price, fill = cut)) +
geom_boxplot() +
labs(title = "Comparison of Mean Prices by Cut (ANOVA)",
x = "Cut", y = "Price") +
theme_minimal()
LINEAR REGRESSION
let’s perform linear regression between carat and price of the diamonds.
model1 <- lm(price ~ carat, diamonds)
model1$coefficients
## (Intercept) carat
## -2256.361 7756.426
summary(model1)
##
## Call:
## lm(formula = price ~ carat, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18585.3 -804.8 -18.9 537.4 12731.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2256.36 13.06 -172.8 <2e-16 ***
## carat 7756.43 14.07 551.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1549 on 53938 degrees of freedom
## Multiple R-squared: 0.8493, Adjusted R-squared: 0.8493
## F-statistic: 3.041e+05 on 1 and 53938 DF, p-value: < 2.2e-16
Intercept: The intercept term (-2256.36) represents the estimated baseline price of a diamond when the carat weight is zero. However, this value might not have a practical interpretation since diamonds with zero-carat weight are non-existent or worthless. In this case, it’s more of a mathematical construct.
Carat: The coefficient for carat (7756.43) suggests
that for each additional carat in weight, the price of the diamond is
expected to increase by approximately $7756.43, holding all other
variables constant. This coefficient is highly significant (p <
0.001), indicating a strong positive relationship between carat weight
and diamond price. It implies that larger diamonds are more expensive
than smaller ones, which is obvious in the diamond market.
Let us visualize the result
# Create a scatter plot with regression line
ggplot(diamonds, aes(x = carat, y = price)) +
geom_point(alpha = 0.5) + # Scatter plot of points
geom_smooth(method = "lm", se = FALSE, color = "blue") + # Regression line
labs(x = "Carat", y = "Price", title = "Diamond Price vs. Carat Weight") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
let us predict the price of the diamonds and visualize the result
predicted <- predict(model1)
# Scatter plot of observed vs. predicted values
plot(diamonds$price, predicted,
xlab = "Observed Price", ylab = "Predicted Price",
main = "Observed vs. Predicted Price")
abline(0, 1, col = "red")
OBSERVATION:
Consider Carat Weight Carefully: For consumers looking to
purchase diamonds, understanding the impact of carat weight on price is
crucial. If budget constraints are a concern, opting for diamonds with
slightly lower carat weights can significantly reduce costs while still
obtaining a visually appealing gem.
Factor in Other Characteristics: While carat weight
is a significant determinant of price, it’s essential to consider other
factors like cut, color, and clarity. These characteristics also
influence a diamond’s appearance and value. Finding the right balance
among these factors is key to making an informed purchasing
decision.