Title: Week_11_Data_Dive
Output: HTML document
#installing the necessary libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
library(patchwork)
library(broom)
library(lindia)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
options(scipen = 6)
theme_set(theme_minimal())
#loading the data
data(diamonds)
head(diamonds)
## # A tibble: 6 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
#converting the categorical columns into numeric
convert_to_numeric <- function(data,columns) {
for(col in columns){
data[[col]] <- as.integer(factor(data[[col]]))
}
return(data)
}
diamonds <- convert_to_numeric(diamonds,c("cut","color","clarity"))
diamonds
## # A tibble: 53,940 × 10
## carat cut color clarity depth table price x y z
## <dbl> <int> <int> <int> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 5 2 2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 4 2 3 59.8 61 326 3.89 3.84 2.31
## 3 0.23 2 2 5 56.9 65 327 4.05 4.07 2.31
## 4 0.29 4 6 4 62.4 58 334 4.2 4.23 2.63
## 5 0.31 2 7 2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 3 7 6 62.8 57 336 3.94 3.96 2.48
## 7 0.24 3 6 7 62.3 57 336 3.95 3.98 2.47
## 8 0.26 3 5 3 61.9 55 337 4.07 4.11 2.53
## 9 0.22 1 2 4 65.1 61 337 3.87 3.78 2.49
## 10 0.23 3 5 5 59.4 61 338 4 4.05 2.39
## # ℹ 53,930 more rows
correlation_specific <- cor(diamonds[c("carat","cut","color","clarity", "price")])
print(correlation_specific)
## carat cut color clarity price
## carat 1.0000000 -0.13496702 0.29143675 -0.35284057 0.92159130
## cut -0.1349670 1.00000000 -0.02051852 0.18917474 -0.05349066
## color 0.2914368 -0.02051852 1.00000000 0.02563128 0.17251093
## clarity -0.3528406 0.18917474 0.02563128 1.00000000 -0.14680007
## price 0.9215913 -0.05349066 0.17251093 -0.14680007 1.00000000
Interpretation:
It seems that carat has strong positive correlation with
price.
Performing Linear model
model <- lm(price~carat+cut+color+clarity,data=diamonds)
summary(model)
##
## Call:
## lm(formula = price ~ carat + cut + color + clarity, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19770.4 -693.4 -169.2 548.8 9721.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4661.173 27.590 -168.94 <2e-16 ***
## carat 8783.772 12.692 692.09 <2e-16 ***
## cut 155.700 4.863 32.01 <2e-16 ***
## color -319.673 3.302 -96.81 <2e-16 ***
## clarity 524.843 3.527 148.80 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1235 on 53935 degrees of freedom
## Multiple R-squared: 0.9042, Adjusted R-squared: 0.9042
## F-statistic: 1.272e+05 on 4 and 53935 DF, p-value: < 2.2e-16
Interpretation:
1.approximately 90.42% of the variability in “price” is
explained by the predictors.
2. For each one-unit increase in “carat,” the estimated increase in
“price” is approximately $8783.77.
3. For each one-unit increase in the categorical variable “cut,” the
estimated increase in “price” is approximately $155.70
4. For each one-unit increase in the categorical variable “color,” the
estimated decrease in “price” is approximately $319.67
5. For each one-unit increase in the categorical variable “clarity,” the
estimated increase in “price” is approximately $524.84
let us diagnose the model
plot(model,which=1)
It looks like funnel shape, we can see that it is not normally
distributed which is in violation of assumption of linear model.
let us perform power transformation and identify the lambda value
pT <- powerTransform(model, family = "bcnPower")
pT$lambda
## [1] 0.4134432
let us perform box-cox transformation on price using this lambda value
diamonds$price_transformed <- (diamonds$price^0.413 - 1) / 0.413
now performing the linear model with transformed price variable
model_transformed <- lm(price_transformed ~ carat + cut + color + clarity, data = diamonds)
summary(model_transformed)
##
## Call:
## lm(formula = price_transformed ~ carat + cut + color + clarity,
## data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -169.792 -3.518 0.094 3.442 42.082
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.17568 0.15792 45.44 <2e-16 ***
## carat 62.10489 0.07264 854.92 <2e-16 ***
## cut 0.72267 0.02784 25.96 <2e-16 ***
## color -2.30246 0.01890 -121.82 <2e-16 ***
## clarity 3.01029 0.02019 149.11 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.069 on 53935 degrees of freedom
## Multiple R-squared: 0.936, Adjusted R-squared: 0.936
## F-statistic: 1.973e+05 on 4 and 53935 DF, p-value: < 2.2e-16
Interpretation:
The coefficient of 62.10489 for carat indicates that for a one-unit
increase in carat, the transformed price_transformed is expected to
increase by 62.10489 units, holding all other variables (cut, color, and
clarity) constant.
.
let us diagnose this model
plot(model_transformed,which=1)
Interpretation:
Constant variance (homoscedasticity): The spread of the
residuals appears to be more consistent across the range of fitted
values, indicating that the assumption of constant variance is better
satisfied. The funnel-shaped pattern observed in the previous plot is no
longer evident.
Linearity: The red dashed line, representing the smoothed residuals, is closer to a horizontal line, suggesting that the linear model with the transformed response variable is capturing the relationship with the explanatory variables more accurately.
Potential outliers: There are a few observations with relatively large positive and negative residuals, such as observations 27131, 27631, and 27416. These could be potential outliers or influential observations that may need further investigation.
plot(model_transformed,which=2)
Interpretation:
Departure from normality: The residuals deviate from the
theoretical quantile (the dashed diagonal line) at both tails of the
distribution. This suggests that the residuals may not be perfectly
normally distributed, as the tails appear to be heavier than expected
under normality.