week_11.knit

Title: Week_11_Data_Dive
Output: HTML document

#installing the necessary libraries
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggthemes)
library(ggrepel)
library(patchwork)
library(broom)
library(lindia)
library(car)

## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

options(scipen = 6)
theme_set(theme_minimal())

#loading the data
data(diamonds)
head(diamonds)

## # A tibble: 6 × 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

#converting the categorical columns into numeric
convert_to_numeric <- function(data,columns) {
  for(col in columns){
    data[[col]] <- as.integer(factor(data[[col]]))
  }
  return(data)
}

diamonds <- convert_to_numeric(diamonds,c("cut","color","clarity"))
diamonds

## # A tibble: 53,940 × 10
##    carat   cut color clarity depth table price     x     y     z
##    <dbl> <int> <int>   <int> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.23     5     2       2  61.5    55   326  3.95  3.98  2.43
##  2  0.21     4     2       3  59.8    61   326  3.89  3.84  2.31
##  3  0.23     2     2       5  56.9    65   327  4.05  4.07  2.31
##  4  0.29     4     6       4  62.4    58   334  4.2   4.23  2.63
##  5  0.31     2     7       2  63.3    58   335  4.34  4.35  2.75
##  6  0.24     3     7       6  62.8    57   336  3.94  3.96  2.48
##  7  0.24     3     6       7  62.3    57   336  3.95  3.98  2.47
##  8  0.26     3     5       3  61.9    55   337  4.07  4.11  2.53
##  9  0.22     1     2       4  65.1    61   337  3.87  3.78  2.49
## 10  0.23     3     5       5  59.4    61   338  4     4.05  2.39
## # ℹ 53,930 more rows

correlation_specific <- cor(diamonds[c("carat","cut","color","clarity", "price")])
print(correlation_specific)

##              carat         cut       color     clarity       price
## carat    1.0000000 -0.13496702  0.29143675 -0.35284057  0.92159130
## cut     -0.1349670  1.00000000 -0.02051852  0.18917474 -0.05349066
## color    0.2914368 -0.02051852  1.00000000  0.02563128  0.17251093
## clarity -0.3528406  0.18917474  0.02563128  1.00000000 -0.14680007
## price    0.9215913 -0.05349066  0.17251093 -0.14680007  1.00000000

Interpretation:
It seems that carat has strong positive correlation with price.

Performing Linear model

model <- lm(price~carat+cut+color+clarity,data=diamonds)
summary(model)

## 
## Call:
## lm(formula = price ~ carat + cut + color + clarity, data = diamonds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19770.4   -693.4   -169.2    548.8   9721.9 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4661.173     27.590 -168.94   <2e-16 ***
## carat        8783.772     12.692  692.09   <2e-16 ***
## cut           155.700      4.863   32.01   <2e-16 ***
## color        -319.673      3.302  -96.81   <2e-16 ***
## clarity       524.843      3.527  148.80   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1235 on 53935 degrees of freedom
## Multiple R-squared:  0.9042, Adjusted R-squared:  0.9042 
## F-statistic: 1.272e+05 on 4 and 53935 DF,  p-value: < 2.2e-16

Interpretation:
1.approximately 90.42% of the variability in “price” is explained by the predictors.
2. For each one-unit increase in “carat,” the estimated increase in “price” is approximately $8783.77.
3. For each one-unit increase in the categorical variable “cut,” the estimated increase in “price” is approximately $155.70
4. For each one-unit increase in the categorical variable “color,” the estimated decrease in “price” is approximately $319.67
5. For each one-unit increase in the categorical variable “clarity,” the estimated increase in “price” is approximately $524.84

let us diagnose the model

plot(model,which=1)

It looks like funnel shape, we can see that it is not normally distributed which is in violation of assumption of linear model.

let us perform power transformation and identify the lambda value

pT <- powerTransform(model, family = "bcnPower")
pT$lambda

## [1] 0.4134432

let us perform box-cox transformation on price using this lambda value

diamonds$price_transformed <- (diamonds$price^0.413 - 1) / 0.413

now performing the linear model with transformed price variable

model_transformed <- lm(price_transformed ~ carat + cut + color + clarity, data = diamonds)
summary(model_transformed)

## 
## Call:
## lm(formula = price_transformed ~ carat + cut + color + clarity, 
##     data = diamonds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -169.792   -3.518    0.094    3.442   42.082 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.17568    0.15792   45.44   <2e-16 ***
## carat       62.10489    0.07264  854.92   <2e-16 ***
## cut          0.72267    0.02784   25.96   <2e-16 ***
## color       -2.30246    0.01890 -121.82   <2e-16 ***
## clarity      3.01029    0.02019  149.11   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.069 on 53935 degrees of freedom
## Multiple R-squared:  0.936,  Adjusted R-squared:  0.936 
## F-statistic: 1.973e+05 on 4 and 53935 DF,  p-value: < 2.2e-16

Interpretation:
The coefficient of 62.10489 for carat indicates that for a one-unit increase in carat, the transformed price_transformed is expected to increase by 62.10489 units, holding all other variables (cut, color, and clarity) constant.
.
let us diagnose this model

plot(model_transformed,which=1)

Interpretation:
Constant variance (homoscedasticity): The spread of the residuals appears to be more consistent across the range of fitted values, indicating that the assumption of constant variance is better satisfied. The funnel-shaped pattern observed in the previous plot is no longer evident.

Linearity: The red dashed line, representing the smoothed residuals, is closer to a horizontal line, suggesting that the linear model with the transformed response variable is capturing the relationship with the explanatory variables more accurately.

Potential outliers: There are a few observations with relatively large positive and negative residuals, such as observations 27131, 27631, and 27416. These could be potential outliers or influential observations that may need further investigation.

plot(model_transformed,which=2)

Interpretation:
Departure from normality: The residuals deviate from the theoretical quantile (the dashed diagonal line) at both tails of the distribution. This suggests that the residuals may not be perfectly normally distributed, as the tails appear to be heavier than expected under normality.