Week_9_Data_Dive.knit

Objective: 1. Multiple Linear Regression 2. Multicollinearity 3. Diagnostic Plots

#loading the necessary libraries 
library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

#viewing the data 
head(diamonds)

## # A tibble: 6 × 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

Multiple Linear Regression

# Fit a linear regression model with Price as the target variable and three predictor variables 
model <- lm(price ~ carat + depth + table, data = diamonds)  
# Extract coefficients 
coefficients <- coef(model)  
coefficients

## (Intercept)       carat       depth       table 
##  13003.4405   7858.7705   -151.2363   -104.4728

Interpretation:
It suggests that when the carat, depth, and table values are zero, the expected price of the diamond is $13,003.44. However, since it’s unlikely for a diamond to have zero carat, depth, or table values, the interpretation of the intercept might not have practical relevance.

carat: The coefficient for “carat” is 7858.7705. This indicates that for every one-unit increase in carat weight, the expected price of the diamond increases by $7858.77, holding all other variables constant.
depth: The coefficient for “depth” is -151.2363. This suggests that for every one-unit increase in depth percentage, the expected price of the diamond decreases by $151.24 and vice , holding all other variables constant.
table: The coefficient for “table” is -104.4728. This implies that for every one-unit increase in table percentage, the expected price of the diamond decreases by $104.47 and vice versa, holding all other variables constant.

Multiple linear regression with an interaction term

# Fit a linear regression model with an interaction term between carat and clarity 
model2 <- lm(price ~ carat * clarity, data = diamonds) 
coefficients <- coef(model2) 
# Create scatter plot with regression line 
ggplot(diamonds, aes(x = carat, y = price, color = clarity)) + 
  geom_point() +  # Scatter plot   
  geom_smooth(method = "lm", se = FALSE,, formula = y~x) + 
  labs(title = "Price vs. Carat with Interaction Term", x = "Carat", y = "Price") +   theme_minimal()  # Use a minimal theme

Key Takeaways:
1. Carat weight has a positive effect on price: Overall, the price increases as the carat weight goes up. This is evident by the general upward trend of the black regression line.
2. Clarity grade matters: The price also depends on the clarity grade. The colored data points show how price varies within each clarity group.
3. Interaction effect:The key point is the interaction between carat weight and clarity. The separate regression lines for different clarity grades (suggested by the color gradient) show that the price increase per carat is steeper for diamonds with lower clarity (SI2) compared to higher clarity grades (VVS2, VVS1, VS2, VS1). In simpler terms, a one-carat increase in weight will raise the price more for a diamond with lower clarity (SI2) than for a diamond with higher clarity (like VVS2).

MLR with the binary term:

diamonds$binary_clarity <- ifelse(diamonds$clarity %in% c("IF", "VVS1", "VVS2", "VS1", "VS2"), 1, 0)
model3 <- lm(price ~ carat + cut + binary_clarity, data = diamonds)
summary(model3)

## 
## Call:
## lm(formula = price ~ carat + cut + binary_clarity, data = diamonds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18545.8   -656.8    -85.4    466.4  12041.3 
## 
## Coefficients:
##                Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)    -3677.86      16.87 -218.019   <2e-16 ***
## carat           8233.96      13.20  623.617   <2e-16 ***
## cut.L            993.07      23.91   41.542   <2e-16 ***
## cut.Q           -523.10      21.08  -24.810   <2e-16 ***
## cut.C            330.16      18.43   17.917   <2e-16 ***
## cut^4             13.23      14.81    0.893    0.372    
## binary_clarity  1326.17      12.65  104.832   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1378 on 53933 degrees of freedom
## Multiple R-squared:  0.8808, Adjusted R-squared:  0.8808 
## F-statistic: 6.64e+04 on 6 and 53933 DF,  p-value: < 2.2e-16

Interpretation:
Multiple R-squared (0.8808): This indicates that the model explains 88.08% of the variance in price.
Overall, this model suggests that carat weight, cut grade, and clarity all play a significant role in determining diamond price. The price increases with carat weight and higher clarity, while cut quality also affects price with “Ideal” and “Very Good” cuts commanding a premium.

MLR with continuous variable “depth”

#final model 
model4 <- lm(price ~ carat + cut + binary_clarity + depth, data = diamonds) 
summary(model4)

## 
## Call:
## lm(formula = price ~ carat + cut + binary_clarity + depth, data = diamonds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18498.5   -660.9    -85.2    466.0  12024.7 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -1516.859    275.940  -5.497 3.88e-08 ***
## carat           8234.529     13.196 624.001  < 2e-16 ***
## cut.L            930.729     25.178  36.966  < 2e-16 ***
## cut.Q           -483.890     21.657 -22.344  < 2e-16 ***
## cut.C            329.036     18.418  17.865  < 2e-16 ***
## cut^4             22.326     14.850   1.503    0.133    
## binary_clarity  1322.792     12.651 104.564  < 2e-16 ***
## depth            -34.701      4.423  -7.846 4.37e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1377 on 53932 degrees of freedom
## Multiple R-squared:  0.8809, Adjusted R-squared:  0.8809 
## F-statistic: 5.699e+04 on 7 and 53932 DF,  p-value: < 2.2e-16

Key Takeaways:

1.Carat, Cut Grade, Clarity, and Depth all Influence Price: All these terms have significant p-values (much less than 0.05), indicating they contribute meaningfully to explaining price variations.
2. Carat Weight Remains the Strongest Effect: The carat coefficient (8234.5) is the largest, suggesting the strongest positive effect on price.
3. Cut Grade Matters: Similar to the previous model, cut grades “Ideal” (cut.L), “Very Good” (cut.C), and “Fair” (cut.Q) influence price. Positive coefficients for “Ideal” and “Very Good” indicate a premium, while the negative coefficient for “Fair” suggests lower prices.
4. Clarity Matters: The positive and significant coefficient for binary_clarity (1322.7) confirms that higher clarity diamonds are generally more expensive.
5. Depth Matters: The negative coefficient for depth (-34.7) indicates a price decrease with increasing depth. This suggests shallower diamonds (lower depth values) tend to be more valuable.
6. Model Fit Statistics Remain Strong: The R-squared values (around 0.88) suggest the model explains a high proportion of the variance in diamond prices. The F-statistic (highly significant p-value) confirms the model significantly improves price prediction compared to a model with just the intercept.

observation:The coefficient for the cut^4 term is not significant, a simpler model with just the cut grade categories might be sufficient.

Multicollinearity check

# Create a binary clarity variable
diamonds$binary_clarity <- ifelse(diamonds$clarity %in% c("IF", "VVS1", "VVS2", "VS1", "VS2"), 1, 0)  
# Select relevant columns
selected_columns <- c("carat", "binary_clarity", "depth") 
# Calculate correlation matrix 
correlation_matrix <- cor(diamonds[selected_columns]) 
print("correlational matrix")

## [1] "correlational matrix"

print(correlation_matrix)

##                      carat binary_clarity       depth
## carat           1.00000000    -0.28614079  0.02822431
## binary_clarity -0.28614079     1.00000000 -0.06000247
## depth           0.02822431    -0.06000247  1.00000000

# Compute variance inflation factors (VIF)
vif_values <- vif(lm(carat ~ binary_clarity + cut + depth, data = diamonds))
print("variance inflation factors")

## [1] "variance inflation factors"

print(vif_values)

##                    GVIF Df GVIF^(1/(2*Df))
## binary_clarity 1.037399  1        1.018528
## cut            1.179120  4        1.020810
## depth          1.142257  1        1.068764

Correlation Matrix:

carat and binary_clarity: Negative correlation (-0.286) suggests larger diamonds are less likely to have high clarity scores. This aligns with findings in previous analysis.
carat and depth: Weak positive correlation (0.028), indicating a slight tendency for larger diamonds to have greater depth.
binary_clarity and depth: Weak negative correlation (-0.060), suggesting higher clarity diamonds might have slightly lower depth.

Variance Inflation Factors (VIF):

All VIF values are below 2, which is a common threshold for indicating problematic collinearity. VIFs for clarity, cut, and depth are all close to 1, suggesting minimal inflation of variances due to multicollinearity.

Diagnostic Plots

# Residuals vs. Fitted Values Plot
plot(model4, which = 1)

Interpretation:

Non-linear pattern: The plot shows a distinct curved pattern, suggesting that the relationship between the response variable (price) and the predictor variables (carat, cut, binary_clarity, and depth) may not be entirely linear. This violates the linearity assumption of the linear regression model.
Unequal variance (heteroscedasticity): The spread of the residuals appears to increase as the fitted values increase, indicating that the assumption of equal variance (homoscedasticity) of the residuals may be violated.
Outliers: Several points appear to be outliers, particularly at the higher end of the fitted values. These outliers can have a significant influence on the regression coefficients and may need further investigation or potential removal from the analysis.
Influential observations: The plot highlights three observations (16284,25999 and 27416) with very large fitted values and residuals. These observations may be highly influential and could potentially distort the regression model.

# Normal Q-Q Plot 
plot(model4, which = 2)

Interpretation:
1. Severe deviation from normality:
The residuals deviate significantly from the straight diagonal line, particularly in the tails of the distribution. This pattern is a clear indication of a violation of the normality assumption of the residuals. 2. Identification of outliers:
The plot highlights two potential outliers (observations 27416 and 16284) as points that deviate substantially from the majority of the residuals. These outliers can influence the regression coefficients and may need further investigation or potential removal from the analysis.

# Scale-Location Plot 
plot(model4, which = 3)

Interpretation:
1.Violation of homoscedasticity: The plot exhibits a distinct funnel shape or megaphone pattern, indicating that the spread of the residuals increases as the fitted values increase. This pattern suggests that the assumption of homoscedasticity (constant variance of residuals) is violated. The residuals appear to have a higher variance for larger fitted values, which is a clear indication of heteroscedasticity (non-constant variance).
2. Presence of outliers and influential observations: The plot highlights several potential outliers, particularly at the higher end of the fitted values, as points that deviate substantially from the majority of the residuals.

# Residuals vs. Leverage Plot 
plot(model4, which = 5)

Interpretation:
1.Identification of influential observations: The plot highlights several observations with relatively high leverage values, indicating that they have a greater influence on the regression coefficients. 2.Presence of outliers: The plot shows a concentration of points with large negative residuals at the lower end of the leverage values. These points represent potential outliers, as they deviate substantially from the majority of the observations.
3. Cook’s distance: The horizontal line in the plot represents the Cook’s distance, which is a measure of the combined influence of an observation’s leverage and residual on the regression model. Observations with a Cook’s distance greater than a certain threshold (typically 1 or 4/n, where n is the number of observations) are considered influential and may need further investigation or potential removal. While the plot does not explicitly show the Cook’s distance values, the observations with high leverage and large residuals are likely to have high Cook’s distance values.

# Cook's Distance Plot
plot(model4, which = 6)

Interpretation:
1. There is a point with a very high Cook’s distance around 0.27416, suggesting it is an influential outlier. 2. Several other points like 0.27631 and 0.23645 also exhibit relatively high Cook’s distance values, indicating they may be influential observations.
3. The majority of points are clustered near the origin, with low leverage and Cook’s distance values, suggesting they are not overly influential.
4. The dashed line at a Cook’s distance of 6 appears to be a reference line, potentially indicating a threshold for identifying influential points.

Observation:
Overall, the diagnostic plots suggests that the linear regression model may not be appropriate for this dataset, as it violates the assumptions of linearity and homoscedasticity. The presence of outliers and influential observations further complicates the interpretation of the model.

Addressing Model Assumptions:
To address these issues, we should consider transforming the response variable or the predictor variables to achieve linearity and homoscedasticity. Additionally, we could investigate the outliers and influential observations to determine if they should be removed or if there are legitimate reasons for their presence in the data. Alternatively, we can explore more complex regression models, such as polynomial regression or non-linear regression techniques, to better capture the relationships in the data.