Title : Week_9_Data_Dive
Output : HTML Document
#loading the necessary libraries
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
#viewing the data
head(diamonds)
## # A tibble: 6 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
Multiple Linear Regression
# Fit a linear regression model with Price as the target variable and three predictor variables
model <- lm(price ~ carat + depth + table, data = diamonds)
# Extract coefficients
coefficients <- coef(model)
coefficients
## (Intercept) carat depth table
## 13003.4405 7858.7705 -151.2363 -104.4728
Interpretation:
It suggests that when the carat, depth, and table values are zero, the
expected price of the diamond is $13,003.44. However, since it’s
unlikely for a diamond to have zero carat, depth, or table values, the
interpretation of the intercept might not have practical relevance.
carat: The coefficient for “carat” is 7858.7705.
This indicates that for every one-unit increase in carat weight, the
expected price of the diamond increases by $7858.77, holding all other
variables constant.
depth: The coefficient for “depth” is -151.2363. This
suggests that for every one-unit increase in depth percentage, the
expected price of the diamond decreases by $151.24 and vice , holding
all other variables constant.
table: The coefficient for “table” is -104.4728. This
implies that for every one-unit increase in table percentage, the
expected price of the diamond decreases by $104.47 and vice versa,
holding all other variables constant.
Multiple linear regression with an interaction term
# Fit a linear regression model with an interaction term between carat and clarity
model2 <- lm(price ~ carat * clarity, data = diamonds)
coefficients <- coef(model2)
# Create scatter plot with regression line
ggplot(diamonds, aes(x = carat, y = price, color = clarity)) +
geom_point() + # Scatter plot
geom_smooth(method = "lm", se = FALSE,, formula = y~x) +
labs(title = "Price vs. Carat with Interaction Term", x = "Carat", y = "Price") + theme_minimal() # Use a minimal theme
Key Takeaways:
1. Carat weight has a positive effect on price: Overall, the
price increases as the carat weight goes up. This is evident by the
general upward trend of the black regression line.
2. Clarity grade matters: The price
also depends on the clarity grade. The colored data points show how
price varies within each clarity group.
3. Interaction effect:The key point is
the interaction between carat weight and clarity. The separate
regression lines for different clarity grades (suggested by the color
gradient) show that the price increase per carat is steeper for diamonds
with lower clarity (SI2) compared to higher clarity grades (VVS2, VVS1,
VS2, VS1). In simpler terms, a one-carat increase in weight will raise
the price more for a diamond with lower clarity (SI2) than for a diamond
with higher clarity (like VVS2).
MLR with the binary term:
diamonds$binary_clarity <- ifelse(diamonds$clarity %in% c("IF", "VVS1", "VVS2", "VS1", "VS2"), 1, 0)
model3 <- lm(price ~ carat + cut + binary_clarity, data = diamonds)
summary(model3)
##
## Call:
## lm(formula = price ~ carat + cut + binary_clarity, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18545.8 -656.8 -85.4 466.4 12041.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3677.86 16.87 -218.019 <2e-16 ***
## carat 8233.96 13.20 623.617 <2e-16 ***
## cut.L 993.07 23.91 41.542 <2e-16 ***
## cut.Q -523.10 21.08 -24.810 <2e-16 ***
## cut.C 330.16 18.43 17.917 <2e-16 ***
## cut^4 13.23 14.81 0.893 0.372
## binary_clarity 1326.17 12.65 104.832 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1378 on 53933 degrees of freedom
## Multiple R-squared: 0.8808, Adjusted R-squared: 0.8808
## F-statistic: 6.64e+04 on 6 and 53933 DF, p-value: < 2.2e-16
Interpretation:
Multiple R-squared (0.8808): This indicates that the model explains
88.08% of the variance in price.
Overall, this model suggests that carat weight, cut grade, and clarity
all play a significant role in determining diamond price. The price
increases with carat weight and higher clarity, while cut quality also
affects price with “Ideal” and “Very Good” cuts commanding a
premium.
MLR with continuous variable “depth”
#final model
model4 <- lm(price ~ carat + cut + binary_clarity + depth, data = diamonds)
summary(model4)
##
## Call:
## lm(formula = price ~ carat + cut + binary_clarity + depth, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18498.5 -660.9 -85.2 466.0 12024.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1516.859 275.940 -5.497 3.88e-08 ***
## carat 8234.529 13.196 624.001 < 2e-16 ***
## cut.L 930.729 25.178 36.966 < 2e-16 ***
## cut.Q -483.890 21.657 -22.344 < 2e-16 ***
## cut.C 329.036 18.418 17.865 < 2e-16 ***
## cut^4 22.326 14.850 1.503 0.133
## binary_clarity 1322.792 12.651 104.564 < 2e-16 ***
## depth -34.701 4.423 -7.846 4.37e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1377 on 53932 degrees of freedom
## Multiple R-squared: 0.8809, Adjusted R-squared: 0.8809
## F-statistic: 5.699e+04 on 7 and 53932 DF, p-value: < 2.2e-16
Key Takeaways:
1.Carat, Cut Grade, Clarity, and Depth all
Influence Price: All these terms have significant p-values
(much less than 0.05), indicating they contribute meaningfully to
explaining price variations.
2. Carat Weight Remains the Strongest Effect: The carat
coefficient (8234.5) is the largest, suggesting the strongest positive
effect on price.
3. Cut Grade Matters: Similar to the previous model,
cut grades “Ideal” (cut.L), “Very Good” (cut.C), and “Fair” (cut.Q)
influence price. Positive coefficients for “Ideal” and “Very Good”
indicate a premium, while the negative coefficient for “Fair” suggests
lower prices.
4. Clarity Matters: The positive and significant
coefficient for binary_clarity (1322.7) confirms that higher clarity
diamonds are generally more expensive.
5. Depth Matters: The negative coefficient for depth
(-34.7) indicates a price decrease with increasing depth. This suggests
shallower diamonds (lower depth values) tend to be more valuable.
6. Model Fit Statistics Remain Strong: The R-squared
values (around 0.88) suggest the model explains a high proportion of the
variance in diamond prices. The F-statistic (highly significant p-value)
confirms the model significantly improves price prediction compared to a
model with just the intercept.
observation:The coefficient for the cut^4 term is not significant, a simpler model with just the cut grade categories might be sufficient.
Multicollinearity check
# Create a binary clarity variable
diamonds$binary_clarity <- ifelse(diamonds$clarity %in% c("IF", "VVS1", "VVS2", "VS1", "VS2"), 1, 0)
# Select relevant columns
selected_columns <- c("carat", "binary_clarity", "depth")
# Calculate correlation matrix
correlation_matrix <- cor(diamonds[selected_columns])
print("correlational matrix")
## [1] "correlational matrix"
print(correlation_matrix)
## carat binary_clarity depth
## carat 1.00000000 -0.28614079 0.02822431
## binary_clarity -0.28614079 1.00000000 -0.06000247
## depth 0.02822431 -0.06000247 1.00000000
# Compute variance inflation factors (VIF)
vif_values <- vif(lm(carat ~ binary_clarity + cut + depth, data = diamonds))
print("variance inflation factors")
## [1] "variance inflation factors"
print(vif_values)
## GVIF Df GVIF^(1/(2*Df))
## binary_clarity 1.037399 1 1.018528
## cut 1.179120 4 1.020810
## depth 1.142257 1 1.068764
Correlation Matrix:
carat and binary_clarity: Negative correlation
(-0.286) suggests larger diamonds are less likely to have high clarity
scores. This aligns with findings in previous analysis.
carat and depth: Weak positive correlation (0.028),
indicating a slight tendency for larger diamonds to have greater
depth.
binary_clarity and depth: Weak negative correlation
(-0.060), suggesting higher clarity diamonds might have slightly lower
depth.
Variance Inflation Factors (VIF):
All VIF values are below 2, which is a common threshold for indicating problematic collinearity. VIFs for clarity, cut, and depth are all close to 1, suggesting minimal inflation of variances due to multicollinearity.
Diagnostic Plots
# Residuals vs. Fitted Values Plot
plot(model4, which = 1)
Interpretation:
Non-linear pattern: The plot shows a distinct curved pattern, suggesting that the relationship between the response variable (price) and the predictor variables (carat, cut, binary_clarity, and depth) may not be entirely linear. This violates the linearity assumption of the linear regression model.
Unequal variance (heteroscedasticity): The spread of the residuals appears to increase as the fitted values increase, indicating that the assumption of equal variance (homoscedasticity) of the residuals may be violated.
Outliers: Several points appear to be outliers, particularly at the higher end of the fitted values. These outliers can have a significant influence on the regression coefficients and may need further investigation or potential removal from the analysis.
Influential observations: The plot highlights three observations (16284,25999 and 27416) with very large fitted values and residuals. These observations may be highly influential and could potentially distort the regression model.
# Normal Q-Q Plot
plot(model4, which = 2)
Interpretation:
1. Severe deviation from normality:
The residuals deviate significantly from the straight diagonal
line, particularly in the tails of the distribution. This pattern is a
clear indication of a violation of the normality assumption of the
residuals. 2. Identification of outliers:
The plot highlights two potential outliers (observations 27416 and
16284) as points that deviate substantially from the majority of the
residuals. These outliers can influence the regression coefficients and
may need further investigation or potential removal from the
analysis.
# Scale-Location Plot
plot(model4, which = 3)
Interpretation:
1.Violation of homoscedasticity: The plot exhibits a distinct
funnel shape or megaphone pattern, indicating that the spread of the
residuals increases as the fitted values increase. This pattern suggests
that the assumption of homoscedasticity (constant variance of residuals)
is violated. The residuals appear to have a higher variance for larger
fitted values, which is a clear indication of heteroscedasticity
(non-constant variance).
2. Presence of outliers and influential observations:
The plot highlights several potential outliers, particularly at the
higher end of the fitted values, as points that deviate substantially
from the majority of the residuals.
# Residuals vs. Leverage Plot
plot(model4, which = 5)
Interpretation:
1.Identification of influential observations: The plot
highlights several observations with relatively high leverage values,
indicating that they have a greater influence on the regression
coefficients. 2.Presence of outliers: The plot shows a
concentration of points with large negative residuals at the lower end
of the leverage values. These points represent potential outliers, as
they deviate substantially from the majority of the observations.
3. Cook’s distance: The horizontal line in the plot
represents the Cook’s distance, which is a measure of the combined
influence of an observation’s leverage and residual on the regression
model. Observations with a Cook’s distance greater than a certain
threshold (typically 1 or 4/n, where n is the number of observations)
are considered influential and may need further investigation or
potential removal. While the plot does not explicitly show the Cook’s
distance values, the observations with high leverage and large residuals
are likely to have high Cook’s distance values.
# Cook's Distance Plot
plot(model4, which = 6)
Interpretation:
1. There is a point with a very high Cook’s distance around 0.27416,
suggesting it is an influential outlier. 2. Several other points like
0.27631 and 0.23645 also exhibit relatively high Cook’s distance values,
indicating they may be influential observations.
3. The majority of points are clustered near the origin, with low
leverage and Cook’s distance values, suggesting they are not overly
influential.
4. The dashed line at a Cook’s distance of 6 appears to be a reference
line, potentially indicating a threshold for identifying influential
points.
Observation:
Overall, the diagnostic plots suggests that the linear
regression model may not be appropriate for this dataset, as it violates
the assumptions of linearity and homoscedasticity. The presence of
outliers and influential observations further complicates the
interpretation of the model.
Addressing Model Assumptions:
To address these issues, we should consider transforming the response
variable or the predictor variables to achieve linearity and
homoscedasticity. Additionally, we could investigate the outliers and
influential observations to determine if they should be removed or if
there are legitimate reasons for their presence in the data.
Alternatively, we can explore more complex regression models, such as
polynomial regression or non-linear regression techniques, to better
capture the relationships in the data.