Week 12 Task:

Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

Diamonds Dataset

For the purposes of this assigmnent, I have decided to work with the built-in diamonds dataset. This dataset features information about diamonds, such as the prices, carat, cut, color, clarity and more.

library(ggplot2)
data(diamonds)
str(diamonds)
## tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
##  $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

I began by loading the dataset. Here we can also see a tibble with the 10 variables included in the dataset.

Building a Regression Model:

diamonds$carat_squared <- diamonds$carat^2 #quadratic term for carat
diamonds$ideal_cut <- ifelse(diamonds$cut == "Ideal", 1, 0) #dichotomous variable for "ideal" quality 
model <- lm(price ~ carat + carat_squared + ideal_cut + carat:ideal_cut, data = diamonds) #regression model

To build this regression model, I first assigned carat as the quadratic term, and decided to look at its interaction with cut quality. I decided to work with cut quality because it essentially is what determines how well a diamond sparkles. A dull diamond could be higher in carats but could have a lower quality and be more dull because of its cut – meaning that it would sell for less than its potential value. In this dataset, there are five levels of cut quality: fair, good, very good, premium, and ideal. I set the highest quality, ideal, equal to 1 and everything else equal to 0.

Hypotheses:

Null Hypothesis: There is no difference in the mean price between diamonds with an ‘Ideal’ cut and diamonds with non-‘Ideal’ cut qualities (fair, good, very good, premium).

Alternate Hypothesis: There is a difference in the mean price between diamonds with an ‘Ideal’ cut and diamonds with non-‘Ideal’ cut qualities (fair, good, very good, premium).

Analysis

summary(model)
## 
## Call:
## lm(formula = price ~ carat + carat_squared + ideal_cut + carat:ideal_cut, 
##     data = diamonds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -26066.6   -709.0    -39.5    437.8  13085.9 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -1835.65      25.67 -71.520  < 2e-16 ***
## carat            6446.93      49.09 131.328  < 2e-16 ***
## carat_squared     542.67      20.82  26.066  < 2e-16 ***
## ideal_cut         -77.19      26.45  -2.918  0.00352 ** 
## carat:ideal_cut   668.00      29.84  22.389  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1519 on 53935 degrees of freedom
## Multiple R-squared:  0.855,  Adjusted R-squared:  0.855 
## F-statistic: 7.953e+04 on 4 and 53935 DF,  p-value: < 2.2e-16

Coefficients:

We can see the coefficient for ‘carat’ is 6446.93. Assuming all other variables are constant, this means that for every increase in carat weight, the price of the diamond increases by $6446.93. For the quadratic term, carat squared, the coefficient is 542.67. This indicates that even though the price increases as the carat weight increases, the rate at which the price eventually slows down. Looking at the ideal cut, or the dichotomous variable, the coefficient is -77.19. This would mean that, assuming all other variables are constant, diamonds with an ‘Ideal’ cut quality are expected to have a price lower by $77.19 compared to diamonds with other cut qualities. This goes against the idea which I had previously stated, about cut quality being a major determining factor in diamond pricing. For the interaction term, the coefficient is 668.00. This represents that between carat and price, the price increases by an additional $668.00 for ‘Ideal’ diamond cut quality when compared to all other non-‘Ideal’ cut diamonds.

F-statistic and p-value:

The F-statistic is 7.953e+04 on 4 and 53935 DF, p-value: < 2.2e-16. The p-value being less than 0.05 means that the result is statistically significant. This is strong evidence which points to rejecting the null hypothesis, and accepting the alternate hypothesis, which states that there is a difference in the mean pricing of ‘Ideal’ cut diamonds and non-‘Ideal’ cut diamonds.

Residual standard error:

The residual standard error is 1519 on 53935 degrees of freedom. The standard error is high and tells us that the data does not align with the line of best fit. This may be because we are looking at the ‘Ideal’ cut quality of the diamonds.

R-squared value:

The R-squared value is 0.855. This indicates that the model accounts for about 85.50% of variability in the data. Since this value is high, it represents a strong correlation between the two variables.

par(mfrow = c(2, 2))
plot(model)

The Residuals vs. Fitted plot examines the relationship between the predicted values and the residuals. We can see that the data has poor correlation to the line. The Normal Q-Q plot compares the distribution of the residuals to a normal distribution. It looks like the data aligns with the line of best fit for the most part, more specifically where there is a plateau, indicating normal distribution. The Scale-Location plot checks for homoscedasticity. The dataset is not evenly distributed on the line. The Residuals vs. Leverage plot helps to determine outliers. The dataset generally aligns with this line, indicating minimal outliers.

Conclusion

Overall, based on the analysis above, and the plots, I would conclude that there is very high correlation between pricing and diamond cut quality. Because of that, I would say that cut quality does sufficiently predict the price of diamonds based on their carat weight. I would conclude that a linear model is appropriate when addressing the residual analysis.