Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?
For the purposes of this assigmnent, I have decided to work with the built-in diamonds dataset. This dataset features information about diamonds, such as the prices, carat, cut, color, clarity and more.
library(ggplot2)
data(diamonds)
str(diamonds)
## tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
## $ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
I began by loading the dataset. Here we can also see a tibble with the 10 variables included in the dataset.
diamonds$carat_squared <- diamonds$carat^2 #quadratic term for carat
diamonds$ideal_cut <- ifelse(diamonds$cut == "Ideal", 1, 0) #dichotomous variable for "ideal" quality
model <- lm(price ~ carat + carat_squared + ideal_cut + carat:ideal_cut, data = diamonds) #regression model
To build this regression model, I first assigned carat as the quadratic term, and decided to look at its interaction with cut quality. I decided to work with cut quality because it essentially is what determines how well a diamond sparkles. A dull diamond could be higher in carats but could have a lower quality and be more dull because of its cut – meaning that it would sell for less than its potential value. In this dataset, there are five levels of cut quality: fair, good, very good, premium, and ideal. I set the highest quality, ideal, equal to 1 and everything else equal to 0.
summary(model)
##
## Call:
## lm(formula = price ~ carat + carat_squared + ideal_cut + carat:ideal_cut,
## data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26066.6 -709.0 -39.5 437.8 13085.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1835.65 25.67 -71.520 < 2e-16 ***
## carat 6446.93 49.09 131.328 < 2e-16 ***
## carat_squared 542.67 20.82 26.066 < 2e-16 ***
## ideal_cut -77.19 26.45 -2.918 0.00352 **
## carat:ideal_cut 668.00 29.84 22.389 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1519 on 53935 degrees of freedom
## Multiple R-squared: 0.855, Adjusted R-squared: 0.855
## F-statistic: 7.953e+04 on 4 and 53935 DF, p-value: < 2.2e-16
We can see the coefficient for ‘carat’ is 6446.93. Assuming all other variables are constant, this means that for every increase in carat weight, the price of the diamond increases by $6446.93. For the quadratic term, carat squared, the coefficient is 542.67. This indicates that even though the price increases as the carat weight increases, the rate at which the price eventually slows down. Looking at the ideal cut, or the dichotomous variable, the coefficient is -77.19. This would mean that, assuming all other variables are constant, diamonds with an ‘Ideal’ cut quality are expected to have a price lower by $77.19 compared to diamonds with other cut qualities. This goes against the idea which I had previously stated, about cut quality being a major determining factor in diamond pricing. For the interaction term, the coefficient is 668.00. This represents that between carat and price, the price increases by an additional $668.00 for ‘Ideal’ diamond cut quality when compared to all other non-‘Ideal’ cut diamonds.
The F-statistic is 7.953e+04 on 4 and 53935 DF, p-value: < 2.2e-16. The p-value being less than 0.05 means that the result is statistically significant. This is strong evidence which points to rejecting the null hypothesis, and accepting the alternate hypothesis, which states that there is a difference in the mean pricing of ‘Ideal’ cut diamonds and non-‘Ideal’ cut diamonds.
The residual standard error is 1519 on 53935 degrees of freedom. The standard error is high and tells us that the data does not align with the line of best fit. This may be because we are looking at the ‘Ideal’ cut quality of the diamonds.
The R-squared value is 0.855. This indicates that the model accounts for about 85.50% of variability in the data. Since this value is high, it represents a strong correlation between the two variables.
par(mfrow = c(2, 2))
plot(model)
The Residuals vs. Fitted plot examines the relationship between the predicted values and the residuals. We can see that the data has poor correlation to the line. The Normal Q-Q plot compares the distribution of the residuals to a normal distribution. It looks like the data aligns with the line of best fit for the most part, more specifically where there is a plateau, indicating normal distribution. The Scale-Location plot checks for homoscedasticity. The dataset is not evenly distributed on the line. The Residuals vs. Leverage plot helps to determine outliers. The dataset generally aligns with this line, indicating minimal outliers.
Overall, based on the analysis above, and the plots, I would conclude that there is very high correlation between pricing and diamond cut quality. Because of that, I would say that cut quality does sufficiently predict the price of diamonds based on their carat weight. I would conclude that a linear model is appropriate when addressing the residual analysis.