data <- read.csv("C:\\Users\\SHREYA\\OneDrive\\Documents\\Gitstuff\\modified_dataset.csv")
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
library(AmesHousing)
library(boot)
library(broom)
library(lindia)
library(ggplot2)
The interaction term:The interaction term ‘cocoa_percent:company_location’ refers to the combined influence of ‘cocoa_percent’ and ‘company_location’ on the rating. The coefficient for this interaction term shows how the link between ‘cocoa_percent’ and the rating varies with the value of ‘company_location’, and vice versa.
If the coefficient was positive and significant, it would imply that the effect of ‘cocoa_percent’ on the rating is stronger in the location. In contrast, if the coefficient was negative and significant, it would indicate that the effect of ‘cocoa_percent’ on the rating is weaker in the location
# Build the extended linear regression model
model_extended <- lm(rating ~ cocoa_percent + review_date + ingredients, data = data)
#summary
summary(model_extended)
##
## Call:
## lm(formula = rating ~ cocoa_percent + review_date + ingredients,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.03025 -0.27248 0.00066 0.27472 1.12161
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.425339 4.830579 -0.709 0.478332
## cocoa_percent -1.155747 0.160627 -7.195 8.20e-13 ***
## review_date 0.003535 0.002399 1.474 0.140741
## ingredients1- B 0.414298 0.184247 2.249 0.024624 *
## ingredients2- B,C 0.482479 0.429995 1.122 0.261946
## ingredients2- B,S 0.362282 0.049684 7.292 4.08e-13 ***
## ingredients2- B,S* 0.099077 0.089983 1.101 0.270975
## ingredients3- B,S*,C 0.059607 0.131780 0.452 0.651076
## ingredients3- B,S*,Sa -0.324134 0.428185 -0.757 0.449124
## ingredients3- B,S,C 0.408212 0.048966 8.337 < 2e-16 ***
## ingredients3- B,S,L -0.148310 0.157343 -0.943 0.345981
## ingredients3- B,S,V 0.318557 0.250128 1.274 0.202933
## ingredients4- B,S*,C,L 0.019729 0.304534 0.065 0.948351
## ingredients4- B,S*,C,Sa 0.230198 0.106023 2.171 0.030009 *
## ingredients4- B,S*,C,V 0.133851 0.167513 0.799 0.424338
## ingredients4- B,S*,V,L 0.255664 0.250083 1.022 0.306729
## ingredients4- B,S,C,L 0.335705 0.053063 6.327 2.96e-10 ***
## ingredients4- B,S,C,Sa 0.250620 0.196169 1.278 0.201520
## ingredients4- B,S,C,V 0.112655 0.058501 1.926 0.054257 .
## ingredients4- B,S,V,L -0.018008 0.196335 -0.092 0.926928
## ingredients5- B,S,C,L,Sa 0.071849 0.428461 0.168 0.866840
## ingredients5- B,S,C,V,L 0.204217 0.056593 3.609 0.000314 ***
## ingredients5-B,S,C,V,Sa -0.055265 0.179851 -0.307 0.758654
## ingredients6-B,S,C,V,L,Sa 0.074751 0.218020 0.343 0.731732
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4257 on 2506 degrees of freedom
## Multiple R-squared: 0.09439, Adjusted R-squared: 0.08608
## F-statistic: 11.36 on 23 and 2506 DF, p-value: < 2.2e-16
Intercept: The intercept of -3.4253 suggests that for a cocoa percentage of 0, review date of 0, and no ingredients, the predicted rating is -3.4253.
cocoa_percent: For each unit increase in cocoa percentage, we expect the rating to decrease by 1.1557 points, on average, holding other variables constant.
review_date: For each unit increase in review date, we expect the rating to increase by 0.0035 points, on average, holding other variables constant.
ingredients: Each type of ingredient has its own coefficient. For example, for “ingredients2- B,S,” the coefficient of 0.3623 suggests that products with this ingredient combination are expected to have a rating 0.3623 points higher, on average, compared to products without this ingredient combination, holding other variables constant. The significance of each ingredient category varies; for instance, “ingredients3- B,S,C” has a highly significant positive effect on rating.
gg_resfitted(model_extended) +
geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Residuals vs Fitted (plot(model_extended, 1)):
Issue Detection: Look for a trend in the residual data. If there is a distinct structure (e.g., curvature, increasing/decreasing spread), it indicates a nonlinear relationship that the model does not describe. If the residuals are randomly distributed around the horizontal line at 0, with no visible pattern, it indicates that the linear regression assumptions have been met like in our case
residual_plots <- gg_resX(model_extended)
gg_reshist(model_extended)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
gg_qqplot(model_extended)
RNormal Q-Q plot (plot(model_extended, 2)):
Issue Detection: Deviation from the diagonal line indicates a departure from normality. If points deviate greatly, particularly in the tails, this shows that the residuals are not normal. If the points nearly follow the diagonal line, then the assumption of normally distributed residuals holds like in our case where it follows nearly a diagonal
plot(cooks.distance(model_extended))