data <- read.csv("C:\\Users\\SHREYA\\OneDrive\\Documents\\Gitstuff\\modified_dataset.csv")
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
library(AmesHousing)
library(boot)
library(broom)
library(lindia)
library(ggplot2)
# Build the extended linear regression model
model_extended <- lm(rating ~ cocoa_percent  + review_date + ingredients, data = data)

#summary 
summary(model_extended)
## 
## Call:
## lm(formula = rating ~ cocoa_percent + review_date + ingredients, 
##     data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.03025 -0.27248  0.00066  0.27472  1.12161 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -3.425339   4.830579  -0.709 0.478332    
## cocoa_percent             -1.155747   0.160627  -7.195 8.20e-13 ***
## review_date                0.003535   0.002399   1.474 0.140741    
## ingredients1- B            0.414298   0.184247   2.249 0.024624 *  
## ingredients2- B,C          0.482479   0.429995   1.122 0.261946    
## ingredients2- B,S          0.362282   0.049684   7.292 4.08e-13 ***
## ingredients2- B,S*         0.099077   0.089983   1.101 0.270975    
## ingredients3- B,S*,C       0.059607   0.131780   0.452 0.651076    
## ingredients3- B,S*,Sa     -0.324134   0.428185  -0.757 0.449124    
## ingredients3- B,S,C        0.408212   0.048966   8.337  < 2e-16 ***
## ingredients3- B,S,L       -0.148310   0.157343  -0.943 0.345981    
## ingredients3- B,S,V        0.318557   0.250128   1.274 0.202933    
## ingredients4- B,S*,C,L     0.019729   0.304534   0.065 0.948351    
## ingredients4- B,S*,C,Sa    0.230198   0.106023   2.171 0.030009 *  
## ingredients4- B,S*,C,V     0.133851   0.167513   0.799 0.424338    
## ingredients4- B,S*,V,L     0.255664   0.250083   1.022 0.306729    
## ingredients4- B,S,C,L      0.335705   0.053063   6.327 2.96e-10 ***
## ingredients4- B,S,C,Sa     0.250620   0.196169   1.278 0.201520    
## ingredients4- B,S,C,V      0.112655   0.058501   1.926 0.054257 .  
## ingredients4- B,S,V,L     -0.018008   0.196335  -0.092 0.926928    
## ingredients5- B,S,C,L,Sa   0.071849   0.428461   0.168 0.866840    
## ingredients5- B,S,C,V,L    0.204217   0.056593   3.609 0.000314 ***
## ingredients5-B,S,C,V,Sa   -0.055265   0.179851  -0.307 0.758654    
## ingredients6-B,S,C,V,L,Sa  0.074751   0.218020   0.343 0.731732    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4257 on 2506 degrees of freedom
## Multiple R-squared:  0.09439,    Adjusted R-squared:  0.08608 
## F-statistic: 11.36 on 23 and 2506 DF,  p-value: < 2.2e-16

Intercept: The intercept of -3.4253 suggests that for a cocoa percentage of 0, review date of 0, and no ingredients, the predicted rating is -3.4253.

cocoa_percent: For each unit increase in cocoa percentage, we expect the rating to decrease by 1.1557 points, on average, holding other variables constant.

review_date: For each unit increase in review date, we expect the rating to increase by 0.0035 points, on average, holding other variables constant.

ingredients: Each type of ingredient has its own coefficient. For example, for “ingredients2- B,S,” the coefficient of 0.3623 suggests that products with this ingredient combination are expected to have a rating 0.3623 points higher, on average, compared to products without this ingredient combination, holding other variables constant. The significance of each ingredient category varies; for instance, “ingredients3- B,S,C” has a highly significant positive effect on rating.

gg_resfitted(model_extended) +
  geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Residuals vs Fitted (plot(model_extended, 1)):

Issue Detection: Look for a trend in the residual data. If there is a distinct structure (e.g., curvature, increasing/decreasing spread), it indicates a nonlinear relationship that the model does not describe. If the residuals are randomly distributed around the horizontal line at 0, with no visible pattern, it indicates that the linear regression assumptions have been met like in our case

residual_plots <- gg_resX(model_extended)

gg_reshist(model_extended)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

gg_qqplot(model_extended)

RNormal Q-Q plot (plot(model_extended, 2)):

Issue Detection: Deviation from the diagonal line indicates a departure from normality. If points deviate greatly, particularly in the tails, this shows that the residuals are not normal. If the points nearly follow the diagonal line, then the assumption of normally distributed residuals holds like in our case where it follows nearly a diagonal

plot(cooks.distance(model_extended))