data <- read.csv("C:\\Users\\SHREYA\\OneDrive\\Documents\\Gitstuff\\modified_dataset.csv")
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
library(patchwork)
## Warning: package 'patchwork' was built under R version 4.3.3
library(broom)
library(lindia)
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

Binary column of data that could be modeled is the “most_memorable_characteristics” column. This column contains textual descriptions of the most memorable characteristics of each chocolate, which could be converted into a binary variable indicating whether a specific characteristic is present or not in a chocolate. For example, you could create binary variables for characteristics like “cocoa,” “nutty,” “fruity,” “spicy,” etc., based on whether these characteristics are mentioned in the description for each chocolate.

data$has_cocoa <- grepl("cocoa",data$most_memorable_characteristics, ignore.case = TRUE)

# Convert the logical values to 0 and 1
data$has_cocoa <- as.integer(data$has_cocoa)
summary(data)
##       ref       company_manufacturer company_location    review_date  
##  Min.   :   5   Length:2530          Length:2530        Min.   :2006  
##  1st Qu.: 802   Class :character     Class :character   1st Qu.:2012  
##  Median :1454   Mode  :character     Mode  :character   Median :2015  
##  Mean   :1430                                           Mean   :2014  
##  3rd Qu.:2079                                           3rd Qu.:2018  
##  Max.   :2712                                           Max.   :2021  
##  country_of_bean_origin specific_bean_origin_or_bar_name cocoa_percent   
##  Length:2530            Length:2530                      Min.   :0.4200  
##  Class :character       Class :character                 1st Qu.:0.7000  
##  Mode  :character       Mode  :character                 Median :0.7000  
##                                                          Mean   :0.7164  
##                                                          3rd Qu.:0.7400  
##                                                          Max.   :1.0000  
##  ingredients        most_memorable_characteristics     rating     
##  Length:2530        Length:2530                    Min.   :1.000  
##  Class :character   Class :character               1st Qu.:3.000  
##  Mode  :character   Mode  :character               Median :3.250  
##                                                    Mean   :3.196  
##                                                    3rd Qu.:3.500  
##                                                    Max.   :4.000  
##    has_cocoa     
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.1652  
##  3rd Qu.:0.0000  
##  Max.   :1.0000

This code will create a new binary variable has_cocoa where 1 indicates the presence of “cocoa” in the most_memorable_characteristics column for each chocolate, and 0 indicates its absence

table(data$has_cocoa)
## 
##    0    1 
## 2112  418
model <- glm(has_cocoa ~ review_date + rating + cocoa_percent + ref, data = data, family = "binomial")

coef(summary(model))
##                    Estimate   Std. Error   z value     Pr(>|z|)
## (Intercept)   362.766249796 2.388189e+02  1.519001 1.287621e-01
## review_date    -0.182059905 1.189960e-01 -1.529967 1.260249e-01
## rating          1.212818444 1.410682e-01  8.597393 8.154723e-18
## cocoa_percent  -4.600818018 1.196982e+00 -3.843683 1.212015e-04
## ref             0.001146289 6.231175e-04  1.839603 6.582652e-02
# Summarize the model
summary(model)
## 
## Call:
## glm(formula = has_cocoa ~ review_date + rating + cocoa_percent + 
##     ref, family = "binomial", data = data)
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    3.628e+02  2.388e+02   1.519 0.128762    
## review_date   -1.821e-01  1.190e-01  -1.530 0.126025    
## rating         1.213e+00  1.411e-01   8.597  < 2e-16 ***
## cocoa_percent -4.601e+00  1.197e+00  -3.844 0.000121 ***
## ref            1.146e-03  6.231e-04   1.840 0.065827 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2268.0  on 2529  degrees of freedom
## Residual deviance: 2155.7  on 2525  degrees of freedom
## AIC: 2165.7
## 
## Number of Fisher Scoring iterations: 5

The coefficients of the logistic regression model for the “has_cocoa” variable are as follows:

Intercept: The intercept is 362.7662. This represents the log odds of the outcome (having cocoa) when all other predictors are zero.

review_date: For each unit increase in review_date, the log odds of having cocoa decrease by 0.1821.

rating: For each unit increase in rating, the log odds of having cocoa increase by 1.2128. This coefficient is statistically significant (p < 0.001).

cocoa_percent: For each unit increase in cocoa_percent, the log odds of having cocoa decrease by 4.6008. This coefficient is statistically significant (p < 0.001).

ref: For each unit increase in ref, the log odds of having cocoa increase by 0.0011. This coefficient is marginally significant (p = 0.0658).

Confidence Interval for the coefficient

coef_rating <- -0.135770
se_rating <- 0.019006

lower_bound <- coef_rating - 1.96 * se_rating
upper_bound <- coef_rating + 1.96 * se_rating

cat("95% Confidence Interval for the coefficient of rating: (", round(lower_bound, 3), ",", round(upper_bound, 3), ")\n")
## 95% Confidence Interval for the coefficient of rating: ( -0.173 , -0.099 )

The 95% confidence interval for the coefficient of rating in your logistic regression model is from -0.173 to -0.099. This means that we are 95% confident that the true value of the coefficient lies within this interval, indicating a significant relationship between rating and the probability of has_cocoa being true.

Plot for Confidence Interval for the coefficient of rating

library(ggplot2)
coef_rating <- data.frame(
  Estimate = 1.26857,
  Lower = 0.98832857,
  Upper = 1.55516347
)


ggplot(coef_rating, aes(x = "", y = Estimate, ymin = Lower, ymax = Upper)) +
  geom_pointrange(color = "blue", size = 1.5) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(x = "", y = "Coefficient Estimate", title = "95% Confidence Interval for Coefficient of 'rating'") +
  theme_minimal() +
  coord_flip()