data <- read.csv("C:\\Users\\SHREYA\\OneDrive\\Documents\\Gitstuff\\modified_dataset.csv")
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
library(patchwork)
## Warning: package 'patchwork' was built under R version 4.3.3
library(broom)
library(lindia)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
Binary column of data that could be modeled is the “most_memorable_characteristics” column. This column contains textual descriptions of the most memorable characteristics of each chocolate, which could be converted into a binary variable indicating whether a specific characteristic is present or not in a chocolate. For example, you could create binary variables for characteristics like “cocoa,” “nutty,” “fruity,” “spicy,” etc., based on whether these characteristics are mentioned in the description for each chocolate.
data$has_cocoa <- grepl("cocoa",data$most_memorable_characteristics, ignore.case = TRUE)
# Convert the logical values to 0 and 1
data$has_cocoa <- as.integer(data$has_cocoa)
summary(data)
## ref company_manufacturer company_location review_date
## Min. : 5 Length:2530 Length:2530 Min. :2006
## 1st Qu.: 802 Class :character Class :character 1st Qu.:2012
## Median :1454 Mode :character Mode :character Median :2015
## Mean :1430 Mean :2014
## 3rd Qu.:2079 3rd Qu.:2018
## Max. :2712 Max. :2021
## country_of_bean_origin specific_bean_origin_or_bar_name cocoa_percent
## Length:2530 Length:2530 Min. :0.4200
## Class :character Class :character 1st Qu.:0.7000
## Mode :character Mode :character Median :0.7000
## Mean :0.7164
## 3rd Qu.:0.7400
## Max. :1.0000
## ingredients most_memorable_characteristics rating
## Length:2530 Length:2530 Min. :1.000
## Class :character Class :character 1st Qu.:3.000
## Mode :character Mode :character Median :3.250
## Mean :3.196
## 3rd Qu.:3.500
## Max. :4.000
## has_cocoa
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.1652
## 3rd Qu.:0.0000
## Max. :1.0000
This code will create a new binary variable has_cocoa where 1 indicates the presence of “cocoa” in the most_memorable_characteristics column for each chocolate, and 0 indicates its absence
table(data$has_cocoa)
##
## 0 1
## 2112 418
model <- glm(has_cocoa ~ review_date + rating + cocoa_percent + ref, data = data, family = "binomial")
coef(summary(model))
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 362.766249796 2.388189e+02 1.519001 1.287621e-01
## review_date -0.182059905 1.189960e-01 -1.529967 1.260249e-01
## rating 1.212818444 1.410682e-01 8.597393 8.154723e-18
## cocoa_percent -4.600818018 1.196982e+00 -3.843683 1.212015e-04
## ref 0.001146289 6.231175e-04 1.839603 6.582652e-02
# Summarize the model
summary(model)
##
## Call:
## glm(formula = has_cocoa ~ review_date + rating + cocoa_percent +
## ref, family = "binomial", data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.628e+02 2.388e+02 1.519 0.128762
## review_date -1.821e-01 1.190e-01 -1.530 0.126025
## rating 1.213e+00 1.411e-01 8.597 < 2e-16 ***
## cocoa_percent -4.601e+00 1.197e+00 -3.844 0.000121 ***
## ref 1.146e-03 6.231e-04 1.840 0.065827 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2268.0 on 2529 degrees of freedom
## Residual deviance: 2155.7 on 2525 degrees of freedom
## AIC: 2165.7
##
## Number of Fisher Scoring iterations: 5
The coefficients of the logistic regression model for the “has_cocoa” variable are as follows:
Intercept: The intercept is 362.7662. This represents the log odds of the outcome (having cocoa) when all other predictors are zero.
review_date: For each unit increase in review_date, the log odds of having cocoa decrease by 0.1821.
rating: For each unit increase in rating, the log odds of having cocoa increase by 1.2128. This coefficient is statistically significant (p < 0.001).
cocoa_percent: For each unit increase in cocoa_percent, the log odds of having cocoa decrease by 4.6008. This coefficient is statistically significant (p < 0.001).
ref: For each unit increase in ref, the log odds of having cocoa increase by 0.0011. This coefficient is marginally significant (p = 0.0658).
Confidence Interval for the coefficient
coef_rating <- -0.135770
se_rating <- 0.019006
lower_bound <- coef_rating - 1.96 * se_rating
upper_bound <- coef_rating + 1.96 * se_rating
cat("95% Confidence Interval for the coefficient of rating: (", round(lower_bound, 3), ",", round(upper_bound, 3), ")\n")
## 95% Confidence Interval for the coefficient of rating: ( -0.173 , -0.099 )
The 95% confidence interval for the coefficient of rating in your logistic regression model is from -0.173 to -0.099. This means that we are 95% confident that the true value of the coefficient lies within this interval, indicating a significant relationship between rating and the probability of has_cocoa being true.
library(ggplot2)
coef_rating <- data.frame(
Estimate = 1.26857,
Lower = 0.98832857,
Upper = 1.55516347
)
ggplot(coef_rating, aes(x = "", y = Estimate, ymin = Lower, ymax = Upper)) +
geom_pointrange(color = "blue", size = 1.5) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
labs(x = "", y = "Coefficient Estimate", title = "95% Confidence Interval for Coefficient of 'rating'") +
theme_minimal() +
coord_flip()