data <- read.csv("C:\\Users\\SHREYA\\OneDrive\\Documents\\Gitstuff\\modified_dataset.csv")
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
library(patchwork)
## Warning: package 'patchwork' was built under R version 4.3.3
library(broom)
library(lindia)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
Binary column of data that could be modeled is the “most_memorable_characteristics” column. This column contains textual descriptions of the most memorable characteristics of each chocolate, which could be converted into a binary variable indicating whether a specific characteristic is present or not in a chocolate. For example, you could create binary variables for characteristics like “cocoa,” “nutty,” “fruity,” “spicy,” etc., based on whether these characteristics are mentioned in the description for each chocolate.
data$has_cocoa <- grepl("cocoa",data$most_memorable_characteristics, ignore.case = TRUE)
# Convert the logical values to 0 and 1
data$has_cocoa <- as.integer(data$has_cocoa)
summary(data)
## ref company_manufacturer company_location review_date
## Min. : 5 Length:2530 Length:2530 Min. :2006
## 1st Qu.: 802 Class :character Class :character 1st Qu.:2012
## Median :1454 Mode :character Mode :character Median :2015
## Mean :1430 Mean :2014
## 3rd Qu.:2079 3rd Qu.:2018
## Max. :2712 Max. :2021
## country_of_bean_origin specific_bean_origin_or_bar_name cocoa_percent
## Length:2530 Length:2530 Min. :0.4200
## Class :character Class :character 1st Qu.:0.7000
## Mode :character Mode :character Median :0.7000
## Mean :0.7164
## 3rd Qu.:0.7400
## Max. :1.0000
## ingredients most_memorable_characteristics rating
## Length:2530 Length:2530 Min. :1.000
## Class :character Class :character 1st Qu.:3.000
## Mode :character Mode :character Median :3.250
## Mean :3.196
## 3rd Qu.:3.500
## Max. :4.000
## has_cocoa
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.1652
## 3rd Qu.:0.0000
## Max. :1.0000
This code will create a new binary variable has_cocoa where 1 indicates the presence of “cocoa” in the most_memorable_characteristics column for each chocolate, and 0 indicates its absence
table(data$has_cocoa)
##
## 0 1
## 2112 418
model <- glm(has_cocoa ~ review_date + rating + cocoa_percent + ref, data = data, family = "binomial")
coef(summary(model))
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 362.766249796 2.388189e+02 1.519001 1.287621e-01
## review_date -0.182059905 1.189960e-01 -1.529967 1.260249e-01
## rating 1.212818444 1.410682e-01 8.597393 8.154723e-18
## cocoa_percent -4.600818018 1.196982e+00 -3.843683 1.212015e-04
## ref 0.001146289 6.231175e-04 1.839603 6.582652e-02
# Summarize the model
summary(model)
##
## Call:
## glm(formula = has_cocoa ~ review_date + rating + cocoa_percent +
## ref, family = "binomial", data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.628e+02 2.388e+02 1.519 0.128762
## review_date -1.821e-01 1.190e-01 -1.530 0.126025
## rating 1.213e+00 1.411e-01 8.597 < 2e-16 ***
## cocoa_percent -4.601e+00 1.197e+00 -3.844 0.000121 ***
## ref 1.146e-03 6.231e-04 1.840 0.065827 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2268.0 on 2529 degrees of freedom
## Residual deviance: 2155.7 on 2525 degrees of freedom
## AIC: 2165.7
##
## Number of Fisher Scoring iterations: 5
The coefficients of the logistic regression model for the “has_cocoa” variable are as follows:
Intercept: The intercept is 362.7662. This represents the log odds of the outcome (having cocoa) when all other predictors are zero.
review_date: For each unit increase in review_date, the log odds of having cocoa decrease by 0.1821.
rating: For each unit increase in rating, the log odds of having cocoa increase by 1.2128. This coefficient is statistically significant (p < 0.001).
cocoa_percent: For each unit increase in cocoa_percent, the log odds of having cocoa decrease by 4.6008. This coefficient is statistically significant (p < 0.001).
ref: For each unit increase in ref, the log odds of having cocoa increase by 0.0011. This coefficient is marginally significant (p = 0.0658).
Confidence Interval for the coefficient
coefficient_rating <- coef(model)["rating"]
stderr_rating <- coef(summary(model))["rating", "Std. Error"]
# Critical value for a 95% confidence interval
critical_value <- qnorm(0.975) # Approximately 1.96 for a 95% confidence interval
# Calculate confidence interval
ci_lower <- coefficient_rating - critical_value * stderr_rating
ci_upper <- coefficient_rating + critical_value * stderr_rating
# Print confidence interval
cat("95% Confidence Interval for the Coefficient of 'rating': [", ci_lower, ",", ci_upper, "]\n")
## 95% Confidence Interval for the Coefficient of 'rating': [ 0.9363299 , 1.489307 ]
# Interpretation
cat("For each unit increase in the rating, we are 95% confident that the log odds of a product having cocoa increase by a value between approximately", ci_lower, "and", ci_upper, "holding other variables constant.\n")
## For each unit increase in the rating, we are 95% confident that the log odds of a product having cocoa increase by a value between approximately 0.9363299 and 1.489307 holding other variables constant.
For each unit increase in the rating, we are 95% confident that the log odds of a product having cocoa increase by a value between approximately 0.936 and 1.489, holding other variables constant.
This means that higher ratings are associated with significantly higher odds of a product having cocoa, as the confidence interval does not include zero.
library(ggplot2)
coef_rating <- data.frame(
Estimate = 1.26857,
Lower = 0.98832857,
Upper = 1.55516347
)
ggplot(coef_rating, aes(x = "", y = Estimate, ymin = Lower, ymax = Upper)) +
geom_pointrange(color = "blue", size = 1.5) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
labs(x = "", y = "Coefficient Estimate", title = "95% Confidence Interval for Coefficient of 'rating'") +
theme_minimal() +
coord_flip()