library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)
library(ggthemes)
library(purrr)
library(pwr)
library(stats)
books <- read.csv("bestsellers.csv")
str(books)
## 'data.frame': 550 obs. of 7 variables:
## $ Name : chr "10-Day Green Smoothie Cleanse" "11/22/63: A Novel" "12 Rules for Life: An Antidote to Chaos" "1984 (Signet Classics)" ...
## $ Author : chr "JJ Smith" "Stephen King" "Jordan B. Peterson" "George Orwell" ...
## $ User.Rating: num 4.7 4.6 4.7 4.7 4.8 4.4 4.7 4.7 4.7 4.6 ...
## $ Reviews : int 17350 2052 18979 21424 7665 12643 19735 19699 5983 23848 ...
## $ Price : int 8 22 15 6 12 11 30 15 3 8 ...
## $ Year : int 2016 2011 2018 2017 2019 2011 2014 2017 2018 2016 ...
## $ Genre : chr "Non Fiction" "Fiction" "Non Fiction" "Fiction" ...
Lets build a logistic regression model using the
books dataset. Let’s consider the
Genre as a binary outcome variable, with
the values “Fiction” and “Non Fiction”.
This model will predict the likelihood of a book being “Fiction”
based on other factors in the dataset such as
User.Rating,
Price, and
Year.
# Convert Genre to a binary factor
books$Genre <- ifelse(books$Genre == "Fiction", 1, 0)
# Inspect the structure of the modified dataset
str(books)
## 'data.frame': 550 obs. of 7 variables:
## $ Name : chr "10-Day Green Smoothie Cleanse" "11/22/63: A Novel" "12 Rules for Life: An Antidote to Chaos" "1984 (Signet Classics)" ...
## $ Author : chr "JJ Smith" "Stephen King" "Jordan B. Peterson" "George Orwell" ...
## $ User.Rating: num 4.7 4.6 4.7 4.7 4.8 4.4 4.7 4.7 4.7 4.6 ...
## $ Reviews : int 17350 2052 18979 21424 7665 12643 19735 19699 5983 23848 ...
## $ Price : int 8 22 15 6 12 11 30 15 3 8 ...
## $ Year : int 2016 2011 2018 2017 2019 2011 2014 2017 2018 2016 ...
## $ Genre : num 0 1 0 1 0 1 1 1 0 1 ...
# Fit the logistic regression model
model <- glm(Genre ~ User.Rating + Price + Year, data = books, family = binomial())
# Summarize the model
summary(model)
##
## Call:
## glm(formula = Genre ~ User.Rating + Price + Year, family = binomial(),
## data = books)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 104.06626 58.18713 1.788 0.0737 .
## User.Rating 1.01415 0.42393 2.392 0.0167 *
## Price -0.04672 0.01165 -4.011 6.04e-05 ***
## Year -0.05384 0.02910 -1.850 0.0643 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 753.53 on 549 degrees of freedom
## Residual deviance: 723.53 on 546 degrees of freedom
## AIC: 731.53
##
## Number of Fisher Scoring iterations: 4
# Coefficient interpretations
coef_summary <- summary(model)$coefficients
coef_summary
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 104.06625617 58.18712834 1.788476 7.369932e-02
## User.Rating 1.01415204 0.42393379 2.392242 1.674582e-02
## Price -0.04672179 0.01164777 -4.011222 6.040524e-05
## Year -0.05383760 0.02909795 -1.850220 6.428189e-02
(Intercept) Coefficient: The estimate of 104.06626 with a standard error of 58.18713 and a z-value of 1.788 indicates that, holding all other variables constant, the log odds of a book being classified as fiction are very high at the intercept (starting point of the year scale). However, this effect is not statistically significant at the typical 0.05 level (p = 0.0737).
User.Rating Coefficient: The estimate of 1.01415 suggests that for each one-point increase in user rating, the log odds of a book being fiction increase by about 1.014. This is statistically significant at the 0.05 level (p = 0.0167), indicating a strong positive relationship between user ratings and the likelihood of a book being fiction.
Price Coefficient: The estimate of -0.04672 with a p-value of 6.04e-05 strongly suggests that higher prices are significantly associated with lower odds of a book being fiction. For every one-unit increase in price, the log odds of being fiction decrease by approximately 0.047, holding other factors constant.
Year Coefficient: The estimate of -0.05384 with a p-value of 0.0643 suggests a downward trend over the years in the log odds of a book being fiction, though this effect is not statistically significant at the 0.05 level. This might indicate a shift in publishing trends or consumer preferences over time.
# Calculating confidence interval for the User.Rating coefficient
ci <- confint(model, "User.Rating", level = 0.95)
## Waiting for profiling to be done...
ci
## 2.5 % 97.5 %
## 0.2001208 1.8661799
The confidence interval for the
User.Rating coefficient ranges from 0.200
to 1.867. This interval means that we are 95% confident that the true
effect of a one-unit increase in user rating on the log odds of a book
being fiction lies between 0.200 and 1.867.