Loading all the necessary libraries

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
library(ggplot2) 
library(ggthemes)
library(purrr)
library(pwr)
library(stats)

Load the Dataset

books <- read.csv("bestsellers.csv")

Check the structure of the data

str(books)

## 'data.frame':    550 obs. of  7 variables:
##  $ Name       : chr  "10-Day Green Smoothie Cleanse" "11/22/63: A Novel" "12 Rules for Life: An Antidote to Chaos" "1984 (Signet Classics)" ...
##  $ Author     : chr  "JJ Smith" "Stephen King" "Jordan B. Peterson" "George Orwell" ...
##  $ User.Rating: num  4.7 4.6 4.7 4.7 4.8 4.4 4.7 4.7 4.7 4.6 ...
##  $ Reviews    : int  17350 2052 18979 21424 7665 12643 19735 19699 5983 23848 ...
##  $ Price      : int  8 22 15 6 12 11 30 15 3 8 ...
##  $ Year       : int  2016 2011 2018 2017 2019 2011 2014 2017 2018 2016 ...
##  $ Genre      : chr  "Non Fiction" "Fiction" "Non Fiction" "Fiction" ...

Lets build a logistic regression model using the books dataset. Let’s consider the Genre as a binary outcome variable, with the values “Fiction” and “Non Fiction”.

This model will predict the likelihood of a book being “Fiction” based on other factors in the dataset such as User.Rating, Price, and Year.

Data Preparation

# Convert Genre to a binary factor
books$Genre <- ifelse(books$Genre == "Fiction", 1, 0)

# Inspect the structure of the modified dataset
str(books)

## 'data.frame':    550 obs. of  7 variables:
##  $ Name       : chr  "10-Day Green Smoothie Cleanse" "11/22/63: A Novel" "12 Rules for Life: An Antidote to Chaos" "1984 (Signet Classics)" ...
##  $ Author     : chr  "JJ Smith" "Stephen King" "Jordan B. Peterson" "George Orwell" ...
##  $ User.Rating: num  4.7 4.6 4.7 4.7 4.8 4.4 4.7 4.7 4.7 4.6 ...
##  $ Reviews    : int  17350 2052 18979 21424 7665 12643 19735 19699 5983 23848 ...
##  $ Price      : int  8 22 15 6 12 11 30 15 3 8 ...
##  $ Year       : int  2016 2011 2018 2017 2019 2011 2014 2017 2018 2016 ...
##  $ Genre      : num  0 1 0 1 0 1 1 1 0 1 ...

Logistic Regression Model

# Fit the logistic regression model
model <- glm(Genre ~ User.Rating + Price + Year, data = books, family = binomial())

# Summarize the model
summary(model)

## 
## Call:
## glm(formula = Genre ~ User.Rating + Price + Year, family = binomial(), 
##     data = books)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 104.06626   58.18713   1.788   0.0737 .  
## User.Rating   1.01415    0.42393   2.392   0.0167 *  
## Price        -0.04672    0.01165  -4.011 6.04e-05 ***
## Year         -0.05384    0.02910  -1.850   0.0643 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 753.53  on 549  degrees of freedom
## Residual deviance: 723.53  on 546  degrees of freedom
## AIC: 731.53
## 
## Number of Fisher Scoring iterations: 4

Coefficient Interpretation

# Coefficient interpretations
coef_summary <- summary(model)$coefficients
coef_summary

##                 Estimate  Std. Error   z value     Pr(>|z|)
## (Intercept) 104.06625617 58.18712834  1.788476 7.369932e-02
## User.Rating   1.01415204  0.42393379  2.392242 1.674582e-02
## Price        -0.04672179  0.01164777 -4.011222 6.040524e-05
## Year         -0.05383760  0.02909795 -1.850220 6.428189e-02

Model Summary Interpretation:

(Intercept) Coefficient: The estimate of 104.06626 with a standard error of 58.18713 and a z-value of 1.788 indicates that, holding all other variables constant, the log odds of a book being classified as fiction are very high at the intercept (starting point of the year scale). However, this effect is not statistically significant at the typical 0.05 level (p = 0.0737).
User.Rating Coefficient: The estimate of 1.01415 suggests that for each one-point increase in user rating, the log odds of a book being fiction increase by about 1.014. This is statistically significant at the 0.05 level (p = 0.0167), indicating a strong positive relationship between user ratings and the likelihood of a book being fiction.
Price Coefficient: The estimate of -0.04672 with a p-value of 6.04e-05 strongly suggests that higher prices are significantly associated with lower odds of a book being fiction. For every one-unit increase in price, the log odds of being fiction decrease by approximately 0.047, holding other factors constant.
Year Coefficient: The estimate of -0.05384 with a p-value of 0.0643 suggests a downward trend over the years in the log odds of a book being fiction, though this effect is not statistically significant at the 0.05 level. This might indicate a shift in publishing trends or consumer preferences over time.

Confidence Interval for Coefficients

# Calculating confidence interval for the User.Rating coefficient
ci <- confint(model, "User.Rating", level = 0.95)

## Waiting for profiling to be done...

ci

##     2.5 %    97.5 % 
## 0.2001208 1.8661799

The confidence interval for the User.Rating coefficient ranges from 0.200 to 1.867. This interval means that we are 95% confident that the true effect of a one-unit increase in user rating on the log odds of a book being fiction lies between 0.200 and 1.867.

week10

Shresta Reddy Nukala

2024-04-22