Loading all the necessary libraries

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
library(ggplot2) 
library(ggthemes)
library(purrr)
library(pwr)
library(stats)
library(car)

## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

library(broom)

Load the Dataset

books <- read.csv("bestsellers.csv")

Check the structure of the data

str(books)

## 'data.frame':    550 obs. of  7 variables:
##  $ Name       : chr  "10-Day Green Smoothie Cleanse" "11/22/63: A Novel" "12 Rules for Life: An Antidote to Chaos" "1984 (Signet Classics)" ...
##  $ Author     : chr  "JJ Smith" "Stephen King" "Jordan B. Peterson" "George Orwell" ...
##  $ User.Rating: num  4.7 4.6 4.7 4.7 4.8 4.4 4.7 4.7 4.7 4.6 ...
##  $ Reviews    : int  17350 2052 18979 21424 7665 12643 19735 19699 5983 23848 ...
##  $ Price      : int  8 22 15 6 12 11 30 15 3 8 ...
##  $ Year       : int  2016 2011 2018 2017 2019 2011 2014 2017 2018 2016 ...
##  $ Genre      : chr  "Non Fiction" "Fiction" "Non Fiction" "Fiction" ...

Let’s build a generalized linear model (GLM) focusing on predicting Price based on other variables such as User.Rating, Reviews, and Year.

Building the model

# Building the linear model
lm_price <- lm(Price ~ User.Rating + Reviews + Year, data = books)

# Summary of the model
summary(lm_price)

## 
## Call:
## lm(formula = Price ~ User.Rating + Reviews + Year, data = books)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.392  -5.778  -2.381   2.626  90.896 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  7.726e+02  3.075e+02   2.512   0.0123 *
## User.Rating -5.131e+00  2.070e+00  -2.478   0.0135 *
## Reviews     -7.513e-05  4.028e-05  -1.865   0.0627 .
## Year        -3.649e-01  1.539e-01  -2.371   0.0181 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.65 on 546 degrees of freedom
## Multiple R-squared:  0.03957,    Adjusted R-squared:  0.03429 
## F-statistic: 7.499 on 3 and 546 DF,  p-value: 6.321e-05

Despite the low R-squared value, the model has some statistically significant predictors. However, the low R-squared value suggests that much of the variation in book prices is not explained by user ratings, reviews, and year alone.

Model Diagnostics

# Checking for multicollinearity
vif(lm_price)

## User.Rating     Reviews        Year 
##    1.067664    1.079957    1.147361

# Plotting residuals to check for homoscedasticity and normality
par(mfrow=c(2,2))
plot(lm_price)

Residuals vs Fitted Plot: This plot helps to check the assumption of linearity and homoscedasticity. Ideally, it supposed to be a random scatter of points. But there appears to be a slight pattern, indicating potential non-linearity or non-constant variance (heteroscedasticity) of residuals.
Normal Q-Q Plot: The purpose of this plot is to verify the assumption that residuals are normally distributed. If residuals are normally distributed, they should fall approximately along the reference line. The Q-Q plot deviates from the line at the ends, suggesting that residuals may have heavy tails and may not be normally distributed.
Scale-Location Plot: This plot shows if residuals are spread equally along the ranges of predictors (homoscedasticity). The pattern seen here, with the spread increasing for larger fitted values, indicates possible heteroscedasticity.
Residuals vs Leverage Plot: This helps to identify influential cases that might have an undue influence on the model. The plot indicates a few points with higher leverage, but they don’t appear to be influential enough to impact the regression line significantly.

Interpretation of Coefficients

# Extracting coefficients
coef_summary <- summary(lm_price)$coefficients
coef_summary

##                  Estimate   Std. Error   t value   Pr(>|t|)
## (Intercept)  7.725638e+02 3.075369e+02  2.512101 0.01228876
## User.Rating -5.130624e+00 2.070070e+00 -2.478479 0.01349548
## Reviews     -7.513389e-05 4.028277e-05 -1.865162 0.06269487
## Year        -3.648812e-01 1.538901e-01 -2.371050 0.01808377

# Focusing on the 'User.Rating' coefficient
user_rating_coef <- coef(lm_price)["User.Rating"]
cat("For each one-point increase in user ratings, the book's price is expected to change by", user_rating_coef, "units.")

## For each one-point increase in user ratings, the book's price is expected to change by -5.130624 units.

The model indicates that the User.Rating coefficient is -5.131. This means that, holding all else constant, each additional point in user rating is associated with a decrease in the book’s price by approximately $5.13. This is an unexpected relationship as one might assume that higher-rated books could command higher prices, but the data suggests otherwise. This negative relationship could be due to a variety of market factors and warrants further investigation.

week11

Shresta Reddy Nukala

2024-04-22