Week 6 | Data Dive — Confidence Intervals

bestsellers <- read.csv("bestsellers.csv")
str(bestsellers)

## 'data.frame':    550 obs. of  7 variables:
##  $ Name       : chr  "10-Day Green Smoothie Cleanse" "11/22/63: A Novel" "12 Rules for Life: An Antidote to Chaos" "1984 (Signet Classics)" ...
##  $ Author     : chr  "JJ Smith" "Stephen King" "Jordan B. Peterson" "George Orwell" ...
##  $ User.Rating: num  4.7 4.6 4.7 4.7 4.8 4.4 4.7 4.7 4.7 4.6 ...
##  $ Reviews    : int  17350 2052 18979 21424 7665 12643 19735 19699 5983 23848 ...
##  $ Price      : int  8 22 15 6 12 11 30 15 3 8 ...
##  $ Year       : int  2016 2011 2018 2017 2019 2011 2014 2017 2018 2016 ...
##  $ Genre      : chr  "Non Fiction" "Fiction" "Non Fiction" "Fiction" ...

Creating two variables

Since we’re interested in the relationship between the number of reviews and the price of books, let’s create two new variables: one for the average number of reviews per year and another for the total revenue generated by each book.

# Create a new variable for average reviews per year
bestsellers$Avg_Reviews_Per_Year <- bestsellers$Reviews / (2024 - bestsellers$Year + 1)

# Create a new variable for total revenue
bestsellers$Total_Revenue <- bestsellers$Price * bestsellers$Reviews

Relationship between Average Reviews per Year and Price

# Scatter plot
plot(bestsellers$Avg_Reviews_Per_Year, bestsellers$Price, 
     xlab = "Average Reviews per Year", ylab = "Price", 
     main = "Relationship between Average Reviews per Year and Price")

# Add a trend line
abline(lm(bestsellers$Price ~ bestsellers$Avg_Reviews_Per_Year), col = "red")

# Calculate correlation coefficient
cor_avg_reviews_price <- cor(bestsellers$Avg_Reviews_Per_Year, bestsellers$Price)
cor_avg_reviews_price

## [1] -0.1280697

The correlation coefficient of -0.128 indicates a weak negative correlation between the average reviews per year and the price of books. This suggests that while there is a slight tendency for higher-priced books to have fewer average reviews per year, the relationship is not very strong.
The scatter plot also shows a dispersed pattern, which aligns with a weak correlation.

Relationship between Total Revenue and Price

# Scatter plot
plot(bestsellers$Total_Revenue, bestsellers$Price, 
     xlab = "Total Revenue", ylab = "Price", 
     main = "Relationship between Total Revenue and Price")

# Add a trend line
abline(lm(bestsellers$Price ~ bestsellers$Total_Revenue), col = "blue")

# Calculate correlation coefficient
cor_total_revenue_price <- cor(bestsellers$Total_Revenue, bestsellers$Price)
cor_total_revenue_price

## [1] 0.4059496

The correlation coefficient of 0.406 indicates a moderate positive correlation between the total revenue and the price of books. This suggests that there is a tendency for higher-priced books to generate more revenue, but again, it’s not an extremely strong relationship.
The scatter plot shows a more clustered pattern, indicating a more defined relationship compared to the first pair of variables.

Building Confidence Intervals

# Load necessary library
library(stats)

# Calculate mean and standard deviation of price
mean_price <- mean(bestsellers$Price)
sd_price <- sd(bestsellers$Price)

# Calculate standard error of the mean
se_price <- sd_price / sqrt(length(bestsellers$Price))

# Calculate t-value for 95% confidence interval (assuming normal distribution)
t_value <- qt(0.975, df = length(bestsellers$Price) - 1)

# Calculate margin of error
margin_of_error <- t_value * se_price

# Calculate confidence interval
lower_bound <- mean_price - margin_of_error
upper_bound <- mean_price + margin_of_error

# Print confidence interval
cat("95% Confidence Interval for Price: [", lower_bound, ",", upper_bound, "]\n")

## 95% Confidence Interval for Price: [ 12.19188 , 14.00812 ]

Based on the 95% confidence interval for the price, we can be confident that the true population mean price of books lies within the range of approximately $12.19 to $14.01. This information provides valuable insights for pricing strategies and market analysis.

Furthermore, considering the weak to moderate correlations observed between price and other variables such as average reviews per year and total revenue, it’s evident that pricing decisions should be made in conjunction with other factors. These could include genre preferences, author popularity, and marketing strategies.

Week 6 | Data Dive — Confidence Intervals

Shresta

2024-04-05

Creating two variables

Relationship between Average Reviews per Year and Price

Relationship between Total Revenue and Price

Building Confidence Intervals