Research Question 1: Does the average rating of books differ significantly between two different genres (fiction and nonfiction)?

Data: Sourced from Keggle: https://www.kaggle.com/datasets/thedevastator/books-sales-and-ratings/data Trimmed by deleting “kids” and “genre fiction” and leaving only “fiction” and “non-fiction” data.

# Load the data
books_data <- read.csv("C:\\Users\\Stepa\\Documents\\Documents\\School\\SEB\\RMaT\\Homework\\books_data.csv")

# Descriptive Statistics
# Display a summary of the Book_average_rating variable
summary(books_data$Book_average_rating)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.230   3.900   4.030   4.027   4.200   4.610
# Display the frequency table of the 'genre' variable
table(books_data$genre)
## 
##    fiction nonfiction 
##         62        171
# Research Question 1: Does the average rating of books differ significantly between two different genres?

# Hypothesis:
# H0: There is no significant difference in the average rating of books between the two genres.
# H1: There is a significant difference in the average rating of books between the two genres.

# Parametric Test (t-test)
# Perform a t-test comparing Book_average_rating between different genres
t_test_result <- t.test(Book_average_rating ~ genre, data = books_data)
t_test_result
## 
##  Welch Two Sample t-test
## 
## data:  Book_average_rating by genre
## t = 0.52568, df = 132.25, p-value = 0.6
## alternative hypothesis: true difference in means between group fiction and group nonfiction is not equal to 0
## 95 percent confidence interval:
##  -0.04976920  0.08579636
## sample estimates:
##    mean in group fiction mean in group nonfiction 
##                 4.040645                 4.022632
# Non-parametric Test (Wilcoxon Rank Sum Test)
# Perform a Wilcoxon Rank Sum Test comparing Book_average_rating between different genres
wilcox_test_result <- wilcox.test(Book_average_rating ~ genre, data = books_data)
wilcox_test_result
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Book_average_rating by genre
## W = 5406.5, p-value = 0.8173
## alternative hypothesis: true location shift is not equal to 0
# Decision:
# Compare the p-values from both tests. If assumptions for the t-test are met, use it. Otherwise, use the Wilcoxon Rank Sum Test.

# Explanation:
# Interpret the p-value and confidence interval. If p-value < 0.05, reject the null hypothesis and conclude a significant difference in average rating between the two genres.

cat("\n\nExplanations:\n")
## 
## 
## Explanations:
# Evaluate t-test results
if (t_test_result$p.value < 0.05) {
  cat("Based on the t-test, we reject the null hypothesis. There is a significant difference in average rating between the two genres.\n")
} else {
  cat("Based on the t-test, we fail to reject the null hypothesis. There is no significant difference in average rating between the two genres.\n")
}
## Based on the t-test, we fail to reject the null hypothesis. There is no significant difference in average rating between the two genres.
# Evaluate Wilcoxon Rank Sum Test results
if (wilcox_test_result$p.value < 0.05) {
  cat("Based on the Wilcoxon Rank Sum Test, we reject the null hypothesis. There is a significant difference in average rating between the two genres.\n")
} else {
  cat("Based on the Wilcoxon Rank Sum Test, we fail to reject the null hypothesis. There is no significant difference in average rating between the two genres.\n")
}
## Based on the Wilcoxon Rank Sum Test, we fail to reject the null hypothesis. There is no significant difference in average rating between the two genres.

Conclusion:

There is no statistically significant evidence to suggest a difference in average book ratings between the fiction and nonfiction genres based on the chosen dataset. Both parametric and non-parametric tests lead to the same conclusion, and the decision to use both tests helps ensure robustness in the face of assumptions about the data distribution.

The more appropriate test in this case might be the Wilcoxon Rank Sum Test because it doesn’t assume normality and is robust against non-normal distributions. The results indicate that there is no evidence to conclude a significant difference in average ratings between fiction and nonfiction genres.

Research Question 2: Can we predict IMDb scores based on the release year and runtime of Netflix movies and TV shows?

Data: Sourced from Keggle: https://www.kaggle.com/datasets/thedevastator/netflix-imdb-scores?resource=download Trimmed by deleting scores with less than 60 reviews to stay under 5000 entries, we couldn’t use Shapiro-Wilk test otherwise.

Justification of Explanatory Variables:

# Load necessary libraries
library(tidyverse)

# Load the Netflix dataset
# Replace 'path_to_dataset.csv' with the actual path to your dataset
netflix_data <- read.csv("C:\\Users\\Stepa\\Documents\\Documents\\School\\SEB\\RMaT\\Homework\\netflix_data.csv")

# Descriptive statistics
# Display a summary of selected variables: imdb_score, release_year, and runtime
summary(netflix_data[c("imdb_score", "release_year", "runtime")])
##    imdb_score     release_year     runtime      
##  Min.   :1.500   Min.   :1953   Min.   :  0.00  
##  1st Qu.:5.800   1st Qu.:2015   1st Qu.: 46.00  
##  Median :6.600   Median :2018   Median : 88.00  
##  Mean   :6.545   Mean   :2016   Mean   : 80.05  
##  3rd Qu.:7.400   3rd Qu.:2020   3rd Qu.:106.00  
##  Max.   :9.600   Max.   :2022   Max.   :235.00
# Linear Regression Analysis
# Fit a linear regression model predicting imdb_score based on release_year and runtime
linear_model <- lm(imdb_score ~ release_year + runtime, data = netflix_data)

# Check assumptions
# Set up a 2x2 grid of plots to visually assess assumptions
par(mfrow = c(2, 2))
plot(linear_model)

# Summary of the linear model
# Display detailed information about the linear regression model
summary(linear_model)
## 
## Call:
## lm(formula = imdb_score ~ release_year + runtime, data = netflix_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1330 -0.6530  0.0910  0.7885  2.9352 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  55.5211680  4.4314592   12.53   <2e-16 ***
## release_year -0.0240587  0.0021948  -10.96   <2e-16 ***
## runtime      -0.0059598  0.0004177  -14.27   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.118 on 4985 degrees of freedom
## Multiple R-squared:  0.05149,    Adjusted R-squared:  0.05111 
## F-statistic: 135.3 on 2 and 4985 DF,  p-value: < 2.2e-16
# Assumption checks
# Perform a Shapiro-Wilk test for normality of residuals
shapiro.test(residuals(linear_model))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(linear_model)
## W = 0.97552, p-value < 2.2e-16
# Plot residuals against fitted values to check for homoscedasticity
plot(linear_model, which = 2)

# Make predictions
# Generate predicted IMDb scores using the linear regression model
predicted_scores <- predict(linear_model)

# Visualize the fit
# Create a scatter plot comparing actual IMDb scores with predicted scores
plot(netflix_data$imdb_score, predicted_scores, main = "Actual vs Predicted IMDb Scores",
     xlab = "Actual IMDb Scores", ylab = "Predicted IMDb Scores")

# Assess model performance
# Calculate Root Mean Squared Error (RMSE) as a measure of model accuracy
residuals <- residuals(linear_model)
rmse <- sqrt(mean(residuals^2))
cat("Root Mean Squared Error (RMSE):", rmse, "\n")
## Root Mean Squared Error (RMSE): 1.118134

Conclusion: Assumption Checks:

  1. Normality of Residuals:
  1. Homoscedasticity:

Coefficients:

Adjusted R-squared:

Final conclusion: