Homework Research Methods and Techniques

Research Question 1: Does the average rating of books differ significantly between two different genres (fiction and nonfiction)?

Data: Sourced from Keggle: https://www.kaggle.com/datasets/thedevastator/books-sales-and-ratings/data Trimmed by deleting “kids” and “genre fiction” and leaving only “fiction” and “non-fiction” data.

# Load the data
books_data <- read.csv("C:\\Users\\Stepa\\Documents\\Documents\\School\\SEB\\RMaT\\Homework\\books_data.csv")

# Descriptive Statistics
# Display a summary of the Book_average_rating variable
summary(books_data$Book_average_rating)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.230   3.900   4.030   4.027   4.200   4.610

# Display the frequency table of the 'genre' variable
table(books_data$genre)

## 
##    fiction nonfiction 
##         62        171

# Research Question 1: Does the average rating of books differ significantly between two different genres?

# Hypothesis:
# H0: There is no significant difference in the average rating of books between the two genres.
# H1: There is a significant difference in the average rating of books between the two genres.

# Parametric Test (t-test)
# Perform a t-test comparing Book_average_rating between different genres
t_test_result <- t.test(Book_average_rating ~ genre, data = books_data)
t_test_result

## 
##  Welch Two Sample t-test
## 
## data:  Book_average_rating by genre
## t = 0.52568, df = 132.25, p-value = 0.6
## alternative hypothesis: true difference in means between group fiction and group nonfiction is not equal to 0
## 95 percent confidence interval:
##  -0.04976920  0.08579636
## sample estimates:
##    mean in group fiction mean in group nonfiction 
##                 4.040645                 4.022632

# Non-parametric Test (Wilcoxon Rank Sum Test)
# Perform a Wilcoxon Rank Sum Test comparing Book_average_rating between different genres
wilcox_test_result <- wilcox.test(Book_average_rating ~ genre, data = books_data)
wilcox_test_result

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Book_average_rating by genre
## W = 5406.5, p-value = 0.8173
## alternative hypothesis: true location shift is not equal to 0

# Decision:
# Compare the p-values from both tests. If assumptions for the t-test are met, use it. Otherwise, use the Wilcoxon Rank Sum Test.

# Explanation:
# Interpret the p-value and confidence interval. If p-value < 0.05, reject the null hypothesis and conclude a significant difference in average rating between the two genres.

cat("\n\nExplanations:\n")

## 
## 
## Explanations:

# Evaluate t-test results
if (t_test_result$p.value < 0.05) {
  cat("Based on the t-test, we reject the null hypothesis. There is a significant difference in average rating between the two genres.\n")
} else {
  cat("Based on the t-test, we fail to reject the null hypothesis. There is no significant difference in average rating between the two genres.\n")
}

## Based on the t-test, we fail to reject the null hypothesis. There is no significant difference in average rating between the two genres.

# Evaluate Wilcoxon Rank Sum Test results
if (wilcox_test_result$p.value < 0.05) {
  cat("Based on the Wilcoxon Rank Sum Test, we reject the null hypothesis. There is a significant difference in average rating between the two genres.\n")
} else {
  cat("Based on the Wilcoxon Rank Sum Test, we fail to reject the null hypothesis. There is no significant difference in average rating between the two genres.\n")
}

## Based on the Wilcoxon Rank Sum Test, we fail to reject the null hypothesis. There is no significant difference in average rating between the two genres.

Conclusion:

Both tests were conducted to check for a significant difference in average ratings between the genres.
The t-test assumes normality and homogeneity of variances, while the Wilcoxon Rank Sum Test is non-parametric and does not assume normality.
Based on the p-values, neither test provides evidence to reject the null hypothesis.
For the t-test, the p-value is 0.6, which is greater than the typical significance level of 0.05.
Hence, we fail to reject the null hypothesis, suggesting no significant difference in average ratings.
The Wilcoxon Rank Sum Test also yields a p-value of 0.8173, reinforcing the conclusion that there is no significant difference in average ratings between fiction and nonfiction genres.

There is no statistically significant evidence to suggest a difference in average book ratings between the fiction and nonfiction genres based on the chosen dataset. Both parametric and non-parametric tests lead to the same conclusion, and the decision to use both tests helps ensure robustness in the face of assumptions about the data distribution.

The more appropriate test in this case might be the Wilcoxon Rank Sum Test because it doesn’t assume normality and is robust against non-normal distributions. The results indicate that there is no evidence to conclude a significant difference in average ratings between fiction and nonfiction genres.

Research Question 2: Can we predict IMDb scores based on the release year and runtime of Netflix movies and TV shows?

Data: Sourced from Keggle: https://www.kaggle.com/datasets/thedevastator/netflix-imdb-scores?resource=download Trimmed by deleting scores with less than 60 reviews to stay under 5000 entries, we couldn’t use Shapiro-Wilk test otherwise.

Justification of Explanatory Variables:

Release Year: This variable is included because there might be a trend or pattern over time that influences IMDb scores. For example, newer productions may have higher scores due to advancements in filmmaking or changes in audience preferences.
Runtime: The duration of a movie or TV show can impact viewer experience. Longer runtimes may allow for more in-depth storytelling, potentially influencing IMDb scores.

# Load necessary libraries
library(tidyverse)

# Load the Netflix dataset
# Replace 'path_to_dataset.csv' with the actual path to your dataset
netflix_data <- read.csv("C:\\Users\\Stepa\\Documents\\Documents\\School\\SEB\\RMaT\\Homework\\netflix_data.csv")

# Descriptive statistics
# Display a summary of selected variables: imdb_score, release_year, and runtime
summary(netflix_data[c("imdb_score", "release_year", "runtime")])

##    imdb_score     release_year     runtime      
##  Min.   :1.500   Min.   :1953   Min.   :  0.00  
##  1st Qu.:5.800   1st Qu.:2015   1st Qu.: 46.00  
##  Median :6.600   Median :2018   Median : 88.00  
##  Mean   :6.545   Mean   :2016   Mean   : 80.05  
##  3rd Qu.:7.400   3rd Qu.:2020   3rd Qu.:106.00  
##  Max.   :9.600   Max.   :2022   Max.   :235.00

# Linear Regression Analysis
# Fit a linear regression model predicting imdb_score based on release_year and runtime
linear_model <- lm(imdb_score ~ release_year + runtime, data = netflix_data)

# Check assumptions
# Set up a 2x2 grid of plots to visually assess assumptions
par(mfrow = c(2, 2))
plot(linear_model)

# Summary of the linear model
# Display detailed information about the linear regression model
summary(linear_model)

## 
## Call:
## lm(formula = imdb_score ~ release_year + runtime, data = netflix_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1330 -0.6530  0.0910  0.7885  2.9352 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  55.5211680  4.4314592   12.53   <2e-16 ***
## release_year -0.0240587  0.0021948  -10.96   <2e-16 ***
## runtime      -0.0059598  0.0004177  -14.27   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.118 on 4985 degrees of freedom
## Multiple R-squared:  0.05149,    Adjusted R-squared:  0.05111 
## F-statistic: 135.3 on 2 and 4985 DF,  p-value: < 2.2e-16

# Assumption checks
# Perform a Shapiro-Wilk test for normality of residuals
shapiro.test(residuals(linear_model))

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(linear_model)
## W = 0.97552, p-value < 2.2e-16

# Plot residuals against fitted values to check for homoscedasticity
plot(linear_model, which = 2)

# Make predictions
# Generate predicted IMDb scores using the linear regression model
predicted_scores <- predict(linear_model)

# Visualize the fit
# Create a scatter plot comparing actual IMDb scores with predicted scores
plot(netflix_data$imdb_score, predicted_scores, main = "Actual vs Predicted IMDb Scores",
     xlab = "Actual IMDb Scores", ylab = "Predicted IMDb Scores")

# Assess model performance
# Calculate Root Mean Squared Error (RMSE) as a measure of model accuracy
residuals <- residuals(linear_model)
rmse <- sqrt(mean(residuals^2))
cat("Root Mean Squared Error (RMSE):", rmse, "\n")

## Root Mean Squared Error (RMSE): 1.118134

Conclusion: Assumption Checks:

Normality of Residuals:

Shapiro-Wilk normality test was performed (shapiro.test(residuals(linear_model))). The small p-value (<2.2e-16) suggests that residuals are not normally distributed.

Homoscedasticity:

A plot of residuals against predicted values was examined (plot(linear_model, which = 2)) to check for homoscedasticity.

Coefficients:

The coefficients for release_year and runtime are both statistically significant (p-value < 0.05).
Interpretation:
- For each one-unit increase in release_year, the IMDb score is expected to decrease by approximately 0.0241 points.
- For each one-unit increase in runtime, the IMDb score is expected to decrease by approximately 0.0060 points.

Adjusted R-squared:

The adjusted R-squared value is 0.0511, suggesting that the model explains only a small proportion of the variance in IMDb scores.

Final conclusion:

Based on the linear regression analysis, the model suggests that there is a statistically significant relationship between release_year, runtime, and IMDb scores for Netflix movies and TV shows.
However, the model’s explanatory power is limited (adjusted R-squared is low), indicating that other factors beyond release year and runtime may influence IMDb scores.

Homework Research Methods and Techniques

Filip Štepanovský

2024-01-02

Research Question 1: Does the average rating of books differ significantly between two different genres (fiction and nonfiction)?

Research Question 2: Can we predict IMDb scores based on the release year and runtime of Netflix movies and TV shows?