Data: Sourced from Keggle: https://www.kaggle.com/datasets/thedevastator/books-sales-and-ratings/data Trimmed by deleting “kids” and “genre fiction” and leaving only “fiction” and “non-fiction” data.
# Load the data
books_data <- read.csv("C:\\Users\\Stepa\\Documents\\Documents\\School\\SEB\\RMaT\\Homework\\books_data.csv")
# Descriptive Statistics
# Display a summary of the Book_average_rating variable
summary(books_data$Book_average_rating)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.230 3.900 4.030 4.027 4.200 4.610
# Display the frequency table of the 'genre' variable
table(books_data$genre)
##
## fiction nonfiction
## 62 171
# Research Question 1: Does the average rating of books differ significantly between two different genres?
# Hypothesis:
# H0: There is no significant difference in the average rating of books between the two genres.
# H1: There is a significant difference in the average rating of books between the two genres.
# Parametric Test (t-test)
# Perform a t-test comparing Book_average_rating between different genres
t_test_result <- t.test(Book_average_rating ~ genre, data = books_data)
t_test_result
##
## Welch Two Sample t-test
##
## data: Book_average_rating by genre
## t = 0.52568, df = 132.25, p-value = 0.6
## alternative hypothesis: true difference in means between group fiction and group nonfiction is not equal to 0
## 95 percent confidence interval:
## -0.04976920 0.08579636
## sample estimates:
## mean in group fiction mean in group nonfiction
## 4.040645 4.022632
# Non-parametric Test (Wilcoxon Rank Sum Test)
# Perform a Wilcoxon Rank Sum Test comparing Book_average_rating between different genres
wilcox_test_result <- wilcox.test(Book_average_rating ~ genre, data = books_data)
wilcox_test_result
##
## Wilcoxon rank sum test with continuity correction
##
## data: Book_average_rating by genre
## W = 5406.5, p-value = 0.8173
## alternative hypothesis: true location shift is not equal to 0
# Decision:
# Compare the p-values from both tests. If assumptions for the t-test are met, use it. Otherwise, use the Wilcoxon Rank Sum Test.
# Explanation:
# Interpret the p-value and confidence interval. If p-value < 0.05, reject the null hypothesis and conclude a significant difference in average rating between the two genres.
cat("\n\nExplanations:\n")
##
##
## Explanations:
# Evaluate t-test results
if (t_test_result$p.value < 0.05) {
cat("Based on the t-test, we reject the null hypothesis. There is a significant difference in average rating between the two genres.\n")
} else {
cat("Based on the t-test, we fail to reject the null hypothesis. There is no significant difference in average rating between the two genres.\n")
}
## Based on the t-test, we fail to reject the null hypothesis. There is no significant difference in average rating between the two genres.
# Evaluate Wilcoxon Rank Sum Test results
if (wilcox_test_result$p.value < 0.05) {
cat("Based on the Wilcoxon Rank Sum Test, we reject the null hypothesis. There is a significant difference in average rating between the two genres.\n")
} else {
cat("Based on the Wilcoxon Rank Sum Test, we fail to reject the null hypothesis. There is no significant difference in average rating between the two genres.\n")
}
## Based on the Wilcoxon Rank Sum Test, we fail to reject the null hypothesis. There is no significant difference in average rating between the two genres.
Conclusion:
Both tests were conducted to check for a significant difference in average ratings between the genres.
The t-test assumes normality and homogeneity of variances, while the Wilcoxon Rank Sum Test is non-parametric and does not assume normality.
Based on the p-values, neither test provides evidence to reject the null hypothesis.
For the t-test, the p-value is 0.6, which is greater than the typical significance level of 0.05.
Hence, we fail to reject the null hypothesis, suggesting no significant difference in average ratings.
The Wilcoxon Rank Sum Test also yields a p-value of 0.8173, reinforcing the conclusion that there is no significant difference in average ratings between fiction and nonfiction genres.
There is no statistically significant evidence to suggest a difference in average book ratings between the fiction and nonfiction genres based on the chosen dataset. Both parametric and non-parametric tests lead to the same conclusion, and the decision to use both tests helps ensure robustness in the face of assumptions about the data distribution.
The more appropriate test in this case might be the Wilcoxon Rank Sum Test because it doesn’t assume normality and is robust against non-normal distributions. The results indicate that there is no evidence to conclude a significant difference in average ratings between fiction and nonfiction genres.
Data: Sourced from Keggle: https://www.kaggle.com/datasets/thedevastator/netflix-imdb-scores?resource=download Trimmed by deleting scores with less than 60 reviews to stay under 5000 entries, we couldn’t use Shapiro-Wilk test otherwise.
Justification of Explanatory Variables:
# Load necessary libraries
library(tidyverse)
# Load the Netflix dataset
# Replace 'path_to_dataset.csv' with the actual path to your dataset
netflix_data <- read.csv("C:\\Users\\Stepa\\Documents\\Documents\\School\\SEB\\RMaT\\Homework\\netflix_data.csv")
# Descriptive statistics
# Display a summary of selected variables: imdb_score, release_year, and runtime
summary(netflix_data[c("imdb_score", "release_year", "runtime")])
## imdb_score release_year runtime
## Min. :1.500 Min. :1953 Min. : 0.00
## 1st Qu.:5.800 1st Qu.:2015 1st Qu.: 46.00
## Median :6.600 Median :2018 Median : 88.00
## Mean :6.545 Mean :2016 Mean : 80.05
## 3rd Qu.:7.400 3rd Qu.:2020 3rd Qu.:106.00
## Max. :9.600 Max. :2022 Max. :235.00
# Linear Regression Analysis
# Fit a linear regression model predicting imdb_score based on release_year and runtime
linear_model <- lm(imdb_score ~ release_year + runtime, data = netflix_data)
# Check assumptions
# Set up a 2x2 grid of plots to visually assess assumptions
par(mfrow = c(2, 2))
plot(linear_model)
# Summary of the linear model
# Display detailed information about the linear regression model
summary(linear_model)
##
## Call:
## lm(formula = imdb_score ~ release_year + runtime, data = netflix_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1330 -0.6530 0.0910 0.7885 2.9352
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 55.5211680 4.4314592 12.53 <2e-16 ***
## release_year -0.0240587 0.0021948 -10.96 <2e-16 ***
## runtime -0.0059598 0.0004177 -14.27 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.118 on 4985 degrees of freedom
## Multiple R-squared: 0.05149, Adjusted R-squared: 0.05111
## F-statistic: 135.3 on 2 and 4985 DF, p-value: < 2.2e-16
# Assumption checks
# Perform a Shapiro-Wilk test for normality of residuals
shapiro.test(residuals(linear_model))
##
## Shapiro-Wilk normality test
##
## data: residuals(linear_model)
## W = 0.97552, p-value < 2.2e-16
# Plot residuals against fitted values to check for homoscedasticity
plot(linear_model, which = 2)
# Make predictions
# Generate predicted IMDb scores using the linear regression model
predicted_scores <- predict(linear_model)
# Visualize the fit
# Create a scatter plot comparing actual IMDb scores with predicted scores
plot(netflix_data$imdb_score, predicted_scores, main = "Actual vs Predicted IMDb Scores",
xlab = "Actual IMDb Scores", ylab = "Predicted IMDb Scores")
# Assess model performance
# Calculate Root Mean Squared Error (RMSE) as a measure of model accuracy
residuals <- residuals(linear_model)
rmse <- sqrt(mean(residuals^2))
cat("Root Mean Squared Error (RMSE):", rmse, "\n")
## Root Mean Squared Error (RMSE): 1.118134
Conclusion: Assumption Checks:
Coefficients:
The coefficients for release_year and runtime are both statistically significant (p-value < 0.05).
Interpretation:
Adjusted R-squared:
Final conclusion: