R Notebook - Week 6

Introduction

In this data dive, we explore the relationships between pairs of numeric variables, including one calculated column and at least one pair of response and explanatory variables. We will visualize the relationships, scrutinize the plots for insights and outliers, and calculate correlation coefficients to understand the strength of these relationships. Finally, we will construct confidence intervals for the response variables to make inferences about the population.

The significance of documenting both the model and the data is crucial. Proper documentation helps ensure that we understand the structure of the dataset and the variables we are analyzing. Without clear documentation, the risk of misinterpreting variables is high. We will reference the documentation to ensure clarity on the variables we work with and explain how this aids in drawing more accurate conclusions.

# Load necessary libraries
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(readr)

# Load the dataset (replace with your data)
tv_data <- read_csv("/Users/saransh/Downloads/TMDB_tv_dataset_v3.csv")

## Rows: 168639 Columns: 29

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (18): name, original_language, overview, backdrop_path, homepage, origi...
## dbl   (7): id, number_of_seasons, number_of_episodes, vote_count, vote_avera...
## lgl   (2): adult, in_production
## date  (2): first_air_date, last_air_date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Display the first few rows
head(tv_data)

Building Pairs of Numeric Variables

###Pair 1: Vote Average and Average Episodes per Season

For the first pair, we will use Vote_average as the response variable and calculate the average number of episodes per season as the explanatory variable. This will help us see if there is a relationship between how long a show tends to run per season and its overall rating.

# Create a new column for average episodes per season
tv_data <- tv_data |> 
  mutate(avg_episodes_per_season = number_of_episodes / number_of_seasons)

# Inspect the dataset with the new column
head(tv_data)

Visualization

This scatter plot visualizes the relationship between the average number of episodes per season and the vote average of each show. The red line represents the linear trend across the data points.

# Scatter plot for avg_episodes_per_season vs vote_average
ggplot(tv_data, aes(x = avg_episodes_per_season, y = vote_average)) +
  geom_point(color = "blue") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Average Episodes per Season vs. Vote Average", 
       x = "Average Episodes per Season", 
       y = "Vote Average") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

## Warning: Removed 22428 rows containing non-finite outside the scale range
## (`stat_smooth()`).

## Warning: Removed 22393 rows containing missing values or values outside the scale range
## (`geom_point()`).

Learnings

Outliers: The scatter plot reveals some shows with exceptionally high or low average episodes per season, which deviate from the trend. These outliers may require further investigation to understand if they are anomalies or have other explanations.

Trend: The trend line suggests the overall relationship between the number of episodes per season and the show’s rating. A positive slope indicates that shows with more episodes per season tend to have higher ratings.

This visualization is crucial as it allows us to visually assess the relationship between these two variables before performing more rigorous statistical analysis.

Correlation Coefficient

The correlation coefficient quantifies the strength and direction of the linear relationship between average episodes per season and vote average.

# Filter out rows where number_of_seasons is 0 or NA to avoid division by zero
tv_data_filtered <- tv_data |> 
  filter(number_of_seasons > 0, !is.na(number_of_seasons))

# Recalculate avg_episodes_per_season without rows where number_of_seasons is 0 or NA
tv_data_filtered <- tv_data_filtered |> 
  mutate(avg_episodes_per_season = number_of_episodes / number_of_seasons)

# Calculate the correlation coefficient after handling the issue
cor(tv_data_filtered$avg_episodes_per_season, tv_data_filtered$vote_average, use = "complete.obs")

## [1] 0.08883899

A correlation coefficient close to 0 suggests that there is little to no linear relationship between the two variables.
In this case, a value of 0.0888 indicates that the number of episodes per season and the average rating (vote average) of the shows are only very weakly related. This means that as the number of episodes per season increases, the average rating of a show does not increase or decrease significantly.
The weak correlation suggests that other factors, such as the genre, network, or production quality, may play a much larger role in determining a show’s rating.

Confidence Interval for Vote Average (Response Variable)

We will now build a confidence interval for the vote average to estimate the population mean of the show’s ratings.

# Build a 95% confidence interval for vote_average
vote_avg_mean <- mean(tv_data$vote_average, na.rm = TRUE)
vote_avg_sd <- sd(tv_data$vote_average, na.rm = TRUE)
n <- sum(!is.na(tv_data$vote_average))

# Calculate the margin of error (Z-score for 95% confidence level = 1.96)
margin_of_error <- 1.96 * (vote_avg_sd / sqrt(n))

# Confidence interval
lower_bound <- vote_avg_mean - margin_of_error
upper_bound <- vote_avg_mean + margin_of_error
c(lower_bound, upper_bound)

## [1] 2.317356 2.350330

The 95% confidence interval for the pair, vote average response variable is [2.317, 2.350]. This means that we are 95% confident that the true population mean of the vote average for all TV shows falls within this range.

###Pair 2: Vote Count and Ratings Per Season

In the second pair, we calculate the total number of ratings per season as an explanatory variable and pair it with the total number of votes as the response variable. This will help us see if shows that receive more votes tend to have more ratings per season.

# Create a new column for the total number of ratings per season
tv_data <- tv_data |> 
  mutate(ratings_per_season = vote_count / number_of_seasons)

# Inspect the dataset with the new column
head(tv_data)

Visualization

We’ll now plot the relationship between ratings per season and vote count.

# Scatter plot for ratings_per_season vs vote_count
ggplot(tv_data, aes(x = ratings_per_season, y = vote_count)) +
  geom_point(color = "green") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Ratings per Season vs. Vote Count", 
       x = "Ratings per Season", 
       y = "Vote Count") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

## Warning: Removed 22428 rows containing non-finite outside the scale range
## (`stat_smooth()`).

## Warning: Removed 20913 rows containing missing values or values outside the scale range
## (`geom_point()`).

Conclusion

Outliers: Some shows seem to have a disproportionately high or low number of ratings per season compared to their total vote count. These could represent outliers and may warrant further investigation to determine their impact on the analysis.

Trend: The trend line helps us determine whether shows with more ratings per season tend to accumulate more votes overall. If a strong positive correlation is observed, it suggests that more engaged viewership leads to more votes.

Correlation Coefficient

We calculate the correlation coefficient to assess the strength of the relationship between ratings per season and vote count.

# Filter out rows where number_of_seasons is 0 or NA to avoid division by zero
tv_data_filtered <- tv_data |> 
  filter(number_of_seasons > 0, !is.na(number_of_seasons))

# Recalculate ratings_per_season without rows where number_of_seasons is 0 or NA
tv_data_filtered <- tv_data_filtered |> 
  mutate(ratings_per_season = vote_count / number_of_seasons)

# Calculate the correlation coefficient after handling the issue
cor(tv_data_filtered$ratings_per_season, tv_data_filtered$vote_count, use = "complete.obs")

## [1] 0.7415094

A correlation coefficient of 0.7415 is quite close to 1, which suggests a strong linear relationship between the two variables. As the number of ratings per season increases, the total vote count tends to increase as well.
The positive correlation indicates that shows with a higher number of ratings per season are likely to have a higher overall vote count. This suggests that more engaged viewership (as measured by ratings per season) correlates with a higher number of votes.
This strong correlation suggests that the number of ratings per season could be a good predictor of the total vote count.

Confidence Interval for Vote Count

We build a confidence interval for the vote count to estimate the population mean.

# Build a 95% confidence interval for vote_count
vote_count_mean <- mean(tv_data$vote_count, na.rm = TRUE)
vote_count_sd <- sd(tv_data$vote_count, na.rm = TRUE)
n <- sum(!is.na(tv_data$vote_count))

# Calculate the margin of error
margin_of_error <- 1.96 * (vote_count_sd / sqrt(n))

# Confidence interval
lower_bound <- vote_count_mean - margin_of_error
upper_bound <- vote_count_mean + margin_of_error
c(lower_bound, upper_bound)

## [1] 12.39435 14.21576

The 95% confidence interval for Pair 2 (the vote count response variable) is [12.394, 14.216]. This means that we are 95% confident that the true population mean of the vote count for all TV shows falls within this range.

Overall Conclusion

In this analysis, we explored the relationships between different numeric variables in the dataset, visualized trends and outliers, and calculated correlation coefficients to quantify the strength of these relationships. Confidence intervals helped us make inferences about the population parameters.

Key Insights:

The relationship between vote average and average episodes per season is visualized, showing potential outliers and a correlation that can guide further analysis.
Vote count and ratings per season have a linear relationship, allowing us to estimate how user engagement translates into votes.

Further Questions to Investigate:

Are the outliers seen in the plots significant, and do they reflect data quality issues or special cases?
Could adding more variables, such as genres or networks, improve the explanatory power of the models?