Simple Linear Regression

Simple Linear Modeling

I this discussion, I am developing a Simple Linear Model between gross revenue of movies to predict the chances of winning Oscars. I will start by loading the data, removing missing values and outliers, and visualizing the data. This will the follow modeling and analysis of the residuals. Dependent variable: Oscars Independent variable: gross_revenue

Load the Data

Load data, remove missing values, removing outliers

# Load the necessary package
library(readr)
library(dplyr)

## Warning: package 'dplyr' was built under R version 4.3.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# Read the data from my github repository
movies_df <- read_csv("https://raw.githubusercontent.com/hawa1983/DATA605-Simple-Linear-Regression/main/movies.csv", na = "N/A")

## Rows: 100 Columns: 21

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): title, genre, certificate, runtime, directors, stars, link
## dbl (12): ranking, release_year, metascore_rating, actors_atings, direction_...
## num  (2): votes, gross_revenue
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# I will be using the gross_revenue column as the dependent variable and oscar column and the respose variable
# Omit rows with missing values only in the gross_revenue and oscars columns
clean_movies_df <- movies_df[!is.na(movies_df$gross_revenue) & !is.na(movies_df$oscars), ]

# Replace non-numeric characters if present, handle missing values, and convert to integer
movies_df$gross_revenue <- as.integer(gsub("[^0-9]", "", movies_df$gross_revenue))
movies_df$oscars <- as.integer(movies_df$oscars)

# Now omit rows with NAs in the specified columns
clean_movies_df <- na.omit(movies_df[c("gross_revenue", "oscars")])

# Now I will remove outliers from the data.
# Calculate the IQR for 'gross_revenue'
IQR_gross_revenue <- IQR(clean_movies_df$gross_revenue)
quantiles_gross_revenue <- quantile(clean_movies_df$gross_revenue, probs = c(0.25, 0.75))

# Define bounds for 'gross_revenue'
lower_bound_gross_revenue <- quantiles_gross_revenue[1] - 1.5 * IQR_gross_revenue
upper_bound_gross_revenue <- quantiles_gross_revenue[2] + 1.5 * IQR_gross_revenue

# Filter out outliers based on 'gross_revenue' and keep corresponding 'oscars'
clean_movies_df <- clean_movies_df[
  clean_movies_df$gross_revenue >= lower_bound_gross_revenue &
  clean_movies_df$gross_revenue <= upper_bound_gross_revenue, 
]
head(clean_movies_df)

## # A tibble: 6 × 2
##   gross_revenue oscars
##           <int>  <int>
## 1     134966411      3
## 2      44824144      7
## 3      11800000      3
## 4       3200000      0
## 5          3207      0
## 6      57300000      6

Visualize the data

We will look at the distribution of the variables here.

Distribution of gross revenue The box plot indicates that the gross revenue data is concentrated within a smaller range, which forms the main body of the box, with the median revenue falling within this range. Several outliers suggest some films have gross revenues much higher than the typical range. These outliers could significantly influence any analysis of the dataset, highlighting the need for careful consideration of how to handle these values in statistical or economic interpretations.

library(ggplot2)

# Boxplot of 'gross_revenue'
ggplot(data = clean_movies_df, aes(y = gross_revenue)) +
  geom_boxplot() +
  labs(
    title = "Boxplot of gross_revenue",
    y = "Gross Revenue"
  )

Distribution of Oscar The box plot for the number of Oscars won shows that most films have won a small to moderate number of Oscars, with the median number of wins around the middle of the box. The range of Oscars won without considering outliers (the interquartile range) spans from the bottom to the top of the box. The whiskers extend to the minimum and maximum wins that are not considered outliers, while individual points above the upper whisker indicate films that have won a notably high number of Oscars compared to the rest.

library(ggplot2)

# Boxplot of 'gross_revenue'
ggplot(data = clean_movies_df, aes(y = oscars)) +
  geom_boxplot() +
  labs(
    title = "Boxplot of Oscar Winnings",
    y = "Oscar"
  )

scatter plot of oscars versus gross revenue The scatter plot summarizes the relationship between the number of Oscars won by a movie and its gross revenue. It shows a diverse spread with no obvious trend indicating that winning Oscars doesn’t consistently correlate with higher gross earnings. Movies across the revenue spectrum have won varying numbers of Oscars, suggesting that critical acclaim and box office success are not directly proportional. It emphasizes that while Oscars are a mark of recognition, they do not guarantee commercial success.

library(ggplot2)

# Assuming 'df' is your dataframe, 'oscars' is the number of Oscars won,
# and 'gross_revenue' is the revenue.
ggplot(data = clean_movies_df, aes(x = gross_revenue, y = oscars)) +
  geom_point() +
  theme_minimal() +
  labs(
    title = "Scatter Plot of Oscars vs Gross Revenue",
    y = "Number of Oscars Won",
    x = "Gross Revenue ($)"
  )

Model the data

# Fit your linear model here

# Fit a linear model (lm) with the clean dataset
# Replace 'dependent_variable' with the name of your dependent variable column
# and 'independent_variable' with the name of your independent variable column
model <- lm(oscars ~ gross_revenue, data = clean_movies_df)

# Summary of the model
summary(model)

## 
## Call:
## lm(formula = oscars ~ gross_revenue, data = clean_movies_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1990 -2.5489 -0.5739  1.5392  7.5501 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.549e+00  3.899e-01   6.537 5.66e-09 ***
## gross_revenue 1.206e-08  4.591e-09   2.627   0.0103 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.681 on 79 degrees of freedom
## Multiple R-squared:  0.08036,    Adjusted R-squared:  0.06872 
## F-statistic: 6.903 on 1 and 79 DF,  p-value: 0.01033

Equation of model

$\text{Oscars} = 2.549 + 1.206 \times 10^{-8} \times \text{Gross Revenue} + \epsilon$

Interpretation of the model output

The intercept (β0) of the regression equation is approximately 2.549. This suggests that if the gross revenue were zero, the predicted number of Oscars won would be around 2.549, although this is not a realistic scenario since gross revenue cannot be zero in practice.
The slope (β1) for gross revenue is approximately 1.206e-08, meaning that for every one-dollar increase in gross revenue, the number of Oscars won is predicted to increase by 1.206e-08. This is a very small effect size, indicating that revenue has a very modest impact on the number of Oscars.
The p-value for gross revenue is 0.0103, which is less than the conventional alpha level of 0.05, suggesting that the relationship between gross revenue and Oscars won is statistically significant.
The R-squared value of 0.08036 means that only about 8% of the variation in the number of Oscars won can be explained by gross revenue alone, suggesting other factors also play a significant role.
The F-statistic and its p-value (p = 0.01033) tell us that the model as a whole is statistically significant.
To assess practical significance, we’d consider the size of the effect (the coefficient for gross revenue) and whether this relationship is meaningful in the real world. Despite the statistical significance, the very small coefficient for gross revenue and the low R-squared value suggest limited practical significance. Other variables not included in the model might have a more substantial impact on the number of Oscars won.

par(mfrow=c(2,2))
plot(model)

The Residuals vs Fitted plot shows some curvature, indicating potential non-linearity or that the model might be missing important predictors.
The Normal Q-Q Plot indicates that residuals deviate from normality, especially at the ends.
The Scale-Location plot shows that residuals might not be equally spread across the range, suggesting potential heteroscedasticity.
The Residuals vs Leverage plot indicates there are a few potential influential points (for example, points labeled as 819 and 470), as they’re outside the Cook’s distance lines.

Given these observations, the linear regression model might not be the best fit for this data, or it may require transformation or a different kind of regression model. The potential influential points should be examined closely to determine if they are outliers or high-leverage points and consider if they should be included in the model.

Simple Linear Regression

Fomba Kassoh

2024-04-07

Simple Linear Modeling

Load the Data

Visualize the data

Model the data