2024-11-17

Introduction

Simple linear regression is used to determine the relationship between two variables. One can predict how an independent variable affects the outcome of a dependent variable using a linear regression equation.

Examples:
- Relationship between sugar intake and weight.
- Monthly income and total hours worked.
- Speed and distance.

The Equation

The equation for simple linear regression is as follows: \[ \hat{\gamma}=\beta_{0} + \beta_{1}x \] where, \[ \hat{\gamma} \: \text{is the dependent variable (predicted outcome)}\newline \text{x} \:\text{is the dependent variable}\newline \beta_{0} \: \text{is the y intercept}\newline \beta_{0} \: \text{is the y intercept}\newline \]

Application

We will use the built-in dataset ggplot2moviesas an example to examine the relationship between budget and other variables (such as votes and rating)

library(ggplot2movies)
data(movies)

Dataset ggplot2movies

budgeted = movies %>%
  filter(!is.na(budget)) %>%
  arrange(desc(budget))
knitr::kable(budgeted[1:5, 1:6])
title year length budget rating votes
Spider-Man 2 2004 127 2.00e+08 7.9 40256
Titanic 1997 194 2.00e+08 6.9 90195
Troy 2004 162 1.85e+08 7.1 33979
Terminator 3: Rise of the Machines 2003 109 1.75e+08 6.9 32111
Waterworld 1995 176 1.75e+08 5.4 19325

Relationship Between Budget and Rating/Votes

We will examine relationship between budget and rating/votes. There are a total of 58788 movies in the ggplot2movies dataset; we will be looking into the top 500 most expensive movies from 2000 to 2005 of the ggplot2movies list.

budgeted = movies %>% 
  filter(!is.na(budget)) %>% 
  filter(year <= 2005 & year >= 2000) %>%
  arrange(desc(budget)) %>%
  filter(row_number() %in% c(1:500))

Budget vs. Rating Scatter Plot

ggplot(budgeted, aes(x = budget/1e6, y = rating)) + geom_point(shape = 23) + 
  ggtitle("Budget vs. Rating") +
  xlab("Budget (Million)") + ylab("Rating")

Budget vs. Votes Scatter Plot

ggplot(budgeted, aes(x = budget/1e6, y = votes/1e3)) + geom_point(shape = 23) + 
  ggtitle("Budget vs. Votes") +
  xlab("Budget (Million)") + ylab("Votes (Thousand)")

Results: Budget vs. Rating

Creating a linear regression line that fits Budget vs. Rating Scatter Plot:

Results: Budget vs. Votes

Creating a linear regression line that fits Budget vs. Votes Scatter Plot:

Conclusions: Budget vs. Rating

summary(mod_rating)
## 
## Call:
## lm(formula = rating ~ budget, data = budgeted)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2502 -0.6919  0.0722  0.7485  2.7724 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.708e+00  1.043e-01   54.74  < 2e-16 ***
## budget      5.974e-09  1.610e-09    3.71  0.00023 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.145 on 498 degrees of freedom
## Multiple R-squared:  0.0269, Adjusted R-squared:  0.02495 
## F-statistic: 13.77 on 1 and 498 DF,  p-value: 0.0002304

Conclusions: Budget vs. Votes

summary(mod_votes)
## 
## Call:
## lm(formula = votes ~ budget, data = budgeted)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -25135  -6977  -3221   2944 138477 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.736e+03  1.276e+03   2.144   0.0325 *  
## budget      1.763e-04  1.970e-05   8.948   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14010 on 498 degrees of freedom
## Multiple R-squared:  0.1385, Adjusted R-squared:  0.1368 
## F-statistic: 80.07 on 1 and 498 DF,  p-value: < 2.2e-16

Summary

The relationship between budget and rating is: \[ \hat{\gamma}=5.708 + 5.974\times10^{-9}x \]

The relationship between budget and votes is: \[ \hat{\gamma}=2.736\times10^3 + 1.763\times10^{-4}x \] We thus can conclude that budget is positively related to rating and votes of a movie.