Simple linear regression is used to show the relations between two continuous (quantitative) variables. With often times that output being a prediction of an outcome between those two continuous variables.
2025-09-15
Simple linear regression is used to show the relations between two continuous (quantitative) variables. With often times that output being a prediction of an outcome between those two continuous variables.
Sir Francis Galton is credited for discovering this way of analyzing data and is said to have started with a pea experiment with his friends. According to Stanton (2001) “He had given seven packets of seeds to seven of his friends and they reported their harvests back to him, Galton plotted weights of the daughter seeds with the weights of the mother seeds. Galton realized that the median weights…approximately described a straight line with a positive slope less than one”. This slope that he found came to be known as a straight regression line.
We are using the movies data set from ggplot2movies package for all examples to show examples of an application of a real data set.
Is there a relationship between budget and length for movies?
Formula for regression:
\(budget = \beta_0 + \beta_1\cdot length + \varepsilon\), where \(\varepsilon \sim \mathcal{N}(\mu=0; \,\,\sigma^2)\)
movies1 <- movies %>% select(budget, length) %>% filter(!is.na(budget), !is.na(length))#filter out NA values movies1_mod <- lm(budget ~ length, data=movies1) #create linear model x = movies1$length; y = movies1$budget xax <- list( title="Movie Length") yax <- list( title="Movie Budget") fig <- plot_ly(x=x, y=y, type="scatter", #create scatter plot mode="markers", name="data", width=800, height=430) %>% add_lines(x = x, y = fitted(movies1_mod), #add fitted linear regression line name="fitted") %>% layout(title="Movie Budget vs. Length of Movie",xaxis=xax, yaxis=yax)
config(fig, displaylogo=FALSE)
There appears to be a positive correlation between movie length and budget.
Have movie ratings gotten worse or better over time?
movies_rating <- movies %>% group_by(year) %>% summarise(average_rating = mean(rating, na.rm=T)) #group average ratings by year #remove NA results movies2_mod <- lm(average_rating ~ year, data=movies_rating) #create linear model fig2 <- ggplot(movies_rating, aes(x=year, y=average_rating)) + geom_point() + geom_smooth(method="lm", se=T, color="red") + #regression line + standard error labs(title = "Average Movie Ratings Over Time", x = "Year", y = "Rating")
Formula for regression:
\(averagerating = \beta_0 + \beta_1\cdot year + \varepsilon\), where \(\varepsilon \sim \mathcal{N} (\mu=0; \,\,\sigma^2)\)
Another thing we can do, is change the limits of the x-axis to make the graph appear more visible
fig2_zoom <- ggplot(movies_rating, aes(x=year, y=average_rating)) + geom_point() + geom_smooth(method="lm", se=T, color="red") + labs(title = "Average Movie Ratings Over Time", x = "Year", y = "Rating") + coord_cartesian(xlim=c(1920, 2000), ylim=c(5.5,6.5))#zooms in on a section #of the plot without # deleting data
This relation seems to be only a weak positive correlation
JMP Statistical Discovery. (2025). Simple Linear Regression. Jmp.com. https://www.jmp.com/en/statistics-knowledge-portal/what-is-regression
Jeffrey M. Stanton (2001) Galton, Pearson, and the Peas: A Brief History of Linear Regression for Statistics Instructors, Journal of Statistics Education, 9:3, , DOI: 10.1080/10691898.2001.11910537