2025-09-15

What Is It?

Simple linear regression is used to show the relations between two continuous (quantitative) variables. With often times that output being a prediction of an outcome between those two continuous variables.

Who Created It?

Sir Francis Galton is credited for discovering this way of analyzing data and is said to have started with a pea experiment with his friends. According to Stanton (2001) “He had given seven packets of seeds to seven of his friends and they reported their harvests back to him, Galton plotted weights of the daughter seeds with the weights of the mother seeds. Galton realized that the median weights…approximately described a straight line with a positive slope less than one”. This slope that he found came to be known as a straight regression line.

Example 1

We are using the movies data set from ggplot2movies package for all examples to show examples of an application of a real data set.

Is there a relationship between budget and length for movies?

Formula for regression:

\(budget = \beta_0 + \beta_1\cdot length + \varepsilon\), where \(\varepsilon \sim \mathcal{N}(\mu=0; \,\,\sigma^2)\)

Example 1 Cont.

movies1 <- movies %>%
  select(budget, length) %>%
  filter(!is.na(budget), !is.na(length))#filter out NA values


movies1_mod <- lm(budget ~ length, data=movies1) #create linear model
x = movies1$length; y = movies1$budget

xax <- list(
  title="Movie Length")

yax <- list(
  title="Movie Budget")

fig <- plot_ly(x=x, y=y, type="scatter", #create scatter plot
               mode="markers", name="data",
               width=800, height=430) %>%
  add_lines(x = x, y = fitted(movies1_mod), #add fitted linear regression line
            name="fitted") %>%
  layout(title="Movie Budget vs. Length of Movie",xaxis=xax, yaxis=yax)

Example 1 Plot

config(fig, displaylogo=FALSE)

There appears to be a positive correlation between movie length and budget.

Example 2

Have movie ratings gotten worse or better over time?

movies_rating <- movies %>%
  group_by(year) %>%
  summarise(average_rating = mean(rating, na.rm=T)) #group average ratings by year
                                                    #remove NA results

movies2_mod <- lm(average_rating ~ year, data=movies_rating) #create linear model

fig2 <- ggplot(movies_rating, aes(x=year, y=average_rating)) +
  geom_point() +
  geom_smooth(method="lm", se=T, color="red") + #regression line + standard error
  labs(title = "Average Movie Ratings Over Time",
       x = "Year",
       y = "Rating")

Formula for regression:

\(averagerating = \beta_0 + \beta_1\cdot year + \varepsilon\), where \(\varepsilon \sim \mathcal{N} (\mu=0; \,\,\sigma^2)\)

Example 2 plot

Example 2 plot cont.

Another thing we can do, is change the limits of the x-axis to make the graph appear more visible

fig2_zoom <- ggplot(movies_rating, aes(x=year, y=average_rating)) +
  geom_point() +
  geom_smooth(method="lm", se=T, color="red") +
  labs(title = "Average Movie Ratings Over Time",
       x = "Year",
       y = "Rating") +
  coord_cartesian(xlim=c(1920, 2000), ylim=c(5.5,6.5))#zooms in on a section
                                                    #of the plot without
                                                   # deleting data

Example 2 plot cont.

This relation seems to be only a weak positive correlation

Works Cited