2024-09-22

Understanding the relationship between two variables

Introduction to Simple Linear Regression

Simple linear regression is a method to model the relationship between a dependent variable and an independent variable. The goal is to find the best-fitting straight line that can predict the dependent variable based on the independent variable.

We’ll use an example to demonstrate how this works.

Example: Predicting Goals Scored

Imagine we want to predict the number of goals a soccer team will score based on their average shots per game. In this case: - The independent variable is the average shots per game. - The dependent variable is the number of goals scored.

We will fit a regression line to the data and see how well the average shots per game predict the number of goals scored.

Visualizing the Data

Let’s start by plotting the relationship between average shots per game and goals scored. This will help us see if there’s a pattern we can model with Simple Linear Regression.

Understanding the Regression Line

The red line in the plot is the regression line. This line shows the best fit for the relationship between shots per game and goals scored.

  • The slope of the line tells us how much the number of goals increases for each additional shot.
  • In this case, we can see that more shots per game generally lead to more goals scored.
  • Simple linear regression helps us quantify this relationship by giving us an equation for the line.

Next, we’ll calculate the equation for this regression line.

Calculating the Regression Line

To calculate the equation of the regression line, we use the lm() function in R, which stands for linear model.

The equation of the line will be in the form of y = mx + b, where: - m is the slope (how much goals increase for each shot). - b is the y-intercept (the starting value when shots are zero).

Regression Model Output

The output below shows the detailed results of the linear regression model:

## 
## Call:
## lm(formula = goals_scored ~ shots_per_game, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.35622 -0.14587 -0.03618  0.19411  0.28386 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -0.40445    0.20308  -1.992   0.0816 .  
## shots_per_game  0.32004    0.01556  20.572 3.26e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2284 on 8 degrees of freedom
## Multiple R-squared:  0.9814, Adjusted R-squared:  0.9791 
## F-statistic: 423.2 on 1 and 8 DF,  p-value: 3.264e-08

Summary of the Regression Results

The linear regression analysis shows the following key points:

  • Intercept (b): -0.404. This means that if a team has 0 shots per game, they would score approximately -0.404 goals, which doesn’t make practical sense but is part of the model.
  • Slope (m): 0.32. For each additional shot per game, the number of goals scored increases by approximately 0.32.
  • R-squared: 0.98. This means that 98% of the variation in goals scored can be explained by shots per game. This indicates a very strong relationship.

In conclusion, more shots per game generally lead to more goals scored.

The Regression Equation

The equation for the regression line is given by:

\[ y = mx + b \]

Where: - \(m\) is the slope (0.32 in our case). - \(b\) is the intercept (-0.404 in our case).

R Code for Regression

Here is the R code used to calculate the regression line:

# Calculate the linear regression model
model <- lm(goals_scored ~ shots_per_game, data = data)

# Display the model summary
summary(model)