---
title: "Simple Linear Regression"
author: "J Sigma"
editor: source
format:
html:
css: styles.css
toc: true
toc-depth: 3
number-sections: false
theme: cosmo
code-fold: true
code-tools: true
smooth-scroll: true
embed-resources: true
page-navigation: true
pdf:
documentclass: article
toc: true
number-sections: false
execute:
engine: knitr
echo: true
warning: false
message: false
---
# **Recap of Foundation Concepts in Statistics**
## **Populations versus Samples**
In statistics, we differentiate – for good measure – between a **population** and a **sample**.
- a **population** refers to an entire group that we are interested in studying. For example, we may want to explain the relationship (if there is any) between the number of study hours and the final marks of students for the **entire** 2026 *Applied Statistics* cohort. This would be a measure of the whole population
- a **sample** is a smaller, representative group of the population. See, it is usually impractical in the real world to obtain data for the whole population because of time or financial constraints, or even the inability to reach the entire population. So, we take smaller, random, unbiased samples that represent the population. We then use the statistics we obtain from those samples (like the sample means and sample standard deviations) to approximate the corresponding parameters for the entire population.
## **Statistical Inference**
**Statistical inference** is the use of samples and their statistics in an attempt to reach a conclusion about the whole population from which they come.
For example, we may find that, for the particular sample(s) that we study, there is a positive relationship between the number of study hours and final marks of the students for the cohort. If we are given enough evidence to infer that this is the true and general behaviour of the whole population, we may safely conclude that there is indeed a positive relationship between these two measures.
So, we infer **population parameters** using **sample statistics**. This is the core idea behind statistical inference.
## **Hypothesis Testing**
**Hypothesis testing** is the tool or procedure which we use to make valid statistical inferences about populations. The procedure is conducted as follows:
::: {.callout-important title="Hypotheis Testing (Modified p-value Approach)"}
1. **Define a null hypothesis**,$H_{0}$ : this is the hypothesis or assumption of **no statistical significance**. It is the default assumption about the population which stands to reason that any observed results from our data are due to random variation.
For our example, we may say that there is no significant relationship between the number of study hours and the final marks of students.
2. **Define an alternative hypothesis**, $H_{1} \text{ or } H_{a}$ : this is the hypothesis of statistical significance. This hypothesis tells us that the behaviour of the population is not due to chance.
3. **Define a significance level**, $\alpha$ : this the **type I error rate**, i.e., it is the probability that we will reject $H_{0}$ when it is, in fact, true. The significance of defining this is so that we know how likely we are to conclude that there is some significant relationship in our population when this is not the case. Of course, we want this to be quite low, and so we usually define $\alpha=0.05$, and in more extreme studies (where we have matters of life and death) will have much lower significance levels.
4. **Calculate the test statistic** : the kind of test statistic and the method by which it is calculated will differ depending on the type of test being conducted. This, in turn, depends on the **sampling distribution** of the test statistic. In all cases, we calculate this with the assumption that $H_{0}$ is true.
This is important. We calculate the test statistic with this assumption because we later measure what the likelihood of obtaining the test statistic is, assuming that $H_{0}$ is true. If this is lower than the significance level we have defined, we may safely reject $H_{0}$, and conclude that $H_{1}$ is likely true.
5. **Calculate the** $p$**-value** : this is the probability of getting a test statistic as or more extreme than the calculated test statistic, assuming $H_{0}$ is true.
6. **Conclusion** : If $p \leq \alpha$, then we reject $H_{0}$ and conclude that the observed test statistic is significant. Otherwise, we fail to reject $H_{0}$ and conclude that there is no evidence of statistical significance in the test we have performed.
:::
# **The Problem We Want to Solve**
The aim of simple linear regression is to be able to explain and describe the relationship between two variables. We do this by determining:
1. **How strong the relationship is** between the two variables
2. **Whether the relationship is real**, or just due to chance
3. Whether we can **explain the impact that changing one variable has on the other**.So, we ask what increasing one variable by $1$ unit does to the other, for example.
4. Whether we can **predict** one variable using the other
We will answer all of these questions using the following example.
::: {.callout-warning title="Working Example" icon="false"}
As part of an experiment, a lecturer recorded the overall course marks and the number of lectures attended for $20$ students in the 2025 *Applied Statistics* cohort. The results of this experiment are shown below
```{r}
#################################
# READING IN DATA INTO EXCEL
#################################
# capture data
Attendence <- c(46, 10, 38, 27, 45, 26, 35, 45, 48,
20, 30, 27, 38, 12, 28, 40, 38, 47, 36, 40)
Marks <- c(80, 20, 59, 34, 71, 55, 50, 78, 81, 28, 50, 47, 77,
18, 41, 79, 68, 88, 66, 70)
# data frame
lecture_data <- data.frame(Attendence, Marks)
lecture_data
# scatter plot
plot(lecture_data$Attendence, lecture_data$Marks,
ylab="Overall Course Marks", xlab="Number of Lectures
Attended", main="Course Mark vs Lecture Attendence")
```
:::
## **Correlation Analysis**
Correlation analysis helps us to answer some of our problem. In particular, it helps us to answer:
1. **How strong the relationship is between the two variables**. In addition to this, it gives us the direction of this relationship
2. **Whether the relationship is real, or just due to chance**
**Pearson's correlation coefficient**, $r$, as you are used to it, helps us to quantify the strength of the relationship between the two variables. We have that
$$-1\leq r \leq 1 $$
whereby:
- $1$ represents a **perfect positive relationship**
- $-1$ represents a **perfect negative relationship**
- $0$ represents **no correlation**
- the closer we are to $0$, the weaker the relationship, and the closer we are to $1$ or $-1$, the stronger the relationship
::: {}
We differentiate between the **population correlation coefficient** (given by $\rho$) and the **sample correlation coefficient** (given by $r$).
:::
We use the following formula to calculate $r$ :
$$r=\frac{\sum^{n}_{i=1}(x_{i}-\bar{x})(y_{i}-\bar{y})}{\sqrt{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2})\sum_{i=1}^{n}(y_{i}-\bar{y})^{2}}}=\frac{SS_{xy}}{\sqrt{SS_{x}SS{y}}}$$
### **Calculating the Correlation Coefficient in R**
```{r}
##############################################
# CORRELATION COEFFICIENT FROM FIRST PRINCIPLES
##############################################
# x and y
x <- lecture_data$Attendence
y <- lecture_data$Marks
# means of x and y
xbar <- mean(x)
ybar <- mean(y)
# sum of squares
SSxy <- sum((x-xbar)*(y-ybar))
SSx <- sum((x-xbar)^2)
SSy <- sum((y-ybar)^2)
# calculating r
r <- (SSxy)/sqrt((SSx*SSy))
r
```
This uses the formula for $r$ that we have defined above. It just has been turned into code. However, especially because of the computational complexity attached to this approach, it is inconvenient to perform a correlation analysis in this way. Instead, we can use an in-built function in R. We do this as follows:
```{r}
###################################
# IN-BUILT CORRELATION COEFFICIENT
###################################
cor(lecture_data$Attendence, lecture_data$Marks) # order does not matter
```
We get the same answer!
### **Inference on the correlation coefficient**
The second question that correlation analysis is whether there is a real relationship between the two variables, or whether the relationship is just due to chance. To answer these question, we perform **inference on the correlation coefficient**. We begin by stating the null and alternative hypotheses. We have that
$$H_{0}:\rho=0 \quad \text{... no significant relationship between the variables}$$
$$\text{and}$$
$$H_{1}:\rho\neq0 \quad \text{... there is some significant relationship}$$
We perform this test at the standard $\alpha=0.05$.
::: callout-note
Sometimes, we perform the test at the $1\%$ significance level. This does not change the procedure. It only changes at which we will reject the null hypothesis
:::
The sampling distribution of the test statistic for inference on the correlation coefficient is a $t$ distribution. The test statistic is given by
$$t=\frac{r\sqrt{n-2}}{\sqrt{1-r^{2}}} \sim t_{n-2}$$
where $n$ is the number of observations in our data set and $r$ is the sample correlation coefficient. For our example, we have that
$$\begin{align*}
t &\approx\frac{(0.9497)\sqrt{20-2}}{\sqrt{1-(0.0497)^{2}}}\\
&\approx12.87 \sim t_{18}
\end{align*}$$
We then calculate the $p$-value for the test statistic. It is important to note that, since the alternative hypothesis is one of a difference from $0$ (and not any particular direction from zero), the test will be two-sided. This motivates the way in which we calculate the test statistic:
```{r}
#########################################################
# FINDING THE P-VALUE FROM THE TEST STATISTIC (MANUALLY)
##########################################################
p_val <- 2*pt(q=12.87, df=18, lower.tail=F)
p_val
```
We can also perform inference on the correlation coefficient using an in-built function in R. For that, we have the following
```{r}
#################################################
# IN-BUILT INFERENCE ON CORRELATION COEFFICIENT
#################################################
cor.test(lecture_data$Attendence, lecture_data$Marks)
```
From this, we can extract
- the test statistic – $t=12.869$
- the degrees of freedom – $\text{df}=18$
- the p-value – $\text{p-value}=1.625\text{e}-10$
- the alternative hypothesis – ***alternative hypothesis: true correlation is not equal to 0***
- the correlation coefficient – $\text{cor}=0.9497185$
We the reject the null hypothesis and conclude that, since the $p$-value is less than the level of significance defined, $\alpha=0.05$, there is significant evidence of a linear relationship between the final course marks and the number of lectures attended.
::: callout-note
If the $p$-value was greater than $0.05$, we would fail to reject the null hypothesis and conclude that there is no evidence of a significant linear relationship between the final marks and the number of lectures attended.
:::
### **Limitations of Correlation Analysis**
Although correlation analysis provides information about the strength and direction of the relationship between two variables, and whether this relationship is real, it still fails to tell us:
- whether we can explain the impact that changing one variable has on the other; and
- whether we can predict one variable using the other
This is where **simple linear regression** steps in
## **Simple Linear Regression**
Simple linear regression allows us to answer the rest of our problem, as we have established. It differs from a correlation analysis in that it now matters which variable we assign too $x$ (the **independent variable**) and which we assign to $y$ (the **dependent variable**)
Simple linear regression analysis is based on the equation of a straight line: $y=mx+c$.
### **Simple Linear Regression Model**
#### **Population Model**
$$y_{i}=\beta_{0}+\beta_{1}x+\epsilon_{i}$$
whereby:
- $i$ refers to an observation $i \in \{1,2,3,\dots, n\}$
- $y_{i}$ is an observed value for a given $x_{i}$
- $\beta_{0}$ is the **intercept parameter**
- $\beta_{1}$ is the **slope parameter**; and
- $\epsilon_{i}$ is the error for a particular observation which accounts for any variability in $y_{i}$ that is not explained by the independent variable
#### **Sample Model**
In the sample model, we make an assumption that there are no errors, and adjust our $\beta$ values. We have that
$$\hat{y}_{i}=\hat{\beta}_{0}+\hat{\beta}_{1}x$$
Now,
- $\hat{y}_{i}$ is the **predicted value** of the dependent variable
- $\hat{\beta}_{0}$ is the **estimated intercept parameter**
- $\hat{\beta}_{1}$ is the **estimated slope parameter**
Here, we assume that the errors are normally distributed with a mean of $0$ and some variance. So, $\epsilon_{i} \sim N(0, \sigma^{2})$
### **Estimating the** $\beta$ **Parameters**
The method used to estimate the $\beta$ parameters in SLR is called the **ordinary least squares (OLS) method**. It works by minimising
$$\sum_{i=1}^{n}\epsilon_{i}^{2}=\sum_{i=1}^{n}(y_{i}-\hat{y}_{i})^{2}$$
i.e., it **minimises the sum of the squared error terms**. Essentially, it finds a line which gives the smallest possible sum of squared error terms. So, we obtain optimal $\beta$ values for the model we are trying to fit.
```{r}
##################################
# SIMPLE LINEAR REGRESSION PLOT
##################################
#scatter plot
plot(x, y, xlab="Number of Lectures Attended",
ylab="Course Marks", main="Course Marks vs
Lecture Atendence")
#model fit
model <- lm(y ~ x, data = lecture_data)
#line of best fit
abline(model, col = "red", lwd = 2)
```
#### **Performing Linear Regression in R**
```{r}
############################################
# FITTING A SIMPLE LINEAR REGRESSION MODEL
############################################
fit <- lm(Marks ~ Attendence, data=lecture_data) #note: y ~ x
summary(fit)
```
From this, we have that our fitted regression equation is given by
$$\hat{y}=-3.6851+1.8250(\text{Attendance})$$
#### **Interpreting the** $\beta$ **Estimates**
::: {.callout-important title="Interpreting the Parameter Estimates" icon="false"}
**Interpreting** $\beta_{0}$ : it is the average value of $y$ when $x=0$
**Interpreting** $\beta_{1}$ : it is the average estimated change in $y$ for a unit's increase in $x$. $\beta_{1}>0 \implies \text{increase}$ and $\beta_{1}<0 \implies \text{decrease}$. We need to be specific as to the context that has been given.
:::
In our example, we have that, on average, a student's mark is $-3.6851$ when a student attends no lectures at all. Notice that, sometimes, the interpretation of $\beta_{0}$ is not useful contextually, as is the case here. We know that the lowest mark a student can obtain is $0$, and so a mark of $-3.6851$ does not make any real sense.
$\beta_{1}$ tells us that, on overage, a student's mark will increase by $1.8250$ for every additional lecture they attend.
### **Assessing the Accuracy of the Model**
The **residual standard error (RSE)** is used to measure the accuracy of the model. We have that
$$\text{RSE}=\sqrt{\frac{\sum_{i=1}^{n}\epsilon^{2}_{i}}{n-2}}$$
and this measures the standard deviation of the model residuals.
- higher RSE $\implies$ less accurate model. This can be seen by more deviation of the residuals from the regression line
- lower RSE $\implies$ more accurate model. Observations , in this case, will be much closer to the regression line
### **Assessing the Accuracy of the** $\beta$ **Estimates**
The **standard error** of a $\beta$ estimate indicates how different the population estimate is likely to be from the sample estimate. A large standard error relative to the sample size of an estimate is an indication of more deviation from the population parameter. We have that
$$se(\hat{\beta}_{1})=\frac{\text{RSE}}{\sqrt{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}}=\frac{\text{RSE}}{\sqrt{SS_{x}}}$$
{fig-align="center"}
### **Testing the Significance of the** $\beta$ **Estimates**
Testing the slope allows us to determine whether it is likely that there is a linear relationship between the independent and independent variables. We start with the null and alternative hypotheses, obtaining that
$$H_{0}: \beta_{1}=0 \quad \text{... no significant relationship}$$
$$\text{and}$$
$$H_{1}: \beta_{1}\neq0$$
Suppose we perform this test at the $5\%$ significance level. Then, $\alpha=0.05$. We can then calculate the test statistic as
$$t=\frac{\hat{\beta}_{1}-\beta_{1}}{se(\hat{\beta}_{1})} \sim t_{n-2}$$
We can notice, since we are performing the test under the assumption that $H_{0}$ is true, that $\beta_{1}=0$. Truly, then,
$$t=\frac{\hat{\beta}_{1}}{se(\hat{\beta}_{1})} \sim t_{n-2}$$
from the R output for our example, we get that
$$\begin{align*}
t&\approx\frac{1.8250}{0.1418}\\
&\approx12.87 \sim t_{18}
\end{align*}$$
The $p$-value is then given by
```{r}
################################
# FINDING THE P-VALUE MANUALLY
################################
p <- 2*pt(q=12.87, df=18, lower.tail=F)
p
```
We could have also obtained this using the in-built R model summary function
```{r}
#####################################
# USING MODEL SUMMARY FUNCTION IN R
#####################################
summary(fit)
```
The standard error of the $\beta_{1}$ estimate is given next to the estimate itself, and the $p$-value follows to the right of the test statistic of the $\beta_{1}$ estimate.
We then reject the null hypothesis since $p<0.05$, and conclude that there is evidence of a significant linear relationship between the course marks and lecture attendance of the students in the 2025 *Applied Statistics* cohort.
### **Confidence Intervals for the** $\beta$ **Estimates**
We calculate the confidence intervals for our $\beta$ estimates as follows:
$$\text{CI}=\hat{\beta_{i}}\pm t_{{\alpha/2}, \text{ df}} \times se(\hat{\beta_{i}})$$
for $i \in \{0, 1\}$. For our example, we can calculate the $95\%$ confidence interval for $\beta_{1}$ as follows:
$$\begin{align*}
\text{CI} &= 1.8250 \pm 2.101\times 0.1418\\
&=[1.527, 2,123]
\end{align*}$$
We could have obtained the critical $t$-value using R
```{r}
####################
# CRITICAL VALUE
####################
tcrit <- qt(p=0.025, df=18, lower.tail=F)
tcrit
```
and we could have found these, as a whole, using R in-built functions
```{r}
###########################################
# CONFIDENCE INTERVALS FOR BETA ESTIMATES
###########################################
# This gives us confidnce intervals for both the intercept
# and slope parameters
confint(fit)
```
::: {.callout-important title="Interpreting the Confidence Interval"}
If we were to obtain various samples from our population, we would expect that $95\%$ of the slope estimates would fall into the $[1.527, 2.123]$ interval, and $95\%$ of the intercept estimates to fall into the $[-14.26, 6.89]$ interval.
:::
### **Checking Overall Model Significance**
In addition to assessing the significance of the $\beta$ estimates, we can also check the overall model significance by checking if our model is any different to a **null model**
::: callout-note
A **null model** is a model that assumes no significance, relationship or pattern.
:::
To perform this test, we need a bit of information
| Source of Variation | $\text{df}$ | **Sum of Squares** | Mean Squares | F-statistic |
|:------------:|:--------:|:-----------------:|:---------------:|:------------:|
| **Regression** | $1$ |
$$SS_{reg}=\sum(\hat{y}_{i}-\bar{y})^{2}$$ |
$$MS_{reg}=\frac{SS_{reg}}{1}$$ |
$$F=\frac{MS_{reg}}{MSE}$$ |
| **Errors (Residuals)** | $n-2$ |
$$SSE=\sum(y_{i}-\hat{y}_{i})^{2}$$ |
$$MSE=\frac{SSE}{n-1}$$ | |
| **Total** | $n-1$ |
$$SS_{tot}=\sum(y_{i}-\bar{y})^{2}$$ | | |
We can use the F-statistic to perform an **F-test.**
::: callout-note
$$\sqrt{MSE}=RSE$$
:::
We begin, once again, by stating the null and alternative hypothesis. Only, in this case, we have that
$$H_{0}:\text{the model does not differ to a null model}$$
$$\text{and}$$
$$H_{1}:\text{our model is different from a null model}$$
We define an $\alpha$ level of $0.05$. We first calculate the test statistic by hand before using any built-in functions in R
```{r}
#########################################
# CACLULATING THE F-STATISTIC MANUALLY
#########################################
n <- nrow(lecture_data)
# SS_reg
yhat <- fitted(fit)
ybar <- mean(lecture_data$Marks)
SS_reg <- sum((yhat - ybar)^2)
# SSE
y <- lecture_data$Marks
SSE <- sum((y - yhat)^2)
# MS_reg
df1 <- 1
MS_reg <- SS_reg/df1
# MSE
df2 <- n-2
MSE <- SSE/df2
# F statistic
f <- MS_reg/MSE
f
```
We can, then calculate the $p$-value associated with this test statistic as
```{r}
#############
# P-VALUE
#############
pval <- pf(q=f, df1=df1, df2=df2, lower.tail=F)
pval
```
We could have also gotten all of this from the model summary.
```{r}
#######################################################
# OBTAINING THE TEST STATISTIC USING THE MODEL SUMMARY
#######################################################
summary(fit)
```
In any case, we would reject the null hypothesis since the $p$-value is less than the defined significance level. We, the, conclude that there is evidence of a significant model at te $5\%$ significance level, and that our model is likely different from a null model.
### **Coefficient of Determination**
The **coefficient of determination**, $R^{2}$, is a measure of model fit, and is defined by
$$R^{2}=\frac{SS_{reg}}{SS_{tot}}=\frac{\sum(\hat{y}_{i}-\bar{y})^{2}}{\sum(y_{i}-\bar{y})^{2}}$$
::: callout-note
$R^{2}=r^{2}$ for simple linear regression, where $r$ is the Pearson correlation coefficient. Consequently, we have that
$$0\leq R^{2}\leq 1$$
:::
$R^{2}$ describes the amount (or proportion) of variation in the response variable that is explained by the variation in the explanatory variable
- low $R^{2}$ value $\implies$ poor model fit
- high $R^{2}$ value $\implies$ good model fit
This value can be found from our model summary as "**Multiple R-Squared**"