Simple Linear Regression

Author

J Sigma

Recap of Foundation Concepts in Statistics

Populations versus Samples

In statistics, we differentiate – for good measure – between a population and a sample.

  • a population refers to an entire group that we are interested in studying. For example, we may want to explain the relationship (if there is any) between the number of study hours and the final marks of students for the entire 2026 Applied Statistics cohort. This would be a measure of the whole population

  • a sample is a smaller, representative group of the population. See, it is usually impractical in the real world to obtain data for the whole population because of time or financial constraints, or even the inability to reach the entire population. So, we take smaller, random, unbiased samples that represent the population. We then use the statistics we obtain from those samples (like the sample means and sample standard deviations) to approximate the corresponding parameters for the entire population.

Statistical Inference

Statistical inference is the use of samples and their statistics in an attempt to reach a conclusion about the whole population from which they come.

For example, we may find that, for the particular sample(s) that we study, there is a positive relationship between the number of study hours and final marks of the students for the cohort. If we are given enough evidence to infer that this is the true and general behaviour of the whole population, we may safely conclude that there is indeed a positive relationship between these two measures.

So, we infer population parameters using sample statistics. This is the core idea behind statistical inference.

Hypothesis Testing

Hypothesis testing is the tool or procedure which we use to make valid statistical inferences about populations. The procedure is conducted as follows:

ImportantHypotheis Testing (Modified p-value Approach)
  1. Define a null hypothesis,\(H_{0}\) : this is the hypothesis or assumption of no statistical significance. It is the default assumption about the population which stands to reason that any observed results from our data are due to random variation.

    For our example, we may say that there is no significant relationship between the number of study hours and the final marks of students.

  2. Define an alternative hypothesis, \(H_{1} \text{ or } H_{a}\) : this is the hypothesis of statistical significance. This hypothesis tells us that the behaviour of the population is not due to chance.

  3. Define a significance level, \(\alpha\) : this the type I error rate, i.e., it is the probability that we will reject \(H_{0}\) when it is, in fact, true. The significance of defining this is so that we know how likely we are to conclude that there is some significant relationship in our population when this is not the case. Of course, we want this to be quite low, and so we usually define \(\alpha=0.05\), and in more extreme studies (where we have matters of life and death) will have much lower significance levels.

  4. Calculate the test statistic : the kind of test statistic and the method by which it is calculated will differ depending on the type of test being conducted. This, in turn, depends on the sampling distribution of the test statistic. In all cases, we calculate this with the assumption that \(H_{0}\) is true.

    This is important. We calculate the test statistic with this assumption because we later measure what the likelihood of obtaining the test statistic is, assuming that \(H_{0}\) is true. If this is lower than the significance level we have defined, we may safely reject \(H_{0}\), and conclude that \(H_{1}\) is likely true.

  5. Calculate the \(p\)-value : this is the probability of getting a test statistic as or more extreme than the calculated test statistic, assuming \(H_{0}\) is true.

  6. Conclusion : If \(p \leq \alpha\), then we reject \(H_{0}\) and conclude that the observed test statistic is significant. Otherwise, we fail to reject \(H_{0}\) and conclude that there is no evidence of statistical significance in the test we have performed.

The Problem We Want to Solve

The aim of simple linear regression is to be able to explain and describe the relationship between two variables. We do this by determining:

  1. How strong the relationship is between the two variables
  2. Whether the relationship is real, or just due to chance
  3. Whether we can explain the impact that changing one variable has on the other.So, we ask what increasing one variable by \(1\) unit does to the other, for example.
  4. Whether we can predict one variable using the other

We will answer all of these questions using the following example.

WarningWorking Example

As part of an experiment, a lecturer recorded the overall course marks and the number of lectures attended for \(20\) students in the 2025 Applied Statistics cohort. The results of this experiment are shown below

Code
#################################
# READING IN DATA INTO EXCEL
#################################

# capture data 

Attendence <- c(46, 10, 38, 27, 45, 26, 35, 45, 48,
                20, 30, 27, 38, 12, 28, 40, 38, 47, 36, 40)
Marks <- c(80, 20, 59, 34, 71, 55, 50, 78, 81, 28, 50, 47, 77, 
           18, 41, 79, 68, 88, 66, 70)

# data frame

lecture_data <- data.frame(Attendence, Marks)
lecture_data
   Attendence Marks
1          46    80
2          10    20
3          38    59
4          27    34
5          45    71
6          26    55
7          35    50
8          45    78
9          48    81
10         20    28
11         30    50
12         27    47
13         38    77
14         12    18
15         28    41
16         40    79
17         38    68
18         47    88
19         36    66
20         40    70
Code
# scatter plot

plot(lecture_data$Attendence, lecture_data$Marks, 
     ylab="Overall Course Marks", xlab="Number of Lectures 
     Attended", main="Course Mark vs Lecture Attendence")

Correlation Analysis

Correlation analysis helps us to answer some of our problem. In particular, it helps us to answer:

  1. How strong the relationship is between the two variables. In addition to this, it gives us the direction of this relationship
  2. Whether the relationship is real, or just due to chance

Pearson’s correlation coefficient, \(r\), as you are used to it, helps us to quantify the strength of the relationship between the two variables. We have that

\[-1\leq r \leq 1 \]

whereby:

  • \(1\) represents a perfect positive relationship

  • \(-1\) represents a perfect negative relationship

  • \(0\) represents no correlation

  • the closer we are to \(0\), the weaker the relationship, and the closer we are to \(1\) or \(-1\), the stronger the relationship

We differentiate between the population correlation coefficient (given by \(\rho\)) and the sample correlation coefficient (given by \(r\)).

We use the following formula to calculate \(r\) :

\[r=\frac{\sum^{n}_{i=1}(x_{i}-\bar{x})(y_{i}-\bar{y})}{\sqrt{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2})\sum_{i=1}^{n}(y_{i}-\bar{y})^{2}}}=\frac{SS_{xy}}{\sqrt{SS_{x}SS{y}}}\]

Calculating the Correlation Coefficient in R

Code
##############################################
# CORRELATION COEFFICIENT FROM FIRST PRINCIPLES
##############################################

# x and y

x <- lecture_data$Attendence
y <- lecture_data$Marks

# means of x and y 

xbar <- mean(x)
ybar <- mean(y)

# sum of squares

SSxy <- sum((x-xbar)*(y-ybar))
SSx <- sum((x-xbar)^2)
SSy <- sum((y-ybar)^2)

# calculating r 

r <- (SSxy)/sqrt((SSx*SSy))
r
[1] 0.9497185

This uses the formula for \(r\) that we have defined above. It just has been turned into code. However, especially because of the computational complexity attached to this approach, it is inconvenient to perform a correlation analysis in this way. Instead, we can use an in-built function in R. We do this as follows:

Code
###################################
# IN-BUILT CORRELATION COEFFICIENT
###################################

cor(lecture_data$Attendence, lecture_data$Marks) # order does not matter
[1] 0.9497185

We get the same answer!

Inference on the correlation coefficient

The second question that correlation analysis is whether there is a real relationship between the two variables, or whether the relationship is just due to chance. To answer these question, we perform inference on the correlation coefficient. We begin by stating the null and alternative hypotheses. We have that

\[H_{0}:\rho=0 \quad \text{... no significant relationship between the variables}\]

\[\text{and}\]

\[H_{1}:\rho\neq0 \quad \text{... there is some significant relationship}\]

We perform this test at the standard \(\alpha=0.05\).

Note

Sometimes, we perform the test at the \(1\%\) significance level. This does not change the procedure. It only changes at which we will reject the null hypothesis

The sampling distribution of the test statistic for inference on the correlation coefficient is a \(t\) distribution. The test statistic is given by

\[t=\frac{r\sqrt{n-2}}{\sqrt{1-r^{2}}} \sim t_{n-2}\]

where \(n\) is the number of observations in our data set and \(r\) is the sample correlation coefficient. For our example, we have that

\[\begin{align*} t &\approx\frac{(0.9497)\sqrt{20-2}}{\sqrt{1-(0.0497)^{2}}}\\ &\approx12.87 \sim t_{18} \end{align*}\]

We then calculate the \(p\)-value for the test statistic. It is important to note that, since the alternative hypothesis is one of a difference from \(0\) (and not any particular direction from zero), the test will be two-sided. This motivates the way in which we calculate the test statistic:

Code
#########################################################
# FINDING THE P-VALUE FROM THE TEST STATISTIC (MANUALLY)
##########################################################

p_val <- 2*pt(q=12.87, df=18, lower.tail=F)
p_val
[1] 1.622399e-10

We can also perform inference on the correlation coefficient using an in-built function in R. For that, we have the following

Code
#################################################
# IN-BUILT INFERENCE ON CORRELATION COEFFICIENT
#################################################

cor.test(lecture_data$Attendence, lecture_data$Marks)

    Pearson's product-moment correlation

data:  lecture_data$Attendence and lecture_data$Marks
t = 12.869, df = 18, p-value = 1.625e-10
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.8748863 0.9802637
sample estimates:
      cor 
0.9497185 

From this, we can extract

  • the test statistic – \(t=12.869\)

  • the degrees of freedom – \(\text{df}=18\)

  • the p-value – \(\text{p-value}=1.625\text{e}-10\)

  • the alternative hypothesis – alternative hypothesis: true correlation is not equal to 0

  • the correlation coefficient – \(\text{cor}=0.9497185\)

We the reject the null hypothesis and conclude that, since the \(p\)-value is less than the level of significance defined, \(\alpha=0.05\), there is significant evidence of a linear relationship between the final course marks and the number of lectures attended.

Note

If the \(p\)-value was greater than \(0.05\), we would fail to reject the null hypothesis and conclude that there is no evidence of a significant linear relationship between the final marks and the number of lectures attended.

Limitations of Correlation Analysis

Although correlation analysis provides information about the strength and direction of the relationship between two variables, and whether this relationship is real, it still fails to tell us:

  • whether we can explain the impact that changing one variable has on the other; and

  • whether we can predict one variable using the other

This is where simple linear regression steps in

Simple Linear Regression

Simple linear regression allows us to answer the rest of our problem, as we have established. It differs from a correlation analysis in that it now matters which variable we assign too \(x\) (the independent variable) and which we assign to \(y\) (the dependent variable)

Simple linear regression analysis is based on the equation of a straight line: \(y=mx+c\).

Simple Linear Regression Model

Population Model

\[y_{i}=\beta_{0}+\beta_{1}x+\epsilon_{i}\]

whereby:

  • \(i\) refers to an observation \(i \in \{1,2,3,\dots, n\}\)

  • \(y_{i}\) is an observed value for a given \(x_{i}\)

  • \(\beta_{0}\) is the intercept parameter

  • \(\beta_{1}\) is the slope parameter; and

  • \(\epsilon_{i}\) is the error for a particular observation which accounts for any variability in \(y_{i}\) that is not explained by the independent variable

Sample Model

In the sample model, we make an assumption that there are no errors, and adjust our \(\beta\) values. We have that

\[\hat{y}_{i}=\hat{\beta}_{0}+\hat{\beta}_{1}x\]

Now,

  • \(\hat{y}_{i}\) is the predicted value of the dependent variable

  • \(\hat{\beta}_{0}\) is the estimated intercept parameter

  • \(\hat{\beta}_{1}\) is the estimated slope parameter

Here, we assume that the errors are normally distributed with a mean of \(0\) and some variance. So, \(\epsilon_{i} \sim N(0, \sigma^{2})\)

Estimating the \(\beta\) Parameters

The method used to estimate the \(\beta\) parameters in SLR is called the ordinary least squares (OLS) method. It works by minimising

\[\sum_{i=1}^{n}\epsilon_{i}^{2}=\sum_{i=1}^{n}(y_{i}-\hat{y}_{i})^{2}\]

i.e., it minimises the sum of the squared error terms. Essentially, it finds a line which gives the smallest possible sum of squared error terms. So, we obtain optimal \(\beta\) values for the model we are trying to fit.

Code
##################################
# SIMPLE LINEAR REGRESSION PLOT
##################################

#scatter plot
plot(x, y, xlab="Number of Lectures Attended",
     ylab="Course Marks", main="Course Marks vs 
     Lecture Atendence")


#model fit
model <- lm(y ~ x, data = lecture_data)


#line of best fit
abline(model, col = "red", lwd = 2)

Performing Linear Regression in R

Code
############################################
# FITTING A SIMPLE LINEAR REGRESSION MODEL
############################################

fit <- lm(Marks ~ Attendence, data=lecture_data) #note: y ~ x 
summary(fit)

Call:
lm(formula = Marks ~ Attendence, data = lecture_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-11.590  -5.215  -0.240   4.348  11.335 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -3.6851     5.0333  -0.732    0.474    
Attendence    1.8250     0.1418  12.869 1.62e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.868 on 18 degrees of freedom
Multiple R-squared:  0.902, Adjusted R-squared:  0.8965 
F-statistic: 165.6 on 1 and 18 DF,  p-value: 1.625e-10

From this, we have that our fitted regression equation is given by

\[\hat{y}=-3.6851+1.8250(\text{Attendance})\]

Interpreting the \(\beta\) Estimates

ImportantInterpreting the Parameter Estimates

Interpreting \(\beta_{0}\) : it is the average value of \(y\) when \(x=0\)

Interpreting \(\beta_{1}\) : it is the average estimated change in \(y\) for a unit’s increase in \(x\). \(\beta_{1}>0 \implies \text{increase}\) and \(\beta_{1}<0 \implies \text{decrease}\). We need to be specific as to the context that has been given.

In our example, we have that, on average, a student’s mark is \(-3.6851\) when a student attends no lectures at all. Notice that, sometimes, the interpretation of \(\beta_{0}\) is not useful contextually, as is the case here. We know that the lowest mark a student can obtain is \(0\), and so a mark of \(-3.6851\) does not make any real sense.

\(\beta_{1}\) tells us that, on overage, a student’s mark will increase by \(1.8250\) for every additional lecture they attend.

Assessing the Accuracy of the Model

The residual standard error (RSE) is used to measure the accuracy of the model. We have that

\[\text{RSE}=\sqrt{\frac{\sum_{i=1}^{n}\epsilon^{2}_{i}}{n-2}}\]

and this measures the standard deviation of the model residuals.

  • higher RSE \(\implies\) less accurate model. This can be seen by more deviation of the residuals from the regression line

  • lower RSE \(\implies\) more accurate model. Observations , in this case, will be much closer to the regression line

Assessing the Accuracy of the \(\beta\) Estimates

The standard error of a \(\beta\) estimate indicates how different the population estimate is likely to be from the sample estimate. A large standard error relative to the sample size of an estimate is an indication of more deviation from the population parameter. We have that

\[se(\hat{\beta}_{1})=\frac{\text{RSE}}{\sqrt{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}}=\frac{\text{RSE}}{\sqrt{SS_{x}}}\]

Here, we can see that for a smaller standard error of the slope parameter, the samples are relatively similar to one another (on the left). Where there is a larger standard error (on the right), the sample looks much different from one another. There seems to be no stability in the estimates, and this indicates that the estimates likely do not come from the same population.

Testing the Significance of the \(\beta\) Estimates

Testing the slope allows us to determine whether it is likely that there is a linear relationship between the independent and independent variables. We start with the null and alternative hypotheses, obtaining that

\[H_{0}: \beta_{1}=0 \quad \text{... no significant relationship}\]

\[\text{and}\]

\[H_{1}: \beta_{1}\neq0\]

Suppose we perform this test at the \(5\%\) significance level. Then, \(\alpha=0.05\). We can then calculate the test statistic as

\[t=\frac{\hat{\beta}_{1}-\beta_{1}}{se(\hat{\beta}_{1})} \sim t_{n-2}\]

We can notice, since we are performing the test under the assumption that \(H_{0}\) is true, that \(\beta_{1}=0\). Truly, then,

\[t=\frac{\hat{\beta}_{1}}{se(\hat{\beta}_{1})} \sim t_{n-2}\]

from the R output for our example, we get that

\[\begin{align*} t&\approx\frac{1.8250}{0.1418}\\ &\approx12.87 \sim t_{18} \end{align*}\]

The \(p\)-value is then given by

Code
################################
# FINDING THE P-VALUE MANUALLY
################################

p <- 2*pt(q=12.87, df=18, lower.tail=F)
p
[1] 1.622399e-10

We could have also obtained this using the in-built R model summary function

Code
#####################################
# USING MODEL SUMMARY FUNCTION IN R
#####################################

summary(fit)

Call:
lm(formula = Marks ~ Attendence, data = lecture_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-11.590  -5.215  -0.240   4.348  11.335 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -3.6851     5.0333  -0.732    0.474    
Attendence    1.8250     0.1418  12.869 1.62e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.868 on 18 degrees of freedom
Multiple R-squared:  0.902, Adjusted R-squared:  0.8965 
F-statistic: 165.6 on 1 and 18 DF,  p-value: 1.625e-10

The standard error of the \(\beta_{1}\) estimate is given next to the estimate itself, and the \(p\)-value follows to the right of the test statistic of the \(\beta_{1}\) estimate.

We then reject the null hypothesis since \(p<0.05\), and conclude that there is evidence of a significant linear relationship between the course marks and lecture attendance of the students in the 2025 Applied Statistics cohort.

Confidence Intervals for the \(\beta\) Estimates

We calculate the confidence intervals for our \(\beta\) estimates as follows:

\[\text{CI}=\hat{\beta_{i}}\pm t_{{\alpha/2}, \text{ df}} \times se(\hat{\beta_{i}})\]

for \(i \in \{0, 1\}\). For our example, we can calculate the \(95\%\) confidence interval for \(\beta_{1}\) as follows:

\[\begin{align*} \text{CI} &= 1.8250 \pm 2.101\times 0.1418\\ &=[1.527, 2,123] \end{align*}\]

We could have obtained the critical \(t\)-value using R

Code
####################
# CRITICAL VALUE
####################

tcrit <- qt(p=0.025, df=18, lower.tail=F)
tcrit
[1] 2.100922

and we could have found these, as a whole, using R in-built functions

Code
###########################################
# CONFIDENCE INTERVALS FOR BETA ESTIMATES
###########################################

# This gives us confidnce intervals for both the intercept
# and slope parameters
confint(fit)
                 2.5 %   97.5 %
(Intercept) -14.259806 6.889518
Attendence    1.527061 2.122947
ImportantInterpreting the Confidence Interval

If we were to obtain various samples from our population, we would expect that \(95\%\) of the slope estimates would fall into the \([1.527, 2.123]\) interval, and \(95\%\) of the intercept estimates to fall into the \([-14.26, 6.89]\) interval.

Checking Overall Model Significance

In addition to assessing the significance of the \(\beta\) estimates, we can also check the overall model significance by checking if our model is any different to a null model

Note

A null model is a model that assumes no significance, relationship or pattern.

To perform this test, we need a bit of information

Source of Variation \(\text{df}\) Sum of Squares Mean Squares F-statistic
Regression \(1\)
\[SS_{reg}=\sum(\hat{y}_{i}-\bar{y})^{2}\]
\[MS_{reg}=\frac{SS_{reg}}{1}\]
\[F=\frac{MS_{reg}}{MSE}\]
Errors (Residuals) \(n-2\)
\[SSE=\sum(y_{i}-\hat{y}_{i})^{2}\]
\[MSE=\frac{SSE}{n-1}\]
Total \(n-1\)
\[SS_{tot}=\sum(y_{i}-\bar{y})^{2}\]

We can use the F-statistic to perform an F-test.

Note

\[\sqrt{MSE}=RSE\]

We begin, once again, by stating the null and alternative hypothesis. Only, in this case, we have that

\[H_{0}:\text{the model does not differ to a null model}\]

\[\text{and}\]

\[H_{1}:\text{our model is different from a null model}\]

We define an \(\alpha\) level of \(0.05\). We first calculate the test statistic by hand before using any built-in functions in R

Code
#########################################
# CACLULATING THE F-STATISTIC MANUALLY
#########################################

n <- nrow(lecture_data) 

# SS_reg
yhat <- fitted(fit)
ybar <- mean(lecture_data$Marks)
SS_reg <- sum((yhat - ybar)^2)

# SSE
y <- lecture_data$Marks
SSE <- sum((y - yhat)^2)

# MS_reg
df1 <- 1
MS_reg <- SS_reg/df1

# MSE
df2 <- n-2
MSE <- SSE/df2

# F statistic 
f <- MS_reg/MSE
f
[1] 165.6082

We can, then calculate the \(p\)-value associated with this test statistic as

Code
#############
# P-VALUE
#############

pval <- pf(q=f, df1=df1, df2=df2, lower.tail=F)
pval
[1] 1.624696e-10

We could have also gotten all of this from the model summary.

Code
#######################################################
# OBTAINING THE TEST STATISTIC USING THE MODEL SUMMARY
#######################################################

summary(fit)

Call:
lm(formula = Marks ~ Attendence, data = lecture_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-11.590  -5.215  -0.240   4.348  11.335 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -3.6851     5.0333  -0.732    0.474    
Attendence    1.8250     0.1418  12.869 1.62e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.868 on 18 degrees of freedom
Multiple R-squared:  0.902, Adjusted R-squared:  0.8965 
F-statistic: 165.6 on 1 and 18 DF,  p-value: 1.625e-10

In any case, we would reject the null hypothesis since the \(p\)-value is less than the defined significance level. We, the, conclude that there is evidence of a significant model at te \(5\%\) significance level, and that our model is likely different from a null model.

Coefficient of Determination

The coefficient of determination, \(R^{2}\), is a measure of model fit, and is defined by

\[R^{2}=\frac{SS_{reg}}{SS_{tot}}=\frac{\sum(\hat{y}_{i}-\bar{y})^{2}}{\sum(y_{i}-\bar{y})^{2}}\]

Note

\(R^{2}=r^{2}\) for simple linear regression, where \(r\) is the Pearson correlation coefficient. Consequently, we have that

\[0\leq R^{2}\leq 1\]

\(R^{2}\) describes the amount (or proportion) of variation in the response variable that is explained by the variation in the explanatory variable

  • low \(R^{2}\) value \(\implies\) poor model fit

  • high \(R^{2}\) value \(\implies\) good model fit

This value can be found from our model summary as “Multiple R-Squared