2025-09-20

First Steps of Linear Regression

The first steps we need to take in linear regression is looking at the data set we will be working in the ‘airquality’ data set, and seeing how temperature affects the Ozone.
Ozone Temp
41 67
36 72
12 74
18 62
23 65
19 59
8 61
16 69
11 66
14 68
18 58
14 64
34 66
6 57
30 68
11 62
1 59
11 73
4 61
32 61
23 67
45 81
115 79
37 76
29 82
71 90
39 87
23 82
21 77
37 72
20 65
12 73
13 76
135 84
49 85
32 81
64 83
40 83
77 88
97 92
97 92
85 89
10 73
27 81
7 80
48 81
35 82
61 84
79 87
63 85
16 74
80 86
108 85
20 82
52 86
82 88
50 86
64 83
59 81
39 81
9 81
16 82
122 89
89 90
110 90
44 86
28 82
65 80
22 77
59 79
23 76
31 78
44 78
21 77
9 72
45 79
168 81
73 86
76 97
118 94
84 96
85 94
96 91
78 92
73 93
91 93
47 87
32 84
20 80
23 78
21 75
24 73
44 81
21 76
28 77
9 71
13 71
46 78
18 67
13 76
24 68
16 82
13 64
23 71
36 81
7 69
14 63
30 70
14 75
18 76
20 68

In this data set table the Dependent variable ‘y’ is the Ozone and the Independent variable ‘x’ is the Temp.

Formula for Linear Regression

The Simple Linear regression formula: \[{y = a + bx}\] Given the Linear regression formula and the data we need to figure out ‘a’(Y-intercept) and ‘b’(Slope).

In this ‘b’ is the slope of the line, showing how much the Ozone changes if the temperate changes.

In this ‘a’ is the Y-Intercept telling us where the line crosses the y axis.

Slope (b) of regression line - Part 1:

Lets start with b for the slope of the regression line, to calculate this you use the formula: \[{b = r{Sy \over Sx}}\] In this formula r is the Correlation Coefficient between Temp and Ozone.

Correlation Coefficient formula (r): \[\displaystyle r = {\sum((x - \bar{x})(y - \bar{y}))\over\sqrt{\sum(x - \bar{x})^2 \times \sum(y - \bar{y})^2}}\] The Correlation Coefficient calculated (r):

## [1] "r = 0.698541"

Slope (b) of regression line - Part 2:

Sy is the standard deviation of Ozone, and Sx is standard deviation of Temp.

formula for Standard deviation of x and y:

\(Sy = {\sqrt{\sum(y-\bar{y})\over(n-1)}}\), \(Sx = {\sqrt{\sum(x-\bar{x})\over(n-1)}}\)

Calculation of Sy and Sx:

## [1] "Sy = 33.275969"
## [1] "Sx = 9.529969"

Slope (b) of regression line - Part 3:

Calculation of b, with the variables r, Sy and Sx:

\[{b = r{Sy \over Sx}}\]

With The r, Sy and Sx plugged in: \[{b = 0.698541{33.275969 \over 9.529969}}\] Calculation of b:

## [1] "b = 2.439110"

Y Intercept of regression line (a):

Formula for Y Intercept (a): \[{a = \bar{y} - b\bar{x}}\] We have the value for b now we need the mean of \(\bar{y}\) and \(\bar{x}\):

## [1] "mean of y = 42.099099"
## [1] "mean of x = 77.792793"

Plugging into the Y-Intercept formula:

\[{a = 42.099099 - 2.439110(77.792793)}\]

## [1] "a = -147.646072"

Simple Linear Regression:

Back to The Simple Linear Regression formula: \[{y = a + bx}\] Now that we have the a and b variables:

  • a = -147.646072
  • b = 2.439110

Now given the values of x we can get the linear regression of the Ozone vs Temperature. Example: x = 65

\[{y = -147.646072 + (2.439110 \times 65)}\]

## [1] "y = 10.896071"

Plot without the regression line:

Lets first see the plot without the linear regression to get an idea of where the line will be:

Linear regression plot with line code:

Code for Linear regression plot:

line_regg <- ggplot(airquality_clean, aes(x = Temp, y = Ozone)) +
                    geom_point() +
                    stat_smooth(method="lm", se=F, fill=NA,
                                formula = y ~ x)

Linear regression plot with line:

Simple Linear Regression using plottly

Click on the plot to see each of the points with the linear regression line.

P-value - Ozone and Temperature

## 
## Call:
## lm(formula = Temp ~ Ozone, data = airquality_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.980  -4.775   1.825   4.228  12.425 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 69.37059    1.05151   65.97   <2e-16 ***
## Ozone        0.20006    0.01963   10.19   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.851 on 109 degrees of freedom
## Multiple R-squared:  0.488,  Adjusted R-squared:  0.4833 
## F-statistic: 103.9 on 1 and 109 DF,  p-value: < 2.2e-16

Conclusion

Given the P-value, we see that Since the P-value is \(2.2e^{-16}\) and that’s less than 0.5, this suggest a strong relationship between the variables temperature and Ozone.