Simple Linear Regression
Given a set of data \[
(x_1,y_1), (x_2,y_2),\ldots,(x_n,y_n)
\] a regression analysis is a means of determining if there is a correlation between the \(x\)-data and the \(y\)-data.
A Scatter Plot is a means of tentatively determining whether or not the data is correlated, simply by observing the shape of the plot
Example 1 - Strong Linear Correlation
Plot the following data points on a scatter plot
\[
(1.2, 10.3), (1.5,10.45), (1.6, 10.79), (2.1, 10.8), (2.2,11.0), (2.4, 11.2), (2.5,11.5), (2.9, 12.2), (3.1,12.4), (3.8,13.0), (4.1,13.4)
\]
For this data set we create two data vectors, one for the \(x\)-variables and one for the \(y\)-variables, and plot the result, with the \(x\)-data plotted along the horizontal axis, and the \(y\)-data plotted along the vertical axis.
X1<-c(1.2,1.5,1.6,2.1,2.1,2.4,2.5,2.9,3.1,3.8,4.1)
Y1<-c(10.3,10.45,10.79,10.8,11.0,11.2,11.5,12.2,12.4,13.0,13.4)
plot(X1,Y1,xlab="x",ylab="y",col="red", main="Linearly Correlated Data",pch=18)

Example 2 - Non-correlated Data
In this example we will plot the following data set \[
(-1.4,2.4), (0.4,1.3), (1.4,6.5), (2.1,3.4), (4.5,12.3), (5.2, 2.1), (5.8,3.1), (6.5,5.4), (6.9, 6.7)
\]
X2<-c(-1.4,0.4,1.4,2.1,4.5,5.2,5.8,6.5,6.9)
Y2<-c(2.4,1.3,6.5,3.4,12.3,2.1,3.1,13.4,6.7)
plot(X2,Y2,xlab="x",ylab="y",col="red", main="Non-correlated Data",pch=19)

It is clear from the scatter plot that there is no immediately obvious trend relating the \(x\) and \(y\) variables in this plot, in which case we say the \(x\) and \(y\) variables are not correlated.
While the scatter plot doesn’t rule out a correlation between the \(x\) and \(y\) variables, it does give an initial indication as to whether or not a relationship exists.
The Line of Best Fit
Given the data set above, there are values of the estimated parameters \(\hat{\beta}_0\) an \(\hat{\beta}_1\) which will yield a line which best fits the points of the data plot.
One way to find a line which passes through or close to these data point is to obtain the lie slope from the first and last data points as follows: \[
b=\frac{6.7-2.4}{6.9+1.4}=\frac{4.3}{8.3}=0.518.
\]
Taking the mean of the \(x\) and \(y\) data, we have
mean(X1)
[1] 2.481818
mean(Y1)
[1] 11.54909
This now allows us to find an estimate for the parameter \(\beta_0\) as follows \[
\bar{y}=a+b\bar{x}\Rightarrow 11.54909=a+0.518\times 2.481818 \Rightarrow a=10.2635.
\]
Our first approximation at a linear model relating the \(x\) and \(y\) data is now given by \[y=10.2635+0.518x\]
We now plot this line through the data points as follows: \[x=0\Rightarrow y=10.2635\qquad x=5\Rightarrow y=12.8535\]
X1<-c(1.2,1.5,1.6,2.1,2.1,2.4,2.5,2.9,3.1,3.8,4.1)
Y1<-c(10.3,10.45,10.79,10.8,11.0,11.2,11.5,12.2,12.4,13.0,13.4)
plot(X1,Y1,xlab="x",ylab="y",col="red",main="Linearly Correlated Data",pch=18)
segments(0,10.2635,x1=5,y1=12.8535,col ="blue",lwd=2)

- Clearly, this line doesn’t fit the data very well, and there are better approximations available to represent the relationship between to \(x\) and \(y\) data.
The Method of Least Squares
X1<-c(1.2,1.5,1.6,2.1,2.1,2.4,2.5,2.9,3.1,3.8,4.1)
Y1<-c(10.3,10.45,10.79,10.8,11.0,11.2,11.5,12.2,12.4,13.0,13.4)
plot(X1,Y1,xlab="x",ylab="y",col="red",main="Linearly Correlated Data",pch=18)
segments(X1,Y1,x1=X1,y1=10.2635+0.518*X1,col ="darkgreen",lwd=2)
segments(0,10.2635,x1=5,y1=12.8535,col ="blue",lwd=2)

Model1<-lm(Y1 ~ X1)
Model1
Call:
lm(formula = Y1 ~ X1)
Coefficients:
(Intercept) X1
8.784 1.114
- Hence the values of the parameters are given by \[
a=8.784\qquad b=1.114
\]
- Lastly, we can plot the simple linear regression model \[y=8.784+1.114x\] onto our scatter plot as follows:
plot(X1,Y1,col="red",pch=18)
abline(Model1,col="darkgreen")

Example 3 - From Lectures
We were given the following data relating Weekly Natural Gas Consumption in a U.S. city along with the Hourly Average Temperature in that city for the same week.
| 1 |
28.0 |
12.4 |
| 2 |
28.0 |
11.7 |
| 3 |
32.5 |
12.4 |
| 4 |
39.0 |
10.8 |
| 5 |
45.9 |
9.4 |
| 6 |
57.8 |
9.5 |
| 7 |
58.1 |
8.0 |
| 8 |
62.5 |
7.5 |
Using this data answer the following:
- Create a data file to represent this data.
- Import this data into R.
- Create two data vectors corresponding to temperature and consumption.
- Create a scatter plot to represent this data
- Use the lm() function to create a linear model for these data vectors.
- Using this model estimate the parameters \(\hat{\beta}_0\) and \(\hat{\beta}_1\) for the simple linear model.
- Plot the line of best fit along with the scatter plot for the data set.
Solution:
The data file for this data set is available on Moodle and is called FuelConsumption.csv
FuelData<-read.csv("FuelConsumption.csv")
FuelData
Temp<-FuelData$Temp
Temp
[1] 28.0 28.0 32.5 39.0 45.9 57.8 58.1 62.5
Fuel<-FuelData$Fuel
Fuel
[1] 12.4 11.7 12.4 10.8 9.4 9.5 8.0 7.5
plot(Temp,Fuel,col="red",pch=18)

FuelModel<-lm(Fuel ~ Temp)
FuelModel
Call:
lm(formula = Fuel ~ Temp)
Coefficients:
(Intercept) Temp
15.8379 -0.1279
The parameters of the Fuel Model are \[a=15.8379\quad b=-0.1279.\]
plot(Temp,Fuel,col="red",pch=18)
abline(FuelModel,col="darkgreen")

