I chose to work with the Iris data set. The predictor variable will be the petal length and the response variable will be the sepal length.

First, we will call the data, look at the headers and attach the data.

data(iris)
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
names(iris)
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width" 
## [5] "Species"
attach(iris)

Below is a scatterplot with Petal Length as the predictor and Sepal Length as the response. I then fixed the labels so that the plot is a little easier to read.

plot(Sepal.Length~Petal.Length)

plot(Petal.Length,Sepal.Length,ylab="Sepal Length (cm)",xlab="Petal Length (cm)",main="Iris Sepal and Petal Length")

The data is a little messy but the trend seems to be a positive correlation. The variance doesn’t exactly look equal but it isn’t too bad so we can continue.

We can now create a linear regression model for this data.

mymod<-lm(Sepal.Length~Petal.Length)
mymod
## 
## Call:
## lm(formula = Sepal.Length ~ Petal.Length)
## 
## Coefficients:
##  (Intercept)  Petal.Length  
##       4.3066        0.4089

The model shows that the intercept is 4.3066 and the slope is 0.4089. The slope means that as petal length increases by 1 cm, sepal length increases by about 0.4089 cm. The intercept, theoretically, means that when petal length is 0, sepal length is 4.3066.

Now, we can add the linear regression to the scatterplot from above.

plot(Petal.Length,Sepal.Length,ylab="Sepal Length (cm)",xlab="Petal Length (cm)")
abline(4.3066,0.4089)

The line fairly accurately models the relationship.

We should check our assumptions. One of the assumptions we made was that the errors follow a normal distribution. First, I’m going to show how to find one residual (error). I used the 10th row of the data, where petal length is 1.5.

actual<-iris[10,"Sepal.Length"]
actual
## [1] 4.9
predicted<-4.3066+0.4089*1.5
predicted
## [1] 4.91995
myresid<-predicted-actual
myresid
## [1] 0.01995

Now, we can look at the distribution of all of the residuals. We can look at a histogram and a normal q-q plot.

myresids<-mymod$residuals
hist(myresids)

qqnorm(myresids)
qqline(myresids)

Both of the above graphs show that the error distribution is more or less normal. That assumption was correct.

Another assumption is the homoscedasticity. We can look at the residuals against the petal length.

plot(mymod$residuals~Petal.Length)
abline(0,0)

The plot doesn’t show any obvious heteroscedasticity so we can assume our assumption was correct.

Lastly, we can look at the summary of the linear regression.

summary(mymod)
## 
## Call:
## lm(formula = Sepal.Length ~ Petal.Length)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.24675 -0.29657 -0.01515  0.27676  1.00269 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.30660    0.07839   54.94   <2e-16 ***
## Petal.Length  0.40892    0.01889   21.65   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4071 on 148 degrees of freedom
## Multiple R-squared:   0.76,  Adjusted R-squared:  0.7583 
## F-statistic: 468.6 on 1 and 148 DF,  p-value: < 2.2e-16

This shows the number of degrees of freedom, the intercepts, residual standard error, etc.

Our plots above show that petal length is a fairly good predictor of the sepal length. The summary of the linear regression model also showed this.