For this document, we will be working with the iris dataset. We will use regression to see if there is a significant relationship between petal length and petal width. Petal length will be our explanatory variable, and petal width will be our response variable. We will plot the data and interpret the resulting scatterplot. We will also calculate correlation, fit the regression line, interpret it, use it for prediction, and calculate residuals.
First, we will call our data set and become familiar with it.
data(iris)
attach(iris)
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
Now that we’re familair with our data, we can create a scatter plot. We can also rename the axis lables.
plot(Petal.Length, Petal.Width, ylab= "Petal Width", xlab= "Petal Length")
From our scatter plot we can identify an upward trend; as the petal length increase, the petal width also seems to increase. The points, in general, are tightly packed; there is slightly more variation for greter petal lengths though. It looks like ther is a positive linear relationship between petal length and petal width.
Now that we’ve create a scatterplot, let’s create the line of best fit.
mod1 <- lm(Petal.Width ~ Petal.Length)
mod1
##
## Call:
## lm(formula = Petal.Width ~ Petal.Length)
##
## Coefficients:
## (Intercept) Petal.Length
## -0.3631 0.4158
Our above model gives us the estimates for the intercept (-0.3631) and slope of our line of best fit (0.4158). The estimate for the slope tells us that for every centimeter increase in petal length, we would expect to see a 0.4158 centimeter increase in petal width. The intercept can’t be interpreted because it wouldn’t make sense for the petal length to ever be zero.When a flower blooms the entire petal opens and starts at a value higher than 0; the petal doesn’t start growing at length 0.
Our line could be written as: y= -0.3630755 + 0.4157554x where x is out petal length input to estimate the response variable value for petal width.
Now, let’s plot our scatter plot and line of best fit together.
plot(Petal.Length, Petal.Width, ylab= "Petal Width", xlab= "Petal Length")
abline(-0.3631, 0.4158)
Beautiful! This linear line of best fit seems to fit our data well.
Next, we can predict the petal width for any given petal length. Let’s find the petal length for the iris in the 100th row. Petal length is the third column in the iris data set.
iris[100, 3]
## [1] 4.1
The petal lenght for the 100th entry is 4.1cm.
With this value for our explanatory variable, we can estimate the value of our response variable. In other words, we can use the petal length for the iris in the 100th row to estimate that iris’ petal width.
iris.predicted <- coef(mod1)%*%c(1, 4.1)
iris.predicted
## [,1]
## [1,] 1.341522
We get an estimated petal width of 1.341522 to correspond to a petal length of 4.1.
Let’s calculate the residual for the 100th entry. The residual equals the observed value, minus the predicted value. Thus, we must first identify the actual value of the petal width, and then we need to take the difference between the observed and predicted values. The actual petal length for the 100th entry is:
iris.actual <- iris[100, 4]
iris.actual
## [1] 1.3
resid1 <- iris.actual - iris.predicted
resid1
## [,1]
## [1,] -0.04152169
The actual/observed petal width is 1.3cm, and we get our resideual to equal -0.04152169.
Before we finish, let’s check to see if our regression assumptions hold. One of the assumptions is that the errors will have a normal distribution. Let’s calculate all of the residulas for the data.
allresids <- mod1$residuals
allresids
## 1 2 3 4 5
## -1.898206e-02 -1.898206e-02 2.259348e-02 -6.055760e-02 -1.898206e-02
## 6 7 8 9 10
## 5.629131e-02 8.101794e-02 -6.055760e-02 -1.898206e-02 -1.605576e-01
## 11 12 13 14 15
## -6.055760e-02 -1.021331e-01 -1.189821e-01 5.744563e-03 6.416902e-02
## 16 17 18 19 20
## 1.394424e-01 2.225935e-01 8.101794e-02 -4.370869e-02 3.944240e-02
## 21 22 23 24 25
## -1.437087e-01 1.394424e-01 1.473201e-01 1.562913e-01 -2.268598e-01
## 26 27 28 29 30
## -1.021331e-01 9.786686e-02 -6.055760e-02 -1.898206e-02 -1.021331e-01
## 31 32 33 34 35
## -1.021331e-01 1.394424e-01 -1.605576e-01 -1.898206e-02 -6.055760e-02
## 36 37 38 39 40
## 6.416902e-02 2.259348e-02 -1.189821e-01 2.259348e-02 -6.055760e-02
## 41 42 43 44 45
## 1.225935e-01 1.225935e-01 2.259348e-02 2.978669e-01 -2.685977e-02
## 46 47 48 49 50
## 8.101794e-02 -1.021331e-01 -1.898206e-02 -6.055760e-02 -1.898206e-02
## 51 52 53 54 55
## -1.909749e-01 -7.823852e-03 -1.741260e-01 5.385591e-05 -4.939939e-02
## 56 57 58 59 60
## -2.078239e-01 9.025064e-03 -8.917353e-03 -2.493994e-01 1.416294e-01
## 61 62 63 64 65
## -9.206844e-02 1.169028e-01 -2.999461e-01 -1.909749e-01 1.663560e-01
## 66 67 68 69 70
## -6.624831e-02 -7.823852e-03 -3.415217e-01 -7.823852e-03 -1.583706e-01
## 71 72 73 74 75
## 1.674495e-01 5.385591e-05 -1.741260e-01 -3.909749e-01 -1.246728e-01
## 76 77 78 79 80
## -6.624831e-02 -2.325505e-01 -1.570156e-02 -7.823852e-03 -9.206844e-02
## 81 82 83 84 85
## -1.167951e-01 -1.752195e-01 -5.837060e-02 -1.572771e-01 -7.823852e-03
## 86 87 88 89 90
## 9.217615e-02 -9.097494e-02 -1.662483e-01 -4.152169e-02 5.385591e-05
## 91 92 93 94 95
## -2.662483e-01 -1.493994e-01 -9.994614e-02 -8.917353e-03 -8.309723e-02
## 96 97 98 99 100
## -1.830972e-01 -8.309723e-02 -1.246728e-01 2.158093e-01 -4.152169e-02
## 101 102 103 104 105
## 3.685430e-01 1.427229e-01 1.011856e-02 -1.651548e-01 1.516941e-01
## 106 107 108 109 110
## -2.809102e-01 1.921761e-01 -4.561836e-01 -2.483059e-01 3.269675e-01
## 111 112 113 114 115
## 2.427229e-01 5.957181e-02 1.764207e-01 2.842984e-01 6.427229e-01
## 116 117 118 119 120
## 4.595718e-01 -1.235793e-01 -2.224858e-01 -2.056369e-01 -2.157016e-01
## 121 122 123 124 125
## 2.932696e-01 3.258740e-01 -4.224858e-01 1.258740e-01 9.326965e-02
## 126 127 128 129 130
## -3.314570e-01 1.674495e-01 1.258740e-01 1.348452e-01 -4.483059e-01
## 131 132 133 134 135
## -2.730325e-01 -2.977591e-01 2.348452e-01 -2.572771e-01 -5.651548e-01
## 136 137 138 139 140
## 1.269675e-01 4.348452e-01 -1.235793e-01 1.674495e-01 2.179963e-01
## 141 142 143 144 145
## 4.348452e-01 5.427229e-01 1.427229e-01 2.101186e-01 4.932696e-01
## 146 147 148 149 150
## 5.011474e-01 1.842984e-01 2.011474e-01 4.179963e-01 4.272290e-02
Let’s see if these are normally distributed. First we can make a histogram to check it out.
hist(allresids)
However, a more effective way to check this assumption is to compare data against a straight line. Let’s plot the quantiles of the residuals against the quantiles of a normal distribution. If the residuals are normally distributed, they’ll follow a straight line.
qqnorm(allresids)
qqline(allresids)
We can see the the residuals closely follow the straight line, so it seems that the errors have a normal distribution.
Another assumption is homoscedasticity, which means that we assume that the variance of the residuals will be equal for all predictor values. Instead of just looking at the original data set, we can check this assumption by plotting the residuals against our explanatory variable, which is petal length.
plot(mod1$residuals ~ Petal.Length)
abline(0,0)
From the plot, there doesn’t seem to be clear evidence or heteroscedasticity, so it seems that the second assumption holds.
Lastly, we can find the mean square error from the model summary.
summary(mod1)
##
## Call:
## lm(formula = Petal.Width ~ Petal.Length)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.56515 -0.12358 -0.01898 0.13288 0.64272
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.363076 0.039762 -9.131 4.7e-16 ***
## Petal.Length 0.415755 0.009582 43.387 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2065 on 148 degrees of freedom
## Multiple R-squared: 0.9271, Adjusted R-squared: 0.9266
## F-statistic: 1882 on 1 and 148 DF, p-value: < 2.2e-16
We can see that the mean square error is 0.2065.