Regression

For this document, we will be working with the iris dataset. We will use regression to see if there is a significant relationship between petal length and petal width. Petal length will be our explanatory variable, and petal width will be our response variable. We will plot the data and interpret the resulting scatterplot. We will also calculate correlation, fit the regression line, interpret it, use it for prediction, and calculate residuals.

First, we will call our data set and become familiar with it.

data(iris)
attach(iris)
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

Now that we’re familair with our data, we can create a scatter plot. We can also rename the axis lables.

plot(Petal.Length, Petal.Width, ylab= "Petal Width", xlab= "Petal Length")

From our scatter plot we can identify an upward trend; as the petal length increase, the petal width also seems to increase. The points, in general, are tightly packed; there is slightly more variation for greter petal lengths though. It looks like ther is a positive linear relationship between petal length and petal width.

Now that we’ve create a scatterplot, let’s create the line of best fit.

mod1 <- lm(Petal.Width ~ Petal.Length)
mod1
## 
## Call:
## lm(formula = Petal.Width ~ Petal.Length)
## 
## Coefficients:
##  (Intercept)  Petal.Length  
##      -0.3631        0.4158

Our above model gives us the estimates for the intercept (-0.3631) and slope of our line of best fit (0.4158). The estimate for the slope tells us that for every centimeter increase in petal length, we would expect to see a 0.4158 centimeter increase in petal width. The intercept can’t be interpreted because it wouldn’t make sense for the petal length to ever be zero.When a flower blooms the entire petal opens and starts at a value higher than 0; the petal doesn’t start growing at length 0.

Our line could be written as: y= -0.3630755 + 0.4157554x where x is out petal length input to estimate the response variable value for petal width.

Now, let’s plot our scatter plot and line of best fit together.

plot(Petal.Length, Petal.Width, ylab= "Petal Width", xlab= "Petal Length")
abline(-0.3631, 0.4158)

Beautiful! This linear line of best fit seems to fit our data well.

Next, we can predict the petal width for any given petal length. Let’s find the petal length for the iris in the 100th row. Petal length is the third column in the iris data set.

iris[100, 3]
## [1] 4.1

The petal lenght for the 100th entry is 4.1cm.

With this value for our explanatory variable, we can estimate the value of our response variable. In other words, we can use the petal length for the iris in the 100th row to estimate that iris’ petal width.

iris.predicted <- coef(mod1)%*%c(1, 4.1)
iris.predicted
##          [,1]
## [1,] 1.341522

We get an estimated petal width of 1.341522 to correspond to a petal length of 4.1.

Let’s calculate the residual for the 100th entry. The residual equals the observed value, minus the predicted value. Thus, we must first identify the actual value of the petal width, and then we need to take the difference between the observed and predicted values. The actual petal length for the 100th entry is:

iris.actual <- iris[100, 4]
iris.actual
## [1] 1.3
resid1 <- iris.actual - iris.predicted
resid1
##             [,1]
## [1,] -0.04152169

The actual/observed petal width is 1.3cm, and we get our resideual to equal -0.04152169.

Before we finish, let’s check to see if our regression assumptions hold. One of the assumptions is that the errors will have a normal distribution. Let’s calculate all of the residulas for the data.

allresids <- mod1$residuals
allresids
##             1             2             3             4             5 
## -1.898206e-02 -1.898206e-02  2.259348e-02 -6.055760e-02 -1.898206e-02 
##             6             7             8             9            10 
##  5.629131e-02  8.101794e-02 -6.055760e-02 -1.898206e-02 -1.605576e-01 
##            11            12            13            14            15 
## -6.055760e-02 -1.021331e-01 -1.189821e-01  5.744563e-03  6.416902e-02 
##            16            17            18            19            20 
##  1.394424e-01  2.225935e-01  8.101794e-02 -4.370869e-02  3.944240e-02 
##            21            22            23            24            25 
## -1.437087e-01  1.394424e-01  1.473201e-01  1.562913e-01 -2.268598e-01 
##            26            27            28            29            30 
## -1.021331e-01  9.786686e-02 -6.055760e-02 -1.898206e-02 -1.021331e-01 
##            31            32            33            34            35 
## -1.021331e-01  1.394424e-01 -1.605576e-01 -1.898206e-02 -6.055760e-02 
##            36            37            38            39            40 
##  6.416902e-02  2.259348e-02 -1.189821e-01  2.259348e-02 -6.055760e-02 
##            41            42            43            44            45 
##  1.225935e-01  1.225935e-01  2.259348e-02  2.978669e-01 -2.685977e-02 
##            46            47            48            49            50 
##  8.101794e-02 -1.021331e-01 -1.898206e-02 -6.055760e-02 -1.898206e-02 
##            51            52            53            54            55 
## -1.909749e-01 -7.823852e-03 -1.741260e-01  5.385591e-05 -4.939939e-02 
##            56            57            58            59            60 
## -2.078239e-01  9.025064e-03 -8.917353e-03 -2.493994e-01  1.416294e-01 
##            61            62            63            64            65 
## -9.206844e-02  1.169028e-01 -2.999461e-01 -1.909749e-01  1.663560e-01 
##            66            67            68            69            70 
## -6.624831e-02 -7.823852e-03 -3.415217e-01 -7.823852e-03 -1.583706e-01 
##            71            72            73            74            75 
##  1.674495e-01  5.385591e-05 -1.741260e-01 -3.909749e-01 -1.246728e-01 
##            76            77            78            79            80 
## -6.624831e-02 -2.325505e-01 -1.570156e-02 -7.823852e-03 -9.206844e-02 
##            81            82            83            84            85 
## -1.167951e-01 -1.752195e-01 -5.837060e-02 -1.572771e-01 -7.823852e-03 
##            86            87            88            89            90 
##  9.217615e-02 -9.097494e-02 -1.662483e-01 -4.152169e-02  5.385591e-05 
##            91            92            93            94            95 
## -2.662483e-01 -1.493994e-01 -9.994614e-02 -8.917353e-03 -8.309723e-02 
##            96            97            98            99           100 
## -1.830972e-01 -8.309723e-02 -1.246728e-01  2.158093e-01 -4.152169e-02 
##           101           102           103           104           105 
##  3.685430e-01  1.427229e-01  1.011856e-02 -1.651548e-01  1.516941e-01 
##           106           107           108           109           110 
## -2.809102e-01  1.921761e-01 -4.561836e-01 -2.483059e-01  3.269675e-01 
##           111           112           113           114           115 
##  2.427229e-01  5.957181e-02  1.764207e-01  2.842984e-01  6.427229e-01 
##           116           117           118           119           120 
##  4.595718e-01 -1.235793e-01 -2.224858e-01 -2.056369e-01 -2.157016e-01 
##           121           122           123           124           125 
##  2.932696e-01  3.258740e-01 -4.224858e-01  1.258740e-01  9.326965e-02 
##           126           127           128           129           130 
## -3.314570e-01  1.674495e-01  1.258740e-01  1.348452e-01 -4.483059e-01 
##           131           132           133           134           135 
## -2.730325e-01 -2.977591e-01  2.348452e-01 -2.572771e-01 -5.651548e-01 
##           136           137           138           139           140 
##  1.269675e-01  4.348452e-01 -1.235793e-01  1.674495e-01  2.179963e-01 
##           141           142           143           144           145 
##  4.348452e-01  5.427229e-01  1.427229e-01  2.101186e-01  4.932696e-01 
##           146           147           148           149           150 
##  5.011474e-01  1.842984e-01  2.011474e-01  4.179963e-01  4.272290e-02

Let’s see if these are normally distributed. First we can make a histogram to check it out.

hist(allresids)

However, a more effective way to check this assumption is to compare data against a straight line. Let’s plot the quantiles of the residuals against the quantiles of a normal distribution. If the residuals are normally distributed, they’ll follow a straight line.

qqnorm(allresids)
qqline(allresids)

We can see the the residuals closely follow the straight line, so it seems that the errors have a normal distribution.

Another assumption is homoscedasticity, which means that we assume that the variance of the residuals will be equal for all predictor values. Instead of just looking at the original data set, we can check this assumption by plotting the residuals against our explanatory variable, which is petal length.

plot(mod1$residuals ~ Petal.Length)
abline(0,0)

From the plot, there doesn’t seem to be clear evidence or heteroscedasticity, so it seems that the second assumption holds.

Lastly, we can find the mean square error from the model summary.

summary(mod1)
## 
## Call:
## lm(formula = Petal.Width ~ Petal.Length)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.56515 -0.12358 -0.01898  0.13288  0.64272 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.363076   0.039762  -9.131  4.7e-16 ***
## Petal.Length  0.415755   0.009582  43.387  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2065 on 148 degrees of freedom
## Multiple R-squared:  0.9271, Adjusted R-squared:  0.9266 
## F-statistic:  1882 on 1 and 148 DF,  p-value: < 2.2e-16

We can see that the mean square error is 0.2065.