Learning Log Day 3: Regression Practice

First I we need to gain access to the data set. For this learning log I am using the iris data to look at flower data.

data(iris)

We then want to check the first few rows of data just to get an idea of what it looks like and what we’re working with.

head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

?iris

Since we will be working with the iris data, we will attach it to make calling on it easier.

attach(iris)

Before trying to model anything, it is important to plot the data to get a preliminary look at it and observe any trends or relationships that might exist. In this case I will be using petal length as the explanatory variable, and petal width as the response variable.

plot(Petal.Length, Petal.Width, ylab = "Petal Width (cm)",
     xlab = "Petal Length (cm)")

Looking at the scatterplot, there appears to be a positive trend between petal length and petal width. They also appear to be pretty tightly packed together, and it looks like there could be a linear relationship between the length and the width.

From here, we will working on creating our linear regression model using the {lm} command. Because petal width is our response (y) variable, we put that first in the {lm} command. I will also name this iris.regmod

iris.regmod <- lm(Petal.Width ~ Petal.Length)

With our linear model now created, we can call on iris.regmod to get the coefficients of our linear regression model.

iris.regmod

## 
## Call:
## lm(formula = Petal.Width ~ Petal.Length)
## 
## Coefficients:
##  (Intercept)  Petal.Length  
##      -0.3631        0.4158

In this case our y-intercept is -0.3631 and our slope is 0.4158. That means the equation for our linear regression model is y=-0.3631+0.4148(Petal Length).

If we wanted to interpret this into words. We would say that we would expect the petal width to increase 0.4148 centimeters for every centimeter increase in petal length. We can not realistically interpret the intercept because it is a negative number. Theoretically, this is saying that a petal of length zero would have a width of -0.3631 centimenters. It is obviously not possible to have a negative width so we can not appropriately interpet the intercept.

In order to get a better idea of how our linear regression matches with our data, we will again plot the data, this time with our linear regression as well.

plot(Petal.Length, Petal.Width, ylab = "Petal Width (cm)",
     xlab = "Petal Length (cm)")
abline(-0.3631,0.4158)

This looks like the line pretty accurately models the relationship between petal length and petal width.

It is also important to check the assumptions that we made when creating this model. One assumption is that all errors will follow a normal distribution. We can check this through the residuals.

iris.resid <- iris.regmod$residuals
hist(iris.resid)

qqnorm(iris.resid)
qqline(iris.resid)

The histogram shows that the errors approximately have the shape of a normal curve, and this is confirmed by the tight grouping and linear appearance of the qqnorm graph.

We also need to check the homoscedasticity. We can do this through both the original scatter plot we did up top, or by plotting the residuals against petal length. Both of these will allow us to compare the variance.

plot(Petal.Length, Petal.Width, ylab = "Petal Width (cm)",
     xlab = "Petal Length (cm)")

plot(iris.resid ~ Petal.Length)
abline(0,0)

Neither of these plots shows obvious signs of heteroscedasticity, so we can assume our assumption of homoscedasticity is okay.

Finally, we will conclude with a summary of our linear regression.

summary(iris.regmod)

## 
## Call:
## lm(formula = Petal.Width ~ Petal.Length)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.56515 -0.12358 -0.01898  0.13288  0.64272 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.363076   0.039762  -9.131  4.7e-16 ***
## Petal.Length  0.415755   0.009582  43.387  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2065 on 148 degrees of freedom
## Multiple R-squared:  0.9271, Adjusted R-squared:  0.9266 
## F-statistic:  1882 on 1 and 148 DF,  p-value: < 2.2e-16

A few things to notice here are that it still shows our intercept (-0.3631) and slope (0.4156). It also tells us our residual standard error (Mean Square Error) is 0.2065 with 148(n-2) degrees of freedom, since we had 150(n) data points.

Through our plots above, we have found that petal length is a strong predictor of petal width, and that longer petals correlates with wider petals. This was confirmed through our graphing of the residuals and proved that our assumptions made initially were accurate. We can use similar steps to do future linear regression analysis on any data sets and variables.

Learning Log Day 3: Regression Practice

Josh

2/6/2018