Simple Linear Regression

Many of you are probably very familiar with simple linear regression (SLR), but here is a (very) brief overview before we get into the code. SLR is a test where we test how changing a continuous predictor variable impacts a continous response variable. As the name suggests, SLR only tests for linear relationships. In short, both X and Y are continuous and the relationship between the two needs to be linear.

In R (and in the statistics behind the code) linear regression is a very close relative of ANOVA, so check out the ANOVA tutorial too.

Getting Started

As always, you’ll need to set your working directory.Use setwd() and see the Getting Started with R tutorial for full details on this if you don’t remember right now.

We also need to read in the data. For this tutorial, we will use the freely available iris dataset. This dataframe can be directly called. For your own data, you can read it into the program using read.csv(). Please see the Getting Started with R tutorial.

Let’s take a look at this data.

View(iris)

The first 4 columns are numeric data showing petal and sepal length and width. The 5th column is a categorical variable - species of iris measured.

For this analysis, we will test if sepal width impacts the length of the sepal using SLR.

Checking Assumptions

There are ~5 assumptions to test for SLR.

  1. The observations were randomly sampled.

  2. The observations are independent from one another.

  3. The relationship of the two variables are linearly related (no curve, just a nice straight line).

  4. Homoscedasticity: the spread of data is the same throughout. Residual values have same variance across range of data.

    See this link for a better explanation and a nice graph of heteroscedasticity.

  5. The errors of the model are distributed normally.

Random and Independent

You have determine if your sampling was independent and random. The other assumptions can be tested with the code below.

Linearity

To test linearity use the code below.

plot(iris$Sepal.Width,iris$Sepal.Length, main="", xlab="Sepal Width (mm)",ylab="Sepal Length (mm)") 

            #Note: list the predictor variable first, then the response variable.
            #2nd Note: the structure of iris$Sepal.Width is dataframe$variable

Looking at the scatterplot we made above, we see that the relationship is not super linear. This is a subjective call though. For the sake of this example, let’s continue. In real life, you may need to transform your data.

Homoscedasticity

To test is the variances are the same throughout, we again look at the scatterplot. What we want to see is pretty equal scatter from left to right; funnel shapes are bad. I would say that it is roughly homoscedastic, so we may continue.

Residual plots and qqplot

A residual plot is a great way to look at homo vs. heteroscedasticity of the residual errors using the code below.

A qq plot is the best way to see if the errors are normally distributed.

The code below is a quick way to see residual and q-q norm plots for any linear model (which you use to make ANOVA and regressions). You can run this line 4 times - a new plot pops up each time.

The most useful (to me) is the q-q norm plot (the 2nd one). If the errors of the model are normal, the data points will fall closely on the line. In this particular example, it’s pretty much perfect!

For more information, google these nested functions: plot(lm()) and see https://drsimonj.svbtle.com/visualising-residuals.

lm.iris1<-lm(Sepal.Length~Sepal.Width,data=iris) #make your linear model
par(mfrow = c(2, 2))  # Split the plotting panel into a 2 x 2 grid
plot(lm.iris1)  # Plot the linear model information you made above

#Note, if 4 graphs aren't showing up for you all at once, run the par() line and plot() line together.
par(mfrow = c(1,1)) #Returns the plotting panel to normal (1x1)

Make sure to run the last par() line above to return your plotting view to normal. #Linear Regression

summary(lm.iris1) #uses the linear model that you made above
## 
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width, data = iris)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5561 -0.6333 -0.1120  0.5579  2.2226 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.5262     0.4789   13.63   <2e-16 ***
## Sepal.Width  -0.2234     0.1551   -1.44    0.152    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8251 on 148 degrees of freedom
## Multiple R-squared:  0.01382,    Adjusted R-squared:  0.007159 
## F-statistic: 2.074 on 1 and 148 DF,  p-value: 0.1519
anova(lm.iris1)
## Analysis of Variance Table
## 
## Response: Sepal.Length
##              Df  Sum Sq Mean Sq F value Pr(>F)
## Sepal.Width   1   1.412 1.41224  2.0744 0.1519
## Residuals   148 100.756 0.68078

Looking at the summary results, we see LOTS of stuff. Let’s start by looking in the ‘Coefficients’ section. The estimate of intercept is the y-intercept, in this case, 6.5262. The estimate of Sepal.Width is the slope of the model, -0.2234. So the equation of the relationship is Sepal.Length = -0.22 X Sepal.Width + 6.5262.

Now let’s look at the bottom section (of the summary still), starting with the Adjusted R-squared which is 0.0072 - that’s a crazy low R-sq, so we know there are large residuals (the observations are far from the trend line). On the last line, we see the F and degrees of freedom. The p-value is 0.1519, which is well above our alpha of 0.05. The relationship is not significant.

The anova() returns the F, degrees of freedom, and p-value, and these should match the values from the summary.

Plot the relationship

plot(iris$Sepal.Width,iris$Sepal.Length, main="", xlab="Sepal Width (mm)",ylab="Sepal Length (mm)")
abline(lm(iris$Sepal.Length~iris$Sepal.Width), col="red") # regression line (y~x)

#if you are getting a warning similar to "new plot has not been called", try highlighting both lines (plot and abline) and run both lines at the same time))

For more advanced plotting, please see the Plotting tutorial.

Now your data!

Remember, if you need help with any function, type a ? before that function name and run it. The R help file will come up. Also check out Quick R and other online resources.

setwd("~/VCU/Lab/R_Tutorial/LinearRegression")
#read in your data here using read.csv(). See the *Getting Started with R* Tutorial for help.

#Test Linearity
plot(iris$Sepal.Width,iris$Sepal.Length, main="", xlab="Sepal Width (mm)",ylab="Sepal Length (mm)") 
            #Note: list the predictor variable first, then the response variable.
            #2nd Note: the structure of iris$Sepal.Width is dataframe$variable
#Homoscedasticity
    #Investigate the scatterplot

#Residual errors (homoscedasticity and linearity) and q-q plots (normality of errors)
lm.iris1<-lm(Sepal.Length~Sepal.Width,data=iris) #make your linear model
summary(lm.iris1)  # Report the results
par(mfrow = c(2, 2))  # Split the plotting panel into a 2 x 2 grid
plot(lm.iris1)  # Plot the linear model information you made above
#Note, if 4 graphs aren't showing up for you all at once, run the par() line and plot() line together.
par(mfrow = c(1,1)) #Returns the plotting panel to normal (1x1)

#Linear Regression
summary(lm.iris1) #uses the linear model that you made above
anova(lm.iris1)

#Plot the relationship
plot(iris$Sepal.Width,iris$Sepal.Length, main="", xlab="Sepal Width (mm)",ylab="Sepal Length (mm)")
abline(lm(iris$Sepal.Length~iris$Sepal.Width), col="red") # regression line (y~x)
#if you are getting a warning similar to "new plot has not been called", try highlighting both lines (plot and abline) and run both lines at the same time))