Introduction and Dataset

This presentation will cover the process of creating a simple linear regression model in R and using the Plotly and ggplot2 libraries to graph data with the model we create.

We’ll be using the built-in Fisher flower iris data set, which contains the length and width of 150 different flowers of three species. We begin by importing the set as a frame:

data(iris)
df = iris
head(iris, 3)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa

Plotly demonstration

We create frames for each species of Iris, create a base plot, and add each species as a trace to the plot, finally adding some styling.

setosa = df[df$Species == 'setosa', 1:5]
versicolor = df[df$Species == 'versicolor', 1:5]
virginica = df[df$Species == 'virginica', 1:5]
dfplot = plot_ly(type='scatter', mode='markers')
dfplot = add_trace(dfplot, x = setosa$Petal.Length,
                   y = setosa$Petal.Width,
                   name='Setosa', type='scatter', mode='markers')
dfplot = add_trace(dfplot, x = versicolor$Petal.Length,
                   y = versicolor$Petal.Width,
                   name='Versicolor', type='scatter', mode='markers')
dfplot = add_trace(dfplot, x = virginica$Petal.Length,
                   y = virginica$Petal.Width,
                   name='Virginica', type='scatter', mode='markers')
dfplot = layout(dfplot, title = 'Petal Length vs. Width',
       xaxis = list(title = 'Petal Length'),
       yaxis = list(title = 'Petal Width'))

Data visualization

We can see clear trends in the individual species and the Iris genus as a whole. Can we extrapolate the data for other sizes and species?

Linear regression example

Let’s start with the smallest species, Iris Setosa. R lets us create a linear regression model based on our data.

setosaLM = lm(Petal.Width ~ Petal.Length, setosa)
setosaLM
## 
## Call:
## lm(formula = Petal.Width ~ Petal.Length, data = setosa)
## 
## Coefficients:
##  (Intercept)  Petal.Length  
##     -0.04822       0.20125

We can predict the width of a petal from its length with the model: \[Width = .20125 * Length - .04822\]

Linear regression plot

Using the ggplot library, we can create a graph using the Setosa observations and add the regression line. This will let us see how well the linear regression model fits the sample data.

slmplot = ggplot(setosa, aes(x = Petal.Length, y = Petal.Width))
slmplot = slmplot + geom_point(shape = 19, color = 'orange')
slmplot = slmplot + geom_abline(slope = .20125, intercept = -.04822)
slmplot = slmplot + labs(title = 'Petal Length vs. Width (Iris Setosa)') +
  xlab('Petal Length') + ylab('Petal Width')

Linear regression plot

The line gives a fairly accurate representation of the average petal width for Iris Setosa, but what about the other species?

Linear regression plot expanded

While the model is accurate for Iris Setosa, it underestimates the size for the Versicolor and Virginica species since only Setosa is represented, making it a poor fit for general prediction.

Linear regression plot expanded

If we include all three species for our linear model, we get the model: \[Width = .4158 * Length - .3631\]