2023-09-15

Simple Linear Regressions

  • X = independent variable set by researcher
  • Y = dependent variable measured by researcher
  • Linear Regression is the Expected change in Y per unit

Illustrative Dataset ‘iris’

data(iris)
str(iris)
'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Reression Model

\[ y = a + bX \] - y is predicted average of Y at a given X
- a is the intercept
- b is the slope

Example with ‘iris’

Giving More information

Limiting the data to a single species makes the information more useful

Other 2 Species

95% Confidence Interval

\[ \beta = \text{b} \pm \text{(t}_\text{n-2.975} \text{)(se}_\text{b}) \] Point estimate for slope
Plus/minus 97.5 precentile from t table times the standard error of slope calculated from standard error of regression.

Confidence intervals between Species

Graphing with Regression Line Equation

Graphing with full information

fig9 <- ggplot(data = iris, aes( 
      x = Sepal.Length, 
      y = Sepal.Width, 
      col = Species, 
      shape = Species)) +
  geom_point(size = 3)+ 
  scale_color_manual(values = c("setosa" = "orchid4",
                                "versicolor" ="maroon",
                                "virginica"="steelblue")) +
  theme_classic() +
  labs(
    title = "Iris Sepal Width vs. Length",
    subtitle = "Species Comparison",
    caption = "Data from 'iris'",
    x = "Sepal Length",
    y = "Sepal Width") +
  geom_smooth(formula = y ~ x,method = "lm", se=FALSE) +
  stat_regline_equation()