October 16, 2025

What are Linear Regressions

  • A linear regression is a statistical model used to make predictions or estimates
  • describes the relationship between an independent variable (X) and a dependent variable (Y)
  • The model is linear, meaning changes in X lead to proportional changes in Y

Regression Equation

Here is a General Regression Equation:

\[ Y = \alpha + \beta X + \epsilon \]

  • Where:
    • \(\alpha\) = intercept
    • \(\beta\) = slope (change in X and Y)
    • \(\epsilon\) = error term (unexplained variation)

Linear Regression Output in R

We will do a linear regression on height and weight in the women data set:

## 
## Call:
## lm(formula = height ~ weight, data = women)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.83233 -0.26249  0.08314  0.34353  0.49790 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 25.723456   1.043746   24.64 2.68e-12 ***
## weight       0.287249   0.007588   37.85 1.09e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.44 on 13 degrees of freedom
## Multiple R-squared:  0.991,  Adjusted R-squared:  0.9903 
## F-statistic:  1433 on 1 and 13 DF,  p-value: 1.091e-14

Height vs Weight Regression Interpretation

Equation Gained from Regression:

\[ Y = 25.72 + 0.29X \]

  • This shows that for every 1 unit change (1 lb) the height increases by ~.29 inches.
  • The \(R^2\) term is 0.99 showing that 99% of the variation in height is explained by weight
  • the p-value is 1.091e-14 showing that the coefficient is highly significant

Using Plotly to Plot Women’s Data Linear Regression

women_plot <- plot_ly(women, x = ~weight, y = ~height,
                      type = 'scatter',
                      mode = 'markers',
                      marker = list(size = 5, line = list(color = 'steelblue', width = 1)),
                      name = 'Women\'s data',
                      text = ~paste('Weight:', weight, '<br>Height:', height)
                      ) |>
                        add_trace(x = women$weight,
                                  y = fitted(women_regression),
                                  mode = 'lines',
                                  line = list(color = 'tomato', width = 2),
                                  name = 'Line of Best Fit',
                                  inherit = FALSE
                                  ) |>
                        layout(
                          title = 'Linear Regression of Height Vs Weight for Women',
                          xaxis = list(title = 'Weight (lbs)'),
                          yaxis = list(title = 'Height (Inches)')
                        )

Plot of Regression of Women’s Height vs Weight

  • You can see this has a positive correlation between height and weight
  • Also, the line seems to fit the data well, which makes sense when looking back at the \(R^2\)

Regressions with Different Data

  • now we will do more linear regressions on different data sets and interpret the results
  • we will use the following data sets:
    • airquality
    • penguins

Linear Regression of airquality data (using ggplot)

Air Quality Regression Results

## 
## Call:
## lm(formula = Temp ~ Solar.R, data = clean_airqual)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.3787  -4.9572   0.8932   5.9111  18.4013 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 72.863012   1.693951  43.014  < 2e-16 ***
## Solar.R      0.028255   0.008205   3.444 0.000752 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.898 on 144 degrees of freedom
## Multiple R-squared:  0.07609,    Adjusted R-squared:  0.06967 
## F-statistic: 11.86 on 1 and 144 DF,  p-value: 0.0007518

Takeaway from Air Quality Regression

  • Shows slightly positive relationship
  • Is significant, t-stat > 2
  • Small \(R^2\) = does not explain variation well
  • Overall does not explain much (poor variable choice)

Pengiun Regression Plot Code

clean_pen <- penguins |>
  filter(!is.na(bill_len) & !is.na(bill_dep))

pen_plot <- ggplotly( ggplot(clean_pen, 
                   aes(x = bill_dep, y = bill_len, color = species)) +
                     geom_point(size = 2, alpha = .7) +
                     geom_smooth(method = 'lm', se = FALSE) +
                     labs(title = 'LR of Bill Length on Bill Depth by Species',
                          x = 'Bill Depth (mm)', y = 'Bill Length (mm)') + 
                     theme_solarized())

Regression Results/Interpretation

  • Adelie has the weakest relationship
  • Chinstrap & Gentoo have strong positive relationships
  • Can interpret that a longer bill means larger depth

Conclusion

Hopefully you were able to learn a little about simple linear regressions and how you can interpret/plot them using R.