2024-02-04

Introduction to Simple Linear Regression

Simple linear regression is a method used in statistics to model the relationship between a dependent variable and an independent variable through the fitment of a linear equation to the data. With the intention of finding the best fit line that minimizes the sum of squared differences between the observed values and predicted values by the best fit line. This allows for predictions or insights into the relationship between the independent and dependent variable.

In not all cases is a simple linear regression model useful. Such as if the correlation is not actually linear

Definition

According to PennState, “Simple linear regression is a statistical method that allows us to summarize and study relationships between two continuous (quantitative) variables”

The formula in which a best fit line must follow is:

\[\hat{Y}_i = b_0 + b_1 X_i\]

Stars

For now, we will use a dataset from Kaggle that contains data about stars in the Milky Way Galaxy.

Here luminosity (\(L\)) is calculated by the following formula: \[ L = 4 \pi R^2 \sigma T^4 \]

  • Where
    • \(L\) is the luminosity,
    • \(R\) is the radius of the star,
    • \(T\) is the surface temperature of the star, and
    • \(\sigma\) is the Stefan-Boltzmann constant.

Stars’ Absolute Magnitude

A stars Absolute Magnitude (\(M\)) is a measure of its intrinsic brightness or luminosity.

(\(M\)) can be calculated using the formula: \[ M = m - 5 \cdot (\log_{10}(d) - 1) \]

  • Where:
    • \(M\) is the absolute magnitude,
    • \(m\) is the apparent magnitude,
    • \(d\) is the distance to the star in parsecs.

Plotly Plot

In the following plot the temperature is compared to the Luminosity of each star. The color of each dot represents the color of that star. Based on the line of best fit, we can assume that there is some correlation between the two variables.

ggplot Plot One

Here we can further see that there may be a relation between color and temperature. The means of each color of star are relatively different which reinforces the idea that Luminosity and Temperature may be correlated.

## NULL

ggplot Plot Two

Here is a simple example of a linear regression line comparing the radius and temperature of stars, unfortunately, as can be seen by the line, there is no linear correlation between the two variables

R code

The next slide will contain the code that was used to display the plotly plot which contained a linear regression line for luminosity vs temperature

R code Continued

scatter_plot = plot_ly(data = df, x = ~Temperature..K.,
                                    y = ~Luminosity.L.Lo.,
                        type = "scatter", mode = "markers",
                        marker = list(color = ~Star.color),
                        text = ~paste("Temperature: ",
                        Temperature..K., "K<br>Luminosity: ",
                        Luminosity.L.Lo., "W"))
scatter_plot = scatter_plot %>% 
  add_lines(x = ~Temperature..K.,
  y = ~fitted(lm(Luminosity.L.Lo. ~ Temperature..K.,
  data = df)),
            line = list(color = "green", width = 2), name = "Best Fit Line")
layout = list(title = "Scatter Plot of Temperature vs.
Luminosity for Stars in the Milky Way",
               xaxis = list(title = "Temperature (Kelvin)"),
               yaxis = list(title = "Luminosity"))
scatter_plot = layout(scatter_plot, layout)
scatter_plot