2025-11-09

Simple Linear Regression

  • Simple linear regression is a statistical method that allows us to summarize and study relationships between two continuous (quantitative) variables.

  • Helps us make predictions based on the variables chosen

  • Overall, helps us understand relations between variables

Linear Regression Formula

The formula commonly used in Data Science is \(y = \beta_0\ + \beta_1\ x\ + \epsilon\).

  • \(y\): Dependent variable(what we are trying to predict)
  • \(x\): Independent variable (what is being used to aid the prediction)
  • \(\beta_0\): the intercept (value of y when x = 0)
  • \(\beta_1\): the slope
  • \(\epsilon\): the random error

Line of best fit

Ordinary Least Squares regression aims to minimize the sum of square difference between the observed and predicted values.

Residual: \(\epsilon\) = \(y_i\ - \hat{y_i}\)

  • As a line is drawn through your data, the vertical distance between the points and the regression line are called residuals or errors.
  • \(\hat{y_i}\) = \(\beta_0\ + \beta_1\ x\)
  • \(\beta_0\ ,\beta_1\ \) such that: SUM(\(y_i\ - \hat{y_i}\))^2 is minimized

Dataset starwars

For example: Is there a relationship between a characters height and their mass(weight) in the Star wars universe?

head(
  select(starwars,name, height,mass)
)
## # A tibble: 6 × 3
##   name           height  mass
##   <chr>           <int> <dbl>
## 1 Luke Skywalker    172    77
## 2 C-3PO             167    75
## 3 R2-D2              96    32
## 4 Darth Vader       202   136
## 5 Leia Organa       150    49
## 6 Owen Lars         178   120

Scatter Plot with a Linear Regression Line

## `geom_smooth()` using formula = 'y ~ x'

Fit Linear Regression line

## 
## Call:
## lm(formula = mass ~ height, data = starwars_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -60.95  -29.51  -20.83  -17.65 1260.29 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11.4868   111.3842  -0.103    0.918
## height        0.6240     0.6262   0.997    0.323
## 
## Residual standard error: 169.5 on 57 degrees of freedom
## Multiple R-squared:  0.01712,    Adjusted R-squared:  -0.0001194 
## F-statistic: 0.9931 on 1 and 57 DF,  p-value: 0.3232

Residuals and Model fit

3Dplotly

## Warning: Ignoring 23 observations

R Code for 3D plot

xax <- list(
  title = "height",
  titlefont = list(family="Modern Computer Roman")
)
yax <- list(
  title = "mass",
  titlefont = list(family="Modern Computer Roman")
)
zax <- list(
  title = "birth year",
  titlefont = list(family="Modern Computer Roman"))
plot_ly(data = starwars_clean, x= ~height,y= ~mass,z= ~birth_year,
        type = "scatter3d",
        mode="markers",
        color = ~as.factor(gender)) %>%
    layout(
      title = "Birth year VS. (Height,Mass)",
      scene = list(xaxis=xax, yaxis=yax,zaxis=zax)
    )

Conclusion

  • Using simple linear regression, we explored the relationship between height and mass of Star Wars characters.

  • Several extreme outliers (e.g., Jabba the Hutt, Chewbacca, Yoda) affect the results.

  • The scatterplot shows a general positive trend between height and mass.

However:

  • p-value = 0.323 is not statistically significant

  • Residual standard error is 169.5 is a large spread of errors

  • Takeaway: There is no strong evidence that height reliably predicts mass in Star Wars characters.