2026-03-08

1. Introduction

  • Statistics Topic

Simple Linear Regression (SLR).

  • Goal

Understanding simple linear regression with example dataset.

  • Dataset

R Built-in USArrests data.

  • Hypothesis

States with higher urban populations will show higher crime rates.

2. What is Linear Regression

Linear Regression is a statistical method used to model the relationship between:

  • A response or a dependent variable (Y): what we want to predict

  • One or more predictor or independent variables (X): what we use to predict

In simple linear regression,

we use only one independent variable to predict the dependent variable.

Goal: Find the best-fitting straight line through all data points.

3. Mathematical Model (LaTeX)

The regression model is expressed by the following equation:

\[Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i\]

Where:

  • \(X_i\): Independent variable (predictor)

  • \(Y_i\): Dependent variable (response)

  • \(\beta_0\): Intercept

  • \(\beta_1\): Slope

  • \(\varepsilon_i\): Error/Residual, distance between data and the best-fit line

4. Model Assumptions (LaTeX)

For linear regression to be valid, we assume: \[\varepsilon \sim N(0, \sigma^2)\]

Key Assumptions

  • Linearity: The relationship between X and Y is linear

  • Independence: Observations are independent of each other

  • Homoscedasticity: Constant variance of errors (\(\sigma^2\))

  • Normality: Errors are normally distributed

5. How to Interpret Regression Line

The regression line shows how one variable predicts another.

Direction

  • Positive slope (line goes up): As X increases, Y increases
  • Negative slope (line goes down): As X increases, Y decreases

Strength

  • Steep slope: Strong effect
  • Gentle slope: Weak effect
  • Horizontal line: No relationship

The closer the points cluster around the line, the better X predicts Y.

6. Dataset Summary and Exploration

We will use “USArrests” R built-in dataset, which contains crime statistics and urban population in the 50 states of the US (1973).

Variables

  • Murder: Murder arrests per 100,000 population

  • Assault: Assault arrests per 100,000 population

  • Rape: Rape arrests per 100,000 population

  • UrbanPop: Urban Population

7. Data Preview

Murder Assault UrbanPop Rape
Alabama 13.2 236 58 21.2
Alaska 10.0 263 48 44.5
Arizona 8.1 294 80 31.0
Arkansas 8.8 190 50 19.5
California 9.0 276 91 40.6

8. Visualizing Data: Urban Population and Murder Arrest Rate (ggplot)

9. Visualizing Data: Urban Population and Assault Arrest Rate (ggplot)

10. Visualizing Data: Urban Population and Rape Arrest Rate (ggplot)

11 - 1. Interpreting the Scatterplot and Regression Line

Murder Arrest Rate and Urban Population

The regression line is nearly horizontal, which means there is no meaningful relationship between the urban population size and the murder arrest rate. Based on the scatterplot and regression line, urban population size does not predict the murder arrest rate.

Assault Arrest Rate and Urban Population

The regression line has a positive slope, which means there is a moderate positive relationship between the urban population and the assault arrest rate. The points are moderately clustered around the line.

11 - 2. Interpreting the Scatterplot and Regression Line

Rape Arrest Rate and Urban Population

The regression line has positive slope, which means there is a moderate positive relationship between the urban population and the rape arrest rate. The slope is a little steeper and the points are more clustered around the line compared to those of the assault arrest rate and urban population.

Overall Review

While the regression lines of the urban population -assault arrest rate and urban population- rape arrest rate have a similar slope angle, their y-intercept is different. The urban population-rape arrest rate has a lower overall position, which means lower absolute rates. Overall, based on scatter plots and regression lines, a higher urban population predicts higher assault arrest rate and rape arrest rate, but not murder arrest rate.

12. Visualizing Data: Urban Population and Violent Crime Arrest Rate (Plotly)

We will integrate the murder, assault, and rape rates and draw a linear regression line.

13. R Code Example

# create one variable "Rate" by integrating 3 variables (Murder, Assault, Rape)
arrests_long <- USArrests %>%
  pivot_longer(cols = c(Murder, Assault, Rape),
               names_to = "Crime_Type", values_to = "Rate")

# create one regression model using "Rate" as one dependent variable
# find one best-fit line passes through all crime type points
total_mod <- lm(Rate ~ UrbanPop, data = arrests_long)
arrests_long$fitted <- predict(total_mod)

# visualize
# point crime type data with different color for each
# draw one regression line
plot_ly(arrests_long, x = ~UrbanPop) %>%
  add_markers(y = ~Rate, color = ~Crime_Type, 
              colors = c("Murder"="red", "Assault"="blue", "Rape"="green")) %>%
  add_lines(y = ~fitted, name = "Global Trend", 
            line = list(color = "orange", width = 3)) %>%
  layout(title = "Integrated Arrest Trend Across 3 Crime Types",
         xaxis = list(title = "Urban Population"),
         yaxis = list(title = "Arrest Rate"))

14. Plotly Code Reflection

The core logic of the integrated regression analysis:

  • Data Reshaping

Used pivot_longer to unify murder, assault, and rape arrests into a single response variable (Y) named Rate.

  • Treating as One Variable

The model treats three crime types as a single category of crime arrest rates, allowing the computer to plot points across a unified scale.

  • Global Trend Line

Instead of separate slopes, we derived one average trend that pierces through all 150 data points (50 states x 3 types).

  • Why it matters

This simplifies complex multidimensional data to reveal the correlation between urban population size and violent crime arrest rates at a glance.

15. Key Findings from the Integrated Data

The integrated trend line reflects a statistical compromise:

  • Balanced Slope

The integrated slope is steeper than the Murder arrest trend but less pronounced than Assault and Rape, finding the central tendency of three violent crimes.

  • Weighted Average (OLS)

While assault arrest rate pulls the line upward, the murder arrest rate trend acts as an anchor, flattening the overall slope (\(\beta_1\)).

  • Summary

The flattened line represents the global average of three violent crime arrests, providing evidence that individual arrest scales can mask or dilute the overall impact of urban population size.

16. Regression Line: Statistical Summary

The numerical and statistical insights of our SLR model:

  • Subtle Positive Slope (\(\beta_1 > 0\))

The line appears flatter due to the inclusion of Murder data, but it still maintains a slight positive direction.

  • Summarizing Power

The model condensed 150 complex data points into a single mathematical trend line, providing a clear overview of the dataset.

  • Predictive Foundation

The best-fit line represents the average relationship, allowing us to quantify the impact of urban population size on violent crime arrest rates.

17. Conclusion

Simple Linear Regression gives an intuitive understanding of relationships.

  • Simplicity

By using one predictor, SLR transforms complex data points into a clear linear narrative.

  • Foundation of Prediction

It serves as the fundamental building block for predictive modeling and advanced data science.

  • Quantifiable Insight

It allows researchers to move beyond observation and start quantifying the impact of one variable on another.

  • Final Thought

SLR remains a powerful starting point for turning multi-layered sociological data into interpretable, actionable patterns.