- Statistics Topic
Simple Linear Regression (SLR).
- Goal
Understanding simple linear regression with example dataset.
- Dataset
R Built-in USArrests data.
- Hypothesis
States with higher urban populations will show higher crime rates.
2026-03-08
Simple Linear Regression (SLR).
Understanding simple linear regression with example dataset.
R Built-in USArrests data.
States with higher urban populations will show higher crime rates.
Linear Regression is a statistical method used to model the relationship between:
A response or a dependent variable (Y): what we want to predict
One or more predictor or independent variables (X): what we use to predict
In simple linear regression,
we use only one independent variable to predict the dependent variable.
Goal: Find the best-fitting straight line through all data points.
The regression model is expressed by the following equation:
\[Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i\]
Where:
\(X_i\): Independent variable (predictor)
\(Y_i\): Dependent variable (response)
\(\beta_0\): Intercept
\(\beta_1\): Slope
\(\varepsilon_i\): Error/Residual, distance between data and the best-fit line
For linear regression to be valid, we assume: \[\varepsilon \sim N(0, \sigma^2)\]
Key Assumptions
Linearity: The relationship between X and Y is linear
Independence: Observations are independent of each other
Homoscedasticity: Constant variance of errors (\(\sigma^2\))
Normality: Errors are normally distributed
The regression line shows how one variable predicts another.
Direction
Strength
The closer the points cluster around the line, the better X predicts Y.
We will use “USArrests” R built-in dataset, which contains crime statistics and urban population in the 50 states of the US (1973).
Variables
Murder: Murder arrests per 100,000 population
Assault: Assault arrests per 100,000 population
Rape: Rape arrests per 100,000 population
UrbanPop: Urban Population
| Murder | Assault | UrbanPop | Rape | |
|---|---|---|---|---|
| Alabama | 13.2 | 236 | 58 | 21.2 |
| Alaska | 10.0 | 263 | 48 | 44.5 |
| Arizona | 8.1 | 294 | 80 | 31.0 |
| Arkansas | 8.8 | 190 | 50 | 19.5 |
| California | 9.0 | 276 | 91 | 40.6 |
Murder Arrest Rate and Urban Population
The regression line is nearly horizontal, which means there is no meaningful relationship between the urban population size and the murder arrest rate. Based on the scatterplot and regression line, urban population size does not predict the murder arrest rate.
Assault Arrest Rate and Urban Population
The regression line has a positive slope, which means there is a moderate positive relationship between the urban population and the assault arrest rate. The points are moderately clustered around the line.
Rape Arrest Rate and Urban Population
The regression line has positive slope, which means there is a moderate positive relationship between the urban population and the rape arrest rate. The slope is a little steeper and the points are more clustered around the line compared to those of the assault arrest rate and urban population.
Overall Review
While the regression lines of the urban population -assault arrest rate and urban population- rape arrest rate have a similar slope angle, their y-intercept is different. The urban population-rape arrest rate has a lower overall position, which means lower absolute rates. Overall, based on scatter plots and regression lines, a higher urban population predicts higher assault arrest rate and rape arrest rate, but not murder arrest rate.
We will integrate the murder, assault, and rape rates and draw a linear regression line.
# create one variable "Rate" by integrating 3 variables (Murder, Assault, Rape)
arrests_long <- USArrests %>%
pivot_longer(cols = c(Murder, Assault, Rape),
names_to = "Crime_Type", values_to = "Rate")
# create one regression model using "Rate" as one dependent variable
# find one best-fit line passes through all crime type points
total_mod <- lm(Rate ~ UrbanPop, data = arrests_long)
arrests_long$fitted <- predict(total_mod)
# visualize
# point crime type data with different color for each
# draw one regression line
plot_ly(arrests_long, x = ~UrbanPop) %>%
add_markers(y = ~Rate, color = ~Crime_Type,
colors = c("Murder"="red", "Assault"="blue", "Rape"="green")) %>%
add_lines(y = ~fitted, name = "Global Trend",
line = list(color = "orange", width = 3)) %>%
layout(title = "Integrated Arrest Trend Across 3 Crime Types",
xaxis = list(title = "Urban Population"),
yaxis = list(title = "Arrest Rate"))
The core logic of the integrated regression analysis:
Used pivot_longer to unify murder, assault, and rape arrests into a single response variable (Y) named Rate.
The model treats three crime types as a single category of crime arrest rates, allowing the computer to plot points across a unified scale.
Instead of separate slopes, we derived one average trend that pierces through all 150 data points (50 states x 3 types).
This simplifies complex multidimensional data to reveal the correlation between urban population size and violent crime arrest rates at a glance.
The integrated trend line reflects a statistical compromise:
The integrated slope is steeper than the Murder arrest trend but less pronounced than Assault and Rape, finding the central tendency of three violent crimes.
While assault arrest rate pulls the line upward, the murder arrest rate trend acts as an anchor, flattening the overall slope (\(\beta_1\)).
The flattened line represents the global average of three violent crime arrests, providing evidence that individual arrest scales can mask or dilute the overall impact of urban population size.
The numerical and statistical insights of our SLR model:
The line appears flatter due to the inclusion of Murder data, but it still maintains a slight positive direction.
The model condensed 150 complex data points into a single mathematical trend line, providing a clear overview of the dataset.
The best-fit line represents the average relationship, allowing us to quantify the impact of urban population size on violent crime arrest rates.
Simple Linear Regression gives an intuitive understanding of relationships.
By using one predictor, SLR transforms complex data points into a clear linear narrative.
It serves as the fundamental building block for predictive modeling and advanced data science.
It allows researchers to move beyond observation and start quantifying the impact of one variable on another.
SLR remains a powerful starting point for turning multi-layered sociological data into interpretable, actionable patterns.