2025-10-27

Intro

A simple linear regression is a statistical method that uses a straight line to model the relationship between a singe input variable (predictor) and a single output variable (response)

linear regression is helpful when: -Determining how one variable affects another - Make predictions based on observed patterns -Measure the strength of relationships

Example: Analyzing how car mileage affects resale pricing.

head(cars)
  mileage year    price
1   61662 2016 48161.40
2   67869 2019 54828.17
3   12985 2018 60841.36
4   39924 2014 53159.49
5   78292 2019 49201.08
6   72554 2017 45979.48

Simple Linear Regression Model

Formula \[y = \beta_0 + \beta_1 x + \epsilon;\] For this formula - \(y\) is the responsive variable which would be price - \(x\) is the predictor variable which would be Mileage - \(\beta_0\) is the y intercept - \(\beta_1\) is the slope - \(\epsilon\) = error term

Goal Find the best straight line that shows how mileage affects price.

The fitted line is- \(\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x\)

Scatter Diagram

Before fitting any model, first step should be visualizing the data to check if a linear relationship makes sense.

Result The scatter plot shows a negative linear relationship where as mileage increases, price decreases. The pattern of the plot also suggest a straight line model is appropriate for this data.

Least-Sqaures Fit

There are infinite line that could be drawn through these point. We use least squares to find that.

The black line is our answer and is the line that best fits the data.

Hypothesis Testing

We found a negative slope, but is the relationship statically significant or could have happened by chance.

Test

\[H_0: \beta_1 = 0 \hspace{1cm} H_1: \beta_1 \neq 0\] model <- lm(price ~ mileage, data = cars)

p_value <- summary(model)$coefficients[“mileage”,“Pr(>|t|)”] p_value Results

Looking at the p-value

  • If p-value < 0.05: We reject \(H_0\) - the relationship is statistically significant
  • If p-value > 0.05: We fail to reject \(H_0\) - not enough evidence of a relationship

Since out p-value was less this means our result is unlikely to be due to some random chance.

3D graph

While we have mainly focused on mileage, price is influenced my multiple factors. The graph below shows how both mileage and model year affect price simultaneously.

Result

Newer cars with lower request higher prices. This dives a little bit of multiple linear regressions due to the multiple variables but same main concepts.

Conclusion

We used simple linear regression to investigate the relationship between car mileage and selling price.

Overview - Clear negative relationship with higher mileage care being associated with lower prices

  • Our hypothesis test proved the data to be statistically significant with (p-value < 0.05) confirming this.

  • Good model fit, the \(R^2\) value indicated that mileage explains a substantial portion of the variation in price.

Regression analysis transforms raw data into actionable insights, allowing us to make informed and data-driven decisions.