Modeling an Application of Simple Linear Regression

2024-11-13

What is Simple Linear Regression?

Linear regression is a statistical method to model the relationship between two variables.
It assumes a linear relationship between an independent variable X and a dependent variable Y.
Has Plenty of applications in Finance, Chemistry, Computer Science and Engineering.

Mathematical Theory

In simple linear regression, we model the relationship between two variables, using the following equation:

\[ Y = \beta_0 + \beta_1 X + \epsilon \]

\(Y\): Dependent variable
\(X\): Independent variable
\(\beta_0\): Intercept (the expected value of \(Y\) when \(X = 0\))
\(\beta_1\): Slope (the change in \(Y\) for a one-unit increase in \(X\))
\(\epsilon\): Error term, representing the deviation of the observed \(Y\) from the predicted \(Y\)

The objective of simple linear regression is to estimate the coefficients \(\beta_0\) and \(\beta_1\), minimizing the differences between the observed and predicted values of \(Y\).

Objective of Using Simple Linear Regression

To achieve this, we use the Least Squares Method, which minimizes the sum of squared residuals:

\[ \text{Minimize } \sum_{i=1}^{n} (Y_i - \beta_0 - \beta_1 X_i)^2 \]

This method finds the best-fitting line by minimizing the total squared distance between observed values and the regression line.

Example Dataset

Dataset: Hours Studied vs. Exam Scores

## Warning: package 'readr' was built under R version 4.4.2

##   Hours Scores
## 1   2.5     21
## 2   5.1     47
## 3   3.2     27
## 4   8.5     75
## 5   3.5     30
## 6   1.5     20

Calculating the Regression Line

\[ y = \beta_0 + \beta_1 x \] - Slope (\(\beta_1\)): \[ \beta_1 = \frac{\sum{(x_i - \bar{x})(y_i - \bar{y})}}{\sum{(x_i - \bar{x})^2}} \] - Intercept (\(\beta_0\)): \[ \beta_0 = \bar{y} - \beta_1 \bar{x} \]

Code Bit for Scatterplot

Using the Code bits below when can model our data using a Scatterplot.
Once we do that we can then model the regression line for our Scatterplot.

ggplot(data, aes(x = Hours, y = Scores)) + geom_point() + labs(title = “Hours Studied vs. Exam Scores”, x = “Hours Studied”, y = “Exam Scores”) + theme_minimal()

model <- lm(Scores ~ Hours, data = data)

ggplot(data, aes(x = Hours, y = Scores)) + geom_point() + geom_smooth(method = “lm”, se = FALSE, color = “blue”) + labs(title = “Simple Linear Regression: Hours Studied vs. Exam Scores”, x = “Hours Studied”, y = “Exam Scores”) + theme_minimal()

Plotting the ScatterPlot using ggplot

## Warning: package 'ggplot2' was built under R version 4.4.2

Fitting the Regression Line

## `geom_smooth()` using formula = 'y ~ x'

Interpreting the Output from the Model

## 
## Call:
## lm(formula = Scores ~ Hours, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.0248  -1.6391   0.0788   1.7754   7.4621 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.8636     0.8199   2.273   0.0253 *  
## Hours         9.9013     0.1407  70.363   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.433 on 94 degrees of freedom
## Multiple R-squared:  0.9814, Adjusted R-squared:  0.9812 
## F-statistic:  4951 on 1 and 94 DF,  p-value: < 2.2e-16

Interpreting the Output from the Model pt.2

The model shows a strong relationship between the number of hours studied and exam scores.
The p-values for both the intercept and the slope is low, meaning that both coefficients are statistically significant
RSquared value of 0.9814 suggests that the model explains a very high proportion of the variance in exam score, making it a reliable predictor of scores.

Evaluating the Model Fit

## [1] 0.9813673

RSquared value is a measure of how well the regression line approximates the real data points.
Rsquared ranges from 0 to 1
Rsquared value of 1 indicated that the regression line perfectly fits the data, while 0 indicated that line does not fit the data at all
With a value of 0.9813673 it means that about 98.14% of the variation in exam scores can be explained by hours studied.
This suggest that the linear regression model provides an excellent fit to the data.

Plot to check if values are randomly distributed

Conclusion

The residual plot helps us assess whether the difference between observed and predicted values are randomly distributed.
A good regression model should have residual that are randomly scattered around the y=0 line with no recognizable pattern.
If the residuals are randomly scattered around the axis, it would suggest that the model’s assumption are valid, meaning the dependence between the independent and dependent variables are linear and that any errors have variance.
Judging by the shape of the Residual Plot and the RSquared value it can be determined that this linear regression model would be valid.