Introduction to Simple Linear Regression

Simple Linear Regression is a statistical method used to model the relationship between two continuous variables.

  • Dependent variable (Y): The variable I’m trying to predict.

  • Independent variable (X): The variable I’ll use to predict the dependent variable.

The goal is to fit a line that minimizes the distance between the observed data points and the line.

Simple Linear Regression Model

The simple linear regression equation is:

\[ Y = \beta_0 + \beta_1 X + \epsilon \]

Where:

  • \(Y\) is the dependent variable (ex: Salary)

  • \(X\) is the independent variable (ex: Years of Experience)

  • \(\beta_0\) is the intercept

  • \(\beta_1\) is the slope

  • \(\epsilon\) is the error term (residuals).

Estimating the Coefficients: \(\beta_0\) and \(\beta_1\)

The coefficients in the linear regression model can be estimated using the least squares method. The formula for slope is: \[ \hat{\beta_1} = \frac{n \sum_{i=1}^{n} X_i Y_i - \sum_{i=1}^{n} X_i \sum_{i=1}^{n} Y_i}{n \sum_{i=1}^{n} X_i^2 - (\sum_{i=1}^{n} X_i)^2} \]

Estimating the Coefficients: \(\beta_0\) and \(\beta_1\) Cont.

The formula for \(\beta_0\) (intercept) is:

\[ \hat{\beta_0} = \bar{Y} - \hat{\beta_1} \bar{X} \]

Where:

  • \(\hat{\beta_1}\) is the estimated slope.
  • \(\hat{\beta_0}\) is the estimated intercept.
  • \(\bar{Y}\) is the mean of the dependent variable (Salary).
  • \(\bar{X}\) is the mean of the independent variable (Years of Experience).
  • \(n\) is the number of data points.

Example Dataset

I will use the Salary_dataset.csv, which includes data on years of experience and corresponding salaries.

Columns:

  • YearsExperience: Years of experience.

  • Salary: Salary in USD.

The task is to predict Salary based on Experience.

Example Dataset Cont.

model <- lm(Salary ~ YearsExperience, data = data)

summary(model)
## 
## Call:
## lm(formula = Salary ~ YearsExperience, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7958.0 -4088.5  -459.9  3372.6 11448.0 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      24848.2     2306.7   10.77 1.82e-11 ***
## YearsExperience   9450.0      378.8   24.95  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5788 on 28 degrees of freedom
## Multiple R-squared:  0.957,  Adjusted R-squared:  0.9554 
## F-statistic: 622.5 on 1 and 28 DF,  p-value: < 2.2e-16

Scatter Plot with Regression Line

To visualize the relationship between Experience and Salary, it’s easy to use a scatter plot and add the regression line:

Plot with Plotly

This is an interactive 2D plot using the plotly package to visualize the relationship between years of experience and salary:

R Code for Model Evaluation

To evaluate the model’s performance, I created a plot to check the residuals:

How did I plot that? R Code Examples

Scatter Plot with Regression Line

Plot with Plotly

Conclusion

Simple Linear Regression helps model relationships between two variables. For this example:

  • I modeled Salary based on Experience.
  • The model shows the relationship for how salary increases with years of experience.
  • I also assessed the model’s performance using residual plots.

Next steps:

  • We could extend this to multiple linear regression if other variables are available.