2024-11-13

Introduction to Linear Regression

Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables. It is used for predicting outcomes based on the linear relationship between variables.

In this presentation we will use SAT scores by gender and year between 1967 and 2001. We will explore: - The basic concept of linear regression - Visualizing the data - Interpreting results

Note: The dataset was picked solely for it’s use in demonstrating these principles. Source: https://www.kaggle.com/datasets/fundal/sat-by-year-and-gender-1967-2001?resource=download

Visualizing the Data

We will start by visualizing the relationship between verbal scores and math scores for all test takers.

Linear Regression Formula

The general formula for a linear regression model is:

\[ y = \beta_0 + \beta_1 x + \epsilon \]

Where: - \(y\) is the dependent variable (math scores) - \(x\) is the independent variable (verbal scores) - \(\beta_0\) is the intercept - \(\beta_1\) is the slope (coefficient) - \(\epsilon\) is the error term

Interpreting the Coefficients

From the linear regression model, we can interpret the coefficients:

\[ M\_math = \beta_0 + \beta_1 M\_verbal \]

  • \(\beta_0\) represents the expected value of M_math when M_verbal is zero.
  • \(\beta_1\) represents the change in M_math for a one-unit increase in M_verbal.

Using this we can determine if verbal skills indicate better math scores.

Linear Regression Model

## 
## Call:
## lm(formula = A_math ~ A_verbal, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.4862 -5.0208 -0.4862  3.1678 12.5138 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 321.50324   40.76403   7.887 4.30e-09 ***
## A_verbal      0.35640    0.07972   4.471 8.69e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.186 on 33 degrees of freedom
## Multiple R-squared:  0.3772, Adjusted R-squared:  0.3583 
## F-statistic: 19.99 on 1 and 33 DF,  p-value: 8.694e-05

Regressing the data

The P value suggests high verbal scores are predictive of high math scores.

## `geom_smooth()` using formula = 'y ~ x'

3D Verbal Scores and Math Scores by year.

Female Vs Male over time.

## `geom_smooth()` using formula = 'y ~ x'

Did Females close the gap on Male Scores?

The data shows that males scored on average 15-31 points higher than females on any given year. The PValue of .215 is large enough to suggest that there was no significant score gap closure between the genders during this time.