Applied Regression Analysis

class: center, middle, inverse, title-slide

.title[
# Applied Regression Analysis
]
.subtitle[
## <br>Simple & Multiple Linear Regression
]
.author[
### Jorge Sinval & Kah Loong Chue
]
.date[
### 2025-08-28
]

---

class: inverse, center, middle

.bg-dark-gray {
    background-color: #1F2D30;
}

.bg-light-gray {
    background-color: #F4F5F6;
}

.kbd {
    display: inline-block;
    padding: .2em .5em;
    font-size: 0.75em;
    line-height: 1.75;
    color: #555;
    vertical-align: middle;
    background-color: #fcfcfc;
    border: solid 1px #ccc;
    border-bottom-color: #bbb;
    border-radius: 3px;
    box-shadow: inset 0 -1px 0 #bbb
}

</style>

# 4. Simple Linear Regression

---

# Objectives

- Understand the logic of Simple and Multiple Linear Regression.

- Fit a regression model using the Method of Ordinary Least Squares (OLS).

- Interpret model coefficients, R-squared, and significance tests (ANOVA and t-tests).

- Check the main assumptions of linear regression using residual plots.

---
# What is Linear Regression?

Regression is "Linear" when the relationship between the dependent variable (DV) and the independent variable(s) (IVs) is described as a linear combination.

.pull-left[

### Simple Linear Regression
Uses **one** IV to predict the DV.

![:scale 70%](slides4of6_files/figure-html/slr-plot-chunk-1.png)

]

.pull-right[

### Multiple Linear Regression

Uses **two or more** IVs to predict the DV.

![:scale 70%](slides4of6_files/figure-html/mlr-plot-chunk-1.png)

]

---
# The Simple Linear Regression (SLR) Model

The model describes the linear relationship between a DV `$\left(Y\right)$` and an IV `$\left(X\right)$`.

.pull-left[

`$$Y_i = \beta_0 + \beta_1X_i + \epsilon_i$$`

**Intercept (`$\beta_0$`)**
- Also known as the constant.
- The predicted value of `$Y$` when `$X = 0$`.

**Slope (`$\beta_1$`)**
- The regression coefficient.
- The change in `$Y$` for a 1-unit change in `$X$`.

**Error (`$\epsilon_j$`)**
- Also known as the residual.
- Represents natural variation in `$Y$` and measurement error.

]

.pull-right[

.center[![](slides4of6_files/figure-html/betas-plot-chunk-1.png)]

]

---

# Fitting the Model: Ordinary Least Squares (OLS)

.pull-left[
How do we find the best estimates for `$\beta_0$` and `$\beta_1$`? We find the line that passes as close as possible to all data points. Minimizing the distance between the observed values `$\left(Y_i\right)$`and the values predicted by the model `$\left(\hat{Y}_i\right)$`.

**The OLS method finds the line that minimizes the sum of the squared errors `$\left(SSE\right)$`.**

`$$SSE = \sum_{i=1}^{n}(Y_i - \hat{Y}_i)^2$$`

Where the error (`$e_i$`) is the difference between the observed value (`$Y_i$`) and the value predicted by the model (`$\hat{Y}_i$`): `$\hat{\varepsilon}_i = e_i = Y_i - \hat{Y}_i$`

The model estimated in the sample can be used to predict values in the population:

`$$\hat{Y}_i = \hat{\beta}_0 + \hat{\beta}_1X_{i}$$`

]

.pull-right[

.center[![](slides4of6_files/figure-html/ols-plot-chunk-1.png)]
Illustration of Residuals in Linear Regression
]

---

# Simple Linear Regression Coefficients

The solution is found by minimizing the Sum of Squared Errors (`$SSE$`), which requires finding the partial derivatives with respect to the regression coefficients and setting them to zero. This leads to the following system of normal equations:

`$$\begin{cases} 
\frac{\partial SSE}{\partial \beta_0} = 0 \\
\frac{\partial SSE}{\partial \beta_1} = 0 
\end{cases}$$`

The solution to this system gives the ordinary least squares (OLS) estimates for the intercept `$\left(\hat\beta_0\right)$` and slope `$\left(\hat\beta_1\right)$`:

`$$\begin{cases} 
\hat{\beta}_1 = \frac{\sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^{n} (X_i - \bar{X})^2} \\
\\
\hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{X}
\end{cases}$$`

Where:

- `$\bar{X}$` and `$\bar{Y}$` are the sample means of `$X$` and `$Y$`.
- `$n$` is the sample size.

---

# Simple Linear Regression Coefficients

This can be expressed using more conventional statistical notation: `$\hat{\beta}_1 = \frac{SS_{XY}}{SS_{XX}}$`

Where:
* `$SS_{XY} = \sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y})$` is the sum of cross-products.
* `$SS_{XX} = \sum_{i=1}^{n} (X_i - \bar{X})^2$` is the sum of squares of `$X$`.

Alternative formulas for `$\hat{\beta}_1$` that are also commonly used:

* Using covariance and variance: `$\hat{\beta}_1 = \frac{\text{Cov}(X,Y)}{\text{Var}(X)}$`
  
  This is derived by dividing both the numerator and denominator of the main formula by `$n-1$`:
  `$$\hat{\beta}_1 = \frac{\frac{1}{n-1}\sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y})}{\frac{1}{n-1}\sum_{i=1}^{n} (X_i - \bar{X})^2} = \frac{\text{Cov}(X,Y)}{\text{Var}(X)}$$`
* Using computational formulas:

`$$\hat{\beta}_1 = \frac{\sum_{i=1}^{n} X_i Y_i - n\bar{X}\bar{Y}}{\sum_{i=1}^{n} X_i^2 - n\bar{X}^2}$$`

---

# Example 1: Fitting a Simple Linear Regression Model

.pull-left-1[

Fit the simple linear regression model between the number of aggressive behaviors and the number of hours (h) of TV watched per week (file `tv_aggression.csv`). The data matrix is as follows:

]

.pull-right-2[

<table>
<caption>Data: TV Hours and Aggressive Behaviors</caption>
 <thead>
  <tr>
   <th style="text-align:right;"> Child ID </th>
   <th style="text-align:right;"> TV Hours per Week </th>
   <th style="text-align:right;"> Aggressive Behaviors </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 17 </td>
   <td style="text-align:right;"> 8 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 3 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 23 </td>
   <td style="text-align:right;"> 9 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 17 </td>
   <td style="text-align:right;"> 7 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 2 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 20 </td>
   <td style="text-align:right;"> 6 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 7 </td>
   <td style="text-align:right;"> 12 </td>
   <td style="text-align:right;"> 6 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 8 </td>
   <td style="text-align:right;"> 31 </td>
   <td style="text-align:right;"> 10 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 9 </td>
   <td style="text-align:right;"> 14 </td>
   <td style="text-align:right;"> 6 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 10 </td>
   <td style="text-align:right;"> 26 </td>
   <td style="text-align:right;"> 9 </td>
  </tr>
</tbody>
</table>

]

---

# Example 1: Results

In _jamovi_, we can fit the model with the following steps:

.pull-left[
1. Open `jamovi`.  
2. Click on the .kbd[hamburger] menu in the top left corner.  
3. Select .kbd[Open] and then the folder icon 📁 to browse for files.  
4. Navigate to the directory where the data file is located and select the file (i.e., `tv_aggression.csv`).
5. Click on the .kbd[Analyses] tab and select .kbd[Regression].
6. Choose .kbd[Linear Regression].
7. Insert the variable `aggression` in the .red[Dependent Variable] box.
8. Insert the variable `tv_hours` in the .red[Covariates] box.

]

.pull-right[

.center[![:scale 100%](assets/img/example1_slr.png)]

]

---

# Example 1: Results

.bg-light-gray.bg-dark-gray.ba.bw2.br3.shadow-5.ph4.mt3[
Thus, the estimated regression model equation is:

`$$\widehat{Agg} = 1.927 + 0.278\times TVhrs$$`  
`$Agg$` &mdash; Aggressive Behaviors; `$TVhrs$` &mdash; TV Hours per Week.

]

---

# Significance of the Linear Model: ANOVA

_Does our model explain a significant amount of the variation in the DV? Does the model have at least some one predictor that is significantly related to the DV?_

**A. Hypotheses:**
- `$H_0: \beta_1 = 0$` (The model does not explain any variance in `$Y$`.)
- `$H_1: \beta_1 \neq 0$` (The model explains some variance in `$Y$`.)

**B. Test Statistic:**

.pull-left-1[
The **F-test** compares the variance explained by the model to the residual (error) variance.
]

.pull-right-2[

.center[![:scale 62%](slides4of6_files/figure-html/anova-slr-chunk-1.png)]

]

---

# Significance of the Linear Model: ANOVA

**B. Test Statistic (continued):**

| Source | Sum of Squares `$\left(SS\right)$` | `$df$` | Mean Square `$\left(MS\right)$` | `$\mathcal{F}$` |
|:---|:---|:---:|:---|:---:|
| Regression | `$SSR=\sum(\hat{Y}_i - \bar{Y})^2$` | `$p$` | `$MSR = SSR/p$` | `$MSR/MSE$` |
| Residual/Error | `$SSE=\sum(Y_i - \hat{Y}_i)^2$` | `$N-p-1$` | `$MSE = SSE/(N-p-1)$` | |
| Total | `$SST=\sum(Y_i - \bar{Y})^2$` | `$N-1$` | | |

**C. Decision:** If `$p \leq \alpha$`, reject `$H_0$`.

**D. Conclusion:**

.bg-light-gray.bg-dark-gray.ba.bw2.br3.shadow-5.ph4.mt3[
The model **[is/is not]** is statisticaly significant `$\left(\mathcal{f}_{\left(p; n-p-1\right)} = ...; p = ...\right)$`.
]

.footnote[`\$p = \$` number of predictors in the model (in simple linear regression `\$p = 1\$`.]

---

# Example 2: ANOVA Table

Assess the statistical significance of the fitted model in example 1. Does the number of hours of TV per week significantly `$\left(\alpha = 0.05\right)$`)influence the number of aggressive behaviors? Justify the applicability of the model and the inference made `$\left(\alpha = .05\right)$`.

---

# Example 2: Results

.pull-left[
The ANOVA table from `jamovi` for this example can be obtained with the following steps:

1. Open `jamovi`.  
2. Click on the .kbd[hamburger] menu in the top left corner.  
3. Select .kbd[Open] and then the folder icon 📁 to browse for files.  
4. Navigate to the directory where the data file is located and select the file (i.e., `tv_aggression.csv`).
5. Click on the .kbd[Analyses] tab and select .kbd[Regression].
6. Choose .kbd[Linear Regression].
7. Insert the variable `aggression` in the .red[Dependent Variable] box.
8. Insert the variable `tv_hours` in the .red[Covariates] box.
9. Go to the .kbd[Model Fit] section and check the box for .red[F test].]

.pull-right[

.center[![:scale 100%](assets/img/example2_slr.png)]

]

---

# Example 2: Results

.bg-light-gray.bg-dark-gray.ba.bw2.br3.shadow-5.ph4.mt3[
The tested model consists of one predictor (`$p = 1$`), thus a simple linear regression. The thus if the number of hours of TV influences significantly the number of aggressive behaviors then `$\beta_1 \neq 0$` (or in other words, the model is significant).

**Test the significant of the Linear Model**

**A. Hypotheses:** `$H_0: \beta_1 = 0$` vs. `$H_1: \beta_1 \neq 0$`

**B. Test Statistic:** From the ANOVA (`$\mathcal{F}$` statistic): `$\mathcal{f}_{\left(1; 8\right)}= 71.327; p < .001$`

**C. Decision:** `$p < .001 < 0.05  =\alpha$`, reject `$H_0$`.

**D. Conclusion:** The model is statistically significant `$\left(\mathcal{f}_{\left(1; 8\right)} = 71.327; p < .001\right)$`. Thus, the number of hours of TV per week is a statistically significant predictor of the number of aggressive behaviors.

]

---

# Coefficient of Determination: `$R^2$` and `$R^{2}_{a}$`

** `$R^2$`:**

Measures the proportion of variance in the DV explained by the model.

`$$R^2 = \frac{SSR}{SST} = 1 - \frac{SSE}{SST}$$`

The coefficient of determination is a measure of practical significance of the model. It indicates how well the model explains the variability in the dependent variable.

**Adjusted `$R^2$` `$\left(R^{2}_{a}\right)$`:**

**.red[In the case of multiple linear regression]**, the `$R^2$` adjusted for the degrees of freedom should be calculated `$\left(R^{2}_{a}\right)$`:

`$$R^{2}_{a} = 1 - (1 - R^2) \frac{N - 1}{N - p - 1}$$`

Where:

- `$N$` is the sample size.  
  - `$p$` is the number of predictors in the model.

---

# Significance of the Coefficient of Determination

_Is the coefficient of determination (`$R^2$`) significantly different from zero? Is the amount of variance explained by the model significantly greater than zero?_

**To test if the `$R^2$` or `$R^{2}_{a}$` are significantly different from zero, we can use the ANOVA `$\mathcal{F}$`-test. <sup>💡</sup> this the exact same ANOVA test that we used to assess the overall significance of the regression model.**

**A. Hypotheses:**

- `$H_0: \rho^2 = 0$` (The model does not explain any variance in the DV)
- `$H_1: \rho^2 \neq 0$` (The model explains a significant portion of the variance)

**B. Test Statistic:**

The ANOVA `$\mathcal{F}$`-test compares the variance explained by the model to the residual (error) variance.

---

# Significance of the Coefficient of Determination

**C. Decision:** If `$p \leq \alpha$`, reject `$H_0$`.

**D. Conclusion:**

.bg-light-gray.bg-dark-gray.ba.bw2.br3.shadow-5.ph4.mt3[

The model **[does/does not]** explain a statistically significant amount of variance of the dependent variable `$\left(\mathcal{f}_{\left(p; n-p-1\right)} = ...; p = ...\right)$`.

]

---

# Practical Significance of the `$R^2$` and `$R^{2}_{a}$`

**Interpretation of `$R^2$` and `$R^{2}_{a}$`:**

The practical significance of the model is considered **[low/moderate/high]** based on the value of `$R^2 / R^2_a$`.

Cohen (1988) provides the following guidelines for interpreting `$R^2$` in the context of social science research:

| `$R^2 / R^2_a$` Value | Practical Significance |
|:---:|:---|
| .02 | small effect size |
| .13 | medium effect size |
| .26 | large effect size |

**<sup>⚠️</sup> This values are only rough guidelines and should be interpreted in the context of the specific research area and study design.**

.bg-light-gray.bg-dark-gray.ba.bw2.br3.shadow-5.ph4.mt3[

The model explains approximately `$\left[ R^2~ or~ R^2_a \times 100\right]\%$` of the variance in the dependent variable, which can be considered **[small/medium/large]** practical significance.

]

---

# Example 3: Coefficient of Determination

Determine the fraction of the total variability in the number of aggressive behaviors explained by the fitted model in `Example 1`. Is this value statistically significant? Justify your answer. (`$\alpha = .05$`).

---

# Example 3: Results

The results for this example can be obtained in `jamovi` with the following steps:

.pull-left[

]

.pull-right[

.center[![:scale 100%](assets/img/example3_slr.png)]

]

---

# Example 3: Results

.bg-light-gray.bg-dark-gray.ba.bw2.br3.shadow-5.ph4.mt3[

The fraction of the total variability in the number of aggressive behaviors explained by the fitted model refers to the `$R^2$` value. This value can be found in the .red[Model Fit Measures] table.

**Coefficient of Determination:** `$r^2 = 0.899$`

Is this value statistically significant (i.e., `$r \neq 0$`)?

**Test the significant of the Coefficient of Determination**

**A. Hypotheses:** `$H_0: \rho^2 = 0$` vs. `$H_1: \rho^2 \neq 0$`

**B. Test Statistic:** From the ANOVA (`$\mathcal{F}$` statistic): `$\mathcal{f}_{\left(1; 8\right)}= 71.327; p < .001$`

**C. Decision:** `$p < .001 < 0.05  =\alpha$`, reject `$H_0$`.

**D. Conclusion:** The coefficient of determination is statistically significant `$\left(\mathcal{f}_{\left(1; 8\right)} = 71.327; p < .001; r^2=0.899\right)$`. Thus, the model explains `$89.9\%$` variance of the dependent variable, this value is a different from `$0$`.

]

---

# Significance of Regression Coefficients: `$\beta_0$`

### Intercept `$\left(\beta_0\right)$`

Is the intercept (`$\beta_0$`) significantly different from a specific value `$k$` (**by default `$k = 0$`**)?

.pull-left[
**A. Hypotheses:**  
`$H_0: \beta_0 = k$` vs. `$H_1: \beta_0 \neq, <, > k \quad \left(k \in \mathbb{R}\right)$`
  
**B. Test Statistic:** The test statistic, assuming the null hypothesis is true, is:  
  `$$T_{\beta_0} = \frac{b_0 - k}{\sqrt{MSE \times \left(\frac{1}{n} + \frac{\bar{X}^2}{\sum_{i=1}^{n}(X_i - \bar{X})^2}\right)}} \sim \mathcal{t}_{(n-p-1)}$$`

]

.pull-right[

**C. Decision:**

**Two-Tailed Test:** Reject `$H_0$` if `$|\mathcal{T}_{\beta_0}| \ge t_{1-\frac{\alpha}{2};(n-p-1)}$`  
**Upper One-Tailed Test:** Reject `$H_0$` if `$\mathcal{T}_{\beta_0} \ge t_{1-\alpha;(n-p-1)}$`  
**Lower One-Tailed Test:** Reject `$H_0$` if `$\mathcal{T}_{\beta_0} \le -t_{1-\alpha;(n-p-1)}$`

.red[or]

Reject `$H_0$` if the `$p \leq \alpha$`

]

**D. Conclusion:**

.bg-light-gray.bg-dark-gray.ba.bw2.br3.shadow-5.ph4.mt3[
The intercept **[is/is not]** statistically significant `$\left(\mathcal{t}_{(n-p-1)} = ...; p = ...; \beta_0 = ...\right)$`.

]

---

# Significance of Regression Coefficients: `$\beta_1$`

### Slope `$\left(\beta_1\right)$`

Is the slope (`$\beta_1$`) significantly different from a specific value `$k$` (**by default `$k = 0$`**)?

.pull-left[

**A. Hypotheses:**  
`$H_0: \beta_1 = k$` vs. `$H_1: \beta_1 \neq, <, > k \quad \left(k \in \mathbb{R}\right)$`
      
**B. Test Statistic:** The test statistic, assuming the null hypothesis is true, is:

`$$T_{\beta_1} = \frac{b_1 - k}{\sqrt{\frac{MSE}{\sum_{i=1}^{n}(X_i - \bar{X})^2}}} \sim \mathcal{t}_{(n-p-1)}$$`
]

.pull-right[

**C. Decision:**
        
**Two-Tailed Test:** Reject `$H_0$` if `$|\mathcal{T}_{\beta_1}| \ge t_{1-\frac{\alpha}{2};(n-p-1)}$`  
**Right-Tailed Test:** Reject `$H_0$` if `$\mathcal{T}_{\beta_1} \ge t_{1-\alpha;(n-p-1)}$`  
**Left-Tailed Test:** Reject `$H_0$` if `$\mathcal{T}_{\beta\_1} \le -t_{1-\alpha;(n-p-1)}$`

.red[or]

Reject `$H_0$` if the `$p \leq \alpha$`

]

**D. Conclusion:**

.bg-light-gray.bg-dark-gray.ba.bw2.br3.shadow-5.ph4.mt3[
The slope **[is/is not]** statistically significant `$\left(\mathcal{t}_{(n-p-1)} = ...; p = ...; \beta_1 = ...\right)$`.

]

---

# Example 4: Significance of Regression Coefficients

**Let `$\alpha = 0.05$`:**

**4.1/** Even if a child does not watch TV, should we expect them to exhibit `$0$` aggressive behaviors? Justify your answer.

**4.2/** Does watching TV affect aggressive behaviors? Compare the conclusion you reach with the conclusions from Examples 2 and 3. What can you state regarding the statistical methods used in these questions?

---

# Example 4: Results

The results obtained in `jamovi` are as follows:

.pull-left[
1. Open `jamovi`.  
2. Click on the .kbd[hamburger] menu in the top left corner.  
3. Select .kbd[Open] and then the folder icon 📂 to browse for files.  
4. Navigate to the directory where the data file is located and select the file (i.e., `tv_aggression.csv`).
5. Click on the .kbd[Analyses] tab and select .kbd[Regression].
6. Choose .kbd[Linear Regression].
7. Insert the variable `aggression` in the .red[Dependent Variable] box.
8. Insert the variable `tv_hours` in the .red[Covariates] box.

]

.pull-right[

.center[![:scale 100%](assets/img/example4_slr.png)]

]

---

# Example 4: Results

.bg-light-gray.bg-dark-gray.ba.bw2.br3.shadow-5.ph4.mt3[
**4.1/** The expected value of aggressive behaviors when TV hours is `$0$` refers to the intercept `$\left(\beta_0\right)$`. This value can be found in the .red[Model Coefficients] table.

**Regression Coefficient `$\left(\beta_0\right)$`**

`$b_0 = 1.926$` Is this value statistically significant (i.e., `$\beta_1 \neq k; k = 0$`)?

**Test the significant of the Regression Coefficient `$\left(\beta_0\right)$`**

**A. Hypotheses:** `$H_0: \beta_0 = 0$` vs. `$H_1: \beta_0 \neq 0$`

**B. Test Statistic:** From the `$\mathcal{T}$` statistic: `$\mathcal{t}_{b_{0}\left(8\right)}= 3.114; p = .014$`

**C. Decision:** `$p = .014 < 0.05  =\alpha$`, reject `$H_0$`.

**D. Conclusion:**  The number of aggressive behaviors when TV hours is `$0$` is expected to be approximately `$1.926$`. The regression coefficient  is statistically significant `$\left(\mathcal{t_{b_{0}}}_{\left(`r 8\right)} = 3.114; p = .014;b_0 = 1.926\right)$`.

]

---

# Example 4: Results

.bg-light-gray.bg-dark-gray.ba.bw2.br3.shadow-5.ph4.mt3[
**4.2/** The effect of watching TV on aggressive behaviors can be assessed by examining the slope coefficient `$\left(\beta_1\right)$`. This value can be found in the .red[Model Coefficients] table.

**Regression Coefficient `$\left(\beta_1\right)$`**

`$b_0 = 1.926$` Is this value statistically significant (i.e., `$\beta_1 \neq k; k = 0$`)?

**Test the significant of the Regression Coefficient `$\left(\beta_1\right)$`**

**A. Hypotheses:** `$H_0: \beta_1 = 0$` vs. `$H_1: \beta_1 \neq 0$`

**B. Test Statistic:** From the `$\mathcal{T}$` statistic: `$\mathcal{t}_{b_{1}\left(8\right)}= 8.446; p < .001$`

**C. Decision:** `$p < .001 < 0.05  =\alpha$`, reject `$H_0$`.

**D. Conclusion:**  The variation that occurs for each additional hour of TV watched per week is expected to vary the number of aggressive behaviors by approximately `$0.278$`. The regression coefficient  is statistically significant `$\left(\mathcal{t_{b_{1}}}_{\left(8\right)} = 8.446; p < .001;b_1 = 0.278\right)$`.

]

---

# Outliers

**Outliers** are data points that deviate from the overall pattern of the data. They can have a substantial impact on the regression line and the estimates of the coefficients.

**Diagnostic:** Residual plot between the residuals (`$e_i$`) and the fitted values (`$\hat{Y}_i$`):

.center[![:scale 80%](slides4of6_files/figure-html/residual-plots-1.png)]

---

# Outliers

.pull-left-1[
**Problems with outliers:**  
1.  They affect the estimates of the regression coefficients.  
2.  They alter the significance of the model.

**Solution:**  
1.  Identify the cause of the *outlier* (is it natural or an artifact?).
2.  Remove *outliers* (**.red[with caution!]**).
3.  Use robust regression methods that are less sensitive to outliers.

]

.pull-right-2[
🚨 Always report if you removed any data points and justify your decision.

.center[
![:scale 50%](assets/gif/outlier_jenga.gif)

Jenga Tension

]

---

# Assumptions of Linear Regression

For the model's significance tests to be valid, several assumptions must be met regarding the **error terms (residuals)**.

1.  **Linearity:** The relationship between `$X$` and `$Y$` is linear.
2.  **Independence:** The errors are independent of each other.
3.  **Normality:** The errors are normally distributed.
4.  **Homoscedasticity:** The errors have constant variance.

These can be summarized as: `$\epsilon \sim \mathcal{IIN}(0, \sigma)$`

---
# Checking Assumptions with Residual Plots

We can check assumptions by plotting the residuals against the predicted values (`$\hat{Y}$`).

### Ideal Situation

The residuals are randomly scattered around 0, with no clear pattern.

.center[![:scale 50%](slides4of6_files/figure-html/residuals-ideal-plot-chunk-1.png)]

---

# Checking Assumptions with Residual Plots

### Common Violations

**Linearity**: The left plot shows a distinct parabola in the residuals, indicating the linear model was a poor fit for the underlying curved data.  
**Independence**: The right plot shows a sine wave pattern, indicating the residuals are not independent of each other.

.center[![:scale 55%](slides4of6_files/figure-html/residuals-violation-plot-chunk1-1.png)]

---

# Checking Assumptions with Residual Plots

**Homoscedasticity**: The left plot shows the classic "cone" shape, indicating the variance of residuals is not constant.  
**Normality**: The right plot shows the points on a Q-Q plot systematically deviating from the reference line, indicating the residuals are not normally distributed.

.center[![:scale 55%](slides4of6_files/figure-html/residuals-violation-plot-chunk2-1.png)]

---

# Example 5: Residual Plots

Using the same model from Examples 1&ndash;4, generate the residual plots to check the assumptions of linear regression for the model. Do the assumptions appear to be met? Are there any obvious outliers?

---

# Example 5: Results

In _jamovi_, after repeating the process from Examples 1 to 4, you can generate the residual plots by:

1.  In the .red[Linear Regression] options section, go to the .red[Assumption Checks] chunk.
2.  Check the box for `Residual plots`.

.center[![:scale 50%](slides4of6_files/figure-html/residuals-example-5-1.png)]

---

# Example 6

.pull-left-1[

Motivation and self-concept in the school context are frequent objects of study in Educational Psychology. An educational psychologist from a secondary school randomly sampled `$15$` students from `$3$` classes that were, in turn, randomly selected from the `$10$` classes in the school. The psychologist applied a motivation and numerical self-concept (`AC_num`) scale, which is assessed by 21 Thurstone-type items. The data obtained are transcribed in the table below.

]

.font80[.pull-right-2[

<table>
 <thead>
  <tr>
   <th style="text-align:right;"> Student </th>
   <th style="text-align:right;"> Age </th>
   <th style="text-align:right;"> Numerical Self-Concept </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 16 </td>
   <td style="text-align:right;"> 3.2 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 17 </td>
   <td style="text-align:right;"> 1.0 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 15 </td>
   <td style="text-align:right;"> 3.6 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 17 </td>
   <td style="text-align:right;"> 4.0 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 16 </td>
   <td style="text-align:right;"> 2.0 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 15 </td>
   <td style="text-align:right;"> 3.4 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 7 </td>
   <td style="text-align:right;"> 18 </td>
   <td style="text-align:right;"> 1.0 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 8 </td>
   <td style="text-align:right;"> 16 </td>
   <td style="text-align:right;"> 5.7 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 9 </td>
   <td style="text-align:right;"> 17 </td>
   <td style="text-align:right;"> 4.8 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 10 </td>
   <td style="text-align:right;"> 15 </td>
   <td style="text-align:right;"> 3.5 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 11 </td>
   <td style="text-align:right;"> 15 </td>
   <td style="text-align:right;"> 3.5 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 12 </td>
   <td style="text-align:right;"> 16 </td>
   <td style="text-align:right;"> 7.1 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 13 </td>
   <td style="text-align:right;"> 16 </td>
   <td style="text-align:right;"> 7.5 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 14 </td>
   <td style="text-align:right;"> 17 </td>
   <td style="text-align:right;"> 6.9 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 15 </td>
   <td style="text-align:right;"> 19 </td>
   <td style="text-align:right;"> 1.5 </td>
  </tr>
</tbody>
</table>

]]

---

# Example 6

**6.1.** What is the most appropriate statistical methodology to test the psychologist's hypothesis? State the assumptions of the methodology you choose. Will you be looking at a particular statistical test?

**6.2.** Fit the model that allows you to estimate numerical self-concept from age.

**6.3.** What is the variation in self-concept per year of age?

**6.4.** Does age significantly influence numerical self-concept? Justify your answer.

**6.5.** What percentage of the total variability in numerical self-concept is explained by the model fitted in 6.2? Is this value statistically significant? Justify your answer.

**6.6.** Estimate the numerical self-concept of a 20-year-old student. What can you conclude about this value?

**6.7.** Graphically analyze the assumptions for the application of the fitted model.

---

# Example 7

.pull-left-1[
Factors such as assertiveness, social expectations, and institutional expectations can influence leadership levels. In a study on this topic, the data from the following table was collected.
]

.font80[.pull-right-2[

<table>
 <thead>
  <tr>
   <th style="text-align:right;"> Participant </th>
   <th style="text-align:right;"> Assertiveness </th>
   <th style="text-align:right;"> Leadership </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 11 </td>
   <td style="text-align:right;"> 8 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 9 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 20 </td>
   <td style="text-align:right;"> 17 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 8 </td>
   <td style="text-align:right;"> 8 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 10 </td>
   <td style="text-align:right;"> 10 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 12 </td>
   <td style="text-align:right;"> 9 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 7 </td>
   <td style="text-align:right;"> 13 </td>
   <td style="text-align:right;"> 13 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 8 </td>
   <td style="text-align:right;"> 11 </td>
   <td style="text-align:right;"> 14 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 9 </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 7 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 10 </td>
   <td style="text-align:right;"> 13 </td>
   <td style="text-align:right;"> 13 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 11 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 4 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 12 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 9 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 13 </td>
   <td style="text-align:right;"> 20 </td>
   <td style="text-align:right;"> 17 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 14 </td>
   <td style="text-align:right;"> 9 </td>
   <td style="text-align:right;"> 18 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 15 </td>
   <td style="text-align:right;"> 13 </td>
   <td style="text-align:right;"> 17 </td>
  </tr>
</tbody>
</table>

]]

---

# Example 7

**7.1.** Write the model equation that predicts leadership from assertiveness.

**7.2.** Test whether assertiveness is a significant predictor of leadership (`$\alpha = .05$`).

**7.3.** Quantify the model's goodness of fit. Can the model be considered significant?

**7.4.** What assumptions would need to be validated to apply the technique that you have used?

---
class: inverse, center, middle

# 5. Multiple Linear Regression

---

# Multiple linear regression

In multiple linear regression, 1 dependent variable (d.v.) is related, in the form of a linear combination, with 2 or more independent variables (i.v.'s) in a model of the type:

`$$Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \dots + \beta_p X_{pi} + \varepsilon_i \quad (i=1,...,n)$$`

`$\beta_1, \beta_2, \dots, \beta_p$` are the partial slopes or regression coefficients, `$p$` – no. of i.v.’s in the model.

### 2.2.1. Model Fitting

This follows the same principle as the Least Squares Method from Simple Linear Regression (SLR), but extends the system of 2 equations to a multiple system with `$p + 1$` equations.

Using matrix notation:

`$$\mathbf{Y} = 
\begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}
\quad
\mathbf{X} = 
\begin{bmatrix} 
1 & x_{11} & \cdots & x_{p1} \\
1 & x_{12} & \cdots & x_{p2} \\
\vdots & \vdots & \ddots & \vdots \\
1 & x_{1n} & \cdots & x_{pn} 
\end{bmatrix}
\quad
\boldsymbol{\beta} = 
\begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \end{bmatrix}
\quad
\boldsymbol{\varepsilon} = 
\begin{bmatrix} \varepsilon_1 \\ \varepsilon_2 \\ \vdots \\ \varepsilon_n \end{bmatrix}$$`

where the model is `$\mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}$`

---

# Multiple Linear Regression

The least squares solution is given by:

`$$\hat{\boldsymbol{\beta}}: \frac{\partial}{\partial\boldsymbol{\beta}} [(\mathbf{Y} - \mathbf{X}\boldsymbol{\beta})'(\mathbf{Y} - \mathbf{X}\boldsymbol{\beta})] = 0$$`

$$ \hat{\boldsymbol{\beta}} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y} $$

---

# Significance of the Linear Model: ANOVA

**Is the fitted model statistically significant? Is there at least one independent variable (i.v.) that significantly influences the dependent variable (d.v.)? Is there at least one `$\beta_i \neq 0$` (for `$i=1,...,p$`)?**

**A. Hypotheses:**
- `$H_0: \beta_1 = \beta_2 = ... = \beta_p = 0$` (The model does not explain any variance in `$Y$`.)
- `$H_1: \exists_i:  \beta_i \neq 0 \left(i = 1,...,p\right)$` (The model explains some variance in `$Y$`.)

**B. Test Statistic:**

.footnote[`\$p = \$` number of predictors in the model (in multiple linear regression `\$p > 1\$`.]

---

# Significance of the Linear Model: ANOVA

**C. Decision:**

If `$p \leq \alpha$`, reject `$H_0$`.

**D. Conclusion:**

.bg-light-gray.bg-dark-gray.ba.bw2.br3.shadow-5.ph4.mt3[
The model **[is/is not]** is statisticaly significant `$\left(\mathcal{f}_{\left(p; n-p-1\right)} = ...; p = ...\right)$`.
]

---

# Coefficient of Determination: `$R^2$` and `$R^{2}_{a}$`

Exactly as in Simple Linear Regression:

$$ R^2 = \frac{SSR}{SST} \quad \text{or} \quad R^2_{adj} = 1 - \frac{MSE}{MST} = 1 - \frac{n-1}{n-p-1}(1-R^2) $$

**Note: In this context, SSR is the Sum of Squares Regression, SST is the Total Sum of Squares, MSE is the Mean Squared Error, and MST is the Total Mean Square.**

<br>

🚨In MLR (Multiple Linear Regression), one should (always...) use **.red[Adjusted R-squared (`$R^2_{a}$`)]** to compensate for the increase in the Sum of Squares Regression (SSR) that is always observed whenever a new independent variable (i.v.) is added to the model, even if that i.v. does not have a statistically significant effect on the variation of the dependent variable (d.v.).

---

# Significance of the Coefficient of Determination

_Is the coefficient of determination (`$R^2$`) significantly different from zero? Is the amount of variance explained by the model significantly greater than zero?_

**A. Hypotheses:**

- `$H_0: \rho^2 = 0$` (The model does not explain any variance in the DV)
- `$H_1: \rho^2 \neq 0$` (The model explains a significant portion of the variance)

**B. Test Statistic:**

The ANOVA `$\mathcal{F}$`-test compares the variance explained by the model to the residual (error) variance.

---

# Significance of the Coefficient of Determination

**C. Decision:** If `$p \leq \alpha$`, reject `$H_0$`.

**D. Conclusion:**

.bg-light-gray.bg-dark-gray.ba.bw2.br3.shadow-5.ph4.mt3[

The model **[does/does not]** explain a statistically significant amount of variance of the dependent variable `$\left(\mathcal{f}_{\left(p; n-p-1\right)} = ...; p = ...\right)$`.

]

---

# Significance of Regression Coefficients: `$\beta_i$`

**A. Hypotheses** (for each coefficient `$\beta_i$`):

`$H_0: \beta_i = k$` vs. `$H_1: \beta_i \neq, <, > k \quad (i = 0, \dots, p)$`

Note: by default in software `$k = 0$`

**B. Test Statistic:**  
`$$T_{\beta_i} = \frac{b_i - k}{\sqrt{MSE \times C_{ii}}} \sim \mathcal{t}_{(n-p-1)}$$`

where `$C_{ii}$` is the `$ii^{th}$` element of the matrix `$(\mathbf{X}'\mathbf{X})^{-1}$` corresponding to `$\beta_i$`.

---

# Significance of Regression Coefficients: `$\beta_i$`

**C. Decision:**  
* **Two-Tailed Test:** Reject `$H_0$` if `$|\mathcal{T}_{\beta_i}| \ge t_{1-\frac{\alpha}{2}; \left(n-p-1\right)}$`  
* **Right-Tailed Test:** Reject `$H_0$` if `$\mathcal{T}_{\beta_i} \ge t_{1-\alpha; \left(n-p-1\right)}$`  
* **Left-Tailed Test:** Reject `$H_0$` if `$\mathcal{T}_{\beta_i} \le -t_{1-\alpha; \left(n-p-1\right)}$`

.red[or]
Reject `$H_0$` if the `$p \leq \alpha$`

**D. Conclusion:**

.bg-light-gray.bg-dark-gray.ba.bw2.br3.shadow-5.ph4.mt3[
The coefficient `$\beta_i$` **[is/is not]** statistically significant `$\left(\mathcal{t}_{(n-p-1)} = ...; p = ...; \beta_i = ...\right)$`.
]

**NOTE:** Strictly speaking, each test on each coefficient tests whether the corresponding independent variable (i.v.) affects the dependent variable (d.v.), assuming that the other i.v.'s are held constant. Thus, to properly account for the simultaneous testing of multiple coefficients, it is necessary to use a significance level of `$\frac{\alpha}{p}$` and not `$\alpha$` (e.g., a Bonferroni correction). In practice, when `$p$` is large, this correction is extremely conservative and is therefore rarely used.

---

# Example 1

**Dataset:** `tv_aggression.csv` (use all columns).

This exercise extends Exercise 1 by adding a second predictor, `family_income`.

### Tasks

Using a significance level of `$\alpha = 0.05$`:

1.  **Fit the Model:** Fit a multiple linear regression model to predict aggressive behaviors from both TV hours and family income. Write down the estimated regression equation.

2.  **Model Quality:**
    a. Is the overall model statistically significant? Justify your answer.
    b. What is the adjusted R-squared for this model? Interpret this value.
    c. Compare the adjusted R-squared from this model to the R-squared from the simple linear regression model in Exercise 1. Does adding `family_income` improve the model?

3.  **Coefficient Interpretation:**
    a. Interpret the coefficient for `tv_hours`. How does its meaning differ from the slope in Exercise 1?
    b. Interpret the coefficient for `family_income`.
    c. Are both predictors statistically significant? Justify your answer.

---

# Exercise 5: Depression, Age, and Domestic Work

**Dataset:** `depression.csv`

This exercise is based on Example 2 from the multiple regression section of the slides. A clinical psychologist wants to validate the theory that age and time spent on domestic work have significant effects on depression levels (measured by the CES-D scale).

### Tasks

Using a significance level of `$\alpha = 0.05$`:

1.  **Write the Model:** Write the equation of the fitted model that predicts CES-D score from age and time spent on domestic work.

2.  **Overall Significance:** Is the fitted model statistically significant? Justify your answer.

3.  **Practical Significance:** What percentage of the variation in CES-D levels is explained by the model?

4.  **Predictor Effects:** Do both independent variables have a significant effect on the CES-D level? In other words, is the theory valid for the population under study? Justify your answer.

---

# Exercise 6: Leadership Model Expansion

**Dataset:** `leadership.csv` (use all columns).

This exercise is based on Example 3 from the multiple regression section. The researcher from Exercise 3 believes the model predicting leadership can be improved by adding `social_expectations` and `institutional_expectations`.

### Task

Assuming all statistical assumptions are met, evaluate the quality of this new, expanded model using a significance level of `$\alpha = 0.05$`. Specifically, address the following:
- Is the overall model significant?
- How much variance in leadership does the new model explain?
- Are all three predictors (assertiveness, social expectations, institutional expectations) significant?
- Compare this model to the simple regression model from Exercise 3. Is it a better model? Justify your answer.

---

# References    
Cohen, J. (1988). _Statistical power analysis for the behavioral
sciences_. 2nd ed. Hillsdale, NJ: Lawrence Erlbaum Associates. ISBN:
0805802835.

---
# References

---
class: center, bottom, inverse

# More info

Slides created with the <svg viewBox="0 0 581 512" xmlns="http://www.w3.org/2000/svg" style="height:1em;position:relative;display:inline-block;top:.1em;fill:#384CB7;">  [ comment ]  <path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"></path></svg> package [`xaringan`](https://github.com/yihui/xaringan).

Practice is the best strategy for learning.

_In God we trust, all others bring data_

Edwards Deming

THE END

---
class: center, bottom, inverse

![:scale 50%](assets/gif/the_end.gif)