class: middle background-image: url(data:image/png;base64,#LTU_logo.jpg) background-position: top left background-size: 30% # STM1001 [Topic 8](https://bookdown.org/a_shaker/STM1001_Topic_8/) Lecture ## Correlation and Simple Linear Regression ### La Trobe University This lecture complements the [Topic 8 readings](https://bookdown.org/a_shaker/STM1001_Topic_8/) --- # Topic 8: Related Links ## Readings [Topic 8 readings](https://bookdown.org/a_shaker/STM1001_Topic_8/) ## Notation [Notation for Topic 8: Correlation and Simple Linear Regression](https://bookdown.org/a_shaker/STM1001_Topic_0/notation-summary.html#topic-8-correlation-and-simple-linear-regression) --- # Topic 8: Correlation and Simple Linear Regression **Overview** <iframe src="https://bookdown.org/a_shaker/STM1001_Topic_8/" width="100%" height="400px" data-external="1"></iframe> --- # Introduction * You may recall that we considered an introduction to ***correlation*** in [Topic 2](https://bookdown.org/a_shaker/STM1001_Topic_2/4-measures-of-association-between-variables.html) -- * Today we will be revisiting correlation again in a bit more detail, followed by ***Simple Linear Regression*** -- * Both of these techniques can be used to describe the relationship between two numeric variables --- # Correlation * In today's example, we will revisit the Big Five framework (Costa and McCrae, 1992; Goldberg, 1992), which contains five main personality traits: * Conscientiousness * Agreeableness * Neuroticism * Openness * Extraversion -- * The data set we will be considering is the `Big 5` data set (Dolan, Oort, Stoel, and Wicherts, 2009) which is a freely available built-in data set in jamovi (The jamovi project, 2022) * The data set contains `\(n = 500\)` observations, and measurements for each of the Big Five personality traits -- * In particular, we will consider the association between **Openness** and **Agreeableness** --- <img src="data:image/png;base64,#Topic_8_Lecture_files/figure-html/unnamed-chunk-2-1.svg" height="100%" style="display: block; margin: auto;" /> Considering the above scatter plot, what do you think the sample correlation is between the two variables? -- `\(r = 0.16\)` --- # Correlation We can test the null hypothesis that the population correlation coefficient is 0, using `$$H_0:\rho = 0 \;\;\text{versus}\;\;H_1: \rho \neq 0,$$` where: * `\(\rho\)` denotes the true (population) correlation coefficient. -- If we reject `\(H_0\)`, we conclude there is evidence to suggest that the correlation is not equal to zero. This would mean we have evidence of a significant linear relationship (or association) between the two variables. -- Summarising the notation, we have that: * `\(\rho\)` denotes the **population** correlation coefficient * `\(r\)` denotes the **sample** correlation coefficient -- We will now carry out the test to determine whether there is evidence of a significant association between Openness and Agreeableness --- # Correlation test output ``` r Pearson's product-moment correlation data: df$Agreeableness and df$Openness t = 3.5988, df = 498, p-value = 0.0003517 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.07253049 0.24349954 sample estimates: cor 0.1592085 ``` --- # Correlation test output ``` r Pearson's product-moment correlation data: df$Agreeableness and df$Openness t = 3.5988, df = 498, `p-value = 0.0003517` alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.07253049 0.24349954 sample estimates: cor 0.1592085 ``` * Since we have `\(p < 0.001\)` which is less than 0.05, we reject `\(H_0\)` -- That is, there is evidence to suggest that the population correlation coefficient is not equal to zero. -- This means we have evidence of a significant association between the two variables. --- # Correlation test output ``` r Pearson's product-moment correlation data: df$Agreeableness and df$Openness `t = 3.5988`, df = 498, `p-value = 0.0003517` alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.07253049 0.24349954 sample estimates: cor 0.1592085 ``` * Since we have `\(p < 0.001\)` which is less than 0.05, we reject `\(H_0\)`. * The test statistic is `\(t = 3.5988\)` --- # Correlation test output ``` r Pearson's product-moment correlation data: df$Agreeableness and df$Openness `t = 3.5988`, df = 498, `p-value = 0.0003517` alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.07253049 0.24349954 sample estimates: `cor` `0.1592085` ``` * Since we have `\(p < 0.001\)` which is less than 0.05, we reject `\(H_0\)`. * The test statistic is `\(t = 3.5988\)` * The sample correlation coefficient (i.e. the estimated correlation) is `\(r = 0.16\)` --- # Correlation test output ``` r Pearson's product-moment correlation data: df$Agreeableness and df$Openness `t = 3.5988`, df = 498, `p-value = 0.0003517` alternative hypothesis: true correlation is not equal to 0 `95 percent confidence interval:` `0.07253049 0.24349954` sample estimates: `cor` `0.1592085` ``` * Since we have `\(p < 0.001\)` which is less than 0.05, we reject `\(H_0\)`. * The test statistic is `\(t = 3.5988\)` * The sample correlation coefficient (i.e. the estimated correlation) is `\(r = 0.16\)` * The 95% confidence interval for the population correlation coefficient `\(\rho\)` is (0.07, 0.24). This means we are 95% confident that the true correlation between Agreeableness and Openness is between 0.07 and 0.24. --- # Does correlation imply causation? Do you think higher marriage rates would be related to higher numbers of people who drowned after falling out of a fishing boat? -- <img src="data:image/png;base64,#images/spurious.svg" style="display: block; margin: auto;" /> See [this week's readings](https://bookdown.org/a_shaker/STM1001_Topic_8/1-2-does-correlation-imply-causation.html) for further discussion. -- [Spurious correlations](https://www.tylervigen.com/spurious-correlations) (also called nonsense correlations), by [Tyler Vigen](https://www.tylervigen.com/about), licenced under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/). --- # Defining a straight line We will now consider Simple Linear Regression (or SLR for short), starting with defining a straight line. -- You may be familiar with the following equation which we can use to define a straight line: `$$y = mx + c,$$` where: * `\(m\)` is the slope of the line * `\(c\)` is the `\(y\)`-intercept --- # Defining a straight line * We can see the line crosses the `\(y\)`-intercept when `\(x = 0\)` and `\(y = 10\)`: <img src="data:image/png;base64,#Topic_8_Lecture_files/figure-html/unnamed-chunk-10-1.svg" height="100%" style="display: block; margin: auto;" /> --- # Defining a straight line * When we zoom in, we can see that as `\(x\)` increases by one unit, `\(y\)` increases by 5 (the slope) <img src="data:image/png;base64,#Topic_8_Lecture_files/figure-html/unnamed-chunk-11-1.svg" height="100%" style="display: block; margin: auto;" /> --- # Line of best fit When we have a scatter plot of data, we can add a **line of best fit**. -- * This involves choosing a slope and `\(y\)`-intercept for our line that means the line will be placed in the best spot to fit the data -- * In the graph on the following slide, we can see that, depending on the choices of the `\(y\)`-intercept and slope, we can end up with better or worse models (or lines) -- **Simple Linear Regression** (SLR), which we will define shortly, allows us to use the data to help us determine where exactly the line of best fit should be. * Recall that we can define a straight line as `\(y = mx + c\)`, where `\(m\)` is the slope and `\(c\)` is the `\(y\)`-intercept * Equivalently, we could write this equation as `\(y = c + mx\)`. It will be useful to bear this concept in mind as we define the simple linear regression model shortly --- <img src="data:image/png;base64,#Topic_8_Lecture_files/figure-html/unnamed-chunk-12-1.svg" style="display: block; margin: auto;" /> --- <img src="data:image/png;base64,#Topic_8_Lecture_files/figure-html/unnamed-chunk-13-1.svg" style="display: block; margin: auto;" /> --- # Simple linear regression model definition .content-box-blue[ .center[ **Simple linear regression model definition:** ] `$$y = \beta_0 + \beta_1 x + \epsilon,$$` where: * `\(x\)` is the **explanatory variable** (also referred to as the **independent** variable or **predictor** variable) * `\(y\)` is the **response variable** (also referred to as the **dependent variable**) * `\(\beta_0\)` is the `\(y\)`-intercept of the line (just like `\(c\)` in the equation we looked at earlier) and is referred to as the **intercept coefficient** * `\(\beta_1\)` is the slope of the line (just like `\(m\)` in the equation we looked at earlier) and is referred to as the **slope coefficient** * `\(\epsilon\)` is known as the **random error** term which has expected value `\(\text{E}(\epsilon) = 0\)` ] --- # Simple linear regression model definition Then, supposing we have a data set with `\(n\)` observations, each with a value for `\(x\)` and a value for `\(y\)` denoted as `$$(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)$$` we can use this data to help us obtain the ***sample estimates***, that is, `$$\widehat{y}=\widehat{\beta}_0+\widehat{\beta}_1x$$` -- So the general SLR equation is `$$y = \beta_0 + \beta_1 x + \epsilon,$$` -- and when we have fitted the SLR to our data, the fitted result is `$$\widehat{y}=\widehat{\beta}_0+\widehat{\beta}_1x$$` where `\(\widehat{\beta}_0\)` and `\(\widehat{\beta}_1\)` are estimates based on the details of our specific data set. --- # How do we choose the line of best fit? Using our data, we need to use some criteria to ensure we choose a slope `\((\widehat{\beta}_1)\)` and intercept `\((\widehat{\beta}_0)\)` for our line to ensure it is in the 'best' spot. -- * The criteria used in Simple Linear Regression involves fitting a model to the data such that the ***sum of squared residuals is minimized*** -- Let's consider a simple example based on the figure in the next slide. --- # How do we choose the line of best fit? .pull-left[ <img src="data:image/png;base64,#Topic_8_Lecture_files/figure-html/unnamed-chunk-14-1.svg" style="display: block; margin: auto;" /> ] .pull-right[ * Suppose we have a data set consisting of the three green observations displayed in the figure on the left {{content}} ] -- * Each observation has an associated 'fit' on the blue line, represented by the black dots {{content}} -- * Each observation also has a corresponding ***residual***, `\(e\)`, which is the vertical distance between the observation and the corresponding fit -- * If we squared all of these residuals and added them up, Simple Linear Regression would allow us to place the line in the spot such that this sum would be minimized --- # What is a residual? .content-box-blue[ .center[ **What is a residual?** ] A residual is the vertical distance between the observed value and the regression line, or `\(y - \widehat{y}\)`. ] -- [Try out this app to help you visualise how residuals are calculated](https://stm1001.shinyapps.io/slr_sept/) -- * On the next slide, we will return to our Openness versus Agreeableness example and compare the sum of squared residuals (SSR) for the three models * Using calculus, it is possible to work out the values of `\(\widehat{\beta}_0\)` and `\(\widehat{\beta}_1\)` that minimise the SSR by hand, however we will allow statistical software packages to do the hard work for us --- <img src="data:image/png;base64,#Topic_8_Lecture_files/figure-html/unnamed-chunk-15-1.svg" style="display: block; margin: auto;" /> --- # Simple Linear Regression Output ``` r lm(formula = Openness ~ Agreeableness, data = df) Residuals: Min 1Q Median 3Q Max -0.99601 -0.22775 -0.00405 0.22457 1.04770 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.06254 0.14821 20.663 < 2e-16 *** Agreeableness 0.15448 0.04293 3.599 0.000352 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.3362 on 498 degrees of freedom Multiple R-squared: 0.02535, Adjusted R-squared: 0.02339 F-statistic: 12.95 on 1 and 498 DF, p-value: 0.0003517 ``` --- # Simple Linear Regression Output ``` r `Coefficients:` Estimate Std. Error t value Pr(>|t|) `(Intercept) 3.06254 0.14821 20.663 < 2e-16 ***` Agreeableness 0.15448 0.04293 3.599 0.000352 *** Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.3362 on 498 degrees of freedom Multiple R-squared: 0.02535, Adjusted R-squared: 0.02339 F-statistic: 12.95 on 1 and 498 DF, p-value: 0.0003517 ``` The results related to `\(\widehat{\beta}_0\)` and `\(\widehat{\beta}_1\)` are under the heading `Coefficients:` * The first row `(Intercept)` corresponds to the intercept coefficient `\(\widehat{\beta}_0\)` --- # Simple Linear Regression Output ``` r `Coefficients:` Estimate Std. Error t value Pr(>|t|) `(Intercept) 3.06254 0.14821 20.663 < 2e-16 ***` `Agreeableness 0.15448 0.04293 3.599 0.000352 ***` Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.3362 on 498 degrees of freedom Multiple R-squared: 0.02535, Adjusted R-squared: 0.02339 F-statistic: 12.95 on 1 and 498 DF, p-value: 0.0003517 ``` The results related to `\(\widehat{\beta}_0\)` and `\(\widehat{\beta}_1\)` are under the heading `Coefficients:` * The first row `(Intercept)` corresponds to the intercept coefficient `\(\widehat{\beta}_0\)` * The second row `Agreeableness` corresponds to the slope coefficient `\(\widehat{\beta}_1\)` --- # Simple Linear Regression Output ``` r Coefficients: `Estimate` Std. Error t value Pr(>|t|) `(Intercept)` `3.06254` 0.14821 20.663 < 2e-16 *** Agreeableness 0.15448 0.04293 3.599 0.000352 *** Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.3362 on 498 degrees of freedom Multiple R-squared: 0.02535, Adjusted R-squared: 0.02339 F-statistic: 12.95 on 1 and 498 DF, p-value: 0.0003517 ``` * The estimate for `\(\beta_0\)` is `3.06254` --- # Simple Linear Regression Output ``` r Coefficients: `Estimate` Std. Error t value Pr(>|t|) `(Intercept)` `3.06254` 0.14821 20.663 < 2e-16 *** `Agreeableness` `0.15448` 0.04293 3.599 0.000352 *** Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.3362 on 498 degrees of freedom Multiple R-squared: 0.02535, Adjusted R-squared: 0.02339 F-statistic: 12.95 on 1 and 498 DF, p-value: 0.0003517 ``` * The estimate for `\(\beta_0\)` is `3.06254` * The estimate for `\(\beta_1\)` is `0.15448` --- # Simple Linear Regression Output ``` r Coefficients: `Estimate` Std. Error t value Pr(>|t|) `(Intercept)` `3.06254` 0.14821 20.663 < 2e-16 *** `Agreeableness` `0.15448` 0.04293 3.599 0.000352 *** Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.3362 on 498 degrees of freedom Multiple R-squared: 0.02535, Adjusted R-squared: 0.02339 F-statistic: 12.95 on 1 and 498 DF, p-value: 0.0003517 ``` * The estimate for `\(\beta_0\)` is `3.06254` * The estimate for `\(\beta_1\)` is `0.15448` Given `\(\widehat{\beta}_0\)` and `\(\widehat{\beta}_1\)`, we can write down the estimated model as: `$$\widehat{\text{Openness}} = 3.06254 + 0.15448\times\text{Agreeableness}$$` --- # Simple Linear Regression Output ``` r Coefficients: `Estimate` Std. Error t value Pr(>|t|) `(Intercept)` `3.06254` 0.14821 20.663 < 2e-16 *** `Agreeableness` `0.15448` 0.04293 3.599 0.000352 *** Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.3362 on 498 degrees of freedom Multiple R-squared: 0.02535, Adjusted R-squared: 0.02339 F-statistic: 12.95 on 1 and 498 DF, p-value: 0.0003517 ``` * The estimate for `\(\beta_0\)` is `3.06254` and the estimate for `\(\beta_1\)` is `0.15448` Given `\(\widehat{\beta}_0\)` and `\(\widehat{\beta}_1\)`, we can write down the estimated model as: `$$\widehat{\text{Openness}} = 3.06254 + 0.15448\times\text{Agreeableness}$$` We can **interpret the value of `\(\widehat{\beta}_1 = 0.15448\)`** as follows: *"We estimate that, on average, for every 1 unit increase in Agreeableness, the average Openness value will be 0.15448 higher"* --- # Simple Linear Regression Output ``` r Coefficients: Estimate Std. Error t value `Pr(>|t|)` `(Intercept)` 3.06254 0.14821 20.663 `< 2e-16` *** Agreeableness 0.15448 0.04293 3.599 0.000352 *** Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.3362 on 498 degrees of freedom Multiple R-squared: 0.02535, Adjusted R-squared: 0.02339 F-statistic: 12.95 on 1 and 498 DF, p-value: 0.0003517 ``` Reading from the column labeled `Pr(>|t|)`, the `\(p\)`-value for the intercept coefficient is `< 2e-16`, which is very close to zero. -- * This is for a test of the form `\(H_0 : \beta_0 = 0\)` versus `\(H_1 : \beta_0 \neq 0\)` -- * For technical reasons, we always include our intercept coefficient (`\(\widehat{\beta}_0\)`) in our fitted model, even if the associate `\(p\)`-value is `\(> \alpha\)` --- # Simple Linear Regression Output ``` r Coefficients: Estimate Std. Error t value `Pr(>|t|)` `(Intercept)` 3.06254 0.14821 20.663 `< 2e-16` *** `Agreeableness` 0.15448 0.04293 3.599 `0.000352` *** Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.3362 on 498 degrees of freedom Multiple R-squared: 0.02535, Adjusted R-squared: 0.02339 F-statistic: 12.95 on 1 and 498 DF, p-value: 0.0003517 ``` Reading from the column labeled `Pr(>|t|)`, the `\(p\)`-value for the slope coefficient is `0.000352` which is also very close to zero. -- * This is for a test of the form `\(H_0 : \beta_1 = 0\)` versus `\(H_1 : \beta_1 \neq 0\)` **Since we have `\(p < 0.05\)`, we reject `\(H_0\)` and conclude that `\(\beta_1\)` is not zero.** **This means there is evidence of a significant linear association between Openness and Agreeableness ** *(more on this shortly)*. --- # Simple Linear Regression Output ``` r Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.06254 0.14821 20.663 < 2e-16 *** Agreeableness 0.15448 0.04293 3.599 0.000352 *** Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.3362 on 498 degrees of freedom `Multiple R-squared: 0.02535`, Adjusted R-squared: 0.02339 F-statistic: 12.95 on 1 and 498 DF, p-value: 0.0003517 ``` The `Multiple R-squared` value, which can be found in the second last row, is `\(R^2 = 0.02535\)`. * This indicates that 2.54% of the variation in the response can be explained by the model, which is a poor fit --- # `\(H_0 : \beta_1 = 0 \text{ vs } H_1 : \beta_1 \neq 0\)` Recall the simple linear regression model `$$y = \beta_0 + \beta_1x + \epsilon$$` -- If the true value of `\(\beta_1\)` were 0, then the regression model would become `$$y = \beta_0 + \epsilon,$$` meaning `\(y\)` does not depend on `\(x\)` in any way. -- In other words, **if the true value of `\(\beta_1\)` were 0 there would be no association between `\(x\)` and `\(y\)`**. For this reason, the hypothesis test for `\(\beta_1\)` is very important. --- name: menti class: middle background-image: url(data:image/png;base64,#menti.jpg) background-size: 115% # Kahoot ## Go to [www.kahoot.it](www.kahoot.it) and use ## the code provided --- # `\(R^2\)`, the Coefficient of Determination The `\(R^2\)` value can be used to ***evaluate the fit of the model***. -- * `\(R^2\)` values are always between 0 and 1 -- * `\(R^2\)` values close to 0 indicate a poor fit, whereas `\(R^2\)` values close to 1 indicate an excellent fit -- In fact, the `\(R^2\)` value is simply the ***correlation squared***. -- * For example, recall that earlier, we found that the correlation coefficient was `\(r = 0.1592085\)`. If we square this number, we get `\(R^2 = 0.1592085^2 = 0.0253\)` -- * Conversely, if we take the square root of `\(R^2\)`, we can find the correlation --- # `\(R^2\)`, the Coefficient of Determination Although the interpretation of the `\(R^2\)` value can sometimes differ by subject matter, for the purposes of this subject, the below table can be used as a guide when interpreting `\(R^2\)` values: .content-box-blue[ .center[ | * `\(R^2\)` value* | *Quality of the SLR model* | |:-------------|:-------------:| | `\(0.8 \leq R^2 \leq 1\)` | Excellent | | `\(0.5 \leq R^2 < 0.8\)` | Good | | `\(0.25 \leq R^2 < 0.5\)` | Moderate | | `\(0 \leq R^2 < 0.25\)` | Weak | ]] --- # Checking assumptions As usual, it is important to check the assumptions. For SLR, we have: .content-box-blue[ .center[ **Simple linear regression model assumptions:** ] 1. **The model is linear.** Is the linear model proposed suitable for the data, or might there be another kind of model that would fit the data more accurately? 1. **The errors have constant variance.** As well as having an expected value `\(\text{E}(\epsilon) = 0\)`, they are assumed to have constant variance such that `\(\text{Var}(\epsilon) = \sigma^2\)`. This being the case, it follows that we would not expect the variance to be larger or smaller depending on `\(x\)`. 1. **The errors are normally distributed** with mean 0 and variance `\(\sigma^2\)`, such that `\(\epsilon \sim N(0, \sigma^2)\)`. Given this assumption, we would expect the residuals to look like data that has been sampled from a normal distribution. ] --- # Checking assumptions There are two plots that are very useful in helping us check for the above assumptions: the ***residuals versus fits*** plot, and the ***Normal Q-Q plot***. * Both plots are shown below for the Openness versus Agreeableness example: <img src="data:image/png;base64,#Topic_8_Lecture_files/figure-html/unnamed-chunk-31-1.svg" style="display: block; margin: auto;" /> --- # What to look for when checking the plots We will start with the **Normal Q-Q** plot, since it is already familiar to us. -- * We can use this plot to check for **Assumption 3 (Normality)** -- * The plot shows that the dots follow the line very closely, meaning there are no obvious concerns regarding normality of the errors <img src="data:image/png;base64,#Topic_8_Lecture_files/figure-html/unnamed-chunk-32-1.svg" style="display: block; margin: auto;" /> --- # What to look for when checking the plots The ***residuals versus fits*** plot shows the residuals on the `\(y\)`-axis versus the fitted values on the `\(x\)`-axis. -- * We can use this plot to check for **Assumption 1 (linearity)** and **Assumption 2 (constant variance)** -- Firstly, **if the model is linear**, then we expect to see random data and **no patterns in the residuals versus fits plot**. -- * If the data follows some sort of pattern (and we may also observe this in a scatter plot of the data), it may be that a different type of model (for example a quadratic or exponential model) is appropriate -- Secondly, if the errors have constant variance, then we would expect the magnitude of the residuals to remain fairly constant across the entirety of the plot. * If we see that the spread of the residuals becomes larger or smaller, rather than remaining generally constant, as the fitted values change, it may be that the **constant variance** assumption has been violated. This is commonly referred to as **'fanning'** --- # What to look for when checking the plots Referring to the previous residuals versus fits plot, we can observe the following: 1. There are **no obvious patterns**, meaning the **linearity** assumption has been met 1. The spread of the residuals remains generally constant as the fitted values change. That is, there are **no signs of fanning**. This means the **constant variance** assumption has also been met. --- # What to look for when checking the plots Also consider the following examples: <img src="data:image/png;base64,#Topic_8_Lecture_files/figure-html/unnamed-chunk-33-1.svg" height="100%" style="display: block; margin: auto;" /> --- # What to look for when checking the plots In the first plot, we see a random scatter, which is exactly what we want to see in a residuals versus fits plot. There are no signs that either of the linearity or constant variance assumptions have been violated. <img src="data:image/png;base64,#Topic_8_Lecture_files/figure-html/unnamed-chunk-34-1.svg" height="100%" style="display: block; margin: auto;" /> --- # What to look for when checking the plots In the second plot, we see an **obvious pattern** in the data, as it is showing signs of curvature. This is an indication that the **linearity** assumption has been violated. <img src="data:image/png;base64,#Topic_8_Lecture_files/figure-html/unnamed-chunk-35-1.svg" height="100%" style="display: block; margin: auto;" /> --- # What to look for when checking the plots In the third plot, we see obvious signs of **fanning.** This is an indication that the **constant variance** assumption has been violated. <img src="data:image/png;base64,#Topic_8_Lecture_files/figure-html/unnamed-chunk-36-1.svg" height="100%" style="display: block; margin: auto;" /> --- # Predictions As well as using the model to better understand the relationship between the response and explanatory variables, we can also use the model to make predictions. -- For example, suppose we had a new observation with an Agreeableness value of 3, but we did not know the Openness value. * We could predict the Openness value as follows: -- Let `\(x_0\)` denote the 'new' value of `\(x\)`. Then, our predicted response for this choice of `\(x\)` will be `$$\widehat{y}_0 = \widehat{\beta}_0 + \widehat{\beta}_1x_0$$` -- Recalling that our estimated model is `$$\widehat{\text{Openness}} = 3.06254 + 0.15448\times\text{Agreeableness},$$` we can estimate the Openness value as `$$\widehat{\text{Openness}} = 3.06254 + 0.15448\times 3 = 3.52598.$$` --- # Extrapolation ***Extrapolation*** occurs when we make a prediction based on a value of `\(x_0\)` that is not within the range of the data from which we estimated our model. -- * This can lead to inaccurate or even [disastrous](https://www.youtube.com/shorts/ecXAMpKbdig) results -- * We therefore need to be careful when using a model to make predictions and check that `\(x_0\)` is within the range of the data from which we estimated our model -- * See [this topic's readings](https://bookdown.org/a_shaker/STM1001_Topic_8/2.5-predictions.html#a-cautionary-tale-extrapolation) for an example --- # References Costa, P. T. and R. R. McCrae (1992). _Neo personality inventory-revised (NEO PI-R)_. Psychological Assessment Resources Odessa, FL. Dolan, C. V., F. J. Oort, R. D. Stoel, et al. (2009). "Testing measurement invariance in the target rotated multigroup exploratory factor model". In: _Structural Equation Modeling: A Multidisciplinary Journal_ 16.2, pp. 295-314. URL: [https://doi.org/10.1080/10705510902751416](https://doi.org/10.1080/10705510902751416). Goldberg, L. R. (1992). "The development of markers for the Big-Five factor structure." In: _Psychological assessment_ 4.1, p. 26. The jamovi project (2022). _jamovi [Computer Software]_. URL: [https://www.jamovi.org](https://www.jamovi.org). --- background-image: url(data:image/png;base64,#computerlab.jpg) background-position: bottom background-size: 75% class: center # See you in the computer labs! --- class: middle <font color = "grey"> These notes have been prepared by Amanda Shaker and Rupert Kuveke. The copyright for the material in these notes resides with the authors named above, with the Department of Mathematics and Statistics and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License <a href = "https://creativecommons.org/licenses/by-nc-nd/4.0/" target="_blank"> BY-NC-ND. </a> </font>