class: middle background-image: url(data:image/png;base64,#LTU_logo.jpg) background-position: top left background-size: 30% # STM1001 [Topic 8](https://bookdown.org/a_shaker/STM1001_Topic_8/) Workshop ## Correlation and Simple Linear Regression ### La Trobe University This workshop complements the [Topic 8 readings](https://bookdown.org/a_shaker/STM1001_Topic_8/) --- # Topic 8: Correlation and Simple Linear Regression ## In this week's readings: * We will not have time to cover every concept, so please make sure you read this topic's readings thoroughly. <iframe src="https://bookdown.org/a_shaker/STM1001_Topic_8/" width="100%" height="400px" data-external="1"></iframe> --- # Correlation and Simple Linear Regression * Do you think there is a relationship between money and happiness? -- * If so, do you think the association is positive (as one goes up, the other goes up) or negative (as one goes up, the other goes down)? -- * Recall two variables that we considered in [Topic 2](https://bookdown.org/a_shaker/STM1001_Topic_2/): GDP per capita (Income), and the Happiness index (score out of 100) (Gapminder.org 2021) * On the next slide is a random sample of 10 countries and their income and happiness scores: --- # Correlation and Simple Linear Regression
--- # Activity 1 (Individual) 1. Draw a scatter plot of the data with Income on the `\(x\)`-axis and Happiness on the `\(y\)`-axis 1. Looking at your scatter plot, what do you think the sample correlation is? (remember that this is a measure of the association. It is a number between -1 and 1) --- name: menti class: middle background-image: url(data:image/png;base64,#menti.jpg) background-size: 115% # Menti ## Go to [www.menti.com](https://www.menti.com) and use ## the code provided ## First three questions ONLY --- # Correlation We can test the null hypothesis that the population correlation coefficient is 0, using `$$H_0:\rho = 0 \;\;\text{versus}\;\;H_1: \rho \neq 0,$$` where: * `\(\rho\)` denotes the true (population) correlation coefficient. --- # Activity 2 (Individual) ``` Pearson's product-moment correlation data: df$income_2019 and df$happiness_2019 t = 4.9228, df = 8, p-value = 0.00116 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.5229322 0.9681534 sample estimates: cor 0.8670731 ``` 1. If we have evidence that the population correlation coefficient `\(\rho\)` is not equal to zero, what does this mean? (explain in your own words) 2. Above are the results of the test for significance. Using these results, answer the following questions: * What is the estimated correlation? * What is the 95% Confidence Interval for `\(\rho\)`? * What is the `\(p\)`-value? * Do we have enough evidence to reject `\(H_0\)`? What is the evidence? * What can we conclude about the relationship between the two variables? --- # Activity 2 (Solutions) 1. This means we have evidence of an association between the two variables 2. * The estimated correlation is 0.8671 * The 95% confidence interval for `\(\rho\)` is (0.5229, 0.9682) * 0.00116 * Yes, since the `\(p\)`-value is 0.00116, which is less than 0.05. Or, we can use the fact that the 95% confidence interval for `\(\rho\)` does not include 0 * Given the evidence, we can conclude that there is evidence of a significant association between the two variables. --- # Does correlation imply causation? Do you think higher marriage rates would be related to higher numbers of people who drowned after falling out of a fishing boat? <img src="data:image/png;base64,#images/spurious.svg" style="display: block; margin: auto;" /> [Spurious correlations](https://www.tylervigen.com/spurious-correlations) (also called nonsense correlations), by [Tyler Vigen](https://www.tylervigen.com/about), licenced under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) See [this week's readings](https://bookdown.org/a_shaker/STM1001_Topic_8/1-2-does-correlation-imply-causation.html) for further discussion --- # Defining a straight line * You may be familiar with the following equation which we can use to define a straight line: `$$y = mx + c,$$` where: * `\(m\)` is the slope of the line * `\(c\)` is the `\(y\)`-intercept. --- # Defining a straight line * We can see the line crosses the `\(y\)`-intercept when `\(x = 0\)` and `\(y = 10\)`: <img src="data:image/png;base64,#Topic_8_Workshop_files/figure-html/unnamed-chunk-5-1.svg" height="100%" style="display: block; margin: auto;" /> --- # Defining a straight line * When we zoom in, we can see that as `\(x\)` increases by one unit, `\(y\)` increases by 5 (the slope) <img src="data:image/png;base64,#Topic_8_Workshop_files/figure-html/unnamed-chunk-6-1.svg" height="100%" style="display: block; margin: auto;" /> --- # Activity 3 (Group) * Consider the scatter plot of the happiness vs income data you created before. As a group, decide on the best spot for the line. * In particular: * What should be the value of the slope? * What should be the value of the `\(y\)`-intercept? --- # Activity 3 continued * On the next slides are the estimated model using “Simple Linear Regression” in R, followed by a scatter plot with the "line of best fit" added. Use these to answer the following questions: * What do you think is the estimate of the slope? How close was this to your group’s estimate? * What do you think is the estimate of the `\(y\)`-intercept? How close was this to your group’s estimate? * Write down a sentence that interprets the slope. Hint: for every one-unit increase in average income, what do we expect will happen to the happiness score, on average? --- # Activity 3 continued ``` Call: lm(formula = happiness_2019 ~ income_2019, data = df) Residuals: Min 1Q Median 3Q Max -7.2321 -1.7428 -0.7695 1.9069 6.6278 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 39.8468763 4.6388562 8.590 0.0000261 *** income_2019 0.0007280 0.0001479 4.923 0.00116 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 4.464 on 8 degrees of freedom Multiple R-squared: 0.7518, Adjusted R-squared: 0.7208 F-statistic: 24.23 on 1 and 8 DF, p-value: 0.00116 ``` --- # Activity 3 continued <img src="data:image/png;base64,#Topic_8_Workshop_files/figure-html/unnamed-chunk-8-1.svg" height="100%" style="display: block; margin: auto;" /> --- # Activity 3 (solutions) * The estimate of the slope is 0.000728 * The estimate of the `\(y\)`-intercept is 39.85 * For each one-unit (dollar) increase in income, we expect, on average, the happiness score to increase by 0.000728. * Equivalently, by adjusting the units, we could say that for each $10,000 increase in income, we expect, on average, the happiness score to increase by 7.28 --- # Simple linear regression model definition .content-box-blue[ .center[ **Simple linear regression model definition:** ] `$$y = \beta_0 + \beta_1 x + \epsilon,$$` where: * `\(x\)` is the **explanatory variable** (also referred to as the **independent** variable or **predictor** variable) * `\(y\)` is the **response variable** (also referred to as the **dependent variable**) * `\(\beta_0\)` is the `\(y\)`-intercept of the line (just like `\(c\)` in the equation we looked at earlier) and is referred to as the **intercept coefficient** * `\(\beta_1\)` is the slope of the line (just like `\(m\)` in the equation we looked at earlier) and is referred to as the **slope coefficient** * `\(\epsilon\)` is known as the **random error** term which has expected value `\(\text{E}(\epsilon) = 0\)` ] --- # Simple linear regression model definition (continued) Then, supposing we have a data set with `\(n\)` observations, each with a value for `\(x\)` and a value for `\(y\)` denoted as `$$(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)$$` we can use this data to help us obtain the ***sample estimates***, that is, `$$\widehat{y}=\widehat{\beta}_0+\widehat{\beta}_1x.$$` --- name: menti class: middle background-image: url(data:image/png;base64,#menti.jpg) background-size: 115% # Menti ## Go to [www.menti.com](https://www.menti.com) and use ## the code provided ## Final two questions ONLY --- # More details in the readings * For more, see [this topic’s readings](https://bookdown.org/a_shaker/STM1001_Topic_8/) * In particular, the following important content will also be covered: * How do we decide where the “line of best fit” should go? * More detail about how to interpret the Simple Linear Regression output from R * Testing for `\(H_0 : \beta_1 = 0\)` versus `\(H_1 : \beta_1 \neq 0\)` * `\(R^2\)`, the Coefficient of Determination * Checking assumptions * Predictions --- background-image: url(data:image/png;base64,#computerlab.jpg) background-position: bottom background-size: 75% class: center # See you in the computer labs! Continue with this topic's readings: [Topic 8 Readings](https://bookdown.org/a_shaker/STM1001_Topic_8/) --- # References Gapminder.org. 2021. “Free Data from World Bank via Gapminder.org, CC-BY License.” 2021. https://www.gapminder.org/data/. --- class: middle <font color = "grey"> These notes have been prepared by Amanda Shaker. The copyright for the material in these notes resides with the authors named above, with the Department of Mathematics and Statistics and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License <a href = "https://creativecommons.org/licenses/by-nc-nd/4.0/" target="_blank"> BY-NC-ND. </a> </font>