- Recap week 10: Example questions
- Regression
- Regression app in R
- Lab report 7
- Today’s lab / short answer example questions
Assignment 2
- Will be available from Friday 14th in the morning
- Due Friday 21st 4pm
- If you cannot meet the deadline, you need to apply for special consideration (SCA) on Canvas (with a good reason and proof of it)
- It must be your own work, do not plagiarise, we have dedicated software now to detect it
Example Question (1)
A Pearson correlation coefficient of 0.52 means
- that the x variable causes the y variable to increase
- that the x and y are positively correlated
- that x and y are just not significantly correlated
- that x and y are highly significantly correlated
Example Question (2)
The correlation coefficient is
- the same as the covariance
- a standardised version of the covariance
- the square root of the covariance
- none of the above
Example Question (3)
When a data set looks non-normal but I still would like to check for correlations between variables
- I should use a Wilcoxon test
- I should use a Pearson correlation test
- I should use a Kendall or Spearman correlation test
- I can use a Wilcoxon OR a Kendall test
Example Question (4)
A correlation test tests
- whether a histogram shows a normal distribution
- whether two groups are significantly different
- whether two variables are correlated
- none of the above
Example Question (5)
In a correlation analysis, if my p-value falls below 5% but in reality there is no correlation between the two variables,
- I probably did not test whether the variables were normal
- I might erroneously have used continuous variables
- I commit a type II error
- I commit a type I error
Example Question (6)
In a correlation analysis we normally have
- two categorical variables
- one categorical variable and one continuous variable
- two continuous variables
- any two variables
Example Question (7)
Assume that your waiting time for a bus is uniformly distributed between 0 and 10 minutes. What is the probability to wait at least 5 minutes or longer?
- This cannot be calculated from the above
- About 5 %
- About 50 %
- About 10 %
Regression Analysis
The origin of the word
The original publication that coined the term ‘regression’ - it dealt with body sizes of parents and their children.
Francis Galton observed that tall parents tended to have not as tall offpring, the metric ‘regressed’.
Regression Analysis
- Measures the dependency of one variable on another
- E.g. it tests the hypothesis ‘The body mass index of a person can be used to predict the risk for diabetes’
- A linear regression is a predictive model: e.g. it may allow us to predict diabetes risk from the body mass index of a person
- We assume causality, i.e. one variable drives the other one - unlike in correlation analysis!
- The two coefficients of a simple linear regression are the slope and the intercept
- In a linear regression, we test for the significance of the slope (only rarely for the significance of the intercept)
Regression Analysis
\(\hat{y} = ax + b\)
\(\hat{y}\) are the estimated y values:
\(\hat{y} - y = {\color{red}{d}}\)
Assuming that the line should go through the mean of x and y, our task is to find \(a\) that minimises the sum of the squared residuals \({\color{red}{d}}\)
Regression Analysis
\(\hat{y} - y = {\color{red}{d}}\)
now replace \(\hat{y}\) by \(ax + b\) and solve for \({\color{red}{d}}\) to obtain
\({\color{red}{d}} = y - ax - b\)
We therefore have to minimise \(\sum{(y - ax - b)^2}\). We do this by setting the first derivative to zero (remember calculus…?)
Regression Analysis
\(\frac{d\sum{(y - ax - b)^2}}{da} = -2\sum{x(y - ax - b)} = 0\)
We find that (without proof, see Crawley 2014, the R book):
\(a = \frac{\sum{(x - \bar{x})(y - \bar{y})}}{\sum{(x - \bar{x})^2}}\)
Because the regression line by definition goes through the mean of x and y, we can work out the intercept:
\(\bar{y} = a\bar{x} + b\)
\(b = \bar{y} - a \bar{x}\)
done! confused? https://gallery.shinyapps.io/simple_regression/
Regression Analysis
Assumptions
- \(x\) is measured without error (for the type of regression presented here)
- The residuals are normally distributed
- Constant variance in y (no heterogeneity of variance, or ‘heteroscedasticity’)
We know how to test for normality, but we’ll have to learn how to assess heteroscedasticity
Regression in R
in R, we only need to type:
m1 = lm(y ~ x) m1
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
0.04133 1.02517
and we are done!
Regression in R
How do we interpret these numbers?
- The intercept is what we called \(b\) and \(a\) is the slope
- If we want to predict a \(y\)-value for a given \(x\), we need to put the intercept and the slope into the formula:
\(y\)-value = \(x\) * slope + intercept
The slope (sometimes called a) and the intercept (sometimes called b) are the coefficients
Regression in R
To plot this is easy too, same as in correlation analysis:
plot(y ~ x) abline(m1) #plots the regression line, based on model 'm1'
But it is important to know what is going on behind the curtain: R automatically finds the best fit using the least squares method (minimising the squared residuals by pivoting a line through the mean of x and y (as we have just done manually)
Regression in R
Using summary(m1) (‘m1’ being your model output) additionally tests the intercept and slope for significance:
- intercept null hypothesis, using a t-statistic: ‘the intercept is not significantly different from zero’. In plain English: when x is zero, y is zero too.
- slope null hypothesis, using a t-statistic: ‘the slope is not significantly different from zero’.
Regression in R
summary(m1) #m1 as defined in previous slide
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-2.5034 -0.4339 0.2452 0.7240 1.6150
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.04133 0.85395 0.048 0.963
x 1.02517 0.13763 7.449 7.27e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.25 on 8 degrees of freedom
Multiple R-squared: 0.874, Adjusted R-squared: 0.8582
F-statistic: 55.49 on 1 and 8 DF, p-value: 7.272e-05
Regression in R
- First, we see the call (the data input)
- Below, we see the statistics (min, median, max, etc.) of the residuals
- Then come the coefficients, i.e. the intercept and the slope, including their estimate, standard error, t-value and corresponding p-value.
- Note that the intercept is rarely important! The null hypothesis is that it is different from zero - rarely of interest!
- Last come the residual standard error, degrees of freedom, R-squared values (look at the adjusted R-squared value, see following slides) and an F-statistic (the last one is not important at this stage)
Regression Analysis
Example where the significance of the intercept matters
Photospectrometry: At concentration zero - is the absorption significantly different from zero?
Regression in R
R-squared value: the proportion of explained vs. total variance, corresponds to the squared correlation coefficient \(R\).
\(R = \frac{cov(x,y)}{s_xs_y} = \frac{\sum{(x-\bar{x})(y-\bar{y})}}{(n-1)s_xs_y}\)
Remember?
\(R^2\) is the proportion of variance in \(y\) explained by \(x\)
How can we visualise this?
Regression in R
\(R^2 = 1 - \frac{{\color{blue}{SS_{residual}}}}{{\color{red}{SS_{total}}}} = 1 - \frac{{\color{blue}{\sum{(y_i - \hat{y}_i)^2}}}}{{\color{red}{\sum{(y_i - \bar{y}_i)^2}}}}\)
Visual representation ofRegression in R
Example using the ‘swiss’ data set
Does the percentage of households involved in agriculture (‘agriculture’) influence the fertility index (‘fertility’)? Can we establish a perdictive model that forecasts ‘fertility’ based on ‘agriculture’?
First, always plot the data.
Indeed, the higher ‘agriculture’, the higher the variable ‘fertility’:
#using the inbuilt 'swiss' data set plot(swiss$Fertility ~ swiss$Agriculture)
Regression in R
Let us run and interpret a linear regression on the data:
m1 = lm(Fertility ~ Agriculture, data = swiss) summary(m1)
Call:
lm(formula = Fertility ~ Agriculture, data = swiss)
Residuals:
Min 1Q Median 3Q Max
-25.5374 -7.8685 -0.6362 9.0464 24.4858
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 60.30438 4.25126 14.185 <2e-16 ***
Agriculture 0.19420 0.07671 2.532 0.0149 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.82 on 45 degrees of freedom
Multiple R-squared: 0.1247, Adjusted R-squared: 0.1052
F-statistic: 6.409 on 1 and 45 DF, p-value: 0.01492
Regression in R
Example using the ‘swiss’ data set
- How do we interpret the output?
- What assumptions do we need to verify again…?
To check for normality of the residuals, we extract those from the model (m1$residuals) and use our known tools (visual, Shapiro-Wilk test)
To check the assumption of homogeneity of variance, we plot the residuals against the fitted values
Regression in R - Assumptions
So we need to check whether the residuals follow a normal distribution and whether we see heterogeneity of variance
par(mfrow = c(1,2)) qqnorm(m1$residuals) qqline(m1$residuals) plot(m1$residuals ~ m1$fitted.values)
This is looking reasonable, most data points sit on the line. There is no obvious heterogeneity of variance. We can trust the model output.
Assumptions, the quick way
plot(m1) #note this gives you 4 plots, best used after par(mfrow = c(2, 2)) #we will consider plots 1, 2, and 4 only
This plot checks for homogeneity of variance of the residuals, we do now want to see a pattern, e.g. larger residuals at higher fitted values (trumpet shape).
Assumptions, the quicker way
In this plot we check for the normality of the residuals in a qq-plot, just like we did earlier.
Assumptions, the quicker way
This plot tells us about very influential outliers. The values outside the 0.5 line (or worse outside the 1) have a strong influence on the regression line and may need looking at.
Regression in R: how to report the data
plot(swiss$Fertility ~ swiss$Agriculture) abline(m1) text(70, 50, 'R-squared = 0.12')
The higher the percentage of families involved in agriculture, the larger the families (linear regression, R-squared = 0.12, p < 0.05).
Regression in R: use the model to predict values
Intercept: 60.3
Slope: 0.19
What fertility would we predict if the percentage of people working in agriculture is 50%?
\(Fertility = 0.19 \cdot 50 + 60.3 = 69.8\)
Learning tool: Linear regression app
Model diagnostics (testing model assumptions)
https://gallery.shinyapps.io/slr_diag/
Regression
https://rich.shinyapps.io/regression/
Regression game
https://gallery.shinyapps.io/simple_regression/
Correlation:
https://gallery.shinyapps.io/correlation_game/
What will we have learnt by the end of this week?
- How to perform a linear regression
- How to interpret the output of a linear regression
- How to check for model assumptions
- How to report a linear regression analysis
Glossary
- regression, to regress y on x
- intercept
- slope
- coefficients
- R-squared value
- heterogeneity of variance
- residuals, residual distribution
- fitted values