Source file ⇒ Lec13.Rmd
geom_smooth
(or equivalently stat_smooth
) adds a smoothed conditional mean
see ggplot2 help
There are different methods:
We have two continuous normal variables X and Y. For example in the mtcars data table, X=wt and Y=mpg. Intuitively the regression line is the best fitting line through your data.
mtcars %>% ggplot(aes(x=wt, y=mpg)) +geom_point() +geom_smooth(method="lm",se=FALSE)
Many scientists misuse the regression line so it is important to know more about it:
In a linear regression model you assume that the average value of y for a given value of x is given by the relationship \[M(x)=\beta_0 + \beta_1x.\] M(x) is the mean values of all the y in your scatter plot in a narrow strip around x. Only Tyche, the Greek goddess of fortune, knows what \(\beta_0\) and \(\beta_1\) are.
This is called a parametric model because the relationship between \(M(x)\) and x is given by an equation with two parameters \(\beta_0\) and \(\beta_1\).
The error of the regression line in estimating \(y_i\) from \(x_i\) is called the residual and is
\[y_i-(\beta_0 +\beta_1x_i)\].
Here is a picture of all of the residuals in a scatter plot.
residuals:
Thinking of \[ \sum_{i=1}^{n}(y_i-\beta_0 -\beta_1x_i)^2. \] as a function of \(\beta_0\) and \(\beta_1\) we can use calculus to find the value, \(\widehat{\beta_0}\) and \(\widehat{\beta_1}\), that minimizes \[ \sum_{i=1}^{n}(y_i-\beta_0 -\beta_1x_i)^2. \]
The regression line based on my sample is given by \[\widehat{M}(x)=\widehat{\beta_0} + \widehat{\beta_1}x.\] and are random variables here since you will get a different value with every sample you take. Again, only Tyche knows what the true parameters, \(\beta_0\) and \(beta_1\) are.
It turns out that \[\widehat{\beta_1} = Cov(x_i,y_i)/Var(x_i) \] and \[ \widehat{\beta_0}=\overline{y} -\widehat{\beta_1}\overline{x} \] where \(\overline{x}\) and \(\overline{y}\) are your sample averages.
For example
lm(formula = mpg ~ wt, data = mtcars)
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Coefficients:
## (Intercept) wt
## 37.285 -5.344
so \(\beta_0=37.3\) is your y intercept
and \(\beta_1=-5.3\) is your slope
Let \(x_0\) be an arbitrary data point (for example \(x_0=3\) is a car with weight 3000 pounds in the mtcars dataset). \(\widehat{M}(x_0)\) is then an estimate of the height of the regression line at \(x_0\) (i.e the avergage mpg of a car with weight 3000 pounds).
We have, \[\widehat{M}(x_0)=\widehat{\beta_0} +\widehat{\beta_1}x_0\] \[\widehat{M}(x_0)=(\overline{y}-\widehat{\beta_1}\overline{x}) + \widehat{\beta_1}x_0\] \[\widehat{M}(x_0)=\overline{y} + \widehat{\beta_1}(x_0-\overline{x})\]
From here using the property (Var(A+B)=Var(A)+Var(B) if A and B are independent random variables) and the amazing fact that \(\overline{y}\) and \(\widehat{\beta_1}\) are independent random variables) you can show that
\[Var(\widehat{M}(x_0))=\frac{\sigma^2}{n} + \frac{(x_0-\overline{x})^2\sigma^2}{\sum_{i=1}^{n}(x_i-\overline{x})^2}.\]
What we see from this is that the variance of the height of the regression line varies with \(x_0\) and that it gets larger the further away \(x_0\) is from \(\overline{x}\). This is why the confidence band gets largers the further you are away from the point of averages \((\overline{x},\overline{y})\)
For example:
mtcars %>% ggplot(aes(x=wt, y=mpg)) +geom_point() +geom_smooth(method="lm") + geom_point(aes(x=mean(wt),y=mean(mpg)),size=5)
mtcars$wt %>% mean()
## [1] 3.21725
mtcars$mpg %>% mean()
## [1] 20.09062
We have two continuous variables X and Y. For example in the mtcars data table, X=wt and Y=mpg.
mtcars %>% ggplot(aes(x=wt, y=mpg)) +geom_point() +geom_smooth(se=FALSE)
Algorithm for Loess
Let \(x_0\) be an observation. For example \(x_0=3.435\) corresponding to the 3435 pound Merc 280.
span
) of the \(x_i\) closest to \(x_0\).span=.4
of 32 cars, then we look for the 13 closest car weights to the Merc 280.In this example cars nearest to the Merc 280 have a weight close to 1 and blue cars further away have smaller weights. All the red cars have zero weights.
Just as we did for simple linear regression find \(\widehat{\beta_0}\) and \(\widehat{\beta_1}\) that minimize \[ \sum_{i=1}^{n} K_{i0}(y_i-\beta_0 -\beta_1x_i)^2. \] The difference here is that we have weights % K_{i0}$.
The fitted value of \(x_0\) is given by \[\widehat{M}(x_0)=\widehat{\beta_0} + \widehat{\beta_1}x_0\]
We do this for every observation \(x_0\) in our dataset and connect the points \(\widehat{M}(x_0)\). How we connect the points is a little complicated and I won’t go into it. What is important is to understand that if the span
is close to zero then the accuracy of the regression line will be limitted only for a very small range. Hence at every observation there will be an adjustment in the direction of the line resulting in a wiggly curve. If the span is close to 1 then the regression line will be true for a large range and the curve will be almost straight.
mtcars %>% ggplot(aes(x=wt, y=mpg)) +geom_point() +stat_smooth(se=FALSE,method="loess", span=.4)
mtcars %>% ggplot(aes(x=wt, y=mpg)) +geom_point() +stat_smooth(se=FALSE,method="loess", span=.8)
The Loess method is non parametric meaning that we are entirely relaxing the linearity assumption.