When you’ve got a scatterplot you want to add a regression to, there are a number of ways.
Let’s make some data:
age = 18:29
height = c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)
data = data.frame(age, height)
print(data)
## age height
## 1 18 76.1
## 2 19 77.0
## 3 20 78.1
## 4 21 78.2
## 5 22 78.8
## 6 23 79.7
## 7 24 79.9
## 8 25 81.1
## 9 26 81.2
## 10 27 81.8
## 11 28 82.8
## 12 29 83.5
plot it. we’ll use a different plotting library this time, though, since it’s easier to make pretty things.
library(ggplot2)
ggplot(data, aes(x=age, y=height)) +
geom_point()
Now, to get a best-fit line all we have to do it:
ggplot(data, aes(x=age, y=height)) +
geom_point() +
geom_smooth(method="lm")
If we want to see how “good” this is, we can look at the lm output ourselves:
fit = lm(data$height~data$age)
print(fit)
##
## Call:
## lm(formula = data$height ~ data$age)
##
## Coefficients:
## (Intercept) data$age
## 64.928 0.635
print(summary(fit))
##
## Call:
## lm(formula = data$height ~ data$age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.27238 -0.24248 -0.02762 0.16014 0.47238
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 64.9283 0.5084 127.71 < 2e-16 ***
## data$age 0.6350 0.0214 29.66 4.43e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.256 on 10 degrees of freedom
## Multiple R-squared: 0.9888, Adjusted R-squared: 0.9876
## F-statistic: 880 on 1 and 10 DF, p-value: 4.428e-11
Remember, we care about R-squared and the p-value (bottom right two rows).
A good ‘fit’ is when R-squared is closer to 1 (that means your model describes the data well) and when the p-value is below 0.05. Those aren’t hard-and-fast rules, but a good rule-of-thumb to use.
You can use the other plotting system we saw last week, too:
plot(data$age, data$height)
abline(fit)
it’s just not as pretty.
We can even use this model we’ve made to make a prediction. In the output of print(fit) we have coefficients that we can use in this basic regression formula. To make a prediction, we have to take the data$age coefficient, multiply it by the age we want to predict the height for then add the intercept. so, our formula would be:
height = 0.635 * age + 64.928
So, if the age we want to predict height for is 32 then the estimate is that it’ll be:
print(0.635 * 32 + 64.928)
## [1] 85.248
It should be nearly spot-on for existing ones, too (say, 25):
print(0.635 * 25 + 64.928)
## [1] 80.803
print(data[age==25,])
## age height
## 8 25 81.1
Remember, it won’t be perfect (it’s a prediction), but it’ll be close with some error margin.