When you’ve got a scatterplot you want to add a regression to, there are a number of ways.

Let’s make some data:

```
age = 18:29
height = c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)
data = data.frame(age, height)
print(data)
```

```
## age height
## 1 18 76.1
## 2 19 77.0
## 3 20 78.1
## 4 21 78.2
## 5 22 78.8
## 6 23 79.7
## 7 24 79.9
## 8 25 81.1
## 9 26 81.2
## 10 27 81.8
## 11 28 82.8
## 12 29 83.5
```

plot it. we’ll use a different plotting library this time, though, since it’s easier to make pretty things.

```
library(ggplot2)
ggplot(data, aes(x=age, y=height)) +
geom_point()
```

Now, to get a best-fit line all we have to do it:

```
ggplot(data, aes(x=age, y=height)) +
geom_point() +
geom_smooth(method="lm")
```

If we want to see how “good” this is, we can look at the `lm`

output ourselves:

```
fit = lm(data$height~data$age)
print(fit)
```

```
##
## Call:
## lm(formula = data$height ~ data$age)
##
## Coefficients:
## (Intercept) data$age
## 64.928 0.635
```

`print(summary(fit))`

```
##
## Call:
## lm(formula = data$height ~ data$age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.27238 -0.24248 -0.02762 0.16014 0.47238
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 64.9283 0.5084 127.71 < 2e-16 ***
## data$age 0.6350 0.0214 29.66 4.43e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.256 on 10 degrees of freedom
## Multiple R-squared: 0.9888, Adjusted R-squared: 0.9876
## F-statistic: 880 on 1 and 10 DF, p-value: 4.428e-11
```

Remember, we care about R-squared and the p-value (bottom right two rows).

A good ‘fit’ is when R-squared is closer to `1`

(that means your model describes the data well) and when the p-value is below `0.05`

. Those aren’t hard-and-fast rules, but a good rule-of-thumb to use.

You can use the other plotting system we saw last week, too:

```
plot(data$age, data$height)
abline(fit)
```

it’s just not as pretty.

We can even use this model we’ve made to make a prediction. In the output of `print(fit)`

we have coefficients that we can use in this basic regression formula. To make a prediction, we have to take the `data$age`

coefficient, multiply it by the age we want to predict the height for then add the `intercept`

. so, our formula would be:

`height = 0.635 * age + 64.928`

So, if the age we want to predict height for is `32`

then the estimate is that it’ll be:

`print(0.635 * 32 + 64.928)`

`## [1] 85.248`

It should be nearly spot-on for existing ones, too (say, `25`

):

`print(0.635 * 25 + 64.928)`

`## [1] 80.803`

`print(data[age==25,])`

```
## age height
## 8 25 81.1
```

Remember, it won’t be perfect (it’s a prediction), but it’ll be close with some error margin.