regression

When you’ve got a scatterplot you want to add a regression to, there are a number of ways.

Let’s make some data:

age = 18:29
height = c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)

data = data.frame(age, height)

print(data)

##    age height
## 1   18   76.1
## 2   19   77.0
## 3   20   78.1
## 4   21   78.2
## 5   22   78.8
## 6   23   79.7
## 7   24   79.9
## 8   25   81.1
## 9   26   81.2
## 10  27   81.8
## 11  28   82.8
## 12  29   83.5

plot it. we’ll use a different plotting library this time, though, since it’s easier to make pretty things.

library(ggplot2)

ggplot(data, aes(x=age, y=height)) +
  geom_point()

Now, to get a best-fit line all we have to do it:

ggplot(data, aes(x=age, y=height)) +
  geom_point() +
  geom_smooth(method="lm")

If we want to see how “good” this is, we can look at the lm output ourselves:

fit = lm(data$height~data$age)

print(fit)

## 
## Call:
## lm(formula = data$height ~ data$age)
## 
## Coefficients:
## (Intercept)     data$age  
##      64.928        0.635

print(summary(fit))

## 
## Call:
## lm(formula = data$height ~ data$age)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.27238 -0.24248 -0.02762  0.16014  0.47238 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  64.9283     0.5084  127.71  < 2e-16 ***
## data$age      0.6350     0.0214   29.66 4.43e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.256 on 10 degrees of freedom
## Multiple R-squared:  0.9888, Adjusted R-squared:  0.9876 
## F-statistic:   880 on 1 and 10 DF,  p-value: 4.428e-11

Remember, we care about R-squared and the p-value (bottom right two rows).

A good ‘fit’ is when R-squared is closer to 1 (that means your model describes the data well) and when the p-value is below 0.05. Those aren’t hard-and-fast rules, but a good rule-of-thumb to use.

You can use the other plotting system we saw last week, too:

plot(data$age, data$height)
abline(fit)

it’s just not as pretty.

We can even use this model we’ve made to make a prediction. In the output of print(fit) we have coefficients that we can use in this basic regression formula. To make a prediction, we have to take the data$age coefficient, multiply it by the age we want to predict the height for then add the intercept. so, our formula would be:

height = 0.635 * age + 64.928

So, if the age we want to predict height for is 32 then the estimate is that it’ll be:

print(0.635 * 32 + 64.928)

## [1] 85.248

It should be nearly spot-on for existing ones, too (say, 25):

print(0.635 * 25 + 64.928)

## [1] 80.803

print(data[age==25,])

##   age height
## 8  25   81.1

Remember, it won’t be perfect (it’s a prediction), but it’ll be close with some error margin.

regression

@hrbrmstr

February 20, 2015