statistics-dot-com

Week 3 of this class is about the basic linear model of statistics. We look at it in terms of simple linear regression, but with the right level of abstraction it can be used as a framework to do most all of the first two weeks as well. That is, the one-sample t-test is the linear model with no slope, the two-sample paired t-test is the linear model with slope 1, the two-sample t-test is a special case of oneway ANOVA which is the linear model with a categorical predictor…

But that is too much too early. For now lets think about the linear
model as a means to understand the relationship between two numeric
variables, one a *predictor* and one a *response*. The basic idea of
course being that the response is somehow modeled by the predictor and
if you knew that model you would say a value of the response is that
model for that value of the predictor *plus* some random error. The
simple linear model uses a linear equation for the model. This gives
us three things to estimate: the intercept of the line, the slope of
the line, and a standard deviation of the random error.

The basic plot for showing a relationship between two numeric
variables is the scatterplot. R has several ways to plot this. Suppose
`x`

and `y`

are variables in a data set `our_data`

. Then these all
should work:

plot(our_data$x, our_data$y)

with(our_data, plot(x, y))

`plot(y ~ x, data=our_data)`

The latter uses R's model formula. For this use we have the left side
is the response, the right side a description of the predictor. The
`data=`

argument allows variable lookup within the data frame
`our_data`

thereby avoiding the need to type it twice or use with.

For the `heartrate`

data set in the `UsingR`

package we have two
variables `age`

and `maxrate`

. Make a scatter plot with `maxrate`

as
the response variable.

Which seems true

Add two lines to your graph with the equations

```
with(heartrate, {
abline(h = mean(maxrate))
abline(v = mean(age))
})
```

Why is the intersection the “center” of the diagram?

The center is useful for identifying correlation in a scatterplot. If the center is used as a new coordinate system how many of the 15 data points are in the upper right or lower left quadrant?

Which of the following R commands will list the number of cases in
data frame (15 for the `heartrate`

data set):

Correlation is described in the notes. Visually the center picture you
made above captures the idea: data is correlated if it tends to align
in 2 opposite quadrants of the four quadrants formed by
centering. Well that is a rough idea. Basically, if the data is as in
the heartrate data, large values of `age`

are related with smaller
than average values of `heartrate`

. This is an example of negative
correlation.

Enough with graphically, numerically we can find the correlation with
`cor`

.

Find the correlation of age and max heartrate using `cor`

. Use

```
with(heartrate, cor(age, maxrate))
```

Does `with(heartrate, cor(maxrate, age))`

give the same answer as the
previous one:

Does `with(heartrate, cor(scale(maxrate), scale(age)))`

give the same
answer as the previous one:

The linear model is fit in R using formula notation. For us here, a template is simply:

```
reponse ~ predictor, data=data_set
```

More general models (multiple predictors, mathematical transformations of predictors, categorical predictors, interactions) are also fit with extensions of this notation.

Here is how we would fit the heartrate data:

```
res <- lm(maxrate ~ age, heartrate)
summary(res)
```

```
##
## Call:
## lm(formula = maxrate ~ age, data = heartrate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.926 -2.538 0.388 3.187 6.624
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 210.048 2.867 73.3 < 2e-16 ***
## age -0.798 0.070 -11.4 3.8e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.58 on 13 degrees of freedom
## Multiple R-squared: 0.909, Adjusted R-squared: 0.902
## F-statistic: 130 on 1 and 13 DF, p-value: 3.85e-08
```

The key values for interpreting the modal are the parameter estimates. Here we have estimates for the intercept (210.04846 ), the slope (-0.79773) and the estimate for the standard error of the residuals (4.578).

For the built-in `mtcars`

data set, fit `mpg`

as a response and `wt`

as a predictor. What is the estimate for the slope? (The
interpretation is every 1000 pounds cost this many miles per gallon).

For the `Cars93`

data set in the `MASS`

package, the following does a
similar computation:

```
require(MASS)
lm(MPG.city ~ I(Weight/1000), data=Cars93)
```

What is the value of the slope now?

The construct *I(Weight/1000)* is used to divide the Weight variable
by 1000 so that the scale is the same as with the mtcars data
set. When using formulas, simple math notations must be specified
AsIs using I, as the formula notation co-opts the familiar
symbols: +, -, *, and /.

The model object contains more than meets the eye. This is because R,
unlike SPSS say, is very terse in its output. You need to ask it to
get more out. Look at the basic output of the model and compare to the
following when a `summary`

is requested:

```
res <- lm(mpg ~ wt, mtcars)
res
```

```
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Coefficients:
## (Intercept) wt
## 37.29 -5.34
```

```
summary(res)
```

```
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.543 -2.365 -0.125 1.410 6.873
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.285 1.878 19.86 < 2e-16 ***
## wt -5.344 0.559 -9.56 1.3e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.05 on 30 degrees of freedom
## Multiple R-squared: 0.753, Adjusted R-squared: 0.745
## F-statistic: 91.4 on 1 and 30 DF, p-value: 1.29e-10
```

The `summary`

function in R is generic and is implemented for many
different classes. For linear models it provides a summary of the
coefficients and the model fit, whereas the `lm`

object itself (`res`

)
just shows the estimates.

The summary function is an accessor function (or extractor
function). It is defined for most modeling functions you will meet,
though perhaps defined differently. Other such functions, among
others, are coef for the coefficients, residuals for the residuals
and plot for diagnostic plots.

Define `res`

by:

```
res <- lm(mpg ~ wt, mtcars)
```

Find `mean(resid(res))`

, the sample mean of the residuals.

Define `res`

as above

```
res <- lm(mpg ~ wt, mtcars)
```

Find `sd(resid(res))`

, the sample standard deviation of the residuals. Compare it to the value from this line of `summary(res)`

:

Residual standard error: 3.046 on 30 degrees of freedom