# Week 3

statistics-dot-com

Week 3 of this class is about the basic linear model of statistics. We look at it in terms of simple linear regression, but with the right level of abstraction it can be used as a framework to do most all of the first two weeks as well. That is, the one-sample t-test is the linear model with no slope, the two-sample paired t-test is the linear model with slope 1, the two-sample t-test is a special case of oneway ANOVA which is the linear model with a categorical predictor…

But that is too much too early. For now lets think about the linear model as a means to understand the relationship between two numeric variables, one a predictor and one a response. The basic idea of course being that the response is somehow modeled by the predictor and if you knew that model you would say a value of the response is that model for that value of the predictor plus some random error. The simple linear model uses a linear equation for the model. This gives us three things to estimate: the intercept of the line, the slope of the line, and a standard deviation of the random error.

The basic plot for showing a relationship between two numeric variables is the scatterplot. R has several ways to plot this. Suppose `x` and `y` are variables in a data set `our_data`. Then these all should work:

• plot(our_data\$x, our_data\$y)

• with(our_data, plot(x, y))

• `plot(y ~ x, data=our_data)`

The latter uses R's model formula. For this use we have the left side is the response, the right side a description of the predictor. The `data=` argument allows variable lookup within the data frame `our_data` thereby avoiding the need to type it twice or use with.

For the `heartrate` data set in the `UsingR` package we have two variables `age` and `maxrate`. Make a scatter plot with `maxrate` as the response variable.

Which seems true

``````with(heartrate, {
abline(h = mean(maxrate))
abline(v = mean(age))
})
``````

Why is the intersection the “center” of the diagram?

The center is useful for identifying correlation in a scatterplot. If the center is used as a new coordinate system how many of the 15 data points are in the upper right or lower left quadrant?

Which of the following R commands will list the number of cases in data frame (15 for the `heartrate` data set):

Correlation is described in the notes. Visually the center picture you made above captures the idea: data is correlated if it tends to align in 2 opposite quadrants of the four quadrants formed by centering. Well that is a rough idea. Basically, if the data is as in the heartrate data, large values of `age` are related with smaller than average values of `heartrate`. This is an example of negative correlation.

Enough with graphically, numerically we can find the correlation with `cor`.

Find the correlation of age and max heartrate using `cor`. Use

``````with(heartrate, cor(age, maxrate))
``````

Does `with(heartrate, cor(maxrate, age))` give the same answer as the previous one:

Does `with(heartrate, cor(scale(maxrate), scale(age)))` give the same answer as the previous one:

The linear model is fit in R using formula notation. For us here, a template is simply:

``````reponse ~ predictor, data=data_set
``````

More general models (multiple predictors, mathematical transformations of predictors, categorical predictors, interactions) are also fit with extensions of this notation.

Here is how we would fit the heartrate data:

``````res <- lm(maxrate ~ age, heartrate)
summary(res)
``````
``````##
## Call:
## lm(formula = maxrate ~ age, data = heartrate)
##
## Residuals:
##    Min     1Q Median     3Q    Max
## -8.926 -2.538  0.388  3.187  6.624
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  210.048      2.867    73.3  < 2e-16 ***
## age           -0.798      0.070   -11.4  3.8e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.58 on 13 degrees of freedom
## Multiple R-squared:  0.909,  Adjusted R-squared:  0.902
## F-statistic:  130 on 1 and 13 DF,  p-value: 3.85e-08
``````

The key values for interpreting the modal are the parameter estimates. Here we have estimates for the intercept (210.04846 ), the slope (-0.79773) and the estimate for the standard error of the residuals (4.578).

For the built-in `mtcars` data set, fit `mpg` as a response and `wt` as a predictor. What is the estimate for the slope? (The interpretation is every 1000 pounds cost this many miles per gallon).

For the `Cars93` data set in the `MASS` package, the following does a similar computation:

``````require(MASS)
lm(MPG.city ~ I(Weight/1000), data=Cars93)
``````

What is the value of the slope now?

The construct *I(Weight/1000)* is used to divide the Weight variable by 1000 so that the scale is the same as with the mtcars data set. When using formulas, simple math notations must be specified AsIs using I, as the formula notation co-opts the familiar symbols: +, -, *, and /.

The model object contains more than meets the eye. This is because R, unlike SPSS say, is very terse in its output. You need to ask it to get more out. Look at the basic output of the model and compare to the following when a `summary` is requested:

``````res <- lm(mpg ~ wt, mtcars)
res
``````
``````##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Coefficients:
## (Intercept)           wt
##       37.29        -5.34
``````
``````summary(res)
``````
``````##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Residuals:
##    Min     1Q Median     3Q    Max
## -4.543 -2.365 -0.125  1.410  6.873
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   37.285      1.878   19.86  < 2e-16 ***
## wt            -5.344      0.559   -9.56  1.3e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.05 on 30 degrees of freedom
## Multiple R-squared:  0.753,  Adjusted R-squared:  0.745
## F-statistic: 91.4 on 1 and 30 DF,  p-value: 1.29e-10
``````

The `summary` function in R is generic and is implemented for many different classes. For linear models it provides a summary of the coefficients and the model fit, whereas the `lm` object itself (`res`) just shows the estimates.

The summary function is an accessor function (or extractor function). It is defined for most modeling functions you will meet, though perhaps defined differently. Other such functions, among others, are coef for the coefficients, residuals for the residuals and plot for diagnostic plots.

Define `res` by:

``````res <- lm(mpg ~ wt, mtcars)
``````

Find `mean(resid(res))`, the sample mean of the residuals.

Define `res` as above

``````res <- lm(mpg ~ wt, mtcars)
``````

Find `sd(resid(res))`, the sample standard deviation of the residuals. Compare it to the value from this line of `summary(res)`:

```Residual standard error: 3.046 on 30 degrees of freedom
```