Week 3

statistics-dot-com

Week 3 of this class is about the basic linear model of statistics. We look at it in terms of simple linear regression, but with the right level of abstraction it can be used as a framework to do most all of the first two weeks as well. That is, the one-sample t-test is the linear model with no slope, the two-sample paired t-test is the linear model with slope 1, the two-sample t-test is a special case of oneway ANOVA which is the linear model with a categorical predictor…

But that is too much too early. For now lets think about the linear model as a means to understand the relationship between two numeric variables, one a predictor and one a response. The basic idea of course being that the response is somehow modeled by the predictor and if you knew that model you would say a value of the response is that model for that value of the predictor plus some random error. The simple linear model uses a linear equation for the model. This gives us three things to estimate: the intercept of the line, the slope of the line, and a standard deviation of the random error.

The basic plot for showing a relationship between two numeric variables is the scatterplot. R has several ways to plot this. Suppose x and y are variables in a data set our_data. Then these all should work:

plot(our_data$x, our_data$y)
with(our_data, plot(x, y))
plot(y ~ x, data=our_data)

The latter uses R's model formula. For this use we have the left side is the response, the right side a description of the predictor. The data= argument allows variable lookup within the data frame our_data thereby avoiding the need to type it twice or use with.

For the heartrate data set in the UsingR package we have two variables age and maxrate. Make a scatter plot with maxrate as the response variable.

Which seems true

Roughly as age increases maximum heart rate increases

Roughly as age increase maximum heart rate decreases

Add two lines to your graph with the equations

with(heartrate, {
    abline(h = mean(maxrate))
    abline(v = mean(age))
})

Why is the intersection the “center” of the diagram?

The vertical line is the mean – or center – of the age variable

The horizontal line is the mean – or center – of the max heart rate variable

Both of the above

The center is useful for identifying correlation in a scatterplot. If the center is used as a new coordinate system how many of the 15 data points are in the upper right or lower left quadrant?

Which of the following R commands will list the number of cases in data frame (15 for the heartrate data set):

nocases(heartrate)

length(heartrate)

nrow(heartrate)

n(heartrate)

Correlation is described in the notes. Visually the center picture you made above captures the idea: data is correlated if it tends to align in 2 opposite quadrants of the four quadrants formed by centering. Well that is a rough idea. Basically, if the data is as in the heartrate data, large values of age are related with smaller than average values of heartrate. This is an example of negative correlation.

Enough with graphically, numerically we can find the correlation with cor.

Find the correlation of age and max heartrate using cor. Use

with(heartrate, cor(age, maxrate))

Does with(heartrate, cor(maxrate, age)) give the same answer as the previous one:

yes

Does with(heartrate, cor(scale(maxrate), scale(age))) give the same answer as the previous one:

yes

The linear model is fit in R using formula notation. For us here, a template is simply:

reponse ~ predictor, data=data_set

More general models (multiple predictors, mathematical transformations of predictors, categorical predictors, interactions) are also fit with extensions of this notation.

Here is how we would fit the heartrate data:

res <- lm(maxrate ~ age, heartrate)
summary(res)

## 
## Call:
## lm(formula = maxrate ~ age, data = heartrate)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.926 -2.538  0.388  3.187  6.624 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  210.048      2.867    73.3  < 2e-16 ***
## age           -0.798      0.070   -11.4  3.8e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.58 on 13 degrees of freedom
## Multiple R-squared:  0.909,  Adjusted R-squared:  0.902 
## F-statistic:  130 on 1 and 13 DF,  p-value: 3.85e-08

The key values for interpreting the modal are the parameter estimates. Here we have estimates for the intercept (210.04846 ), the slope (-0.79773) and the estimate for the standard error of the residuals (4.578).

For the built-in mtcars data set, fit mpg as a response and wt as a predictor. What is the estimate for the slope? (The interpretation is every 1000 pounds cost this many miles per gallon).

For the Cars93 data set in the MASS package, the following does a similar computation:

require(MASS)
lm(MPG.city ~ I(Weight/1000), data=Cars93)

What is the value of the slope now?

The construct *I(Weight/1000)* is used to divide the Weight variable by 1000 so that the scale is the same as with the mtcars data set. When using formulas, simple math notations must be specified AsIs using I, as the formula notation co-opts the familiar symbols: +, -, *, and /.

The model object contains more than meets the eye. This is because R, unlike SPSS say, is very terse in its output. You need to ask it to get more out. Look at the basic output of the model and compare to the following when a summary is requested:

res <- lm(mpg ~ wt, mtcars)
res

## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Coefficients:
## (Intercept)           wt  
##       37.29        -5.34

summary(res)

## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.543 -2.365 -0.125  1.410  6.873 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   37.285      1.878   19.86  < 2e-16 ***
## wt            -5.344      0.559   -9.56  1.3e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.05 on 30 degrees of freedom
## Multiple R-squared:  0.753,  Adjusted R-squared:  0.745 
## F-statistic: 91.4 on 1 and 30 DF,  p-value: 1.29e-10

The summary function in R is generic and is implemented for many different classes. For linear models it provides a summary of the coefficients and the model fit, whereas the lm object itself (res) just shows the estimates.

The summary function is an accessor function (or extractor function). It is defined for most modeling functions you will meet, though perhaps defined differently. Other such functions, among others, are coef for the coefficients, residuals for the residuals and plot for diagnostic plots.

Define res by:

res <- lm(mpg ~ wt, mtcars)

Find mean(resid(res)), the sample mean of the residuals.

Define res as above

res <- lm(mpg ~ wt, mtcars)

Find sd(resid(res)), the sample standard deviation of the residuals. Compare it to the value from this line of summary(res):

Residual standard error: 3.046 on 30 degrees of freedom

They are the same

The one from sd is larger

The one from sd is smaller

Week 3

About

Plotting

Correlation

Fitting a linear model

Working with 'lm' objects