QG Lecture 15

Syllabus is revised. Exam #2 is scheduled for two weeks from today. Thursday Nov 7th. We will have a review on Tuesday Nov 5th.

Multiple regression

Regression analysis with two or more explanatory variables is called multiple regression.

Examples: Regression of ozone concentration on air temperature, wind speed, and radiation; regression of voting preference on income, education and geography, regression of exam scores on homework grade and seat location in class, regression of mammal brain size on body weight and gestation period, regression of mortality rates on family income, race, and location.

Note you regress the response variable on the explanatory variables. You don't say I performed a regression of my variables or I regressed y AND x.

Everything in simple (bivariate) regression carries over to multiple regression. Each explanatory variable contributes a term to the model. There is a slope coefficient for each explanatory variable.

With one explanatory variable the regression model is described by a straight line. With two explanatory variables it is described by a flat plane. You can view this plane using a three dimensional scatter plot. With more than two explanatory variables the model is described by a hyperplane.

Here I use the function scatterplot3d() in the scatterplot3d package to view a regression model with two explanatory variables.

require(scatterplot3d)

## Loading required package: scatterplot3d

I use the trees data frame and first regress lumber volume on tree girth and add the model as a line on the scatter plot.

one = lm(Volume ~ Girth, data = trees)
plot(trees$Volume ~ trees$Girth, pch = 16, xlab = "Tree diameter (in)", ylab = "Timber volume (cubic ft)")
abline(one)

plot of chunk bestFitLine

I use the traditional graphics.

Next I add tree height to the bivariate (simple) regression. That is, I regress lumber volume on grith and height.

two = lm(Volume ~ Girth + Height, data = trees)
s3d = scatterplot3d(trees, angle = 55, scale.y = 0.7, pch = 16, xlab = "Tree diameter (in)", 
    zlab = "Timber volume (cubic ft)", ylab = "Tree height (ft)")
s3d$plane3d(two)

plot of chunk bestFitPlane

You can see that lumber volume increases with tree diameter and tree height.

By changing the view angle on the plot you can see that some observations are above the plane and some are below.

s3d = scatterplot3d(trees, angle = 32, scale.y = 0.7, pch = 16, xlab = "Tree diameter (in)", 
    zlab = "Timber volume (cubic ft)", ylab = "Tree height (ft)")
s3d$plane3d(two)

The distance along the vertical axis from the observation to the model plane is the residual.

The simple regression model is given by:

y_i = b0 + b1 * x_i + e_i,

where y_i is the response variable and x_i is the explanatory variable. The subscript indicates that y and x are vectors with length equal to the number of observations and the equation holds for each pair of observations.

The intercept is b0 and the slope is b1. These are single values and are estimated using the method of least squares. They are called coefficients or parameters.

The vector e is called the residuals. There is one residual for each observation. The residuals are assumed to be described by a set of normal distributions each centered on zero and having the same variance (sigma squared).

The multiple regression model is given by:

y_i = b0 + b1 * x_i1 + b2 * x_i2 + … + bp * x_ip + e_i,

where there are p explanatory variables. The explanatory variables are written with a double subscript to indicate the observation number (1st subscript) and the variable number (2nd subscript).

Each explantory variable gets a coefficient so there are p + 1 of them.

There is a single response variable y_i and a single set of residuals e_i. The residuals are again assumed to be described by a set of normal distributions each centered on zero and having the same variance (sigma squared).

Scatter plots

You inspect your data before modeling it. If you have more than one explanatory variable, a set of scatter plots is a good start.

Let's look at some data: The data set PetrolConsumption.txt contains the amount of gasoline consumed by state for one year.

PC = read.table("http://myweb.fsu.edu/jelsner/data/PetrolConsumption.txt", header = TRUE)
head(PC)

##   Petrol.Tax Avg.Inc Pavement Prop.DL Petrol.Consumption
## 1        9.0    3571     1976   0.525                541
## 2        9.0    4092     1250   0.572                524
## 3        9.0    3865     1586   0.580                561
## 4        7.5    4870     2351   0.529                414
## 5        8.0    4399      431   0.544                410
## 6       10.0    5342     1333   0.571                457

The data set contains petrol tax (Petrol.Tax) [cents per gallon], per capita income (Avg.Inc) [$ /10], miles of paved highway (Pavement), proportion of drivers (Prop.DL), and consumption of petrol (Petrol.Consumption) [millions of gallons].

You are interested in a multiple regression model of gasoline consumption on tax, income, pavement and proportion of drivers.

That is you want to know what influences petrol consumption. This is your response variable. You want to know what variables explain the variation in petrol consumption at the state level and to what relative degree.

First create pairwise scatter plots with all the variables in the data frame using the ggpairs() function from the GGally package.

library(GGally)

## Warning: package 'GGally' was built under R version 3.0.2

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 3.0.2

## Loading required package: reshape

## Warning: package 'reshape' was built under R version 3.0.2

## Loading required package: plyr

## Warning: package 'plyr' was built under R version 3.0.2

## 
## Attaching package: 'reshape'
## 
## The following object is masked from 'package:plyr':
## 
##     rename, round_any

## Warning: replacing previous import 'rename' when loading 'reshape'
## Warning: replacing previous import 'round_any' when loading 'reshape'

ggpairs(PC)

plot of chunk pairwiseScatter

The scatter plots are arranged in rows and columns with the diagonal entry the name of the variable. Variables are arranged from upper left to lower right by column number. The first column in the data set is Petrol.Tax then Avg.Inc, etc.

The graph in row 2 column 1 is the scatter plot of Avg.Inc on the vertical axis and Petrol.Tax on the horizontal axis. The graph in row 3 column 1 is the scatter plot of Pavement on the vertical axis and Petrol.Tax on the horizontal axis. And so on.

The value in row 1 column 2 is the Pearson correlation between Petrol.Tax and Avg.Inc. The value in row 1 column 3 is the correlation between Pavement and Petrol.Tax and so on.

Since Petrol.Consumption is your response variable, you focus on the set of scatter plots in row 5 where this variable is on the vertical axis.

Which of the explanatory variables appears to be the most (least) related to petrol consumption?

The variable with the highest correlation with petrol consumption is found by typing:

cor(PC)

##                    Petrol.Tax  Avg.Inc Pavement  Prop.DL
## Petrol.Tax            1.00000  0.01267 -0.52213 -0.28804
## Avg.Inc               0.01267  1.00000  0.05016  0.15707
## Pavement             -0.52213  0.05016  1.00000 -0.06413
## Prop.DL              -0.28804  0.15707 -0.06413  1.00000
## Petrol.Consumption   -0.45128 -0.24486  0.01904  0.69897
##                    Petrol.Consumption
## Petrol.Tax                   -0.45128
## Avg.Inc                      -0.24486
## Pavement                      0.01904
## Prop.DL                       0.69897
## Petrol.Consumption            1.00000

There is a function in the psych package (built on top of the two lattice functions) which has scatterplots with loess fits on the lower triangle, correlation values in the upper triangle and histograms on the diagonal.

require(psych)

## Loading required package: psych

## Warning: package 'psych' was built under R version 3.0.2

## 
## Attaching package: 'psych'
## 
## The following object is masked from 'package:ggplot2':
## 
##     %+%

pairs.panels(PC)

plot of chunk pairspanel

## NULL

Fit a mulitple regression model

To fit a multiple regression model to the petrol data type

model1 = lm(Petrol.Consumption ~ Prop.DL + Pavement + Avg.Inc + Petrol.Tax, 
    data = PC)

To see the model coefficients type:

model1

## 
## Call:
## lm(formula = Petrol.Consumption ~ Prop.DL + Pavement + Avg.Inc + 
##     Petrol.Tax, data = PC)
## 
## Coefficients:
## (Intercept)      Prop.DL     Pavement      Avg.Inc   Petrol.Tax  
##    3.77e+02     1.34e+03    -2.43e-03    -6.66e-02    -3.48e+01

Gasoline consumption increases with the proportion of drivers and decreases with the amount of pavement, average income and gas tax. Can you offer interpretations?

The equation for the model is: Average petrol consumption [millions of gallons] = 377.3 + 1336 * Prop.DL â 0.0024 * Pavement â 0.06659 * Avg.Inc â 34.79 * Petrol.Tax

The variables that are statistically significant in explaining petrol consumption are found by looking at the table of coefficents output from the summary() function.

summary(model1)

## 
## Call:
## lm(formula = Petrol.Consumption ~ Prop.DL + Pavement + Avg.Inc + 
##     Petrol.Tax, data = PC)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -122.0  -45.6  -10.7   31.5  234.9 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.77e+02   1.86e+02    2.03  0.04821 *  
## Prop.DL      1.34e+03   1.92e+02    6.95  1.5e-08 ***
## Pavement    -2.43e-03   3.39e-03   -0.72  0.47800    
## Avg.Inc     -6.66e-02   1.72e-02   -3.87  0.00037 ***
## Petrol.Tax  -3.48e+01   1.30e+01   -2.68  0.01033 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 66.3 on 43 degrees of freedom
## Multiple R-squared:  0.679,  Adjusted R-squared:  0.649 
## F-statistic: 22.7 on 4 and 43 DF,  p-value: 3.91e-10

You can see from the table that Pavement is not significant.

Recall the null hypothesis states that the variable is NOT important in explaining the variation in the response variable.

The p value is a measure of the credibility (evidence in support) of the null hypothesis. The larger the p value the more credible the null hypothesis.

The coefficient estimates (slopes) are not directly comparable because they combine different measurement units. A coefficient has units of the response variable divided by the units of the explanatory variable.

The t-values are comparable as they represent standardized coefficients. That is we can't say which variable is most important in explaining petrol consumption by looking at the magnitudes of the slopes. Petrol.Tax is NOT more important than Avg.Inc even though 34.79 is much larger than 0.0666.

The model says that for every 1% increase in the proportion of drivers (to overall population), the mean petrol consumption increases by 1.34 billion gallons assuming Pavement, Avg.Inc and Petrol.Tax remain constant.

For every 1 cent/gallon increase in Petrol.Tax, mean petrol consumption decreases by 34.8 million gallons assuming Prop.DL, Pavement, and Avg.Inc are held constant.

The order of the explanatory variables does not affect the magnitude or the sign of the slope coefficients.

summary(lm(Petrol.Consumption ~ Pavement + Petrol.Tax + Avg.Inc + Prop.DL, data = PC))

## 
## Call:
## lm(formula = Petrol.Consumption ~ Pavement + Petrol.Tax + Avg.Inc + 
##     Prop.DL, data = PC)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -122.0  -45.6  -10.7   31.5  234.9 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.77e+02   1.86e+02    2.03  0.04821 *  
## Pavement    -2.43e-03   3.39e-03   -0.72  0.47800    
## Petrol.Tax  -3.48e+01   1.30e+01   -2.68  0.01033 *  
## Avg.Inc     -6.66e-02   1.72e-02   -3.87  0.00037 ***
## Prop.DL      1.34e+03   1.92e+02    6.95  1.5e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 66.3 on 43 degrees of freedom
## Multiple R-squared:  0.679,  Adjusted R-squared:  0.649 
## F-statistic: 22.7 on 4 and 43 DF,  p-value: 3.91e-10

The multiple R-squared value is 0.68. That means the model that includes all four explanatory variables explains 68% of the variation in petrol consumption.

Note that the p value on the model (given on the last line of the output) is small. The null hypothesis in this case is that none of the variables are important in explaining petrol consumption.

Simplify the model

You are not finished. You need to try a simpler model by removing Pavement (the variable that is not statistically significant).

model2 = lm(Petrol.Consumption ~ Prop.DL + Avg.Inc + Petrol.Tax, data = PC)
summary(model2)

## 
## Call:
## lm(formula = Petrol.Consumption ~ Prop.DL + Avg.Inc + Petrol.Tax, 
##     data = PC)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -110.1  -51.2  -12.9   24.5  238.8 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  307.328    156.831    1.96  0.05639 .  
## Prop.DL     1374.768    183.669    7.49  2.2e-09 ***
## Avg.Inc       -0.068      0.017   -4.00  0.00024 ***
## Petrol.Tax   -29.484     10.584   -2.79  0.00785 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 65.9 on 44 degrees of freedom
## Multiple R-squared:  0.675,  Adjusted R-squared:  0.653 
## F-statistic: 30.4 on 3 and 44 DF,  p-value: 8.23e-11

The remaining explanatory variables in the model are all significant. The values are slightly different as the variance in the gas consumption attributed to Pavement is now spread out among the remaining variables. Removing explanatory variables affects the values of the parameters remaining in the model.

The proportion of the population with driving licenses is the most important variable as can be seen by having the largest t value (in absolute value).

Not that by removing a variable the R squared value DECREASES to 0.675. This is always the case with a smaller model (fewer explanatory variables). Thus the R-squared statistic cannot be used to compare models that have different numbers of explanatory variables.

The adjusted R squared is a modification of the R squared that accounts for the number of explanatory variables in the model. Unlike R squared, the adjusted R squared increases only if the new term improves the model more than would be expected by chance. The adjusted R squared can be negative, and will always be less than or equal to R squared.

To see the value of the adjusted R squared, first note that Petrol.Tax has the smallest t value. Suppose you remove it. What do you find?

model3 = lm(Petrol.Consumption ~ Prop.DL + Avg.Inc, data = PC)
summary(model3)

## 
## Call:
## lm(formula = Petrol.Consumption ~ Prop.DL + Avg.Inc, data = PC)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -115.5  -45.9  -13.8   30.1  243.5 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    7.8371   122.4630    0.06  0.94926    
## Prop.DL     1525.0429   188.2969    8.10  2.5e-10 ***
## Avg.Inc       -0.0709     0.0182   -3.90  0.00032 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 70.7 on 45 degrees of freedom
## Multiple R-squared:  0.618,  Adjusted R-squared:  0.601 
## F-statistic: 36.3 on 2 and 45 DF,  p-value: 4.06e-10

The adjusted R squared is smaller, so you conclude that this variable should be in the model.

Thus you settle on a final model: Average petrol consumption [millions of gallons] = 307.3 + 1375 x Prop.DL â 0.06802 x Avg.Inc â 29.48 x Petrol.Tax

Model assumptions

You are not finished yet. You need to check the model assumptions.

model2.df = fortify(model2)
ggplot(model2.df, aes(x = .fitted, y = .stdresid)) + geom_point() + geom_smooth() + 
    geom_hline(yintercept = 0)

## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

plot of chunk linearityVariance

require(sm)

## Loading required package: sm

## Warning: package 'sm' was built under R version 3.0.2

## Package `sm', version 2.2-5: type help(sm) for summary information

res = residuals(model2)
sm.density(res, xlab = "Model Residuals", model = "Normal")

plot of chunk normality

qqnorm(model2$residuals)
qqline(model2$residuals)

plot of chunk normality2

There is some evidence against normality and constant variance. The fact that the residuals do not exactly follow a normal distribution influences the confidence you can place on your inferences. Remediation measures include transformation of the response variable or weighted least squares.

Visualizing multivariate data

With multivariate data it can be helpful to visualize a data set of numbers using a heat map. A heat map is a table that has colors in place of numbers. Colors correspond to the level of the measurement. Each column can be a different measurement or all the same like this one.

http://online.wsj.com/article/SB125993225142676615.html#articleTabs%3Dinteractive

Here is how to make a heat map using R. You use the NBA statistics from a few years ago that were archived by flowingdata.com.

nba = read.csv("http://datasets.flowingdata.com/ppg2008.csv", sep = ",")

The players are ordered by points scored, and the Name variable converted to a factor that ensures proper sorting of the plot.

nba$Name = with(nba, reorder(Name, PTS))
head(nba)

##             Name  G  MIN  PTS  FGM  FGA   FGP FTM FTA   FTP X3PM X3PA
## 1   Dwyane Wade  79 38.6 30.2 10.8 22.0 0.491 7.5 9.8 0.765  1.1  3.5
## 2  LeBron James  81 37.7 28.4  9.7 19.9 0.489 7.3 9.4 0.780  1.6  4.7
## 3   Kobe Bryant  82 36.2 26.8  9.8 20.9 0.467 5.9 6.9 0.856  1.4  4.1
## 4 Dirk Nowitzki  81 37.7 25.9  9.6 20.0 0.479 6.0 6.7 0.890  0.8  2.1
## 5 Danny Granger  67 36.2 25.8  8.5 19.1 0.447 6.0 6.9 0.878  2.7  6.7
## 6  Kevin Durant  74 39.0 25.3  8.9 18.8 0.476 6.1 7.1 0.863  1.3  3.1
##    X3PP ORB DRB TRB AST STL BLK  TO  PF
## 1 0.317 1.1 3.9 5.0 7.5 2.2 1.3 3.4 2.3
## 2 0.344 1.3 6.3 7.6 7.2 1.7 1.1 3.0 1.7
## 3 0.351 1.1 4.1 5.2 4.9 1.5 0.5 2.6 2.3
## 4 0.359 1.1 7.3 8.4 2.4 0.8 0.8 1.9 2.2
## 5 0.404 0.7 4.4 5.1 2.7 1.0 1.4 2.5 3.1
## 6 0.422 1.0 5.5 6.5 2.8 1.3 0.7 3.0 1.8

Next the data frame is converted from wide to long format. The game statistics have different ranges so to make them comparable all the values are rescaled.

require(reshape2)

## Loading required package: reshape2

## Warning: package 'reshape2' was built under R version 3.0.2

## 
## Attaching package: 'reshape2'
## 
## The following object is masked from 'package:reshape':
## 
##     colsplit, melt, recast

require(plyr)
nba.m = melt(nba)

## Using Name as id variables

nba.m = ddply(nba.m, .(variable), transform, rescale = scale(value))

To create the heat map type:

ggplot(nba.m, aes(variable, Name)) + geom_tile(aes(fill = rescale), colour = "white") + 
    scale_fill_gradient(low = "white", high = "steelblue")

plot of chunk heatmap

The map makes it easier to see patterns and outliers.

Housing prices

Let's look at another example. A realtor can use multiple regression to justify a house selling price based on a list of desirable house features. Such data are commonly compiled by realtor boards.

Here you consider a data file (houseprice.txt) containing a random sample of 107 home sales in Albuquerque, New Mexico during the period February 15 through April 30, 1993 (Albuquerque Board of Realtors, 1993).

Input the data.

hp = read.table("http://myweb.fsu.edu/jelsner/data/houseprice.txt", header = TRUE)
head(hp)

##   price sqft custom corner taxes
## 1  2050 2650      1      0  1639
## 2  2080 2600      1      0  1088
## 3  2150 2664      1      0  1193
## 4  2150 2921      1      0  1635
## 5  1999 2580      1      0  1732
## 6  1900 2580      0      0  1534

The data include:

price: Selling price in $100s
sqft: Square feet of living space
custom: Whether the house was built with custom features (1) or not (0)
corner: Whether the house sits on a corner lot (1) or not (0)
taxes: Annual taxes in $ Here you assume that taxes determine price. In some real estate contexts the causality would work in the opposite direction: selling price can affect subsequent home appraisals and hence taxes.

As an intial overview of your data, type

ggpairs(hp)

plot of chunk pairsPlot

As expected housing prices increase with square footage and taxes.

Scatter plots are not useful for binary variables.

With categorical or factor variables we are often interested in how the response variable's relationship with an explanatory variable depends on the level of the factor.

For instance, to examine the relationship between housing price and square footage conditional on whether or not the house was custom built, type:

ggplot(hp, aes(y = price, x = sqft, color = factor(corner))) + geom_point() + 
    geom_smooth(method = lm, se = FALSE)

plot of chunk coplot

ggplot(hp, aes(y = price, x = sqft)) + geom_point() + geom_smooth(method = lm, 
    se = FALSE) + facet_wrap(~corner) + xlab("Living Space (sq ft)") + ylab("Selling Price ($1000)")

plot of chunk coplot

Note in the first plot the conditioning variable must be a factor.

The plot shows that housing prices increase with square footage, but less so for corner houses.

With custom as the conditioning variable the relationship between house size and price is about the same.

ggplot(hp, aes(y = price, x = sqft)) + geom_point() + geom_smooth(method = lm, 
    se = FALSE) + facet_wrap(~custom) + xlab("Living Space (sq ft)") + ylab("Selling Price ($1000)")

plot of chunk coplot2

There is one outlier.

You can condition on both corner and custom by typing:

ggplot(hp, aes(y = price, x = sqft, color = factor(corner))) + geom_point() + 
    geom_smooth(method = lm, se = FALSE) + facet_wrap(~custom) + xlab("Living Space (sq ft)") + 
    ylab("Selling Price ($1000)")

plot of chunk coplot3

We proceed by regressing price on sqft, taxes, custom, and corner.

model1 = lm(price ~ sqft + taxes + custom + corner, data = hp)
summary(model1)

## 
## Call:
## lm(formula = price ~ sqft + taxes + custom + corner, data = hp)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -544.6  -99.5   -4.8   64.8  510.2 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  175.166     56.312    3.11  0.00242 ** 
## sqft           0.208      0.061    3.40  0.00096 ***
## taxes          0.677      0.101    6.70  1.2e-09 ***
## custom       156.815     44.495    3.52  0.00064 ***
## corner       -83.401     40.059   -2.08  0.03985 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 162 on 102 degrees of freedom
## Multiple R-squared:  0.828,  Adjusted R-squared:  0.821 
## F-statistic:  123 on 4 and 102 DF,  p-value: <2e-16

Price is strongly associated with size, taxes, and custom. Corner has a significant negative coefficient. Thus you state that after controlling for size, taxes, and custom, there is evidence that corner houses, on average, sell at a lower price.

Note again that although we cannot directly compare the estimated slope values because they have different measurement units, the t values are comparable as they represent standardized coefficients.

On average, each additional square foot of living space corresponds to a 0.2076 x $100 = $20.76 increase in price, and on average custom houses sell for 156.8148 x $100 = $15,681 more than regular houses. The R squared of 0.828 says that 82.8% of the variability in price is accounted for by these four explanatory variables.

To predict a selling price of a 2000 sq ft house with annual taxes of $1000 that is custom built and not on a corner, type:

predict(model1, data.frame(sqft = 2000, taxes = 1000, custom = 1, corner = 0), 
    interval = "confidence")

##    fit  lwr  upr
## 1 1424 1356 1493

Collinearity

Let's return to the body fat data set. Read the data and create a scatter plot matrix

bf = read.table("http://myweb.fsu.edu/jelsner/data/fat.txt", header = TRUE)
ggpairs(bf)

plot of chunk unnamed-chunk-1

Here you see that abdomen has the strongest linear relationship with body fat, but other variables have high correlation. Also note the large correlation between the explanatory variables.

You begin with a model that include the four explanatory variables.

model1 = lm(bodyfat ~ abdomen + biceps + forearm + wrist, data = bf)
summary(model1)

## 
## Call:
## lm(formula = bodyfat ~ abdomen + biceps + forearm + wrist, data = bf)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.125 -3.228  0.079  3.332  7.709 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0265    10.5786    1.04   0.3032    
## abdomen       0.7843     0.0676   11.61  1.1e-14 ***
## biceps       -0.8957     0.3166   -2.83   0.0071 ** 
## forearm       1.4538     0.4613    3.15   0.0030 ** 
## wrist        -4.2950     0.9021   -4.76  2.3e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.02 on 42 degrees of freedom
## Multiple R-squared:  0.817,  Adjusted R-squared:   0.8 
## F-statistic:   47 on 4 and 42 DF,  p-value: 5.68e-15

Do you see anything strange in these values?

This is an indication of collinearity. The large between-variable correlation leads to a model that may not make physical sense.

The rule of thumb is that when the correlation between two explanatory variables exceeds 0.6, collinearity can be a problem.

When two explanatory variables have large correlation then estimates of the model parameters are not precise. A model with imprecise parameter estimates is not useful.

The best approach in this situation is to reduce the set of explanatory variables.

Remove the explanatory variable that is hardest to account for based on physical arguments. For the fat data, it is probably best to remove all variables except abdomen.

Abrasion loss

Davies and Goldsmith (1972) investigated the relationship between abrasion loss (abrasion) of samples of rubber (grams per hour) as a function of hardness (higher values indicate harder rubber) and tensile strength (kg/cm² ).

The data are in AbrasionLoss.txt. Input the data using

AL = read.table("http://myweb.fsu.edu/jelsner/data/AbrasionLoss.txt", header = TRUE)

a. Create a scatter plot matrix of the three variables. Based on the scatter of points in the plot of abrasion versus strength does it appear that tensile strength would be helpful in explaining abrasion loss?

b. Regress abrasion loss on hardness and strength. What is the adjusted R squared value? Is strength an important explanatory variable after accounting for hardness?

c. On average how much additional abrasion is lost for every 1 kg/cm² increase in tensile strength?

d. Check the correlations between the explanatory variables. Could collinearity be a problem for interpreting the model?

e. Find the 95% prediction interval for the abrasion corresponding to a new rubber sample having a hardness of 60 units and a tensile strength of 200 kg/cm^2.