Data and code Background

Several of the below data and codes are taken from Nicholas Horton () R Markdown reproducible analysis source file, which can be found at http://nhorton.people.amherst.edu/sdm4. Most of the examples are from the Fourth Edition of (2014) by De Veaux, Velleman, and Bock.

Inferences for Regression

Exploring the data

library(mosaic); library(readr)
BodyFat <- read_csv("http://nhorton.people.amherst.edu/sdm4/data/Body_fat_complete.csv")
dim(BodyFat)
## [1] 250  16
glimpse(BodyFat)
## Rows: 250
## Columns: 16
## $ `Body Density` <dbl> 1.07, 1.09, 1.04, 1.08, 1.03, 1.05, 1.05, 1.07, 1.09, …
## $ PctBF          <dbl> 12.3, 6.1, 25.3, 10.4, 28.7, 20.9, 19.2, 12.4, 4.1, 11…
## $ Age            <dbl> 23, 22, 22, 26, 24, 24, 26, 25, 25, 23, 26, 27, 32, 30…
## $ Weight         <dbl> 154, 173, 154, 185, 184, 210, 181, 176, 191, 198, 186,…
## $ Height         <dbl> 67.8, 72.2, 66.2, 72.2, 71.2, 74.8, 69.8, 72.5, 74.0, …
## $ Neck           <dbl> 36.2, 38.5, 34.0, 37.4, 34.4, 39.0, 36.4, 37.8, 38.1, …
## $ Chest          <dbl> 93.1, 93.6, 95.8, 101.8, 97.3, 104.5, 105.1, 99.6, 100…
## $ Abdomen        <dbl> 85.2, 83.0, 87.9, 86.4, 100.0, 94.4, 90.7, 88.5, 82.5,…
## $ waist          <dbl> 33.5, 32.7, 34.6, 34.0, 39.4, 37.2, 35.7, 34.8, 32.5, …
## $ Hip            <dbl> 94.5, 98.7, 99.2, 101.2, 101.9, 107.8, 100.3, 97.1, 99…
## $ Thigh          <dbl> 59.0, 58.7, 59.6, 60.1, 63.2, 66.0, 58.4, 60.0, 62.9, …
## $ Knee           <dbl> 37.3, 37.3, 38.9, 37.3, 42.2, 42.0, 38.3, 39.4, 38.3, …
## $ Ankle          <dbl> 21.9, 23.4, 24.0, 22.8, 24.0, 25.6, 22.9, 23.2, 23.8, …
## $ Bicep          <dbl> 32.0, 30.5, 28.8, 32.4, 32.2, 35.7, 31.9, 30.5, 35.9, …
## $ Forearm        <dbl> 27.4, 28.9, 25.2, 29.4, 27.7, 30.6, 27.8, 29.0, 31.1, …
## $ Wrist          <dbl> 17.1, 18.2, 16.6, 18.2, 17.7, 18.8, 17.7, 18.8, 18.2, …
str(BodyFat) # displays structure of body fat data  (an R object)
## tibble [250 × 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Body Density: num [1:250] 1.07 1.09 1.04 1.08 1.03 ...
##  $ PctBF       : num [1:250] 12.3 6.1 25.3 10.4 28.7 20.9 19.2 12.4 4.1 11.7 ...
##  $ Age         : num [1:250] 23 22 22 26 24 24 26 25 25 23 ...
##  $ Weight      : num [1:250] 154 173 154 185 184 ...
##  $ Height      : num [1:250] 67.8 72.2 66.2 72.2 71.2 ...
##  $ Neck        : num [1:250] 36.2 38.5 34 37.4 34.4 39 36.4 37.8 38.1 42.1 ...
##  $ Chest       : num [1:250] 93.1 93.6 95.8 101.8 97.3 ...
##  $ Abdomen     : num [1:250] 85.2 83 87.9 86.4 100 94.4 90.7 88.5 82.5 88.6 ...
##  $ waist       : num [1:250] 33.5 32.7 34.6 34 39.4 ...
##  $ Hip         : num [1:250] 94.5 98.7 99.2 101.2 101.9 ...
##  $ Thigh       : num [1:250] 59 58.7 59.6 60.1 63.2 66 58.4 60 62.9 63.1 ...
##  $ Knee        : num [1:250] 37.3 37.3 38.9 37.3 42.2 42 38.3 39.4 38.3 41.7 ...
##  $ Ankle       : num [1:250] 21.9 23.4 24 22.8 24 25.6 22.9 23.2 23.8 25 ...
##  $ Bicep       : num [1:250] 32 30.5 28.8 32.4 32.2 35.7 31.9 30.5 35.9 35.6 ...
##  $ Forearm     : num [1:250] 27.4 28.9 25.2 29.4 27.7 30.6 27.8 29 31.1 30 ...
##  $ Wrist       : num [1:250] 17.1 18.2 16.6 18.2 17.7 18.8 17.7 18.8 18.2 19.2 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   `Body Density` = col_double(),
##   ..   PctBF = col_double(),
##   ..   Age = col_double(),
##   ..   Weight = col_double(),
##   ..   Height = col_double(),
##   ..   Neck = col_double(),
##   ..   Chest = col_double(),
##   ..   Abdomen = col_double(),
##   ..   waist = col_double(),
##   ..   Hip = col_double(),
##   ..   Thigh = col_double(),
##   ..   Knee = col_double(),
##   ..   Ankle = col_double(),
##   ..   Bicep = col_double(),
##   ..   Forearm = col_double(),
##   ..   Wrist = col_double()
##   .. )
names(BodyFat) # displays names of the variables
##  [1] "Body Density" "PctBF"        "Age"          "Weight"       "Height"      
##  [6] "Neck"         "Chest"        "Abdomen"      "waist"        "Hip"         
## [11] "Thigh"        "Knee"         "Ankle"        "Bicep"        "Forearm"     
## [16] "Wrist"

Model Fit

BodyFatmod <- lm(PctBF ~ waist, data=BodyFat)
coef(BodyFatmod)
## (Intercept)       waist 
##       -42.7         1.7
summary(BodyFatmod)
## 
## Call:
## lm(formula = PctBF ~ waist, data = BodyFat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.899  -3.645   0.186   3.177  12.789 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -42.7341     2.7165   -15.7   <2e-16 ***
## waist         1.7000     0.0743    22.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.71 on 248 degrees of freedom
## Multiple R-squared:  0.678,  Adjusted R-squared:  0.677 
## F-statistic:  523 on 1 and 248 DF,  p-value: <2e-16

Assumptions and conditions

The P-value associated with the t-test for significance of \(H_{0}: \beta_1 =0\) is the same as the P-value associated with the analysis of variance F-test. This will always be true for the simple linear regression model. The P-values are the same because of a well-known relationship between \(t\) and an \(F\) random variable that has 1 numerator degree of freedom,i.e., \(t^{2}_{(n-2)}=F_{(1,n-2)}\). Thus for a given significance level \(\alpha\), the F-test of \(H_{0}:\beta_{1} =0\) versus \(H_{a}: \beta_{1} \neq 0\) is algebraically equivalent to the two-tailed t-test. We will get exactly the same P-values * If one test rejects \(H_0\), then so will the other. * If one test does not reject \(H_0\), then so will the other.

The natural question then is when should we use the F-test and when should we use the t-test?

  • The F-test is only appropriate for testing that the slope differs from 0.
  • Use the t-test to test that the slope is positive or negative

The F-test is more useful for the multiple regression model when we want to test that more than one slope parameter is 0.

msummary(BodyFatmod)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -42.7341     2.7165   -15.7   <2e-16 ***
## waist         1.7000     0.0743    22.9   <2e-16 ***
## 
## Residual standard error: 4.71 on 248 degrees of freedom
## Multiple R-squared:  0.678,  Adjusted R-squared:  0.677 
## F-statistic:  523 on 1 and 248 DF,  p-value: <2e-16
rsquared(BodyFatmod)
## [1] 0.678
confint(BodyFatmod)   
##              2.5 % 97.5 %
## (Intercept) -48.08 -37.38
## waist         1.55   1.85
anova(BodyFatmod)
## Analysis of Variance Table
## 
## Response: PctBF
##            Df Sum Sq Mean Sq F value Pr(>F)    
## waist       1  11621   11621     523 <2e-16 ***
## Residuals 248   5508      22                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Some things to note are:

  • The residual standard deviation \(s\) is 4.71 and this estimates the standard deviation of the errors.

  • \(r^2\) = (SSTO-SSE) / SSTO = SSR / (SSR+SSE) = $ = 0.678 or 67.8%.

*The interpretation is that waist explain 67.8% of the variation in body fat.

  • The value of the F statistic is \(F=523\) with 1 and 248 degrees of freedom, and the p-value for this F statistic is <2e-16. Thus we reject the null hypothesis.

Examining residuals

We use the following notation for our regression model: $ y=β_0+β_1∗x_1+ϵ$ where \(ϵ∼N(0,σ^2\)

When evaluating a regression model, it is useful to check the following conditions:

  • The model parameters \((β_0,β_1,σ^{2})\) are constant.
  • The error terms in the regression model are independent and have been sampled from a single population (identically distributed). This is often abbreviated as iid.
  • The error terms follow a normal probability distribution centered at zero with a fixed variance \(σ^{2}\).

These conditions are only required to be met if you are conducting a hypothesis test. However, even if you are only trying to find a model with a good fit, these conditions are important. If there are clear patterns in the residuals, it is likely that there is a better model that can be fit.

Regression assumptions about the error terms are generally checked by looking at the residual plots.

xyplot(PctBF ~ waist, xlab="Waist (in.)", 
       type=c("p", "r", "smooth"), data=BodyFat)   #  scatterplot of PctBody Fat vs waist for the BodyFat  dataset (along with a superimposed regression line and a smoother

xyplot(resid(BodyFatmod) ~ waist, xlab="Waist (in.)", 
       type=c("p", "r", "smooth"), data=BodyFat)  # resiudal vs x

xyplot(resid(BodyFatmod) ~ fitted(BodyFatmod), xlab="Predicted values", 
       ylab="Residuals",
       type=c("p", "r", "smooth"), data=BodyFat) #residual vs predicted value

xqqmath(~ resid(BodyFatmod))

histogram(~ resid(BodyFatmod)) #histogram of residuals 

Questions to ask

  • Does the histogram indicate that the error terms are normally distributed?
  • Does the “residual verses fits” plot indicate equal variances for all values of X?
  • Create a plot of Residuals vs. X. Are their any clear patterns in the plot?
  • Create a plot of Residuals vs. other potential explanatory variables. Are their any clear patterns in these plots?

Interpretation

  • The residual versus fitted value plot looks good in that the variance is roughly the same all the way across and there are no worrisome patterns. There seems to be no difficulties with the model or data.

  • If plot of residuals versus fits shows that the residual variance (vertical spread, funnel shape) increases as the fitted values (predicted values of sale price) increase. This violates the assumption of constant error variance.

  • If the pattern is curved it indicates that the wrong type of equation was used.

Quick Fixes

When the regression function is not linear and the error terms are not normal and have unequal variances, we can use transfomrations. In general,although not always!:

  • Transforming the y values corrects problems with the error terms (and may help the non-linearity).

  • Transforming the x values primarily corrects the non-linearity.

Estimation and Prediction

Confidence intervals for predicted values using the panel.lmbands() function.

The two research questions that we are dealing with are

  • What is the mean response \(E(Y)\) when the predictor value is ?

  • What value will a new response \(\hat{Y}_{pred}\) be when the predictor value is \(x_{h}\)?

confpred <- predict(BodyFatmod, interval="confidence")
head(confpred)
##    fit  lwr  upr
## 1 14.3 13.6 15.0
## 2 12.8 12.0 13.6
## 3 16.1 15.5 16.7
## 4 15.1 14.4 15.8
## 5 24.2 23.5 24.9
## 6 20.4 19.8 21.0
intpred <- predict(BodyFatmod, interval="prediction")
## Warning in predict.lm(BodyFatmod, interval = "prediction"): predictions on current data refer to _future_ responses
head(intpred)
##    fit   lwr  upr
## 1 14.3  4.98 23.6
## 2 12.8  3.50 22.1
## 3 16.1  6.79 25.4
## 4 15.1  5.79 24.4
## 5 24.2 14.88 33.5
## 6 20.4 11.14 29.7
xyplot(PctBF ~ waist, xlab="Waist (in.)", 
       panel=panel.lmbands, lwd=2, cex=0.2, data=BodyFat)

Introduction

Goals of Multiple Regression

Multiple regression analysis can be used to serve different goals. The goals will influence the type of analysis that is conducted. The most common goals of multiple regression are to describe, predict, or confirm.

  • Describe: A model may be developed to describe the relationship between multiple explanatory variables and the response variable.
  • Predict: A regression model may be used to generalize to observations outside the sample. Just as in simple linear regression, explanatory variables should be within the range of the sample data to predict future responses.
  • Confirm: Theories are often developed about which variables or combination of variables should be included in a model. For example, is mileage useful in predicting retail price? Inferential techniques can be used to test if the association between the explanatory variables and the response could just be due to chance.

When trying multiple models in variable selection, hypothesis tests to evaluate the importance of any specific term are often very misleading. While variable selection techniques are useful to find a descriptive or predictive model, p-values for individual terms tend to be unreliable. To conduct a hypothesis test in multivariate regression, it is best to use the extra Sum of Squares (or Drop in Deviance) Test.

With careful model building, it is possible to find a strong relationship between our explanatory variables and the price of a car. However, with any model, it is important to understand how the data was collected, and what procedures were used to create the model before using the model to make decisions.

In this tutorial you will build a multivariate regression model to describe the relationship between the retail price of used GM cars. We include several variables in this dataset that might influence the Price of a car:

  • Mileage (number of miles the car has been driven)
  • Make (Buick, Cadillac, Chevrolet, Pontiac, SAAB, Saturn)
  • Type (Sedan, Coupe, Hatchback, Convertible, or Wagon)
  • Cyl (number of cylinders: 4, 6, or 8)
  • Liter (a measure of engine size)
  • Doors (number of doors: 2, 4)
  • Cruise (1 = cruise control, 0 = no cruise control)
  • Sound (1 = upgraded speakers, 0 = standard speakers)
  • Leather (1 = leather seats, 0 = not leather seats)

We start by viewing our data and it’s structure

library(readr)
Cars <- read_csv("~/OneDrive - SUNY - The College at Brockport/P-Drive copy 2020/COVID2020 Changes/MTH641 Spring 2021/Labs/C3Cars.csv")
#View(Cars) # to view Cars
# head(Cars, 5)    # Shows the first 5 rows of the data 
str(Cars[,c(1:12)])
## tibble [804 × 12] (S3: tbl_df/tbl/data.frame)
##  $ Mileage: num [1:804] 8221 9135 13196 16342 19832 ...
##  $ Price  : num [1:804] 17314 17542 16219 16337 16339 ...
##  $ Make   : chr [1:804] "Buick" "Buick" "Buick" "Buick" ...
##  $ Model  : chr [1:804] "Century" "Century" "Century" "Century" ...
##  $ Trim   : chr [1:804] "Sedan 4D" "Sedan 4D" "Sedan 4D" "Sedan 4D" ...
##  $ Type   : chr [1:804] "Sedan" "Sedan" "Sedan" "Sedan" ...
##  $ Cyl    : num [1:804] 6 6 6 6 6 6 6 6 6 6 ...
##  $ Liter  : num [1:804] 3.1 3.1 3.1 3.1 3.1 3.1 3.1 3.1 3.1 3.1 ...
##  $ Doors  : num [1:804] 4 4 4 4 4 4 4 4 4 4 ...
##  $ Cruise : num [1:804] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Sound  : num [1:804] 1 1 1 0 0 1 1 1 0 1 ...
##  $ Leather: num [1:804] 1 0 0 0 1 0 0 0 1 1 ...

Exploring the data with graphical and numerical summaries

xyplot(Price ~ Mileage,xlab="Mileage", group=Make,auto.key=TRUE,lwd=2,type=c("p", "r", "smooth"), data=Cars) # superimposed line and a smoother

tally(Cyl~ Make|Leather, data=Cars)
## , , Leather = 0
## 
##    Make
## Cyl Buick Cadillac Chevrolet Pontiac SAAB Saturn
##   4     0        0        29      16   31     24
##   6    45        0        32      38    0      7
##   8     0        0         0       0    0      0
## 
## , , Leather = 1
## 
##    Make
## Cyl Buick Cadillac Chevrolet Pontiac SAAB Saturn
##   4     0        0       151      34   83     26
##   6    35       20        88      42    0      3
##   8     0       60        20      20    0      0

Subsetting Data to see outliers, cadillac hardtop convertibles

#cadil<-Cars[which(Cars$Make=="Cadillac"),] # another way of subsetting

cadil <- subset(Cars, Price >= 60000 & Cars$Make=="Cadillac", select=c(Price, Mileage,Make,Trim,Model))

head(cadil)
## # A tibble: 6 x 5
##    Price Mileage Make     Trim            Model 
##    <dbl>   <dbl> <chr>    <chr>           <chr> 
## 1 70755.     583 Cadillac Hardtop Conv 2D XLR-V8
## 2 68566.    6420 Cadillac Hardtop Conv 2D XLR-V8
## 3 69134.    7892 Cadillac Hardtop Conv 2D XLR-V8
## 4 66374.   12021 Cadillac Hardtop Conv 2D XLR-V8
## 5 65281.   15600 Cadillac Hardtop Conv 2D XLR-V8
## 6 63913.   18200 Cadillac Hardtop Conv 2D XLR-V8
#tail(cadil) # use this if u want to see last few rows

Simple linear Regression

We will start with a simple linear regression model to predict Price from Mileage. To fit a linear regression model in R, we will use the function lm().

Cars.lm = lm(Price ~ Mileage, data = Cars)  # Creating a simple linear model
coef(Cars.lm)
## (Intercept)     Mileage 
##   24764.559      -0.173
summary(Cars.lm)
## 
## Call:
## lm(formula = Price ~ Mileage, data = Cars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -13905  -7254  -3520   5188  46091 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.48e+04   9.04e+02   27.38  < 2e-16 ***
## Mileage     -1.72e-01   4.21e-02   -4.09  4.7e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9790 on 802 degrees of freedom
## Multiple R-squared:  0.0205, Adjusted R-squared:  0.0192 
## F-statistic: 16.8 on 1 and 802 DF,  p-value: 4.68e-05
gf_point(Price ~ Mileage, data = Cars) %>%
  gf_lm(interval='confidence')

adding fitted and residual values to the orginal data

Cars = mutate(Cars, res1 = resid(Cars.lm), fits1 = fitted(Cars.lm)) 
gf_histogram(~ res1, data = Cars) #histogram

gf_point(res1 ~ fits1, data = Cars)%>%
  gf_hline(yintercept = 0)
## Warning: geom_hline(): Ignoring `mapping` because `yintercept` was provided.

More than one explanatory vairalbe, model 2 with mileage and make as explanatory

While the p-value tends to give some indication that Mileage is important, the R2 value indicates that our model is not a good fit. In addition, there was a very clear pattern in the Residuals vs. Make plot. Thus we will add this term into our model.

Cars.lm2 = lm(Price ~ Mileage + Make, data = Cars)  # Creating a simple linear model
Cars = mutate(Cars, res2 = resid(Cars.lm2), fits2 = fitted(Cars.lm2))
gf_point(Price ~ Mileage, data = Cars, color = ~ Make)%>%
  gf_line(fits2 ~ Mileage)

We also see that our \(R^2\) value increased significanlty.

rsquared(Cars.lm)
## [1] 0.0205
rsquared(Cars.lm2)
## [1] 0.665

Residual Plots and Model Assumptions

gf_histogram(~ res2, data = Cars)

gf_point(res2 ~ fits2, data = Cars)%>%
  gf_hline(yintercept = 0)
## Warning: geom_hline(): Ignoring `mapping` because `yintercept` was provided.

gf_point(res2 ~ Mileage, data = Cars)%>%
  gf_hline(yintercept = 0)
## Warning: geom_hline(): Ignoring `mapping` because `yintercept` was provided.

Interaction Terms

Notice that simply adding the Make variable intro our model forces the slope to be identical in all three cases. However, a closer look at the data indicates that there may be an interaction effect since the effect of Mileage on Price appears to depend upon the Make of the car. To include interaction terms in our model, we use X1∗X2

adding residulas and fitted values to the data

Cars.lm3 = lm(Price ~ Mileage*Make, data = Cars)  # Creating a simple linear model
Cars = mutate(Cars, res3 = resid(Cars.lm3), fits3 = fitted(Cars.lm3)) 

Model3 with interaction between mileage and make

gf_point(Price ~ Mileage, data = Cars, color = ~ Make)%>%
  gf_line(fits2~Mileage)

gf_point(Price ~ Mileage, data = Cars, color = ~ Make)%>%
  gf_line(fits3~Mileage)

We also see that our \(R^2\) value increased slightly.

rsquared(Cars.lm2)
## [1] 0.665
rsquared(Cars.lm3)
## [1] 0.667

What is the best possible model?

We can try to find the model with the “best” R2 value, by simply putting all the terms in a model. Is the R2 value significantly better than our previous models?

Cars.lm4 = lm(Price ~ Mileage+Make+Doors+Type+Cyl+Liter+Cruise+Sound+Leather, data = Cars)

summary(Cars.lm4)
## 
## Call:
## lm(formula = Price ~ Mileage + Make + Doors + Type + Cyl + Liter + 
##     Cruise + Sound + Leather, data = Cars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -8225  -1416    -12   1154  14878 
## 
## Coefficients: (1 not defined because of singularities)
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.11e+04   1.32e+03   23.55  < 2e-16 ***
## Mileage       -1.85e-01   1.09e-02  -16.99  < 2e-16 ***
## MakeCadillac   1.59e+04   4.89e+02   32.44  < 2e-16 ***
## MakeChevrolet -1.75e+03   3.60e+02   -4.85  1.5e-06 ***
## MakePontiac   -1.89e+03   3.68e+02   -5.14  3.4e-07 ***
## MakeSAAB       1.05e+04   4.58e+02   23.01  < 2e-16 ***
## MakeSaturn    -1.24e+03   4.79e+02   -2.58  0.01001 *  
## Doors         -4.14e+03   2.57e+02  -16.13  < 2e-16 ***
## TypeCoupe     -1.20e+04   4.73e+02  -25.40  < 2e-16 ***
## TypeHatchback -4.10e+03   5.47e+02   -7.50  1.8e-13 ***
## TypeSedan     -4.07e+03   3.87e+02  -10.51  < 2e-16 ***
## TypeWagon            NA         NA      NA       NA    
## Cyl           -1.23e+03   3.16e+02   -3.88  0.00011 ***
## Liter          5.76e+03   3.54e+02   16.25  < 2e-16 ***
## Cruise         1.11e+02   2.57e+02    0.43  0.66513    
## Sound          2.96e+02   2.03e+02    1.45  0.14622    
## Leather        2.34e+02   2.19e+02    1.07  0.28502    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2520 on 788 degrees of freedom
## Multiple R-squared:  0.936,  Adjusted R-squared:  0.935 
## F-statistic:  773 on 15 and 788 DF,  p-value: <2e-16

This model appears to have a fairly good adjusted R2 value. However, this may still not be the best model. There are many transformations, such as the log(Price), that may dramatically help the model. In addition we have not made any attempt to consider quadratic or cubic terms in our model. The growing number of large data sets as well as increasing computer power has dramatically improved the ability of researchers to find a parsimonious model (a model that carefully selects a relatively small number of the most useful explanatory variables). However, even with intensive computing power, the process of finding a best model is often more of an art than a science.

Automated multivariate regression procedures (the process of using prespecified conditions to automatically add or delete variables) can have some limitations:

  • When explanatory variables are correlated, automated procedures often fail to include variables that are useful in describing the results.
  • Automated procedures tend to overfit the data (fit unhelpful variables) by searching for any terms that explain the variability in the sample results. This chance variability in the sample results may not be reflected in the entire population from which the sample was collected.
  • The automated procedures provides output that looks like a best model, but that can be easily misinterpreted, since it doesn’t require a researcher to explore the data to get an intuitive feel for the data. For example, automated procedures don’t encourage researchers to look at residual plots that may reveal interesting patterns within the data.

Multicolinearity

Multicollinearity exists when two or more explanatory variables in a multiple regression model are highly correlated with each other. If two explanatory variables X1 and X2 are highly correlated, it can be very difficult to identify whether X1, X2, or both variables are actually responsible for influencing the response variable, Y.

Below we use Cars dataset to determine the relationship between Cyl and Liter on Price.

model1 = lm(Price ~ Mileage + Liter, data = Cars)  # Creating a simple linear model
anova(model1)
## Analysis of Variance Table
## 
## Response: Price
##            Df   Sum Sq  Mean Sq F value  Pr(>F)    
## Mileage     1 1.61e+09 1.61e+09    24.4 9.4e-07 ***
## Liter       1 2.42e+10 2.42e+10   368.5 < 2e-16 ***
## Residuals 801 5.26e+10 6.57e+07                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model2 = lm(Price ~ Mileage + Cyl, data = Cars)  # Creating a simple linear model
anova(model2)
## Analysis of Variance Table
## 
## Response: Price
##            Df   Sum Sq  Mean Sq F value  Pr(>F)    
## Mileage     1 1.61e+09 1.61e+09    24.8 7.7e-07 ***
## Cyl         1 2.51e+10 2.51e+10   387.5 < 2e-16 ***
## Residuals 801 5.18e+10 6.47e+07                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model3 = lm(Price ~ Mileage + Cyl + Liter, data = Cars)  # Creating a simple linear model
anova(model3)
## Analysis of Variance Table
## 
## Response: Price
##            Df   Sum Sq  Mean Sq F value  Pr(>F)    
## Mileage     1 1.61e+09 1.61e+09   24.89 7.4e-07 ***
## Cyl         1 2.51e+10 2.51e+10  388.44 < 2e-16 ***
## Liter       1 1.93e+08 1.93e+08    2.99   0.084 .  
## Residuals 800 5.16e+10 6.45e+07                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Questions

  • In the first model, use only Mileage and Liter as the explanatory variables. Is Liter an important explanatory variable in this model?

  • In the second model, use only Mileage and number of cylinders (Cyl) as the explanatory variables. Is Cyl an important explanatory variable in this model?

  • In the third model, use Mileage, Liter, and number of cylinders (Cyl) as the explanatory variables. How did the test statistics and p-values change when all three explanatory variables were included in the model?

  • Compare the \(R^2\) values in each model. Which model would you suggest?

  • The \(R^2\) values are similar for all three models, however, Liter and Cylined are both measures of engine size. The \(R^2\) values and t-tests for hte regression coefficients show that liter is significant in predicting retail price in model 1, similarly cylinder is significant in model 2 but model 3 shows pnly liter as significant. The coeeficients are unreliable in presence of multicolinearity. Some would argue ot remove cylinder and use liter as it is a more precise measure of engine size.

  • In multiple regression, the p-values for individual terms are highly unreliable and should not be used to test for the importance of a variable

Checking out model for cars priced less than 50K

Cars_sub<- subset(Cars, Price < 50000 , select=c(Price,Mileage,Make,Trim,Model,Cyl))



xyplot(Price~Mileage|Cyl , data=Cars_sub,
            xlab="Mileage", ylab="Price",
            main="",
            labels=row.names(Cars_sub)) 

xyplot(Price~Mileage|Make , data=Cars_sub,
            xlab="Mileage", ylab="Price",
            main="",
            labels=row.names(Cars_sub)) 

Model Fit

  • What happens to price, in general, when there is one or more mile on the car?
  • Does the fact that b1 is small means mielage is not improtant?
  • Does mileage help you predict price?
  • What does R-square value tell you?
carmodel<- lm(Price ~ Mileage + Make+Cyl,data=Cars)
#coef(carmodel) just to print coefficients
summary(carmodel)
## 
## Call:
## lm(formula = Price ~ Mileage + Make + Cyl, data = Cars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -11054  -2630    -26   1651  24612 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    5.21e+02   1.02e+03    0.51    0.611    
## Mileage       -1.76e-01   1.76e-02   -9.99   <2e-16 ***
## MakeCadillac   1.39e+04   6.78e+02   20.48   <2e-16 ***
## MakeChevrolet -5.43e+02   5.28e+02   -1.03    0.304    
## MakePontiac   -1.01e+03   5.66e+02   -1.78    0.076 .  
## MakeSAAB       1.67e+04   6.57e+02   25.46   <2e-16 ***
## MakeSaturn    -2.19e+02   7.34e+02   -0.30    0.765    
## Cyl            3.98e+03   1.41e+02   28.17   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4070 on 796 degrees of freedom
## Multiple R-squared:  0.832,  Adjusted R-squared:  0.831 
## F-statistic:  564 on 7 and 796 DF,  p-value: <2e-16
#aov(carmodel)# u can print ANOVA table too, next chapter

Fitting a seperate model for <50k cars imprvoes model fit

carmodel_sub<- lm(Price ~ Mileage + Make+Cyl,data=Cars_sub)
coef(carmodel_sub)
##   (Intercept)       Mileage  MakeCadillac MakeChevrolet   MakePontiac 
##      1134.670        -0.163     11132.617      -678.379     -1049.510 
##      MakeSAAB    MakeSaturn           Cyl 
##     16437.087      -459.593      3835.060
summary(carmodel_sub)
## 
## Call:
## lm(formula = Price ~ Mileage + Make + Cyl, data = Cars_sub)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -8046  -2485     -7   1637  16782 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.13e+03   8.24e+02    1.38    0.169    
## Mileage       -1.63e-01   1.43e-02  -11.37   <2e-16 ***
## MakeCadillac   1.11e+04   5.60e+02   19.86   <2e-16 ***
## MakeChevrolet -6.78e+02   4.24e+02   -1.60    0.110    
## MakePontiac   -1.05e+03   4.55e+02   -2.31    0.021 *  
## MakeSAAB       1.64e+04   5.28e+02   31.15   <2e-16 ***
## MakeSaturn    -4.60e+02   5.89e+02   -0.78    0.435    
## Cyl            3.84e+03   1.14e+02   33.76   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3260 on 785 degrees of freedom
## Multiple R-squared:  0.86,   Adjusted R-squared:  0.858 
## F-statistic:  687 on 7 and 785 DF,  p-value: <2e-16
aov(carmodel_sub)
## Call:
##    aov(formula = carmodel_sub)
## 
## Terms:
##                  Mileage     Make      Cyl Residuals
## Sum of Squares  1.10e+09 3.80e+10 1.21e+10  8.37e+09
## Deg. of Freedom        1        5        1       785
## 
## Residual standard error: 3265
## Estimated effects may be unbalanced

Stepwise regression

  • The stepwise regression method implemented in R uses AIC (not \(R^{2}\) as the basis for adding a new variable to the model.

  • Models with smaller AIC values are preferred.

full.model<-lm(Price~Make+Cyl+Liter+Doors+Cruise+Sound+Leather+Mileage, data = Cars)
#Now use the step function, and store the results in the object step.model:
step.model <- step(full.model)
## Start:  AIC=13118
## Price ~ Make + Cyl + Liter + Doors + Cruise + Sound + Leather + 
##     Mileage
## 
##           Df Sum of Sq      RSS   AIC
## - Leather  1  8.65e+04 9.49e+09 13116
## - Sound    1  3.03e+05 9.49e+09 13116
## - Cyl      1  3.40e+06 9.49e+09 13117
## <none>                 9.49e+09 13118
## - Cruise   1  2.53e+07 9.52e+09 13118
## - Liter    1  1.29e+09 1.08e+10 13218
## - Doors    1  1.46e+09 1.09e+10 13231
## - Mileage  1  1.74e+09 1.12e+10 13252
## - Make     5  3.40e+10 4.34e+10 14331
## 
## Step:  AIC=13116
## Price ~ Make + Cyl + Liter + Doors + Cruise + Sound + Mileage
## 
##           Df Sum of Sq      RSS   AIC
## - Sound    1  2.65e+05 9.49e+09 13114
## - Cyl      1  3.59e+06 9.49e+09 13115
## <none>                 9.49e+09 13116
## - Cruise   1  2.60e+07 9.52e+09 13117
## - Liter    1  1.32e+09 1.08e+10 13219
## - Doors    1  1.46e+09 1.09e+10 13229
## - Mileage  1  1.74e+09 1.12e+10 13250
## - Make     5  3.57e+10 4.52e+10 14360
## 
## Step:  AIC=13114
## Price ~ Make + Cyl + Liter + Doors + Cruise + Mileage
## 
##           Df Sum of Sq      RSS   AIC
## - Cyl      1  3.45e+06 9.49e+09 13113
## <none>                 9.49e+09 13114
## - Cruise   1  2.59e+07 9.52e+09 13115
## - Liter    1  1.32e+09 1.08e+10 13217
## - Doors    1  1.46e+09 1.09e+10 13227
## - Mileage  1  1.74e+09 1.12e+10 13248
## - Make     5  3.60e+10 4.55e+10 14365
## 
## Step:  AIC=13113
## Price ~ Make + Liter + Doors + Cruise + Mileage
## 
##           Df Sum of Sq      RSS   AIC
## <none>                 9.49e+09 13113
## - Cruise   1  2.64e+07 9.52e+09 13113
## - Doors    1  1.55e+09 1.10e+10 13232
## - Mileage  1  1.74e+09 1.12e+10 13246
## - Liter    1  1.12e+10 2.07e+10 13739
## - Make     5  3.79e+10 4.73e+10 14395
#To see the suggested final model, apply the summary command to step.model:
summary(step.model)
## 
## Call:
## lm(formula = Price ~ Make + Liter + Doors + Cruise + Mileage, 
##     data = Cars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -9576  -1909   -183   1337  22534 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.53e+04   1.02e+03   15.08  < 2e-16 ***
## MakeCadillac   1.61e+04   5.58e+02   28.93  < 2e-16 ***
## MakeChevrolet -2.21e+03   4.72e+02   -4.69  3.2e-06 ***
## MakePontiac   -1.78e+03   4.92e+02   -3.61  0.00032 ***
## MakeSAAB       1.47e+04   5.63e+02   26.15  < 2e-16 ***
## MakeSaturn    -2.26e+03   6.45e+02   -3.51  0.00048 ***
## Liter          4.53e+03   1.48e+02   30.66  < 2e-16 ***
## Doors         -1.73e+03   1.52e+02  -11.37  < 2e-16 ***
## Cruise        -5.11e+02   3.44e+02   -1.49  0.13779    
## Mileage       -1.80e-01   1.49e-02  -12.07  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3460 on 794 degrees of freedom
## Multiple R-squared:  0.879,  Adjusted R-squared:  0.878 
## F-statistic:  641 on 9 and 794 DF,  p-value: <2e-16
summary(carmodel)
## 
## Call:
## lm(formula = Price ~ Mileage + Make + Cyl, data = Cars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -11054  -2630    -26   1651  24612 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    5.21e+02   1.02e+03    0.51    0.611    
## Mileage       -1.76e-01   1.76e-02   -9.99   <2e-16 ***
## MakeCadillac   1.39e+04   6.78e+02   20.48   <2e-16 ***
## MakeChevrolet -5.43e+02   5.28e+02   -1.03    0.304    
## MakePontiac   -1.01e+03   5.66e+02   -1.78    0.076 .  
## MakeSAAB       1.67e+04   6.57e+02   25.46   <2e-16 ***
## MakeSaturn    -2.19e+02   7.34e+02   -0.30    0.765    
## Cyl            3.98e+03   1.41e+02   28.17   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4070 on 796 degrees of freedom
## Multiple R-squared:  0.832,  Adjusted R-squared:  0.831 
## F-statistic:  564 on 7 and 796 DF,  p-value: <2e-16
  • Note that the final model left out some of the original seven variables. Comments!
  • Best Subset Regression: another way of automated model selection, can be doen using leaps package

Finding a Good Re-expression

carmodel_log<- lm(log(Price) ~ Mileage + Make+Cyl,data=Cars)
summary(carmodel_log)
## 
## Call:
## lm(formula = log(Price) ~ Mileage + Make + Cyl, data = Cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3438 -0.0993 -0.0003  0.0836  0.4287 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    8.94e+00   3.53e-02  253.40  < 2e-16 ***
## Mileage       -7.88e-06   6.05e-07  -13.03  < 2e-16 ***
## MakeCadillac   3.60e-01   2.34e-02   15.42  < 2e-16 ***
## MakeChevrolet -1.04e-01   1.82e-02   -5.69  1.8e-08 ***
## MakePontiac   -6.69e-02   1.95e-02   -3.43  0.00064 ***
## MakeSAAB       7.40e-01   2.27e-02   32.67  < 2e-16 ***
## MakeSaturn    -7.68e-02   2.53e-02   -3.04  0.00246 ** 
## Cyl            1.92e-01   4.87e-03   39.49  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.14 on 796 degrees of freedom
## Multiple R-squared:  0.884,  Adjusted R-squared:  0.883 
## F-statistic:  868 on 7 and 796 DF,  p-value: <2e-16
#residualPlot(carmodel_log)
xyplot(resid(carmodel_log) ~ Mileage, xlab="Mileage", 
       type=c("p", "r", "smooth"), data=Cars)  # resiudal vs x

xyplot(resid(carmodel_log) ~ fitted(carmodel), xlab="Predicted values", 
       ylab="Residuals",
       type=c("p", "r", "smooth"), data=Cars) #residual vs predicted value

xqqmath(~ resid(carmodel_log))

histogram(~ resid(carmodel_log)) #histogram of residuals 

Finding a Good Re-expression

When the regression function is not linear and the error terms are not normal and have unequal variances, we can use transfomrations. In general,although not always!:

  • Transforming the y values corrects problems with the error terms (and may help the non-linearity).
  • Transforming the x values primarily corrects the non-linearity.

*Looking at the penguins example we can see how different log transformations affect the xyplot of the two variables:

Penguins <- read.csv("http://nhorton.people.amherst.edu/sdm4/data/Penguins.csv")

xyplot(DiveHeartRate ~ Duration, data = Penguins, type = c("p", "r"),
       main = "No Transformation", xlab = "Dive Duration (min)", ylab = "# of Dives")

xyplot(log(DiveHeartRate) ~ Duration, data = Penguins, type = c("p", "r"),
       main = "Y Transformation", xlab = "log(Dive Duration (min))", ylab = "# of Dives")

xyplot(log(DiveHeartRate) ~ log(Duration), data=Penguins, type = c("p", "r"),
       main = "X and Y Transformations", xlab = "log(Dive Duration (min))", ylab = "log(# of Dives)")

Create data for experimental design

tip    <- factor(rep(1:4, each = 4))
coupon <- factor(rep(1:4, times = 4))
y <- c(9.3, 9.4, 9.6, 10,
       9.4, 9.3, 9.8, 9.9,
       9.2, 9.4, 9.5, 9.7,
       9.7, 9.6, 10, 10.2)
hardness <- data.frame(y, tip, coupon)
head(hardness)
##      y tip coupon
## 1  9.3   1      1
## 2  9.4   1      2
## 3  9.6   1      3
## 4 10.0   1      4
## 5  9.4   2      1
## 6  9.3   2      2
## Analyze data ####
fit <- aov(y ~ coupon + tip, data = hardness)                 
summary(fit)
##             Df Sum Sq Mean Sq F value  Pr(>F)    
## coupon       3  0.825  0.2750    30.9 4.5e-05 ***
## tip          3  0.385  0.1283    14.4 0.00087 ***
## Residuals    9  0.080  0.0089                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Subsetting Data

# using subset function 
names(Cars)
##  [1] "Mileage" "Price"   "Make"    "Model"   "Trim"    "Type"    "Cyl"    
##  [8] "Liter"   "Doors"   "Cruise"  "Sound"   "Leather" "X13"     "X14"    
## [15] "X15"     "X16"     "X17"     "X18"     "X19"     "X20"     "X21"    
## [22] "X22"     "X23"     "X24"     "res1"    "fits1"   "res2"    "fits2"  
## [29] "res3"    "fits3"
newdata <- subset(Cars, Price >= 60000 & Mileage < 35000, 
select=c(Mileage, Price,Make,Model,Cyl))
head(newdata)
## # A tibble: 6 x 5
##   Mileage  Price Make     Model    Cyl
##     <dbl>  <dbl> <chr>    <chr>  <dbl>
## 1     583 70755. Cadillac XLR-V8     8
## 2    6420 68566. Cadillac XLR-V8     8
## 3    7892 69134. Cadillac XLR-V8     8
## 4   12021 66374. Cadillac XLR-V8     8
## 5   15600 65281. Cadillac XLR-V8     8
## 6   18200 63913. Cadillac XLR-V8     8

Bells and whiskers of visual displays

Below figure displays the scatterplot stratified by what shelf (category) it is displayed on at the store.

Cereals <- read.csv("http://nhorton.people.amherst.edu/sdm4/data/Cereals.csv")
Cereals[c(1,2),c(1,2,3)]
##                name mfr calories
## 1         100%_Bran   N       70
## 2 100%_Natural_Bran   Q      120
tally(~ shelf, data=Cereals)
## shelf
##  1  2  3 
## 20 21 36
# adding a factor variable to data
Cereals <- mutate(Cereals, shelfgrp = derivedFactor(
  bottomshelf = shelf==1,
  middleshelf = shelf==2,
  topshelf = shelf==3
))
tally(~ shelfgrp, data=Cereals)
## shelfgrp
## bottomshelf middleshelf    topshelf 
##          20          21          36
xyplot(calories ~ sugars, group=shelfgrp, auto.key=TRUE, 
  lwd=2, type=c("p", "r"), data=Cereals)

data(Births78)
if (require(lattice)) {
  xyplot(births ~ date, Births78)
  xyplot(births ~ date, Births78, groups = wday, main="Births by Weekday ",t="l", auto.key=list(space="top",columns=4, 
                     title="Weekday", cex.title=1,
                     lines=TRUE, points=FALSE))
  
}