## Load dataset directly from the internet
lbw <- read.table("http://www.umass.edu/statdata/statdata/data/lowbwt.dat", head = T, skip = 4)
## Change to lower case variable names
names(lbw) <- tolower(names(lbw))
## Recoding using within function
lbw <- within(lbw, {
## Relabel race: 1, 2, 3 -> White, Black, Other
race <- factor(race, levels = 1:3, labels = c("White","Black","Other"))
## Dichotomize ptl
preterm <- factor(ptl >= 1, levels = c(FALSE,TRUE), labels = c("0","1+"))
## You can alse use cut
## preterm <- cut(ptl, breaks = c(-Inf, 0, Inf), labels = c("0","1+"))
## Change 0,1 binary to No,Yes binary
## smoke <- factor(smoke, levels = c(0,1), labels = c("No","Yes"))
## ht <- factor(ht , levels = c(0,1), labels = c("No","Yes"))
## ui <- factor(ui , levels = c(0,1), labels = c("No","Yes"))
## low <- factor(low , levels = c(0,1), labels = c("No","Yes"))
## Categorize ftv (frequency of visit): 0 1 2 3 4 6 -> None(0), Normal(1-2), Many (>= 3)
ftv.cat <- cut(ftv, breaks = c(-Inf, 0, 2, Inf), labels = c("None","Normal","Many"))
## Make Normal the reference level
ftv.cat <- relevel(ftv.cat, ref = "Normal")
})
## Categorize smoke ht ui
lbw[,c("smoke","ht","ui","low")] <- lapply(lbw[,c("smoke","ht","ui")],
function(var) {
var <- factor(ifelse(var == 1, "Yes", "No"))
})
## Summary
summary(lbw)
id low age lwt race smoke ptl ht ui
Min. : 4 No :115 Min. :14.0 Min. : 80 White:96 No :115 Min. :0.000 No :177 No :161
1st Qu.: 68 Yes: 74 1st Qu.:19.0 1st Qu.:110 Black:26 Yes: 74 1st Qu.:0.000 Yes: 12 Yes: 28
Median :123 Median :23.0 Median :121 Other:67 Median :0.000
Mean :121 Mean :23.2 Mean :130 Mean :0.196
3rd Qu.:176 3rd Qu.:26.0 3rd Qu.:140 3rd Qu.:0.000
Max. :226 Max. :45.0 Max. :250 Max. :3.000
ftv bwt ftv.cat preterm
Min. :0.000 Min. : 709 Normal: 77 0 :159
1st Qu.:0.000 1st Qu.:2414 None :100 1+: 30
Median :0.000 Median :2977 Many : 12
Mean :0.794 Mean :2945
3rd Qu.:1.000 3rd Qu.:3475
Max. :6.000 Max. :4990
## Check the outcome variable
library(lattice)
densityplot(lbw$bwt)
## No interaction. Squared terms for continuous variables.
lm.full <- lm(bwt ~ age + I(age^2) + lwt + I(lwt^2) + smoke + race + preterm,
data = lbw)
summary(lm.full)
Call:
lm(formula = bwt ~ age + I(age^2) + lwt + I(lwt^2) + smoke +
race + preterm, data = lbw)
Residuals:
Min 1Q Median 3Q Max
-2259.2 -390.3 12.5 488.1 1423.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4166.2320 1100.9490 3.78 0.00021 ***
age -146.1344 62.3205 -2.34 0.02012 *
I(age^2) 2.9233 1.2229 2.39 0.01786 *
lwt 9.2296 10.7900 0.86 0.39348
I(lwt^2) -0.0181 0.0350 -0.52 0.60549
smokeYes -326.4692 112.2103 -2.91 0.00408 **
raceBlack -496.7971 155.0977 -3.20 0.00161 **
raceOther -341.7900 120.2940 -2.84 0.00501 **
preterm1+ -279.1634 138.8263 -2.01 0.04583 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 668 on 180 degrees of freedom
Multiple R-squared: 0.196, Adjusted R-squared: 0.16
F-statistic: 5.47 on 8 and 180 DF, p-value: 0.00000345
library(epicalc)
regress.display(lm.full)
Coeff lower095ci upper095ci Pr>|t|
(Intercept) 4166.23199 1993.80557 6338.65841 0.0002098
age -146.13440 -269.10716 -23.16164 0.0201231
I(age^2) 2.92325 0.51019 5.33631 0.0178605
lwt 9.22958 -12.06153 30.52068 0.3934754
I(lwt^2) -0.01813 -0.08727 0.05101 0.6054888
smokeYes -326.46917 -547.88604 -105.05231 0.0040775
raceBlack -496.79705 -802.84069 -190.75341 0.0016079
raceOther -341.78998 -579.15773 -104.42223 0.0050114
preterm1+ -279.16340 -553.09981 -5.22699 0.0458302
Most useful functions are:
## Load the car (Companion to Applied Regression) package
library(car)
Plots empirical quantiles of a variable, or of studentized residuals from a linear model, against theoretical quantiles of a comparison distribution.
## Residual normality
qqPlot(lm.full)
These functions construct added-variable (also called partial-regression) plots for linear and generalized linear models.
## Added-variable plots: bwt|others ~ age|others (Partial relationship between response and regressor)
avPlots(model = lm.full, id.n = 5)
## avPlots(model = lm.full, terms = ~ age + I(age^2), id.n = 5)
These functions draw Ceres plots for linear and generalized linear models.
## This does not work?
## ceresPlots(model = lm.full, terms = ~ age, ylim = c(-4000,4000))
These functions construct component+residual plots (also called partial-residual plots) for linear and generalized linear models.
## Component + Residual plots: (Predicted + Residual(bwt)) ~ age
crPlots(model = lm.full, id.n = 5)
## crPlots(model = lm.full, terms = ~ age + I(age^2), id.n = 5)
These functions display index plots of dfbeta (effect on coefficients of deleting each observation in turn) and dfbetas (effect on coefficients of deleting each observation in turn, standardized by a deleted estimate of the coefficient standard error). In the plot of dfbeta, horizontal lines are drawn at 0 and +/- one standard error; in the plot of dfbetas, horizontal lines are drawn and 0 and +/- 1.
## dfbeta and dfbetas Index Plots
dfbetasPlots(model = lm.full, id.n = 5)
## Easier to see if the y-axis is fixed
dfbetasPlots(model = lm.full, id.n = 5, ylim = c(-1,1)*1.5)
Provides index plots of Cook's distances, leverages, Studentized residuals, and outlier significance levels for a regression object.
influenceIndexPlot(model = lm.full, id.n = 5)
This function creates a “bubble” plot of Studentized residuals by hat values, with the areas of the circles representing the observations proportional to Cook's distances. Vertical reference lines are drawn at twice and three times the average hat value, horizontal reference lines at -2, 0, and 2 on the Studentized-residual scale.
influencePlot(model = lm.full, id.n = 5)
StudRes Hat CookD
68 0.7712 0.30289 0.16965
76 0.4098 0.14951 0.05740
93 0.5646 0.17724 0.08751
106 1.3175 0.22056 0.23314
129 2.1800 0.02508 0.11535
130 1.3655 0.53713 0.48916
131 -2.4353 0.06020 0.20270
132 -3.5297 0.02404 0.17904
133 -2.7495 0.10151 0.30259
136 -2.1492 0.03276 0.13053
For a ‘lm’ model, draws an inverse.response plot with the response Y on the vertical axis and the fitted values Yhat on the horizontal axis. Uses ‘nls’ to estimate lambda in the function Yhat = b0 + b1(Y)lambda. Adds the fitted curve to the plot. ‘invResPlot’ is an alias for ‘inverseResponsePlot’.
inverseResponsePlot(model = lm.full, id.n = 5)
lambda RSS
1 4.723 14885493
2 -1.000 17295001
3 0.000 16366354
4 1.000 15724560
These functions display a generalization, due to Sall (1990) and Cook and Weisberg (1991), of added-variable plots to multiple-df terms in a linear model. When a term has just 1 df, the leverage plot is a rescaled version of the usual added-variable (partial-regression) plot.
leveragePlots(model = lm.full, id.n = 5)
For a regression object, plots the response on the vertical axis versus a linear combination u of terms in the mean function on the horizontal axis. Added to the plot are a ‘loess’ smooth for the graph, along with a loess smooth from the plot of the fitted values on u. ‘mmps’ is an alias for ‘marginalModelPlots’, and ‘mmp’ is an alias for ‘marginalModelPlot’.
marginalModelPlots(model = lm.full, id.n = 5)
Plots the residuals versus each term in a mean function and versus fitted values. Also computes a curvature test for each of the plots by adding a quadratic term and testing the quadratic to be zero. This is Tukey's test for nonadditivity when plotting against fitted values.
residualPlots(model = lm.full, id.n = 5)
Test stat Pr(>|t|)
age 0.473 0.637
I(age^2) 1.330 0.185
lwt -0.484 0.629
I(lwt^2) 2.312 0.022
smoke NA NA
race NA NA
preterm NA NA
Tukey test 0.788 0.431
Creates plots for examining the possible dependence of spread on level, or an extension of these plots to the studentized residuals from linear models.
spreadLevelPlot(x = lm.full)
Suggested power transformation: 0.5817
## Load effects
library(effects)
## Plot effect of each term
allEffects.lm.full <- allEffects(lm.full)
plot(allEffects.lm.full)