I am working through the material in the Penn State online class STAT 501 Regression Methods. This R script is a personal exercise to translate the concepts and examples illustrated in Minitab into R and SAS. Lesson 8 in STAT 501 is Categorical Predictors. The SAS version of the code below is posted on my blog.
This lesson covers categorical predictors in MLR, including interaction effects.
It also covers the piecewise linear regression model which uses an interaction term to create a model that contains two or more different linear pieces.
** Create a non-filled circle plot in ggplot with fill=NA, and shape=21.
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Loading required package: car
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## `curl` package not installed, falling back to using `url()`
## `curl` package not installed, falling back to using `url()`
## Parsed with column specification:
## cols(
## y = col_double(),
## x1 = col_integer()
## )
Start with an example of the relationship between birth weight and the categorical predictor variable, maternal smoking status, after controlling for gestation period. The scatterplot matrix shows the relationship between gestation period, but not the effect of the categorical variable.
pairs(birthsmokers)
Fit a first order model weight ~ gestation + smoke where weight is the birth weight in grams, Gest is the length of gestation in weeks, Smoke is a binary variable 1 if mother smoked, and the independent error terms \(e_i\) follow a normal distribution with mean 0 and equal variance \(\sigma^2\). This is an example of a first-order model with one binary predictor and one quantitative predictor.
ggplot(birthsmokers) +
geom_point(aes(y=Wgt, x=Gest, color=Smoke)) +
geom_smooth(data=birthsmokers, aes(y=Wgt, x=Gest, color=Smoke),
method = "lm", se=FALSE) +
ggtitle("Scatterplot of Birthweight vs Gestation Length") +
xlab("Gestation (weeks)") +
ylab("Birthweight (grams)")
The scatterplot suggests birth weights for non-smoking mothers are higher than for smoking mothers, regardless of the length of gestation. Use a hypothesis test or confidence interval to see if this result extends to the larger population.
birthsmokers.lm <- lm(Wgt ~ Gest + Smoke, data = birthsmokers)
summary(birthsmokers.lm)
##
## Call:
## lm(formula = Wgt ~ Gest + Smoke, data = birthsmokers)
##
## Residuals:
## Min 1Q Median 3Q Max
## -223.693 -92.063 -9.365 79.663 197.507
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2389.573 349.206 -6.843 1.63e-07 ***
## Gest 143.100 9.128 15.677 1.07e-15 ***
## Smokeyes -244.544 41.982 -5.825 2.58e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 115.5 on 29 degrees of freedom
## Multiple R-squared: 0.8964, Adjusted R-squared: 0.8892
## F-statistic: 125.4 on 2 and 29 DF, p-value: 5.289e-15
The regression equion is Wgt = -2390 + 143 Gest - 245 Smoke. 89.64% of the variation in birth weight is explained by gestation length and smoking status. The predictor variable parameter estimates for Gest (p<0.0001) and Smoke (p<0.0001) are significantly different from 0. The ANOVA F-test (p<0.0001) indicates the model variables are more useful than a model without them.
The parameter value for a binary categorical predictor indicates how much higher (or lower) the mean response function of the one group is than that of the other for any value of the other explanatory variables.
The significance of a binary categorical predictor is determined by a hypothesis test with \(H_0: \beta=0\).
confint(birthsmokers.lm)
## 2.5 % 97.5 %
## (Intercept) -3103.7795 -1675.3663
## Gest 124.4312 161.7694
## Smokeyes -330.4064 -158.6817
There is an advantage to pooling the data. The regression model assumes the slope for the two groups are equal. It also assumes that the variances of the error terms are equal. Therefore, pooling the data improves the estimation of the these measures.
birthsmokers_0 <- birthsmokers %>% filter(Smoke=="yes")
birthsmokers_1 <- birthsmokers %>% filter(Smoke=="no")
birthsmokers.lm_0 <- lm(Wgt ~ Gest, data = birthsmokers_0)
summary(birthsmokers.lm_0)
##
## Call:
## lm(formula = Wgt ~ Gest, data = birthsmokers_0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -228.53 -64.86 -19.10 93.89 184.53
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2474.56 553.97 -4.467 0.000532 ***
## Gest 139.03 14.11 9.851 1.12e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 126.6 on 14 degrees of freedom
## Multiple R-squared: 0.8739, Adjusted R-squared: 0.8649
## F-statistic: 97.04 on 1 and 14 DF, p-value: 1.125e-07
birthsmokers.lm_1 <- lm(Wgt ~ Gest, data = birthsmokers_1)
summary(birthsmokers.lm_1)
##
## Call:
## lm(formula = Wgt ~ Gest, data = birthsmokers_1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -171.52 -101.59 23.28 83.63 139.48
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2546.14 457.29 -5.568 6.93e-05 ***
## Gest 147.21 11.97 12.294 6.85e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 106.9 on 14 degrees of freedom
## Multiple R-squared: 0.9152, Adjusted R-squared: 0.9092
## F-statistic: 151.1 on 1 and 14 DF, p-value: 6.852e-09
What if the coding of the categorical predictor variable was (-1,1) instead of (0,1)? Then the parameter estimator would indicate how much each group is offset from the overal average.
If the predictor variables are are unrelated to each other, then their effects on the response variable are additive.
What if the predictor variables interact? E.g., in the smoking example, what if smoking is related to gestation length?
In the following example, the effect y of three treatments A, B, and C are plotted by age. The effects are not parallel lines. This means the effect of the treatment on the mean effectiveness \(\mu\) depends on age, and the effect of age on the treatment mean effectiveness \(\mu\) depends on the treatment, so we should add interaction terms to the model.
ggplot(depression) +
geom_point(aes(y=y, x=age, color=TRT)) +
geom_smooth(data=depression, aes(y=y, x=age, color=TRT),
method = "lm", se=FALSE) +
ggtitle("Scatterplot of y vs Age") +
xlab("Age (years)") +
ylab("y")
In this case, fit a second-order regression model with interaction terms: y ~ age + A + B + age*A + age*B where A is treatment A, B is treatment B, and the independent error terms \(e_i\) follow a normal distribution with mean 0 and equal variance \(\sigma^2\).
This model effectively defines three different regression functions, one for each of the three treatments by setting the indicator variables to 0 or 1.
When a model includes interaction terms, the slope parameter is no longer the change in the mean response for each unit increase in the predictor, all other predictors held constant. There is no way to break apart the interaction.
depression.lm <- lm(y ~ age + x2 + x3 + age*x2 + age*x3, data = depression)
summary(depression.lm)
##
## Call:
## lm(formula = y ~ age + x2 + x3 + age * x2 + age * x3, data = depression)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.4366 -2.7637 0.1887 2.9075 6.5634
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.21138 3.34964 1.854 0.073545 .
## age 1.03339 0.07233 14.288 6.34e-15 ***
## x2 41.30421 5.08453 8.124 4.56e-09 ***
## x3 22.70682 5.09097 4.460 0.000106 ***
## age:x2 -0.70288 0.10896 -6.451 3.98e-07 ***
## age:x3 -0.50971 0.11039 -4.617 6.85e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.925 on 30 degrees of freedom
## Multiple R-squared: 0.9143, Adjusted R-squared: 0.9001
## F-statistic: 64.04 on 5 and 30 DF, p-value: 4.264e-15
Before moving on to inferences, evaluate the model.
The Standardized Residuals vs. Fits plot shows residuals with approximately zero average and constant variation variation, and no excessively outlying points.
depression.res <- data.frame(res=rstandard(depression.lm))
depression.fit <- data.frame(fit=fitted(depression.lm))
depression <- cbind(depression, depression.res, depression.fit)
ggplot(data=depression, aes(y = res, x = fit)) +
geom_point() +
geom_hline(yintercept = 0, linetype = 2) +
geom_hline(yintercept = 2, linetype = 1, color="red") +
geom_hline(yintercept = -2, linetype = 1, color="red") +
ggtitle("Residuals vs. Fits \n(response is y)") +
xlab("Fitted Value") +
ylab("Standardized Residual")
The normal probability plot shows a linear relationship.
qqnorm(depression$res);
qqline(depression$res)
Now we can address the research questions. Are the three treatments significant after taking age into account (i.e., \(H_0: \beta_2 = \beta_3 = \beta_{12} = \beta_{13} = 0\))?
# Partial F-test.
# Reduced model excludes predictor "age".
depression.lm.red <- lm(y ~ x2 + x3 + age*x2 + age*x3, data = depression)
anova(depression.lm)
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## age 1 3424.4 3424.4 222.2946 2.059e-15 ***
## x2 1 803.8 803.8 52.1784 4.857e-08 ***
## x3 1 1.2 1.2 0.0772 0.7831
## age:x2 1 375.0 375.0 24.3430 2.808e-05 ***
## age:x3 1 328.4 328.4 21.3194 6.850e-05 ***
## Residuals 30 462.1 15.4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(depression.lm.red)
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x2 1 924.5 924.5 60.0133 1.212e-08 ***
## x3 1 2.7 2.7 0.1731 0.6803
## age 1 3302.3 3302.3 214.3638 3.331e-15 ***
## x2:age 1 375.0 375.0 24.3430 2.808e-05 ***
## x3:age 1 328.4 328.4 21.3194 6.850e-05 ***
## Residuals 30 462.1 15.4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(depression.lm.red, depression.lm)
## Analysis of Variance Table
##
## Model 1: y ~ x2 + x3 + age * x2 + age * x3
## Model 2: y ~ age + x2 + x3 + age * x2 + age * x3
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 462.15
## 2 30 462.15 0 -5.1159e-13
Does the effect of age on the treatment’s effectiveness depend on treatment (i.e., \(H_0: \beta_{12} = \beta_{13} = 0\))?
# Partial F-test.
# Reduced model excludes predictor "age".
depression.lm.red <- lm(y ~ age*x2 + age*x3, data = depression)
anova(depression.lm)
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## age 1 3424.4 3424.4 222.2946 2.059e-15 ***
## x2 1 803.8 803.8 52.1784 4.857e-08 ***
## x3 1 1.2 1.2 0.0772 0.7831
## age:x2 1 375.0 375.0 24.3430 2.808e-05 ***
## age:x3 1 328.4 328.4 21.3194 6.850e-05 ***
## Residuals 30 462.1 15.4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(depression.lm.red)
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## age 1 3424.4 3424.4 222.2946 2.059e-15 ***
## x2 1 803.8 803.8 52.1784 4.857e-08 ***
## x3 1 1.2 1.2 0.0772 0.7831
## age:x2 1 375.0 375.0 24.3430 2.808e-05 ***
## age:x3 1 328.4 328.4 21.3194 6.850e-05 ***
## Residuals 30 462.1 15.4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(depression.lm.red, depression.lm)
## Analysis of Variance Table
##
## Model 1: y ~ age * x2 + age * x3
## Model 2: y ~ age + x2 + x3 + age * x2 + age * x3
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 462.15
## 2 30 462.15 0 0
If the data follows distinct linear trends among predictor variable domains, model the response variable in pieces. The pieces may be connected or disconnected.
Data set shipment contains the cost (y) of handling a shipment in the warehouse (in thousand dollars) and the size (x1) of the shipment (in thousand parts) (n=10). Here are the first few rows.
head(shipment)
## # A tibble: 6 x 2
## y x1
## <dbl> <int>
## 1 12.0 225
## 2 14.1 350
## 3 8.93 150
## 4 11.0 200
## 5 10.0 175
## 6 10.1 180
A scatterplot of y vs x1 shows a potential kink around x1=250.
ggplot(shipment) +
geom_point(aes(y=y, x=x1)) +
geom_smooth(data=shipment, aes(y=y, x=x1), method = "lm", se=FALSE) +
ggtitle("Scatterplot of y vs x1") +
xlab("x1") +
ylab("y")
Define a piecewise linear regression model \(y_i = \beta_1x_{i1} + \beta_2(x_{i1}-250)x_{i2}+\epsilon_i\) where \(x_{i1}\) is the size of the shipment and \(x_{i2}=0\) if \(x_{i1}<250\) and 1 otherwise.
shipment$xstar <- (shipment$x1 - 250) * ifelse(shipment$x1<=250,0,1)
shipment.plm <- lm(y ~ x1 + xstar, data = shipment)
summary(shipment.plm)
##
## Call:
## lm(formula = y ~ x1 + xstar, data = shipment)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.10540 -0.06390 -0.02906 0.08047 0.11811
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.2139257 0.1785735 18.00 4.04e-07 ***
## x1 0.0384599 0.0009554 40.25 1.52e-09 ***
## xstar -0.0247734 0.0016460 -15.05 1.37e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09303 on 7 degrees of freedom
## Multiple R-squared: 0.9988, Adjusted R-squared: 0.9985
## F-statistic: 2938 on 2 and 7 DF, p-value: 5.813e-11
Using the regression function, what is the predicted cost for a shipment with size 125? 250? 400? Notice that the prediction for x1=250 would be the same whether x2 is 0 or 1.
predict(shipment.plm, data.frame(x1=125, xstar=0), interval="confidence")
## fit lwr upr
## 1 8.021416 7.86496 8.177872
predict(shipment.plm, data.frame(x1=250, xstar=0), interval="confidence")
## fit lwr upr
## 1 12.82891 12.65523 13.00258
predict(shipment.plm, data.frame(x1=400, xstar=150), interval="confidence")
## fit lwr upr
## 1 14.88189 14.70378 15.06
shipment$x2 <- ifelse(shipment$x1<=250,"0","1")
ggplot(shipment) +
geom_point(aes(y=y, x=x1, color=x2)) +
geom_smooth(data=shipment, aes(y=y, x=x1, color=x2), method = "lm", se=FALSE) +
ggtitle("Scatterplot of y vs x1") +
xlab("x1") +
ylab("y")
Model the sale price of homes (SalePrice) as a function of square footage (SqFeet) and presence of air conditioning (Air) (1=yes). Suppose the effect of air conditioning depends upon the size of the home.
realestate.lm <- lm(SalePrice ~ SqFeet + Air + SqFeet*Air, data = realestate)
summary(realestate.lm)
##
## Call:
## lm(formula = SalePrice ~ SqFeet + Air + SqFeet * Air, data = realestate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -248.01 -37.13 -7.80 22.25 381.92
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.218 30.085 -0.107 0.914871
## SqFeet 104.902 15.748 6.661 6.96e-11 ***
## Air -78.868 32.663 -2.415 0.016100 *
## SqFeet:Air 55.888 16.580 3.371 0.000805 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 77.01 on 517 degrees of freedom
## Multiple R-squared: 0.6887, Adjusted R-squared: 0.6869
## F-statistic: 381.2 on 3 and 517 DF, p-value: < 2.2e-16
Graph the relationship.
realestate$AirCond <- ifelse(realestate$Air==0,"no","yes")
ggplot(realestate) +
geom_point(aes(y=SalePrice, x=SqFeet, color=AirCond), fill=NA, shape=21) +
geom_smooth(data=realestate, aes(y=SalePrice, x=SqFeet, color=AirCond),
method = "lm", se=FALSE) +
ggtitle("Scatterplot of Sale Price vs Square Footage") +
xlab("Square Footage") +
ylab("Sale Price")
Note the increasing variance problem.
realestate$fitted <- fitted(realestate.lm)
realestate$rstandard <- rstandard(realestate.lm)
ggplot(data=realestate, aes(y = rstandard, x = fitted)) +
geom_point() +
geom_hline(yintercept = 0, linetype = 2) +
geom_hline(yintercept = 2, linetype = 1, color="red") +
geom_hline(yintercept = -2, linetype = 1, color="red") +
ggtitle("Residuals vs. Fits \n(response is y)") +
xlab("Fitted Value") +
ylab("Standardized Residual")
Model the risk of infection at a hospital (InfctRsk) as a function of length of stay (Stay), x-ray frequencey (Xray), and hospital region (Region) (1-4). We are interested in differences among the regions, controlling for length of stay and x-ray frequency.
# for unspecified reasons, example excludes two stays over 14 days.
infectionrisk <- infectionrisk %>% filter(Stay <= 14)
# create indicator for 4 regions with reference region 1.
infectionrisk$i2 <- ifelse(infectionrisk$Region==2,1,0)
infectionrisk$i3 <- ifelse(infectionrisk$Region==3,1,0)
infectionrisk$i4 <- ifelse(infectionrisk$Region==4,1,0)
Regress InfctRsk ~ Stay + Xray + i2 + i3 + i4.
infectionrisk.lm <- lm(InfctRsk ~ Stay + Xray + i2 + i3 + i4, data = infectionrisk)
summary(infectionrisk.lm)
##
## Call:
## lm(formula = InfctRsk ~ Stay + Xray + i2 + i3 + i4, data = infectionrisk)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.66492 -0.65420 0.04265 0.64034 2.51391
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.134259 0.877347 -2.433 0.01668 *
## Stay 0.505394 0.081455 6.205 1.11e-08 ***
## Xray 0.017587 0.005649 3.113 0.00238 **
## i2 0.171284 0.281475 0.609 0.54416
## i3 0.095461 0.288852 0.330 0.74169
## i4 1.057835 0.378077 2.798 0.00612 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.036 on 105 degrees of freedom
## Multiple R-squared: 0.4198, Adjusted R-squared: 0.3922
## F-statistic: 15.19 on 5 and 105 DF, p-value: 3.243e-11
The i4 parameter estimator is significant (p=0.0061), so region 4 differs from region 1. The positive coefficient means region 4 has higher risk of infection.
Test the overall regional differences. Fit a reduced model with no region indicators.
infectionrisk.lmred <- lm(InfctRsk ~ Stay + Xray, data = infectionrisk)
summary(infectionrisk.lmred)
##
## Call:
## lm(formula = InfctRsk ~ Stay + Xray, data = infectionrisk)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.76125 -0.78551 0.01725 0.71503 2.30103
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.016188 0.692516 -1.467 0.14518
## Stay 0.403971 0.074642 5.412 3.79e-07 ***
## Xray 0.018541 0.005712 3.246 0.00156 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.07 on 108 degrees of freedom
## Multiple R-squared: 0.364, Adjusted R-squared: 0.3522
## F-statistic: 30.9 on 2 and 108 DF, p-value: 2.445e-11
anova(infectionrisk.lm, infectionrisk.lmred)
## Analysis of Variance Table
##
## Model 1: InfctRsk ~ Stay + Xray + i2 + i3 + i4
## Model 2: InfctRsk ~ Stay + Xray
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 105 112.71
## 2 108 123.56 -3 -10.849 3.3687 0.02135 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The F-statistic 3.369 (p=0.0214), so reject \(H_0: \beta_3 = \beta_4 = \beta_5 = 0\).
Test whether region 4 is the only regional difference. Fit a reduced model with just the region 4 indicator.
infectionrisk.lmred4 <- lm(InfctRsk ~ Stay + Xray + i4, data = infectionrisk)
summary(infectionrisk.lmred4)
##
## Call:
## lm(formula = InfctRsk ~ Stay + Xray + i4, data = infectionrisk)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.64970 -0.68798 0.00979 0.62551 2.41138
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.937057 0.727267 -2.663 0.00893 **
## Stay 0.496495 0.077550 6.402 4.17e-09 ***
## Xray 0.017393 0.005503 3.161 0.00205 **
## i4 0.947784 0.301464 3.144 0.00216 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.028 on 107 degrees of freedom
## Multiple R-squared: 0.4177, Adjusted R-squared: 0.4014
## F-statistic: 25.59 on 3 and 107 DF, p-value: 1.476e-12
anova(infectionrisk.lm, infectionrisk.lmred4)
## Analysis of Variance Table
##
## Model 1: InfctRsk ~ Stay + Xray + i2 + i3 + i4
## Model 2: InfctRsk ~ Stay + Xray + i4
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 105 112.71
## 2 107 113.11 -2 -0.39949 0.1861 0.8305
The F-statistic is 0.1861 (p=0.8305), so do not reject \(H_0: \beta_3 = \beta_4 = 0\).