8 Categorical Predictors

I am working through the material in the Penn State online class STAT 501 Regression Methods. This R script is a personal exercise to translate the concepts and examples illustrated in Minitab into R and SAS. Lesson 8 in STAT 501 is Categorical Predictors. The SAS version of the code below is posted on my blog.

Overview

This lesson covers categorical predictors in MLR, including interaction effects.

It also covers the piecewise linear regression model which uses an interaction term to create a model that contains two or more different linear pieces.

R Concepts Worth Noting

** Create a non-filled circle plot in ggplot with fill=NA, and shape=21.

Data Management for this Module

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## Loading required package: car

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## `curl` package not installed, falling back to using `url()`
## `curl` package not installed, falling back to using `url()`

## Parsed with column specification:
## cols(
##   y = col_double(),
##   x1 = col_integer()
## )

8.1 Example on Birth Weight and Smoking

Start with an example of the relationship between birth weight and the categorical predictor variable, maternal smoking status, after controlling for gestation period. The scatterplot matrix shows the relationship between gestation period, but not the effect of the categorical variable.

pairs(birthsmokers)

Fit a first order model weight ~ gestation + smoke where weight is the birth weight in grams, Gest is the length of gestation in weeks, Smoke is a binary variable 1 if mother smoked, and the independent error terms \(e_i\) follow a normal distribution with mean 0 and equal variance \(\sigma^2\). This is an example of a first-order model with one binary predictor and one quantitative predictor.

ggplot(birthsmokers) +
  geom_point(aes(y=Wgt, x=Gest, color=Smoke)) +
  geom_smooth(data=birthsmokers, aes(y=Wgt, x=Gest, color=Smoke), 
              method = "lm", se=FALSE) +
  ggtitle("Scatterplot of Birthweight vs Gestation Length") +
  xlab("Gestation (weeks)") +
  ylab("Birthweight (grams)")

The scatterplot suggests birth weights for non-smoking mothers are higher than for smoking mothers, regardless of the length of gestation. Use a hypothesis test or confidence interval to see if this result extends to the larger population.

birthsmokers.lm <- lm(Wgt ~ Gest + Smoke, data = birthsmokers)
summary(birthsmokers.lm)

## 
## Call:
## lm(formula = Wgt ~ Gest + Smoke, data = birthsmokers)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -223.693  -92.063   -9.365   79.663  197.507 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2389.573    349.206  -6.843 1.63e-07 ***
## Gest          143.100      9.128  15.677 1.07e-15 ***
## Smokeyes     -244.544     41.982  -5.825 2.58e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 115.5 on 29 degrees of freedom
## Multiple R-squared:  0.8964, Adjusted R-squared:  0.8892 
## F-statistic: 125.4 on 2 and 29 DF,  p-value: 5.289e-15

The regression equion is Wgt = -2390 + 143 Gest - 245 Smoke. 89.64% of the variation in birth weight is explained by gestation length and smoking status. The predictor variable parameter estimates for Gest (p<0.0001) and Smoke (p<0.0001) are significantly different from 0. The ANOVA F-test (p<0.0001) indicates the model variables are more useful than a model without them.

8.2 The Basics

The parameter value for a binary categorical predictor indicates how much higher (or lower) the mean response function of the one group is than that of the other for any value of the other explanatory variables.

The significance of a binary categorical predictor is determined by a hypothesis test with \(H_0: \beta=0\).

confint(birthsmokers.lm)

##                  2.5 %     97.5 %
## (Intercept) -3103.7795 -1675.3663
## Gest          124.4312   161.7694
## Smokeyes     -330.4064  -158.6817

8.3 Two Separate Advantages

Why not fit two separate regression equations? Two reasons.

There is an advantage to pooling the data. The regression model assumes the slope for the two groups are equal. It also assumes that the variances of the error terms are equal. Therefore, pooling the data improves the estimation of the these measures.
It makes it a straightforward task to estmate the effect of the binary variable.

birthsmokers_0 <- birthsmokers %>% filter(Smoke=="yes")
birthsmokers_1 <- birthsmokers %>% filter(Smoke=="no")
birthsmokers.lm_0 <- lm(Wgt ~ Gest, data = birthsmokers_0)
summary(birthsmokers.lm_0)

## 
## Call:
## lm(formula = Wgt ~ Gest, data = birthsmokers_0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -228.53  -64.86  -19.10   93.89  184.53 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2474.56     553.97  -4.467 0.000532 ***
## Gest          139.03      14.11   9.851 1.12e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 126.6 on 14 degrees of freedom
## Multiple R-squared:  0.8739, Adjusted R-squared:  0.8649 
## F-statistic: 97.04 on 1 and 14 DF,  p-value: 1.125e-07

birthsmokers.lm_1 <- lm(Wgt ~ Gest, data = birthsmokers_1)
summary(birthsmokers.lm_1)

## 
## Call:
## lm(formula = Wgt ~ Gest, data = birthsmokers_1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -171.52 -101.59   23.28   83.63  139.48 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2546.14     457.29  -5.568 6.93e-05 ***
## Gest          147.21      11.97  12.294 6.85e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 106.9 on 14 degrees of freedom
## Multiple R-squared:  0.9152, Adjusted R-squared:  0.9092 
## F-statistic: 151.1 on 1 and 14 DF,  p-value: 6.852e-09

8.4 Coding Qualitative Variables

What if the coding of the categorical predictor variable was (-1,1) instead of (0,1)? Then the parameter estimator would indicate how much each group is offset from the overal average.

8.5 Additive Effects

If the predictor variables are are unrelated to each other, then their effects on the response variable are additive.

8.6 Interaction Effects

What if the predictor variables interact? E.g., in the smoking example, what if smoking is related to gestation length?

In the following example, the effect y of three treatments A, B, and C are plotted by age. The effects are not parallel lines. This means the effect of the treatment on the mean effectiveness \(\mu\) depends on age, and the effect of age on the treatment mean effectiveness \(\mu\) depends on the treatment, so we should add interaction terms to the model.

ggplot(depression) +
  geom_point(aes(y=y, x=age, color=TRT)) +
  geom_smooth(data=depression, aes(y=y, x=age, color=TRT), 
              method = "lm", se=FALSE) +
  ggtitle("Scatterplot of y vs Age") +
  xlab("Age (years)") +
  ylab("y")

In this case, fit a second-order regression model with interaction terms: y ~ age + A + B + age*A + age*B where A is treatment A, B is treatment B, and the independent error terms \(e_i\) follow a normal distribution with mean 0 and equal variance \(\sigma^2\).

This model effectively defines three different regression functions, one for each of the three treatments by setting the indicator variables to 0 or 1.

When a model includes interaction terms, the slope parameter is no longer the change in the mean response for each unit increase in the predictor, all other predictors held constant. There is no way to break apart the interaction.

depression.lm <- lm(y ~ age + x2 + x3 + age*x2 + age*x3, data = depression)
summary(depression.lm)

## 
## Call:
## lm(formula = y ~ age + x2 + x3 + age * x2 + age * x3, data = depression)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.4366 -2.7637  0.1887  2.9075  6.5634 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.21138    3.34964   1.854 0.073545 .  
## age          1.03339    0.07233  14.288 6.34e-15 ***
## x2          41.30421    5.08453   8.124 4.56e-09 ***
## x3          22.70682    5.09097   4.460 0.000106 ***
## age:x2      -0.70288    0.10896  -6.451 3.98e-07 ***
## age:x3      -0.50971    0.11039  -4.617 6.85e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.925 on 30 degrees of freedom
## Multiple R-squared:  0.9143, Adjusted R-squared:  0.9001 
## F-statistic: 64.04 on 5 and 30 DF,  p-value: 4.264e-15

Before moving on to inferences, evaluate the model.

The Standardized Residuals vs. Fits plot shows residuals with approximately zero average and constant variation variation, and no excessively outlying points.

depression.res <- data.frame(res=rstandard(depression.lm))
depression.fit <- data.frame(fit=fitted(depression.lm))
depression <- cbind(depression, depression.res, depression.fit)
ggplot(data=depression, aes(y = res, x = fit)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = 2) +
  geom_hline(yintercept = 2, linetype = 1, color="red") +
  geom_hline(yintercept = -2, linetype = 1, color="red") +
  ggtitle("Residuals vs. Fits \n(response is y)") +
  xlab("Fitted Value") +
  ylab("Standardized Residual")

The normal probability plot shows a linear relationship.

qqnorm(depression$res); 
qqline(depression$res)

Now we can address the research questions. Are the three treatments significant after taking age into account (i.e., \(H_0: \beta_2 = \beta_3 = \beta_{12} = \beta_{13} = 0\))?

# Partial F-test.
# Reduced model excludes predictor "age".
depression.lm.red <- lm(y ~ x2 + x3 + age*x2 + age*x3, data = depression)
anova(depression.lm)

## Analysis of Variance Table
## 
## Response: y
##           Df Sum Sq Mean Sq  F value    Pr(>F)    
## age        1 3424.4  3424.4 222.2946 2.059e-15 ***
## x2         1  803.8   803.8  52.1784 4.857e-08 ***
## x3         1    1.2     1.2   0.0772    0.7831    
## age:x2     1  375.0   375.0  24.3430 2.808e-05 ***
## age:x3     1  328.4   328.4  21.3194 6.850e-05 ***
## Residuals 30  462.1    15.4                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(depression.lm.red)

## Analysis of Variance Table
## 
## Response: y
##           Df Sum Sq Mean Sq  F value    Pr(>F)    
## x2         1  924.5   924.5  60.0133 1.212e-08 ***
## x3         1    2.7     2.7   0.1731    0.6803    
## age        1 3302.3  3302.3 214.3638 3.331e-15 ***
## x2:age     1  375.0   375.0  24.3430 2.808e-05 ***
## x3:age     1  328.4   328.4  21.3194 6.850e-05 ***
## Residuals 30  462.1    15.4                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(depression.lm.red, depression.lm)

## Analysis of Variance Table
## 
## Model 1: y ~ x2 + x3 + age * x2 + age * x3
## Model 2: y ~ age + x2 + x3 + age * x2 + age * x3
##   Res.Df    RSS Df   Sum of Sq F Pr(>F)
## 1     30 462.15                        
## 2     30 462.15  0 -5.1159e-13

Does the effect of age on the treatment’s effectiveness depend on treatment (i.e., \(H_0: \beta_{12} = \beta_{13} = 0\))?

# Partial F-test.
# Reduced model excludes predictor "age".
depression.lm.red <- lm(y ~ age*x2 + age*x3, data = depression)
anova(depression.lm)

## Analysis of Variance Table
## 
## Response: y
##           Df Sum Sq Mean Sq  F value    Pr(>F)    
## age        1 3424.4  3424.4 222.2946 2.059e-15 ***
## x2         1  803.8   803.8  52.1784 4.857e-08 ***
## x3         1    1.2     1.2   0.0772    0.7831    
## age:x2     1  375.0   375.0  24.3430 2.808e-05 ***
## age:x3     1  328.4   328.4  21.3194 6.850e-05 ***
## Residuals 30  462.1    15.4                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(depression.lm.red)

## Analysis of Variance Table
## 
## Response: y
##           Df Sum Sq Mean Sq  F value    Pr(>F)    
## age        1 3424.4  3424.4 222.2946 2.059e-15 ***
## x2         1  803.8   803.8  52.1784 4.857e-08 ***
## x3         1    1.2     1.2   0.0772    0.7831    
## age:x2     1  375.0   375.0  24.3430 2.808e-05 ***
## age:x3     1  328.4   328.4  21.3194 6.850e-05 ***
## Residuals 30  462.1    15.4                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(depression.lm.red, depression.lm)

## Analysis of Variance Table
## 
## Model 1: y ~ age * x2 + age * x3
## Model 2: y ~ age + x2 + x3 + age * x2 + age * x3
##   Res.Df    RSS Df Sum of Sq F Pr(>F)
## 1     30 462.15                      
## 2     30 462.15  0         0

8.7 Leaving an Important Interaction Out of a Model

8.8 Piecewise Linear Regression Models

If the data follows distinct linear trends among predictor variable domains, model the response variable in pieces. The pieces may be connected or disconnected.

Data set shipment contains the cost (y) of handling a shipment in the warehouse (in thousand dollars) and the size (x1) of the shipment (in thousand parts) (n=10). Here are the first few rows.

head(shipment)

## # A tibble: 6 x 2
##       y    x1
##   <dbl> <int>
## 1 12.0    225
## 2 14.1    350
## 3  8.93   150
## 4 11.0    200
## 5 10.0    175
## 6 10.1    180

A scatterplot of y vs x1 shows a potential kink around x1=250.

ggplot(shipment) +
  geom_point(aes(y=y, x=x1)) +
  geom_smooth(data=shipment, aes(y=y, x=x1), method = "lm", se=FALSE) +
  ggtitle("Scatterplot of y vs x1") +
  xlab("x1") +
  ylab("y")

Define a piecewise linear regression model \(y_i = \beta_1x_{i1} + \beta_2(x_{i1}-250)x_{i2}+\epsilon_i\) where \(x_{i1}\) is the size of the shipment and \(x_{i2}=0\) if \(x_{i1}<250\) and 1 otherwise.

shipment$xstar <- (shipment$x1 - 250) * ifelse(shipment$x1<=250,0,1)
shipment.plm <- lm(y ~ x1 + xstar, data = shipment)
summary(shipment.plm)

## 
## Call:
## lm(formula = y ~ x1 + xstar, data = shipment)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.10540 -0.06390 -0.02906  0.08047  0.11811 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.2139257  0.1785735   18.00 4.04e-07 ***
## x1           0.0384599  0.0009554   40.25 1.52e-09 ***
## xstar       -0.0247734  0.0016460  -15.05 1.37e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09303 on 7 degrees of freedom
## Multiple R-squared:  0.9988, Adjusted R-squared:  0.9985 
## F-statistic:  2938 on 2 and 7 DF,  p-value: 5.813e-11

Using the regression function, what is the predicted cost for a shipment with size 125? 250? 400? Notice that the prediction for x1=250 would be the same whether x2 is 0 or 1.

predict(shipment.plm, data.frame(x1=125, xstar=0), interval="confidence")

##        fit     lwr      upr
## 1 8.021416 7.86496 8.177872

predict(shipment.plm, data.frame(x1=250, xstar=0), interval="confidence")

##        fit      lwr      upr
## 1 12.82891 12.65523 13.00258

predict(shipment.plm, data.frame(x1=400, xstar=150), interval="confidence")

##        fit      lwr   upr
## 1 14.88189 14.70378 15.06

shipment$x2 <- ifelse(shipment$x1<=250,"0","1")
ggplot(shipment) +
  geom_point(aes(y=y, x=x1, color=x2)) +
  geom_smooth(data=shipment, aes(y=y, x=x1, color=x2), method = "lm", se=FALSE) +
  ggtitle("Scatterplot of y vs x1") +
  xlab("x1") +
  ylab("y")

8.9 Further Examples

Real Estate Air Conditioning

Model the sale price of homes (SalePrice) as a function of square footage (SqFeet) and presence of air conditioning (Air) (1=yes). Suppose the effect of air conditioning depends upon the size of the home.

realestate.lm <- lm(SalePrice ~ SqFeet + Air + SqFeet*Air, data = realestate)
summary(realestate.lm)

## 
## Call:
## lm(formula = SalePrice ~ SqFeet + Air + SqFeet * Air, data = realestate)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -248.01  -37.13   -7.80   22.25  381.92 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -3.218     30.085  -0.107 0.914871    
## SqFeet       104.902     15.748   6.661 6.96e-11 ***
## Air          -78.868     32.663  -2.415 0.016100 *  
## SqFeet:Air    55.888     16.580   3.371 0.000805 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 77.01 on 517 degrees of freedom
## Multiple R-squared:  0.6887, Adjusted R-squared:  0.6869 
## F-statistic: 381.2 on 3 and 517 DF,  p-value: < 2.2e-16

Graph the relationship.

realestate$AirCond <- ifelse(realestate$Air==0,"no","yes")
ggplot(realestate) +
  geom_point(aes(y=SalePrice, x=SqFeet, color=AirCond), fill=NA, shape=21) +
  geom_smooth(data=realestate, aes(y=SalePrice, x=SqFeet, color=AirCond), 
              method = "lm", se=FALSE) +
  ggtitle("Scatterplot of Sale Price vs Square Footage") +
  xlab("Square Footage") +
  ylab("Sale Price")

Note the increasing variance problem.

realestate$fitted <- fitted(realestate.lm)
realestate$rstandard <- rstandard(realestate.lm)
ggplot(data=realestate, aes(y = rstandard, x = fitted)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = 2) +
  geom_hline(yintercept = 2, linetype = 1, color="red") +
  geom_hline(yintercept = -2, linetype = 1, color="red") +
  ggtitle("Residuals vs. Fits \n(response is y)") +
  xlab("Fitted Value") +
  ylab("Standardized Residual")

Hospital Infection Risk

Model the risk of infection at a hospital (InfctRsk) as a function of length of stay (Stay), x-ray frequencey (Xray), and hospital region (Region) (1-4). We are interested in differences among the regions, controlling for length of stay and x-ray frequency.

# for unspecified reasons, example excludes two stays over 14 days.
infectionrisk <- infectionrisk %>% filter(Stay <= 14)
# create indicator for 4 regions with reference region 1.
infectionrisk$i2 <- ifelse(infectionrisk$Region==2,1,0)
infectionrisk$i3 <- ifelse(infectionrisk$Region==3,1,0)
infectionrisk$i4 <- ifelse(infectionrisk$Region==4,1,0)

Regress InfctRsk ~ Stay + Xray + i2 + i3 + i4.

infectionrisk.lm <- lm(InfctRsk ~ Stay + Xray + i2 + i3 + i4, data = infectionrisk)
summary(infectionrisk.lm)

## 
## Call:
## lm(formula = InfctRsk ~ Stay + Xray + i2 + i3 + i4, data = infectionrisk)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.66492 -0.65420  0.04265  0.64034  2.51391 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.134259   0.877347  -2.433  0.01668 *  
## Stay         0.505394   0.081455   6.205 1.11e-08 ***
## Xray         0.017587   0.005649   3.113  0.00238 ** 
## i2           0.171284   0.281475   0.609  0.54416    
## i3           0.095461   0.288852   0.330  0.74169    
## i4           1.057835   0.378077   2.798  0.00612 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.036 on 105 degrees of freedom
## Multiple R-squared:  0.4198, Adjusted R-squared:  0.3922 
## F-statistic: 15.19 on 5 and 105 DF,  p-value: 3.243e-11

The i4 parameter estimator is significant (p=0.0061), so region 4 differs from region 1. The positive coefficient means region 4 has higher risk of infection.

Test the overall regional differences. Fit a reduced model with no region indicators.

infectionrisk.lmred <- lm(InfctRsk ~ Stay + Xray, data = infectionrisk)
summary(infectionrisk.lmred)

## 
## Call:
## lm(formula = InfctRsk ~ Stay + Xray, data = infectionrisk)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.76125 -0.78551  0.01725  0.71503  2.30103 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.016188   0.692516  -1.467  0.14518    
## Stay         0.403971   0.074642   5.412 3.79e-07 ***
## Xray         0.018541   0.005712   3.246  0.00156 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.07 on 108 degrees of freedom
## Multiple R-squared:  0.364,  Adjusted R-squared:  0.3522 
## F-statistic:  30.9 on 2 and 108 DF,  p-value: 2.445e-11

anova(infectionrisk.lm, infectionrisk.lmred)

## Analysis of Variance Table
## 
## Model 1: InfctRsk ~ Stay + Xray + i2 + i3 + i4
## Model 2: InfctRsk ~ Stay + Xray
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1    105 112.71                              
## 2    108 123.56 -3   -10.849 3.3687 0.02135 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F-statistic 3.369 (p=0.0214), so reject \(H_0: \beta_3 = \beta_4 = \beta_5 = 0\).

Test whether region 4 is the only regional difference. Fit a reduced model with just the region 4 indicator.

infectionrisk.lmred4 <- lm(InfctRsk ~ Stay + Xray + i4, data = infectionrisk)
summary(infectionrisk.lmred4)

## 
## Call:
## lm(formula = InfctRsk ~ Stay + Xray + i4, data = infectionrisk)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.64970 -0.68798  0.00979  0.62551  2.41138 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.937057   0.727267  -2.663  0.00893 ** 
## Stay         0.496495   0.077550   6.402 4.17e-09 ***
## Xray         0.017393   0.005503   3.161  0.00205 ** 
## i4           0.947784   0.301464   3.144  0.00216 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.028 on 107 degrees of freedom
## Multiple R-squared:  0.4177, Adjusted R-squared:  0.4014 
## F-statistic: 25.59 on 3 and 107 DF,  p-value: 1.476e-12

anova(infectionrisk.lm, infectionrisk.lmred4)

## Analysis of Variance Table
## 
## Model 1: InfctRsk ~ Stay + Xray + i2 + i3 + i4
## Model 2: InfctRsk ~ Stay + Xray + i4
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    105 112.71                           
## 2    107 113.11 -2  -0.39949 0.1861 0.8305

The F-statistic is 0.1861 (p=0.8305), so do not reject \(H_0: \beta_3 = \beta_4 = 0\).