KITADA

Lesson #23

Regression with a categorical explanatory variable

Motivation:

So far, all of our variables in regression have been quantitative. But, we could have categorical explanatory variables in regression. (And even a categorical response variable, but we won’t get into that in this class.) This lesson discusses how to perform a regression analysis with a categorical explanatory variable.

What you need to know from this lesson:

After completing this lesson, you should be able to

properly code the categories of the categorical explanatory variable in R.
write and define the terms in a least-squares regression equation that contains a categorical explanatory variable
interpret coefficients for categorical explanatory variables in the least-squares regression equation
use the least-squares equation for predicting the response variable for the different categories of the explanatory variable
determine if the categorical explanatory variables is a significant predictor of the response variable
explain what an interaction term is, when to use it, and how to interpret its coefficient in the least-squares regression equation (time permitting)

To accomplish the above “What You Need to Know”, do the following:

1. Attend lecture and answer the questions on the following pages of this lesson.
2. Read Section 10.3 in the text (pages 584 – 588)
3. Do the Lesson 23 questions at the end of the lesson notes

The Lesson

Example: biofeedback in the performance of a task

Researchers wished to study the effect of biofeedback and manual dexterity on the ability of patients to perform a complicated task accurately. Twenty-eight patients were randomly selected from those referred for physical therapy. The 28 were then randomly assigned to either receive or not receive biofeedback (technique that trains people to improve their health by controlling certain bodily processes that normally happen involuntarily). Researchers obtained a manual dexterity score for each patient and then asked each to perform a task. The number of consecutive repetitions of the task before an error was made was recorded for each patient. The data are below, with biofeedback coded as 1 if the patient received biofeedback and 0 if the patient did not receive biofeedback.

BIOFEEDBACK

##    repetitions manual biofeedback
## 1           88    225           1
## 2          102     88           1
## 3           73    162           0
## 4          105     90           1
## 5           51    245           0
## 6           52    150           1
## 7          106     87           1
## 8           76    212           1
## 9          100    112           1
## 10         112     77           1
## 11          89    137           0
## 12          52    171           0
## 13          49    199           0
## 14          75    137           1
## 15          50    149           0
## 16          75    251           1
## 17          75    102           0
## 18         112     90           1
## 19          55    180           0
## 20         115     25           1
## 21          50    142           0
## 22          87     88           0
## 23         106     87           0
## 24          91    101           0
## 25          75    211           1
## 26          70    136           1
## 27         100    100           0
## 28         100    100           1

Note:

To do a regression analysis using some statistical software, categorical variables must be treated as quantitative variables. For this example the categorical variable (biofeedback) has been converted to a quantitative variable. If the categorical variable has two categories, the procedure is to assign a 0 to all observations in one of the categories and a 1 to all of the observations in the other category.

Part one:

Let’s ignore manual dexterity score for now and determine the effect biofeedback has on # of consecutive repetitions of the complicated task without making an error. (We’ll assume all conditions are met and concentrate on the analysis.)

1. Which is the response variable and which is the explanatory variable?

RESPONSE: Repetitions
EXPLANATORY: Biofeedback

Use the output below to answer the questions that follow.

### LM MODEL ###
bio_mod<-with(BIOFEEDBACK, lm(repetitions~biofeedback))
summary(bio_mod)

## 
## Call:
## lm(formula = repetitions ~ biofeedback)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -38.867 -17.135   2.615  16.115  34.615 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   71.385      5.593   12.76 1.06e-12 ***
## biofeedback   19.482      7.641    2.55    0.017 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.17 on 26 degrees of freedom
## Multiple R-squared:    0.2,  Adjusted R-squared:  0.1692 
## F-statistic:   6.5 on 1 and 26 DF,  p-value: 0.01703

anova(bio_mod)

## Analysis of Variance Table
## 
## Response: repetitions
##             Df  Sum Sq Mean Sq F value  Pr(>F)  
## biofeedback  1  2643.3 2643.30  6.5002 0.01703 *
## Residuals   26 10572.8  406.65                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

2. Write the least-squares regression equation and EXPLAIN THE MEANING OF EACH TERM!

REPETITIONS = 71.385 + 19.482 * BIOFEEDBACK

3. Predict # of consecutive repetitions for those receiving biofeedback. Do the same for those who do not receive biofeedback.

BIOFEEDBACK = 1
- REPETITIONS = 71.385 + 19.482 = 90.867
BIOFEEDBACK = 0
- REPETITIONS = 71.385

4. Mark the predicted values from #3 on the scatterplot below. What is the difference in these predicted values? What is this number in the least-squares regression equation?

### BIOFEEDBACK MOD ###
par(mfrow=c(1,1))
with(BIOFEEDBACK, plot(biofeedback, repetitions,
                       main="Scatterplot: BIO vs REP"))
abline(coefficients(bio_mod), 
       lwd=2, lty=2, 
       col="red")

plot of chunk unnamed-chunk-4

5. Interpret the value of the slope in the regression equation in the context of the problem.

The slope is the effect of the biofeedback treatment. People who have the biofeedback treatment have and average of 19.482 repetitions than the control group.

6. Is biofeedback a statistically significant predictor of # of consecutive repetitions? Perform a hypothesis test to answer this question.

#Coefficients:
#            Estimate Std. Error t value Pr(>|t|)  
#biofeedback   19.482      7.641    2.55    0.017 *

Test statistic: 2.55
DF: 26
P-value: 0.017

There is moderate evidence to suggest that the effect of biofeedback on repeitions is significant, with a p-value of 0.017. Therefore, we will reject the null hypothesis at an \( \alpha=0.05 \) level.

7. Another way of phrasing the question of interest is, “Is there a difference the average # of consecutive repetitions between those receiving biofeedback and those that don’t?” What analysis procedure can be used to answer the question of interest? Compare this procedure with the simple regression procedure performed above.

We could also do a two-sample t-test.

### TWO SAMPLE T-TEST ###
library(dplyr)

## Warning: package 'dplyr' was built under R version 3.2.5

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

BIOFEED_0<-BIOFEEDBACK%>%
  filter(biofeedback==0)

BIOFEED_1<-BIOFEEDBACK%>%
  filter(biofeedback==1)

t.test(BIOFEED_1$repetitions,
       BIOFEED_0$repetitions,
       var.equal=TRUE)

## 
##  Two Sample t-test
## 
## data:  BIOFEED_1$repetitions and BIOFEED_0$repetitions
## t = 2.5496, df = 26, p-value = 0.01703
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   3.775026 35.189077
## sample estimates:
## mean of x mean of y 
##  90.86667  71.38462

8. For extra practice, complete the Analysis of Variance table.

Part Two:

Let’s include manual dexterity score in the model.

9. The scatterplot of manual dexterity score versus # of consecutive repetitions without making an error with different symbols for those that receive biofeedback and those that did not is given below. Use it to answer these two questions.

### MANUAL AND BIOFEEDBACK ###
par(mfrow=c(1,1))
with(BIOFEEDBACK, plot(manual, repetitions,
                       type="n",
                       main="Scatterplot: MANUAL vs REP"))

with(BIOFEED_1, points(manual, repetitions,
                       col="red",
                       pch=16))

with(BIOFEED_0, points(manual, repetitions,
                       col="blue",
                       pch=17))

plot of chunk unnamed-chunk-7

a. In the least-squares regression equation, what would you expect the coefficient of dexterity score to be? Why?

I would expect the direction to be negative. As manual dexterity increases the number of repetitions decreases, in general.

b. Do you think that the coefficient of dexterity score would be the same for those receiving biofeedback compared to those that didn’t’? That is, if two separate simple linear regressions were performed (one for those who received biofeedback and one for those who did not), would the slope of the least-squares regression line be the same in both models?

No, I dont think the slopes would be the same because it looks like there are different rates of change for the two treatment groups.

10. Here is the least-squares regression equation:

### MANUAL AND BIOFEEDBACK MODEL ###
biodex_mod<-with(BIOFEEDBACK, lm(repetitions~manual+biofeedback))
summary(biodex_mod)

## 
## Call:
## lm(formula = repetitions ~ manual + biofeedback)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.490  -7.380   3.421   7.478  20.520 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 107.7089     7.9177  13.604 4.67e-13 ***
## manual       -0.2535     0.0480  -5.281 1.80e-05 ***
## biofeedback  16.8018     5.3817   3.122   0.0045 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.14 on 25 degrees of freedom
## Multiple R-squared:  0.6218, Adjusted R-squared:  0.5916 
## F-statistic: 20.56 on 2 and 25 DF,  p-value: 5.258e-06

a. Interpret the coefficient of biofeedback in the context of the problem.

The effect of the biofeedback treatment on repetitions, with manual dexterity in the model, is 16.8018.

b. Interpret the coefficient of dexterity score in the context of the problem.

For each additional point in the dexterity score the predicted number of consecutive repetitions decreases by 0.2535.

11. Below is the scatterplot with the least-squares regression line (from the regression of # of consecutive repetitions versus manual dexterity score) drawn in for those receiving biofeedback and for those not receiving biofeedback. In looking at the scatterplot, are the interpretations in#10 accurate? Explain.

No, because the rates of changes are different for the two treatment groups.

### INTERACTION MODELS ###
b1_mod<-with(BIOFEED_1, lm(repetitions~manual))
b0_mod<-with(BIOFEED_0, lm(repetitions~manual))

par(mfrow=c(1,1))
with(BIOFEEDBACK, plot(manual, repetitions,
                       type="n",
                       main="Scatterplot: MANUAL vs REP"))

with(BIOFEED_1, points(manual, repetitions,
                       col="red",
                       pch=16))
abline(coefficients(b1_mod), lwd=2, col="red")
with(BIOFEED_0, points(manual, repetitions,
                       col="blue",
                       pch=17))
abline(coefficients(b0_mod), lwd=2, col="blue")

plot of chunk unnamed-chunk-9

12. The problem with using the least-squares regression equation in #10 to interpret the coefficients of dexterity score and biofeedback is that the slope of the separate regression lines (one for those receiving biofeedback and one for those not receiving biofeedback) are different. When the slopes are different, the rate of change in the response variable for a unit increase in the explanatory variable is different for the different groups being compared. When this happens, we say that the two explanatory variables are interacting with each other. That is, # of consecutive repetitions reacts differently to changes in manual dexterity score between those who received biofeedback and those that did not. When this happens, an interaction term must be included in the model. The interaction term is just the product of manual dexterity score and biofeedback and is indicated in the model by (dexterity score)*biofeedback.

Below is the regression output when including the interaction term:

bioint_mod<-with(BIOFEEDBACK, lm(repetitions~manual+
                                   biofeedback+
                                   manual*biofeedback))
summary(bioint_mod)

## 
## Call:
## lm(formula = repetitions ~ manual + biofeedback + manual * biofeedback)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.270  -5.243   2.064   8.959  16.354 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        121.59567   12.57298   9.671 9.41e-10 ***
## manual              -0.35037    0.08353  -4.195 0.000322 ***
## biofeedback         -3.07789   15.10554  -0.204 0.840260    
## manual:biofeedback   0.14205    0.10113   1.405 0.172936    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.87 on 24 degrees of freedom
## Multiple R-squared:  0.6506, Adjusted R-squared:  0.6069 
## F-statistic: 14.89 on 3 and 24 DF,  p-value: 1.099e-05

a. Write the least-squares regression equation.

REPETITIONS=121.59567-0.35037*DEXTERITY-3.07789*BIOFEEDBACK+0.14205*DEXTERITY*BIOFEEDBACK

b. What does the coefficient of the interaction term mean? (I’ll answer this one for you as it’s a bit confusing)

    For every one unit increase in manual dexterity scores, the number of consecutive repetitions without making an error is predicted to increase by 0.1421 more for those receiving biofeedback than those who do not receive biofeedback.

    To understand this, suppose two people have the same manual dexterity score (say 125). One of these two people received biofeedback while the other did not.

i] Use the least-squares regression equation to predict # of consecutive repetitions for each person.
Manual Dexerity: 125
Biofeedback = 0:
- Repetitions = 121.59567-0.35037*125=77.79942
Biofeedback = 1:
- Repetitions = 121.59567-0.35037*125-3.07789+0.14205*125=92.47778
ii] What is the difference in these two predicted values?

14.67836

Now, suppose two other people have manual dexterity scores of 126. Again, one received biofeedback while the other did not.

iii] Use the least-squares regression equation to predict # of consecutive repetitions for each of these two people.
Manual Dexerity: 126
Biofeedback = 0:
- Repetitions = 121.59567-0.35037*126=77.44905
Biofeedback = 1:
- Repetitions = 121.59567-0.35037*126-3.07789+0.14205*126=92.26946
iv] What is the difference in these two predicted values?

14.82041

v] What is the difference between the two values calculated in parts ii] and iv]?

So, as we increased the manual dexterity score by 1 (from 125 to 126), # of consecutive repetitions for those receiving biofeedback went from 14.67836 more than for those not receiving biofeedback to 14.82041 more than those not receiving biofeedback, an increase of 00.14205 which is the coefficient for the interaction term!

c. With an interaction term in the equation, we cannot interpret the coefficient of dexterity score or biofeedback in the usual way.

13. If the slopes of the regression lines for the two groups were the same, there would be no need to include an interaction term as the difference in the response variable between the two groups would be the same for any value of the quantitative explanatory variable. To illustrate, suppose the scatterplot of # of consecutive repetitions versus manual dexterity score for those receiving biofeedback and those not receiving biofeedback looked like this. For any dexterity score, the difference in the # of consecutive repetitions before making an error between those receiving biofeedback and those not receiving biofeedback is the same.

(SEE PLOT ON HANDOUT)

What is the value of this difference? The regression equations for both groups are given below.

Receiving Biofeedback: \( \hat{y}=10+0.5*Dexerity \)
Not Receiving Biofeedback: \( \hat{y}=0+0.5*Dexerity \)

a. Where is the difference in the regression equations?

The intercept value

b. What is the value of the difference in the y-intercepts?

c. When the slopes are the same, the difference in the y-intercepts tells us the difference in the response variable between the two groups for any value of the explanatory variable.

Summary

A regression analysis can be performed with categorical explanatory variables. However, we have to “trick” the statistical software into thinking the categorical variables are quantitative by assigning two numbers to the two different categories of the variable, typically 0 and 1. (We did not and will not get into how to deal with variables with more than two categories in this class.) Most of the analysis is done the same way as if the variable was quantitative. But, there are a few changes to remember:

Always make sure to define how the categorical variable was coded when writing the regression equation (i.e. which category was coded with a 0 and which was coded with a 1)
The interpretation of the coefficient of the categorical explanatory variable is a little more than “increasing x by 1”. Be specific: what does it mean to “increase the categorical explanatory variable by 1 unit”?
If the regression equation has two explanatory variables, there may be an interaction between the two variables. If one of the explanatory variables is quantitative, it’s easy to check to see if there is an interaction: make a scatterplot of the response variable versus the quantitative explanatory variable. Include a regression line for each of the categories of the categorical variable of interest. If the lines are parallel or nearly parallel, no interaction is taking place (i.e. no interaction term is needed in the equation). However, if the lines are not parallel, there is an interaction between the two variables and an interaction term is needed in the regression equation.
Interpreting the coefficient of the interaction term is not easy. Look at the above example one more time to fully understand this idea.

When there are three or more explanatory variables, an interaction could be taking place between two or more of the explanatory variables. The model could get quite complicated, but we will not get into that situation in this class. You only need to know what an interaction is, why it might be necessary to include an interaction term in the regression model, and how to interpret the coefficient of the interaction term for two explanatory variables, one of which is quantitative and one of which is categorical.