Categorical Predictors

Prompt: Today we will explore different ways to code categorical predictors. We will be using a data set from a study that tested how different dosages of drug x (IV) influenced alertness (DV). This is an experimental design where participants were randomly assigned to different conditions.

Let’s start by loading necessary packages

Next let’s load our data

data <- read.csv("HW5data.csv")
#let's peak at our data 
head(data)
##   Dosage Alertness
## 1      a        30
## 2      a        38
## 3      a        35
## 4      a        41
## 5      a        27
## 6      a        24

1. The first thing we want to do is look at some descriptives, what are the means for each level?

describe(data) # Note that Dosage is a factor
##           vars  n  mean   sd median trimmed  mad min max range skew kurtosis
## Dosage*      1 18  1.89 0.76      2    1.88 1.48   1   3     2 0.16    -1.35
## Alertness    2 18 27.67 6.82     27   27.50 8.15  17  41    24 0.25    -1.06
##             se
## Dosage*   0.18
## Alertness 1.61
  # Descriptive for each level of Dosage by using describe.by {psych}
describeBy(data$Alertness, data$Dosage)
## 
##  Descriptive statistics by group 
## group: a
##    vars n mean  sd median trimmed  mad min max range skew kurtosis   se
## X1    1 6 32.5 6.6   32.5    32.5 8.15  24  41    17    0    -1.91 2.69
## ------------------------------------------------------------ 
## group: b
##    vars n  mean   sd median trimmed  mad min max range  skew kurtosis   se
## X1    1 8 28.25 4.43     28   28.25 4.45  21  35    14 -0.07     -1.3 1.57
## ------------------------------------------------------------ 
## group: c
##    vars n  mean   sd median trimmed  mad min max range  skew kurtosis   se
## X1    1 4 19.25 1.71   19.5   19.25 1.48  17  21     4 -0.28    -1.96 0.85

Answer: x1 = 32.5; x2 = 28.25; x3 = 19.25

2. Next, let’s model this relationship using dummy coding so first we need to create that pseudovariate.

Remember, dummy coding is most useful when you have a good control group

Also note that there are many different ways of creating pseudovariates in R. You can create them by hand as shown next, or via different packages as I will demonstrate in the subsequent chunk. For this example, let’s make group a our reference group. Write out how the dummy coding would look (i.e. table with 0’s and 1’s)

# Dummy Coding manually in R
  # First we have to pick a level to be the control (i.e. the variable that
  # gets 0 for both codes). Let's use "a" as the control.
  # Because we have 3 levels, we need 2 dummy codes.
    # The ifelse() function says to create a vector called dummyB (dummyC) and
    # give it a 1 when Dosage is "b" ("c") and 0 otherwise.
dummyB <- ifelse(data$Dosage=="b", 1, 0)
dummyC <- ifelse(data$Dosage=="c", 1, 0)

Answer:

3. Now that we have dummy coded our IV, let’s model the effect of different dosages on alertness.

# Now we estimate the model with our two predictors
mod1 <- lm(Alertness ~ dummyB + dummyC, data=data)
summary(mod1)
## 
## Call:
## lm(formula = Alertness ~ dummyB + dummyC, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.500 -2.438  0.250  2.688  8.500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   32.500      2.010  16.166 6.72e-11 ***
## dummyB        -4.250      2.659  -1.598 0.130880    
## dummyC       -13.250      3.179  -4.168 0.000824 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.924 on 15 degrees of freedom
## Multiple R-squared:  0.5396, Adjusted R-squared:  0.4782 
## F-statistic: 8.789 on 2 and 15 DF,  p-value: 0.002977
#alternative coding methods in R
#we could also dummy code in the moment by wrapping the variable in as.factor, by default it will be effects coded
data$Dosage_dummy <- factor(data$Dosage, 
                      levels = c("a","b", "c")) #we will place the group we want as the throw-away group last
contrasts(data$Dosage_dummy) <- contr.treatment(3) #
#interestingly, the same informaiton is now contained within 1 variable -- magic!

#let's try with this other method
mod1 <- lm(Alertness ~ Dosage_dummy, data=data)
summary(mod1) #nice, it's the same!
## 
## Call:
## lm(formula = Alertness ~ Dosage_dummy, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.500 -2.438  0.250  2.688  8.500 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     32.500      2.010  16.166 6.72e-11 ***
## Dosage_dummy2   -4.250      2.659  -1.598 0.130880    
## Dosage_dummy3  -13.250      3.179  -4.168 0.000824 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.924 on 15 degrees of freedom
## Multiple R-squared:  0.5396, Adjusted R-squared:  0.4782 
## F-statistic: 8.789 on 2 and 15 DF,  p-value: 0.002977
#finally, factors are by default coded as dummy codes so we could also just do it like this:
mod1 <- lm(Alertness ~ as.factor(Dosage), data=data) #wrapoing the variable in as.factor()
summary(mod1) # sAll the same!
## 
## Call:
## lm(formula = Alertness ~ as.factor(Dosage), data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.500 -2.438  0.250  2.688  8.500 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          32.500      2.010  16.166 6.72e-11 ***
## as.factor(Dosage)b   -4.250      2.659  -1.598 0.130880    
## as.factor(Dosage)c  -13.250      3.179  -4.168 0.000824 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.924 on 15 degrees of freedom
## Multiple R-squared:  0.5396, Adjusted R-squared:  0.4782 
## F-statistic: 8.789 on 2 and 15 DF,  p-value: 0.002977

Answer:

4. Is the intercept significant? How would you interpret it?

Answer: The intercept is sig (p < .001) and is the mean for Alertness at Dosage level “a”

5. Explain what each slope is telling you and if they are significant

Answer:

The slopes are the differences between Dosage level “b” (“c”) and dosage level “a”. Dosage level “c” is significantly different from “a”

6. What about the difference between b and c?

Answer: The difference between group b and c is not available in this analysis

7. How does the model fit?

Answer: The model F-test is included as well and the adj R^2 is .4782 which indicates good model fit

8. Compare the F from this model and an ANOVA, what does this tell you?

aov.mod <- aov(Alertness ~ Dosage, data=data)
summary(aov.mod) # Note the same model F.
##             Df Sum Sq Mean Sq F value  Pr(>F)   
## Dosage       2  426.2  213.12   8.789 0.00298 **
## Residuals   15  363.8   24.25                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Answer: Indeed, they are both 8.789 (p = .00298)

9. Let’s try modeling the same two variables but now using effects Coding. Here we are going to make a the throw-away group. In effects coding, the group you choose to code as -1 is commonly referred to as the throw-away group because it’s information is essentially lost. However, it’s not hard to run another model where we change the throw-away group, and in certain situations we might want to do that. Write out how the effects coding would look (i.e. table with 0’s and -1’s

# Effects Coding
  # This time we want to compare all of the groups to the grand mean.
  # We need two codes because we have three levels.
    # This creates a vector called effectsB (effectsC) and codes a 1 if
    # Dosage is "b", ("c"), -1 if dosage is "a", and otherwise 0.
effectsB <- ifelse(data$Dosage=="b", 1, ifelse(data$Dosage=="a", -1, 0))
effectsC <- ifelse(data$Dosage=="c", 1, ifelse(data$Dosage=="a", -1, 0))

Answer:

10. Estimate the model using effects coding

mod2 <- lm(Alertness ~ effectsB + effectsC, data=data)
summary(mod2)
## 
## Call:
## lm(formula = Alertness ~ effectsB + effectsC, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.500 -2.438  0.250  2.688  8.500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   26.667      1.208  22.073 7.52e-13 ***
## effectsB       1.583      1.572   1.007  0.32969    
## effectsC      -7.417      1.866  -3.976  0.00122 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.924 on 15 degrees of freedom
## Multiple R-squared:  0.5396, Adjusted R-squared:  0.4782 
## F-statistic: 8.789 on 2 and 15 DF,  p-value: 0.002977
#Once again, there are other ways to effects code in R and we could once again use the contrasts function from the contrast package
data$Dosage_eff <- factor(data$Dosage, 
                      levels = c("b", "c", "a")) #we will place the group we want as the throw-away group last
contrasts(data$Dosage_eff) <- contr.sum(3) #

mod2 <- lm(Alertness ~ Dosage_eff, data=data)
summary(mod2) #same output!
## 
## Call:
## lm(formula = Alertness ~ Dosage_eff, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.500 -2.438  0.250  2.688  8.500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   26.667      1.208  22.073 7.52e-13 ***
## Dosage_eff1    1.583      1.572   1.007  0.32969    
## Dosage_eff2   -7.417      1.866  -3.976  0.00122 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.924 on 15 degrees of freedom
## Multiple R-squared:  0.5396, Adjusted R-squared:  0.4782 
## F-statistic: 8.789 on 2 and 15 DF,  p-value: 0.002977

11. Is the intercept significant? How would you interpret it?

Answer: The intercept is sig (p < .0001) and is the unweighted grand mean. It is important to recognize that this is not the mean across all Alertness scores. That is a weighted mean (notice, the different levels have different Ns).

12. Are the slopes significant and how would you interpret them?

Answer: The slope for C is significant (p = .001) and represents the difference betwee group C and the Grand Mean. There is no slope for A. The difference from the Grand Mean for “a” is either not of concern or must be computed by hand.

13. Finally, how does the model fit now?

Answer: Identical fit to when we dummy coded!

14. Finally, let’s give contrast coding a try!

Remember, contrast coding is ideal when we have a hypothesis ahead of time (without looking at the means). Let’s say that before our study we hypothesized that “a” would be highest, “b” would be second, and “c” would be lowest. We need to create a contrast code that reflects this hypothesis. No need to write this one out.

    # This ifelse() function creates a vector where dosage level "a" is coded
    # .5, level "b" is coded 0, and level "c" is coded -.5.
cont.code <- ifelse(data$Dosage=="a", .5, ifelse(data$Dosage=="b", 0, -.5))

#once again, there are other ways to create this pseudovariate 

15. Estimate the model using contrast coding

mod3 <- lm(Alertness ~ cont.code, data=data)
summary(mod3)
## 
## Call:
## lm(formula = Alertness ~ cont.code, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.364 -3.011 -0.250  3.546  8.046 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   26.955      1.174  22.952 1.13e-13 ***
## cont.code     12.818      3.151   4.068 0.000895 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.927 on 16 degrees of freedom
## Multiple R-squared:  0.5084, Adjusted R-squared:  0.4777 
## F-statistic: 16.55 on 1 and 16 DF,  p-value: 0.0008951
      # Note: The model F-test is the same as the t-test for the
      # slope coefficient. This is because we now only have 1 predictor in the
      # model.

Answer:

16. Is the intercept significant and how would we interpret it? Answer: The intercept is sig (p < .0001) and is once again the grand mean

17. What about the other regression coefficients, why do we only have 1 slope now? Answer: We only have 1 slope because with 3 levels contrast compares

18. How would you interpret the slop? Answer: The slope is sig (p = .00008) and is the difference between the mean of group 3 and the means of groups 1 and 2

19. How does the model fit now? Answer: Wow, same as before!