Categorical Predictors

Prompt: This week we will explore different ways to code and analyize categorical predictors. We will be using a data set from a study that tested how different dosages of drug x (IV) influenced alertness (DV). This is an experimental design where participants were randomly assigned to different conditions.

Let’s start by loading necessary packages

note how our knitted html did not show this output this was done by adding the include = F argument above

Next let’s load our data

data <- read.csv("HW5data.csv")

#let's peak at our data 
head(data)
##   Dosage Alertness
## 1      a        30
## 2      a        38
## 3      a        35
## 4      a        41
## 5      a        27
## 6      a        24

1. The first thing we want to do is look at some descriptives, what are the means for each level?

describe(data) # Note that Dosage is a factor
##           vars  n  mean   sd median trimmed  mad min max range skew kurtosis
## Dosage*      1 18  1.89 0.76      2    1.88 1.48   1   3     2 0.16    -1.35
## Alertness    2 18 27.67 6.82     27   27.50 8.15  17  41    24 0.25    -1.06
##             se
## Dosage*   0.18
## Alertness 1.61
  # Descriptive for each level of Dosage by using describe.by {psych}
describeBy(data$Alertness, data$Dosage)
## 
##  Descriptive statistics by group 
## group: a
##    vars n mean  sd median trimmed  mad min max range skew kurtosis   se
## X1    1 6 32.5 6.6   32.5    32.5 8.15  24  41    17    0    -1.91 2.69
## ------------------------------------------------------------ 
## group: b
##    vars n  mean   sd median trimmed  mad min max range  skew kurtosis   se
## X1    1 8 28.25 4.43     28   28.25 4.45  21  35    14 -0.07     -1.3 1.57
## ------------------------------------------------------------ 
## group: c
##    vars n  mean   sd median trimmed  mad min max range  skew kurtosis   se
## X1    1 4 19.25 1.71   19.5   19.25 1.48  17  21     4 -0.28    -1.96 0.85

Answer: Across dosages, the average alertness level is 27.67. For dosage group A, the average alertness level is 32.5. For dosage group B, the average alertness level is 28.25. For dosage group C, the average alertness level is 19.25.

2. Next, let’s model this relationship using dummy coding. The first we need to create that pseudovariate.

Remember, dummy coding is most useful when you have a good control group

Also note that there are many different ways of creating pseudovariates in R. You can create them by hand as shown next, or via different packages as I will demonstrate in the subsequent chunk. For this example, let’s make group a our reference group. Write out how the dummy coding would look (i.e. table with 0’s and 1’s)

# Dummy Coding manually in R
  # First we have to pick a level to be the control (i.e. the variable that
  # gets 0 for both codes). Let's use "a" as the control.
  # Because we have 3 levels, we need 2 dummy codes.
    # The ifelse() function says to create a vector called dummyB (dummyC) and
    # give it a 1 when Dosage is "b" ("c") and 0 otherwise.
dummyB <- ifelse(data$Dosage=="b", 1, 0)
dummyC <- ifelse(data$Dosage=="c", 1, 0)

Answer: Pseudovariates created using dummy coding.

3. Now that we have dummy coded our IV, let’s model the effect of different dosages on alertness.

# Now we estimate the model with our two predictors
mod1 <- lm(Alertness ~ dummyB + dummyC, data=data)
summary(mod1)
## 
## Call:
## lm(formula = Alertness ~ dummyB + dummyC, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.500 -2.438  0.250  2.688  8.500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   32.500      2.010  16.166 6.72e-11 ***
## dummyB        -4.250      2.659  -1.598 0.130880    
## dummyC       -13.250      3.179  -4.168 0.000824 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.924 on 15 degrees of freedom
## Multiple R-squared:  0.5396, Adjusted R-squared:  0.4782 
## F-statistic: 8.789 on 2 and 15 DF,  p-value: 0.002977
#There are however, alternative coding methods in R
#for instance, we could use the contrasts function
data$Dosage_dummy <- factor(data$Dosage, 
                      levels = c("a","b", "c")) #we will place the group we want as the throw-away group last
contrasts(data$Dosage_dummy) <- contr.treatment(3) #
#interestingly, the same informaiton is now contained within 1 variable -- magic!

#let's try with same model with the pseudovariate we generated with constrasts
mod1 <- lm(Alertness ~ Dosage_dummy, data=data)
summary(mod1) #it's the same!
## 
## Call:
## lm(formula = Alertness ~ Dosage_dummy, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.500 -2.438  0.250  2.688  8.500 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     32.500      2.010  16.166 6.72e-11 ***
## Dosage_dummy2   -4.250      2.659  -1.598 0.130880    
## Dosage_dummy3  -13.250      3.179  -4.168 0.000824 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.924 on 15 degrees of freedom
## Multiple R-squared:  0.5396, Adjusted R-squared:  0.4782 
## F-statistic: 8.789 on 2 and 15 DF,  p-value: 0.002977
#we could also dummy code within the moment by wrapping the variable in as.factor(), this is becasue R treats factors as dummy codes by default. 
mod1 <- lm(Alertness ~ as.factor(Dosage), data=data) #wrapoing the variable in as.factor()
summary(mod1) # Also the same!
## 
## Call:
## lm(formula = Alertness ~ as.factor(Dosage), data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.500 -2.438  0.250  2.688  8.500 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          32.500      2.010  16.166 6.72e-11 ***
## as.factor(Dosage)b   -4.250      2.659  -1.598 0.130880    
## as.factor(Dosage)c  -13.250      3.179  -4.168 0.000824 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.924 on 15 degrees of freedom
## Multiple R-squared:  0.5396, Adjusted R-squared:  0.4782 
## F-statistic: 8.789 on 2 and 15 DF,  p-value: 0.002977

Answer: Y-hat = 32.5 - 4.25X1 - 13.25X2

4. Is the intercept significant? How would you interpret it?

Answer: The intercept is the mean of group A (the control group) and is significant. The mean alertness level of group A is 32.5.

5. Explain what each slope is telling you and if they are significant

Answer: Each slope is the mean difference between the group and the comparison group (group A). The average alertness level of Group B is 4.25 units lower than Group A. The average alertness level of Group C is 13.25 units lower than Group A. Only the slope comparing groups C and A is significant (p = .0008)

6. What about the difference between b and c?

Answer: Dummy coding compares groups to a control group which receives 0. The direct comparison is between groups A and B and A and C. Relative to group B, however, the average alertness of group C is lower.

7. How does the model fit?

Answer: Yes, there is model fit F(2, 15) = 8.79, p = .002.

8. Compare the F from this model and an ANOVA, what does this tell you?

aov.mod <- aov(Alertness ~ Dosage, data=data)
summary(aov.mod) # Note the same model F.
##             Df Sum Sq Mean Sq F value  Pr(>F)   
## Dosage       2  426.2  213.12   8.789 0.00298 **
## Residuals   15  363.8   24.25                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Answer: The F-value and p-value are the same derived from the regression model. The two statistical tests yield the same results.

9. Let’s try modeling the same two variables but now using effects Coding. Here we are going to make a the throw-away group. In effects coding, the group you choose to code as -1 is commonly referred to as the throw-away group because it’s information is essentially lost. However, it’s not hard to run another model where we change the throw-away group, and in certain situations we might want to do that. Write out how the effects coding would look (i.e. table with 0’s and -1’s

# Effects Coding
  # This time we want to compare all of the groups to the grand mean.
  # We need two codes because we have three levels.
    # This creates a vector called effectsB (effectsC) and codes a 1 if
    # Dosage is "b", ("c"), -1 if dosage is "a", and otherwise 0.
effectsB <- ifelse(data$Dosage=="b", 1, ifelse(data$Dosage=="a", -1, 0))
effectsC <- ifelse(data$Dosage=="c", 1, ifelse(data$Dosage=="a", -1, 0))

Answer:

Dosage A -1 -1 -1 Dosage B 1 0 0 Dosage C 0 1 0

10. Estimate the model using effects coding

mod2 <- lm(Alertness ~ effectsB + effectsC, data=data)
summary(mod2)
## 
## Call:
## lm(formula = Alertness ~ effectsB + effectsC, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.500 -2.438  0.250  2.688  8.500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   26.667      1.208  22.073 7.52e-13 ***
## effectsB       1.583      1.572   1.007  0.32969    
## effectsC      -7.417      1.866  -3.976  0.00122 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.924 on 15 degrees of freedom
## Multiple R-squared:  0.5396, Adjusted R-squared:  0.4782 
## F-statistic: 8.789 on 2 and 15 DF,  p-value: 0.002977
#Once again, there are other ways to effects code in R and we could once again use the contrasts function from the contrast package
data$Dosage_eff <- factor(data$Dosage, 
                      levels = c("b", "c", "a")) #we will place the group we want as the throw-away group last
contrasts(data$Dosage_eff) <- contr.sum(3) #

mod2 <- lm(Alertness ~ Dosage_eff, data=data)
summary(mod2) #same output!
## 
## Call:
## lm(formula = Alertness ~ Dosage_eff, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.500 -2.438  0.250  2.688  8.500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   26.667      1.208  22.073 7.52e-13 ***
## Dosage_eff1    1.583      1.572   1.007  0.32969    
## Dosage_eff2   -7.417      1.866  -3.976  0.00122 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.924 on 15 degrees of freedom
## Multiple R-squared:  0.5396, Adjusted R-squared:  0.4782 
## F-statistic: 8.789 on 2 and 15 DF,  p-value: 0.002977

11. Is the intercept significant? How would you interpret it

Answer: The intercept is the unweighted mean of the group means and is significant. Across groups, the average level of alertness is 26.67.

12. Are the slopes significant and how would you interpret them?

Answer: Only the slope for group C is significant (b = -7.41, p = .001), which reflects the mean difference in average alertness between group C and the grand mean.

13. Finally, how does the model fit now?

Answer: Model fit is the same, F(2, 15) = 8.78, p = .002.

14. Finally, let’s give contrast coding a try!

Remember, contrast coding is ideal when we have a hypothesis ahead of time (without looking at the means). Let’s say that before our study we hypothesized that “a” would be highest, “b” would be second, and “c” would be lowest. We need to create a contrast code that reflects this hypothesis.

No need to write this one out.

    # This ifelse() function creates a vector where dosage level "a" is coded
    # .5, level "b" is coded 0, and level "c" is coded -.5.
cont.code <- ifelse(data$Dosage=="a", .5, ifelse(data$Dosage=="b", 0, -.5))

#once again, there are other ways to create this pseudovariate 

15. Estimate the model using contrast coding

mod3 <- lm(Alertness ~ cont.code, data=data)
summary(mod3)
## 
## Call:
## lm(formula = Alertness ~ cont.code, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.364 -3.011 -0.250  3.546  8.046 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   26.955      1.174  22.952 1.13e-13 ***
## cont.code     12.818      3.151   4.068 0.000895 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.927 on 16 degrees of freedom
## Multiple R-squared:  0.5084, Adjusted R-squared:  0.4777 
## F-statistic: 16.55 on 1 and 16 DF,  p-value: 0.0008951
      # Note: The model F-test is the same as the t-test for the
      # slope coefficient. This is because we now only have 1 predictor in the
      # model.

Answer: Y-hat = 26.95 + 12.81X

16. Is the intercept significant and how would we interpret it?

Answer: The intercept is significant and represents the unweighted mean of all group means: 26.96 alertness.

17. What about the other regression coefficients, why do we only have 1 slope now?

Answer: The group that received the -.5 code, group C, has dropped from analysis.

18. How would you interpret the slope?

Answer: On average, there is higher alertness in group A relative to group B (b = 12.82, p = .0009).

19. How does the model fit now?

Answer: The model fits well F(1, 16) = 16.55, p = .0009).