Prompt: This week we will explore different ways to code and analyze categorical predictors. We will be using a data set from a study that tested how different dosages of drug x (IV) influenced alertness (DV). This is an experimental design where participants were randomly assigned to different conditions.
Let's start by loading necessary packages
note how our knitted html did not show this output this was done by adding the include = F argument above
Next let's load our data
getwd()
## [1] "C:/Users/AshMi/Desktop/hmwk"
setwd("C:/Users/AshMi/Desktop/hmwk")
getwd()
## [1] "C:/Users/AshMi/Desktop/hmwk"
data <- read.csv("HW5data.csv")
#let's peak at our data
head(data)
## ï..dosage alertness
## 1 a 30
## 2 a 38
## 3 a 35
## 4 a 41
## 5 a 27
## 6 a 24
1. The first thing we want to do is look at some descriptives, what are the means for each level?
describe(data) # Note that Dosage is a factor
## vars n mean sd median trimmed mad min max range skew kurtosis
## ï..dosage* 1 18 1.89 0.76 2 1.88 1.48 1 3 2 0.16 -1.35
## alertness 2 18 27.67 6.82 27 27.50 8.15 17 41 24 0.25 -1.06
## se
## ï..dosage* 0.18
## alertness 1.61
# Descriptive for each level of Dosage by using describe.by {psych}
colnames(data)[1] <- "dosage"
describeBy(data$alertness, data$dosage)
##
## Descriptive statistics by group
## group: a
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 6 32.5 6.6 32.5 32.5 8.15 24 41 17 0 -1.91 2.69
## ------------------------------------------------------------
## group: b
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 8 28.25 4.43 28 28.25 4.45 21 35 14 -0.07 -1.3 1.57
## ------------------------------------------------------------
## group: c
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 4 19.25 1.71 19.5 19.25 1.48 17 21 4 -0.28 -1.96 0.85
Answer: Group A has a mean of 32.5, Group B has a mean of 28.25, Group C has a mean of 19.25.
2. Next, let's model this relationship using dummy coding. The first we need to create that pseudovariate.
Remember, dummy coding is most useful when you have a good control group
Also note that there are many different ways of creating pseudovariates in R. You can create them by hand as shown next, or via different packages as I will demonstrate in the subsequent chunk. For this example, let's make group a our reference group. Write out how the dummy coding would look (i.e. table with 0's and 1's)
# Dummy Coding manually in R
# First we have to pick a level to be the control (i.e. the variable that
# gets 0 for both codes). Let's use "a" as the control.
# Because we have 3 levels, we need 2 dummy codes.
# The ifelse() function says to create a vector called dummyB (dummyC) and
# give it a 1 when Dosage is "b" ("c") and 0 otherwise.
dummyB <- ifelse(data$dosage=="b", 1, 0)
dummyC <- ifelse(data$dosage=="c", 1, 0)
Answer: Table: a b c 000000 11111111 1111
3. Now that we have dummy coded our IV, let's model the effect of different dosages on alertness.
# Now we estimate the model with our two predictors
mod1 <- lm(alertness ~ dummyB + dummyC, data=data)
summary(mod1)
##
## Call:
## lm(formula = alertness ~ dummyB + dummyC, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.500 -2.438 0.250 2.688 8.500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.500 2.010 16.166 6.72e-11 ***
## dummyB -4.250 2.659 -1.598 0.130880
## dummyC -13.250 3.179 -4.168 0.000824 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.924 on 15 degrees of freedom
## Multiple R-squared: 0.5396, Adjusted R-squared: 0.4782
## F-statistic: 8.789 on 2 and 15 DF, p-value: 0.002977
#There are however, alternative coding methods in R
#for instance, we could use the contrasts function
data$Dosage_dummy <- factor(data$dosage,
levels = c("a","b", "c"))
#we will place the group we want as the throw-away group last
contrasts(data$Dosage_dummy) <- contr.treatment(3) #
#interestingly, the same informaiton is now contained within 1 variable -- magic!
#let's try with same model with the pseudovariate we generated with constrasts
mod1 <- lm(alertness ~ Dosage_dummy, data=data)
summary(mod1) #it's the same!
##
## Call:
## lm(formula = alertness ~ Dosage_dummy, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.500 -2.438 0.250 2.688 8.500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.500 2.010 16.166 6.72e-11 ***
## Dosage_dummy2 -4.250 2.659 -1.598 0.130880
## Dosage_dummy3 -13.250 3.179 -4.168 0.000824 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.924 on 15 degrees of freedom
## Multiple R-squared: 0.5396, Adjusted R-squared: 0.4782
## F-statistic: 8.789 on 2 and 15 DF, p-value: 0.002977
#we could also dummy code within the moment by wrapping the variable in as.factor(), this is becasue R treats factors as dummy codes by default.
mod1 <- lm(alertness ~ as.factor(dosage), data=data) #wrapoing the variable in as.factor()
summary(mod1) # Also the same!
##
## Call:
## lm(formula = alertness ~ as.factor(dosage), data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.500 -2.438 0.250 2.688 8.500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.500 2.010 16.166 6.72e-11 ***
## as.factor(dosage)b -4.250 2.659 -1.598 0.130880
## as.factor(dosage)c -13.250 3.179 -4.168 0.000824 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.924 on 15 degrees of freedom
## Multiple R-squared: 0.5396, Adjusted R-squared: 0.4782
## F-statistic: 8.789 on 2 and 15 DF, p-value: 0.002977
Answer: Dosage C has a significant relationship with alertness, Dosage B is not.
4. Is the intercept significant? How would you interpret it?
Answer: Yes, the intercept is significant. We would interpret it as the mean when X =0, since X is the reference group is 0, the intercept is the mean of the comparison group (dummy coded as 1). In this case, 32.50 is the mean of the treatment groups dosage b and c.
5. Explain what each slope is telling you and if they are significant
Answer: Each slope is telling me the the difference between the groups (reference and comparison) in the model? Dosage b is -4.250 units less on average than Dosage A, holding Dosage C constant, and the relationship is non-significant. Dosage C is -13.250 units less on average than Dosage A, holding Dosage B constant, and is significant.
6. What about the difference between b and c?
Answer: The difference between B and C is 11? This bit confuses me.
7. How does the model fit?
Answer: The F-statistic is 8.79, so the model fit is ok, not great but not bad either.
8. Compare the F from this model and an ANOVA, what does this tell you?
aov.mod <- aov(alertness ~ dosage, data=data)
summary(aov.mod) # Note the same model F.
## Df Sum Sq Mean Sq F value Pr(>F)
## dosage 2 426.2 213.12 8.789 0.00298 **
## Residuals 15 363.8 24.25
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Answer: The model F is the same for both. This tells me that ANOVA and regression can be applied in similar ways and are themselves similar?
9. Let's try modeling the same two variables but now using effects Coding. Here we are going to make a the throw-away group. In effects coding, the group you choose to code as -1 is commonly referred to as the throw-away group because it's information is essentially lost. However, it's not hard to run another model where we change the throw-away group, and in certain situations we might want to do that. Write out how the effects coding would look (i.e. table with 0's and -1's
# Effects Coding
# This time we want to compare all of the groups to the grand mean.
# We need two codes because we have three levels.
# This creates a vector called effectsB (effectsC) and codes a 1 if
# Dosage is "b", ("c"), -1 if dosage is "a", and otherwise 0.
effectsB <- ifelse(data$dosage=="b", 1, ifelse(data$dosage=="a", -1, 0))
effectsC <- ifelse(data$dosage=="c", 1, ifelse(data$dosage=="a", -1, 0))
Answer: Table: a b c -1-1-1-1-1-1 11111111 1111
10. Estimate the model using effects coding
mod2 <- lm(alertness ~ effectsB + effectsC, data=data)
summary(mod2)
##
## Call:
## lm(formula = alertness ~ effectsB + effectsC, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.500 -2.438 0.250 2.688 8.500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.667 1.208 22.073 7.52e-13 ***
## effectsB 1.583 1.572 1.007 0.32969
## effectsC -7.417 1.866 -3.976 0.00122 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.924 on 15 degrees of freedom
## Multiple R-squared: 0.5396, Adjusted R-squared: 0.4782
## F-statistic: 8.789 on 2 and 15 DF, p-value: 0.002977
#Once again, there are other ways to effects code in R and we could once again use the contrasts function from the contrast package
data$Dosage_eff <- factor(data$Dosage,
levels = c("b", "c", "a")) #we will place the group we want as the throw-away group last
contrasts(data$Dosage_eff) <- contr.sum(3) #
mod2 <- lm(alertness ~ Dosage_eff, data=data)
summary(mod2) #same output!
##
## Call:
## lm(formula = alertness ~ Dosage_eff, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.500 -2.438 0.250 2.688 8.500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.667 1.208 22.073 7.52e-13 ***
## Dosage_eff1 1.583 1.572 1.007 0.32969
## Dosage_eff2 -7.417 1.866 -3.976 0.00122 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.924 on 15 degrees of freedom
## Multiple R-squared: 0.5396, Adjusted R-squared: 0.4782
## F-statistic: 8.789 on 2 and 15 DF, p-value: 0.002977
11. Is the intercept significant? How would you interpret it
Answer: Yes, the intercept is significant. The intercept is interpreted as the grand unweighted mean of all the groups(in this case 26.66).
12. Are the slopes significant and how would you interpret them?
Answer: One of the slopes (C-Dosage_eff2) is significant. We would interpret it as the difference between a group and the grand mean, for Dosage C that's -7.41.
13. Finally, how does the model fit now?
Answer:The adjusted R squared is exactly the same (0.4782), so still not that great.
14. Finally, let's give contrast coding a try!
Remember, contrast coding is ideal when we have a hypothesis ahead of time (without looking at the means). Let's say that before our study we hypothesized that "a" would be highest, "b" would be second, and "c" would be lowest. We need to create a contrast code that reflects this hypothesis.
No need to write this one out.
# This ifelse() function creates a vector where dosage level "a" is coded
# .5, level "b" is coded 0, and level "c" is coded -.5.
cont.code <- ifelse(data$dosage=="a", .5, ifelse(data$dosage=="b", 0, -.5))
#once again, there are other ways to create this pseudovariate
15. Estimate the model using contrast coding
mod3 <- lm(alertness ~ cont.code, data=data)
summary(mod3)
##
## Call:
## lm(formula = alertness ~ cont.code, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.364 -3.011 -0.250 3.546 8.046
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.955 1.174 22.952 1.13e-13 ***
## cont.code 12.818 3.151 4.068 0.000895 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.927 on 16 degrees of freedom
## Multiple R-squared: 0.5084, Adjusted R-squared: 0.4777
## F-statistic: 16.55 on 1 and 16 DF, p-value: 0.0008951
# Note: The model F-test is the same as the t-test for the
# slope coefficient. This is because we now only have 1 predictor in the
# model.
Answer: Looking at the F-statistic, the model fit (16.55) is good.
16. Is the intercept significant and how would we interpret it?
Answer: Yes, the intercept (26.955) is significant and we interpret it as the mean of all data points in the set.
17. What about the other regression coefficients, why do we only have 1 slope now?
Answer: Because we are essentially comparing two factors (DV alertness to IV dosage-regardless of its multiple levels).
18. How would you interpret the slope?
Answer: The slope (12.818) is the difference between mean of dosage B and C and the grand mean of all the groups.
19. How does the model fit now?
Answer: The F-statistic is 16.55, much improved from previous models.