Week_5_HW_Ashley

Categorical Predictors

Prompt: This week we will explore different ways to code and analyze categorical predictors. We will be using a data set from a study that tested how different dosages of drug x (IV) influenced alertness (DV). This is an experimental design where participants were randomly assigned to different conditions.

Let's start by loading necessary packages

note how our knitted html did not show this output this was done by adding the include = F argument above

Next let's load our data

getwd()

## [1] "C:/Users/AshMi/Desktop/hmwk"

setwd("C:/Users/AshMi/Desktop/hmwk")
getwd()

## [1] "C:/Users/AshMi/Desktop/hmwk"

data <- read.csv("HW5data.csv")

#let's peak at our data 
head(data)

##   ï..dosage alertness
## 1         a        30
## 2         a        38
## 3         a        35
## 4         a        41
## 5         a        27
## 6         a        24

1. The first thing we want to do is look at some descriptives, what are the means for each level?

describe(data) # Note that Dosage is a factor

##            vars  n  mean   sd median trimmed  mad min max range skew kurtosis
## ï..dosage*    1 18  1.89 0.76      2    1.88 1.48   1   3     2 0.16    -1.35
## alertness     2 18 27.67 6.82     27   27.50 8.15  17  41    24 0.25    -1.06
##              se
## ï..dosage* 0.18
## alertness  1.61

  # Descriptive for each level of Dosage by using describe.by {psych}
colnames(data)[1] <- "dosage"
describeBy(data$alertness, data$dosage)

## 
##  Descriptive statistics by group 
## group: a
##    vars n mean  sd median trimmed  mad min max range skew kurtosis   se
## X1    1 6 32.5 6.6   32.5    32.5 8.15  24  41    17    0    -1.91 2.69
## ------------------------------------------------------------ 
## group: b
##    vars n  mean   sd median trimmed  mad min max range  skew kurtosis   se
## X1    1 8 28.25 4.43     28   28.25 4.45  21  35    14 -0.07     -1.3 1.57
## ------------------------------------------------------------ 
## group: c
##    vars n  mean   sd median trimmed  mad min max range  skew kurtosis   se
## X1    1 4 19.25 1.71   19.5   19.25 1.48  17  21     4 -0.28    -1.96 0.85

Answer: Group A has a mean of 32.5, Group B has a mean of 28.25, Group C has a mean of 19.25.

2. Next, let's model this relationship using dummy coding. The first we need to create that pseudovariate.

Remember, dummy coding is most useful when you have a good control group

Also note that there are many different ways of creating pseudovariates in R. You can create them by hand as shown next, or via different packages as I will demonstrate in the subsequent chunk. For this example, let's make group a our reference group. Write out how the dummy coding would look (i.e. table with 0's and 1's)

# Dummy Coding manually in R
  # First we have to pick a level to be the control (i.e. the variable that
  # gets 0 for both codes). Let's use "a" as the control.
  # Because we have 3 levels, we need 2 dummy codes.
    # The ifelse() function says to create a vector called dummyB (dummyC) and
    # give it a 1 when Dosage is "b" ("c") and 0 otherwise.
dummyB <- ifelse(data$dosage=="b", 1, 0)
dummyC <- ifelse(data$dosage=="c", 1, 0)

Answer: Table: a b c 000000 11111111 1111

3. Now that we have dummy coded our IV, let's model the effect of different dosages on alertness.

# Now we estimate the model with our two predictors
mod1 <- lm(alertness ~ dummyB + dummyC, data=data)
summary(mod1)

## 
## Call:
## lm(formula = alertness ~ dummyB + dummyC, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.500 -2.438  0.250  2.688  8.500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   32.500      2.010  16.166 6.72e-11 ***
## dummyB        -4.250      2.659  -1.598 0.130880    
## dummyC       -13.250      3.179  -4.168 0.000824 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.924 on 15 degrees of freedom
## Multiple R-squared:  0.5396, Adjusted R-squared:  0.4782 
## F-statistic: 8.789 on 2 and 15 DF,  p-value: 0.002977

#There are however, alternative coding methods in R
#for instance, we could use the contrasts function
data$Dosage_dummy <- factor(data$dosage, 
                      levels = c("a","b", "c")) 
#we will place the group we want as the throw-away group last
contrasts(data$Dosage_dummy) <- contr.treatment(3) #
#interestingly, the same informaiton is now contained within 1 variable -- magic!

#let's try with same model with the pseudovariate we generated with constrasts
mod1 <- lm(alertness ~ Dosage_dummy, data=data)
summary(mod1) #it's the same!

## 
## Call:
## lm(formula = alertness ~ Dosage_dummy, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.500 -2.438  0.250  2.688  8.500 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     32.500      2.010  16.166 6.72e-11 ***
## Dosage_dummy2   -4.250      2.659  -1.598 0.130880    
## Dosage_dummy3  -13.250      3.179  -4.168 0.000824 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.924 on 15 degrees of freedom
## Multiple R-squared:  0.5396, Adjusted R-squared:  0.4782 
## F-statistic: 8.789 on 2 and 15 DF,  p-value: 0.002977

#we could also dummy code within the moment by wrapping the variable in as.factor(), this is becasue R treats factors as dummy codes by default. 
mod1 <- lm(alertness ~ as.factor(dosage), data=data) #wrapoing the variable in as.factor()
summary(mod1) # Also the same!

## 
## Call:
## lm(formula = alertness ~ as.factor(dosage), data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.500 -2.438  0.250  2.688  8.500 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          32.500      2.010  16.166 6.72e-11 ***
## as.factor(dosage)b   -4.250      2.659  -1.598 0.130880    
## as.factor(dosage)c  -13.250      3.179  -4.168 0.000824 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.924 on 15 degrees of freedom
## Multiple R-squared:  0.5396, Adjusted R-squared:  0.4782 
## F-statistic: 8.789 on 2 and 15 DF,  p-value: 0.002977

Answer: Dosage C has a significant relationship with alertness, Dosage B is not.

4. Is the intercept significant? How would you interpret it?

Answer: Yes, the intercept is significant. We would interpret it as the mean when X =0, since X is the reference group is 0, the intercept is the mean of the comparison group (dummy coded as 1). In this case, 32.50 is the mean of the treatment groups dosage b and c.

5. Explain what each slope is telling you and if they are significant

Answer: Each slope is telling me the the difference between the groups (reference and comparison) in the model? Dosage b is -4.250 units less on average than Dosage A, holding Dosage C constant, and the relationship is non-significant. Dosage C is -13.250 units less on average than Dosage A, holding Dosage B constant, and is significant.

6. What about the difference between b and c?

Answer: The difference between B and C is 11? This bit confuses me.

7. How does the model fit?

Answer: The F-statistic is 8.79, so the model fit is ok, not great but not bad either.

8. Compare the F from this model and an ANOVA, what does this tell you?

aov.mod <- aov(alertness ~ dosage, data=data)
summary(aov.mod) # Note the same model F.

##             Df Sum Sq Mean Sq F value  Pr(>F)   
## dosage       2  426.2  213.12   8.789 0.00298 **
## Residuals   15  363.8   24.25                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Answer: The model F is the same for both. This tells me that ANOVA and regression can be applied in similar ways and are themselves similar?

9. Let's try modeling the same two variables but now using effects Coding. Here we are going to make a the throw-away group. In effects coding, the group you choose to code as -1 is commonly referred to as the throw-away group because it's information is essentially lost. However, it's not hard to run another model where we change the throw-away group, and in certain situations we might want to do that. Write out how the effects coding would look (i.e. table with 0's and -1's

# Effects Coding
  # This time we want to compare all of the groups to the grand mean.
  # We need two codes because we have three levels.
    # This creates a vector called effectsB (effectsC) and codes a 1 if
    # Dosage is "b", ("c"), -1 if dosage is "a", and otherwise 0.
effectsB <- ifelse(data$dosage=="b", 1, ifelse(data$dosage=="a", -1, 0))
effectsC <- ifelse(data$dosage=="c", 1, ifelse(data$dosage=="a", -1, 0))

Answer: Table: a b c -1-1-1-1-1-1 11111111 1111

10. Estimate the model using effects coding

mod2 <- lm(alertness ~ effectsB + effectsC, data=data)
summary(mod2)

## 
## Call:
## lm(formula = alertness ~ effectsB + effectsC, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.500 -2.438  0.250  2.688  8.500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   26.667      1.208  22.073 7.52e-13 ***
## effectsB       1.583      1.572   1.007  0.32969    
## effectsC      -7.417      1.866  -3.976  0.00122 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.924 on 15 degrees of freedom
## Multiple R-squared:  0.5396, Adjusted R-squared:  0.4782 
## F-statistic: 8.789 on 2 and 15 DF,  p-value: 0.002977

#Once again, there are other ways to effects code in R and we could once again use the contrasts function from the contrast package
data$Dosage_eff <- factor(data$Dosage, 
                      levels = c("b", "c", "a")) #we will place the group we want as the throw-away group last
contrasts(data$Dosage_eff) <- contr.sum(3) #

mod2 <- lm(alertness ~ Dosage_eff, data=data)
summary(mod2) #same output!

## 
## Call:
## lm(formula = alertness ~ Dosage_eff, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.500 -2.438  0.250  2.688  8.500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   26.667      1.208  22.073 7.52e-13 ***
## Dosage_eff1    1.583      1.572   1.007  0.32969    
## Dosage_eff2   -7.417      1.866  -3.976  0.00122 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.924 on 15 degrees of freedom
## Multiple R-squared:  0.5396, Adjusted R-squared:  0.4782 
## F-statistic: 8.789 on 2 and 15 DF,  p-value: 0.002977

11. Is the intercept significant? How would you interpret it

Answer: Yes, the intercept is significant. The intercept is interpreted as the grand unweighted mean of all the groups(in this case 26.66).

12. Are the slopes significant and how would you interpret them?

Answer: One of the slopes (C-Dosage_eff2) is significant. We would interpret it as the difference between a group and the grand mean, for Dosage C that's -7.41.

13. Finally, how does the model fit now?

Answer:The adjusted R squared is exactly the same (0.4782), so still not that great.

14. Finally, let's give contrast coding a try!

Remember, contrast coding is ideal when we have a hypothesis ahead of time (without looking at the means). Let's say that before our study we hypothesized that "a" would be highest, "b" would be second, and "c" would be lowest. We need to create a contrast code that reflects this hypothesis.

No need to write this one out.

    # This ifelse() function creates a vector where dosage level "a" is coded
    # .5, level "b" is coded 0, and level "c" is coded -.5.
cont.code <- ifelse(data$dosage=="a", .5, ifelse(data$dosage=="b", 0, -.5))

#once again, there are other ways to create this pseudovariate

15. Estimate the model using contrast coding

mod3 <- lm(alertness ~ cont.code, data=data)
summary(mod3)

## 
## Call:
## lm(formula = alertness ~ cont.code, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.364 -3.011 -0.250  3.546  8.046 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   26.955      1.174  22.952 1.13e-13 ***
## cont.code     12.818      3.151   4.068 0.000895 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.927 on 16 degrees of freedom
## Multiple R-squared:  0.5084, Adjusted R-squared:  0.4777 
## F-statistic: 16.55 on 1 and 16 DF,  p-value: 0.0008951

      # Note: The model F-test is the same as the t-test for the
      # slope coefficient. This is because we now only have 1 predictor in the
      # model.

Answer: Looking at the F-statistic, the model fit (16.55) is good.

16. Is the intercept significant and how would we interpret it?

Answer: Yes, the intercept (26.955) is significant and we interpret it as the mean of all data points in the set.

17. What about the other regression coefficients, why do we only have 1 slope now?

Answer: Because we are essentially comparing two factors (DV alertness to IV dosage-regardless of its multiple levels).

18. How would you interpret the slope?

Answer: The slope (12.818) is the difference between mean of dosage B and C and the grand mean of all the groups.

19. How does the model fit now?

Answer: The F-statistic is 16.55, much improved from previous models.

Week_5_HW_Ashley_Miller

Categorical Predictors