---
title: "Week_5_HW_your_name"
output: html_document
---
### Categorical Predictors
Prompt: This week we will explore different ways to code and analyize categorical predictors. We will be using a data set from a study that tested how different dosages of drug x (IV) influenced alertness (DV). This is an experimental design where participants were randomly assigned to different conditions.
Let's start by loading necessary packages
```{r, include =F}
install.packages("contri")
library(psych)
install.packages("contrast")
library(contrast) #we can use this to create our pseudovariates
```
**note how our knitted html did not show this output**
**this was done by adding the include = F argument above**
Next let's load our data
```{r}
data <- read.csv("HW5data.csv")
#let's peak at our data
head(data)
```
*1.* The first thing we want to do is look at some descriptives, what are the means for each level?
```{r}
describe(data) # Note that Dosage is a factor
# Descriptive for each level of Dosage by using describe.by {psych}
describeBy(data$Alertness, data$Dosage)
```
Answer: The mean for alertness is 27.67 and the mean for dosage is 1.89.
*2.* Next, let's model this relationship using dummy coding. The first we need to create that pseudovariate.
**Remember, dummy coding is most useful when you have a good control group**
Also note that there are many different ways of creating pseudovariates in R. You can create them by hand as shown next, or via different packages as I will demonstrate in the subsequent chunk. For this example, let's make group a our reference group. Write out how the dummy coding would look (i.e. table with 0's and 1's)
```{r}
# Dummy Coding manually in R
# First we have to pick a level to be the control (i.e. the variable that
# gets 0 for both codes). Let's use "a" as the control.
# Because we have 3 levels, we need 2 dummy codes.
# The ifelse() function says to create a vector called dummyB (dummyC) and
# give it a 1 when Dosage is "b" ("c") and 0 otherwise.
colnames(data)[1] <- "Dosage"
dummyB <- ifelse(data$Dosage=="b", 1, 0)
dummyC <- ifelse(data$Dosage=="c", 1, 0)
```
Answer:
Dummy Coding
Dosages effect1 effect2 effect3
A 0 0 0
B 0 1 0
C 0 0 1
*3.* Now that we have dummy coded our IV, let's model the effect of different dosages on alertness.
```{r}
# Now we estimate the model with our two predictors
mod1 <- lm(Alertness~ dummyB + dummyC, data = data)
summary(mod1)
#There are however, alternative coding methods in R
#for instance, we could use the contrasts function
data$Dosage_dummy <- factor(data$Dosage,levels ="a","b","c")
#we will place the group we want as the throw-away group last
#
#interestingly, the same informaiton is now contained within 1 variable -- magic!
#let's try with same model with the pseudovariate we generated with constrasts
contrasts(data$Dosage_dummy) <- contr.treatment(3)
#it's the same!
mod1 <- lm(Alertness ~ Dosage_dummy, data=data)
summary(mod1)
#we could also dummy code within the moment by wrapping the variable in as.factor(), this is becasue R treats factors as dummy codes by default.
mod1 <- lm(Alertness ~ as.factor(Dosage), data=data) #wrapoing the variable in as.factor()
summary(mod1) # Also the same!
```
Answer: y= 32.50 + -4.25x + -13.25
*4.* Is the intercept significant? How would you interpret it?
Answer: The intercept is significant as there are asterics next to the intercept value indicating that it is significant. Additionally, the p-value is .0029, which is significant. This indicates that the Y-intercept is significantly different than zero.
*5.* Explain what each slope is telling you and if they are significant
Answer: The slopes for doasage b and dosage c are not significant and are telling me that there is a negative association.
*6.* What about the difference between b and c?
Answer: Dosage c effects alertness more negatively than dosage c.
*7.* How does the model fit?
Answer: The residual standard error is 4.924, indicating that there is a large discrepancy between the data and the regression line, indicating that the model is not a good fit.
*8.* Compare the F from this model and an ANOVA, what does this tell you?
```{r}
aov.mod <- aov(Alertness ~ Dosage, data=data)
summary(aov.mod) # Note the same model F.
```
Answer: This F value us the same and the p-value is the same as the ANOVA. This tells me that the ANOVA yields the same F statistic and P-value as the linear model. Additionally, this tells me that the null hypothesis is not true, because the F-statistic is not close to 1.
*9.* Let's try modeling the same two variables but now using effects Coding. Here we are going to make a the throw-away group. In effects coding, the group you choose to code as -1 is commonly referred to as the throw-away group because it's information is essentially lost. However, it's not hard to run another model where we change the throw-away group, and in certain situations we might want to do that. Write out how the effects coding would look (i.e. table with 0's and -1's
```{r}
# Effects Coding
# This time we want to compare all of the groups to the grand mean.
# We need two codes because we have three levels.
# This creates a vector called effectsB (effectsC) and codes a 1 if
# Dosage is "b", ("c"), -1 if dosage is "a", and otherwise 0.
effectsB <- ifelse(data$Dosage=="b", 1, ifelse(data$Dosage=="a", -1, 0))
effectsC <- ifelse(data$Dosage=="c", 1, ifelse(data$Dosage=="a", -1, 0))
```
Answer:
dosage effect1 effect2 effect3
a -1 0 0
a -1 0 0
a -1 0 0
a -1 0 0
a -1 0 0
a -1 0 0
b 0 1 0
b 0 1 0
b 0 1 0
b 0 1 0
b 0 1 0
b 0 1 0
b 0 1 0
b 0 1 0
c 0 0 1
c 0 0 1
c 0 0 1
c 0 0 1
*10.* Estimate the model using effects coding
```{r}
mod2 <- lm(Alertness ~ effectsB + effectsC, data=data)
summary(mod2)
#Once again, there are other ways to effects code in R and we could once again use the contrasts function from the contrast package
data$Dosage_eff <- factor(data$Dosage,
levels = c("b", "c", "a")) #we will place the group we want as the throw-away group last
contrasts(data$Dosage_eff) <- contr.sum(3) #
mod2 <- lm(Alertness ~ Dosage_eff, data=data)
summary(mod2) #same output!
```
*11.* Is the intercept significant? How would you interpret it
Answer: The intercept is significant because the p-value is 0.0029 and there is an asterisk next to they why intercept indicating it is significant. This indicates that the alertness for dosage is significantly different than zero.
*12.* Are the slopes significant and how would you interpret them?
Answer: The slope for effect 1 is not significant, but the slope for effect 2 is significant. I would interpret this as dosage_effect2 having a significant effect on alertness whereas dosage_effect1 does not.
*13.* Finally, how does the model fit now?
Answer: The statistics indicating the model fit have remained the same.
*14.* Finally, let's give contrast coding a try!
**Remember, contrast coding is ideal when we have a hypothesis ahead of time (without looking at the means). Let's say that before our study we hypothesized that "a" would be highest, "b" would be second, and "c" would be lowest. We need to create a contrast code that reflects this hypothesis.**
No need to write this one out.
```{r}
# This ifelse() function creates a vector where dosage level "a" is coded
# .5, level "b" is coded 0, and level "c" is coded -.5.
cont.code <- ifelse(data$Dosage=="a", .5, ifelse(data$Dosage=="b", 0, -.5))
#once again, there are other ways to create this pseudovariate
```
*15.* Estimate the model using contrast coding
```{r}
mod3 <- lm(Alertness ~ cont.code, data=data)
summary(mod3)
# Note: The model F-test is the same as the t-test for the
# slope coefficient. This is because we now only have 1 predictor in the
# model.
```
Answer: y= 26.955 + 12.818x
*16.* Is the intercept significant and how would we interpret it?
Answer: The intercept is significant as indicated in the output by the asterisk, and by the p-value of 0.00089. This indicates that when the variables are re-centered, the mean for all data points are significantly different than 0.
*17.* What about the other regression coefficients, why do we only have 1 slope now?
Answer: There is only one slope because the contrast coding re-centered the categorical data so that there will be one mean and one slope to represent the categorical data.
*18.* How would you interpret the slope?
Answer: The slope indicates that there is a significant difference between dosage a and dosage b for treatment 3.
*19.* How does the model fit now?
Answer: The residual error and degrees of freedom have increased slightly, and so this still indicates this is not a strong model fit.