Introduction

We use multiple categories of repondents in research and then try to control for their differences using a control variable. This simulation is an attempt to show how the control works.

We create the data for two animals - kips and kins. The data includes age, cuteness and weight. The weight is a function of age and cuteness.

We simulate 200 data points. Approximately half would me Kips. The age and cuteness are assumed to uniformly vary as described below.

Animal Age Second Header
Kips 1 to 20 1 to 10
kins 1 to 40 1 to 20
db <- as.data.frame(matrix(0,nrow =200,ncol=4))
names(db) <- c("animal","age","cuteness","weight")

# roughly half of the 200 data points would be kips (code = 0) and the other half would be kins (code = 1)
db$animal <- sample(c(0,1),200,replace=TRUE)

#Kips live upto 20 years, and Kins live upto 40 years. We ensure that are samples are at least 1 year old and we take integer values of age, Here the first code is applicable to all the animal and the second one only for Kins. 
db$age <- round(runif(200,min = 1, max = 20),0)
db$age[db$animal == 1] <- round(runif(sum(db$animal),min = 1, max = 40),0)

#Kips cuteness caries betweeb 1 and 10, and for Kins between 1 and 20
db$cuteness <- round(runif(200,min = 1, max = 10),0)
db$cuteness[db$animal == 1] <- round(runif(sum(db$animal),min = 1, max = 20),0)

For weight,

#weight is dependent on both age and cuteness. The effect of age is different for both animals. The coefficients for age and cuteness are 2 (Kips) and 6 (Kins). The coefficient for weight is 10 for both animals.  

db$weight <- 2*db$age + 10*db$cuteness + 18*round(sample(c(-1,1),200,replace = TRUE)*runif(200),2)

db$weight[db$animal == 1] <- 6*db$age[db$animal == 1] + 10*db$cuteness[db$animal == 1] + 90* round(sample(c(-1,1),sum(db$animal),replace = TRUE)*runif(sum(db$animal)),2)

Let us look at the first few rows of our data.

head(db)
##   animal age cuteness weight
## 1      1  16       14 160.40
## 2      0   5        2  13.98
## 3      1   8       12 245.40
## 4      0   2        4  29.42
## 5      1   9        1  -8.90
## 6      0  13        3  65.90

Kips have a value of 0 in the animal column and Kins have a value of 1.

We will create seperate regression equations for the two animals to check if we can recover the values of the coeffients

db1 <- db[db$animal == 0,]
lm1 <- lm(weight~ age + cuteness, data = db1)
db2 <- db[db$animal == 1,]
lm2 <- lm(weight~ age + cuteness, data = db2)

# the R2 of the two regressions
c(summary(lm1)$r.squared,summary(lm2)$r.squared)
## [1] 0.8939976 0.8017351

As you can see from the above, the two two regressions have R squared above 0.7, which are decently good values.

And the coefficients….

#coefficients of regression 1 (Kips)
coefficients(lm1)
## (Intercept)         age    cuteness 
##    1.313192    1.908354   10.027225
#coefficients of regression 2 (Kins)
coefficients(lm2)
## (Intercept)         age    cuteness 
##  -24.458124    6.440165   11.461979

Also, the coefficients are very close to the actual design values.

What if we create a common regression equation that includes both Kips and Kins?

Let us start by an equation which we know is wrong as it does not have the control variables

lm3 <- lm(weight ~ age + cuteness, data = db)
coefficients(lm3)
## (Intercept)         age    cuteness 
##   -45.89621     6.35421    12.60102

As we can see from above, the coefficents are different.

In the next step, let us introduce the control for the animal

lm4 <- lm(weight ~ age + cuteness + animal, data = db)
coefficients(lm4)
## (Intercept)         age    cuteness      animal 
##  -44.640990    5.693533   11.241050   38.162566

As we can see, with introduction of control, the intercept changes. The coefficient of weight is also similar to the actual value ( = 10). However, the slope of age still does not reflect the actual impact age of the Kips and Kins.

Let us now introduce the interaction terms.

lm5 <- lm(weight ~ age + cuteness + animal + age:animal + cuteness:animal, data = db)
coefficients(lm5)
##     (Intercept)             age        cuteness          animal 
##        1.313192        1.908354       10.027225      -25.771316 
##      age:animal cuteness:animal 
##        4.531811        1.434754

Look at the coefficients of age and cuteness. They are exactly same as the coefficients for Kips. Scroll up to look at the coefficents of lm1

We can also get the actual coeficients for Kins by substituting one for the value of animal = 1. Now the coefficient for age will be tbe coefficient for age plus the interaction term of age and animal. Similarly, the coefficient for cuteness will be the coefficeint for cuteness plus the inteaction coeffcient for cuteness and animal.

c(coefficients(lm5)[2]+coefficients(lm5)[5],coefficients(lm5)[3]+coefficients(lm5)[6])
##       age  cuteness 
##  6.440165 11.461979

Perfect again. These values are exactly similar to the coeffients for Kins. Scroll up to look at the coeffcients of lm2.

Moral of the Story When we use control, it is necessary to introduce an interaction term if the category effects the dependent variable differently.

Description When we don’t introduce the interaction term, we are essentially saying that the slope of the effect for different categories is same and only the intercept is different. So, it is like saying that the if the Kips and Kins age by one year, the change of weight for both is the same. And, only the intercept, that is the weight at age = 0 is different for both.

So, if we want to make this assumption, then we are fine with not using controls. However, if our assumption is that effect could be different, then we must use the interaction term.

Important Here the control is a categorical variable.