Lab 4 - Linear Regression Contd.

library(tidyverse)
library(ggplot2)
#library(summarytools)
library(jtools)
library(naniar)
library(fastDummies)
library(knitr)
library(huxtable)

In this lab we will try to use the csiw dataset but with different variable to understand how this analysis is to be done.

Let us load the required libraries first

Now we can load the dataset

csiw <- read.csv("~/Downloads/csiw07.csv")

Let us see how this dataset looks

summary(csiw, plain.ascii = FALSE)

##      school          teacher            id            group       
##  Min.   : 1.000   Min.   : 1.00   Min.   : 1.00   Min.   :-9.000  
##  1st Qu.: 1.000   1st Qu.: 4.00   1st Qu.:20.00   1st Qu.: 1.000  
##  Median : 2.000   Median : 8.00   Median :35.00   Median : 2.000  
##  Mean   : 4.509   Mean   :10.91   Mean   :32.12   Mean   : 1.514  
##  3rd Qu.: 9.000   3rd Qu.:18.00   3rd Qu.:44.00   3rd Qu.: 4.000  
##  Max.   :13.000   Max.   :55.00   Max.   :61.00   Max.   : 4.000  
##                                                                   
##     treatmt           cch1              cch2             ccpt1       
##  Min.   :1.000   Min.   :-9.0000   Min.   :-9.0000   Min.   :-9.000  
##  1st Qu.:1.000   1st Qu.: 0.0000   1st Qu.: 0.5000   1st Qu.: 0.000  
##  Median :1.000   Median : 1.0000   Median : 1.0000   Median : 5.000  
##  Mean   :1.302   Mean   :-0.8305   Mean   :-0.4619   Mean   : 3.101  
##  3rd Qu.:2.000   3rd Qu.: 2.0000   3rd Qu.: 2.0000   3rd Qu.: 8.000  
##  Max.   :2.000   Max.   : 3.0000   Max.   : 3.0000   Max.   :16.000  
##                                                                      
##      ccpt2           ccprod1         ccprod2           ccrdr1      
##  Min.   :-9.000   Min.   :-9.00   Min.   :-9.000   Min.   :-9.000  
##  1st Qu.: 3.000   1st Qu.: 0.00   1st Qu.: 1.000   1st Qu.: 0.000  
##  Median : 7.000   Median : 2.00   Median : 4.000   Median : 0.000  
##  Mean   : 5.098   Mean   : 1.41   Mean   : 2.143   Mean   :-1.052  
##  3rd Qu.:11.000   3rd Qu.: 5.00   3rd Qu.: 6.000   3rd Qu.: 1.000  
##  Max.   :15.000   Max.   :17.00   Max.   :19.000   Max.   : 8.000  
##                                                                    
##      ccrdr2            grade           hotistdum     
##  Min.   :-9.0000   Min.   :-9.0000   Min.   :0.0000  
##  1st Qu.: 0.0000   1st Qu.: 1.0000   1st Qu.:0.0000  
##  Median : 0.0000   Median : 1.0000   Median :1.0000  
##  Mean   : 0.1646   Mean   : 0.4423   Mean   :0.6196  
##  3rd Qu.: 2.0000   3rd Qu.: 2.0000   3rd Qu.:1.0000  
##  Max.   :12.0000   Max.   : 2.0000   Max.   :1.0000  
##                                      NA's   :81

Keeping in line with the assignment structure we will choose the following variables

Treatment variable (binary) - treatmt
Achievement level (4 categories and extra category for missing data) - group
Pretest variable - ccrdr1
Post test variable - ccrdr2
Grade (1=Grade 4, 2= Grade 5) - grade

Now let us subset this information into a dataset and analyze

csiw <- csiw %>% dplyr::mutate(treat=treatmt, pretest = ccrdr1, posttest=ccrdr2, grade =grade)
csiw <- csiw %>% dplyr::select(treat, pretest, posttest, grade, group)
head(csiw)

treat	pretest	posttest	grade	group
1	1	-9	1	3
1	-9	-9	1	1
1	0	6	1	1
1	3	0	1	1
1	0	0	1	2
1	0	-9	1	3

We have two recoding tasks ahead of us.

Recoding achievement levels in the group variable. But this is essentially creating dummy variables which is a little bit different than simple recoding. For creation of dummies we will use a convenient R package called fastDummies.
Recoding grade level in the grade variable. For this we will use the recode function in dplyr package.

How does the achievement level variable actuall look

table(csiw$group)

## 
##  -9   1   2   3   4 
##  40  84  82  76 125

Other than the four categories of relevant levels, we also have a -9 for the missing data. While there are many ways to deal with missing data such as mean/median/mode substitution or multiple imputation techniques among others. In our analysis, we will simply drop the data under the assumption of Missing at Random. It simple means that once we control for the relevant confounders, the data can be assumed to be missing at random and hence dropping those observations will not create bias in our estimates.

Therefore, we will replace the -9 with NAs. We will be using a powerful package called naniar to deal with missing values.

#install.packages("naniar")
library(naniar)

csiw <- csiw %>% replace_with_na(replace = list(group = -9))

We can also do this for the remaning variables in the dataset

table(csiw$grade)

## 
##  -9   1   2 
##  40 194 173

csiw <- csiw %>% replace_with_na(replace=list(pretest=-9,posttest=-9, grade=-9))

Now that we have dealt with the missing data, let us move onto the recoding process for group and grade. But remember…

Variable to be recoded has to be a Factor Variable.

Are our variables of interest in the factor type?

class(csiw$treat)

## [1] "integer"

class(csiw$group)

## [1] "integer"

class(csiw$grade)

## [1] "integer"

table(csiw$treat)

## 
##   1   2 
## 284 123

table(csiw$group)

## 
##   1   2   3   4 
##  84  82  76 125

table(csiw$grade)

## 
##   1   2 
## 194 173

So let us convert them to factors first Good practice to convert categorical variables as factors treatment needs to be considered numeric when running regression, but other covariates can remain as factors

csiw$treat <- as.factor(csiw$treat)
table(csiw$treat)

## 
##   1   2 
## 284 123

# convert into an ordered factor variable into unordered
csiw$treat <- factor(as.numeric(as.character(csiw$treat)), ordered=FALSE)
table(csiw$treat)

## 
##   1   2 
## 284 123

csiw$group <- as.factor(csiw$group)
csiw$grade <- as.factor(csiw$grade)

Since we need treat vaiable in the 1 or 0 form we will recode this variable

csiw$treat <- recode(csiw$treat, "1"="1","2"="0")
csiw$treat <- as.character(csiw$treat)
csiw$treat <- as.numeric(csiw$treat)
table(csiw$treat)

## 
##   0   1 
## 123 284

For the group variable we need to create dummies as below

# lets us first recode the entries in the group column for ease of use once we create dummies. 
# We will see why....
colnames(csiw)

## [1] "treat"    "pretest"  "posttest" "grade"    "group"

csiw$group <- recode(csiw$group, "1"="high","2"= "average", "3"="low", "4"="learndis")
table(csiw$group)

## 
##     high  average      low learndis 
##       84       82       76      125

Now we can create the required dummies

csiw <- dummy_cols(csiw,select_columns = "group")
colnames(csiw)

##  [1] "treat"          "pretest"        "posttest"       "grade"         
##  [5] "group"          "group_low"      "group_high"     "group_average" 
##  [9] "group_NA"       "group_learndis"

Similarly, let us now work with the grade variable

table(csiw$grade)

## 
##   1   2 
## 194 173

Now we have to create one dummy for use in this analysis, as before, we can create two dummies for each category but use only one in the analysis.

csiw

treat	pretest	posttest	grade	group	group_low	group_high	group_average	group_NA	group_learndis
1	1		1	low	1	0	0	0	0
1			1	high	0	1	0	0	0
1	0	6	1	high	0	1	0	0	0
1	3	0	1	high	0	1	0	0	0
1	0	0	1	average	0	0	1	0	0
1	0		1	low	1	0	0	0	0
1	0	0	1	low	1	0	0	0	0
1	0	1	1	low	1	0	0	0	0
1	0	0	1	low	1	0	0	0	0
1	0	5	1	average	0	0	1	0	0
1	0		1	high	0	1	0	0	0
1	0	7	1	average	0	0	1	0	0
1	0		1	low	1	0	0	0	0
1	0	3	1	low	1	0	0	0	0
1	3	1	1	low	1	0	0	0	0
1	0	10	1	average	0	0	1	0	0
1	0	9	1	low	1	0	0	0	0
1	0	0	1	low	1	0	0	0	0
1	6	9	1	average	0	0	1	0	0
1	0		1	average	0	0	1	0	0
1	0	3	1	average	0	0	1	0	0
1	1	2	1	low	1	0	0	0	0
1	0	11	1	high	0	1	0	0	0
1	2	7	1	high	0	1	0	0	0
1	0	8	1	average	0	0	1	0	0
1	1	10	1	average	0	0	1	0	0
1	4	11	1	average	0	0	1	0	0
1	0	0	1		0	0	0	1	0
1	3	7	1	high	0	1	0	0	0
1	0	1	1		0	0	0	1	0
1		10	1	high	0	1	0	0	0
1	1		1		0	0	0	1	0
1	1	0	1	average	0	0	1	0	0
1	0	0	2	high	0	1	0	0	0
1	0	10	2	average	0	0	1	0	0
1	0	0	2	high	0	1	0	0	0
1	2	1	1	high	0	1	0	0	0
1	1		1	high	0	1	0	0	0
1	0	0	1	low	1	0	0	0	0
1	0	0	1	low	1	0	0	0	0
1	1	2	2	high	0	1	0	0	0
1	1	1	2	low	1	0	0	0	0
1	1	1	1	low	1	0	0	0	0
1	1	2	1	high	0	1	0	0	0
1	4	9	2	high	0	1	0	0	0
1	2	1	2	average	0	0	1	0	0
1	0	0	1	high	0	1	0	0	0
1	0	1	1	average	0	0	1	0	0
1	0	2	2	high	0	1	0	0	0
1	0	2	1	average	0	0	1	0	0
1	0	6	1	average	0	0	1	0	0
1	1	0	1	high	0	1	0	0	0
1	0	0	2	average	0	0	1	0	0
1	3	2	2	average	0	0	1	0	0
1	0	0	1		0	0	0	1	0
1		0			0	0	0	1	0
1	0		2	high	0	1	0	0	0
1	0	5	2	low	1	0	0	0	0
1	0	1	2	high	0	1	0	0	0
1	1	2	2	high	0	1	0	0	0
1	0	1	2	average	0	0	1	0	0
1	0	3	2	high	0	1	0	0	0
1	0	2	2	high	0	1	0	0	0
1	0	1	2	high	0	1	0	0	0
1	0		2	high	0	1	0	0	0
1	0	4	2	low	1	0	0	0	0
1	0	6	2	average	0	0	1	0	0
1	0	2	2	average	0	0	1	0	0
1	1	0	2	low	1	0	0	0	0
1	0	1	2	average	0	0	1	0	0
1	0	1	2	average	0	0	1	0	0
1	0	3	2	average	0	0	1	0	0
1	1		2	low	1	0	0	0	0
1	0		2	high	0	1	0	0	0
1	0	1	2	low	1	0	0	0	0
1	0		2	average	0	0	1	0	0
1	0	1	2	average	0	0	1	0	0
1	0	2	2	low	1	0	0	0	0
1	0	1	2	low	1	0	0	0	0
1		3	2		0	0	0	1	0
1	2	0	2	average	0	0	1	0	0
1	0	1	2	average	0	0	1	0	0
1	1	0	2	low	1	0	0	0	0
1	3		2	low	1	0	0	0	0
1	1	2	2	high	0	1	0	0	0
1	1		2	low	1	0	0	0	0
1	0	0	2	high	0	1	0	0	0
1	0	1	2	average	0	0	1	0	0
1	0	2	2	low	1	0	0	0	0
1	0	11	2	low	1	0	0	0	0
1	1	3	2	average	0	0	1	0	0
1	0		2	low	1	0	0	0	0
1	1	0	2	average	0	0	1	0	0
1	0	0	2	average	0	0	1	0	0
1			2	low	1	0	0	0	0
1	2	1	2	high	0	1	0	0	0
1	3	2	2	average	0	0	1	0	0
1	0	6	2	high	0	1	0	0	0
1	0	0	2	low	1	0	0	0	0
1	0	0	2	average	0	0	1	0	0
1		0	2		0	0	0	1	0
1		1	2		0	0	0	1	0
1		0	2		0	0	0	1	0
1		1	2		0	0	0	1	0
1			2		0	0	0	1	0
1	1	2	2	learndis	0	0	0	0	1
1	0		2	learndis	0	0	0	0	1
1	0	0	1	learndis	0	0	0	0	1
1	1		2	learndis	0	0	0	0	1
1	0	0	1	learndis	0	0	0	0	1
1	0	0	2	learndis	0	0	0	0	1
1		1		learndis	0	0	0	0	1
1		0	1	learndis	0	0	0	0	1
1		0		learndis	0	0	0	0	1
1		0	1	learndis	0	0	0	0	1
0			2	learndis	0	0	0	0	1
1	2	10	2	high	0	1	0	0	0
1	2	3	2	average	0	0	1	0	0
1	0		2	learndis	0	0	0	0	1
1	0	11	2	high	0	1	0	0	0
1	2	1	2	low	1	0	0	0	0
1	0	1	2	high	0	1	0	0	0
1	0	10	2	low	1	0	0	0	0
1	0	0	2		0	0	0	1	0
1	0	11	2	high	0	1	0	0	0
1	4	2	2	average	0	0	1	0	0
1	0	2	1	average	0	0	1	0	0
1	2		2		0	0	0	1	0
1	1	0	2	low	1	0	0	0	0
1	2	2	2	average	0	0	1	0	0
1	2	3	2	high	0	1	0	0	0
1	0	9	2	high	0	1	0	0	0
1	2	1	2	low	1	0	0	0	0
1	0	2	2	low	1	0	0	0	0
1	4	12	2	average	0	0	1	0	0
1	2	10	2	high	0	1	0	0	0
1	0	5	1	high	0	1	0	0	0
1	4	1	2	average	0	0	1	0	0
1	0		1	high	0	1	0	0	0
1	1		1	low	1	0	0	0	0
1	0	9	1	low	1	0	0	0	0
1	1	4	1	high	0	1	0	0	0
1		10	1	high	0	1	0	0	0
1	4	2	1	low	1	0	0	0	0
1	3		1	average	0	0	1	0	0
1	3	10	1		0	0	0	1	0
1	1	7	1	high	0	1	0	0	0
1	1	1	1	average	0	0	1	0	0
1	1	10	1	high	0	1	0	0	0
1	0	4	1	average	0	0	1	0	0
1	1		1	low	1	0	0	0	0
1	4	1	1	high	0	1	0	0	0
1	1	3	1	low	1	0	0	0	0
1	2	7	1	high	0	1	0	0	0
1	0		1	low	1	0	0	0	0
1	3	3	1	average	0	0	1	0	0
1	2	1	1	low	1	0	0	0	0
1	1		1	low	1	0	0	0	0
1	0	3	1	low	1	0	0	0	0
1	5	10	1	high	0	1	0	0	0
1	3	1	1	average	0	0	1	0	0
1	3	4	1	high	0	1	0	0	0
1	4	10	1	high	0	1	0	0	0
1	4	9	1	high	0	1	0	0	0
1	3	2	1	low	1	0	0	0	0
1	2	10	1	average	0	0	1	0	0
1		9	1	average	0	0	1	0	0
1		1	1		0	0	0	1	0
1	0	11	2	high	0	1	0	0	0
1	0	0	2	high	0	1	0	0	0
1	1		2	high	0	1	0	0	0
1	1	12	2	high	0	1	0	0	0
1	1	12	2	high	0	1	0	0	0
1	3	7	2	high	0	1	0	0	0
1	0	3	2	high	0	1	0	0	0
1	3	2	2	average	0	0	1	0	0
1	0	5	2	high	0	1	0	0	0
1	2	5	2	average	0	0	1	0	0
1	1	12	2	low	1	0	0	0	0
1			2		0	0	0	1	0
1	1	5	2	high	0	1	0	0	0
1	0	1	2	average	0	0	1	0	0
1	0	7	2	high	0	1	0	0	0
1	0	12	2	high	0	1	0	0	0
1	0	4	2	high	0	1	0	0	0
1	0	8	2	average	0	0	1	0	0
1	0	9	2	high	0	1	0	0	0
1	1	12	2	high	0	1	0	0	0
1	0	9	2	high	0	1	0	0	0
1	7	5	2	average	0	0	1	0	0
1	0	6	2	average	0	0	1	0	0
1	0	7	2	high	0	1	0	0	0
1		7	2		0	0	0	1	0
1		6	2		0	0	0	1	0
1	0	10	1	high	0	1	0	0	0
1	1	0	1	average	0	0	1	0	0
1	1	9	1	average	0	0	1	0	0
1	1	0	1	high	0	1	0	0	0
1	0		1		0	0	0	1	0
1	6	0	1		0	0	0	1	0
1	2	0	1	low	1	0	0	0	0
1	2	3	1	low	1	0	0	0	0
1	0	1	1	low	1	0	0	0	0
1	1	10	1	low	1	0	0	0	0
1	0	2	1	average	0	0	1	0	0
1	0	2	1	low	1	0	0	0	0
1	0	9	1	high	0	1	0	0	0
1	8	6	1	average	0	0	1	0	0
1	0	12	1	high	0	1	0	0	0
1	2		1		0	0	0	1	0
1	0	0	1	low	1	0	0	0	0
1	0	2	1	high	0	1	0	0	0
1	3	3	1	high	0	1	0	0	0
1	1	10	1	average	0	0	1	0	0
1		0	1	low	1	0	0	0	0
1	0		1	low	1	0	0	0	0
1	2		1		0	0	0	1	0
1	2	3	1	average	0	0	1	0	0
1		1	1		0	0	0	1	0
1	0	1	2	learndis	0	0	0	0	1
1	0	1	1	learndis	0	0	0	0	1
1	2		2	learndis	0	0	0	0	1
1		0		learndis	0	0	0	0	1
1				learndis	0	0	0	0	1
1	0		2	learndis	0	0	0	0	1
1	0			learndis	0	0	0	0	1
1	0	0	2	learndis	0	0	0	0	1
1	0	0	1	learndis	0	0	0	0	1
1	0	0	1	learndis	0	0	0	0	1
1	0	2	2	learndis	0	0	0	0	1
1	1			learndis	0	0	0	0	1
1		1		learndis	0	0	0	0	1
1		0		learndis	0	0	0	0	1
1	0	0	1	learndis	0	0	0	0	1
1	0			learndis	0	0	0	0	1
1	1	1	1	learndis	0	0	0	0	1
1	0			learndis	0	0	0	0	1
1		5	2	learndis	0	0	0	0	1
1	0	6	1	learndis	0	0	0	0	1
1	0		2	learndis	0	0	0	0	1
1	0	0	2	learndis	0	0	0	0	1
1	0	0	2	learndis	0	0	0	0	1
1		0	1	learndis	0	0	0	0	1
1		2	2	learndis	0	0	0	0	1
1		1	1	learndis	0	0	0	0	1
1	0	1	1	learndis	0	0	0	0	1
1	0	1	1	learndis	0	0	0	0	1
1			2	learndis	0	0	0	0	1
1	1	0	1	learndis	0	0	0	0	1
1	0			learndis	0	0	0	0	1
1	0	0		learndis	0	0	0	0	1
1	0			learndis	0	0	0	0	1
1		0		learndis	0	0	0	0	1
1	0	0	1	learndis	0	0	0	0	1
1	0	1	1	learndis	0	0	0	0	1
1	0	0	1	learndis	0	0	0	0	1
1	0	0	2	learndis	0	0	0	0	1
1			2	learndis	0	0	0	0	1
1	0	1	1	learndis	0	0	0	0	1
1	0	0	1	learndis	0	0	0	0	1
1	1	1	2	learndis	0	0	0	0	1
1		0		learndis	0	0	0	0	1
1		0		learndis	0	0	0	0	1
1		0		learndis	0	0	0	0	1
1		0		learndis	0	0	0	0	1
1	0	1	1	learndis	0	0	0	0	1
1	0	0	1	learndis	0	0	0	0	1
1	0	0	1	learndis	0	0	0	0	1
1		1		learndis	0	0	0	0	1
1		0		learndis	0	0	0	0	1
1	0		2	learndis	0	0	0	0	1
1	0		1	learndis	0	0	0	0	1
1	0	0	1	learndis	0	0	0	0	1
1	0	0		learndis	0	0	0	0	1
1	0	0		learndis	0	0	0	0	1
1	0	0	1	learndis	0	0	0	0	1
1		0		learndis	0	0	0	0	1
1		0		learndis	0	0	0	0	1
1	1	2	2	learndis	0	0	0	0	1
1	0		2	learndis	0	0	0	0	1
1	0	0	1	learndis	0	0	0	0	1
1	0	0	1	learndis	0	0	0	0	1
1	0	0	1	learndis	0	0	0	0	1
1		0		learndis	0	0	0	0	1
1		0		learndis	0	0	0	0	1
0	1	3	1	high	0	1	0	0	0
0	3	1	1	low	1	0	0	0	0
0	0	0	1	average	0	0	1	0	0
0	1	1	1	high	0	1	0	0	0
0		8	1	high	0	1	0	0	0
0	3	3	1	average	0	0	1	0	0
0	0		1	low	1	0	0	0	0
0	1	1	1	low	1	0	0	0	0
0		0	1	average	0	0	1	0	0
0	0	0	1	average	0	0	1	0	0
0	1	1	1	low	1	0	0	0	0
0	4	4	1	low	1	0	0	0	0
0	1	0	1	high	0	1	0	0	0
0	2	4	1	low	1	0	0	0	0
0	6	2	1	average	0	0	1	0	0
0	4	0	1	average	0	0	1	0	0
0	0	2	1	high	0	1	0	0	0
0	1	2	1	low	1	0	0	0	0
0			1	low	1	0	0	0	0
0	0		1	average	0	0	1	0	0
0			1		0	0	0	1	0
0			1		0	0	0	1	0
0	2		1	low	1	0	0	0	0
0			1		0	0	0	1	0
0	1		1		0	0	0	1	0
0	1	0	2	high	0	1	0	0	0
0			2		0	0	0	1	0
0	0	0	2	low	1	0	0	0	0
0			2		0	0	0	1	0
0	0	3	2	average	0	0	1	0	0
0	1	0	2	low	1	0	0	0	0
0	0	3	2	low	1	0	0	0	0
0	0	3	2	low	1	0	0	0	0
0	2		2	high	0	1	0	0	0
0	1	0	2	low	1	0	0	0	0
0		3	2	high	0	1	0	0	0
0			2		0	0	0	1	0
0	0	0	2	average	0	0	1	0	0
0	0	0	2	average	0	0	1	0	0
0	0	0	2	low	1	0	0	0	0
0	0	0	2	low	1	0	0	0	0
0	0	0	2	high	0	1	0	0	0
0	0	0	2	average	0	0	1	0	0
0		2	2	high	0	1	0	0	0
0	0	0	2	average	0	0	1	0	0
0	1	3	2	low	1	0	0	0	0
0		0	2	high	0	1	0	0	0
0		0	2		0	0	0	1	0
0	0	1	2		0	0	0	1	0
0			2		0	0	0	1	0
0		0	2	learndis	0	0	0	0	1
0	0	0	1	learndis	0	0	0	0	1
0	0	0	1	learndis	0	0	0	0	1
0	1	0	2	learndis	0	0	0	0	1
0	1		2	learndis	0	0	0	0	1
0	1		2	learndis	0	0	0	0	1
0	0	0	2	learndis	0	0	0	0	1
0	0		2	learndis	0	0	0	0	1
0	0	0	1	learndis	0	0	0	0	1
0	0	1	2	learndis	0	0	0	0	1
0	1	0	1	learndis	0	0	0	0	1
0		0	2	learndis	0	0	0	0	1
0		1	2	learndis	0	0	0	0	1
0		2	2	learndis	0	0	0	0	1
0	0	2	1		0	0	0	1	0
0	2		1		0	0	0	1	0
0	3	0	1	low	1	0	0	0	0
0	0	0	1		0	0	0	1	0
0			1	average	0	0	1	0	0
0	1	0	1	low	1	0	0	0	0
0	3	0	1	average	0	0	1	0	0
0	0	0	1	low	1	0	0	0	0
0			1		0	0	0	1	0
0	2		1		0	0	0	1	0
0	0	1	1	high	0	1	0	0	0
0	3	1	1	average	0	0	1	0	0
0	2	0	1	average	0	0	1	0	0
0		1	1	low	1	0	0	0	0
0			1	low	1	0	0	0	0
0	2	0	1	average	0	0	1	0	0
0	1	0	1	average	0	0	1	0	0
0		3	1	average	0	0	1	0	0
0	0	0	1	high	0	1	0	0	0
0	1	0	1	average	0	0	1	0	0
0	0	0	1	average	0	0	1	0	0
0	1		1		0	0	0	1	0
0	1	0	1	average	0	0	1	0	0
0	0	0	1		0	0	0	1	0
0	1	1	1	average	0	0	1	0	0
0	0		1	learndis	0	0	0	0	1
0	0	0	1	learndis	0	0	0	0	1
0	0	0	1	learndis	0	0	0	0	1
0	0	0	2	learndis	0	0	0	0	1
0	1	0	1	learndis	0	0	0	0	1
0	1	0	2	learndis	0	0	0	0	1
0	1	0	2	learndis	0	0	0	0	1
0	0	0	2	learndis	0	0	0	0	1
0	3	0	1	learndis	0	0	0	0	1
0		0		learndis	0	0	0	0	1
0	0	0	2	learndis	0	0	0	0	1
0	0	0	2	learndis	0	0	0	0	1
0	0	0	2	learndis	0	0	0	0	1
0				learndis	0	0	0	0	1
0	2	0	1	learndis	0	0	0	0	1
0	0	0	1	learndis	0	0	0	0	1
0	0	0	2	learndis	0	0	0	0	1
0	0		2	learndis	0	0	0	0	1
0	4	0	2	learndis	0	0	0	0	1
0				learndis	0	0	0	0	1
0			1	learndis	0	0	0	0	1
0	0	1	2	learndis	0	0	0	0	1
0	2	0	1	learndis	0	0	0	0	1
0		0		learndis	0	0	0	0	1
0		0		learndis	0	0	0	0	1
0		0		learndis	0	0	0	0	1
0				learndis	0	0	0	0	1
0				learndis	0	0	0	0	1
0				learndis	0	0	0	0	1
0	0	0		learndis	0	0	0	0	1
0				learndis	0	0	0	0	1
0	0			learndis	0	0	0	0	1
0		0		learndis	0	0	0	0	1

csiw <- dummy_cols(csiw,select_columns = "grade")
colnames(csiw)

##  [1] "treat"          "pretest"        "posttest"       "grade"         
##  [5] "group"          "group_low"      "group_high"     "group_average" 
##  [9] "group_NA"       "group_learndis" "grade_1"        "grade_2"       
## [13] "grade_NA"

We will use the variable grade_2 as the required dummy in this analysis.

Analysis

Naive Model

Outcome = posttest
Predictor = treat

Let us run a simple linear regression to estimate the naive model

naive.model <- lm(posttest~treat, data=csiw)
summ(naive.model)

## MODEL INFO:
## Observations: 325 (82 missing obs. deleted)
## Dependent Variable: posttest
## Type: OLS linear regression 
## 
## MODEL FIT:
## F(1,323) = 36.10, p = 0.00
## R² = 0.10
## Adj. R² = 0.10 
## 
## Standard errors: OLS
## -----------------------------------------------
##                     Est.   S.E.   t val.      p
## ----------------- ------ ------ -------- ------
## (Intercept)         0.71   0.35     2.06   0.04
## treat               2.44   0.41     6.01   0.00
## -----------------------------------------------

ANCOVA

In ANCOVA the idea is that we have a continuous variable as the outcome (posttest) and a discrete variable as the predictor of interest (treat) and other covariates to be controlled for (pretest).

In R, ANCOVA is also implemented in the lm() funcion we use for linear regression.

ancova.model <- lm(posttest ~ treat + pretest, data=csiw)
summ(ancova.model)

## MODEL INFO:
## Observations: 272 (135 missing obs. deleted)
## Dependent Variable: posttest
## Type: OLS linear regression 
## 
## MODEL FIT:
## F(2,269) = 24.30, p = 0.00
## R² = 0.15
## Adj. R² = 0.15 
## 
## Standard errors: OLS
## -----------------------------------------------
##                     Est.   S.E.   t val.      p
## ----------------- ------ ------ -------- ------
## (Intercept)         0.22   0.40     0.53   0.59
## treat               2.82   0.45     6.28   0.00
## pretest             0.43   0.14     3.03   0.00
## -----------------------------------------------

Now we can also plot the predicted values as a function of pretest and treat (group membership).

# creating a new column in csiw dataset for the predicted values from ANCOVA model
# csiw$ancova.predicted <- predict(ancova.model)

This error happened because while doing linear regression, R drops the rows where column values are NA. So while doing the analysis, R is dropping 135 observations due to missing data.

Let us remove those rows and redo the analysis.

csiw <- csiw[which(complete.cases(csiw)), ]

Now we will do the ANOVA steps again to see if we get this error

ancova.model <- lm(posttest ~ treat + pretest, data=csiw)
summ(ancova.model)

## MODEL INFO:
## Observations: 258
## Dependent Variable: posttest
## Type: OLS linear regression 
## 
## MODEL FIT:
## F(2,255) = 23.94, p = 0.00
## R² = 0.16
## Adj. R² = 0.15 
## 
## Standard errors: OLS
## -----------------------------------------------
##                     Est.   S.E.   t val.      p
## ----------------- ------ ------ -------- ------
## (Intercept)         0.19   0.42     0.44   0.66
## treat               2.95   0.47     6.34   0.00
## pretest             0.43   0.15     2.90   0.00
## -----------------------------------------------

Dropping the missing values seems to change the treatment effect, this might be an indication that the data is not missing at random. But we will continue with the analysis for now

# creating a new column in csiw dataset for the predicted values from ANCOVA model
csiw$ancova.predicted <- predict(ancova.model)

As you can see now we do not have the error message we had earlier.

Now let us implement the required plot using ggplot2 pakage.

ggplot(csiw, aes(x=pretest, y=ancova.predicted, group=factor(treat),color=factor(treat))) + geom_point()+geom_smooth(method="loess", color="darkred")

Assessing linearity

The assignment asks to plot the residuals from the ANCOVA model against the pretest covariate. We will first calculate the residuals and store it in a new column in csiw and then do the plotting

# storing residuals as new column in csiw
csiw$ancova.residuals <- residuals(ancova.model)

Now we can plot the required graph. Let us add labels to the x and y axis in this plot

ggplot(csiw, aes(x=pretest, y=ancova.residuals, color=factor(treat))) +geom_point()+geom_smooth(method="loess", color="darkred")+
  xlab("Pretest Scores") + ylab("Residuals from ANCOVA model") + labs(title="Residuals vs Pretest - ANCOVA model")

How do we add a quadratic fit? We can create a new variable (pretest.sq) for the pretest variable’s squared value and then replot the graph with the squared variable as the covariate.

csiw$pretest.sq <- csiw$pretest^2

Now let us plot the graph again

ggplot(csiw, aes(x=pretest.sq, y=ancova.residuals, color=factor(treat))) +geom_point()+geom_smooth(method="loess", color="darkred")+
  xlab("Pretest Scores Squared") + ylab("Residuals from ANCOVA model") + labs(title="Residuals vs Pretest squared - ANCOVA model")

We can see that linearity seems to be satisfied, because the residuals are close to zero on average and seem to be distributed almost equally above and below the curve. we have heterosketastity

Quadratic model

Now we have to center the variable pretest and create another variable for the squared value of the centered pretest variable.

# We will use the mutate function to create new variables
pretest.mean <- mean(csiw$pretest)
pretest.mean

## [1] 0.9186047

csiw$pretest.c <- csiw$pretest-pretest.mean
csiw$pretest.csq <- (csiw$pretest.c)^2

Now let us run the ANCOVA again with the centered variable and its square

quadratic.ancova.model <- lm(posttest ~ treat+pretest.c+pretest.csq, data=csiw)
summ(quadratic.ancova.model)

## MODEL INFO:
## Observations: 258
## Dependent Variable: posttest
## Type: OLS linear regression 
## 
## MODEL FIT:
## F(3,254) = 15.98, p = 0.00
## R² = 0.16
## Adj. R² = 0.15 
## 
## Standard errors: OLS
## -----------------------------------------------
##                     Est.   S.E.   t val.      p
## ----------------- ------ ------ -------- ------
## (Intercept)         0.53   0.41     1.29   0.20
## treat               2.94   0.47     6.28   0.00
## pretest.c           0.35   0.23     1.49   0.14
## pretest.csq         0.03   0.07     0.46   0.65
## -----------------------------------------------

Let us look at the predicted values and residual plots of this updated model

csiw$quadratic.ancova.predicted <- predict(quadratic.ancova.model)
csiw$quadratic.ancova.residuals <- residuals(quadratic.ancova.model)

Now the plots

par(mfrow=c(1,2))

ggplot(csiw, aes(x=pretest.c, y = quadratic.ancova.predicted, color=factor(treat))) + geom_point()+
  geom_smooth(method="loess", color="darkred")+  xlab("Pretest Scores Centered") + ylab("Predicted values from Quadratic ANCOVA model") + labs(title="Predicted values vs Pretest squared - Quadratic ANCOVA model")

ggplot(csiw, aes(x=pretest.c, y = quadratic.ancova.residuals, color=factor(treat))) + geom_point()+
  geom_smooth(method="loess", color="darkred")+  xlab("Pretest Scores Centered") + ylab("Residuals from Quadratic ANCOVA model") + labs(title="Residuals vs Pretest squared - Quadratic ANCOVA model")

Searching for confounders

In order to identify confounders we need three kinds of tests:

ANOVA - aov() - for continuous outcome on discrete predictor
Linear Regression - lm () - for continuous outcome on continuous predictor
Chi Square test - chisq.test() - for discrete outcome on discrete predictor

The list of confounders to check:

Dummies from achievement level
Dummy for grade level

Let us see some examples. Let us see

# Dummies from achievement level eg. achievement level = high
## corresponding dummy in the dataset = group_high

### outcome vs. predictor = posttest vs. group_high = continuous vs. discrete = ANOVA

summary(aov(posttest~group_high, data=csiw))

##              Df Sum Sq Mean Sq F value   Pr(>F)    
## group_high    1  474.4   474.4    43.2 2.75e-10 ***
## Residuals   256 2811.2    11.0                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(aov(posttest~group_average, data=csiw))

##                Df Sum Sq Mean Sq F value Pr(>F)
## group_average   1      2   2.322   0.181  0.671
## Residuals     256   3283  12.825

summary(aov(posttest~group_low, data=csiw))

##              Df Sum Sq Mean Sq F value Pr(>F)
## group_low     1     19   19.05   1.493  0.223
## Residuals   256   3267   12.76

summary(aov(posttest~group_learndis, data=csiw))

##                 Df Sum Sq Mean Sq F value   Pr(>F)    
## group_learndis   1  416.5   416.5   37.16 3.99e-09 ***
## Residuals      256 2869.2    11.2                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

### treatment vs. predictor = treat vs group_learndis = Chi Square test
csiw$treat.factor <- factor(csiw$treat)
chisq.test(csiw$treat.factor,csiw$group_high)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  csiw$treat.factor and csiw$group_high
## X-squared = 9.5627, df = 1, p-value = 0.001986

chisq.test(csiw$treat.factor,csiw$group_learndis)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  csiw$treat.factor and csiw$group_learndis
## X-squared = 7.7297, df = 1, p-value = 0.005432

From the results above we can see that this is a potential confounder. Now let us add this to the model and see the output

confounder.check.model <- lm(posttest ~ treat+pretest.c+pretest.csq+group_high + group_learndis, data=csiw)
summ(confounder.check.model)

## MODEL INFO:
## Observations: 258
## Dependent Variable: posttest
## Type: OLS linear regression 
## 
## MODEL FIT:
## F(5,252) = 20.67, p = 0.00
## R² = 0.29
## Adj. R² = 0.28 
## 
## Standard errors: OLS
## ---------------------------------------------------
##                         Est.   S.E.   t val.      p
## -------------------- ------- ------ -------- ------
## (Intercept)             0.82   0.43     1.92   0.06
## treat                   2.17   0.45     4.86   0.00
## pretest.c               0.07   0.22     0.32   0.75
## pretest.csq             0.08   0.06     1.35   0.18
## group_high              2.12   0.46     4.62   0.00
## group_learndis         -1.73   0.50    -3.48   0.00
## ---------------------------------------------------

Let us look at the predicted values and residual plots of this updated model

csiw$confounder.check.model <- predict(confounder.check.model)
csiw$confounder.check.residuals <- residuals(confounder.check.model)

Now the plots

par(mfrow=c(1,2))

ggplot(csiw, aes(x=pretest.c, y = confounder.check.model, color=factor(treat))) + geom_point()+
  geom_smooth(method="loess", color="darkred")+  xlab("Pretest Scores Centered") + ylab("Predicted values from Quadratic ANCOVA model") + labs(title="Predicted values vs Pretest squared - Confounder Check model")

ggplot(csiw, aes(x=pretest.c, y = confounder.check.residuals, color=factor(treat))) + geom_point()+
  geom_smooth(method="loess", color="darkred")+  xlab("Pretest Scores Centered") + ylab("Residuals from Quadratic ANCOVA model") + labs(title="Residuals vs Pretest squared - Confounder Check model")

Now let us look at the results we have gotten so far. For this we will use the export_summs() function from the jtools package

export_summs(naive.model, ancova.model, quadratic.ancova.model, confounder.check.model)

	Model 1	Model 2	Model 3	Model 4
(Intercept)	0.71 *	0.19	0.53	0.82
	(0.35)	(0.42)	(0.41)	(0.43)
treat	2.44 ***	2.95 ***	2.94 ***	2.17 ***
	(0.41)	(0.47)	(0.47)	(0.45)
pretest		0.43 **
		(0.15)
pretest.c			0.35	0.07
			(0.23)	(0.22)
pretest.csq			0.03	0.08
			(0.07)	(0.06)
group_high				2.12 ***
				(0.46)
group_learndis				-1.73 ***
				(0.50)
N	325	258	258	258
R2	0.10	0.16	0.16	0.29
* p < 0.001; p < 0.01; * p < 0.05.

Since the remaining list of covariates to check for confounding are all discrete, we can repeat the steps from above with different covariates.

Exercise: Do the confounder check for grade dummy called grade_2.  
          grade_2 = 1, if Grade = 5 and grade_2 = 0 if Grade = 4.

Heterogeneity

interaction.model <- lm(posttest ~ treat*grade_1 +pretest.c+pretest.csq+group_high + group_learndis, data=csiw)
summ(interaction.model)

## MODEL INFO:
## Observations: 258
## Dependent Variable: posttest
## Type: OLS linear regression 
## 
## MODEL FIT:
## F(7,250) = 14.75, p = 0.00
## R² = 0.29
## Adj. R² = 0.27 
## 
## Standard errors: OLS
## ---------------------------------------------------
##                         Est.   S.E.   t val.      p
## -------------------- ------- ------ -------- ------
## (Intercept)             1.14   0.64     1.79   0.07
## treat                   1.80   0.69     2.63   0.01
## grade_1                -0.52   0.77    -0.67   0.50
## pretest.c               0.09   0.23     0.42   0.68
## pretest.csq             0.08   0.06     1.27   0.21
## group_high              2.13   0.46     4.62   0.00
## group_learndis         -1.77   0.50    -3.51   0.00
## treat:grade_1           0.63   0.89     0.70   0.48
## ---------------------------------------------------

interaction.model2 <- lm(posttest ~ treat*grade_2 +pretest.c+pretest.csq+group_high + group_learndis, data=csiw)
summ(interaction.model2)

## MODEL INFO:
## Observations: 258
## Dependent Variable: posttest
## Type: OLS linear regression 
## 
## MODEL FIT:
## F(7,250) = 14.75, p = 0.00
## R² = 0.29
## Adj. R² = 0.27 
## 
## Standard errors: OLS
## ---------------------------------------------------
##                         Est.   S.E.   t val.      p
## -------------------- ------- ------ -------- ------
## (Intercept)             0.63   0.52     1.20   0.23
## treat                   2.43   0.58     4.16   0.00
## grade_2                 0.52   0.77     0.67   0.50
## pretest.c               0.09   0.23     0.42   0.68
## pretest.csq             0.08   0.06     1.27   0.21
## group_high              2.13   0.46     4.62   0.00
## group_learndis         -1.77   0.50    -3.51   0.00
## treat:grade_2          -0.63   0.89    -0.70   0.48
## ---------------------------------------------------

Exercise: Do the interaction check for treatment and prior achievement level on 4 groups.

Checking assumptions about the random part of model

plot(interaction.model2)