purrr & Regression with Baseline Model into Account

1. Purrr in R

This package is amazing. Knowledge of this package significantly simplifies the process of getting some key information during data analysis.

Loading the tidyverse package:

purr is a package within tidyverse universe. We don’t have to load “purr” to be able to use it because loading tidyverse would do it. However, for the purpose of this demonstration, I am going to load it.

#Loading Required Libraries
library(tidyverse)
library(purrr)
library(dplyr)
library(kableExtra)
library(ggplot2)
library(cowplot)

# Creating a New Dataset
m_data <- list(
  c(1,2,5,9),
  c(3,5,7,3),
  c(2,5,6,7)
)

# Checking the datafile
m_data

[[1]]
[1] 1 2 5 9

[[2]]
[1] 3 5 7 3

[[3]]
[1] 2 5 6 7

# Checking the First Set of Files
m_data[[1]]

[1] 1 2 5 9

# Checking the Third Set of Files
m_data[[3]]

[1] 2 5 6 7

First Method to Find Mean of All Variable

#First Set Mean
m_data[[1]] %>% mean()

[1] 4.25

  #Or non-professional way
mean(m_data[[1]])

[1] 4.25

#Second Set Mean
m_data[[2]] %>% mean()

[1] 4.5

#Third Set Mean
m_data[[3]] %>% mean()

[1] 5

Second Method (purrr).

We here pass the map function. It makes our task much easier. One map syntax and the job is done.

# Finding the Mean of Each Element
res<- map(m_data, mean)
res

[[1]]
[1] 4.25

[[2]]
[1] 4.5

[[3]]
[1] 5

# Checking the Class of Vector res
class(res)

[1] "list"

# OR WE CAN SIMPLY USE THE PIPE FUNCTION
res %>% map(mean)

[[1]]
[1] 4.25

[[2]]
[1] 4.5

[[3]]
[1] 5

map returns a list of numbers (mean of each variable in this case). map_at on the other hand transforms their input by applying a function to each element and returning a vector the same length as the input. For example

map_at:
map_int: returns integers
map_dbl: returns the vectors of doubles
map_chr: returns characters Lets take an example:

m_data %>% map_dbl(mean)

[1] 4.25 4.50 5.00

Let’s check an anonymous function (local function)

I want to create a local function and want to pass it over all the elements of the vectors above.

tilde to denote a function
dot (.) to let R know to act on all the vectors
plus (+) sum
8 to all of the elements one by one

m_data %>% map(~ . + 8)

[[1]]
[1]  9 10 13 17

[[2]]
[1] 11 13 15 11

[[3]]
[1] 10 13 14 15

map_dbl wants to return a single vector, so, we don’t get the consistent answer if we pass it over map_dbl.

# m_data %>% map_dbl(~ . + 8)

I get the error message: Error: Result 1 must be a single double, not a double vector of length 4 Backtrace: x 1…

I want to check if the values in the m_data are numeric:

m_data %>% map(is.numeric)

[[1]]
[1] TRUE

[[2]]
[1] TRUE

[[3]]
[1] TRUE

Yes. All of them are numeric vectors. We can check them the other way as well. map_lgl is a logical function that we can pass on our dataset. Or we can also use map_chr if we want the TRUE, FALSE as objects/characters

m_data %>% map_lgl(is.numeric)

[1] TRUE TRUE TRUE

m_data %>% map_chr(is.numeric)

[1] "TRUE" "TRUE" "TRUE"

A function within a function

func_1 <- function(x){
if (x > 25){
# stop("x is bigger than I can handle!!")#produce an error message
  return("x is bigger than I can handle!!")#return the message
}else if(x > 20){
return (x * 3)#multiply the number by 3
}
  return(x * 5)#otherwise multiply the number by 5
}

# Now, let's check how it works
# a. Let's make x bigger than 20
func_1(21)# 21 * 3 = 63

[1] 63

# b. Let's make x smaller than 20
func_1(10)

[1] 50

# c. Let's make x bigger than 25
func_1(26)

[1] "x is bigger than I can handle!!"

2. ANCOVA Demonstration

I wanted to conduct a full ANCOVA Analysis of my data in R. This is my first time doing so. I have my data in two different datafiles. Both dataset have a common “Student ID” variable. I am goint to use it to combine the files. One of the data files is saved as .xlsx file type, while the other is in .csv format. I need slightly different syntax to upload them. Let’s upload them and check the summary of the data.

# Loading Required Libraries
library(readxl)
library(tidyr)
library(pander)
# Uploading the data in R
data1 <- read_excel("data_warehouse_pull.xlsx")
summary(data1)

   student_id       school_id         grade            frpl       
 Min.   :812011   Min.   :350.0   Min.   :2.000   Min.   :0.0000  
 1st Qu.:816066   1st Qu.:357.0   1st Qu.:2.000   1st Qu.:0.0000  
 Median :820121   Median :364.0   Median :3.000   Median :0.0000  
 Mean   :820121   Mean   :365.1   Mean   :3.481   Mean   :0.3881  
 3rd Qu.:824176   3rd Qu.:373.0   3rd Qu.:4.000   3rd Qu.:1.0000  
 Max.   :828231   Max.   :381.0   Max.   :5.000   Max.   :1.0000  
                                                                  
      male            white            black            asian        
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000  
 Median :1.0000   Median :0.0000   Median :0.0000   Median :0.00000  
 Mean   :0.5108   Mean   :0.4805   Mean   :0.2506   Mean   :0.03976  
 3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.00000  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
                  NA's   :149      NA's   :149      NA's   :149      
    hispanic       multiracial      personalized_learning    title_1      
 Min.   :0.0000   Min.   :0.00000   Min.   :0.0000        Min.   :0.0000  
 1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000        1st Qu.:0.0000  
 Median :0.0000   Median :0.00000   Median :1.0000        Median :1.0000  
 Mean   :0.1826   Mean   :0.04424   Mean   :0.5325        Mean   :0.5762  
 3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:1.0000        3rd Qu.:1.0000  
 Max.   :1.0000   Max.   :1.00000   Max.   :1.0000        Max.   :1.0000  
 NA's   :149      NA's   :149                                             
     magnet       region_bridges  region_harris     region_benjamin  
 Min.   :0.0000   Min.   :0.000   Min.   :0.00000   Min.   :0.00000  
 1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:0.00000   1st Qu.:0.00000  
 Median :0.0000   Median :0.000   Median :0.00000   Median :0.00000  
 Mean   :0.1809   Mean   :0.164   Mean   :0.03643   Mean   :0.09032  
 3rd Qu.:0.0000   3rd Qu.:0.000   3rd Qu.:0.00000   3rd Qu.:0.00000  
 Max.   :1.0000   Max.   :1.000   Max.   :1.00000   Max.   :1.00000  
                                                                     
 region_patton    region_simpson   region_robinson  region_raymond  
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :0.0000   Median :0.0000   Median :0.0000   Median :0.0000  
 Mean   :0.1632   Mean   :0.1707   Mean   :0.1181   Mean   :0.2573  
 3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:1.0000  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000

data2 <- read.csv("ready2readprograminfo.csv")
summary(data2)

   student_id       ready2read     met_half_Ready2Read_goal
 Min.   :812011   Min.   :0.0000   Min.   :0.0000          
 1st Qu.:816066   1st Qu.:0.0000   1st Qu.:0.0000          
 Median :820121   Median :1.0000   Median :0.0000          
 Mean   :820121   Mean   :0.5228   Mean   :0.3578          
 3rd Qu.:824176   3rd Qu.:1.0000   3rd Qu.:1.0000          
 Max.   :828231   Max.   :1.0000   Max.   :1.0000          
                                                           
 met_all_Ready2Read_goal pre_fluency_score post_fluency_score
 Min.   :0.0000          Min.   :  0.00    Min.   :  0.0     
 1st Qu.:0.0000          1st Qu.: 66.00    1st Qu.: 99.0     
 Median :0.0000          Median : 95.00    Median :128.0     
 Mean   :0.2229          Mean   : 96.24    Mean   :127.7     
 3rd Qu.:0.0000          3rd Qu.:125.00    3rd Qu.:157.0     
 Max.   :1.0000          Max.   :237.00    Max.   :299.0     
                         NA's   :1580      NA's   :2065

Based on the summary data1 has 20 variables and data2 has 6 variables. The first data file has contains the information about students, school and the regions the schools are located. The second data file on the other hand, has information about students’ participation on Ready2Read (R2R)programs, whether or not they met the half-way-through and all goals, and their pre- and post- fluency scores.

We don’t see that many missing cases in the first dataset but there are good amount of cases in the second dataset. If you want to know how I deal with them please follow this link: https://rpubs.com/nirmal/791463

Assuming that you went through the link, I am directly coming up with the final dataset “final_data”. The following steps were taken to have this final dataset produced:

Merging the datafiles
Tracing the common missing cases and deleting them case wise because they were never the part of this study.
Tracking the remaining missing cases and replacing them with column mean.

Table continues below
student_id	school_id	grade	frpl	male	white	black
812011	365	3	1	1	0	0
812012	362	2	0	0	0	0
812013	359	2	1	1	0	0
812014	360	3	1	0	1	0
812015	366	2	1	1	0	1
812016	350	3	1	1	0	0

Table continues below
hispanic	multiracial	personalized_learning	title_1	magnet
0	1	0	0	0
1	0	0	0	0
1	0	0	1	1
0	0	1	1	0
0	0	0	0	1
1	0	1	1	0

Table continues below
region_bridges	region_harris	region_benjamin	region_patton
0	0	1	0
0	0	0	0
0	0	1	0
0	0	0	0
0	0	0	0
0	0	0	0

Table continues below
region_simpson	region_robinson	ready2read
0	0	0
1	0	1
0	0	1
1	0	1
0	1	0
1	0	0

Table continues below
met_half_Ready2Read_goal	met_all_Ready2Read_goal	pre_fluency_score
0	0	0
1	1	0
0	0	0
1	0	0
0	0	0
0	0	0

post_fluency_score
0
117
13
9
5
4

Here’s the variables in this final_data:

SN	Variable	Description	Type
1	student_id	Unique student ID	student_variable
2	school_id	School ID	school_variable
3	grade_level	Student’s grade level	student_variable
4	frpl	Indicator for free and reduced price lunch, where FRPL = 1 and non-FRPL = 0	student_variable
5	male	Indicator for sex, where male = 1 and female = 0	student_variable
6	excep	Indicator for students with exceptionalities, where excep = 1 and non-excep = 0	student_variable
7	el	Indicator for English Learner, where EL = 1 and non-EL = 0	student_variable
8	tag	Indicator for talented and gifted students, where TAG = 1 and non-TAG = 0	student_variable
9	white	Race/ethnicity indicator, where white=1 and non-white=0	student_variable
10	black	Race/ethnicity indicator, where Black=1 and non-Black=0	student_variable
11	asian	Race/ethnicity indicator, where Asian=1 and non-Asian=0	student_variable
12	hispanic	Race/ethnicity indicator, where Hispanic=1 and non-Hispanic=0	student_variable
13	multiracial	Race/ethnicity indicator, where multiracial=1 and non-multiracial=0	student_variable
14	personalized_learning	Student attends a personalized learning school (attends = 1; does not attend = 0)	school_variable
15	title_1	Student attends a Title I school (attends = 1; does not attend = 0)	school_variable
16	magnet	Student attends a magnet school (attends = 1; does not attend = 0)	school_variable
17	ready2read	Indicates student received the Ready2Read program	outcome_variable
18	met_half_Ready2Read_goal	Student completed at least half of goals (met half = 1; did not meet half = 0)	outcome_variable
19	met_all_Ready2Read _goal	Student completed all Ready2Read goals (met all = 1; did not meet all = 0)	outcome_variable
20	pre_fluency_score	Beginning of year fluency assessment score	outcome_variable
21	post_fluency_score	End-of-year fluency assessment score	outcome_variable
22	region_bridges	Indicates student attends school in the Bridges Region	region_variable
23	region_harris	Indicates student attends school in the Harris Region	region_variable
24	region_benjamin	Indicates student attends school in Benjamin Region	region_variable
25	region_patton	Indicates student attends school in Patton Region	region_variable
26	region_simpson	Indicates student attends school in Simpson Region	region_variable
27	region_robinson	Indicates student attends school in Robinson Region	region_variable
28	region_raymond	Indicates student attends school in Raymond Region	region_variable

Changing the Coded Values to Original Description and Changing the Variable Types to Factor

# FRPL
final_data$frpl <- ifelse(test = final_data$frpl == 1, yes = "Low_SES_Students" , no = "High_SES_Students")
final_data$frpl <- as.factor(final_data$frpl)

# Gender
final_data$male <- ifelse(test = final_data$male == 1, yes = "Male" , no = "Female")
final_data$male <- as.factor(final_data$male)

# Ethnicity: White
final_data$white <- ifelse(test = final_data$white == 1, yes = "White_Students" , no = "Non_White_Students")
final_data$white <- as.factor(final_data$white)

# Ethnicity: Black
final_data$black <- ifelse(test = final_data$black == 1, yes = "Black_Students" , no = "Non_Black_Students")
final_data$black <- as.factor(final_data$black)

# Ethnicity: Asian
final_data$asian <- ifelse(test = final_data$asian == 1, yes = "Asian_Students" , no = "Non_Asian_Students")
final_data$asian <- as.factor(final_data$asian)

# Ethnicity: Hispanic
final_data$hispanic <- ifelse(test = final_data$hispanic == 1, yes = "Hispanic_Students" , no = "Non_Hispanic_Students")
final_data$hispanic <- as.factor(final_data$hispanic)

# Ethnicity: Multiracial
final_data$multiracial <- ifelse(test = final_data$multiracial == 1, yes = "Multiracial_Students" , no = "Non_Multiracial_Students")
final_data$multiracial <- as.factor(final_data$multiracial)

# Is School Providing Personalized Learning?
final_data$personalized_learning <- ifelse(test = final_data$personalized_learning == 1, yes = "Personalized_Learning_School" , no = "Non_Personalized_School")
final_data$personalized_learning <- as.factor(final_data$personalized_learning)

# Title One School Status
final_data$title_1 <- ifelse(test = final_data$title_1 == 1, yes = "Title_1_School" , no = "Non_Title_1_School")
final_data$title_1 <- as.factor(final_data$title_1)

# Magnet School Status
final_data$magnet <- ifelse(test = final_data$magnet == 1, yes = "Magnet_School" , no = "Non_Magnet_School")
final_data$magnet <- as.factor(final_data$magnet)

# Lies in Bridges Region
final_data$region_bridges <- ifelse(test = final_data$region_bridges == 1, yes = "Bridges_School" , no = "Non_Bridges_School")
final_data$region_bridges <- as.factor(final_data$region_bridges)

# Lies in Harris Region
final_data$region_harris <- ifelse(test = final_data$region_harris == 1, yes = "Harris_School" , no = "Non_Harris_School")
final_data$region_harris <- as.factor(final_data$region_harris)

# Lies in Benjamin Region
final_data$region_benjamin <- ifelse(test = final_data$region_benjamin == 1, yes = "Benjamin_School" , no = "Non_Benjamin_School")
final_data$region_benjamin <- as.factor(final_data$region_benjamin)

# Lies in Patton Region
final_data$region_patton <- ifelse(test = final_data$region_patton == 1, yes = "Patton_School" , no = "Non_Patton_School")
final_data$region_patton <- as.factor(final_data$region_patton)

# Lies in Simpson Region
final_data$region_simpson <- ifelse(test = final_data$region_simpson == 1, yes = "Simpson_School" , no = "Non_Simpson_School")
final_data$region_simpson <- as.factor(final_data$region_simpson)

# Lies in Robinson Region
final_data$region_robinson <- ifelse(test = final_data$region_robinson == 1, yes = "Robinson_School" , no = "Non_Robinson_School")
final_data$region_robinson <- as.factor(final_data$region_robinson)

# Lies in Raymond Region
final_data$region_raymond <- ifelse(test = final_data$region_raymond == 1, yes = "Raymond_School" , no = "Non_Raymond_School")
final_data$region_raymond <- as.factor(final_data$region_raymond)

# Did the Student Participate in Ready2Read Program?
final_data$ready2read <- ifelse(test = final_data$ready2read == 1, yes = "Ready2Read_School" , no = "Non_Ready2Read_School")
final_data$ready2read <- as.factor(final_data$ready2read)

# Did the Student Met Half Ready2Read Goal?
final_data$met_half_Ready2Read_goal <- ifelse(test = final_data$met_half_Ready2Read_goal == 1, yes = "Met_Half_Ready2Read_Goal" , no = "NMet_Half_Ready2Read_Goal")
final_data$met_half_Ready2Read_goal <- as.factor(final_data$met_half_Ready2Read_goal)

# Did the Student Met All Ready2Read Goal?
final_data$met_all_Ready2Read_goal <- ifelse(test = final_data$met_all_Ready2Read_goal == 1, yes = "Met_All_Ready2Read_Goal" , no = "NMet_All_Ready2Read_Goal")
final_data$met_all_Ready2Read_goal <- as.factor(final_data$met_all_Ready2Read_goal)

# Changing Grade Level to Factor Variable
final_data$grade <- as.factor(final_data$grade)

# Checking the Glimpse of the Modified Dataset
glimpse(final_data)

Rows: 14,845
Columns: 25
$ student_id               <dbl> 812011, 812012, 812013, 812014, 812015, 81201~
$ school_id                <dbl> 365, 362, 359, 360, 366, 350, 379, 366, 361, ~
$ grade                    <fct> 3, 2, 2, 3, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 3, ~
$ frpl                     <fct> Low_SES_Students, High_SES_Students, Low_SES_~
$ male                     <fct> Male, Female, Male, Female, Male, Male, Femal~
$ white                    <fct> Non_White_Students, Non_White_Students, Non_W~
$ black                    <fct> Non_Black_Students, Non_Black_Students, Non_B~
$ asian                    <fct> Non_Asian_Students, Non_Asian_Students, Non_A~
$ hispanic                 <fct> Non_Hispanic_Students, Hispanic_Students, His~
$ multiracial              <fct> Multiracial_Students, Non_Multiracial_Student~
$ personalized_learning    <fct> Non_Personalized_School, Non_Personalized_Sch~
$ title_1                  <fct> Non_Title_1_School, Non_Title_1_School, Title~
$ magnet                   <fct> Non_Magnet_School, Non_Magnet_School, Magnet_~
$ region_bridges           <fct> Non_Bridges_School, Non_Bridges_School, Non_B~
$ region_harris            <fct> Non_Harris_School, Non_Harris_School, Non_Har~
$ region_benjamin          <fct> Benjamin_School, Non_Benjamin_School, Benjami~
$ region_patton            <fct> Non_Patton_School, Non_Patton_School, Non_Pat~
$ region_simpson           <fct> Non_Simpson_School, Simpson_School, Non_Simps~
$ region_robinson          <fct> Non_Robinson_School, Non_Robinson_School, Non~
$ region_raymond           <fct> Non_Raymond_School, Non_Raymond_School, Non_R~
$ ready2read               <fct> Non_Ready2Read_School, Ready2Read_School, Rea~
$ met_half_Ready2Read_goal <fct> NMet_Half_Ready2Read_Goal, Met_Half_Ready2Rea~
$ met_all_Ready2Read_goal  <fct> NMet_All_Ready2Read_Goal, Met_All_Ready2Read_~
$ pre_fluency_score        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
$ post_fluency_score       <dbl> 0.0, 117.0, 13.0, 9.0, 5.0, 4.0, 19.0, 5.0, 2~

Baseline Masurement

Looking at the dataset, we can see that I am provided with pre- and post- fluency scores. Before I move any further, I want to see how the pre_fluency_scores correlate with post_fluency_scores. I am going to plot them and compare:

my_color = c("#1A425C", "#8AB63F")
ggplot(final_data, aes(x = pre_fluency_score, y = post_fluency_score, col = frpl))+
  geom_point()+
  geom_smooth(method = "lm", se = FALSE)+
  scale_color_manual(values = my_color)+
  theme_bw()+
  xlab("Students' Raw Pre Fluency Score") +
  ylab("Students' Raw Post Fluency Score")

It is not mandatory for both pre- & post- test scores to be in the same column. But, as outcome variables some analysts want to do so. It is some time referred as wide to long format. Dataset with many variables can complicate the process of transferring wide data to long data. I am going to at least try. I don’t want my “final_data” to be messed up, so, I will create a new vector called l_data and take necessary steps. We don’t need to load any other library for this, tidyverse would be enough.

What am I going to do?

Create a key variable named test_type and define pre_fluency_score and post_fluency_scores as two available keys
Create a new variable test_score and assign corresponding pre- and post- fluency scores as applicable

l_data <- gather(data = final_data, key = test_type, value = test_score,
                 pre_fluency_score : post_fluency_score, factor_key = TRUE)
pander(head(arrange(l_data, student_id), 10))

Table continues below
student_id	school_id	grade	frpl	male
812011	365	3	Low_SES_Students	Male
812011	365	3	Low_SES_Students	Male
812012	362	2	High_SES_Students	Female
812012	362	2	High_SES_Students	Female
812013	359	2	Low_SES_Students	Male
812013	359	2	Low_SES_Students	Male
812014	360	3	Low_SES_Students	Female
812014	360	3	Low_SES_Students	Female
812015	366	2	Low_SES_Students	Male
812015	366	2	Low_SES_Students	Male

Table continues below
white	black	asian
Non_White_Students	Non_Black_Students	Non_Asian_Students
Non_White_Students	Non_Black_Students	Non_Asian_Students
Non_White_Students	Non_Black_Students	Non_Asian_Students
Non_White_Students	Non_Black_Students	Non_Asian_Students
Non_White_Students	Non_Black_Students	Non_Asian_Students
Non_White_Students	Non_Black_Students	Non_Asian_Students
White_Students	Non_Black_Students	Non_Asian_Students
White_Students	Non_Black_Students	Non_Asian_Students
Non_White_Students	Black_Students	Non_Asian_Students
Non_White_Students	Black_Students	Non_Asian_Students

Table continues below
hispanic	multiracial
Non_Hispanic_Students	Multiracial_Students
Non_Hispanic_Students	Multiracial_Students
Hispanic_Students	Non_Multiracial_Students
Hispanic_Students	Non_Multiracial_Students
Hispanic_Students	Non_Multiracial_Students
Hispanic_Students	Non_Multiracial_Students
Non_Hispanic_Students	Non_Multiracial_Students
Non_Hispanic_Students	Non_Multiracial_Students
Non_Hispanic_Students	Non_Multiracial_Students
Non_Hispanic_Students	Non_Multiracial_Students

Table continues below
personalized_learning	title_1	magnet
Non_Personalized_School	Non_Title_1_School	Non_Magnet_School
Non_Personalized_School	Non_Title_1_School	Non_Magnet_School
Non_Personalized_School	Non_Title_1_School	Non_Magnet_School
Non_Personalized_School	Non_Title_1_School	Non_Magnet_School
Non_Personalized_School	Title_1_School	Magnet_School
Non_Personalized_School	Title_1_School	Magnet_School
Personalized_Learning_School	Title_1_School	Non_Magnet_School
Personalized_Learning_School	Title_1_School	Non_Magnet_School
Non_Personalized_School	Non_Title_1_School	Magnet_School
Non_Personalized_School	Non_Title_1_School	Magnet_School

Table continues below
region_bridges	region_harris	region_benjamin
Non_Bridges_School	Non_Harris_School	Benjamin_School
Non_Bridges_School	Non_Harris_School	Benjamin_School
Non_Bridges_School	Non_Harris_School	Non_Benjamin_School
Non_Bridges_School	Non_Harris_School	Non_Benjamin_School
Non_Bridges_School	Non_Harris_School	Benjamin_School
Non_Bridges_School	Non_Harris_School	Benjamin_School
Non_Bridges_School	Non_Harris_School	Non_Benjamin_School
Non_Bridges_School	Non_Harris_School	Non_Benjamin_School
Non_Bridges_School	Non_Harris_School	Non_Benjamin_School
Non_Bridges_School	Non_Harris_School	Non_Benjamin_School

Table continues below
region_patton	region_simpson	region_robinson
Non_Patton_School	Non_Simpson_School	Non_Robinson_School
Non_Patton_School	Non_Simpson_School	Non_Robinson_School
Non_Patton_School	Simpson_School	Non_Robinson_School
Non_Patton_School	Simpson_School	Non_Robinson_School
Non_Patton_School	Non_Simpson_School	Non_Robinson_School
Non_Patton_School	Non_Simpson_School	Non_Robinson_School
Non_Patton_School	Simpson_School	Non_Robinson_School
Non_Patton_School	Simpson_School	Non_Robinson_School
Non_Patton_School	Non_Simpson_School	Robinson_School
Non_Patton_School	Non_Simpson_School	Robinson_School

Table continues below
region_raymond	ready2read	met_half_Ready2Read_goal
Non_Raymond_School	Non_Ready2Read_School	NMet_Half_Ready2Read_Goal
Non_Raymond_School	Non_Ready2Read_School	NMet_Half_Ready2Read_Goal
Non_Raymond_School	Ready2Read_School	Met_Half_Ready2Read_Goal
Non_Raymond_School	Ready2Read_School	Met_Half_Ready2Read_Goal
Non_Raymond_School	Ready2Read_School	NMet_Half_Ready2Read_Goal
Non_Raymond_School	Ready2Read_School	NMet_Half_Ready2Read_Goal
Non_Raymond_School	Ready2Read_School	Met_Half_Ready2Read_Goal
Non_Raymond_School	Ready2Read_School	Met_Half_Ready2Read_Goal
Non_Raymond_School	Non_Ready2Read_School	NMet_Half_Ready2Read_Goal
Non_Raymond_School	Non_Ready2Read_School	NMet_Half_Ready2Read_Goal

met_all_Ready2Read_goal	test_type	test_score
NMet_All_Ready2Read_Goal	pre_fluency_score	0
NMet_All_Ready2Read_Goal	post_fluency_score	0
Met_All_Ready2Read_Goal	pre_fluency_score	0
Met_All_Ready2Read_Goal	post_fluency_score	117
NMet_All_Ready2Read_Goal	pre_fluency_score	0
NMet_All_Ready2Read_Goal	post_fluency_score	13
NMet_All_Ready2Read_Goal	pre_fluency_score	0
NMet_All_Ready2Read_Goal	post_fluency_score	9
NMet_All_Ready2Read_Goal	pre_fluency_score	0
NMet_All_Ready2Read_Goal	post_fluency_score	5

Based on the outcome, my plan worked. Now, I have two rows of data per student, one for pre_fluency_score, and the next for the post_fluency_score. One of benefits of such data format is, we can do many different types of activities like creating box plots or conducting long term trend analysis. I can also plot confidence interval in very easy to understand fashion. Now, lets create a box plot of pre- & post- fluency scores for schools by their magnet status.

magnet_box <- ggplot(l_data, aes(x = test_type, y = test_score, col = magnet))+
  geom_boxplot(outlier.size = 0)+
  geom_point(aes(fill = magnet), shape = 19, alpha = 0.5, position = position_jitterdodge(jitter.width = 0.2))+
 # scale_fill_manual(values = my_color)+
 # scale_color_manual(values = my_color)+
 # theme_bw()+
  xlab("Students' Pre- & Post- Fluency Scores by Magnet Status")
magnet_box

The diagram depicts the clear picture of the pre- & post- fluency scores among students in Magnet and non-magnet schools. Students in non-magnet schools seem to have higher pre_fluency_scores, but we don’t see any differences in mean post_fluency_scores. The post_fluency_scores among the students in magnet schools are spread higher than others in general.

Let’s check if there is any difference in students’ pre- & post- fluency scores based on their grades and personalized learning status. I am going to use a box plot to do so.

grade_box <- ggplot(l_data, aes(x = test_type, y = test_score, col = grade))+
  geom_boxplot(outlier.size = 0)+
  geom_point(aes(fill = personalized_learning), shape = 19, alpha = 0.5, position = position_jitterdodge(jitter.width = 0.2))+
    xlab("Students' Pre- & Post- Fluency Scores by Grades")
grade_box

Representation of variation in the results of schools that provide personalized and non-personalized leanings are not that clear in the diagram above but we have clear picture of the variation in students pre X pre, & pre X post fluency scores by grades. Second graders had the lowest mean pre- and post- fluency scores and that increased by grades. It makes complete sense because language develops in parallel with age.

I wanted to see the differences in as many variables as possible. Thus, so far, I used schools’ magnet status, students’ socioeconomic status, and grades.

As I move towards something called constrained Longitudinal Data Analysis (cLDA), I want to stick with one variable, because my intention is to show the baseline measurement and how it changed overtime. I am thinking that pre_fluency_score and post_fluency_score tell the story of two different time intervals and I assume there are differences among students based on certain characteristics. My variable of choice would be ‘ready2read’. I want to see the rate of change in scores among the students who were part of R2R program and who were not.

This analysis can be conduct using the nlme package available to R users. As my idea is to see the change in pre- and post- fluency scores for the students who participated in R2R program and who did not.

Fixed Part

My fixed part would be the test_type and test_type X R2R status. One of the rule of thumb in regression modeling is to include the baseline/lower ranked effects while modeling for an interaction in the model. If I run a regression model without doing anything, R will regard test_type and R2R status as main effects. However, I want interaction to be the main effect thus, I will first create an alternative model matrix named int_main. Here’s how we can do it:

# creating int matrix
int <- model.matrix(~ test_type * ready2read, data = l_data)

# checking column names
colnames(int)

[1] "(Intercept)"                                            
[2] "test_typepost_fluency_score"                            
[3] "ready2readReady2Read_School"                            
[4] "test_typepost_fluency_score:ready2readReady2Read_School"

# creating int_main matrix by selecting only desired columns from int
int_main <- int[, c("test_typepost_fluency_score", "test_typepost_fluency_score:ready2readReady2Read_School")]

# checking column names
colnames(int_main)

[1] "test_typepost_fluency_score"                            
[2] "test_typepost_fluency_score:ready2readReady2Read_School"

Random Part

The random parts, the standard deviations, and correlations constantly vary and they are defined as varldent() in R. For example, in this analysis, I expect differences in standard deviation for students who participated in R2R program and who did not. Likewise, I expect different correlation coefficient between the students who participated in the program and who completed the post_fluency_test and who did not.

To capture the essence of this analysis, I am going to define **weights = varldent(form ~ 1|test_type:ready2read), we get a distinct statistics estimated for test_type by R2R status (For example: pre_fluency X participated, pre_fluency X non_participated; post_fluency X participated & post_fluency X participated).

On the other hand, the correlation syntax correlation = corSymm(form = ~ 1 | student_id) defines the participants. We are bound to check the correlation of pre_fluency_scores to post_fluency_score of the same student. In this instance, we can’t correlate the scores of two different students because they can be inherently different.

Let’s run the analysis:

# Loading the required library
library(nlme)

# creating my_anl vector and saving the results 
my_anl <- gls(test_score ~ int_main,
              weights = varIdent(form = ~ 1 | test_type),
              correlation = corSymm(form = ~ 1 | student_id),
              data = l_data)
summary(my_anl)

Generalized least squares fit by REML
  Model: test_score ~ int_main 
  Data: l_data 
       AIC      BIC    logLik
  283929.4 283979.2 -141958.7

Correlation Structure: General
 Formula: ~1 | student_id 
 Parameter estimate(s):
 Correlation: 
  1    
2 0.875
Variance function:
 Structure: Different standard deviations per stratum
 Formula: ~1 | test_type 
 Parameter estimates:
 pre_fluency_score post_fluency_score 
          1.000000           1.000825 

Coefficients:
                                                                   Value
(Intercept)                                                     96.24237
int_maintest_typepost_fluency_score                             30.27677
int_maintest_typepost_fluency_score:ready2readReady2Read_School  2.25163
                                                                Std.Error
(Intercept)                                                     0.3403112
int_maintest_typepost_fluency_score                             0.2425019
int_maintest_typepost_fluency_score:ready2readReady2Read_School 0.3300808
                                                                  t-value
(Intercept)                                                     282.80695
int_maintest_typepost_fluency_score                             124.85167
int_maintest_typepost_fluency_score:ready2readReady2Read_School   6.82144
                                                                p-value
(Intercept)                                                           0
int_maintest_typepost_fluency_score                                   0
int_maintest_typepost_fluency_score:ready2readReady2Read_School       0

 Correlation: 
                                                                (Intr) in____
int_maintest_typepost_fluency_score                             -0.174       
int_maintest_typepost_fluency_score:ready2readReady2Read_School  0.000 -0.712

Standardized residuals:
          Min            Q1           Med            Q3           Max 
-3.103079e+00 -6.811375e-01 -5.712186e-05  6.802595e-01  4.156391e+00 

Residual standard error: 41.46353 
Degrees of freedom: 29690 total; 29687 residual

The Major Findings:

The alpha or p-value for the interaction, e.g., int_maintest_typepost_fluency_score:ready2readReady2Read_School is 0, suggesting that the change in students’ pre_fluency_scores to post_fluency_scores are statistically significant for the groups (i.e., students who participated in the Ready2Read programs and who did not).
The estimated difference in the change of post_fluency_score among the students who participated in the Ready2Read program was approximately 2.25(SE: 0.33) units higher than the ones who did not.
The estimated correlation between pre_fluency_scores and post_fluency_scores is fairly strong, i.e., 87.5%. It suggests that students’ pre_fluency_scores were the strong indicators of their post_fluency_scores, regardless of their participation in R2R programs.
The estimated standard deviation for the pre_fluency_score was 41.46, when the post_fluency_score was 41.46 * 1.000825 = 41.49.

We need to report the confidence interval of a regression analysis. I am going to use library rms and use the Gls() function within the package to calculate the confidence interval around the mean.

# Loading the required library
library(rms)

# creating my_anl vector and saving the results 
my_anl_2 <- Gls(test_score ~ int_main,
              weights = varIdent(form = ~ 1 | test_type),
              correlation = corSymm(form = ~ 1 | student_id),
              data = l_data)
# Saving the prediction and CI data into my original dataset
my_prediction <- cbind(l_data, predict(my_anl_2, l_data, conf.int = 0.95))
head(my_prediction)

  student_id school_id grade              frpl   male              white
1     812011       365     3  Low_SES_Students   Male Non_White_Students
2     812012       362     2 High_SES_Students Female Non_White_Students
3     812013       359     2  Low_SES_Students   Male Non_White_Students
4     812014       360     3  Low_SES_Students Female     White_Students
5     812015       366     2  Low_SES_Students   Male Non_White_Students
6     812016       350     3  Low_SES_Students   Male Non_White_Students
               black              asian              hispanic
1 Non_Black_Students Non_Asian_Students Non_Hispanic_Students
2 Non_Black_Students Non_Asian_Students     Hispanic_Students
3 Non_Black_Students Non_Asian_Students     Hispanic_Students
4 Non_Black_Students Non_Asian_Students Non_Hispanic_Students
5     Black_Students Non_Asian_Students Non_Hispanic_Students
6 Non_Black_Students Non_Asian_Students     Hispanic_Students
               multiracial        personalized_learning            title_1
1     Multiracial_Students      Non_Personalized_School Non_Title_1_School
2 Non_Multiracial_Students      Non_Personalized_School Non_Title_1_School
3 Non_Multiracial_Students      Non_Personalized_School     Title_1_School
4 Non_Multiracial_Students Personalized_Learning_School     Title_1_School
5 Non_Multiracial_Students      Non_Personalized_School Non_Title_1_School
6 Non_Multiracial_Students Personalized_Learning_School     Title_1_School
             magnet     region_bridges     region_harris     region_benjamin
1 Non_Magnet_School Non_Bridges_School Non_Harris_School     Benjamin_School
2 Non_Magnet_School Non_Bridges_School Non_Harris_School Non_Benjamin_School
3     Magnet_School Non_Bridges_School Non_Harris_School     Benjamin_School
4 Non_Magnet_School Non_Bridges_School Non_Harris_School Non_Benjamin_School
5     Magnet_School Non_Bridges_School Non_Harris_School Non_Benjamin_School
6 Non_Magnet_School Non_Bridges_School Non_Harris_School Non_Benjamin_School
      region_patton     region_simpson     region_robinson     region_raymond
1 Non_Patton_School Non_Simpson_School Non_Robinson_School Non_Raymond_School
2 Non_Patton_School     Simpson_School Non_Robinson_School Non_Raymond_School
3 Non_Patton_School Non_Simpson_School Non_Robinson_School Non_Raymond_School
4 Non_Patton_School     Simpson_School Non_Robinson_School Non_Raymond_School
5 Non_Patton_School Non_Simpson_School     Robinson_School Non_Raymond_School
6 Non_Patton_School     Simpson_School Non_Robinson_School Non_Raymond_School
             ready2read  met_half_Ready2Read_goal  met_all_Ready2Read_goal
1 Non_Ready2Read_School NMet_Half_Ready2Read_Goal NMet_All_Ready2Read_Goal
2     Ready2Read_School  Met_Half_Ready2Read_Goal  Met_All_Ready2Read_Goal
3     Ready2Read_School NMet_Half_Ready2Read_Goal NMet_All_Ready2Read_Goal
4     Ready2Read_School  Met_Half_Ready2Read_Goal NMet_All_Ready2Read_Goal
5 Non_Ready2Read_School NMet_Half_Ready2Read_Goal NMet_All_Ready2Read_Goal
6 Non_Ready2Read_School NMet_Half_Ready2Read_Goal NMet_All_Ready2Read_Goal
          test_type test_score linear.predictors    lower    upper
1 pre_fluency_score          0          96.24237 95.57537 96.90937
2 pre_fluency_score          0          96.24237 95.57537 96.90937
3 pre_fluency_score          0          96.24237 95.57537 96.90937
4 pre_fluency_score          0          96.24237 95.57537 96.90937
5 pre_fluency_score          0          96.24237 95.57537 96.90937
6 pre_fluency_score          0          96.24237 95.57537 96.90937

I can see in the outcomes that my_prediction data file has 3 new columns. The first among the newly added is linear.predictors, while the last two are lower and upper bound values of 95% confidence interval.

Going through all these CI values will be fruitless. We can’t figure out what is going on. I just have to capture the essence not the individual statistics. For that reason, I am going to plot the CI, the following way:

pred <- position_dodge(.1)
limits <- aes(ymax = upper, ymin = lower, shape = ready2read)
group_CI <- ggplot(my_prediction, aes(y = linear.predictors, x = test_type))+
  geom_errorbar(limits, width = 0.1, position = pred)+
  geom_line(aes(group = ready2read, y = linear.predictors), position = pred)+
  geom_point(position = pred, aes (fill = ready2read), shape = 21, size = 3)+
  scale_fill_manual(values = c("black", "white"))+
  theme_bw()+
  theme(panel.grid.major = element_blank(),
legend.title = element_blank(),
legend.position = "bottom")+
  xlab("Students' Ready2Read Status")+
  ylab("Estimated Mean and 95% Confidence Interval")
group_CI

We can see two different baseline mean and two value-added mean. In most of the analyses, we tend to think that the comparison is made only when we have a common starting point. For that reason, I am going to show single baseline score and how they change overtime for the group. To achieve this goal, I have to recreate a new group r2r_modified, and recreate the plot.

# Creating new group
my_prediction$r2r_modified <- factor(my_prediction$ready2read, levels = c("All", "Non_Ready2Read_School", "Ready2Read_School"))
my_prediction$r2r_modified[my_prediction$test_type == "pre_fluency_score"] <- "All"

# Setting the limit for the Plot
pred <- position_dodge(.07)
limits <- aes(ymax = upper, ymin = lower, shape = r2r_modified)

# Creating the Plot
group_CI2 <- ggplot(my_prediction, aes(y = linear.predictors, x = test_type))+
  geom_errorbar(limits, width = 0.08, position = pred)+
  geom_line(aes(group = ready2read, y = linear.predictors), position = pred)+
  geom_point(position = pred, aes (fill = r2r_modified), shape = 21, size = 3)+
  scale_fill_manual(values = c("grey","black","white"))+
  theme_bw()+
  theme(panel.grid.major = element_blank(),
legend.title = element_blank(),
legend.position = "bottom")+
  xlab("Students' Ready2Read Status")+
  ylab("Estimated Mean and 95% Confidence Interval")
group_CI2

Amazing!! Did it!!