This package is amazing. Knowledge of this package significantly simplifies the process of getting some key information during data analysis.
purr is a package within tidyverse universe. We don’t have to load “purr” to be able to use it because loading tidyverse would do it. However, for the purpose of this demonstration, I am going to load it.
#Loading Required Libraries
library(tidyverse)
library(purrr)
library(dplyr)
library(kableExtra)
library(ggplot2)
library(cowplot)
# Creating a New Dataset
m_data <- list(
c(1,2,5,9),
c(3,5,7,3),
c(2,5,6,7)
)
# Checking the datafile
m_data
[[1]]
[1] 1 2 5 9
[[2]]
[1] 3 5 7 3
[[3]]
[1] 2 5 6 7
# Checking the First Set of Files
m_data[[1]]
[1] 1 2 5 9
# Checking the Third Set of Files
m_data[[3]]
[1] 2 5 6 7
#First Set Mean
m_data[[1]] %>% mean()
[1] 4.25
#Or non-professional way
mean(m_data[[1]])
[1] 4.25
#Second Set Mean
m_data[[2]] %>% mean()
[1] 4.5
#Third Set Mean
m_data[[3]] %>% mean()
[1] 5
We here pass the map function. It makes our task much easier. One map syntax and the job is done.
# Finding the Mean of Each Element
res<- map(m_data, mean)
res
[[1]]
[1] 4.25
[[2]]
[1] 4.5
[[3]]
[1] 5
# Checking the Class of Vector res
class(res)
[1] "list"
# OR WE CAN SIMPLY USE THE PIPE FUNCTION
res %>% map(mean)
[[1]]
[1] 4.25
[[2]]
[1] 4.5
[[3]]
[1] 5
map returns a list of numbers (mean of each variable in this case). map_at on the other hand transforms their input by applying a function to each element and returning a vector the same length as the input. For example
m_data %>% map_dbl(mean)
[1] 4.25 4.50 5.00
I want to create a local function and want to pass it over all the elements of the vectors above.
m_data %>% map(~ . + 8)
[[1]]
[1] 9 10 13 17
[[2]]
[1] 11 13 15 11
[[3]]
[1] 10 13 14 15
map_dbl wants to return a single vector, so, we don’t get the consistent answer if we pass it over map_dbl.
# m_data %>% map_dbl(~ . + 8)
I get the error message: Error: Result 1 must be a single double, not a double vector of length 4 Backtrace: x 1…
I want to check if the values in the m_data are numeric:
m_data %>% map(is.numeric)
[[1]]
[1] TRUE
[[2]]
[1] TRUE
[[3]]
[1] TRUE
Yes. All of them are numeric vectors. We can check them the other way as well. map_lgl is a logical function that we can pass on our dataset. Or we can also use map_chr if we want the TRUE, FALSE as objects/characters
m_data %>% map_lgl(is.numeric)
[1] TRUE TRUE TRUE
m_data %>% map_chr(is.numeric)
[1] "TRUE" "TRUE" "TRUE"
A function within a function
func_1 <- function(x){
if (x > 25){
# stop("x is bigger than I can handle!!")#produce an error message
return("x is bigger than I can handle!!")#return the message
}else if(x > 20){
return (x * 3)#multiply the number by 3
}
return(x * 5)#otherwise multiply the number by 5
}
# Now, let's check how it works
# a. Let's make x bigger than 20
func_1(21)# 21 * 3 = 63
[1] 63
# b. Let's make x smaller than 20
func_1(10)
[1] 50
# c. Let's make x bigger than 25
func_1(26)
[1] "x is bigger than I can handle!!"
I wanted to conduct a full ANCOVA Analysis of my data in R. This is my first time doing so. I have my data in two different datafiles. Both dataset have a common “Student ID” variable. I am goint to use it to combine the files. One of the data files is saved as .xlsx file type, while the other is in .csv format. I need slightly different syntax to upload them. Let’s upload them and check the summary of the data.
# Loading Required Libraries
library(readxl)
library(tidyr)
library(pander)
# Uploading the data in R
data1 <- read_excel("data_warehouse_pull.xlsx")
summary(data1)
student_id school_id grade frpl
Min. :812011 Min. :350.0 Min. :2.000 Min. :0.0000
1st Qu.:816066 1st Qu.:357.0 1st Qu.:2.000 1st Qu.:0.0000
Median :820121 Median :364.0 Median :3.000 Median :0.0000
Mean :820121 Mean :365.1 Mean :3.481 Mean :0.3881
3rd Qu.:824176 3rd Qu.:373.0 3rd Qu.:4.000 3rd Qu.:1.0000
Max. :828231 Max. :381.0 Max. :5.000 Max. :1.0000
male white black asian
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
Median :1.0000 Median :0.0000 Median :0.0000 Median :0.00000
Mean :0.5108 Mean :0.4805 Mean :0.2506 Mean :0.03976
3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.00000
Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000
NA's :149 NA's :149 NA's :149
hispanic multiracial personalized_learning title_1
Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.0000
1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000
Median :0.0000 Median :0.00000 Median :1.0000 Median :1.0000
Mean :0.1826 Mean :0.04424 Mean :0.5325 Mean :0.5762
3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :1.0000
NA's :149 NA's :149
magnet region_bridges region_harris region_benjamin
Min. :0.0000 Min. :0.000 Min. :0.00000 Min. :0.00000
1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.00000 1st Qu.:0.00000
Median :0.0000 Median :0.000 Median :0.00000 Median :0.00000
Mean :0.1809 Mean :0.164 Mean :0.03643 Mean :0.09032
3rd Qu.:0.0000 3rd Qu.:0.000 3rd Qu.:0.00000 3rd Qu.:0.00000
Max. :1.0000 Max. :1.000 Max. :1.00000 Max. :1.00000
region_patton region_simpson region_robinson region_raymond
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
Median :0.0000 Median :0.0000 Median :0.0000 Median :0.0000
Mean :0.1632 Mean :0.1707 Mean :0.1181 Mean :0.2573
3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:1.0000
Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
data2 <- read.csv("ready2readprograminfo.csv")
summary(data2)
student_id ready2read met_half_Ready2Read_goal
Min. :812011 Min. :0.0000 Min. :0.0000
1st Qu.:816066 1st Qu.:0.0000 1st Qu.:0.0000
Median :820121 Median :1.0000 Median :0.0000
Mean :820121 Mean :0.5228 Mean :0.3578
3rd Qu.:824176 3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :828231 Max. :1.0000 Max. :1.0000
met_all_Ready2Read_goal pre_fluency_score post_fluency_score
Min. :0.0000 Min. : 0.00 Min. : 0.0
1st Qu.:0.0000 1st Qu.: 66.00 1st Qu.: 99.0
Median :0.0000 Median : 95.00 Median :128.0
Mean :0.2229 Mean : 96.24 Mean :127.7
3rd Qu.:0.0000 3rd Qu.:125.00 3rd Qu.:157.0
Max. :1.0000 Max. :237.00 Max. :299.0
NA's :1580 NA's :2065
Based on the summary data1 has 20 variables and data2 has 6 variables. The first data file has contains the information about students, school and the regions the schools are located. The second data file on the other hand, has information about students’ participation on Ready2Read (R2R)programs, whether or not they met the half-way-through and all goals, and their pre- and post- fluency scores.
We don’t see that many missing cases in the first dataset but there are good amount of cases in the second dataset. If you want to know how I deal with them please follow this link: https://rpubs.com/nirmal/791463
Assuming that you went through the link, I am directly coming up with the final dataset “final_data”. The following steps were taken to have this final dataset produced:
| student_id | school_id | grade | frpl | male | white | black | asian |
|---|---|---|---|---|---|---|---|
| 812011 | 365 | 3 | 1 | 1 | 0 | 0 | 0 |
| 812012 | 362 | 2 | 0 | 0 | 0 | 0 | 0 |
| 812013 | 359 | 2 | 1 | 1 | 0 | 0 | 0 |
| 812014 | 360 | 3 | 1 | 0 | 1 | 0 | 0 |
| 812015 | 366 | 2 | 1 | 1 | 0 | 1 | 0 |
| 812016 | 350 | 3 | 1 | 1 | 0 | 0 | 0 |
| hispanic | multiracial | personalized_learning | title_1 | magnet |
|---|---|---|---|---|
| 0 | 1 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 1 | 1 |
| 0 | 0 | 1 | 1 | 0 |
| 0 | 0 | 0 | 0 | 1 |
| 1 | 0 | 1 | 1 | 0 |
| region_bridges | region_harris | region_benjamin | region_patton |
|---|---|---|---|
| 0 | 0 | 1 | 0 |
| 0 | 0 | 0 | 0 |
| 0 | 0 | 1 | 0 |
| 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 |
| region_simpson | region_robinson | region_raymond | ready2read |
|---|---|---|---|
| 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 1 |
| 0 | 0 | 0 | 1 |
| 1 | 0 | 0 | 1 |
| 0 | 1 | 0 | 0 |
| 1 | 0 | 0 | 0 |
| met_half_Ready2Read_goal | met_all_Ready2Read_goal | pre_fluency_score |
|---|---|---|
| 0 | 0 | 0 |
| 1 | 1 | 0 |
| 0 | 0 | 0 |
| 1 | 0 | 0 |
| 0 | 0 | 0 |
| 0 | 0 | 0 |
| post_fluency_score |
|---|
| 0 |
| 117 |
| 13 |
| 9 |
| 5 |
| 4 |
| SN | Variable | Description | Type |
|---|---|---|---|
| 1 | student_id | Unique student ID | student_variable |
| 2 | school_id | School ID | school_variable |
| 3 | grade_level | Student’s grade level | student_variable |
| 4 | frpl | Indicator for free and reduced price lunch, where FRPL = 1 and non-FRPL = 0 | student_variable |
| 5 | male | Indicator for sex, where male = 1 and female = 0 | student_variable |
| 6 | excep | Indicator for students with exceptionalities, where excep = 1 and non-excep = 0 | student_variable |
| 7 | el | Indicator for English Learner, where EL = 1 and non-EL = 0 | student_variable |
| 8 | tag | Indicator for talented and gifted students, where TAG = 1 and non-TAG = 0 | student_variable |
| 9 | white | Race/ethnicity indicator, where white=1 and non-white=0 | student_variable |
| 10 | black | Race/ethnicity indicator, where Black=1 and non-Black=0 | student_variable |
| 11 | asian | Race/ethnicity indicator, where Asian=1 and non-Asian=0 | student_variable |
| 12 | hispanic | Race/ethnicity indicator, where Hispanic=1 and non-Hispanic=0 | student_variable |
| 13 | multiracial | Race/ethnicity indicator, where multiracial=1 and non-multiracial=0 | student_variable |
| 14 | personalized_learning | Student attends a personalized learning school (attends = 1; does not attend = 0) | school_variable |
| 15 | title_1 | Student attends a Title I school (attends = 1; does not attend = 0) | school_variable |
| 16 | magnet | Student attends a magnet school (attends = 1; does not attend = 0) | school_variable |
| 17 | ready2read | Indicates student received the Ready2Read program | outcome_variable |
| 18 | met_half_Ready2Read_goal | Student completed at least half of goals (met half = 1; did not meet half = 0) | outcome_variable |
| 19 | met_all_Ready2Read _goal | Student completed all Ready2Read goals (met all = 1; did not meet all = 0) | outcome_variable |
| 20 | pre_fluency_score | Beginning of year fluency assessment score | outcome_variable |
| 21 | post_fluency_score | End-of-year fluency assessment score | outcome_variable |
| 22 | region_bridges | Indicates student attends school in the Bridges Region | region_variable |
| 23 | region_harris | Indicates student attends school in the Harris Region | region_variable |
| 24 | region_benjamin | Indicates student attends school in Benjamin Region | region_variable |
| 25 | region_patton | Indicates student attends school in Patton Region | region_variable |
| 26 | region_simpson | Indicates student attends school in Simpson Region | region_variable |
| 27 | region_robinson | Indicates student attends school in Robinson Region | region_variable |
| 28 | region_raymond | Indicates student attends school in Raymond Region | region_variable |
# FRPL
final_data$frpl <- ifelse(test = final_data$frpl == 1, yes = "Low_SES_Students" , no = "High_SES_Students")
final_data$frpl <- as.factor(final_data$frpl)
# Gender
final_data$male <- ifelse(test = final_data$male == 1, yes = "Male" , no = "Female")
final_data$male <- as.factor(final_data$male)
# Ethnicity: White
final_data$white <- ifelse(test = final_data$white == 1, yes = "White_Students" , no = "Non_White_Students")
final_data$white <- as.factor(final_data$white)
# Ethnicity: Black
final_data$black <- ifelse(test = final_data$black == 1, yes = "Black_Students" , no = "Non_Black_Students")
final_data$black <- as.factor(final_data$black)
# Ethnicity: Asian
final_data$asian <- ifelse(test = final_data$asian == 1, yes = "Asian_Students" , no = "Non_Asian_Students")
final_data$asian <- as.factor(final_data$asian)
# Ethnicity: Hispanic
final_data$hispanic <- ifelse(test = final_data$hispanic == 1, yes = "Hispanic_Students" , no = "Non_Hispanic_Students")
final_data$hispanic <- as.factor(final_data$hispanic)
# Ethnicity: Multiracial
final_data$multiracial <- ifelse(test = final_data$multiracial == 1, yes = "Multiracial_Students" , no = "Non_Multiracial_Students")
final_data$multiracial <- as.factor(final_data$multiracial)
# Is School Providing Personalized Learning?
final_data$personalized_learning <- ifelse(test = final_data$personalized_learning == 1, yes = "Personalized_Learning_School" , no = "Non_Personalized_School")
final_data$personalized_learning <- as.factor(final_data$personalized_learning)
# Title One School Status
final_data$title_1 <- ifelse(test = final_data$title_1 == 1, yes = "Title_1_School" , no = "Non_Title_1_School")
final_data$title_1 <- as.factor(final_data$title_1)
# Magnet School Status
final_data$magnet <- ifelse(test = final_data$magnet == 1, yes = "Magnet_School" , no = "Non_Magnet_School")
final_data$magnet <- as.factor(final_data$magnet)
# Lies in Bridges Region
final_data$region_bridges <- ifelse(test = final_data$region_bridges == 1, yes = "Bridges_School" , no = "Non_Bridges_School")
final_data$region_bridges <- as.factor(final_data$region_bridges)
# Lies in Harris Region
final_data$region_harris <- ifelse(test = final_data$region_harris == 1, yes = "Harris_School" , no = "Non_Harris_School")
final_data$region_harris <- as.factor(final_data$region_harris)
# Lies in Benjamin Region
final_data$region_benjamin <- ifelse(test = final_data$region_benjamin == 1, yes = "Benjamin_School" , no = "Non_Benjamin_School")
final_data$region_benjamin <- as.factor(final_data$region_benjamin)
# Lies in Patton Region
final_data$region_patton <- ifelse(test = final_data$region_patton == 1, yes = "Patton_School" , no = "Non_Patton_School")
final_data$region_patton <- as.factor(final_data$region_patton)
# Lies in Simpson Region
final_data$region_simpson <- ifelse(test = final_data$region_simpson == 1, yes = "Simpson_School" , no = "Non_Simpson_School")
final_data$region_simpson <- as.factor(final_data$region_simpson)
# Lies in Robinson Region
final_data$region_robinson <- ifelse(test = final_data$region_robinson == 1, yes = "Robinson_School" , no = "Non_Robinson_School")
final_data$region_robinson <- as.factor(final_data$region_robinson)
# Lies in Raymond Region
final_data$region_raymond <- ifelse(test = final_data$region_raymond == 1, yes = "Raymond_School" , no = "Non_Raymond_School")
final_data$region_raymond <- as.factor(final_data$region_raymond)
# Did the Student Participate in Ready2Read Program?
final_data$ready2read <- ifelse(test = final_data$ready2read == 1, yes = "Ready2Read_School" , no = "Non_Ready2Read_School")
final_data$ready2read <- as.factor(final_data$ready2read)
# Did the Student Met Half Ready2Read Goal?
final_data$met_half_Ready2Read_goal <- ifelse(test = final_data$met_half_Ready2Read_goal == 1, yes = "Met_Half_Ready2Read_Goal" , no = "NMet_Half_Ready2Read_Goal")
final_data$met_half_Ready2Read_goal <- as.factor(final_data$met_half_Ready2Read_goal)
# Did the Student Met All Ready2Read Goal?
final_data$met_all_Ready2Read_goal <- ifelse(test = final_data$met_all_Ready2Read_goal == 1, yes = "Met_All_Ready2Read_Goal" , no = "NMet_All_Ready2Read_Goal")
final_data$met_all_Ready2Read_goal <- as.factor(final_data$met_all_Ready2Read_goal)
# Changing Grade Level to Factor Variable
final_data$grade <- as.factor(final_data$grade)
# Checking the Glimpse of the Modified Dataset
glimpse(final_data)
Rows: 14,845
Columns: 25
$ student_id <dbl> 812011, 812012, 812013, 812014, 812015, 81201~
$ school_id <dbl> 365, 362, 359, 360, 366, 350, 379, 366, 361, ~
$ grade <fct> 3, 2, 2, 3, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 3, ~
$ frpl <fct> Low_SES_Students, High_SES_Students, Low_SES_~
$ male <fct> Male, Female, Male, Female, Male, Male, Femal~
$ white <fct> Non_White_Students, Non_White_Students, Non_W~
$ black <fct> Non_Black_Students, Non_Black_Students, Non_B~
$ asian <fct> Non_Asian_Students, Non_Asian_Students, Non_A~
$ hispanic <fct> Non_Hispanic_Students, Hispanic_Students, His~
$ multiracial <fct> Multiracial_Students, Non_Multiracial_Student~
$ personalized_learning <fct> Non_Personalized_School, Non_Personalized_Sch~
$ title_1 <fct> Non_Title_1_School, Non_Title_1_School, Title~
$ magnet <fct> Non_Magnet_School, Non_Magnet_School, Magnet_~
$ region_bridges <fct> Non_Bridges_School, Non_Bridges_School, Non_B~
$ region_harris <fct> Non_Harris_School, Non_Harris_School, Non_Har~
$ region_benjamin <fct> Benjamin_School, Non_Benjamin_School, Benjami~
$ region_patton <fct> Non_Patton_School, Non_Patton_School, Non_Pat~
$ region_simpson <fct> Non_Simpson_School, Simpson_School, Non_Simps~
$ region_robinson <fct> Non_Robinson_School, Non_Robinson_School, Non~
$ region_raymond <fct> Non_Raymond_School, Non_Raymond_School, Non_R~
$ ready2read <fct> Non_Ready2Read_School, Ready2Read_School, Rea~
$ met_half_Ready2Read_goal <fct> NMet_Half_Ready2Read_Goal, Met_Half_Ready2Rea~
$ met_all_Ready2Read_goal <fct> NMet_All_Ready2Read_Goal, Met_All_Ready2Read_~
$ pre_fluency_score <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
$ post_fluency_score <dbl> 0.0, 117.0, 13.0, 9.0, 5.0, 4.0, 19.0, 5.0, 2~
Looking at the dataset, we can see that I am provided with pre- and post- fluency scores. Before I move any further, I want to see how the pre_fluency_scores correlate with post_fluency_scores. I am going to plot them and compare:
my_color = c("#1A425C", "#8AB63F")
ggplot(final_data, aes(x = pre_fluency_score, y = post_fluency_score, col = frpl))+
geom_point()+
geom_smooth(method = "lm", se = FALSE)+
scale_color_manual(values = my_color)+
theme_bw()+
xlab("Students' Raw Pre Fluency Score") +
ylab("Students' Raw Post Fluency Score")
It is not mandatory for both pre- & post- test scores to be in the same column. But, as outcome variables some analysts want to do so. It is some time referred as wide to long format. Dataset with many variables can complicate the process of transferring wide data to long data. I am going to at least try. I don’t want my “final_data” to be messed up, so, I will create a new vector called l_data and take necessary steps. We don’t need to load any other library for this, tidyverse would be enough.
What am I going to do?
l_data <- gather(data = final_data, key = test_type, value = test_score,
pre_fluency_score : post_fluency_score, factor_key = TRUE)
pander(head(arrange(l_data, student_id), 10))
| student_id | school_id | grade | frpl | male |
|---|---|---|---|---|
| 812011 | 365 | 3 | Low_SES_Students | Male |
| 812011 | 365 | 3 | Low_SES_Students | Male |
| 812012 | 362 | 2 | High_SES_Students | Female |
| 812012 | 362 | 2 | High_SES_Students | Female |
| 812013 | 359 | 2 | Low_SES_Students | Male |
| 812013 | 359 | 2 | Low_SES_Students | Male |
| 812014 | 360 | 3 | Low_SES_Students | Female |
| 812014 | 360 | 3 | Low_SES_Students | Female |
| 812015 | 366 | 2 | Low_SES_Students | Male |
| 812015 | 366 | 2 | Low_SES_Students | Male |
| white | black | asian |
|---|---|---|
| Non_White_Students | Non_Black_Students | Non_Asian_Students |
| Non_White_Students | Non_Black_Students | Non_Asian_Students |
| Non_White_Students | Non_Black_Students | Non_Asian_Students |
| Non_White_Students | Non_Black_Students | Non_Asian_Students |
| Non_White_Students | Non_Black_Students | Non_Asian_Students |
| Non_White_Students | Non_Black_Students | Non_Asian_Students |
| White_Students | Non_Black_Students | Non_Asian_Students |
| White_Students | Non_Black_Students | Non_Asian_Students |
| Non_White_Students | Black_Students | Non_Asian_Students |
| Non_White_Students | Black_Students | Non_Asian_Students |
| hispanic | multiracial |
|---|---|
| Non_Hispanic_Students | Multiracial_Students |
| Non_Hispanic_Students | Multiracial_Students |
| Hispanic_Students | Non_Multiracial_Students |
| Hispanic_Students | Non_Multiracial_Students |
| Hispanic_Students | Non_Multiracial_Students |
| Hispanic_Students | Non_Multiracial_Students |
| Non_Hispanic_Students | Non_Multiracial_Students |
| Non_Hispanic_Students | Non_Multiracial_Students |
| Non_Hispanic_Students | Non_Multiracial_Students |
| Non_Hispanic_Students | Non_Multiracial_Students |
| personalized_learning | title_1 | magnet |
|---|---|---|
| Non_Personalized_School | Non_Title_1_School | Non_Magnet_School |
| Non_Personalized_School | Non_Title_1_School | Non_Magnet_School |
| Non_Personalized_School | Non_Title_1_School | Non_Magnet_School |
| Non_Personalized_School | Non_Title_1_School | Non_Magnet_School |
| Non_Personalized_School | Title_1_School | Magnet_School |
| Non_Personalized_School | Title_1_School | Magnet_School |
| Personalized_Learning_School | Title_1_School | Non_Magnet_School |
| Personalized_Learning_School | Title_1_School | Non_Magnet_School |
| Non_Personalized_School | Non_Title_1_School | Magnet_School |
| Non_Personalized_School | Non_Title_1_School | Magnet_School |
| region_bridges | region_harris | region_benjamin |
|---|---|---|
| Non_Bridges_School | Non_Harris_School | Benjamin_School |
| Non_Bridges_School | Non_Harris_School | Benjamin_School |
| Non_Bridges_School | Non_Harris_School | Non_Benjamin_School |
| Non_Bridges_School | Non_Harris_School | Non_Benjamin_School |
| Non_Bridges_School | Non_Harris_School | Benjamin_School |
| Non_Bridges_School | Non_Harris_School | Benjamin_School |
| Non_Bridges_School | Non_Harris_School | Non_Benjamin_School |
| Non_Bridges_School | Non_Harris_School | Non_Benjamin_School |
| Non_Bridges_School | Non_Harris_School | Non_Benjamin_School |
| Non_Bridges_School | Non_Harris_School | Non_Benjamin_School |
| region_patton | region_simpson | region_robinson |
|---|---|---|
| Non_Patton_School | Non_Simpson_School | Non_Robinson_School |
| Non_Patton_School | Non_Simpson_School | Non_Robinson_School |
| Non_Patton_School | Simpson_School | Non_Robinson_School |
| Non_Patton_School | Simpson_School | Non_Robinson_School |
| Non_Patton_School | Non_Simpson_School | Non_Robinson_School |
| Non_Patton_School | Non_Simpson_School | Non_Robinson_School |
| Non_Patton_School | Simpson_School | Non_Robinson_School |
| Non_Patton_School | Simpson_School | Non_Robinson_School |
| Non_Patton_School | Non_Simpson_School | Robinson_School |
| Non_Patton_School | Non_Simpson_School | Robinson_School |
| region_raymond | ready2read | met_half_Ready2Read_goal |
|---|---|---|
| Non_Raymond_School | Non_Ready2Read_School | NMet_Half_Ready2Read_Goal |
| Non_Raymond_School | Non_Ready2Read_School | NMet_Half_Ready2Read_Goal |
| Non_Raymond_School | Ready2Read_School | Met_Half_Ready2Read_Goal |
| Non_Raymond_School | Ready2Read_School | Met_Half_Ready2Read_Goal |
| Non_Raymond_School | Ready2Read_School | NMet_Half_Ready2Read_Goal |
| Non_Raymond_School | Ready2Read_School | NMet_Half_Ready2Read_Goal |
| Non_Raymond_School | Ready2Read_School | Met_Half_Ready2Read_Goal |
| Non_Raymond_School | Ready2Read_School | Met_Half_Ready2Read_Goal |
| Non_Raymond_School | Non_Ready2Read_School | NMet_Half_Ready2Read_Goal |
| Non_Raymond_School | Non_Ready2Read_School | NMet_Half_Ready2Read_Goal |
| met_all_Ready2Read_goal | test_type | test_score |
|---|---|---|
| NMet_All_Ready2Read_Goal | pre_fluency_score | 0 |
| NMet_All_Ready2Read_Goal | post_fluency_score | 0 |
| Met_All_Ready2Read_Goal | pre_fluency_score | 0 |
| Met_All_Ready2Read_Goal | post_fluency_score | 117 |
| NMet_All_Ready2Read_Goal | pre_fluency_score | 0 |
| NMet_All_Ready2Read_Goal | post_fluency_score | 13 |
| NMet_All_Ready2Read_Goal | pre_fluency_score | 0 |
| NMet_All_Ready2Read_Goal | post_fluency_score | 9 |
| NMet_All_Ready2Read_Goal | pre_fluency_score | 0 |
| NMet_All_Ready2Read_Goal | post_fluency_score | 5 |
Based on the outcome, my plan worked. Now, I have two rows of data per student, one for pre_fluency_score, and the next for the post_fluency_score. One of benefits of such data format is, we can do many different types of activities like creating box plots or conducting long term trend analysis. I can also plot confidence interval in very easy to understand fashion. Now, lets create a box plot of pre- & post- fluency scores for schools by their magnet status.
magnet_box <- ggplot(l_data, aes(x = test_type, y = test_score, col = magnet))+
geom_boxplot(outlier.size = 0)+
geom_point(aes(fill = magnet), shape = 19, alpha = 0.5, position = position_jitterdodge(jitter.width = 0.2))+
# scale_fill_manual(values = my_color)+
# scale_color_manual(values = my_color)+
# theme_bw()+
xlab("Students' Pre- & Post- Fluency Scores by Magnet Status")
magnet_box
The diagram depicts the clear picture of the pre- & post- fluency scores among students in Magnet and non-magnet schools. Students in non-magnet schools seem to have higher pre_fluency_scores, but we don’t see any differences in mean post_fluency_scores. The post_fluency_scores among the students in magnet schools are spread higher than others in general.
Let’s check if there is any difference in students’ pre- & post- fluency scores based on their grades and personalized learning status. I am going to use a box plot to do so.
grade_box <- ggplot(l_data, aes(x = test_type, y = test_score, col = grade))+
geom_boxplot(outlier.size = 0)+
geom_point(aes(fill = personalized_learning), shape = 19, alpha = 0.5, position = position_jitterdodge(jitter.width = 0.2))+
xlab("Students' Pre- & Post- Fluency Scores by Grades")
grade_box
Representation of variation in the results of schools that provide personalized and non-personalized leanings are not that clear in the diagram above but we have clear picture of the variation in students pre X pre, & pre X post fluency scores by grades. Second graders had the lowest mean pre- and post- fluency scores and that increased by grades. It makes complete sense because language develops in parallel with age.
I wanted to see the differences in as many variables as possible. Thus, so far, I used schools’ magnet status, students’ socioeconomic status, and grades.
As I move towards something called constrained Longitudinal Data Analysis (cLDA), I want to stick with one variable, because my intention is to show the baseline measurement and how it changed overtime. I am thinking that pre_fluency_score and post_fluency_score tell the story of two different time intervals and I assume there are differences among students based on certain characteristics. My variable of choice would be ‘ready2read’. I want to see the rate of change in scores among the students who were part of R2R program and who were not.
This analysis can be conduct using the nlme package available to R users. As my idea is to see the change in pre- and post- fluency scores for the students who participated in R2R program and who did not.
My fixed part would be the test_type and test_type X R2R status. One of the rule of thumb in regression modeling is to include the baseline/lower ranked effects while modeling for an interaction in the model. If I run a regression model without doing anything, R will regard test_type and R2R status as main effects. However, I want interaction to be the main effect thus, I will first create an alternative model matrix named int_main. Here’s how we can do it:
# creating int matrix
int <- model.matrix(~ test_type * ready2read, data = l_data)
# checking column names
colnames(int)
[1] "(Intercept)"
[2] "test_typepost_fluency_score"
[3] "ready2readReady2Read_School"
[4] "test_typepost_fluency_score:ready2readReady2Read_School"
# creating int_main matrix by selecting only desired columns from int
int_main <- int[, c("test_typepost_fluency_score", "test_typepost_fluency_score:ready2readReady2Read_School")]
# checking column names
colnames(int_main)
[1] "test_typepost_fluency_score"
[2] "test_typepost_fluency_score:ready2readReady2Read_School"
The random parts, the standard deviations, and correlations constantly vary and they are defined as varldent() in R. For example, in this analysis, I expect differences in standard deviation for students who participated in R2R program and who did not. Likewise, I expect different correlation coefficient between the students who participated in the program and who completed the post_fluency_test and who did not.
To capture the essence of this analysis, I am going to define **weights = varldent(form ~ 1|test_type:ready2read), we get a distinct statistics estimated for test_type by R2R status (For example: pre_fluency X participated, pre_fluency X non_participated; post_fluency X participated & post_fluency X participated).
On the other hand, the correlation syntax correlation = corSymm(form = ~ 1 | student_id) defines the participants. We are bound to check the correlation of pre_fluency_scores to post_fluency_score of the same student. In this instance, we can’t correlate the scores of two different students because they can be inherently different.
Let’s run the analysis:
# Loading the required library
library(nlme)
# creating my_anl vector and saving the results
my_anl <- gls(test_score ~ int_main,
weights = varIdent(form = ~ 1 | test_type),
correlation = corSymm(form = ~ 1 | student_id),
data = l_data)
summary(my_anl)
Generalized least squares fit by REML
Model: test_score ~ int_main
Data: l_data
AIC BIC logLik
283929.4 283979.2 -141958.7
Correlation Structure: General
Formula: ~1 | student_id
Parameter estimate(s):
Correlation:
1
2 0.875
Variance function:
Structure: Different standard deviations per stratum
Formula: ~1 | test_type
Parameter estimates:
pre_fluency_score post_fluency_score
1.000000 1.000825
Coefficients:
Value
(Intercept) 96.24237
int_maintest_typepost_fluency_score 30.27677
int_maintest_typepost_fluency_score:ready2readReady2Read_School 2.25163
Std.Error
(Intercept) 0.3403112
int_maintest_typepost_fluency_score 0.2425019
int_maintest_typepost_fluency_score:ready2readReady2Read_School 0.3300808
t-value
(Intercept) 282.80695
int_maintest_typepost_fluency_score 124.85167
int_maintest_typepost_fluency_score:ready2readReady2Read_School 6.82144
p-value
(Intercept) 0
int_maintest_typepost_fluency_score 0
int_maintest_typepost_fluency_score:ready2readReady2Read_School 0
Correlation:
(Intr) in____
int_maintest_typepost_fluency_score -0.174
int_maintest_typepost_fluency_score:ready2readReady2Read_School 0.000 -0.712
Standardized residuals:
Min Q1 Med Q3 Max
-3.103079e+00 -6.811375e-01 -5.712186e-05 6.802595e-01 4.156391e+00
Residual standard error: 41.46353
Degrees of freedom: 29690 total; 29687 residual
The Major Findings:
We need to report the confidence interval of a regression analysis. I am going to use library rms and use the Gls() function within the package to calculate the confidence interval around the mean.
# Loading the required library
library(rms)
# creating my_anl vector and saving the results
my_anl_2 <- Gls(test_score ~ int_main,
weights = varIdent(form = ~ 1 | test_type),
correlation = corSymm(form = ~ 1 | student_id),
data = l_data)
# Saving the prediction and CI data into my original dataset
my_prediction <- cbind(l_data, predict(my_anl_2, l_data, conf.int = 0.95))
head(my_prediction)
student_id school_id grade frpl male white
1 812011 365 3 Low_SES_Students Male Non_White_Students
2 812012 362 2 High_SES_Students Female Non_White_Students
3 812013 359 2 Low_SES_Students Male Non_White_Students
4 812014 360 3 Low_SES_Students Female White_Students
5 812015 366 2 Low_SES_Students Male Non_White_Students
6 812016 350 3 Low_SES_Students Male Non_White_Students
black asian hispanic
1 Non_Black_Students Non_Asian_Students Non_Hispanic_Students
2 Non_Black_Students Non_Asian_Students Hispanic_Students
3 Non_Black_Students Non_Asian_Students Hispanic_Students
4 Non_Black_Students Non_Asian_Students Non_Hispanic_Students
5 Black_Students Non_Asian_Students Non_Hispanic_Students
6 Non_Black_Students Non_Asian_Students Hispanic_Students
multiracial personalized_learning title_1
1 Multiracial_Students Non_Personalized_School Non_Title_1_School
2 Non_Multiracial_Students Non_Personalized_School Non_Title_1_School
3 Non_Multiracial_Students Non_Personalized_School Title_1_School
4 Non_Multiracial_Students Personalized_Learning_School Title_1_School
5 Non_Multiracial_Students Non_Personalized_School Non_Title_1_School
6 Non_Multiracial_Students Personalized_Learning_School Title_1_School
magnet region_bridges region_harris region_benjamin
1 Non_Magnet_School Non_Bridges_School Non_Harris_School Benjamin_School
2 Non_Magnet_School Non_Bridges_School Non_Harris_School Non_Benjamin_School
3 Magnet_School Non_Bridges_School Non_Harris_School Benjamin_School
4 Non_Magnet_School Non_Bridges_School Non_Harris_School Non_Benjamin_School
5 Magnet_School Non_Bridges_School Non_Harris_School Non_Benjamin_School
6 Non_Magnet_School Non_Bridges_School Non_Harris_School Non_Benjamin_School
region_patton region_simpson region_robinson region_raymond
1 Non_Patton_School Non_Simpson_School Non_Robinson_School Non_Raymond_School
2 Non_Patton_School Simpson_School Non_Robinson_School Non_Raymond_School
3 Non_Patton_School Non_Simpson_School Non_Robinson_School Non_Raymond_School
4 Non_Patton_School Simpson_School Non_Robinson_School Non_Raymond_School
5 Non_Patton_School Non_Simpson_School Robinson_School Non_Raymond_School
6 Non_Patton_School Simpson_School Non_Robinson_School Non_Raymond_School
ready2read met_half_Ready2Read_goal met_all_Ready2Read_goal
1 Non_Ready2Read_School NMet_Half_Ready2Read_Goal NMet_All_Ready2Read_Goal
2 Ready2Read_School Met_Half_Ready2Read_Goal Met_All_Ready2Read_Goal
3 Ready2Read_School NMet_Half_Ready2Read_Goal NMet_All_Ready2Read_Goal
4 Ready2Read_School Met_Half_Ready2Read_Goal NMet_All_Ready2Read_Goal
5 Non_Ready2Read_School NMet_Half_Ready2Read_Goal NMet_All_Ready2Read_Goal
6 Non_Ready2Read_School NMet_Half_Ready2Read_Goal NMet_All_Ready2Read_Goal
test_type test_score linear.predictors lower upper
1 pre_fluency_score 0 96.24237 95.57537 96.90937
2 pre_fluency_score 0 96.24237 95.57537 96.90937
3 pre_fluency_score 0 96.24237 95.57537 96.90937
4 pre_fluency_score 0 96.24237 95.57537 96.90937
5 pre_fluency_score 0 96.24237 95.57537 96.90937
6 pre_fluency_score 0 96.24237 95.57537 96.90937
I can see in the outcomes that my_prediction data file has 3 new columns. The first among the newly added is linear.predictors, while the last two are lower and upper bound values of 95% confidence interval.
Going through all these CI values will be fruitless. We can’t figure out what is going on. I just have to capture the essence not the individual statistics. For that reason, I am going to plot the CI, the following way:
pred <- position_dodge(.1)
limits <- aes(ymax = upper, ymin = lower, shape = ready2read)
group_CI <- ggplot(my_prediction, aes(y = linear.predictors, x = test_type))+
geom_errorbar(limits, width = 0.1, position = pred)+
geom_line(aes(group = ready2read, y = linear.predictors), position = pred)+
geom_point(position = pred, aes (fill = ready2read), shape = 21, size = 3)+
scale_fill_manual(values = c("black", "white"))+
theme_bw()+
theme(panel.grid.major = element_blank(),
legend.title = element_blank(),
legend.position = "bottom")+
xlab("Students' Ready2Read Status")+
ylab("Estimated Mean and 95% Confidence Interval")
group_CI
We can see two different baseline mean and two value-added mean. In most of the analyses, we tend to think that the comparison is made only when we have a common starting point. For that reason, I am going to show single baseline score and how they change overtime for the group. To achieve this goal, I have to recreate a new group r2r_modified, and recreate the plot.
# Creating new group
my_prediction$r2r_modified <- factor(my_prediction$ready2read, levels = c("All", "Non_Ready2Read_School", "Ready2Read_School"))
my_prediction$r2r_modified[my_prediction$test_type == "pre_fluency_score"] <- "All"
# Setting the limit for the Plot
pred <- position_dodge(.07)
limits <- aes(ymax = upper, ymin = lower, shape = r2r_modified)
# Creating the Plot
group_CI2 <- ggplot(my_prediction, aes(y = linear.predictors, x = test_type))+
geom_errorbar(limits, width = 0.08, position = pred)+
geom_line(aes(group = ready2read, y = linear.predictors), position = pred)+
geom_point(position = pred, aes (fill = r2r_modified), shape = 21, size = 3)+
scale_fill_manual(values = c("grey","black","white"))+
theme_bw()+
theme(panel.grid.major = element_blank(),
legend.title = element_blank(),
legend.position = "bottom")+
xlab("Students' Ready2Read Status")+
ylab("Estimated Mean and 95% Confidence Interval")
group_CI2
Amazing!! Did it!!