Staff from the Learning and Teaching (L&T) department are deciding which programs to implement during the upcoming school year. A vocabulary-building software program called Ready2Read was piloted last year (for the full school year) in 16 elementary schools for students in grades 2-5.
The Assistant Superintendent for L&T brought you a copy of the implementation datasets from the Ready2Read pilot and has asked you to evaluate the program’s effectiveness.
Ultimately, she wants to know what the district should do with Ready2Read during the upcoming school year. The data on program participation, student demographics, and teacher feedback come from multiple data sources and are stored in separate data files.
library(kableExtra)#To create Codebook Table
library(generics)#to avoid conflicts among similar functions
library(tidyverse)#To facilitate work flow
library(tidyr)#To manipulate Data Files
library(cowplot)#To put graphs together
library(readxl)#To read the given excel file
library(ggplot2)#To Visualize data
library(stats)#for conducting regression analyses
library(graphics)#for pairwise correlation graphs
Codebook
| Variable | Description |
|---|---|
| student_id | Unique student ID |
| school_id | School ID |
| grade_level | Student’s grade level |
| frpl | Indicator for free and reduced price lunch, where FRPL = 1 and non-FRPL = 0 |
| male | Indicator for sex, where male = 1 and female = 0 |
| excep | Indicator for students with exceptionalities, where excep = 1 and non-excep = 0 |
| el | Indicator for English Learner, where EL = 1 and non-EL = 0 |
| tag | Indicator for talented and gifted students, where TAG = 1 and non-TAG = 0 |
| white | Race/ethnicity indicator, where white=1 and non-white=0 |
| black | Race/ethnicity indicator, where Black=1 and non-Black=0 |
| asian | Race/ethnicity indicator, where Asian=1 and non-Asian=0 |
| hispanic | Race/ethnicity indicator, where Hispanic=1 and non-Hispanic=0 |
| multiracial | Race/ethnicity indicator, where multiracial=1 and non-multiracial=0 |
| personalized_learning | Student attends a personalized learning school (attends = 1; does not attend = 0) |
| title_1 | Student attends a Title I school (attends = 1; does not attend = 0) |
| magnet | Student attends a magnet school (attends = 1; does not attend = 0) |
| ready2read | Indicates student received the Ready2Read program |
| met_half_Ready2Read_goal | Student completed at least half of goals (met half = 1; did not meet half = 0) |
| met_all_Ready2Read _goal | Student completed all Ready2Read goals (met all = 1; did not meet all = 0) |
| pre_fluency_score | Beginning of year fluency assessment score |
| post_fluency_score | End-of-year fluency assessment score |
| region_bridges | Indicates student attends school in the Bridges Region |
| region_harris | Indicates student attends school in the Harris Region |
| region_benjamin | Indicates student attends school in Benjamin Region |
| region_patton | Indicates student attends school in Patton Region |
| region_simpson | Indicates student attends school in Simpson Region |
| region_robinson | Indicates student attends school in Robinson Region |
| region_raymond | Indicates student attends school in Raymond Region |
datawarehouse <- read_excel("data_warehouse_pull.xlsx")
ready2readdata <- read.csv("ready2readprograminfo.csv")
# Checking the data structure by summarizing the files
summary(datawarehouse)
student_id school_id grade frpl
Min. :812011 Min. :350.0 Min. :2.000 Min. :0.0000
1st Qu.:816066 1st Qu.:357.0 1st Qu.:2.000 1st Qu.:0.0000
Median :820121 Median :364.0 Median :3.000 Median :0.0000
Mean :820121 Mean :365.1 Mean :3.481 Mean :0.3881
3rd Qu.:824176 3rd Qu.:373.0 3rd Qu.:4.000 3rd Qu.:1.0000
Max. :828231 Max. :381.0 Max. :5.000 Max. :1.0000
male white black asian
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
Median :1.0000 Median :0.0000 Median :0.0000 Median :0.00000
Mean :0.5108 Mean :0.4805 Mean :0.2506 Mean :0.03976
3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.00000
Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000
NA's :149 NA's :149 NA's :149
hispanic multiracial personalized_learning title_1
Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.0000
1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000
Median :0.0000 Median :0.00000 Median :1.0000 Median :1.0000
Mean :0.1826 Mean :0.04424 Mean :0.5325 Mean :0.5762
3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :1.0000
NA's :149 NA's :149
magnet region_bridges region_harris region_benjamin
Min. :0.0000 Min. :0.000 Min. :0.00000 Min. :0.00000
1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.00000 1st Qu.:0.00000
Median :0.0000 Median :0.000 Median :0.00000 Median :0.00000
Mean :0.1809 Mean :0.164 Mean :0.03643 Mean :0.09032
3rd Qu.:0.0000 3rd Qu.:0.000 3rd Qu.:0.00000 3rd Qu.:0.00000
Max. :1.0000 Max. :1.000 Max. :1.00000 Max. :1.00000
region_patton region_simpson region_robinson region_raymond
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
Median :0.0000 Median :0.0000 Median :0.0000 Median :0.0000
Mean :0.1632 Mean :0.1707 Mean :0.1181 Mean :0.2573
3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:1.0000
Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
The data_warehouse_pull.xlsx file contains students’ socioeconomic status and information about the region the schools are located in. As can be seen, there are 149 missing cases in the columns that represent students’ ethnicity. If, I decide to use these indicators, I may have to get rid of these cases because they make ((149/6221) * 100 = 0.92) less than a percentage point of the total dataset.
# Checking the ready2readprograminfo.csv file
summary(ready2readdata)
student_id ready2read met_half_Ready2Read_goal
Min. :812011 Min. :0.0000 Min. :0.0000
1st Qu.:816066 1st Qu.:0.0000 1st Qu.:0.0000
Median :820121 Median :1.0000 Median :0.0000
Mean :820121 Mean :0.5228 Mean :0.3578
3rd Qu.:824176 3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :828231 Max. :1.0000 Max. :1.0000
met_all_Ready2Read_goal pre_fluency_score post_fluency_score
Min. :0.0000 Min. : 0.00 Min. : 0.0
1st Qu.:0.0000 1st Qu.: 66.00 1st Qu.: 99.0
Median :0.0000 Median : 95.00 Median :128.0
Mean :0.2229 Mean : 96.24 Mean :127.7
3rd Qu.:0.0000 3rd Qu.:125.00 3rd Qu.:157.0
Max. :1.0000 Max. :237.00 Max. :299.0
NA's :1580 NA's :2065
There are five unique variables in this data set, with student_id being the common variable between these two data set.
Missing Cases
Above summary shows that there are 1580 missing cases in pre_fluency_score, which is roughly 10% of the total data. Likewise, 2065 cases in post_fluency_score makes up approximately 12.7%. A quick inspection of the excel file shows that there are total of 1376 common missing cases, which is 8.48%. These data points can be deleted because they were never a part of the research in the first place and I don’t want them to skew my results one way or other. I will be better off getting rid of these cases, however, I have to merge these files first.
Lucky that I am provided with a clean set of data. A few random cross validation lets me know that the student_ids not only match but also are in exact position. They don’t have to be in exact same places to be able to merge the data in R, but having such clean data dramatically clarifies many confusions. Let’s merge the files first:
newdata <- merge(datawarehouse, ready2readdata, by = "student_id")
The glimpse shows that two files have been merged and we have total of 16,221 data points and 25 variables. I have saved the new merged data file as ‘newdata’.
#checking the missing data by variables in the new dataset
summary(is.na(newdata))
student_id school_id grade frpl
Mode :logical Mode :logical Mode :logical Mode :logical
FALSE:16221 FALSE:16221 FALSE:16221 FALSE:16221
male white black asian
Mode :logical Mode :logical Mode :logical Mode :logical
FALSE:16221 FALSE:16072 FALSE:16072 FALSE:16072
TRUE :149 TRUE :149 TRUE :149
hispanic multiracial personalized_learning title_1
Mode :logical Mode :logical Mode :logical Mode :logical
FALSE:16072 FALSE:16072 FALSE:16221 FALSE:16221
TRUE :149 TRUE :149
magnet region_bridges region_harris region_benjamin
Mode :logical Mode :logical Mode :logical Mode :logical
FALSE:16221 FALSE:16221 FALSE:16221 FALSE:16221
region_patton region_simpson region_robinson region_raymond
Mode :logical Mode :logical Mode :logical Mode :logical
FALSE:16221 FALSE:16221 FALSE:16221 FALSE:16221
ready2read met_half_Ready2Read_goal met_all_Ready2Read_goal
Mode :logical Mode :logical Mode :logical
FALSE:16221 FALSE:16221 FALSE:16221
pre_fluency_score post_fluency_score
Mode :logical Mode :logical
FALSE:14641 FALSE:14156
TRUE :1580 TRUE :2065
#What is the mean of the missing data?
mean(is.na(newdata))
[1] 0.01082547
#What is the mean of the missing data?
mean(is.na(newdata))
[1] 0.01082547
Roughly 1% of the data have been missing if we take everything in account. But if we just check the columns explained above, the missing data are higher than acceptable limits. Let’s check:
#Percentage of missing data in 'pre_fluency_score'
mean(is.na(newdata$pre_fluency_score))
[1] 0.0974046
mean(is.na(newdata$post_fluency_score))
[1] 0.1273041
Below 10% in both cases but still high. Mentioned earlier, the students who did not participate in the Ready 2 Read program are to be excluded from this evaluation. However, I don’t want to get rid of the students who took either of the pre_fluency or post_fluency test. At least, I don’t know the direct way of getting rid of the common data points having NAs in these two variables without loosing the data points of aforementioned students. The easy way would be to merge the Excel data file and delete these data points after using the -filter- options in both of these columns and read the new file into R. However, there is a way to do so in R.
# Changing the missing values in pre and post fluency score columns to 999
newdata$pre_fluency_score[is.na(newdata$pre_fluency_score)] <- 999
newdata$post_fluency_score[is.na(newdata$post_fluency_score)] <- 999
# Checking if there is any remaining NAs in those Columns
#nrow(is.na(newdata$pre_fluency_score))
#nrow(is.na(newdata$post_fluency_score))
# Creating a new variable 'var'
newdata <- mutate(newdata, var = pre_fluency_score + post_fluency_score)
#Making sure it worked and checking the structure
#summary(newdata)
#Getting Rid of the rows that have values 1998
newdata <- filter(newdata, var < 1998)
#glimpse(newdata)
#Replacing 999 back to NAs
newdata$pre_fluency_score[newdata$pre_fluency_score == 999] <- NA
newdata$post_fluency_score[newdata$post_fluency_score == 999] <- NA
#summary(newdata)
#Percentage of missing data in 'pre_fluency_score'
mean(is.na(newdata$pre_fluency_score))
[1] 0.013742
mean(is.na(newdata$post_fluency_score))
[1] 0.04641293
#Percentages of missing values in these columns are below 5%, an acceptable range. I want to replace the missing values by column mean. But, I don't want my overall mean to dramatically change before or after the action.
# Lets calculate the mean of the variables pre_fluency_score and post_fluency_score
mean(newdata$pre_fluency_score, na.rm = TRUE)
[1] 96.2424
mean(newdata$post_fluency_score, na.rm = TRUE)
[1] 127.6974
# The average pre_fluency_score and post_fluency_scores are 96.24 & 127.70 respectively. I am now going to replace the NA in both columns by the corresponding mean values.
newdata$pre_fluency_score[is.na(newdata$pre_fluency_score)] <- 96.24
newdata$post_fluency_score[is.na(newdata$post_fluency_score)] <- 127.70
# Checking if the overall mean remained same
mean(newdata$pre_fluency_score)
[1] 96.24237
mean(newdata$post_fluency_score)
[1] 127.6975
# Our values remain exactly the same. I now, want to drop the variable 'var' a created a moment ago.
newdata <- select(newdata, -26)
str(newdata)
'data.frame': 14845 obs. of 25 variables:
$ student_id : num 812011 812012 812013 812014 812015 ...
$ school_id : num 365 362 359 360 366 350 379 366 361 358 ...
$ grade : num 3 2 2 3 2 3 3 2 2 2 ...
$ frpl : num 1 0 1 1 1 1 1 1 1 0 ...
$ male : num 1 0 1 0 1 1 0 1 0 1 ...
$ white : num 0 0 0 1 0 0 0 0 0 0 ...
$ black : num 0 0 0 0 1 0 0 0 0 1 ...
$ asian : num 0 0 0 0 0 0 0 0 0 0 ...
$ hispanic : num 0 1 1 0 0 1 1 1 1 0 ...
$ multiracial : num 1 0 0 0 0 0 0 0 0 0 ...
$ personalized_learning : num 0 0 0 1 0 1 1 0 0 0 ...
$ title_1 : num 0 0 1 1 0 1 1 0 1 1 ...
$ magnet : num 0 0 1 0 1 0 0 1 0 0 ...
$ region_bridges : num 0 0 0 0 0 0 0 0 1 0 ...
$ region_harris : num 0 0 0 0 0 0 0 0 0 0 ...
$ region_benjamin : num 1 0 1 0 0 0 0 0 0 0 ...
$ region_patton : num 0 0 0 0 0 0 0 0 0 1 ...
$ region_simpson : num 0 1 0 1 0 1 1 0 0 0 ...
$ region_robinson : num 0 0 0 0 1 0 0 1 0 0 ...
$ region_raymond : num 0 0 0 0 0 0 0 0 0 0 ...
$ ready2read : int 0 1 1 1 0 0 1 0 0 1 ...
$ met_half_Ready2Read_goal: int 0 1 0 1 0 0 1 0 0 1 ...
$ met_all_Ready2Read_goal : int 0 1 0 0 0 0 0 0 0 1 ...
$ pre_fluency_score : num 0 0 0 0 0 0 0 0 0 0 ...
$ post_fluency_score : num 0 117 13 9 5 4 19 5 2 13 ...
Everything worked so far. Now I have a merged dataset named ‘newdata’. It has 14845 data points and 25 variables.
The Assistant Superintendent for L&T wants to know if it is worth implementing Ready2Read program in other schools or require some modification, or just shut it down. (Note: If I were a part of this program and I knew what goes within it, I would be able to conduct rigorous research studies. I could also conduct a simple experimental research like Propensity score matching if I were provided with the data from the control schools).
I am provided with the baseline score (pre_fluency_score) and value added score (post_fluency_score), and looks like they are taken 1 year apart. I am not exactly sure what the ‘fluency score’ represent. Ignoring the fact that there is a natural linguistic growth (in a journey towards first language acquisition) and assuming that the critical period ended before these students entered second grade, I am going to assume that the increase in students’ post_fluency_score compared to the pre_fluency_score is totally due to this program . I am going to run a few linear regressions, and see the relationship between the test scores and other variables. My recommendation to the Assistant Superintendent would come from these analyses.
But before that, I would like to see if there is change in post_fluency_score among all students regardless of their school, school region, or socio-demogrphic characteristics.
ggplot()+
geom_histogram(data = newdata, aes(newdata$post_fluency_score), fill="blue", color="darkblue")+
geom_vline(xintercept = mean(newdata$post_fluency_score), col = "yellow", lwd = 2)+
geom_histogram(data=newdata, aes(newdata$pre_fluency_score), fill="red", color="darkred")+
geom_vline(xintercept = mean(newdata$pre_fluency_score), col = "black", lwd = 2)+
xlab("Fluency Score Distribution (Red:Pre; Blue:Post)")+
ylab("Student Count")
Definitely, the post_fluency_score has higher mean compared to the pre_fluency_score, but I am not able to check the what the post_fluency_score spreads. Let’s make the chart more visible:
ggplot()+
geom_freqpoly(data=newdata, aes(post_fluency_score, ..density..), color="darkblue")+
geom_freqpoly(data=newdata, aes(pre_fluency_score, ..density..), color="darkred")+
xlab("Fluency Score Distribution (Red:Pre; Blue:Post)")
Things are more visible, now. Looking at these two graphs, we can say that fluency scores in post fluency test are much higher. Looks like there has been overall positive upward momentum during the study period. Are they statistically significant? Let’s explore.
As the dataset contains 25 variables, it is not easy to include all these variables in our model. As the assistant superintendent for L & T is interested to know what to do with the Ready2Read program, the easy way is to compare the growth between the students who participated in the program with the ones who did not. Some post-hoc analysis would definitely help us pin point other pertinent issues, however, for this task I am going to use only the selected variables:
Outcome Variables
Predictive Variables
To achieve this goal, I have to use just the chunk of my data. I can subset the dataset:
final_data <- select(newdata,ready2read, met_half_Ready2Read_goal, met_all_Ready2Read_goal, pre_fluency_score, post_fluency_score)
str(final_data)
'data.frame': 14845 obs. of 5 variables:
$ ready2read : int 0 1 1 1 0 0 1 0 0 1 ...
$ met_half_Ready2Read_goal: int 0 1 0 1 0 0 1 0 0 1 ...
$ met_all_Ready2Read_goal : int 0 1 0 0 0 0 0 0 0 1 ...
$ pre_fluency_score : num 0 0 0 0 0 0 0 0 0 0 ...
$ post_fluency_score : num 0 117 13 9 5 4 19 5 2 13 ...
We can see that the predictive variables are dichotomous variables. They do have two categories (0 and 1). I want to change those to original description to make it easy to explain.
final_data$ready2read <- ifelse(test = final_data$ready2read == 1, yes = "took_ready2read" , no = "no_ready2read")
final_data$met_half_Ready2Read_goal <- ifelse(test = final_data$met_half_Ready2Read_goal == 1, yes = "met_half_goal" , no = "not_met_half_goal")
final_data$met_all_Ready2Read_goal <- ifelse(test = final_data$met_all_Ready2Read_goal == 1, yes = "met_all_goal" , no = "not_met_all_goal")
summary(final_data)
ready2read met_half_Ready2Read_goal met_all_Ready2Read_goal
Length:14845 Length:14845 Length:14845
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
pre_fluency_score post_fluency_score
Min. : 0.00 Min. : 0.0
1st Qu.: 67.00 1st Qu.:101.0
Median : 96.00 Median :127.7
Mean : 96.24 Mean :127.7
3rd Qu.:125.00 3rd Qu.:156.0
Max. :237.00 Max. :299.0
Based on the information produced by the STR function, all of these variables are marked as the integer variables. These independent variables are categorical variables. Thus, I am going to change the them into the factor variables:
final_data$ready2read <- as.factor(final_data$ready2read)
final_data$met_half_Ready2Read_goal <- as.factor(final_data$met_half_Ready2Read_goal)
final_data$met_all_Ready2Read_goal <- as.factor(final_data$met_all_Ready2Read_goal)
str(final_data)
'data.frame': 14845 obs. of 5 variables:
$ ready2read : Factor w/ 2 levels "no_ready2read",..: 1 2 2 2 1 1 2 1 1 2 ...
$ met_half_Ready2Read_goal: Factor w/ 2 levels "met_half_goal",..: 2 1 2 1 2 2 1 2 2 1 ...
$ met_all_Ready2Read_goal : Factor w/ 2 levels "met_all_goal",..: 2 1 2 2 2 2 2 2 2 1 ...
$ pre_fluency_score : num 0 0 0 0 0 0 0 0 0 0 ...
$ post_fluency_score : num 0 117 13 9 5 4 19 5 2 13 ...
The selected variables have been changed into the factor variables. The outcome variables remain same. Now, it’s time to do some quick exploratory analyses.
I am now going to run a descriptive analysis and see what the summar tell.
summary(final_data)
ready2read met_half_Ready2Read_goal
no_ready2read :7076 met_half_goal :5318
took_ready2read:7769 not_met_half_goal:9527
met_all_Ready2Read_goal pre_fluency_score post_fluency_score
met_all_goal : 3602 Min. : 0.00 Min. : 0.0
not_met_all_goal:11243 1st Qu.: 67.00 1st Qu.:101.0
Median : 96.00 Median :127.7
Mean : 96.24 Mean :127.7
3rd Qu.:125.00 3rd Qu.:156.0
Max. :237.00 Max. :299.0
The summary table shows the number of students taking and not taking the ready2read program, and their status during half and the completion of the program.
Now, I want to run a quick pairwise correlation:
pairs(final_data, panel = panel.smooth,
main = "Pairwise Scatter Plot",
col = 3 + (final_data$pre_fluency_score >= 96.24)+ (final_data$post_fluency_score >= 127.7))
Based on the results, we can see that there are certainly correlations among the variables of our choice and students’ fluency scores. The correlation coefficient between the pre_fluency_score and post_fluency_score, i.e., 87.5% is very strong. It shows that student’s pre_fluency_test scores were one of the strongest determinant of their post_fluency_scores.
pre_1 <- lm(pre_fluency_score ~ ready2read, data = final_data)
summary(pre_1)
Call:
lm(formula = pre_fluency_score ~ ready2read, data = final_data)
Residuals:
Min 1Q Median 3Q Max
-96.7 -29.7 -0.7 28.3 140.3
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 95.7397 0.4929 194.24 <2e-16 ***
ready2readtook_ready2read 0.9605 0.6813 1.41 0.159
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 41.46 on 14843 degrees of freedom
Multiple R-squared: 0.0001339, Adjusted R-squared: 6.65e-05
F-statistic: 1.987 on 1 and 14843 DF, p-value: 0.1586
The model was statistically statistically significant at 0.001%. The average pretest scores among all students was 95.74. The result showed slightly higher pre_fluency_score for the students who opted to take the ready2read program but it was not statistically significantly better than zero, i.e., 0.159. Let’s see them visually:
ggplot(final_data, aes(x = ready2read, y = pre_fluency_score, colour = factor(ready2read)))+
geom_point(size = 3, colour = "black")+
geom_point(size = 2)+
xlab("Student Participate in R2R Program")+
ylab("pre_fluency_score")
The result shows neck and neck pre_fluency_score. There is no use for me to run the regression analysis for the variables met_half_Ready2Read and met_all_Ready2Read_goal because these variable would not exist at the time the students opted for the program.
I now want to see the changes in the test scores between students who participated in the program.
I want to introduce the variables one by one…
fit_post1 <- lm(post_fluency_score ~ ready2read, data = final_data)
summary(fit_post1)$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 126.078929 0.4933097 255.577665 0.000000e+00
ready2readtook_ready2read 3.092779 0.6819101 4.535465 5.792695e-06
Amazing!! The model was statistically significant at 0.001% level. The average post_fluency_Score was 126.08. Compared to the students who took part in the R2R program had approximately 3.10 higher fluency scores compared to the one who did not.
As the predictive variables in this study are independent of each other, I am not going to use them together in a model. I simply want to show the differences in the test scores between the sub groups.
Now, I am going to use met_half_Ready2Read_goal variable in the model and want to compare it with the previous one.
fit_post2 <- lm(post_fluency_score ~ met_half_Ready2Read_goal, data = final_data)
summary(fit_post2)$coefficients
Estimate Std. Error t value
(Intercept) 130.292159 0.5688088 229.061451
met_half_Ready2Read_goalnot_met_half_goal -4.042993 0.7100326 -5.694095
Pr(>|t|)
(Intercept) 0.000000e+00
met_half_Ready2Read_goalnot_met_half_goal 1.263764e-08
The results show that the model was statistically significant. The average post_fluency_scores has been changed to 130.29. The students who met half R2R goals had approximately 4.04 higher scores compared to the one who did not. The test scores increased for all the students because it belongs to the students who opted in for the program and 5318 of them met half goal.
Finally, lets run the data for our final variable:
fit_post3 <- lm(post_fluency_score ~ met_all_Ready2Read_goal, data = final_data)
coef(fit_post3)
(Intercept) met_all_Ready2Read_goalnot_met_all_goal
133.449445 -7.594726
The result shows a huge jump in the posttest scores. The model was statistically significant at 0.001% level. The posttest scores increased to 133.45. The students who met all R2R goals, had approximately 7.59 higher post fluency score.
par(mfrow = c(2,2))
plot(fit_post3)