Sample Data Analysis

Performance Task

Staff from the Learning and Teaching (L&T) department are deciding which programs to implement during the upcoming school year. A vocabulary-building software program called Ready2Read was piloted last year (for the full school year) in 16 elementary schools for students in grades 2-5.

The Assistant Superintendent for L&T brought you a copy of the implementation datasets from the Ready2Read pilot and has asked you to evaluate the program’s effectiveness.

Ultimately, she wants to know what the district should do with Ready2Read during the upcoming school year. The data on program participation, student demographics, and teacher feedback come from multiple data sources and are stored in separate data files.

Invoking Required Libraries

library(kableExtra)#To create Codebook Table
library(generics)#to avoid conflicts among similar functions
library(tidyverse)#To facilitate work flow
library(tidyr)#To manipulate Data Files
library(cowplot)#To put graphs together  
library(readxl)#To read the given excel file
library(ggplot2)#To Visualize data
library(stats)#for conducting regression analyses
library(graphics)#for pairwise correlation graphs

Codebook

Variable	Description
student_id	Unique student ID
school_id	School ID
grade_level	Student’s grade level
frpl	Indicator for free and reduced price lunch, where FRPL = 1 and non-FRPL = 0
male	Indicator for sex, where male = 1 and female = 0
excep	Indicator for students with exceptionalities, where excep = 1 and non-excep = 0
el	Indicator for English Learner, where EL = 1 and non-EL = 0
tag	Indicator for talented and gifted students, where TAG = 1 and non-TAG = 0
white	Race/ethnicity indicator, where white=1 and non-white=0
black	Race/ethnicity indicator, where Black=1 and non-Black=0
asian	Race/ethnicity indicator, where Asian=1 and non-Asian=0
hispanic	Race/ethnicity indicator, where Hispanic=1 and non-Hispanic=0
multiracial	Race/ethnicity indicator, where multiracial=1 and non-multiracial=0
personalized_learning	Student attends a personalized learning school (attends = 1; does not attend = 0)
title_1	Student attends a Title I school (attends = 1; does not attend = 0)
magnet	Student attends a magnet school (attends = 1; does not attend = 0)
ready2read	Indicates student received the Ready2Read program
met_half_Ready2Read_goal	Student completed at least half of goals (met half = 1; did not meet half = 0)
met_all_Ready2Read _goal	Student completed all Ready2Read goals (met all = 1; did not meet all = 0)
pre_fluency_score	Beginning of year fluency assessment score
post_fluency_score	End-of-year fluency assessment score
region_bridges	Indicates student attends school in the Bridges Region
region_harris	Indicates student attends school in the Harris Region
region_benjamin	Indicates student attends school in Benjamin Region
region_patton	Indicates student attends school in Patton Region
region_simpson	Indicates student attends school in Simpson Region
region_robinson	Indicates student attends school in Robinson Region
region_raymond	Indicates student attends school in Raymond Region

Reading the Data Files in R

datawarehouse <- read_excel("data_warehouse_pull.xlsx")
ready2readdata <- read.csv("ready2readprograminfo.csv")
# Checking the data structure by summarizing the files
summary(datawarehouse)

   student_id       school_id         grade            frpl       
 Min.   :812011   Min.   :350.0   Min.   :2.000   Min.   :0.0000  
 1st Qu.:816066   1st Qu.:357.0   1st Qu.:2.000   1st Qu.:0.0000  
 Median :820121   Median :364.0   Median :3.000   Median :0.0000  
 Mean   :820121   Mean   :365.1   Mean   :3.481   Mean   :0.3881  
 3rd Qu.:824176   3rd Qu.:373.0   3rd Qu.:4.000   3rd Qu.:1.0000  
 Max.   :828231   Max.   :381.0   Max.   :5.000   Max.   :1.0000  
                                                                  
      male            white            black            asian        
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000  
 Median :1.0000   Median :0.0000   Median :0.0000   Median :0.00000  
 Mean   :0.5108   Mean   :0.4805   Mean   :0.2506   Mean   :0.03976  
 3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.00000  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
                  NA's   :149      NA's   :149      NA's   :149      
    hispanic       multiracial      personalized_learning    title_1      
 Min.   :0.0000   Min.   :0.00000   Min.   :0.0000        Min.   :0.0000  
 1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000        1st Qu.:0.0000  
 Median :0.0000   Median :0.00000   Median :1.0000        Median :1.0000  
 Mean   :0.1826   Mean   :0.04424   Mean   :0.5325        Mean   :0.5762  
 3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:1.0000        3rd Qu.:1.0000  
 Max.   :1.0000   Max.   :1.00000   Max.   :1.0000        Max.   :1.0000  
 NA's   :149      NA's   :149                                             
     magnet       region_bridges  region_harris     region_benjamin  
 Min.   :0.0000   Min.   :0.000   Min.   :0.00000   Min.   :0.00000  
 1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:0.00000   1st Qu.:0.00000  
 Median :0.0000   Median :0.000   Median :0.00000   Median :0.00000  
 Mean   :0.1809   Mean   :0.164   Mean   :0.03643   Mean   :0.09032  
 3rd Qu.:0.0000   3rd Qu.:0.000   3rd Qu.:0.00000   3rd Qu.:0.00000  
 Max.   :1.0000   Max.   :1.000   Max.   :1.00000   Max.   :1.00000  
                                                                     
 region_patton    region_simpson   region_robinson  region_raymond  
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :0.0000   Median :0.0000   Median :0.0000   Median :0.0000  
 Mean   :0.1632   Mean   :0.1707   Mean   :0.1181   Mean   :0.2573  
 3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:1.0000  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000

The data_warehouse_pull.xlsx file contains students’ socioeconomic status and information about the region the schools are located in. As can be seen, there are 149 missing cases in the columns that represent students’ ethnicity. If, I decide to use these indicators, I may have to get rid of these cases because they make ((149/6221) * 100 = 0.92) less than a percentage point of the total dataset.

# Checking the ready2readprograminfo.csv file
summary(ready2readdata)

   student_id       ready2read     met_half_Ready2Read_goal
 Min.   :812011   Min.   :0.0000   Min.   :0.0000          
 1st Qu.:816066   1st Qu.:0.0000   1st Qu.:0.0000          
 Median :820121   Median :1.0000   Median :0.0000          
 Mean   :820121   Mean   :0.5228   Mean   :0.3578          
 3rd Qu.:824176   3rd Qu.:1.0000   3rd Qu.:1.0000          
 Max.   :828231   Max.   :1.0000   Max.   :1.0000          
                                                           
 met_all_Ready2Read_goal pre_fluency_score post_fluency_score
 Min.   :0.0000          Min.   :  0.00    Min.   :  0.0     
 1st Qu.:0.0000          1st Qu.: 66.00    1st Qu.: 99.0     
 Median :0.0000          Median : 95.00    Median :128.0     
 Mean   :0.2229          Mean   : 96.24    Mean   :127.7     
 3rd Qu.:0.0000          3rd Qu.:125.00    3rd Qu.:157.0     
 Max.   :1.0000          Max.   :237.00    Max.   :299.0     
                         NA's   :1580      NA's   :2065

There are five unique variables in this data set, with student_id being the common variable between these two data set.

Missing Cases

Above summary shows that there are 1580 missing cases in pre_fluency_score, which is roughly 10% of the total data. Likewise, 2065 cases in post_fluency_score makes up approximately 12.7%. A quick inspection of the excel file shows that there are total of 1376 common missing cases, which is 8.48%. These data points can be deleted because they were never a part of the research in the first place and I don’t want them to skew my results one way or other. I will be better off getting rid of these cases, however, I have to merge these files first.

Merging the Given Datafiles

Lucky that I am provided with a clean set of data. A few random cross validation lets me know that the student_ids not only match but also are in exact position. They don’t have to be in exact same places to be able to merge the data in R, but having such clean data dramatically clarifies many confusions. Let’s merge the files first:

newdata <- merge(datawarehouse, ready2readdata, by = "student_id")

The glimpse shows that two files have been merged and we have total of 16,221 data points and 25 variables. I have saved the new merged data file as ‘newdata’.

Handling the Missing Data

#checking the missing data by variables in the new dataset
summary(is.na(newdata))

 student_id      school_id         grade            frpl        
 Mode :logical   Mode :logical   Mode :logical   Mode :logical  
 FALSE:16221     FALSE:16221     FALSE:16221     FALSE:16221    
                                                                
    male           white           black           asian        
 Mode :logical   Mode :logical   Mode :logical   Mode :logical  
 FALSE:16221     FALSE:16072     FALSE:16072     FALSE:16072    
                 TRUE :149       TRUE :149       TRUE :149      
  hispanic       multiracial     personalized_learning  title_1       
 Mode :logical   Mode :logical   Mode :logical         Mode :logical  
 FALSE:16072     FALSE:16072     FALSE:16221           FALSE:16221    
 TRUE :149       TRUE :149                                            
   magnet        region_bridges  region_harris   region_benjamin
 Mode :logical   Mode :logical   Mode :logical   Mode :logical  
 FALSE:16221     FALSE:16221     FALSE:16221     FALSE:16221    
                                                                
 region_patton   region_simpson  region_robinson region_raymond 
 Mode :logical   Mode :logical   Mode :logical   Mode :logical  
 FALSE:16221     FALSE:16221     FALSE:16221     FALSE:16221    
                                                                
 ready2read      met_half_Ready2Read_goal met_all_Ready2Read_goal
 Mode :logical   Mode :logical            Mode :logical          
 FALSE:16221     FALSE:16221              FALSE:16221            
                                                                 
 pre_fluency_score post_fluency_score
 Mode :logical     Mode :logical     
 FALSE:14641       FALSE:14156       
 TRUE :1580        TRUE :2065

#What is the mean of the missing data?
mean(is.na(newdata))

[1] 0.01082547

#What is the mean of the missing data?
mean(is.na(newdata))

[1] 0.01082547

Roughly 1% of the data have been missing if we take everything in account. But if we just check the columns explained above, the missing data are higher than acceptable limits. Let’s check:

#Percentage of missing data in 'pre_fluency_score'
mean(is.na(newdata$pre_fluency_score))

[1] 0.0974046

mean(is.na(newdata$post_fluency_score))

[1] 0.1273041

Below 10% in both cases but still high. Mentioned earlier, the students who did not participate in the Ready 2 Read program are to be excluded from this evaluation. However, I don’t want to get rid of the students who took either of the pre_fluency or post_fluency test. At least, I don’t know the direct way of getting rid of the common data points having NAs in these two variables without loosing the data points of aforementioned students. The easy way would be to merge the Excel data file and delete these data points after using the -filter- options in both of these columns and read the new file into R. However, there is a way to do so in R.

Replace NAs with 999
I want to create a new variable named ‘var’ by adding pre_fluency_score and post_fluency_score
Retain all the rows having values less than 1998 in the column ‘var’
Replace 999 back to NAs
Study the missing cases again and replace NAs by column mean
Drop ‘var’ column Let’s see if my plan works:

# Changing the missing values in pre and post fluency score columns to 999
newdata$pre_fluency_score[is.na(newdata$pre_fluency_score)] <- 999
newdata$post_fluency_score[is.na(newdata$post_fluency_score)] <- 999
# Checking if there is any remaining NAs in those Columns
  #nrow(is.na(newdata$pre_fluency_score))
  #nrow(is.na(newdata$post_fluency_score))
# Creating a new variable 'var'
newdata <- mutate(newdata, var = pre_fluency_score + post_fluency_score)
#Making sure it worked and checking the structure
  #summary(newdata)
#Getting Rid of the rows that have values 1998
newdata <- filter(newdata, var < 1998)
  #glimpse(newdata)
#Replacing 999 back to NAs
newdata$pre_fluency_score[newdata$pre_fluency_score == 999] <- NA
newdata$post_fluency_score[newdata$post_fluency_score == 999] <- NA
  #summary(newdata)
#Percentage of missing data in 'pre_fluency_score'
mean(is.na(newdata$pre_fluency_score))

[1] 0.013742

mean(is.na(newdata$post_fluency_score))

[1] 0.04641293

#Percentages of missing values in these columns are below 5%, an acceptable range. I want to replace the missing values by column mean. But, I don't want my overall mean to dramatically change before or after the action. 
# Lets calculate the mean of the variables pre_fluency_score and post_fluency_score 
mean(newdata$pre_fluency_score, na.rm = TRUE)

[1] 96.2424

mean(newdata$post_fluency_score, na.rm = TRUE)

[1] 127.6974

# The average pre_fluency_score and post_fluency_scores are 96.24 & 127.70 respectively. I am now going to replace the NA in both columns by the corresponding mean values.
newdata$pre_fluency_score[is.na(newdata$pre_fluency_score)] <- 96.24
newdata$post_fluency_score[is.na(newdata$post_fluency_score)] <- 127.70
# Checking if the overall mean remained same
mean(newdata$pre_fluency_score)

[1] 96.24237

mean(newdata$post_fluency_score)

[1] 127.6975

# Our values remain exactly the same. I now, want to drop the variable 'var' a created a moment ago. 
newdata <- select(newdata, -26)
str(newdata)

'data.frame':   14845 obs. of  25 variables:
 $ student_id              : num  812011 812012 812013 812014 812015 ...
 $ school_id               : num  365 362 359 360 366 350 379 366 361 358 ...
 $ grade                   : num  3 2 2 3 2 3 3 2 2 2 ...
 $ frpl                    : num  1 0 1 1 1 1 1 1 1 0 ...
 $ male                    : num  1 0 1 0 1 1 0 1 0 1 ...
 $ white                   : num  0 0 0 1 0 0 0 0 0 0 ...
 $ black                   : num  0 0 0 0 1 0 0 0 0 1 ...
 $ asian                   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ hispanic                : num  0 1 1 0 0 1 1 1 1 0 ...
 $ multiracial             : num  1 0 0 0 0 0 0 0 0 0 ...
 $ personalized_learning   : num  0 0 0 1 0 1 1 0 0 0 ...
 $ title_1                 : num  0 0 1 1 0 1 1 0 1 1 ...
 $ magnet                  : num  0 0 1 0 1 0 0 1 0 0 ...
 $ region_bridges          : num  0 0 0 0 0 0 0 0 1 0 ...
 $ region_harris           : num  0 0 0 0 0 0 0 0 0 0 ...
 $ region_benjamin         : num  1 0 1 0 0 0 0 0 0 0 ...
 $ region_patton           : num  0 0 0 0 0 0 0 0 0 1 ...
 $ region_simpson          : num  0 1 0 1 0 1 1 0 0 0 ...
 $ region_robinson         : num  0 0 0 0 1 0 0 1 0 0 ...
 $ region_raymond          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ ready2read              : int  0 1 1 1 0 0 1 0 0 1 ...
 $ met_half_Ready2Read_goal: int  0 1 0 1 0 0 1 0 0 1 ...
 $ met_all_Ready2Read_goal : int  0 1 0 0 0 0 0 0 0 1 ...
 $ pre_fluency_score       : num  0 0 0 0 0 0 0 0 0 0 ...
 $ post_fluency_score      : num  0 117 13 9 5 4 19 5 2 13 ...

Everything worked so far. Now I have a merged dataset named ‘newdata’. It has 14845 data points and 25 variables.

The Assistant Superintendent for L&T wants to know if it is worth implementing Ready2Read program in other schools or require some modification, or just shut it down. (Note: If I were a part of this program and I knew what goes within it, I would be able to conduct rigorous research studies. I could also conduct a simple experimental research like Propensity score matching if I were provided with the data from the control schools).

First Impression

I am provided with the baseline score (pre_fluency_score) and value added score (post_fluency_score), and looks like they are taken 1 year apart. I am not exactly sure what the ‘fluency score’ represent. Ignoring the fact that there is a natural linguistic growth (in a journey towards first language acquisition) and assuming that the critical period ended before these students entered second grade, I am going to assume that the increase in students’ post_fluency_score compared to the pre_fluency_score is totally due to this program . I am going to run a few linear regressions, and see the relationship between the test scores and other variables. My recommendation to the Assistant Superintendent would come from these analyses.

But before that, I would like to see if there is change in post_fluency_score among all students regardless of their school, school region, or socio-demogrphic characteristics.

ggplot()+
  geom_histogram(data = newdata, aes(newdata$post_fluency_score), fill="blue", color="darkblue")+ 
  geom_vline(xintercept = mean(newdata$post_fluency_score), col = "yellow", lwd = 2)+
  geom_histogram(data=newdata, aes(newdata$pre_fluency_score), fill="red", color="darkred")+ 
  geom_vline(xintercept = mean(newdata$pre_fluency_score), col = "black", lwd = 2)+
  xlab("Fluency Score Distribution (Red:Pre; Blue:Post)")+
  ylab("Student Count")

Definitely, the post_fluency_score has higher mean compared to the pre_fluency_score, but I am not able to check the what the post_fluency_score spreads. Let’s make the chart more visible:

ggplot()+
  geom_freqpoly(data=newdata, aes(post_fluency_score, ..density..), color="darkblue")+
  geom_freqpoly(data=newdata, aes(pre_fluency_score, ..density..), color="darkred")+
  xlab("Fluency Score Distribution (Red:Pre; Blue:Post)")

Things are more visible, now. Looking at these two graphs, we can say that fluency scores in post fluency test are much higher. Looks like there has been overall positive upward momentum during the study period. Are they statistically significant? Let’s explore.

Exploratory Analysis, Model Selection

As the dataset contains 25 variables, it is not easy to include all these variables in our model. As the assistant superintendent for L & T is interested to know what to do with the Ready2Read program, the easy way is to compare the growth between the students who participated in the program with the ones who did not. Some post-hoc analysis would definitely help us pin point other pertinent issues, however, for this task I am going to use only the selected variables:

Outcome Variables

pre_fluency_score
post_fluency_score

Predictive Variables

ready2read: indicates whether students participated in ready2read program
met_half_Ready2read_goal
met_all_Ready2Read_goal

To achieve this goal, I have to use just the chunk of my data. I can subset the dataset:

final_data <- select(newdata,ready2read, met_half_Ready2Read_goal, met_all_Ready2Read_goal, pre_fluency_score,  post_fluency_score)
str(final_data)

'data.frame':   14845 obs. of  5 variables:
 $ ready2read              : int  0 1 1 1 0 0 1 0 0 1 ...
 $ met_half_Ready2Read_goal: int  0 1 0 1 0 0 1 0 0 1 ...
 $ met_all_Ready2Read_goal : int  0 1 0 0 0 0 0 0 0 1 ...
 $ pre_fluency_score       : num  0 0 0 0 0 0 0 0 0 0 ...
 $ post_fluency_score      : num  0 117 13 9 5 4 19 5 2 13 ...

We can see that the predictive variables are dichotomous variables. They do have two categories (0 and 1). I want to change those to original description to make it easy to explain.

final_data$ready2read <- ifelse(test = final_data$ready2read == 1, yes = "took_ready2read" , no = "no_ready2read")
final_data$met_half_Ready2Read_goal <- ifelse(test = final_data$met_half_Ready2Read_goal == 1, yes = "met_half_goal" , no = "not_met_half_goal")
final_data$met_all_Ready2Read_goal <- ifelse(test = final_data$met_all_Ready2Read_goal == 1, yes = "met_all_goal" , no = "not_met_all_goal")
summary(final_data)

  ready2read        met_half_Ready2Read_goal met_all_Ready2Read_goal
 Length:14845       Length:14845             Length:14845           
 Class :character   Class :character         Class :character       
 Mode  :character   Mode  :character         Mode  :character       
                                                                    
                                                                    
                                                                    
 pre_fluency_score post_fluency_score
 Min.   :  0.00    Min.   :  0.0     
 1st Qu.: 67.00    1st Qu.:101.0     
 Median : 96.00    Median :127.7     
 Mean   : 96.24    Mean   :127.7     
 3rd Qu.:125.00    3rd Qu.:156.0     
 Max.   :237.00    Max.   :299.0

Based on the information produced by the STR function, all of these variables are marked as the integer variables. These independent variables are categorical variables. Thus, I am going to change the them into the factor variables:

final_data$ready2read <- as.factor(final_data$ready2read)
final_data$met_half_Ready2Read_goal <- as.factor(final_data$met_half_Ready2Read_goal)
final_data$met_all_Ready2Read_goal <- as.factor(final_data$met_all_Ready2Read_goal)
str(final_data)

'data.frame':   14845 obs. of  5 variables:
 $ ready2read              : Factor w/ 2 levels "no_ready2read",..: 1 2 2 2 1 1 2 1 1 2 ...
 $ met_half_Ready2Read_goal: Factor w/ 2 levels "met_half_goal",..: 2 1 2 1 2 2 1 2 2 1 ...
 $ met_all_Ready2Read_goal : Factor w/ 2 levels "met_all_goal",..: 2 1 2 2 2 2 2 2 2 1 ...
 $ pre_fluency_score       : num  0 0 0 0 0 0 0 0 0 0 ...
 $ post_fluency_score      : num  0 117 13 9 5 4 19 5 2 13 ...

The selected variables have been changed into the factor variables. The outcome variables remain same. Now, it’s time to do some quick exploratory analyses.

I am now going to run a descriptive analysis and see what the summar tell.

summary(final_data)

           ready2read        met_half_Ready2Read_goal
 no_ready2read  :7076   met_half_goal    :5318       
 took_ready2read:7769   not_met_half_goal:9527       
                                                     
                                                     
                                                     
                                                     
     met_all_Ready2Read_goal pre_fluency_score post_fluency_score
 met_all_goal    : 3602      Min.   :  0.00    Min.   :  0.0     
 not_met_all_goal:11243      1st Qu.: 67.00    1st Qu.:101.0     
                             Median : 96.00    Median :127.7     
                             Mean   : 96.24    Mean   :127.7     
                             3rd Qu.:125.00    3rd Qu.:156.0     
                             Max.   :237.00    Max.   :299.0

The summary table shows the number of students taking and not taking the ready2read program, and their status during half and the completion of the program.

Pairwise Scatter Plot

Now, I want to run a quick pairwise correlation:

pairs(final_data, panel = panel.smooth,
      main = "Pairwise Scatter Plot", 
      col = 3 + (final_data$pre_fluency_score >= 96.24)+ (final_data$post_fluency_score >= 127.7))

Based on the results, we can see that there are certainly correlations among the variables of our choice and students’ fluency scores. The correlation coefficient between the pre_fluency_score and post_fluency_score, i.e., 87.5% is very strong. It shows that student’s pre_fluency_test scores were one of the strongest determinant of their post_fluency_scores.

Running Pretest Model

pre_1 <- lm(pre_fluency_score ~ ready2read, data = final_data)
summary(pre_1)


Call:
lm(formula = pre_fluency_score ~ ready2read, data = final_data)

Residuals:
   Min     1Q Median     3Q    Max 
 -96.7  -29.7   -0.7   28.3  140.3 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)                95.7397     0.4929  194.24   <2e-16 ***
ready2readtook_ready2read   0.9605     0.6813    1.41    0.159    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 41.46 on 14843 degrees of freedom
Multiple R-squared:  0.0001339, Adjusted R-squared:  6.65e-05 
F-statistic: 1.987 on 1 and 14843 DF,  p-value: 0.1586

The model was statistically statistically significant at 0.001%. The average pretest scores among all students was 95.74. The result showed slightly higher pre_fluency_score for the students who opted to take the ready2read program but it was not statistically significantly better than zero, i.e., 0.159. Let’s see them visually:

ggplot(final_data, aes(x = ready2read, y = pre_fluency_score, colour = factor(ready2read)))+
  geom_point(size = 3, colour = "black")+
  geom_point(size = 2)+
  xlab("Student Participate in R2R Program")+
  ylab("pre_fluency_score")

The result shows neck and neck pre_fluency_score. There is no use for me to run the regression analysis for the variables met_half_Ready2Read and met_all_Ready2Read_goal because these variable would not exist at the time the students opted for the program.

Post_fluency_score analyses

I now want to see the changes in the test scores between students who participated in the program.

I want to introduce the variables one by one…

fit_post1 <- lm(post_fluency_score ~ ready2read, data = final_data)
summary(fit_post1)$coefficients

                            Estimate Std. Error    t value     Pr(>|t|)
(Intercept)               126.078929  0.4933097 255.577665 0.000000e+00
ready2readtook_ready2read   3.092779  0.6819101   4.535465 5.792695e-06

Amazing!! The model was statistically significant at 0.001% level. The average post_fluency_Score was 126.08. Compared to the students who took part in the R2R program had approximately 3.10 higher fluency scores compared to the one who did not.

As the predictive variables in this study are independent of each other, I am not going to use them together in a model. I simply want to show the differences in the test scores between the sub groups.

Now, I am going to use met_half_Ready2Read_goal variable in the model and want to compare it with the previous one.

fit_post2 <- lm(post_fluency_score ~ met_half_Ready2Read_goal, data = final_data)
summary(fit_post2)$coefficients

                                            Estimate Std. Error    t value
(Intercept)                               130.292159  0.5688088 229.061451
met_half_Ready2Read_goalnot_met_half_goal  -4.042993  0.7100326  -5.694095
                                              Pr(>|t|)
(Intercept)                               0.000000e+00
met_half_Ready2Read_goalnot_met_half_goal 1.263764e-08

The results show that the model was statistically significant. The average post_fluency_scores has been changed to 130.29. The students who met half R2R goals had approximately 4.04 higher scores compared to the one who did not. The test scores increased for all the students because it belongs to the students who opted in for the program and 5318 of them met half goal.

Finally, lets run the data for our final variable:

fit_post3 <- lm(post_fluency_score ~ met_all_Ready2Read_goal, data = final_data)
coef(fit_post3)

                            (Intercept) met_all_Ready2Read_goalnot_met_all_goal 
                             133.449445                               -7.594726

The result shows a huge jump in the posttest scores. The model was statistically significant at 0.001% level. The posttest scores increased to 133.45. The students who met all R2R goals, had approximately 7.59 higher post fluency score.

Residual Checking

par(mfrow = c(2,2))
plot(fit_post3)

The first plot shows that the red line is fairly stable and passes through the origin. I don’t see any absence of model departure.
The Q-Q plot tests the normality of the data. As can be seen, it’s somewhere close to 45 degree suggesting normality.
The scale-location plot the standardized ordinary residuals. The scales seem to be fairly comparable across the subgroups.
The final plot shows the relationship between residuals and leverage. Having the dichotomous variable, we see a pattern but we don’t see higher departure in Cook’s distance.