Data Wrangling

Missing Data

Data that was stored as ’ . ’ was changed to NA so that columns which contained numerical data (RIT scores, percentiles, quartiles, etc.) could changed to ‘numeric’ class data and be aggregated. ‘Enrollment’ did not contain missing data. Each dataframe was copied so that the original was preserved.

Scores2 <- Scores
Scores2 <- Scores2 %>% na_if('.')

SP2 <- SP
SP2 <- SP2 %>% na_if('.')

Enrollment2 <- Enrollment

Fixing data entry typos

Scores2["subject"][Scores2["subject"] == "Read"] <- "Reading"

Modify Data Types (Pt 1)

Character to Numeric

Enrollment2[3] <- lapply(Enrollment2[3], as.numeric)

Score_num <- c(5:8)
Scores2[Score_num] <- lapply(Scores2[Score_num], as.numeric)

New Dataframes & Feature Engineering

Combine fall & spring metrics per student, by subject

Fall Reading scores, Fall Math scores, Spring Reading scores, and Spring Math scores were separated into individual dataframes. Fall reading scores and spring reading scores were then joined to create a new dataframe which contains each student’s MAP test metrics across a school year. This was also repeated for Math scores per student.

Scores_FR <- Scores2 %>% filter(map_term == "Fall" & subject == "Reading")
Scores_FM <- Scores2 %>% filter(map_term == "Fall" & subject == "Math")
Scores_SR <- Scores2 %>% filter(map_term == "Spring" & subject == "Reading")
Scores_SM <- Scores2 %>% filter(map_term == "Spring" & subject == "Math")

# Join rename, & select fall + spring reading scores. 
Reading <- Scores_FR %>% 
  left_join(Scores_SR, by = c("student_id" = "student_id", "year" = "year", "subject" = "subject")) %>% 
  rename(
    fall_rit = rit_score.x,
    fall_percentile = percentile.x,
    fall_quartile = quartile.x,
    spring_rit = rit_score.y,
    spring_percentile = percentile.y,
    spring_quartile = quartile.y,
    typical_fall_to_spring_growth = typical_fall_to_spring_growth.x) %>% 
  dplyr::select(student_id, 
                subject, 
                fall_rit, fall_percentile, fall_quartile, 
                spring_rit, spring_percentile, spring_quartile, 
                typical_fall_to_spring_growth)


# Join rename, & select fall + spring math scores. 
Math <- Scores_FM %>% 
  left_join(Scores_SM, by = c("student_id" = "student_id", "year" = "year", "subject" = "subject")) %>% 
  rename(
    fall_rit = rit_score.x,
    fall_percentile = percentile.x,
    fall_quartile = quartile.x,
    spring_rit = rit_score.y,
    spring_percentile = percentile.y,
    spring_quartile = quartile.y,
    typical_fall_to_spring_growth = typical_fall_to_spring_growth.x) %>% 
  dplyr::select(student_id, 
                subject, 
                fall_rit, fall_percentile, fall_quartile, 
                spring_rit, spring_percentile, spring_quartile, 
                typical_fall_to_spring_growth)

‘Actual_growth’ & ‘RoC’ (Rate of Change)

‘Actual_growth’ was created to measure the absolute change of each student’s performance on the MAP test. Actual fall-to-spring growth was calculated by subtracting each student’s fall RIT score from their spring RIT score: \['spring\_rit' - 'fall\_rit'\]

Students who were missing RIT scores in the fall or spring were dropped from the dataset to avoid skewing the data.

Reading2 <- Reading %>% filter(!is.na(fall_rit) & !is.na(spring_rit))
Reading2$actual_growth <- (Reading2$spring_rit - Reading2$fall_rit)

Math2 <- Math %>% filter(!is.na(fall_rit) & !is.na(spring_rit))
Math2$actual_growth <- (Math2$spring_rit - Math2$fall_rit)

‘RoC’ was created to measure the relative change from fall to spring. In other words, ‘RoC’ measured the rate to which students improved their reading and math scores:

\[ (\frac{'spring\_rit'}{'fall\_rit'}) -1 \]

Reading2$RoC <- ((Reading2$spring_rit / Reading2$fall_rit) - 1)

Math2$RoC <- ((Math2$spring_rit / Math2$fall_rit) - 1)

Combine tables

The new score (per subject), enrollment, and special program tables were joined and reordered. Students assigned to special programs were then dropped to reduce skew.

Reading2 <- Reading2 %>% 
  left_join(Enrollment2, by = "student_id") %>% 
  left_join(., SP2, by = "student_id")
Reading2 <- Reading2 %>% 
  relocate(actual_growth, RoC, .after = typical_fall_to_spring_growth) %>% 
  relocate(grade_level_2019, school_name, program_name, .after = student_id)
Reading_GI <- Reading2 %>% filter(is.na(program_name))

Math2 <- Math2 %>% 
  left_join(Enrollment2, by = "student_id") %>% 
  left_join(., SP2, by = "student_id")
Math2 <- Math2 %>% 
  relocate(actual_growth, RoC, .after = typical_fall_to_spring_growth) %>% 
  relocate(grade_level_2019, school_name, program_name, .after = student_id)
Math_GI <- Math2 %>% filter(is.na(program_name))

datatable(Reading_GI,
          extensions = "FixedColumns",
          options = list(
            dom = "t",
            scrollx = TRUE,
            fixedColumns = TRUE))

datatable(Math_GI,
          extensions = "FixedColumns",
          options = list(
            dom = "t",
            scrollx = TRUE,
            fixedColumns = TRUE))

‘Education_level’

Schools E, F, & G are comprised of students from kindergarten to 4th grade. Schools A, B, C, & D are made up of students from 5th grade through 8th grade. A new column was created called ‘education_level’ which separated students into 2 groups: ‘Elementary School’ (Schools E, F, & G) & ‘Middle School’ (Schools A, B, C, & D)

Reading_edu <- Reading_GI %>% mutate(
  education_level = case_when(
    grade_level_2019 >= 0 & grade_level_2019 <= 4 ~ 'Elementary School',
    grade_level_2019 >= 5 & grade_level_2019 <= 8 ~ 'Middle School'
  )
)

Math_edu <- Math_GI %>% mutate(
  education_level = case_when(
    grade_level_2019 >= 0 & grade_level_2019 <= 4 ~ 'Elementary School',
    grade_level_2019 >= 5 & grade_level_2019 <= 8 ~ 'Middle School'
  )
)

Modify Data Types (Pt 2)

Data was modified a second time so that it could be used for statistical analysis. Coin’s median_test does not support the data type “character” so ‘school_name’ and ‘education_level’ are converted to factor.

A second copy of ‘Reading_GI’ and ‘Math_GI’ is also made and ‘grade_level_2019’ is also set to factor.

Reading_GI["school_name"] <- lapply(Reading_GI["school_name"], as.factor)
Reading_edu["education_level"] <- lapply(Reading_edu["education_level"], as.factor)
Math_GI["school_name"] <- lapply(Math_GI["school_name"], as.factor)
Math_edu["education_level"] <- lapply(Math_edu["education_level"], as.factor)


Reading_grade <- Reading_GI
Reading_grade["grade_level_2019"] <- lapply(Reading_grade["grade_level_2019"], as.factor)
Math_grade <- Math_GI
Math_grade["grade_level_2019"] <- lapply(Math_grade["grade_level_2019"], as.factor)

Analysis

Median raw scores & ‘actual_growth’ for each school

While raw RIT scores are useful for evaluating individual student performance, it is harder to generalize scores to groups of students because a student’s score is dependent on their past educational experience (one would assume that a fifth grader would have a higher reading RIT score than a kindergartner). Therefore, I created a relative measure of growth (‘RoC’) to reflect positive and negative changes in a student’s performance from fall to spring on the MAP test. Using the median rate-of-change, we can also discern the impact of education on student growth. In this analysis, the median is favored over the mean because the median is more resilient to outliers.

That being said, there are a few insights we can glean just by looking at the median scores. For example, students at School D average below the 50th percentile in Reading when every other school is in the 56th percentile or higher. For Math, Schools A, C, & D are all below the 50th percentile.

Reading_summary <- Reading_GI %>% 
  group_by(school_name) %>% 
  summarise(
    median(spring_rit),
    median(spring_percentile),
    median(spring_quartile),
    median(actual_growth))

Math_summary <- Math_GI %>% 
  group_by(school_name) %>% 
  summarise(
    median(spring_rit),
    median(spring_percentile),
    median(spring_quartile),
    median(actual_growth))

datatable(Reading_summary,
          extensions = "FixedColumns",
          options = list(
            dom = "t",
            scrollx = TRUE,
            fixedColumns = TRUE,
            keys = TRUE))

datatable(Math_summary,
          extensions = "FixedColumns",
          options = list(
            dom = "t",
            scrollx = TRUE,
            fixedColumns = TRUE,
            keys = TRUE))

Visualizing the distribution of & median RIT score rate-of-change by school name, grade level, & education level

Hybrid rug & distribution plots were used to plot the distribution of RIT score changes. The median line was also calculated and represented using a dashed line. The median of a distribution is the point where the area under the curve is halved. (in other words, half of the student population lays on the left median and the other half on the right).

These two plots demonstrate that the majority of students have a positive rate of change ranging from 1% per year to up to 30% per year, with 50% of students having a growth rate of 4% or higher in 2019. However, these plots also demonstrate an overall need to address those who fall below the 1% positive rate-of-change area (RoC < 0.5). Overall, students are showing more growth in math than reading.

# Median 
median_reading_RoC <- median(Reading_GI$RoC)
median_math_RoC <- median(Math_GI$RoC)

# Reading dist plot
Reading_dist <- ggplot(Reading_GI, aes(x = RoC, color = 'density')) +
  geom_histogram(aes(y = ..density..), bins = 50, fill = '#67B7D1', alpha = 0.5) +
  geom_density(color = '#67B7D1') +
  geom_rug(color = '#67B7D1') +
  geom_vline(xintercept = median_reading_RoC, size = 0.5, color = "gray20", linetype = 2) +
  ylab("") +
  xlab("RIT Score Rate of Change") + theme(legend.title = element_blank()) +
  scale_color_manual(values = c('density' = '#67B7D1')) +
  theme(legend.position = "none") +
  annotate("text", x = 0.2, y = 10, label = "Median Rate of Change: 4.2%")

ggplotly(Reading_dist) %>% 
  layout(plot_bgcolor = '#e5ecf6',
        title = "Reading RIT Score RoC (Fall to Spring 2019)")

# Math dist plot
Math_dist <- ggplot(Math_GI, aes(x = RoC, color = 'density')) +
  geom_histogram(aes(y = ..density..), bins = 50, fill = '#67B7D1', alpha = 0.5) +
  geom_density(color = '#67B7D1') +
  geom_rug(color = '#67B7D1') +
  geom_vline(xintercept = median_math_RoC, size = 0.5, color = "gray20", linetype = 2) +
  ylab("") +
  xlab("RIT Score Rate of Change") + theme(legend.title = element_blank()) +
  scale_color_manual(values = c('density' = '#67B7D1')) +
  theme(legend.position = "none") +
  annotate("text", x = 0.2, y = 10, label = "Median Rate of Change: 4.6%")

ggplotly(Math_dist) %>% 
  layout(plot_bgcolor = '#e5ecf6',
         title = "Math RIT Score RoC (Fall to Spring 2019)")

Next, I compared the distribution of RIT score rate-of-change by school name. Visually, we can infer that students who attend Schools E, F, & G report higher rates of change on average, with the majority of students at schools F & G reporting RoCs that are above the median. This is evident in both reading and math. On the other hand, the majority of students at Schools A, B, C, & D have RoCs below the average. The distribution of student RoC is more concentrated in schools A, B, C, & D, suggesting that this phenomenon is school-wide.

school_reading_dist <- ggplot(data = Reading_GI) +
  geom_density(aes(x = RoC, color = school_name)) +
  geom_rug(aes(x= RoC, color = school_name)) +
  geom_vline(xintercept = median_reading_RoC, size = 0.5, color = "gray20", linetype = 2) +
  ylab("") + 
  xlab("RIT Score Rate of Change") +
  scale_color_brewer(palette = "Paired") +
  annotate("text", x = 0.2, y = 12, label = "Median Rate of Change: 4.2%")

ggplotly(school_reading_dist) %>% 
    layout(plot_bgcolor='#e5ecf6',   
             xaxis = list(   
               title='RIT Score Rate of Change', 
               zerolinecolor = '#ffff',   
               zerolinewidth = 2,   
               gridcolor = 'ffff'),   
             yaxis = list(   
               title='', 
               zerolinecolor = '#ffff',   
               zerolinewidth = 2,   
               gridcolor = 'ffff'),
         title = 'Reading RoC across Schools(Fall to Spring 2019)')

school_math_dist <- ggplot(data = Math_GI) +
  geom_density(aes(x = RoC, color = school_name)) +
  geom_rug(aes(x= RoC, color = school_name)) +
  geom_vline(xintercept = median_math_RoC, size = 0.5, color = "gray20", linetype = 2) +
  ylab("") + 
  xlab("RIT Score Rate of Change") +
  guides(color = guide_legend(title = "School")) +
  scale_color_brewer(palette = "Paired") +
  annotate("text", x = 0.2, y = 12, label = "Median Rate of Change: 4.6%")

ggplotly(school_math_dist) %>% 
    layout(plot_bgcolor='#e5ecf6',   
             xaxis = list(   
               title='RIT Score Rate of Change', 
               zerolinecolor = '#ffff',   
               zerolinewidth = 2,   
               gridcolor = 'ffff'),   
             yaxis = list(   
               title='', 
               zerolinecolor = '#ffff',   
               zerolinewidth = 2,   
               gridcolor = 'ffff'),
         title = 'Math RoC across Schools (Fall to Spring 2019)')

Similar results are seen when visualizing the distribution of RIT score rate-of-change across educational levels. Once again, schools which can be classified as “elementary schools” (E, F, & G) report a higher average for reading (.74) and math (.79) than “middle schools” (A, B, C, & D).

# Median rate of change for Elementary Schools (Reading)
median_reading_elementary_roc <- Reading_edu %>% filter(education_level == "Elementary School") %>% summarise(median(RoC))
# Median rate of change for Middle Schools (Reading)
median_reading_middle_roc <-Reading_edu %>% filter(education_level == "Middle School") %>% summarise(median(RoC))
# Unlist for ggplot
median_reading_elementary_roc$`median(RoC)` <- unlist(median_reading_elementary_roc$`median(RoC)`)
median_reading_middle_roc$`median(RoC)` <- unlist(median_reading_middle_roc$`median(RoC)`)

# Median rate of change for Elementary Schools (Math)
median_math_elementary_roc <- Math_edu %>% filter(education_level == "Elementary School") %>% summarise(median(RoC))
# Median rate of change for Middle Schools (Math)
median_math_middle_roc <- Reading_edu %>% filter(education_level == "Middle School") %>% summarise(median(RoC))
# Unlist for ggplot
median_math_elementary_roc$`median(RoC)` <- unlist(median_math_elementary_roc$`median(RoC)`)
median_math_middle_roc$`median(RoC)` <- unlist(median_math_middle_roc$`median(RoC)`)

edu_level_reading_dist <- ggplot(data = Reading_edu) +
  geom_density(aes(x = RoC, color = education_level)) +
  geom_rug(aes(x= RoC, color = education_level)) +
  geom_vline(xintercept = median_reading_elementary_roc$`median(RoC)`, size = 0.5, color = "skyblue3", linetype = 2) +
  geom_vline(xintercept = median_reading_middle_roc$`median(RoC)`, size = 0.5, color = "royalblue4", linetype = 2) +
  ylab("") + 
  xlab("RIT Score Rate of Change") +
  guides(color = guide_legend(title = "Education Level")) +
  scale_color_brewer(palette = "Paired") +
  annotate("text", x = 0.255, y = 12, label = "Middle School Median Rate of Change: 2.3%", fill = "royalblue4") +
  annotate("text", x = 0.33, y = 6, label = "Elementary School Median Rate of Change: 7.4%", fill = "skyblue")

ggplotly(edu_level_reading_dist) %>% 
    layout(plot_bgcolor='#e5ecf6',   
             xaxis = list(   
               title='RIT Score Rate of Change', 
               zerolinecolor = '#ffff',   
               zerolinewidth = 2,   
               gridcolor = 'ffff'),   
             yaxis = list(   
               title='', 
               zerolinecolor = '#ffff',   
               zerolinewidth = 2,   
               gridcolor = 'ffff'),
         title = 'Reading RoC across Educational Levels (Fall to Spring 2019)')

edu_level_math_dist <- ggplot(data = Math_edu) +
  geom_density(aes(x = RoC, color = education_level)) +
  geom_rug(aes(x= RoC, color = education_level)) +
  geom_vline(xintercept = median_math_elementary_roc$`median(RoC)`, size = 0.5, color = "skyblue3", linetype = 2) +
  geom_vline(xintercept = median_math_middle_roc$`median(RoC)`, size = 0.5, color = "royalblue4", linetype = 2) +
  ylab("") + 
  xlab("RIT Score Rate of Change") +
  guides(color = guide_legend(title = "Education Level")) +
  scale_color_brewer(palette = "Paired") +
  annotate("text", x = 0.255, y = 12, label = "Middle School Median Rate of Change: 2.5%", fill = "royalblue4") +
  annotate("text", x = 0.33, y = 6, label = "Elementary School Median Rate of Change: 7.9%", fill = "skyblue")

ggplotly(edu_level_math_dist) %>% 
    layout(plot_bgcolor='#e5ecf6',   
             xaxis = list(   
               title='RIT Score Rate of Change', 
               zerolinecolor = '#ffff',   
               zerolinewidth = 2,   
               gridcolor = 'ffff'),   
             yaxis = list(   
               title='', 
               zerolinecolor = '#ffff',   
               zerolinewidth = 2,   
               gridcolor = 'ffff'),
         title = 'Math RoC across Educational Levels (Fall to Spring 2019)')

Finally, I plotted the distribution of RIT Score rate-of-change by grade level. When comparing grades, there are noticeable decreases in RoC from Kindergarten to 1st grade, 1st grade to 2nd grade, and 5th grade to 6th grade. The distribution of RIT score RoC in 6th grade, 7th grade, and 8th grade are similar. These observations are present in both reading and math.

grade_reading_dist <- ggplot(data = Reading_grade) +
  geom_density(aes(x = RoC, color = grade_level_2019)) +
  geom_rug(aes(x= RoC, color = grade_level_2019)) +
  geom_vline(xintercept = median_reading_RoC, size = 0.5, color = "white", linetype = 2) +
  ylab("") + 
  xlab("RIT Score Rate of Change") +
  scale_color_brewer(palette = "Spectral") +
  annotate("text", x = 0.2, y = 12, label = "Median Rate of Change: 4.2%", color = "white")

ggplotly(grade_reading_dist) %>% 
    layout(plot_bgcolor='#34495E',   
             xaxis = list(   
               title='RIT Score Rate of Change', 
               zerolinecolor = '#ffff',   
               zerolinewidth = 2,   
               gridcolor = 'ffff'),   
             yaxis = list(   
               title='', 
               zerolinecolor = '#ffff',   
               zerolinewidth = 2,   
               gridcolor = 'ffff'),
         title = 'Reading RoC across Grade Levels (Fall to Spring 2019)')

grade_math_dist <- ggplot(data = Math_grade) +
  geom_density(aes(x = RoC, color = grade_level_2019)) +
  geom_rug(aes(x= RoC, color = grade_level_2019)) +
  geom_vline(xintercept = median_math_RoC, size = 0.5, color = "white", linetype = 2) +
  ylab("") + 
  xlab("RIT Score Rate of Change") +
  scale_color_brewer(palette = "Spectral")+
  annotate("text", x = 0.25, y = 14.5, label = "Median Rate of Change: 4.6%", color = "white")

ggplotly(grade_math_dist) %>% 
    layout(plot_bgcolor='#34495E',   
             xaxis = list(   
               title='RIT Score Rate of Change', 
               zerolinecolor = '#ffff',   
               zerolinewidth = 2,   
               gridcolor = 'ffff'),   
             yaxis = list(   
               title='', 
               zerolinecolor = '#ffff',   
               zerolinewidth = 2,   
               gridcolor = 'ffff'),
         title = 'Math RoC per Grade Level (Fall to Spring 2019)')

Statistical Analysis (Mood’s median test)

While these visualizations suggest that RIT score rate-of-change is dependent on schools, grade levels, and education levels, it is important to evaluate whether these observations are significant or occurring due to random chance. Mood’s median test is a variation of Pearson’s chi-squared test which tests for independent samples. This test seems appropriate for this dataset as the data examines students in independent samples (students from different schools, across different grades) Results indicate that differences in median RIT score rate-of-change are statistically significant. This means that the observable differences in MAP test performance from fall to spring between schools, grade levels, and educational levels are real, and (therefore should be further evaluated and addressed.) ### Reading RIT Score Median ROC Significance

median_test(RoC~school_name, Reading_GI)           #χ^2 = 316.18, p < .001

## 
##  Asymptotic K-Sample Brown-Mood Median Test
## 
## data:  RoC by
##   school_name (School A, School B, School C, School D, School E, School F, School G)
## chi-squared = 316.18, df = 6, p-value < 2.2e-16

median_test(RoC~grade_level_2019, Reading_grade)     #χ^2 = 601.46, p < .001

## 
##  Asymptotic K-Sample Brown-Mood Median Test
## 
## data:  RoC by
##   grade_level_2019 (0, 1, 2, 3, 4, 5, 6, 7, 8)
## chi-squared = 601.46, df = 8, p-value < 2.2e-16

median_test(RoC~education_level, Reading_edu)   #Z = 16.254, p < .001

## 
##  Asymptotic Two-Sample Brown-Mood Median Test
## 
## data:  RoC by
##   education_level (Elementary School, Middle School)
## Z = 16.254, p-value < 2.2e-16
## alternative hypothesis: true mu is not equal to 0

Math RIT Score Median ROC Significance

median_test(RoC~school_name, Math_GI)           #χ^2 = 331.08, p < .001

## 
##  Asymptotic K-Sample Brown-Mood Median Test
## 
## data:  RoC by
##   school_name (School A, School B, School C, School D, School E, School F, School G)
## chi-squared = 331.08, df = 6, p-value < 2.2e-16

median_test(RoC~grade_level_2019, Math_grade)     #χ^2 = 789.42, p < .001

## 
##  Asymptotic K-Sample Brown-Mood Median Test
## 
## data:  RoC by
##   grade_level_2019 (0, 1, 2, 3, 4, 5, 6, 7, 8)
## chi-squared = 789.42, df = 8, p-value < 2.2e-16

median_test(RoC~education_level, Math_edu)   #Z = 17.838, p < .001

## 
##  Asymptotic Two-Sample Brown-Mood Median Test
## 
## data:  RoC by
##   education_level (Elementary School, Middle School)
## Z = 17.838, p-value < 2.2e-16
## alternative hypothesis: true mu is not equal to 0

KIPP Performance Task Notes

IW

2022-11-08