Packages selected for inspecting and cleaning Varsity Tutors, district data from Cumberland schools.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.1 ✔ readr 2.2.0
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.3 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Import roster.
roster <-read_csv("roster_vt_cumber.csv")
Rows: 14447 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): school
dbl (3): school_code, student_id, grade
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 10572 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): School Name, MOY MAP Math Math Test Date*, MOY MAP ELA Test Date*, ...
dbl (7): Student ID, Student Grade Level, MOY MAP Math Composite Scaled Scor...
lgl (2): Treatment, FRPL
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
view(moy_map)
Change column headers for moy_map.
moy_map = moy_map %>%rename(student_id ="Student ID",grade_moy ="Student Grade Level", school_moy ="School Name", treat_moy ="Treatment", moy_math_comp ="MOY MAP Math Composite Scaled Score*", moy_math_date ="MOY MAP Math Math Test Date*", moy_ela_comp ="MOY MAP ELA Composite Scaled Score*", moy_ela_date ="MOY MAP ELA Test Date*", gender_moy ="Gender", iep_moy ="IEP", ell_moy ="ELL", frpl_moy ="FRPL", race_eth_moy ="Race/Ethnicity", at_risk_moy ="Academically at-risk" )
Import End of Year data request.
eoy_map =read_csv("eoy_map_vt_cumber.csv")
Rows: 8810 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): School Name, EOY MAP Math Math Test Date*, EOY MAP ELA Test Date*, ...
dbl (7): Student ID, Student Grade Level, EOY MAP Math Composite Scaled Scor...
lgl (2): Treatment, FRPL
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
view(eoy_map)
Change column headers for eoy_map.
eoy_map = eoy_map %>%rename(student_id ="Student ID",grade_eoy ="Student Grade Level", school_eoy ="School Name", treat_eoy ="Treatment", eoy_math_comp ="EOY MAP Math Composite Scaled Score*", eoy_math_date ="EOY MAP Math Math Test Date*", eoy_ela_comp ="EOY MAP ELA Composite Scaled Score*", eoy_ela_date ="EOY MAP ELA Test Date*", gender_eoy ="Gender", iep_eoy ="IEP", ell_eoy ="ELL", frpl_eoy ="FRPL", race_eth_eoy ="Race/Ethnicity", at_risk_eoy ="Academically at-risk" )
Inspect duplicates
Roster
Identify duplicate student records within roster file.
# frequency table of student_id occurrencesid_counts_roster = roster %>%group_by(student_id) %>%summarise(n =n()) %>%arrange(desc(n))# filter only IDs that appear > 1dupes_roster = id_counts_roster %>%filter(n >1)# view resultsprint(dupes_roster)
Inspect duplicate student records in roster. Observation: for a given duplicate student record, it appears to be the same individual (e.g., same school, same grade)
# A tibble: 10,557 × 4
school_code school_ros student_id grade_ros
<dbl> <chr> <dbl> <dbl>
1 260336 Anne Chesnutt Middle 2287512462 6
2 260336 Anne Chesnutt Middle 4431265988 6
3 260336 Anne Chesnutt Middle 9779738622 6
4 260336 Anne Chesnutt Middle 3968556194 6
5 260336 Anne Chesnutt Middle 1925959163 6
6 260336 Anne Chesnutt Middle 2638992577 6
7 260336 Anne Chesnutt Middle 8925824655 6
8 260336 Anne Chesnutt Middle 4634477858 6
9 260336 Anne Chesnutt Middle 9836237186 6
10 260336 Anne Chesnutt Middle 1961537885 6
# ℹ 10,547 more rows
MOY
Identify duplicate student records within MOY file.
# frequency table of student_id occurrencesid_counts_moy = moy_map %>%group_by(student_id) %>%summarise(n =n()) %>%arrange(desc(n))# filter only IDs that appear > 1dupes_moy = id_counts_moy %>%filter(n >1)# view list of duplicate IDsprint(dupes_moy)
# A tibble: 1 × 2
student_id n
<dbl> <int>
1 7117828382 2
Inspect duplicate student record in the MOY file. Observation: Student records contain the same demographic info and math test info, however ELA test info (date and score) differ. Decision: Considering the student has complete test info in both observations, keep only the first student record (i.e., take the duplicate row with the earlier ELA test info).
# A tibble: 2 × 14
student_id grade_moy school_moy treat_moy moy_math_comp moy_math_date
<dbl> <dbl> <chr> <lgl> <dbl> <chr>
1 7117828382 7 Lewis Chapel Middle NA 204 1/8/2026
2 7117828382 7 Lewis Chapel Middle NA 204 1/8/2026
# ℹ 8 more variables: moy_ela_comp <dbl>, moy_ela_date <chr>, gender_moy <chr>,
# iep_moy <dbl>, ell_moy <dbl>, frpl_moy <lgl>, race_eth_moy <chr>,
# at_risk_moy <dbl>
Remove duplicate student record in MOY file following above decision.
# remove duplicates entirelymoy_no_dup = moy_map %>%group_by(student_id) %>%filter(n() ==1) %>%# keep only IDs that appear onceungroup()# pull out duplicate record with earliest ELA info to keepmoy_dup_keep = moy_map %>%filter(student_id =="7117828382", moy_ela_date =="1/7/2026")# append moy w/out dupes to selected dup recordmoy_unique =bind_rows(moy_no_dup, moy_dup_keep)print(moy_unique)
# A tibble: 10,571 × 14
student_id grade_moy school_moy treat_moy moy_math_comp moy_math_date
<dbl> <dbl> <chr> <lgl> <dbl> <chr>
1 2287512462 6 Anne Chesnutt Mid… NA 225 1/14/2026
2 4431265988 6 Anne Chesnutt Mid… NA 201 1/14/2026
3 9779738622 6 Anne Chesnutt Mid… NA 235 1/15/2026
4 3968556194 6 Anne Chesnutt Mid… NA 211 1/14/2026
5 1925959163 6 Anne Chesnutt Mid… NA 226 1/14/2026
6 2638992577 6 Anne Chesnutt Mid… NA 213 1/14/2026
7 8925824655 6 Anne Chesnutt Mid… NA 201 1/14/2026
8 4634477858 6 Anne Chesnutt Mid… NA 228 1/14/2026
9 9836237186 6 Anne Chesnutt Mid… NA 194 1/14/2026
10 1961537885 6 Anne Chesnutt Mid… NA 249 1/14/2026
# ℹ 10,561 more rows
# ℹ 8 more variables: moy_ela_comp <dbl>, moy_ela_date <chr>, gender_moy <chr>,
# iep_moy <dbl>, ell_moy <dbl>, frpl_moy <lgl>, race_eth_moy <chr>,
# at_risk_moy <dbl>
EOY
Identify duplicate student records within EOY file.
# frequency table of student_id occurrencesid_counts_eoy = eoy_map %>%group_by(student_id) %>%summarise(n =n()) %>%arrange(desc(n))# filter only IDs that appear > 1dupes_eoy = id_counts_eoy %>%filter(n >1)# view list of duplicate IDsprint(dupes_eoy)
# A tibble: 1 × 2
student_id n
<dbl> <int>
1 8354674797 2
Inspect duplicate student record in the MOY file. Observation: Student records contain the same demographic info and math test info, however ELA test info (date and score) differ. Decision: Considering the student has complete test info in both observations, keep only the first student record (i.e., take the duplicate row with the earlier ELA test info).
# A tibble: 2 × 14
student_id grade_eoy school_eoy treat_eoy eoy_math_comp eoy_math_date
<dbl> <dbl> <chr> <lgl> <dbl> <chr>
1 8354674797 6 Spring Lake Middle NA 206 4/17/2026
2 8354674797 6 Spring Lake Middle NA 206 4/17/2026
# ℹ 8 more variables: eoy_ela_comp <dbl>, eoy_ela_date <chr>, gender_eoy <chr>,
# iep_eoy <dbl>, ell_eoy <dbl>, frpl_eoy <lgl>, race_eth_eoy <chr>,
# at_risk_eoy <dbl>
Remove duplicate student record in EOY file.
# remove duplicates entirelyeoy_no_dup = eoy_map %>%group_by(student_id) %>%filter(n() ==1) %>%# keep only IDs that appear onceungroup()# pull out duplicate record with earliest ELA info to keepeoy_dup_keep = eoy_map %>%filter(student_id =="8354674797", eoy_ela_date =="4/20/2026")# append eoy w/out dupes to selected dup recordeoy_unique =bind_rows(eoy_no_dup, eoy_dup_keep)# preview resultprint(eoy_unique)
# A tibble: 8,809 × 14
student_id grade_eoy school_eoy treat_eoy eoy_math_comp eoy_math_date
<dbl> <dbl> <chr> <lgl> <dbl> <chr>
1 7729773724 8 Cumberland Virtua… NA 229 4/16/2026
2 4694886258 8 Cumberland Virtua… NA 240 4/15/2026
3 1836192339 8 Cumberland Virtua… NA 268 4/15/2026
4 7362151357 8 Cumberland Virtua… NA 224 4/15/2026
5 4215536561 8 Cumberland Virtua… NA 199 4/15/2026
6 3532188732 8 Cumberland Virtua… NA 252 4/15/2026
7 3482512388 8 Cumberland Virtua… NA 283 4/15/2026
8 4294541369 8 Cumberland Virtua… NA 236 4/15/2026
9 4855429317 8 Cumberland Virtua… NA 241 4/15/2026
10 1582791228 8 Cumberland Virtua… NA 239 4/15/2026
# ℹ 8,799 more rows
# ℹ 8 more variables: eoy_ela_comp <dbl>, eoy_ela_date <chr>, gender_eoy <chr>,
# iep_eoy <dbl>, ell_eoy <dbl>, frpl_eoy <lgl>, race_eth_eoy <chr>,
# at_risk_eoy <dbl>
Merge
Merge MOY to roster.
merge_roster_moy = roster_unique %>%left_join(moy_unique, by ="student_id")
Check unmatched records in intermediary data: 257 students in the roster with no appearance in MOY.
#check students in roster with NO match in MOYmissing_moy = roster_unique %>%anti_join(moy_unique, by ="student_id")cat("Number of roster students missing MOY data:", nrow(missing_moy), "\n")
Number of roster students missing MOY data: 257
Check unexpected records (MOY): 271 students present in the MOY file that do NOT present in the roster.
# check in MOY data NOT in rosterunexpected_moy_records = moy_unique %>%anti_join(roster_unique, by ="student_id")cat("Number of MOY records not in roster:", nrow(unexpected_moy_records), "\n")
Number of MOY records not in roster: 271
Match intermediary with EOY records by student ID.
merge_cumber = merge_roster_moy %>%left_join(eoy_unique, by ="student_id")
Check unmatched records in final merge: 1753 students in the intermediary file, merge_roster_moy, with no appearance in EOY.
#check students in merge_roster_moy with NO match in EOYmissing_eoy = merge_roster_moy %>%anti_join(eoy_unique, by ="student_id")cat("Number of students missing EOY data from merge_roster_moy:", nrow(missing_eoy), "\n")
Number of students missing EOY data from merge_roster_moy: 1753
Check unexpected records (EOY): 5 students present in the EOY file that not NOT present in the roster.
# check in EOY file NOT in roster (unexpected records)unexpected_eoy_records = eoy_unique %>%anti_join(roster_unique, by ="student_id")cat("Number of EOY records not in roster:", nrow(unexpected_eoy_records), "\n")
Number of EOY records not in roster: 5
Export unmatched (missing) and unexpected (extra) records.
Import roster from VT that details which students have accounts in their system
#import ID crosswalk from VTvt_roster =read_csv("vt_roster_all_districts.csv")
Rows: 1496 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): District, School
dbl (2): District ID, Varsity Tutors ID
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 4 × 5
# Groups: time [2]
time gender n total pct
<chr> <chr> <int> <int> <dbl>
1 eoy F 4188 8648 0.484
2 eoy M 4460 8648 0.516
3 moy F 4187 8648 0.484
4 moy M 4461 8648 0.516
IEPs: 5 additional students gained IEPs at the end of year.
RACE-ETHNICITY: Following categories appeared to have stayed more or less the same: A, B, W. There was a significant drop in the count of H students, suggesting that the district and/or families re-identified away from H in favor of other new categories: I, Multi, and P.
MOY: inspect racial categories and basic counts
table(merge_vt_cumber_unclean$race_eth_moy)
A B H P W
180 4075 2385 46 1962
EOY: inspect racial categories and basic counts
table(merge_vt_cumber_unclean$race_eth_eoy)
A B H I Multi P W
181 4075 1467 100 817 46 1962
Decision: Keep racial/ethnicity categorizations and gender identifications from EOY file. All other demographic data keep from MOY file.
Percentage missing for each variable. Note: percentage missing for at_risk probably indicates the percent NOT at-risk. Observations: FRPL may be missing for all students or district forgot to add to file. All other demographic info is present for all students. Some missing data for test info.
all_tests: dummy for students with both math and ELA test info complete across MOY and EOY.
table(final_vt_cumber_test_info$all_tests)
FALSE TRUE
2337 6311
#complete math and ELA: MOY and EOY presentfinal_vt_cumber_test_info$all_tests =ifelse(final_vt_cumber_test_info$all_tests, 1, 0)table(final_vt_cumber_test_info$all_tests) #check counts
0 1
2337 6311
math_moy_missing_eoy_present: dummy for students who have EOY math test info but MOY test info is missing.
Generate school-level means for MOY ELA and Math assessment data, number of students, percentages of students by race/ethnicity, gender, and FRPL-status.
Save e2i file with all matched students from the merge, regardless of missing test info.
#select columns ready for e2ivt_cumber_e2i_test_info = final_vt_cumber_test_info %>% dplyr::select(school_code, student_id, grade, school, moy_math_comp, moy_math_date, moy_ela_comp,moy_ela_date, eoy_math_comp, eoy_math_date, eoy_ela_comp, eoy_ela_date, male, iep, ell,frpl, at_risk, treat, race_A, race_B, race_H, race_I, race_Multi, race_P, race_W, both_math, both_ela, all_tests, math_moy_missing_eoy_present, ela_moy_missing_eoy_present)#export e2i with test indicators, all matched students from the merge, regardless of missing test info (all_tests == 1 or 0)write.csv(vt_cumber_e2i_test_info, file ="vt_cumber_e2i_all_students.csv", row.names =FALSE)
Save e2i file with only containing matched students from the merge with complete test info.
table(final_vt_cumber_test_info$all_tests)
0 1
2337 6311
#subset: keep only students with complete test info (all_tests == 1)vt_cumber_e2i_all_tests = vt_cumber_e2i_test_info %>%filter(all_tests ==1)print(vt_cumber_e2i_all_tests) # N=6311
Count of treatment students among the subset with complete test info.
table(vt_cumber_e2i_all_tests$treat)
0 1
5708 603
#export e2i only containing matched students from the merge with complete test info (all_tests == 1)write.csv(vt_cumber_e2i_all_tests, file ="vt_cumber_e2i_complete_tests.csv", row.names =FALSE)