Packages selected for inspecting and cleaning Varsity Tutors, district data from Cumberland schools.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.1 ✔ readr 2.2.0
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.3 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Import roster.
roster <-read_csv("roster_vt_cumber.csv")
Rows: 14447 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): school
dbl (3): school_code, student_id, grade
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 10572 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): School Name, MOY MAP Math Math Test Date*, MOY MAP ELA Test Date*, ...
dbl (7): Student ID, Student Grade Level, MOY MAP Math Composite Scaled Scor...
lgl (2): Treatment, FRPL
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
view(moy_map)
Change column headers for moy_map.
moy_map = moy_map %>%rename(student_id ="Student ID",grade_moy ="Student Grade Level", school_moy ="School Name", treat_moy ="Treatment", moy_math_comp ="MOY MAP Math Composite Scaled Score*", moy_math_date ="MOY MAP Math Math Test Date*", moy_ela_comp ="MOY MAP ELA Composite Scaled Score*", moy_ela_date ="MOY MAP ELA Test Date*", gender_moy ="Gender", iep_moy ="IEP", ell_moy ="ELL", frpl_moy ="FRPL", race_eth_moy ="Race/Ethnicity", at_risk_moy ="Academically at-risk" )
Import End of Year data request.
eoy_map =read_csv("eoy_map_vt_cumber.csv")
Rows: 8810 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): School Name, EOY MAP Math Math Test Date*, EOY MAP ELA Test Date*, ...
dbl (7): Student ID, Student Grade Level, EOY MAP Math Composite Scaled Scor...
lgl (2): Treatment, FRPL
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
view(eoy_map)
Change column headers for eoy_map.
eoy_map = eoy_map %>%rename(student_id ="Student ID",grade_eoy ="Student Grade Level", school_eoy ="School Name", treat_eoy ="Treatment", eoy_math_comp ="EOY MAP Math Composite Scaled Score*", eoy_math_date ="EOY MAP Math Math Test Date*", eoy_ela_comp ="EOY MAP ELA Composite Scaled Score*", eoy_ela_date ="EOY MAP ELA Test Date*", gender_eoy ="Gender", iep_eoy ="IEP", ell_eoy ="ELL", frpl_eoy ="FRPL", race_eth_eoy ="Race/Ethnicity", at_risk_eoy ="Academically at-risk" )
Inspect duplicates
Roster
Identify duplicate student records within roster file.
# frequency table of student_id occurrencesid_counts_roster = roster %>%group_by(student_id) %>%summarise(n =n()) %>%arrange(desc(n))# filter only IDs that appear > 1dupes_roster = id_counts_roster %>%filter(n >1)# view resultsprint(dupes_roster)
Inspect duplicate student records in roster. Observation: for a given duplicate student record, it appears to be the same individual (e.g., same school, same grade)
# A tibble: 10,557 × 5
school_code school_ros student_id grade_ros school_num
<dbl> <chr> <dbl> <dbl> <dbl>
1 260336 Anne Chesnutt Middle 2287512462 6 1
2 260336 Anne Chesnutt Middle 4431265988 6 1
3 260336 Anne Chesnutt Middle 9779738622 6 1
4 260336 Anne Chesnutt Middle 3968556194 6 1
5 260336 Anne Chesnutt Middle 1925959163 6 1
6 260336 Anne Chesnutt Middle 2638992577 6 1
7 260336 Anne Chesnutt Middle 8925824655 6 1
8 260336 Anne Chesnutt Middle 4634477858 6 1
9 260336 Anne Chesnutt Middle 9836237186 6 1
10 260336 Anne Chesnutt Middle 1961537885 6 1
# ℹ 10,547 more rows
MOY
Identify duplicate student records within MOY file.
# frequency table of student_id occurrencesid_counts_moy = moy_map %>%group_by(student_id) %>%summarise(n =n()) %>%arrange(desc(n))# filter only IDs that appear > 1dupes_moy = id_counts_moy %>%filter(n >1)# view list of duplicate IDsprint(dupes_moy)
# A tibble: 1 × 2
student_id n
<dbl> <int>
1 7117828382 2
Inspect duplicate student record in the MOY file. Observation: Student records contain the same demographic info and math test info, however ELA test info (date and score) differ. Decision: Considering the student has complete test info in both observations, keep only the first student record (i.e., take the duplicate row with the earlier ELA test info).
# A tibble: 2 × 14
student_id grade_moy school_moy treat_moy moy_math_comp moy_math_date
<dbl> <dbl> <chr> <lgl> <dbl> <chr>
1 7117828382 7 Lewis Chapel Middle NA 204 1/8/2026
2 7117828382 7 Lewis Chapel Middle NA 204 1/8/2026
# ℹ 8 more variables: moy_ela_comp <dbl>, moy_ela_date <chr>, gender_moy <chr>,
# iep_moy <dbl>, ell_moy <dbl>, frpl_moy <lgl>, race_eth_moy <chr>,
# at_risk_moy <dbl>
Remove duplicate student record in MOY file following above decision.
# remove duplicates entirelymoy_no_dup = moy_map %>%group_by(student_id) %>%filter(n() ==1) %>%# keep only IDs that appear onceungroup()# pull out duplicate record with earliest ELA info to keepmoy_dup_keep = moy_map %>%filter(student_id =="7117828382", moy_ela_date =="1/7/2026")# append moy w/out dupes to selected dup recordmoy_unique =bind_rows(moy_no_dup, moy_dup_keep)print(moy_unique)
# A tibble: 10,571 × 14
student_id grade_moy school_moy treat_moy moy_math_comp moy_math_date
<dbl> <dbl> <chr> <lgl> <dbl> <chr>
1 2287512462 6 Anne Chesnutt Mid… NA 225 1/14/2026
2 4431265988 6 Anne Chesnutt Mid… NA 201 1/14/2026
3 9779738622 6 Anne Chesnutt Mid… NA 235 1/15/2026
4 3968556194 6 Anne Chesnutt Mid… NA 211 1/14/2026
5 1925959163 6 Anne Chesnutt Mid… NA 226 1/14/2026
6 2638992577 6 Anne Chesnutt Mid… NA 213 1/14/2026
7 8925824655 6 Anne Chesnutt Mid… NA 201 1/14/2026
8 4634477858 6 Anne Chesnutt Mid… NA 228 1/14/2026
9 9836237186 6 Anne Chesnutt Mid… NA 194 1/14/2026
10 1961537885 6 Anne Chesnutt Mid… NA 249 1/14/2026
# ℹ 10,561 more rows
# ℹ 8 more variables: moy_ela_comp <dbl>, moy_ela_date <chr>, gender_moy <chr>,
# iep_moy <dbl>, ell_moy <dbl>, frpl_moy <lgl>, race_eth_moy <chr>,
# at_risk_moy <dbl>
EOY
Identify duplicate student records within EOY file.
# frequency table of student_id occurrencesid_counts_eoy = eoy_map %>%group_by(student_id) %>%summarise(n =n()) %>%arrange(desc(n))# filter only IDs that appear > 1dupes_eoy = id_counts_eoy %>%filter(n >1)# view list of duplicate IDsprint(dupes_eoy)
# A tibble: 1 × 2
student_id n
<dbl> <int>
1 8354674797 2
Inspect duplicate student record in the MOY file. Observation: Student records contain the same demographic info and math test info, however ELA test info (date and score) differ. Decision: Considering the student has complete test info in both observations, keep only the first student record (i.e., take the duplicate row with the earlier ELA test info).
# A tibble: 2 × 14
student_id grade_eoy school_eoy treat_eoy eoy_math_comp eoy_math_date
<dbl> <dbl> <chr> <lgl> <dbl> <chr>
1 8354674797 6 Spring Lake Middle NA 206 4/17/2026
2 8354674797 6 Spring Lake Middle NA 206 4/17/2026
# ℹ 8 more variables: eoy_ela_comp <dbl>, eoy_ela_date <chr>, gender_eoy <chr>,
# iep_eoy <dbl>, ell_eoy <dbl>, frpl_eoy <lgl>, race_eth_eoy <chr>,
# at_risk_eoy <dbl>
Remove duplicate student record in EOY file.
# remove duplicates entirelyeoy_no_dup = eoy_map %>%group_by(student_id) %>%filter(n() ==1) %>%# keep only IDs that appear onceungroup()# pull out duplicate record with earliest ELA info to keepeoy_dup_keep = eoy_map %>%filter(student_id =="8354674797", eoy_ela_date =="4/20/2026")# append eoy w/out dupes to selected dup recordeoy_unique =bind_rows(eoy_no_dup, eoy_dup_keep)# preview resultprint(eoy_unique)
# A tibble: 8,809 × 14
student_id grade_eoy school_eoy treat_eoy eoy_math_comp eoy_math_date
<dbl> <dbl> <chr> <lgl> <dbl> <chr>
1 7729773724 8 Cumberland Virtua… NA 229 4/16/2026
2 4694886258 8 Cumberland Virtua… NA 240 4/15/2026
3 1836192339 8 Cumberland Virtua… NA 268 4/15/2026
4 7362151357 8 Cumberland Virtua… NA 224 4/15/2026
5 4215536561 8 Cumberland Virtua… NA 199 4/15/2026
6 3532188732 8 Cumberland Virtua… NA 252 4/15/2026
7 3482512388 8 Cumberland Virtua… NA 283 4/15/2026
8 4294541369 8 Cumberland Virtua… NA 236 4/15/2026
9 4855429317 8 Cumberland Virtua… NA 241 4/15/2026
10 1582791228 8 Cumberland Virtua… NA 239 4/15/2026
# ℹ 8,799 more rows
# ℹ 8 more variables: eoy_ela_comp <dbl>, eoy_ela_date <chr>, gender_eoy <chr>,
# iep_eoy <dbl>, ell_eoy <dbl>, frpl_eoy <lgl>, race_eth_eoy <chr>,
# at_risk_eoy <dbl>
Merge
Merge MOY to roster.
merge_roster_moy = roster_unique %>%left_join(moy_unique, by ="student_id")
Check unmatched records in intermediary data: 257 students in the roster with no appearance in MOY.
#check students in roster with NO match in MOYmissing_moy = roster_unique %>%anti_join(moy_unique, by ="student_id")cat("Number of roster students missing MOY data:", nrow(missing_moy), "\n")
Number of roster students missing MOY data: 257
Check unexpected records (MOY): 271 students present in the MOY file that do NOT present in the roster.
# check in MOY data NOT in rosterunexpected_moy_records = moy_unique %>%anti_join(roster_unique, by ="student_id")cat("Number of MOY records not in roster:", nrow(unexpected_moy_records), "\n")
Number of MOY records not in roster: 271
Match intermediary with EOY records by student ID.
merge_cumber = merge_roster_moy %>%left_join(eoy_unique, by ="student_id")
Check unmatched records in final merge: 1753 students in the intermediary file, merge_roster_moy, with no appearance in EOY.
#check students in merge_roster_moy with NO match in EOYmissing_eoy = merge_roster_moy %>%anti_join(eoy_unique, by ="student_id")cat("Number of students missing EOY data from merge_roster_moy:", nrow(missing_eoy), "\n")
Number of students missing EOY data from merge_roster_moy: 1753
Check unexpected records (EOY): 5 students present in the EOY file that not NOT present in the roster.
# check in EOY file NOT in roster (unexpected records)unexpected_eoy_records = eoy_unique %>%anti_join(roster_unique, by ="student_id")cat("Number of EOY records not in roster:", nrow(unexpected_eoy_records), "\n")
Number of EOY records not in roster: 5
Export unmatched (missing) and unexpected (extra) records.
Import roster from VT that details which students have accounts in their system
#import ID crosswalk from VTvt_roster =read_csv("vt_roster_all_districts.csv")
Rows: 1496 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): District, School
dbl (2): District ID, Varsity Tutors ID
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 4 × 5
# Groups: time [2]
time gender n total pct
<chr> <chr> <int> <int> <dbl>
1 eoy F 4188 8648 0.484
2 eoy M 4460 8648 0.516
3 moy F 4187 8648 0.484
4 moy M 4461 8648 0.516
IEPs: 5 additional students gained IEPs at the end of year.
RACE-ETHNICITY: Following categories appeared to have stayed more or less the same: A, B, W. There was a significant drop in the count of H students, suggesting that the district and/or families re-identified away from H in favor of other new categories: I, Multi, and P.
MOY: inspect racial categories and basic counts
table(merge_vt_cumber_unclean$race_eth_moy)
A B H P W
180 4075 2385 46 1962
EOY: inspect racial categories and basic counts
table(merge_vt_cumber_unclean$race_eth_eoy)
A B H I Multi P W
181 4075 1467 100 817 46 1962
Decision: Keep racial/ethnicity categorizations and gender identifications from EOY file. All other demographic data keep from MOY file.
Percentage missing for each variable. Note: percentage missing for at_risk probably indicates the percent NOT at-risk. Observations: FRPL may be missing for all students or district forgot to add to file. All other demographic info is present for all students. Some missing data for test info.
all_tests: dummy for students with both math and ELA test info complete across MOY and EOY.
table(final_vt_cumber_test_info$all_tests)
FALSE TRUE
2337 6311
#complete math and ELA: MOY and EOY presentfinal_vt_cumber_test_info$all_tests =ifelse(final_vt_cumber_test_info$all_tests, 1, 0)table(final_vt_cumber_test_info$all_tests) #check counts
0 1
2337 6311
math_moy_missing_eoy_present: dummy for students who have EOY math test info but MOY test info is missing.
Generate school-level means for MOY ELA and Math assessment data, number of students, percentages of students by race/ethnicity, gender, and FRPL-status.
#import cumber usage data from VTvt_usage_cumber =read_csv("vt_usage_cumber.csv")
Rows: 4212 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Sch_name*, Sch_BOY_date*, Sch_EOY_date
dbl (7): Stu_ID_VT*, Num_days_active*, Num_sessions_AI, Num_sessions_human, ...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Inspect duplicate student records in usage file (based on VT Account ID - not student ID). Observation: for a given duplicate student record, school name,num_days_active, BOY date and EOY date will match but usage data differs across the duplicate records.
# A tibble: 4,081 × 10
school_vt school_BOY_date school_EOY_date vt_acct_id num_days_active
<chr> <chr> <chr> <dbl> <dbl>
1 Luther Nick Jeral… 2/2/2026 4/14/2026 13158251 6
2 Luther Nick Jeral… 2/2/2026 4/14/2026 4502225 10
3 Luther Nick Jeral… 2/2/2026 4/14/2026 13158251 6
4 Luther Nick Jeral… 2/2/2026 4/14/2026 4482381 3
5 Luther Nick Jeral… 2/2/2026 4/14/2026 4482381 3
6 Luther Nick Jeral… 2/2/2026 4/14/2026 4505683 14
7 Luther Nick Jeral… 2/2/2026 4/14/2026 13155952 14
8 Luther Nick Jeral… 2/2/2026 4/14/2026 13160999 6
9 Luther Nick Jeral… 2/2/2026 4/14/2026 4482466 6
10 Luther Nick Jeral… 2/2/2026 4/14/2026 4502186 10
# ℹ 4,071 more rows
# ℹ 5 more variables: num_sessions_AI <dbl>, num_sessions_human <dbl>,
# num_classesviewed <dbl>, num_essaysreviewed <dbl>,
# num_AI_practiceproblemsessions <dbl>
Decision: Collapse rows for each student ID. Per VT, these instances are rows that only capture one of the students multiple days active and need to be compressed.
Basic check of randomly selected student, 13159482. Inspect sum of Num_sessions_AI before collapsing.
Save e2i file with all matched students from the merge, regardless of missing test info.
#select columns ready for e2ivt_cumber_e2i_usage_test_info = merge_vt_cumber_usage_test_info %>% dplyr::select(school_num, student_id, grade, moy_math_comp, moy_math_date, moy_ela_comp,moy_ela_date, eoy_math_comp, eoy_math_date, eoy_ela_comp, eoy_ela_date, male, iep, ell,frpl, at_risk, treat, race_A, race_B, race_H, race_I, race_Multi, race_P, race_W, both_math, both_ela, all_tests, math_moy_missing_eoy_present, ela_moy_missing_eoy_present, vt_acct_id, num_days_active,num_sessions_AI, num_sessions_human, num_classesviewed, num_essaysreviewed, num_AI_practiceproblemsessions, school_BOY_date, school_EOY_date)#export e2i with test indicators, all matched students from the merge, regardless of missing test info (all_tests == 1 or 0)write.csv(vt_cumber_e2i_usage_test_info, file ="vt_cumber_e2i_all_students.csv", row.names =FALSE)
Save e2i file with only containing matched students from the merge with complete test info.
table(vt_cumber_e2i_usage_test_info$all_tests)
0 1
2337 6311
#subset: keep only students with complete test info (all_tests == 1)vt_cumber_e2i_all_tests = vt_cumber_e2i_usage_test_info %>%filter(all_tests ==1)print(vt_cumber_e2i_all_tests) # N=6311
Count of treatment students among the subset with complete test info.
table(vt_cumber_e2i_all_tests$treat)
0 1
5708 603
#export e2i only containing matched students from the merge with complete test info (all_tests == 1)write.csv(vt_cumber_e2i_all_tests, file ="vt_cumber_e2i_complete_tests.csv", row.names =FALSE)