Packages selected for inspecting and cleaning Varsity Tutors, district data from Cumberland schools.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.1 ✔ readr 2.2.0
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.3 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Import roster.
roster <-read_csv("roster_vt_cumber.csv")
Rows: 14447 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): school
dbl (3): school_code, student_id, grade
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 10572 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): School Name, MOY MAP Math Math Test Date*, MOY MAP ELA Test Date*, ...
dbl (7): Student ID, Student Grade Level, MOY MAP Math Composite Scaled Scor...
lgl (2): Treatment, FRPL
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
view(moy_map)
Change column headers for moy_map.
moy_map = moy_map %>%rename(student_id ="Student ID",grade_moy ="Student Grade Level", school_moy ="School Name", treat_moy ="Treatment", moy_math_comp ="MOY MAP Math Composite Scaled Score*", moy_math_date ="MOY MAP Math Math Test Date*", moy_ela_comp ="MOY MAP ELA Composite Scaled Score*", moy_ela_date ="MOY MAP ELA Test Date*", gender_moy ="Gender", iep_moy ="IEP", ell_moy ="ELL", frpl_moy ="FRPL", race_eth_moy ="Race/Ethnicity", at_risk_moy ="Academically at-risk" )
Import End of Year data request.
eoy_map =read_csv("eoy_map_vt_cumber.csv")
Rows: 8810 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): School Name, EOY MAP Math Math Test Date*, EOY MAP ELA Test Date*, ...
dbl (7): Student ID, Student Grade Level, EOY MAP Math Composite Scaled Scor...
lgl (2): Treatment, FRPL
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
view(eoy_map)
Change column headers for eoy_map.
eoy_map = eoy_map %>%rename(student_id ="Student ID",grade_eoy ="Student Grade Level", school_eoy ="School Name", treat_eoy ="Treatment", eoy_math_comp ="EOY MAP Math Composite Scaled Score*", eoy_math_date ="EOY MAP Math Math Test Date*", eoy_ela_comp ="EOY MAP ELA Composite Scaled Score*", eoy_ela_date ="EOY MAP ELA Test Date*", gender_eoy ="Gender", iep_eoy ="IEP", ell_eoy ="ELL", frpl_eoy ="FRPL", race_eth_eoy ="Race/Ethnicity", at_risk_eoy ="Academically at-risk" )
Inspect duplicates
Roster
Identify duplicate student records within roster file.
# frequency table of student_id occurrencesid_counts_roster = roster %>%group_by(student_id) %>%summarise(n =n()) %>%arrange(desc(n))# filter only IDs that appear > 1dupes_roster = id_counts_roster %>%filter(n >1)# view resultsprint(dupes_roster)
Inspect duplicate student records in roster. Observation: for a given duplicate student record, it appears to be the same individual (e.g., same school, same grade)
# A tibble: 10,557 × 4
school_code school_ros student_id grade_ros
<dbl> <chr> <dbl> <dbl>
1 260336 Anne Chesnutt Middle 2287512462 6
2 260336 Anne Chesnutt Middle 4431265988 6
3 260336 Anne Chesnutt Middle 9779738622 6
4 260336 Anne Chesnutt Middle 3968556194 6
5 260336 Anne Chesnutt Middle 1925959163 6
6 260336 Anne Chesnutt Middle 2638992577 6
7 260336 Anne Chesnutt Middle 8925824655 6
8 260336 Anne Chesnutt Middle 4634477858 6
9 260336 Anne Chesnutt Middle 9836237186 6
10 260336 Anne Chesnutt Middle 1961537885 6
# ℹ 10,547 more rows
MOY
Identify duplicate student records within MOY file.
# frequency table of student_id occurrencesid_counts_moy = moy_map %>%group_by(student_id) %>%summarise(n =n()) %>%arrange(desc(n))# filter only IDs that appear > 1dupes_moy = id_counts_moy %>%filter(n >1)# view list of duplicate IDsprint(dupes_moy)
# A tibble: 1 × 2
student_id n
<dbl> <int>
1 7117828382 2
Inspect duplicate student record in the MOY file. Observation: Student records contain the same demographic info and math test info, however ELA test info (date and score) differ. Decision: Remove student duplicate record.
# A tibble: 10,570 × 14
student_id grade_moy school_moy treat_moy moy_math_comp moy_math_date
<dbl> <dbl> <chr> <lgl> <dbl> <chr>
1 2287512462 6 Anne Chesnutt Mid… NA 225 1/14/2026
2 4431265988 6 Anne Chesnutt Mid… NA 201 1/14/2026
3 9779738622 6 Anne Chesnutt Mid… NA 235 1/15/2026
4 3968556194 6 Anne Chesnutt Mid… NA 211 1/14/2026
5 1925959163 6 Anne Chesnutt Mid… NA 226 1/14/2026
6 2638992577 6 Anne Chesnutt Mid… NA 213 1/14/2026
7 8925824655 6 Anne Chesnutt Mid… NA 201 1/14/2026
8 4634477858 6 Anne Chesnutt Mid… NA 228 1/14/2026
9 9836237186 6 Anne Chesnutt Mid… NA 194 1/14/2026
10 1961537885 6 Anne Chesnutt Mid… NA 249 1/14/2026
# ℹ 10,560 more rows
# ℹ 8 more variables: moy_ela_comp <dbl>, moy_ela_date <chr>, gender_moy <chr>,
# iep_moy <dbl>, ell_moy <dbl>, frpl_moy <lgl>, race_eth_moy <chr>,
# at_risk_moy <dbl>
EOY
Identify duplicate student records within EOY file.
# frequency table of student_id occurrencesid_counts_eoy = eoy_map %>%group_by(student_id) %>%summarise(n =n()) %>%arrange(desc(n))# filter only IDs that appear > 1dupes_eoy = id_counts_eoy %>%filter(n >1)# view list of duplicate IDsprint(dupes_eoy)
# A tibble: 1 × 2
student_id n
<dbl> <int>
1 8354674797 2
Inspect duplicate student record in the MOY file. Observation: Student records contain the same demographic info and math test info, however ELA test info (date and score) differ. Decision: Remove student duplicate record.
# A tibble: 8,808 × 14
student_id grade_eoy school_eoy treat_eoy eoy_math_comp eoy_math_date
<dbl> <dbl> <chr> <lgl> <dbl> <chr>
1 7729773724 8 Cumberland Virtua… NA 229 4/16/2026
2 4694886258 8 Cumberland Virtua… NA 240 4/15/2026
3 1836192339 8 Cumberland Virtua… NA 268 4/15/2026
4 7362151357 8 Cumberland Virtua… NA 224 4/15/2026
5 4215536561 8 Cumberland Virtua… NA 199 4/15/2026
6 3532188732 8 Cumberland Virtua… NA 252 4/15/2026
7 3482512388 8 Cumberland Virtua… NA 283 4/15/2026
8 4294541369 8 Cumberland Virtua… NA 236 4/15/2026
9 4855429317 8 Cumberland Virtua… NA 241 4/15/2026
10 1582791228 8 Cumberland Virtua… NA 239 4/15/2026
# ℹ 8,798 more rows
# ℹ 8 more variables: eoy_ela_comp <dbl>, eoy_ela_date <chr>, gender_eoy <chr>,
# iep_eoy <dbl>, ell_eoy <dbl>, frpl_eoy <lgl>, race_eth_eoy <chr>,
# at_risk_eoy <dbl>
Merge
Merge MOY to roster.
merge_roster_moy = roster_unique %>%left_join(moy_unique, by ="student_id")
Check unmatched records in intermediary data: 258 students in the roster with no appearance in MOY.
#check students in roster with NO match in MOYmissing_moy = roster_unique %>%anti_join(moy_unique, by ="student_id")cat("Number of roster students missing MOY data:", nrow(missing_moy), "\n")
Number of roster students missing MOY data: 258
Check unexpected records (MOY): 271 students present in the MOY file that do NOT present in the roster.
# check in MOY data NOT in rosterunexpected_moy_records = moy_unique %>%anti_join(roster_unique, by ="student_id")cat("Number of MOY records not in roster:", nrow(unexpected_moy_records), "\n")
Number of MOY records not in roster: 271
Match intermediary with EOY records by student ID.
merge_cumber = merge_roster_moy %>%left_join(eoy_unique, by ="student_id")
Check unmatched records in final merge: 1754 students in the intermediary file, merge_roster_moy, with no appearance in EOY.
#check students in merge_roster_moy with NO match in EOYmissing_eoy = merge_roster_moy %>%anti_join(eoy_unique, by ="student_id")cat("Number of students missing EOY data from merge_roster_moy:", nrow(missing_eoy), "\n")
Number of students missing EOY data from merge_roster_moy: 1754
Check unexpected records (EOY): 5 students present in the EOY file that not NOT present in the roster.
# check in EOY file NOT in roster (unexpected records)unexpected_eoy_records = eoy_unique %>%anti_join(roster_unique, by ="student_id")cat("Number of EOY records not in roster:", nrow(unexpected_eoy_records), "\n")
Number of EOY records not in roster: 5
Export unmatched (missing) and unexpected (extra) records.
Keep rows where ID presents across the de-duped roster, moy file, and eoy file.
roster_moy_matched = roster_unique %>%inner_join(moy_unique, by ="student_id")final_vt_cumber_unclean = roster_moy_matched %>%inner_join(eoy_unique, by ="student_id")
Check for requested data elements. Keep only necessary columns (only keep the latest demographic info from eoy file). No missing columns following the data request.
Calculate the percentage missing for each variable. Note: percentage missing for at_risk probably indicates the percent NOT at-risk. Observations: treatment status is missing for all students; FRPL may be missing for all students or district forgot to add to file. All other demographic info is present for all students. Some missing data for test info.
# A tibble: 17 × 4
school_eoy total_students students_with_all_te…¹ percent_with_all_tests
<chr> <int> <int> <dbl>
1 Seventy-First C… 387 384 99.2
2 Cumberland Virt… 127 125 98.4
3 Anne Chesnutt M… 491 481 98.0
4 South View Midd… 567 529 93.3
5 R Max Abbott Mi… 685 630 92.0
6 Lewis Chapel Mi… 502 439 87.4
7 Gray's Creek Mi… 979 830 84.8
8 Luther Nick Jer… 515 384 74.6
9 Pine Forest Mid… 629 463 73.6
10 Douglas Byrd Mi… 632 464 73.4
11 Hope Mills Midd… 410 300 73.2
12 Westover Middle 630 461 73.2
13 Spring Lake Mid… 423 268 63.4
14 Mac Williams Mi… 796 411 51.6
15 Howard Learning… 115 48 41.7
16 John R Griffin … 558 92 16.5
17 Reid Ross Class… 200 0 0
# ℹ abbreviated name: ¹students_with_all_tests
Data Summary
Generate school-level means for MOY ELA and Math assessment data, number of students, percentages of students by race/ethnicity, gender, and FRPL-status.