Welcome to CSC 110 Lab 2! In this lab, you will work with data on schools from the Common Core of Data (CCD), collected by the National Center for Education Statistics. You can read more about the CCD and and find documentation on this datafile here.
Before attempting this lab, you should read Chapter 2 of the LOTF report. We are working with the 2013-14 dataset, for consistency with the LOTF report. This dataset contains a codebook that explains all the variables, which you can find in the data subfolder of this directory, or by clicking here.
This is a pair programming assignment: one student acts as the driver, the other as the navigator. The driver writes code while the navigator directs them on what to write based on the assignment instructions, and provides feedback and help as needed. You should rotate these roles roughly every 15 minutes so each person gets plenty of time in the “driver’s seat.”
Lab assignments can be taken open-book using the R4DS textbook and open-note using the notes you’ve taken and the materials we have covered in class. You should not rely on any other outside resources. Usage of generative AI tools like ChatGPT, Claude, or DeepSeek is not allowed on this assignment.
Next, create a data frame named “schools” and load the data from our
ccd_cms_1314.xlsx data file.
At this point, you should take some time to explore the data and jot
down your observations about what information the schools
data frame contains. This dataset has a lot of variables, many of which
(but not all) you will need for this assignment.
You should also spend some time reviewing the data definitions file that is included with this dataset, which you find here. This file contains critical information that will guide your lab team as you write your code.
In this part of the lab, we’ll get to know the CMS and CCD data.
Write a data pipeline that overwrites the original
schools data frame and creates a new variable named
sch_type. This variable should display either “Charter”,
“Magnet”, or “TPS”, to specify if the school is a charter, magnet, or
traditional public school. Make sure that the output of your pipeline
displays your new variable and all other variables from the original
data frame.
(schools <- schools |>
mutate(sch_type = case_when(
magnet == 1 ~ "Magnet",
chartr == 1 ~ "Charter",
TRUE ~ "TPS"
)
)
)
## # A tibble: 189 × 25
## ncessch leanm schnam mcity mstate fte gslo gshi level titlei magnet
## <dbl> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 370003202327 COMMU… COMMU… CHAR… NC 10.8 KG 05 1 1 2
## 2 370006302398 KENNE… KENNE… CHAR… NC 29.7 KG 12 4 1 2
## 3 370006402399 LAKE … LAKE … HUNT… NC 90.4 05 12 4 2 2
## 4 370008902472 SUGAR… SUGAR… CHAR… NC 48.6 KG 08 1 1 2
## 5 370010702530 METRO… METRO… CHAR… NC 26.9 KG 08 1 2 2
## 6 370012202591 CROSS… CROSS… CHAR… NC 14.4 09 12 3 1 2
## 7 370013502736 COMMU… COMMU… DAVI… NC 90.2 KG 12 4 2 2
## 8 370014202922 KIPP:… KIPP … CHAR… NC 27.7 05 08 2 1 2
## 9 370014302924 CHARL… CHARL… CHAR… NC 23.0 06 10 4 1 2
## 10 370032703258 CORVI… CORVI… CHAR… NC 24.3 KG 04 1 2 2
## # ℹ 179 more rows
## # ℹ 14 more variables: chartr <chr>, frelch <dbl>, redlch <dbl>, totfrl <dbl>,
## # member <dbl>, am <dbl>, asian <dbl>, hisp <dbl>, black <dbl>, white <dbl>,
## # pacific <dbl>, tr <dbl>, toteth <dbl>, sch_type <chr>
Produce a summary table that correctly identifies how many schools of
each of these three types exist in our dataset. (Do NOT produce a new
data frame for this, but use the existing schools data
frame.)
schools |>
group_by(`School type` = sch_type) |>
summarize(
`Number of schools` = n()
)
## # A tibble: 3 × 2
## `School type` `Number of schools`
## <chr> <int>
## 1 Charter 24
## 2 Magnet 18
## 3 TPS 147
Produce another table that shows the number of schools of each of the three school types that are missing data for free or reduced lunch. (Again, do not produce a new data frame.)
schools |>
filter(is.na(frelch) | is.na(redlch)) |>
group_by(`School type` = sch_type) |>
summarize(
`Missing data for free or reduced lunch` = n()
)
## # A tibble: 2 × 2
## `School type` `Missing data for free or reduced lunch`
## <chr> <int>
## 1 Charter 10
## 2 TPS 5
Produce a table that displays the number of schools that are primary
schools, middle schools, high schools, and other, as well as those that
are missing that information. Use a combination of
mutate(), case_when(), and
factor() to bucket these categories appropriately
and sort the categories in the appropriate order
according to the data definitions.
(schools |>
mutate(
sch_catg = case_when(
`level` == 1 ~"primary",
`level` == 2 ~"middle",
`level` == 3 ~"high",
`level` == 4 ~"other",
TRUE ~ "missing the info")) |>
group_by(`School category` = sch_catg) |>
summarize(
`Category count` = n()
) |>
mutate(
`School category` = factor(
`School category`,
level = c("primary", "middle", "high", "other", "missing the info")
) # Here, we couldn't sort the categories in the appropriate order using factor()
) |> arrange(`School category`)
)
## # A tibble: 5 × 2
## `School category` `Category count`
## <fct> <int>
## 1 primary 109
## 2 middle 28
## 3 high 27
## 4 other 10
## 5 missing the info 15
How many schools are eligible for Title I status? Using the same techniques as you did in the previous question, produce a summary table that displays the count of schools in that are either eligible, not eligible, or missing this information.
A school is considered a Title I school if at least 40% of its students qualify for free or reduced lunch. Title I schools receive federal funding to help students in need.
(schools |>
mutate(
titlei_elgb = case_when(
`titlei` == 1 ~"eligible",
`titlei` == 2 ~"not eligible",
TRUE ~ "missing the info")) |>
group_by(`Title I Eligibility` = titlei_elgb) |>
summarize(
`Eligibility count` = n()
)
)
## # A tibble: 3 × 2
## `Title I Eligibility` `Eligibility count`
## <chr> <int>
## 1 eligible 128
## 2 missing the info 15
## 3 not eligible 46
Write a data pipeline that produces a summary report showing the five largest schools in the Charlotte-Mecklenburg Schools (CMS) system. In your summary table, include only the following variables: sch_type, schnam, member, and your new ranking variable. You do not need to create a new data frame for this.
(schools |>
mutate(
larg_rank = min_rank(desc(member))
) |>
arrange(larg_rank) |>
head(5) |>
select(
sch_type,
schnam,
member
)
)
## # A tibble: 5 × 3
## sch_type schnam member
## <chr> <chr> <dbl>
## 1 TPS MYERS PARK HIGH 2775
## 2 TPS SOUTH MECKLENBURG HIGH 2744
## 3 TPS ARDREY KELL HIGH 2512
## 4 TPS MALLARD CREEK HIGH 2286
## 5 TPS INDEPENDENCE HIGH 2247
Using similar techniques to what you just did on the previous question, produce another table that shows only the five smallest schools in CMS. You do not need to create a new data frame for this.
(schools |>
mutate(
sml_rank = min_rank(member)
) |>
arrange(sml_rank) |>
head(5) |>
select(
sch_type,
schnam,
member
)
)
## # A tibble: 5 × 3
## sch_type schnam member
## <chr> <chr> <dbl>
## 1 Charter INVEST COLLEGIATE 99
## 2 Charter ARISTOTLE PREPARATORY ACADEMY 107
## 3 TPS LINCOLN HEIGHTS ACADEMY 107
## 4 TPS PERFORMANCE LEARNING CENTER 137
## 5 Charter COMMUNITY CHARTER SCHOOL 150
Write a pipeline that identifies the size ranking for Davidson Elementary School. For your output, you only need to return the school name, its size, and its ranking.
How many students were enrolled at Davidson Elementary School at the time this data was collected, and what was its size ranking according to your calculations?
(replace this with your answer)
Write another pipeline similar to the one above that identifies the size ranking for North Mecklenburg High School. For your output, you only need to return the school name, its size, and its ranking.
How many students were enrolled at North Meck at the time this data was collected, and what was its size ranking according to your calculations?
(replace this with your answer)
Overwrite the original schools data frame by adding a
series of new variables:
pct_black: The percentage of students who are
Blackpct_white: The percentage of students who are
Whitepct_hisp: The percentage of students who are
Hispanicpct_asian: The percentage of students who are
Asianpct_other: The percentage of students who are none of
the above (“Other”)pct_poc: The percentage of students who are
not White (POC = persons of color)pct_frlunch: The percentage of students who are
receiving free or reduced lunchUse a histogram visualize the percentage of Black students. At your discretion, use whatever bin sizes you feel are appropriate to see the variation in school populations and any outliers that may be present. Remember to include labels!
Repeat the same process as 4.2 above, but this time visualize the percentage of White students.
Repeat the same process as the questions above, but this time visualize the percentage of Hispanic students.
Repeat the same process as the questions above, but this time visualize the percentage of Asian students.
Repeat the same process as the questions above, but this time visualize the percentage of other races.
Repeat the same process as the questions above, but this time visualize the percentage of non-white students.
Repeat the same process as the questions above, but this time visualize the percentage of students receiving free/reduced lunch.
Based on these visuals, what conclusions are you able to draw about the racial makeup of the schools in Mecklenburg County and the prevalence of students receiving free/reduced lunches? Be specific in your observations and think carefully and critically about your answer.
(replace this with your answer)
Comparing CCD Data to the Leading on Opportunity Task Force Report
In this part of the lab, we’re going to compare the data we’ve been looking at with some of the analysis and conclusions that were presented by the Leading on Opportunity Task Force report.
According to our data, how many schools are high-poverty (80% or more students receiving FRL) and high-minority (80% non-white)? Produce a summary table that answers this question.
Did your table produce an exact match to what the LOTF report says? Why or why not? Think about how this population of schools might be different from the population in LOTF, and attempt to identify the correct population
For this one, you may want to examine the chartr
variable.
(your answer here)
First, create a new data frame named cms_schools,
beginning from our schools data frame we’ve been using up
to this point, and filter the observations to only
those schools in the CMS system. Then, in a new code statement directly
below that, use the new data frame to calculate variables displaying the
number of schools in the system and the total
population of all students in the system, and return a summary table
below that displays those new variables along with the name of the
school system.
There are multiple ways to solve this question using the techniques we’ve learned. Depending on which method you choose, you may need to filter NA values out of your dataset.
How does the population you just identified differ from the population in the LOTF report?
Note that this difference will impact your estimates to the below slightly; you should not expect perfect matches.
(your answer here)
In this section, you’re being asked to investigate a series of claims that are present in the LOTF report. Use the data we have here to examine the veracity of these claims. Where the authors are not specific, experiment with different thresholds for the definition of “segregated”:
LOTF states: “a third of the 168 schools in the system are segregated by poverty”
Create a summary table below that attempts to validate this statement.
LOTF states: “half are segregated by race”
Create a summary table below that attempts to validate this statement.
LOTF states: “a fifth are hypersegregated, meaning that 90% of their students are of a particular race”
Create a summary table below that attempts to validate this statement.
LOTF states: “over half of all African-American students attend schools that are 90 percent non-white”
Create a summary table below that attempts to validate this statement.
Looking back over the previous questions in Exercise 7 and your analysis of our dataset, how close do your results come to those assertions that were made in the LOTF report? What advice might you give to the authors of LOTF for the next report they write (to be helpful to researchers replicating their results)?
(your answer here)
In this final part of the lab, we’ll focus on visualizing relationships and examining possible co-variation between variables in our schools dataset.
Create a boxplot that visualizes the variation in the percent of students receiving free/reduced lunch between magnet schools, TPS and charter schools.
Create a boxplot that visualizes the variation in the percent of White students between magnet schools, TPS and charter schools.
Create a boxplot that visualizes the variation in the percent of Black students between magnet schools, TPS and charter schools.
Create a boxplot that visualizes the variation in the percent of Hispanic students between magnet schools, TPS and charter schools.
Create a boxplot that visualizes the variation in the percent of Asian students between magnet schools, TPS and charter schools.
Create a boxplot that visualizes the variation in the percent of Non-White students between magnet schools, TPS and charter schools.
Create a boxplot that visualizes the variation in the percent of students of other races between magnet schools, TPS and charter schools.
Visualize the relationship between the percent of students of color and the percent who receive free or reduced price lunch, segmented by school type. Experiment with a couple of different ways to visualize this relationship, using geom_point and one of the geoms from Chapter 7 of R4DS.
Describe your findings. What do you observe about the relationship, particularly when segmented by school type?
(your answer here)
On my honor I have neither given nor received unauthorized information regarding this work, I have followed and will continue to observe all regulations regarding it, and I am unaware of any violation of the Honor Code by others.
Person 1: (replace this with your full name to pledge this assignment) Person 2: (replace this with your full name to pledge this assignment)
Before submitting your assignment in Gradescope, review this checklist carefully to ensure you have done everything appropriately.
Point penalties will be assessed if your assignment does not meet these guidelines.
For a complete description of the grading rubrics used for this assignment, please consult the Rubrics page on the course website.