Instructions

Welcome to CSC 110 Lab 2! In this lab, you will work with data on schools from the Common Core of Data (CCD), collected by the National Center for Education Statistics. You can read more about the CCD and and find documentation on this datafile here.

Before attempting this lab, you should read Chapter 2 of the LOTF report. We are working with the 2013-14 dataset, for consistency with the LOTF report. This dataset contains a codebook that explains all the variables, which you can find in the data subfolder of this directory, or by clicking here.

This is a pair programming assignment: one student acts as the driver, the other as the navigator. The driver writes code while the navigator directs them on what to write based on the assignment instructions, and provides feedback and help as needed. You should rotate these roles roughly every 15 minutes so each person gets plenty of time in the “driver’s seat.”

Lab assignments can be taken open-book using the R4DS textbook and open-note using the notes you’ve taken and the materials we have covered in class. You should not rely on any other outside resources. Usage of generative AI tools like ChatGPT, Claude, or DeepSeek is not allowed on this assignment.

Write R code in each code block to load, transform, and plot your data.
To earn full credit on coding questions, your code and its output must both be visible.
If a coding question produces a plot/figure, make sure to include clear and descriptive labels (title, axis, etc.)
Written (non-code) answers to questions should be typed out in the answer boxes that are provided for you. Do not replace the heading in each answer box, but instead type your responses below each heading.
I recommend you save and render your Quarto doc as an HTML document regularly. This not only has the benefit of showing your output, but it also tests your code to make sure everything is working. Any errors that the rendering engine encounters will be displayed in the RStudio console.
Remember: all work is done under the Honor Code, and you must pledge your work at the end of the assignment.
When finished, render your Quarto doc as a PDF, review it carefully for completeness and cleanliness, and submit it in Gradescope. Your Qmd file will stay stored in JupyterHub if we need to refer back to it.

Setup

Next, create a data frame named “schools” and load the data from our ccd_cms_1314.xlsx data file.

At this point, you should take some time to explore the data and jot down your observations about what information the schools data frame contains. This dataset has a lot of variables, many of which (but not all) you will need for this assignment.

You should also spend some time reviewing the data definitions file that is included with this dataset, which you find here. This file contains critical information that will guide your lab team as you write your code.

Part I

In this part of the lab, we’ll get to know the CMS and CCD data.

Exercise 1

Exercise 1.1

Write a data pipeline that overwrites the original schools data frame and creates a new variable named sch_type. This variable should display either “Charter”, “Magnet”, or “TPS”, to specify if the school is a charter, magnet, or traditional public school. Make sure that the output of your pipeline displays your new variable and all other variables from the original data frame.

(schools <- schools |>
  mutate(sch_type = case_when(
      magnet == 1 ~ "Magnet",
      chartr == 1 ~ "Charter",
      TRUE ~ "TPS"
  )
)
)

## # A tibble: 189 × 25
##         ncessch leanm  schnam mcity mstate   fte gslo  gshi  level titlei magnet
##           <dbl> <chr>  <chr>  <chr> <chr>  <dbl> <chr> <chr> <chr> <chr>  <chr> 
##  1 370003202327 COMMU… COMMU… CHAR… NC      10.8 KG    05    1     1      2     
##  2 370006302398 KENNE… KENNE… CHAR… NC      29.7 KG    12    4     1      2     
##  3 370006402399 LAKE … LAKE … HUNT… NC      90.4 05    12    4     2      2     
##  4 370008902472 SUGAR… SUGAR… CHAR… NC      48.6 KG    08    1     1      2     
##  5 370010702530 METRO… METRO… CHAR… NC      26.9 KG    08    1     2      2     
##  6 370012202591 CROSS… CROSS… CHAR… NC      14.4 09    12    3     1      2     
##  7 370013502736 COMMU… COMMU… DAVI… NC      90.2 KG    12    4     2      2     
##  8 370014202922 KIPP:… KIPP … CHAR… NC      27.7 05    08    2     1      2     
##  9 370014302924 CHARL… CHARL… CHAR… NC      23.0 06    10    4     1      2     
## 10 370032703258 CORVI… CORVI… CHAR… NC      24.3 KG    04    1     2      2     
## # ℹ 179 more rows
## # ℹ 14 more variables: chartr <chr>, frelch <dbl>, redlch <dbl>, totfrl <dbl>,
## #   member <dbl>, am <dbl>, asian <dbl>, hisp <dbl>, black <dbl>, white <dbl>,
## #   pacific <dbl>, tr <dbl>, toteth <dbl>, sch_type <chr>

Exercise 1.2

Produce a summary table that correctly identifies how many schools of each of these three types exist in our dataset. (Do NOT produce a new data frame for this, but use the existing schools data frame.)

schools |>
  group_by(`School type` = sch_type) |>
  summarize(
    `Number of schools` = n()
)

## # A tibble: 3 × 2
##   `School type` `Number of schools`
##   <chr>                       <int>
## 1 Charter                        24
## 2 Magnet                         18
## 3 TPS                           147

Exercise 1.3

Produce another table that shows the number of schools of each of the three school types that are missing data for free or reduced lunch. (Again, do not produce a new data frame.)

schools |>
  filter(is.na(frelch) | is.na(redlch)) |>
  group_by(`School type` = sch_type) |>
  summarize(
    `Missing data for free or reduced lunch` = n()
)

## # A tibble: 2 × 2
##   `School type` `Missing data for free or reduced lunch`
##   <chr>                                            <int>
## 1 Charter                                             10
## 2 TPS                                                  5

Exercise 2

Exercise 2.1

Produce a table that displays the number of schools that are primary schools, middle schools, high schools, and other, as well as those that are missing that information. Use a combination of mutate(), case_when(), and factor() to bucket these categories appropriately and sort the categories in the appropriate order according to the data definitions.

(schools |>
  mutate(
    sch_catg = case_when(
      `level`  == 1 ~"primary",
      `level`  == 2 ~"middle",
      `level`  == 3 ~"high",
      `level`  == 4 ~"other",
       TRUE ~ "missing the info")) |>
    group_by(`School category` = sch_catg) |>
    summarize(
      `Category count` = n()
    ) |>
    mutate(
    `School category` = factor(
      `School category`,
      level = c("primary", "middle", "high", "other", "missing the info")
    ) # Here, we couldn't sort the categories in the appropriate order using factor()
  ) |> arrange(`School category`)
)

## # A tibble: 5 × 2
##   `School category` `Category count`
##   <fct>                        <int>
## 1 primary                        109
## 2 middle                          28
## 3 high                            27
## 4 other                           10
## 5 missing the info                15

Exercise 2.2

How many schools are eligible for Title I status? Using the same techniques as you did in the previous question, produce a summary table that displays the count of schools in that are either eligible, not eligible, or missing this information.

A school is considered a Title I school if at least 40% of its students qualify for free or reduced lunch. Title I schools receive federal funding to help students in need.

(schools |>
  mutate(
    titlei_elgb = case_when(
      `titlei`  == 1 ~"eligible",
      `titlei`  == 2 ~"not eligible",
       TRUE ~ "missing the info")) |>
    group_by(`Title I Eligibility` = titlei_elgb) |>
    summarize(
      `Eligibility count` = n()
    )
)

## # A tibble: 3 × 2
##   `Title I Eligibility` `Eligibility count`
##   <chr>                               <int>
## 1 eligible                              128
## 2 missing the info                       15
## 3 not eligible                           46

Exercise 3

Exercise 3.1

Write a data pipeline that produces a summary report showing the five largest schools in the Charlotte-Mecklenburg Schools (CMS) system. In your summary table, include only the following variables: sch_type, schnam, member, and your new ranking variable. You do not need to create a new data frame for this.

(schools |>
  mutate(
     larg_rank = min_rank(desc(member))
    ) |>
    arrange(larg_rank) |>
    head(5) |>
  select(
    sch_type,
    schnam,
    member
  )
)

## # A tibble: 5 × 3
##   sch_type schnam                 member
##   <chr>    <chr>                   <dbl>
## 1 TPS      MYERS PARK HIGH          2775
## 2 TPS      SOUTH MECKLENBURG HIGH   2744
## 3 TPS      ARDREY KELL HIGH         2512
## 4 TPS      MALLARD CREEK HIGH       2286
## 5 TPS      INDEPENDENCE HIGH        2247

Exercise 3.2

Using similar techniques to what you just did on the previous question, produce another table that shows only the five smallest schools in CMS. You do not need to create a new data frame for this.

(schools |>
  mutate(
     sml_rank = min_rank(member)
    ) |>
    arrange(sml_rank) |>
    head(5) |>
  select(
    sch_type,
    schnam,
    member
  )
)

## # A tibble: 5 × 3
##   sch_type schnam                        member
##   <chr>    <chr>                          <dbl>
## 1 Charter  INVEST COLLEGIATE                 99
## 2 Charter  ARISTOTLE PREPARATORY ACADEMY    107
## 3 TPS      LINCOLN HEIGHTS ACADEMY          107
## 4 TPS      PERFORMANCE LEARNING CENTER      137
## 5 Charter  COMMUNITY CHARTER SCHOOL         150

Exercise 3.3

Write a pipeline that identifies the size ranking for Davidson Elementary School. For your output, you only need to return the school name, its size, and its ranking.

Exercise 3.4

How many students were enrolled at Davidson Elementary School at the time this data was collected, and what was its size ranking according to your calculations?

(replace this with your answer)

Exercise 3.5

Write another pipeline similar to the one above that identifies the size ranking for North Mecklenburg High School. For your output, you only need to return the school name, its size, and its ranking.

Exercise 3.6

How many students were enrolled at North Meck at the time this data was collected, and what was its size ranking according to your calculations?

(replace this with your answer)

Exercise 4

Exercise 4.1

Overwrite the original schools data frame by adding a series of new variables:

pct_black: The percentage of students who are Black
pct_white: The percentage of students who are White
pct_hisp: The percentage of students who are Hispanic
pct_asian: The percentage of students who are Asian
pct_other: The percentage of students who are none of the above (“Other”)
pct_poc: The percentage of students who are not White (POC = persons of color)
pct_frlunch: The percentage of students who are receiving free or reduced lunch

Exercise 4.2

Use a histogram visualize the percentage of Black students. At your discretion, use whatever bin sizes you feel are appropriate to see the variation in school populations and any outliers that may be present. Remember to include labels!

Exercise 4.3

Repeat the same process as 4.2 above, but this time visualize the percentage of White students.

Exercise 4.4

Repeat the same process as the questions above, but this time visualize the percentage of Hispanic students.

Exercise 4.5

Repeat the same process as the questions above, but this time visualize the percentage of Asian students.

Exercise 4.6

Repeat the same process as the questions above, but this time visualize the percentage of other races.

Exercise 4.7

Repeat the same process as the questions above, but this time visualize the percentage of non-white students.

Exercise 4.8

Repeat the same process as the questions above, but this time visualize the percentage of students receiving free/reduced lunch.

Exercise 4.9

Based on these visuals, what conclusions are you able to draw about the racial makeup of the schools in Mecklenburg County and the prevalence of students receiving free/reduced lunches? Be specific in your observations and think carefully and critically about your answer.

(replace this with your answer)

Part II

Comparing CCD Data to the Leading on Opportunity Task Force Report

In this part of the lab, we’re going to compare the data we’ve been looking at with some of the analysis and conclusions that were presented by the Leading on Opportunity Task Force report.

Exercise 5

Exercise 5.1

According to our data, how many schools are high-poverty (80% or more students receiving FRL) and high-minority (80% non-white)? Produce a summary table that answers this question.

Exercise 5.2

Did your table produce an exact match to what the LOTF report says? Why or why not? Think about how this population of schools might be different from the population in LOTF, and attempt to identify the correct population

For this one, you may want to examine the chartr variable.

(your answer here)

Exercise 6

Exercise 6.1

First, create a new data frame named cms_schools, beginning from our schools data frame we’ve been using up to this point, and filter the observations to only those schools in the CMS system. Then, in a new code statement directly below that, use the new data frame to calculate variables displaying the number of schools in the system and the total population of all students in the system, and return a summary table below that displays those new variables along with the name of the school system.

There are multiple ways to solve this question using the techniques we’ve learned. Depending on which method you choose, you may need to filter NA values out of your dataset.

Exercise 6.2

How does the population you just identified differ from the population in the LOTF report?

Note that this difference will impact your estimates to the below slightly; you should not expect perfect matches.

(your answer here)

Exercise 7

In this section, you’re being asked to investigate a series of claims that are present in the LOTF report. Use the data we have here to examine the veracity of these claims. Where the authors are not specific, experiment with different thresholds for the definition of “segregated”:

Exercise 7.1

LOTF states: “a third of the 168 schools in the system are segregated by poverty”

Create a summary table below that attempts to validate this statement.

Exercise 7.2

LOTF states: “half are segregated by race”

Create a summary table below that attempts to validate this statement.

Exercise 7.3

LOTF states: “a fifth are hypersegregated, meaning that 90% of their students are of a particular race”

Create a summary table below that attempts to validate this statement.

Exercise 7.4

LOTF states: “over half of all African-American students attend schools that are 90 percent non-white”

Create a summary table below that attempts to validate this statement.

Exercise 7.5

Looking back over the previous questions in Exercise 7 and your analysis of our dataset, how close do your results come to those assertions that were made in the LOTF report? What advice might you give to the authors of LOTF for the next report they write (to be helpful to researchers replicating their results)?

(your answer here)

Part III

In this final part of the lab, we’ll focus on visualizing relationships and examining possible co-variation between variables in our schools dataset.

Exercise 9

Exercise 9.1

Create a boxplot that visualizes the variation in the percent of students receiving free/reduced lunch between magnet schools, TPS and charter schools.

Exercise 9.2

Create a boxplot that visualizes the variation in the percent of White students between magnet schools, TPS and charter schools.

Exercise 9.3

Create a boxplot that visualizes the variation in the percent of Black students between magnet schools, TPS and charter schools.

Exercise 9.4

Create a boxplot that visualizes the variation in the percent of Hispanic students between magnet schools, TPS and charter schools.

Exercise 9.5

Create a boxplot that visualizes the variation in the percent of Asian students between magnet schools, TPS and charter schools.

Exercise 9.6

Create a boxplot that visualizes the variation in the percent of Non-White students between magnet schools, TPS and charter schools.

Exercise 9.7

Create a boxplot that visualizes the variation in the percent of students of other races between magnet schools, TPS and charter schools.

Exercise 10

Exercise 10.1

Visualize the relationship between the percent of students of color and the percent who receive free or reduced price lunch, segmented by school type. Experiment with a couple of different ways to visualize this relationship, using geom_point and one of the geoms from Chapter 7 of R4DS.

Exercise 10.2

Describe your findings. What do you observe about the relationship, particularly when segmented by school type?

(your answer here)

Honor Code Pledge

On my honor I have neither given nor received unauthorized information regarding this work, I have followed and will continue to observe all regulations regarding it, and I am unaware of any violation of the Honor Code by others.

Person 1: (replace this with your full name to pledge this assignment) Person 2: (replace this with your full name to pledge this assignment)

Submission checklist

Before submitting your assignment in Gradescope, review this checklist carefully to ensure you have done everything appropriately.

Correct authors and date
All code is clearly visible
All output is clearly visible (does not apply to the Setup section), including labels and legends
Written responses are typed into the appropriate answer boxes
All empty or redundant plots are removed
Honor Code pledge is signed

Point penalties will be assessed if your assignment does not meet these guidelines.

For a complete description of the grading rubrics used for this assignment, please consult the Rubrics page on the course website.

Lab 2

CSC 110

Charlotte Li

Tania Kalisa Umurerwa

2025-02-17

Instructions

Setup

Part I

Exercise 1

Exercise 1.1

Exercise 1.2

Exercise 1.3

Exercise 2

Exercise 2.1

Exercise 2.2

Exercise 3

Exercise 3.1

Exercise 3.2

Exercise 3.3

Exercise 3.4

Exercise 3.5

Exercise 3.6

Exercise 4

Exercise 4.1

Exercise 4.2

Exercise 4.3

Exercise 4.4

Exercise 4.5

Exercise 4.6

Exercise 4.7

Exercise 4.8

Exercise 4.9

Part II

Exercise 5

Exercise 5.1

Exercise 5.2

Exercise 6

Exercise 6.1

Exercise 6.2

Exercise 7

Exercise 7.1

Exercise 7.2

Exercise 7.3

Exercise 7.4

Exercise 7.5

Part III

Exercise 9

Exercise 9.1

Exercise 9.2

Exercise 9.3

Exercise 9.4

Exercise 9.5

Exercise 9.6

Exercise 9.7

Exercise 10

Exercise 10.1

Exercise 10.2

Honor Code Pledge

Submission checklist