Step 1 – Load the District Data

library(ggplot2)
library(readxl)

district <- read_excel("district.xls")

head(district)
## # A tibble: 6 × 137
##   DISTNAME DISTRICT DZCNTYNM REGION DZRATING DZCAMPUS DPETALLC DPETBLAP DPETHISP
##   <chr>    <chr>    <chr>    <chr>  <chr>       <dbl>    <dbl>    <dbl>    <dbl>
## 1 CAYUGA … 001902   001 AND… 07     A               3      574      4.4     11.5
## 2 ELKHART… 001903   001 AND… 07     A               4     1150      4       11.8
## 3 FRANKST… 001904   001 AND… 07     A               3      808      8.5     11.3
## 4 NECHES … 001906   001 AND… 07     A               2      342      8.2     13.5
## 5 PALESTI… 001907   001 AND… 07     B               6     3360     25.1     42.9
## 6 WESTWOO… 001908   001 AND… 07     B               4     1332     19.7     26.2
## # ℹ 128 more variables: DPETWHIP <dbl>, DPETINDP <dbl>, DPETASIP <dbl>,
## #   DPETPCIP <dbl>, DPETTWOP <dbl>, DPETECOP <dbl>, DPETLEPP <dbl>,
## #   DPETSPEP <dbl>, DPETBILP <dbl>, DPETVOCP <dbl>, DPETGIFP <dbl>,
## #   DA0AT21R <dbl>, DA0912DR21R <dbl>, DAGC4X21R <dbl>, DAGC5X20R <dbl>,
## #   DAGC6X19R <dbl>, DA0GR21N <dbl>, DA0GS21N <dbl>, DDA00A001S22R <dbl>,
## #   DDA00A001222R <dbl>, DDA00A001322R <dbl>, DDA00AR01S22R <dbl>,
## #   DDA00AR01222R <dbl>, DDA00AR01322R <dbl>, DDA00AM01S22R <dbl>, …
# Create a new data frame with just the three columns we need
sped_df <- district[, c("DISTNAME", "DPETSPEP", "DPFPASPEP")]

# Preview
head(sped_df)
## # A tibble: 6 × 3
##   DISTNAME      DPETSPEP DPFPASPEP
##   <chr>            <dbl>     <dbl>
## 1 CAYUGA ISD        14.6      28.9
## 2 ELKHART ISD       12.1       8.8
## 3 FRANKSTON ISD     13.1       8.4
## 4 NECHES ISD        10.5      10.1
## 5 PALESTINE ISD     13.5       6.1
## 6 WESTWOOD ISD      14.5       9.4

Step 3 – Summary Statistics

# Summary of percent special education students
summary(sped_df$DPETSPEP)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    9.90   12.10   12.27   14.20   51.70
# Summary of money spent on special education
summary(sped_df$DPFPASPEP)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   5.800   8.900   9.711  12.500  49.000       5

Step 4 – Which Variable Has Missing Values?

# Count NAs in each column
colSums(is.na(sped_df))
##  DISTNAME  DPETSPEP DPFPASPEP 
##         0         0         5

Answer: Look at the output above. The variable with a non-zero count has missing values. Typically DPFPASPEP (spending) contains missing observations.


Step 5 – Remove Missing Observations

# Remove rows with any NA values
sped_clean <- na.omit(sped_df)

# How many rows remain?
nrow(sped_clean)
## [1] 1202

Answer: After removing missing observations, there are 1202 districts remaining in the dataset.


Step 6 – Scatter Plot: Spending vs. Percent Special Education

ggplot(sped_clean, aes(x = DPETSPEP, y = DPFPASPEP)) +
  geom_point(alpha = 0.4, color = "steelblue", size = 2) +
  labs(
    title    = "Special Education: Enrollment % vs. Spending per Pupil",
    subtitle = "Each point represents one Texas school district",
    x        = "Percent of Students in Special Education (DPETSPEP)",
    y        = "Per-Pupil Spending on Special Education (DPFPASPEP)"
  ) +
  theme_minimal()

Observation: Describe what you see — is there a positive trend, negative trend, or no clear pattern?


Step 7 – Correlation Check

cor(sped_clean$DPETSPEP, sped_clean$DPFPASPEP, use = "complete.obs")
## [1] 0.3700234

The scatterplot shows a weak positive relationship between special education enrollment percentage and per-pupil spending. A value near 1 means a strong positive relationship; near –1 means strong negative: near 0 means little linear relationship. Most districts cluster at low values for both variables. However, there are notable outliers, particularly a few districts with very high spending that don’t necessarily have the highest enrollment rates. This suggests that while enrollment percentage and spending are somewhat related, other factors such as district size, student needs, and funding sources may play a significant role in determining spending levels.