library(ggplot2)
library(readxl)
district <- read_excel("district.xls")
head(district)
## # A tibble: 6 × 137
## DISTNAME DISTRICT DZCNTYNM REGION DZRATING DZCAMPUS DPETALLC DPETBLAP DPETHISP
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 CAYUGA … 001902 001 AND… 07 A 3 574 4.4 11.5
## 2 ELKHART… 001903 001 AND… 07 A 4 1150 4 11.8
## 3 FRANKST… 001904 001 AND… 07 A 3 808 8.5 11.3
## 4 NECHES … 001906 001 AND… 07 A 2 342 8.2 13.5
## 5 PALESTI… 001907 001 AND… 07 B 6 3360 25.1 42.9
## 6 WESTWOO… 001908 001 AND… 07 B 4 1332 19.7 26.2
## # ℹ 128 more variables: DPETWHIP <dbl>, DPETINDP <dbl>, DPETASIP <dbl>,
## # DPETPCIP <dbl>, DPETTWOP <dbl>, DPETECOP <dbl>, DPETLEPP <dbl>,
## # DPETSPEP <dbl>, DPETBILP <dbl>, DPETVOCP <dbl>, DPETGIFP <dbl>,
## # DA0AT21R <dbl>, DA0912DR21R <dbl>, DAGC4X21R <dbl>, DAGC5X20R <dbl>,
## # DAGC6X19R <dbl>, DA0GR21N <dbl>, DA0GS21N <dbl>, DDA00A001S22R <dbl>,
## # DDA00A001222R <dbl>, DDA00A001322R <dbl>, DDA00AR01S22R <dbl>,
## # DDA00AR01222R <dbl>, DDA00AR01322R <dbl>, DDA00AM01S22R <dbl>, …
# Create a new data frame with just the three columns we need
sped_df <- district[, c("DISTNAME", "DPETSPEP", "DPFPASPEP")]
# Preview
head(sped_df)
## # A tibble: 6 × 3
## DISTNAME DPETSPEP DPFPASPEP
## <chr> <dbl> <dbl>
## 1 CAYUGA ISD 14.6 28.9
## 2 ELKHART ISD 12.1 8.8
## 3 FRANKSTON ISD 13.1 8.4
## 4 NECHES ISD 10.5 10.1
## 5 PALESTINE ISD 13.5 6.1
## 6 WESTWOOD ISD 14.5 9.4
# Summary of percent special education students
summary(sped_df$DPETSPEP)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 9.90 12.10 12.27 14.20 51.70
# Summary of money spent on special education
summary(sped_df$DPFPASPEP)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 5.800 8.900 9.711 12.500 49.000 5
# Count NAs in each column
colSums(is.na(sped_df))
## DISTNAME DPETSPEP DPFPASPEP
## 0 0 5
Answer: Look at the output above. The variable with a non-zero count has missing values. Typically
DPFPASPEP(spending) contains missing observations.
# Remove rows with any NA values
sped_clean <- na.omit(sped_df)
# How many rows remain?
nrow(sped_clean)
## [1] 1202
Answer: After removing missing observations, there are 1202 districts remaining in the dataset.
ggplot(sped_clean, aes(x = DPETSPEP, y = DPFPASPEP)) +
geom_point(alpha = 0.4, color = "steelblue", size = 2) +
labs(
title = "Special Education: Enrollment % vs. Spending per Pupil",
subtitle = "Each point represents one Texas school district",
x = "Percent of Students in Special Education (DPETSPEP)",
y = "Per-Pupil Spending on Special Education (DPFPASPEP)"
) +
theme_minimal()
Observation: Describe what you see — is there a positive trend, negative trend, or no clear pattern?
cor(sped_clean$DPETSPEP, sped_clean$DPFPASPEP, use = "complete.obs")
## [1] 0.3700234
The scatterplot shows a weak positive relationship between special education enrollment percentage and per-pupil spending. A value near 1 means a strong positive relationship; near –1 means strong negative: near 0 means little linear relationship. Most districts cluster at low values for both variables. However, there are notable outliers, particularly a few districts with very high spending that don’t necessarily have the highest enrollment rates. This suggests that while enrollment percentage and spending are somewhat related, other factors such as district size, student needs, and funding sources may play a significant role in determining spending levels.