library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl)
setwd("~/Desktop/UTSA/Quantitative Methods/RStudio")
district <- read_excel("district.xls")
clean_district <- district |>filter(DZRATING %in% c("A", "B", "C")) |>mutate(DZRATING_num = case_when(DZRATING == "A" ~ 5,DZRATING == "B" ~ 4,DZRATING == "C" ~ 3))
district_model <- lm(DZRATING_num ~ DPSTURNR, data = clean_district)
summary(district_model)
##
## Call:
## lm(formula = DZRATING_num ~ DPSTURNR, data = clean_district)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.5611 -0.3190 -0.1774 0.6368 1.6569
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.56111 0.04095 111.377 <2e-16 ***
## DPSTURNR -0.01523 0.00177 -8.601 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5975 on 1145 degrees of freedom
## (4 observations deleted due to missingness)
## Multiple R-squared: 0.06069, Adjusted R-squared: 0.05987
## F-statistic: 73.97 on 1 and 1145 DF, p-value: < 2.2e-16
district_model<-lm(DZRATING_num~DPSTURNR,data=clean_district)
summary(district_model)
##
## Call:
## lm(formula = DZRATING_num ~ DPSTURNR, data = clean_district)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.5611 -0.3190 -0.1774 0.6368 1.6569
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.56111 0.04095 111.377 <2e-16 ***
## DPSTURNR -0.01523 0.00177 -8.601 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5975 on 1145 degrees of freedom
## (4 observations deleted due to missingness)
## Multiple R-squared: 0.06069, Adjusted R-squared: 0.05987
## F-statistic: 73.97 on 1 and 1145 DF, p-value: < 2.2e-16
ANSWER: I was a bit confused why I got the same unrealistic warning using my own (your) real world district data; it was hard for me to apply this lecture to real world numbers. I assume the error is because my variable is not continuous, the accountability ratings only go A-F, and i excluded the not rated districts.
The R-squared is only about 0.06, which means teacher turnover explains around 6% of the differences in accountability ratings. That’s not much, but I think that seems normal and might indicate many other factors are at play. The p-value shows the relationship is still statistically significant, so it’s a real pattern, just not a strong one. In other words, districts with higher turnover tend to have slightly lower ratings, but turnover clearly isn’t the whole story.
ggplot(clean_district, aes(x = factor(DZRATING,levels = c("C","B","A")), y = DPSTURNR)) + geom_boxplot(fill = "purple") + labs(x = "Accountability Rating",y = "Teacher Turnover (%)",title = "Teacher Turnover by Accountability Rating")
## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
I further tried to interpret and make sense of lecture vs. my homework
and chosen variables using a different plotting method and thought
boxplot makes the pattern easier to see than the scatterplot. Districts
with lower accountability ratings tend to have higher teacher turnover,
while those with A or B ratings usually have lower turnover. The spread
within each group also shows that ratings aren’t determined by turnover
alone, there’s a lot of variation. This fits what the regression showed:
a small but real negative relationship.
ANSWER: The only independent variable I used in this model, teacher turnover (DPSTURNR), is statistically significant (p < .001). Its estimate (–0.015) means that for every 1% increase in teacher turnover, a district’s accountability rating decreases by about 0.015 points on a numeric scale.
In this dataset, ratings range only from A to C, so the model is capturing small shifts within generally higher-rated districts. That limited variation helps explain why the overall relationship is pretty modest. I would guess that the Not Rated categories that were excluded would have been D or F if there weren’t other factors influencing. The data doesn’t (don’t?) include the lowest-performing schools where turnover might have a stronger effect.
plot(district_model,which=1)
The residuals vs. fitted plot looks mostly fine. The red line is fairly
flat, which means the model generally meets the assumption of
linearity…? The striped pattern happens because accountability ratings
are only A through C, so the data aren’t truly continuous. It’s not a
perfect fit, but I think the model still works well enough to show the
overall trend considering the variable I chose.