1. Load your chosen dataset into Rmarkdown
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl)

setwd("~/Desktop/UTSA/Quantitative Methods/RStudio")

district <- read_excel("district.xls")
  1. Select the dependent variable you are interested in, along with independent variables which you believe are causing the dependent variable
clean_district <- district |>filter(DZRATING %in% c("A", "B", "C")) |>mutate(DZRATING_num = case_when(DZRATING == "A" ~ 5,DZRATING == "B" ~ 4,DZRATING == "C" ~ 3))

district_model <- lm(DZRATING_num ~ DPSTURNR, data = clean_district)
summary(district_model)
## 
## Call:
## lm(formula = DZRATING_num ~ DPSTURNR, data = clean_district)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5611 -0.3190 -0.1774  0.6368  1.6569 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.56111    0.04095 111.377   <2e-16 ***
## DPSTURNR    -0.01523    0.00177  -8.601   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5975 on 1145 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.06069,    Adjusted R-squared:  0.05987 
## F-statistic: 73.97 on 1 and 1145 DF,  p-value: < 2.2e-16
  1. create a linear model using the “lm()” command, save it to some object
  2. call a “summary()” on your new model
district_model<-lm(DZRATING_num~DPSTURNR,data=clean_district)

summary(district_model)
## 
## Call:
## lm(formula = DZRATING_num ~ DPSTURNR, data = clean_district)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5611 -0.3190 -0.1774  0.6368  1.6569 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.56111    0.04095 111.377   <2e-16 ***
## DPSTURNR    -0.01523    0.00177  -8.601   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5975 on 1145 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.06069,    Adjusted R-squared:  0.05987 
## F-statistic: 73.97 on 1 and 1145 DF,  p-value: < 2.2e-16
  1. interpret the model’s r-squared and p-values. How much of the dependent variable does the overall model explain? What are the significant variables? What are the insignificant variables?

ANSWER: I was a bit confused why I got the same unrealistic warning using my own (your) real world district data; it was hard for me to apply this lecture to real world numbers. I assume the error is because my variable is not continuous, the accountability ratings only go A-F, and i excluded the not rated districts.

The R-squared is only about 0.06, which means teacher turnover explains around 6% of the differences in accountability ratings. That’s not much, but I think that seems normal and might indicate many other factors are at play. The p-value shows the relationship is still statistically significant, so it’s a real pattern, just not a strong one. In other words, districts with higher turnover tend to have slightly lower ratings, but turnover clearly isn’t the whole story.

ggplot(clean_district, aes(x = factor(DZRATING,levels = c("C","B","A")), y = DPSTURNR)) + geom_boxplot(fill = "purple") + labs(x = "Accountability Rating",y = "Teacher Turnover (%)",title = "Teacher Turnover by Accountability Rating")
## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

I further tried to interpret and make sense of lecture vs. my homework and chosen variables using a different plotting method and thought boxplot makes the pattern easier to see than the scatterplot. Districts with lower accountability ratings tend to have higher teacher turnover, while those with A or B ratings usually have lower turnover. The spread within each group also shows that ratings aren’t determined by turnover alone, there’s a lot of variation. This fits what the regression showed: a small but real negative relationship.

  1. Choose some significant independent variables. Interpret its Estimates (or Beta Coefficients). How do the independent variables individually affect the dependent variable?

ANSWER: The only independent variable I used in this model, teacher turnover (DPSTURNR), is statistically significant (p < .001). Its estimate (–0.015) means that for every 1% increase in teacher turnover, a district’s accountability rating decreases by about 0.015 points on a numeric scale.

In this dataset, ratings range only from A to C, so the model is capturing small shifts within generally higher-rated districts. That limited variation helps explain why the overall relationship is pretty modest. I would guess that the Not Rated categories that were excluded would have been D or F if there weren’t other factors influencing. The data doesn’t (don’t?) include the lowest-performing schools where turnover might have a stronger effect.

  1. Does the model you create meet or violate the assumption of linearity? Show your work with “plot(x,which=1)”
plot(district_model,which=1)

The residuals vs. fitted plot looks mostly fine. The red line is fairly flat, which means the model generally meets the assumption of linearity…? The striped pattern happens because accountability ratings are only A through C, so the data aren’t truly continuous. It’s not a perfect fit, but I think the model still works well enough to show the overall trend considering the variable I chose.