Homework 6 - Alise Hunte

Load your chosen dataset into Rmarkdown

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(readxl)

setwd("~/Desktop/UTSA/Quantitative Methods/RStudio")

district <- read_excel("district.xls")

Select the dependent variable you are interested in, along with independent variables which you believe are causing the dependent variable

clean_district<-district |> select(DA0GR21N, DPSTURNR, DPFRAALLK) |> drop_na()

create a linear model using the “lm()” command, save it to some object
call a “summary()” on your new model

district_model<-lm(DA0GR21N~DPSTURNR+DPFRAALLK, data=clean_district)

summary(district_model)

## 
## Call:
## lm(formula = DA0GR21N ~ DPSTURNR + DPFRAALLK, data = clean_district)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -649.3  -325.2  -214.4   -37.9 11218.8 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 965.181647  91.409334  10.559  < 2e-16 ***
## DPSTURNR    -14.029075   2.831411  -4.955 8.41e-07 ***
## DPFRAALLK    -0.021141   0.003839  -5.507 4.57e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 853.2 on 1074 degrees of freedom
## Multiple R-squared:  0.04632,    Adjusted R-squared:  0.04454 
## F-statistic: 26.08 on 2 and 1074 DF,  p-value: 8.705e-12

interpret the model’s r-squared and p-values. How much of the dependent variable does the overall model explain? What are the significant variables? What are the insignificant variables?

ANSWER: This model explains about 4.6% of the variation in annual graduate counts (R-squared = 0.0463), which means most of the differences in graduation rates are due to other factors not included when I ran it. Still, the overall model is statistically significant (p < .001), showing that both variables have a real relationship with graduate count. Teacher turnover and revenue per pupil are both significant (p < .001). Districts with higher teacher turnover tend to graduate fewer students, roughly 14 fewer graduates for every 1% increase in turnover. Higher revenue per pupil is also linked to slightly lower graduate counts, which could reflect that smaller or higher-need districts spend more per student. So while the relationships are significant, they explain only a small portion of the overall variation in graduation outcomes.

Choose some significant independent variables. Interpret its Estimates (or Beta Coefficients). How do the independent variables individually affect the dependent variable?

Both of my independent variables are modestly significant. The coefficient for teacher turnover (–14.03) means that for every 1% increase in teacher turnover, a district graduates about 14 fewer students, assuming revenue stays the same. The coefficient for revenue per pupil (–0.02) shows a small negative relationship: as revenue per student increases by one dollar, the number of graduates slightly decreases. It could be because smaller or higher-need districts often spend more per student. Overall, both variables have a negative effect on graduation counts, with teacher turnover showing the stronger relationship.

Does the model you create meet or violate the assumption of linearity? Show your work with “plot(x,which=1)”

plot(district_model,which=1)

The residuals vs. fitted plot looks mostly flat, which suggests the model generally meets the assumption of linearity. The red line stays close to zero across most fitted values, meaning the relationship between the variables is roughly linear. But there are a few large outliers, which could indicate a bit of non-linearity in certain districts. Overall, the linearity assumption is mostly satisfied, but a few extreme data points might be influencing the model.

Homework 6 - Alise Hunte

2025-10-29