library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl)
Public_School_Characteristics_2022_23 <- read_csv("Public_School_Characteristics_2022-23.csv")
## Rows: 101390 Columns: 77
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (23): NCESSCH, SURVYEAR, STABR, LEAID, ST_LEAID, LEA_NAME, SCH_NAME, LST...
## dbl (54): X, Y, OBJECTID, STATUS, TOTFRL, FRELCH, REDLCH, DIRECTCERT, PK, KG...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
psc_model<-lm(STUTERATIO~TOTFRL+ULOCALE+WH+HI, data=Public_School_Characteristics_2022_23)
summary(psc_model)
## 
## Call:
## lm(formula = STUTERATIO ~ TOTFRL + ULOCALE + WH + HI, data = Public_School_Characteristics_2022_23)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -58.5   -3.4   -0.9    1.9 3584.4 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                13.9267697  0.2052804  67.843  < 2e-16 ***
## TOTFRL                      0.0012621  0.0003691   3.420 0.000627 ***
## ULOCALE12-City: Mid-size    0.2429880  0.3463773   0.702 0.482985    
## ULOCALE13-City: Small      -0.2138650  0.3390579  -0.631 0.528197    
## ULOCALE21-Suburb: Large    -0.2444225  0.2369523  -1.032 0.302297    
## ULOCALE22-Suburb: Mid-size  0.3357734  0.4382972   0.766 0.443627    
## ULOCALE23-Suburb: Small    -0.2419736  0.5411577  -0.447 0.654774    
## ULOCALE31-Town: Fringe     -0.1383734  0.4598818  -0.301 0.763500    
## ULOCALE32-Town: Distant    -0.4625817  0.3494587  -1.324 0.185603    
## ULOCALE33-Town: Remote      1.5417867  0.4050560   3.806 0.000141 ***
## ULOCALE41-Rural: Fringe    -0.7278903  0.2868867  -2.537 0.011176 *  
## ULOCALE42-Rural: Distant   -1.5699833  0.2992098  -5.247 1.55e-07 ***
## ULOCALE43-Rural: Remote    -2.4718333  0.3444846  -7.175 7.26e-13 ***
## WH                          0.0043632  0.0002875  15.177  < 2e-16 ***
## HI                          0.0031666  0.0004385   7.222 5.16e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.13 on 97084 degrees of freedom
##   (4291 observations deleted due to missingness)
## Multiple R-squared:  0.007507,   Adjusted R-squared:  0.007364 
## F-statistic: 52.45 on 14 and 97084 DF,  p-value: < 2.2e-16

Variables include TOTFRL (total number of students with either free or reduced lunuch status), ULOCALE (locale status), WH (total number of White students), HI (total number of Hispanic students).

P-value of 2.2e-16, which is extremely low, means we can reject the null hypothesis that states the TOTFRL, ULOCALE, WH, HI does not effect STUTERATIO.

R-squared value of 0.007, which is also extremely low, means that only .7% of the intercept can be explained by the independent variables listed above (ULOCALE, WH, HI). In other words, the independent variables listed do not effect student-teacher ratio in the slightest. There are other independent variables at play here.

Based on P-values provided, significant variables include TOTFRL, Town:Remote, Rural:Distant, Rural:Remote, WH, HI. Insignificant variables include all other local statuses. TOTFRL, with a p-value of 0.0006, shows we can reject the null hypothesis that it does not affect STUTERATIO. More interestingly, p-value of 7.26e-13 for Rural: Remote shows that we can reject the null hypothesis that it does not affect student-teacher ratio. Considering that all other p-values for locale codes are not significant, we can rule out the notion that locale code definitely does not affect student-teacher ratio.

plot(psc_model,which=1)

It seems that this model does meet the assumption of linearity, seeing that most of the observations, or residuals, lie close to the fitted (red) line.