The SAT is a standardized test widely used for college admissions in the United States. Since it was debuted by the College Board in 1926, its name and scoring have changed several times; originally called the Scholastic Aptitude Test, it was later called the Scholastic Assessment Test, then the SAT I: Reasoning Test, then the SAT Reasoning Test, and now, simply the SAT. The SAT is wholly owned, developed, and published by the College Board, a private, non-profit organization in the United States. It is administered on behalf of the College Board by the Educational Testing Service which until recently developed the SAT as well. The test is intended to assess students’ readiness for college. The SAT was originally designed not to be aligned with high school curricula, but several adjustments were made for the version of the SAT introduced in 2016, to align it with the common core standards followed in high school. The current SAT, introduced in 2016, takes three hours to finish, plus 50 minutes for the SAT with essay. Scores on the SAT range from 400 to 1600, combining test results from two 800-point sections: mathematics, and critical reading and writing.
The SAT continues to be a significant component of the application process used by US universities to assess student qualifications and make decisons regarding offers of admission. In an ideal world, the SAT would be a true indicator of a student’s academic potential and aptitude amd help determine whether the student and the university are a good fit for each other. Given this, it would be natural to expect that SAT scores reflect just the student’s capabilities without being impacted by other demographic or socio-economic factors.
In my study here, I attempt to assess whether that is indeed the case. Or do factors like household income influence the students’ SAT scores? If indeed, there was a causal relationship between household income and SAT scores, it would run contrary to the efforts to make college admissions a fair process. If indeed, SAT scores were discovered to be demonstratably higher for affluent students, it would provide evidence that the results are not based on merit alone, and certain sections of students have a distinct advantage over other sections.
While the US school system is dominated by public schools, these co-exist with other types of schools such as private schools, charter schools and vocational or technical schools. These are the main choices available to students that don’t have special needs. While public schools are run by individual schools districts that receive state funding and follow common standards, private schools are owned and operated by private entities who charge fees unlike public schools. Charter schools are run by private entities but require state approval and receive public funds from the school system albeit to a lesser extent as compared to public schools. These are typically formed to serve a specific “charter” or objective in specific districts in order to customize the education process for certain segments of students who might otherwise be constrained within a “one size fits all” type of model. Lastly, vocational or technical schools have been started in some towns within the public schools system to provide vocational or technical skills not commonly provided in regular schools. In this category also fall “magnet” schools that place additional emphasis on STEM (Science, Technology, Engineering and Mathematics) skills.
I also attempt to analyze whether the type of school has a bearing on the students’ SAT scores as well. If indeed there was a causal relationship between school type and SAT scores, it would give school administrators something to think about in terms of creating the “right” type of environment and infrastructure that allows students to fulfill their potential.
I restrict my analysis to my home state i.e. New Jersey (NJ).
Issues such as whether SAT scores are the correct metric to assess student potential or indicators of peformance in freshmen year are not studied here. This could be areas for future research.
To summarize, my research question is: “Can SAT scores be predicted by household income (at the school district level) and school type (public or vocational or charter or private)?” Stated differently, is there a statistically significant association between household income, school type and SAT scores?
Null Hypothesis: The slope of the independent variables (household income and school type) is 0 i.e. there is nothing going on in terms of the relationship between SAT score and household income and school type.
Alternative Hypothesis: The slope of the independent variable (household income) is not 0 i.e. there is an association between SAT score and household income and school type.
For my study, I obtained data from public sources on the Internet for the 2016-17 and 2018-19 academic years. I combine data from multiple sources to come up with a dataset for this project, as listed below:
I downloaded information from NJ’s Department of Education - Performance Report for the 2016-17 academic year. This dataset provides SAT scores for Math, Reading & Writing and Total_Average, by high school.
I used information from the schools metadata to label the applicable schools as Vocational or Charter schools. I labelled the remaining schools as Public schools.
I separately searched for and discovered a dataset on the Internet showing Average Total SAT score (not broken out into Math and Reading/Writing sections) for private schools in NJ, for the 2018-19 academic year. While I understand that combining data from 2 different academic years is not the ideal approach, the paucity of data about private schools (who are under no compulsion to provide public information) makes any type of combined analysis difficult. So I make the assumption that SAT scores for a given “type” of school should have a very strong correlation year-over-year. Typically scores in 2 consecutive years would not show a significant idiosyncratic deviation and would largely be driven by underlying conmmon infrastructure such as teachers, equipment, facilities etc.
Data has been collected via a combination of surveys as described below:
Average Total SAT scores from the schools performance report 2016-17 published by the NJ State Department of Education. Website: https://rc.doe.state.nj.us/ReportsDatabase1617.aspx Website: https://www.privateschoolreview.com/sat-score-stats/new-jersey
Median household income as collected in the American Community Survey (ACS) census data for 2017 as available on the following website: https://factfinder.census.gov/faces/nav/jsf/pages/download_center.xhtml
Type of school: This is a categorical variable that indicates whether the school is a charter school, vocational or technical school, private school, or a public school.
I had to perform certain data cleansing (converting to upper case, removing extra spaces, removing redundant strings from school and district names, expanding abbreviated names, aligning spellings such as “Boro” and “Borough” to make them consistent, looking up zip codes etc) to be able to cross reference data across 1) and 2) above. Even after doing that, there were a few cases where the SAT scores were available, but the median income was not, and vice versa. I excluded all such cases from my data set. As a result, my usable data was whittled down to 441 records.
I combined all the above data into one common csv file, that I have posted on Github.
Each case here represents an individual high school. The median household income was: i) based on the school district name, so if a district has more than 1 high school, then they would all have the same value for median household income, or ii) based on zip code for private schools.
Average Total SAT scores for each school: This is a numeric variable. Since the composite score can only take integer values between a minimum of 400 and a maximum of 1600, typically in 10-point increments, it can be treated as a discrete numeric variable.
Median household income for school district or zip code. This is a numeric variable, that can take on a wide range of positive, integer values. This can be treated as a continuous variable.
Type of school: This is a categorical variable denoting Type of School: Charter, Vocational, Public and Private. This has been coded with values 1,2,3 and 4 respectively.
This is an observational study. The data was collected via survey methods based on the information provided on the websites listed in the Data Collection section above. There was no experimental design - so no separation between control groups and treatment groups.
The population of interest is students, their households’ income in the state of New Jersey, and the type of schools attended. The findings from this analysis can be generalized to the general population in NJ because except for a few cases where data was not available, the sample comprises a majority of the population in NJ.
The analysis could potentially be generalized to the population of all high school seniors in the USA to some extent, if we assume that on average, the household economics and school curriculum are largely in sync across the country. Obviously all states have different per capita income and economic conditions, and the schools may not follow exactly the same curriculum. Generalizing across time (scores in different years) is also not completely unreasonable, given the slow moving nature of the variables involved - year-over-year correlation between SAT scores for same high school can be expected to have a very strong correlation and dependency. Obviously the SAT test can have different levels of difficulty in some years, but by and large, the intention is to have a consistent level of difficulty to make scores comparable.
Generalzing the research to outside the USA would not be feasible since SAT is used only by US universities. Other countries would have their own standardized tests and education structures.
There was no experimental design - so no separation between control groups and treatment groups. Given the observational nature of the study, causal relationships cannot be inferred from it. We can only establish association.
I perform the following exploratory data analysis on the dataset.
# Load required libraries
library(readxl)
library(ggplot2)
library(RCurl)## Loading required package: bitops
library(R.utils)## Loading required package: R.oo
## Loading required package: R.methodsS3
## R.methodsS3 v1.7.1 (2016-02-15) successfully loaded. See ?R.methodsS3 for help.
## R.oo v1.22.0 (2018-04-21) successfully loaded. See ?R.oo for help.
##
## Attaching package: 'R.oo'
## The following object is masked from 'package:RCurl':
##
## clone
## The following objects are masked from 'package:methods':
##
## getClasses, getMethods
## The following objects are masked from 'package:base':
##
## attach, detach, gc, load, save
## R.utils v2.8.0 successfully loaded. See ?R.utils for help.
##
## Attaching package: 'R.utils'
## The following object is masked from 'package:RCurl':
##
## reset
## The following object is masked from 'package:utils':
##
## timestamp
## The following objects are masked from 'package:base':
##
## cat, commandArgs, getOption, inherits, isOpen, parse, warnings
library(xlsx)
library(httr)
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(magrittr)##
## Attaching package: 'magrittr'
## The following object is masked from 'package:R.utils':
##
## extract
## The following object is masked from 'package:R.oo':
##
## equals
library(tidyverse)## -- Attaching packages ------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v tibble 2.1.1 v purrr 0.3.2
## v tidyr 0.8.3 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts ---------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x tidyr::complete() masks RCurl::complete()
## x magrittr::equals() masks R.oo::equals()
## x tidyr::extract() masks magrittr::extract(), R.utils::extract()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x purrr::set_names() masks magrittr::set_names()
library(kableExtra)##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
# Create the input parameter based on the URL for the input data file
infile<-getURL("https://raw.githubusercontent.com/Jagdish16/jagdish_r_repo/master/DATA606/Project/NJ_Schools_Data1.csv")# Read the input data file into a tibble
all.schools<-as_tibble(read.csv(text=infile, header=TRUE))
num.all.schools<-count(all.schools)
paste("There are a total of ",num.all.schools," public and private schools in the sample")## [1] "There are a total of 441 public and private schools in the sample"
# Address text formatting issue
all.schools%<>%rename(District.Code = X.U.FEFF.District.Code)
# Set correct data types for different columns
all.schools$Type.Label <- as.factor(all.schools$Type.Label)
all.schools$District.Code <- as.factor(all.schools$District.Code)
all.schools$School.Code <- as.factor(all.schools$School.Code)
all.schools$County.Code <- as.factor(all.schools$County.Code)
all.schools$School.Type <- as.factor(all.schools$School.Type)
all.schools$School.Name <- as.character(all.schools$School.Name)
all.schools$County.Name <- as.character(all.schools$County.Name)
all.schools$District.Name <- as.character(all.schools$District.Name)
all.schools$City <- as.character(all.schools$City)
all.schools$Zip.Code <- as.character(all.schools$Zip.Code)
all.schools$School.Type <- as.factor(all.schools$School.Type)
# Examine the structure of the data frame
#str(all.schools)
# View the first few records
head(all.schools)%>%kable()| District.Code | School.Code | County.Code | County.Name | District.Name | School.Name | City | Zip.Code | School.Type | Type.Label | Math | Reading.and.Writing | Average.Total.Score | Median.HH.Income |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6010 | 910 | 80 | MONMOUTH | ACADEMY CHARTER HIGH SCHOOL | ACADEMY CHARTER HIGH SCHOOL | LAKE COMO | 7719 | Charter | 1 | 448 | 457 | 905 | 92414 |
| 6032 | 901 | 80 | MIDDLESEX | ACADEMY FOR URBAN LEADERSHIP CHARTER SCHOOL | ACADEMY FOR URBAN LEADERSHIP CHARTER SCHOOL | Perth Amboy | 8861 | Charter | 1 | 470 | 469 | 939 | 51389 |
| 110 | 10 | 1 | ATLANTIC | ATLANTIC CITY | ATLANTIC CITY HIGH SCHOOL | Atlantic City | 8401 | Public | 3 | 524 | 526 | 1050 | 26006 |
| 120 | 10 | 1 | ATLANTIC | ATLANTIC COUNTY VOCATIONAL | ATLANTIC COUNTY INSTITUTE OF TECHNOLOGY | Mays Landing | 8330 | Vocational | 2 | 528 | 530 | 1058 | 64266 |
| 150 | 10 | 7 | CAMDEN | AUDUBON BOROUGH | AUDUBON JUNIOR/SENIOR HIGH SCHOOL | Audubon | 8106 | Public | 3 | 536 | 551 | 1087 | 75136 |
| 185 | 30 | 29 | OCEAN | BARNEGAT TOWNSHIP | BARNEGAT HIGH SCHOOL | Barnegat | 8005 | Public | 3 | 556 | 548 | 1104 | 69877 |
# Create a separate data frame for all schools except Private schools (since this data pertains to a different year)
public.schools<-all.schools%>%filter(!School.Type=="Private")
num.public.schools<-count(public.schools)
paste("There are a total of ",num.public.schools," public schools in the sample.")## [1] "There are a total of 402 public schools in the sample."
#View(public.schools)# Create vectors for the 3 variables of interest
total.score<-all.schools$Average.Total.Score
hh.income<-all.schools$Median.HH.Income
school.type<-all.schools$Type.Label
pub.total.score<-public.schools$Average.Total.Score
pub.hh.income<-public.schools$Median.HH.Income
# Compute summary stats for the numeric variables of interest
summary<-NA
summary(total.score)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 795 1016 1106 1102 1178 1502
summary(hh.income)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20529 62739 81011 83896 106039 202862
# Compute mean, minimum, maximum and standard deviation for the numeric variables of interest
mean.total.score<-mean(total.score)
sd.total.score<-sd(total.score)
mean.hh.income<-mean(hh.income)
sd.hh.income<-sd(hh.income)
cat("The mean SAT score across all NJ high schools for 2016-17 is ",mean(total.score),"\n")## The mean SAT score across all NJ high schools for 2016-17 is 1102.259
cat("The standard deviation of SAT scores across all NJ high schools for 2016-17 is ",sd(total.score),"\n")## The standard deviation of SAT scores across all NJ high schools for 2016-17 is 130.7444
cat("The minimum and maximum SAT scores across all NJ high schools for 2016-17 are",min(total.score),"and ", max(total.score),"\n")## The minimum and maximum SAT scores across all NJ high schools for 2016-17 are 795 and 1502
cat("The average median household income across all NJ school districts is ",mean(hh.income),"\n")## The average median household income across all NJ school districts is 83895.85
cat("The standard deviation of median household income across all NJ school districts is ",sd(hh.income),"\n")## The standard deviation of median household income across all NJ school districts is 33649.71
cat("The minimum and maximum median household incomes across all NJ school districs and zip codes for 2016-17 are",min(hh.income),"and ", max(hh.income),"\n")## The minimum and maximum median household incomes across all NJ school districs and zip codes for 2016-17 are 20529 and 202862
# Examine a histogram of the 2 numeric variables
hist(total.score,probability = TRUE, 10,main="Average Total Score", xlab="SAT Scores")
x <- 400:1600
y <- dnorm(x = x, mean = mean.total.score, sd = sd.total.score)
lines(x = x, y = y, col = "blue")boxplot(total.score)qqnorm(total.score)hist(hh.income,probability = TRUE, 10,main="Median Household Income in US Dollars", xlab="Household Income")
x <- 0:230000
y <- dnorm(x = x, mean = mean.hh.income, sd = sd.hh.income)
lines(x = x, y = y, col = "red")boxplot(hh.income)qqnorm(hh.income)From the histogram, it can be seen that the distribution for total score seems nearly normal. The boxplot does show some outliers outside the range. The qqnorm plot shows that the distribution for total score seems nearly normal.
From the histogram and the qqnorm plot above, it can be seen that the distribution for median household income does show a little right-skewness which can be expected because of some really high income values for a few districts/towns. The boxplot does show outliers outside the range.
# Check for a linear relationship visually
# Create a scatter plot of the response (dependent) and the predictor (independent) variable for all schools. Use color coding to indicate the categorical variable that is the second independent variable
ggplot(all.schools, aes(all.schools$Average.Total.Score, all.schools$Median.HH.Income)) +
geom_point(aes(color = all.schools$School.Type)) + scale_color_brewer(palette="Dark2")# Create a scatter plot of the response (dependent) and the predictor (independent) variable for public schools only. Use color coding to indicate the categorical variable that is the second independent variable
ggplot(public.schools, aes(public.schools$Average.Total.Score, public.schools$Median.HH.Income)) +
geom_point(aes(color = public.schools$School.Type)) + scale_color_brewer(palette="Dark2")The scatter plot seems to indicate a positive-sloping relationship between the household income and SAT scores for all schools as well as public schools.
# Examine the count for each type of the categorical variable
table(all.schools$School.Type)##
## Charter Private Public Vocational
## 18 39 341 43
It can be seen that a majority of the observations pertain to public schools. In fact, the number of observations for other “types” are much lower.
# Calculate the z-score for both total score and household income for all schools and public schools
mean.total.score<-mean(total.score)
sd.total.score<-sd(total.score)
z.total.score<-((total.score-mean.total.score)/sd.total.score)
qqnorm(z.total.score)mean.hh.income<-mean(hh.income)
sd.hh.income<-sd(hh.income)
z.hh.income<-((hh.income-mean.hh.income)/sd.hh.income)
qqnorm(z.hh.income)pub.mean.total.score<-mean(pub.total.score)
pub.sd.total.score<-sd(pub.total.score)
pub.z.total.score<-((pub.total.score-pub.mean.total.score)/pub.sd.total.score)
qqnorm(pub.z.total.score)pub.mean.hh.income<-mean(pub.hh.income)
pub.sd.hh.income<-sd(pub.hh.income)
pub.z.hh.income<-((pub.hh.income-pub.mean.hh.income)/pub.sd.hh.income)
qqnorm(pub.z.hh.income)From the qqnorm plot of the z-scores (scaled total score) above, it can be seen that the distribution for total score seems nearly normal.
From the qqnorm plot of the z-scores (scaled hh income) above, it can be seen that the distribution for household income is a little right-skewed.
Before fitting the linear model, scale the median household income to dollars in hundreds i.e. change the unit of denomination to 00s of dollars instead of dollars
# Scale hh income to 00 dollars, and add the scaled z-scores for the 2 numeric variables to the data frame
all.schools$Median.HH.Income<-all.schools$Median.HH.Income/100
head(all.schools)## # A tibble: 6 x 14
## District.Code School.Code County.Code County.Name District.Name
## <fct> <fct> <fct> <chr> <chr>
## 1 6010 910 80 MONMOUTH ACADEMY CHAR~
## 2 6032 901 80 MIDDLESEX ACADEMY FOR ~
## 3 110 10 1 ATLANTIC ATLANTIC CITY
## 4 120 10 1 ATLANTIC ATLANTIC COU~
## 5 150 10 7 CAMDEN AUDUBON BORO~
## 6 185 30 29 OCEAN BARNEGAT TOW~
## # ... with 9 more variables: School.Name <chr>, City <chr>,
## # Zip.Code <chr>, School.Type <fct>, Type.Label <fct>, Math <int>,
## # Reading.and.Writing <int>, Average.Total.Score <int>,
## # Median.HH.Income <dbl>
all.schools$z.total.score<-z.total.score
all.schools$z.hh.income<-z.hh.income
public.schools$Median.HH.Income<-public.schools$Median.HH.Income/100
head(public.schools)## # A tibble: 6 x 14
## District.Code School.Code County.Code County.Name District.Name
## <fct> <fct> <fct> <chr> <chr>
## 1 6010 910 80 MONMOUTH ACADEMY CHAR~
## 2 6032 901 80 MIDDLESEX ACADEMY FOR ~
## 3 110 10 1 ATLANTIC ATLANTIC CITY
## 4 120 10 1 ATLANTIC ATLANTIC COU~
## 5 150 10 7 CAMDEN AUDUBON BORO~
## 6 185 30 29 OCEAN BARNEGAT TOW~
## # ... with 9 more variables: School.Name <chr>, City <chr>,
## # Zip.Code <chr>, School.Type <fct>, Type.Label <fct>, Math <int>,
## # Reading.and.Writing <int>, Average.Total.Score <int>,
## # Median.HH.Income <dbl>
public.schools$pub.z.total.score<-pub.z.total.score
public.schools$pub.z.hh.income<-pub.z.hh.incomeCalculate the correlation between the total score and the household income, for all schools and for public schools only
cor(all.schools$Average.Total.Score, all.schools$Median.HH.Income)## [1] 0.6657101
cor(public.schools$Average.Total.Score, public.schools$Median.HH.Income)## [1] 0.7125073
It can be seen that the correlation between the total score and household income is 0.66 for all schools, and 0.71 for public schools only.
# Run the linear model fitting for all schools and public schools only and inspect the model statistics
all.score.income<-lm(all.schools$Average.Total.Score~all.schools$Median.HH.Income)
summary(all.score.income)##
## Call:
## lm(formula = all.schools$Average.Total.Score ~ all.schools$Median.HH.Income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -301.10 -52.22 -10.12 35.70 392.20
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 885.25466 12.50649 70.78 <2e-16 ***
## all.schools$Median.HH.Income 0.25866 0.01384 18.69 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 97.67 on 439 degrees of freedom
## Multiple R-squared: 0.4432, Adjusted R-squared: 0.4419
## F-statistic: 349.4 on 1 and 439 DF, p-value: < 2.2e-16
public.score.income<-lm(public.schools$Average.Total.Score~public.schools$Median.HH.Income)
summary(public.score.income)##
## Call:
## lm(formula = public.schools$Average.Total.Score ~ public.schools$Median.HH.Income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -290.44 -45.46 -2.20 38.45 404.18
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 870.54870 11.51421 75.61 <2e-16 ***
## public.schools$Median.HH.Income 0.26183 0.01289 20.31 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 86.83 on 400 degrees of freedom
## Multiple R-squared: 0.5077, Adjusted R-squared: 0.5064
## F-statistic: 412.5 on 1 and 400 DF, p-value: < 2.2e-16
For the public schools, the intercept is 870. This can ignored for the purposes of our analysis since it is only relevant for guiding the height of the linear model. Given that the lowest household income is about $20,000, a value of 0 household income does not provide any meangingful insight into the linear relationship.
The slope for the household income variable is 0.26. It indicates that on average, for every $100 change in median household income, there is a 0.26 point change in the average total SAT score, in the same direction i.e. as income increases, so does SAT score and vice-versa. The very low p-values indicate that the household income is statistically significant.
The adjusted R-squared is 0.50. This implies that household income explains about 50% of the variance in SAT scores, on average.
#plot(all.score.income)
plot(all.score.income$residuals)abline=0
plot(all.score.income$fitted.values, all.score.income$residuals)plot(public.score.income$residuals)abline=0
plot(public.score.income$fitted.values, public.score.income$residuals)hist(public.score.income$residuals)qqnorm(public.score.income$residuals)#qqline(public.score.income$residuals)
#plot(public.score.income)Linear relationship between all independent variables and the response variable: The scatter plot of residuals versus each independent variable seems evenly distributed around 0.
Nearly normal residuals: The histogram and the qqnorm plot of the residuals shows skewness. This is because of some outliers in the data that can also be seen as tha variance of the residuals seems to be lower at higher score levels. While the distribution of residuals is centered at 0 in the histogram, the spread is not symmetric due to right-skewness.
Constant variability (Homoscedasticity): Variability of points around least squares line is roughly constant i.e. variability of residuals around the “0-line” is roughly constant.
Independence between observations: It can be assumed that the observations (SAT scores and household income) are independent across different observations.
all.score.income.type<-lm(all.schools$Average.Total.Score~all.schools$Median.HH.Income+all.schools$School.Type)
summary(all.score.income.type)##
## Call:
## lm(formula = all.schools$Average.Total.Score ~ all.schools$Median.HH.Income +
## all.schools$School.Type)
##
## Residuals:
## Min 1Q Median 3Q Max
## -270.93 -43.06 1.28 44.61 361.66
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 840.68480 21.87742 38.427 < 2e-16
## all.schools$Median.HH.Income 0.23792 0.01281 18.574 < 2e-16
## all.schools$School.TypePrivate 188.97737 25.71386 7.349 9.91e-13
## all.schools$School.TypePublic 46.80933 21.67787 2.159 0.031371
## all.schools$School.TypeVocational 92.95837 25.20654 3.688 0.000255
##
## (Intercept) ***
## all.schools$Median.HH.Income ***
## all.schools$School.TypePrivate ***
## all.schools$School.TypePublic *
## all.schools$School.TypeVocational ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 88.2 on 436 degrees of freedom
## Multiple R-squared: 0.5491, Adjusted R-squared: 0.5449
## F-statistic: 132.7 on 4 and 436 DF, p-value: < 2.2e-16
# all.score.z.income.type<-lm(all.schools$z.total.score~all.schools$z.hh.income+all.schools$School.Type)
# summary(all.score.z.income.type)The slope of the categorical variable is statistically significant given the very low p-values. The baseline is school type = Charter. Adding the categorical variable as an independent variable on top of the numeric independent variable increases the adjusted R-squared to 0.54 i.e. the new model explains 54% of the variability in SAT scores, on average.
Based on the above research, it can be concluded that there is indeed an association between total SAT scores and median household income and school type. Since this is not an experiment, we cannot assert any causality between these variables. It is likely that there are other collinear variables at work - for example just a high household income would not automatically result in higher SAT scores. It is possible that both household income and SAT scores are in turn dependent on a common underlying variable such as parent’s level of education and career success. Highly educated and successful parents are likely to create an environment that allows the children to thrive and also guide them in how to organize their studies and work hard.
Ideas for possible future research would include finding additional variables that can help explain the remaining 46% of the variability in SAT scores. Also extending the study to other countries may be a worthwhile exercise though it would warrant finding a similar metric such as SAT.
https://blog.prepscholar.com/sat-score-range
https://www.state.nj.us/education/data/fact.htm
https://patch.com/new-jersey/tomsriver/new-jersey-sat-scores-released-every-high-school-ranked
https://patch.com/new-jersey/westfield/every-nj-school-graded-new-state-report-where-do-you-rank
https://rc.doe.state.nj.us/reportsdatabase.aspx
https://en.wikipedia.org/wiki/List_of_New_Jersey_locations_by_per_capita_income
https://www.privateschoolreview.com/sat-score-stats/new-jersey
https://www.nj.com/education/2018/03/heres_how_every_nj_high_school_fared_on_the_sat.html