On the relationship between SAT Scores, household incomes and school type

Introduction: What is your research question? Why do you care? Why should others care?

The SAT is a standardized test widely used for college admissions in the United States. Since it was debuted by the College Board in 1926, its name and scoring have changed several times; originally called the Scholastic Aptitude Test, it was later called the Scholastic Assessment Test, then the SAT I: Reasoning Test, then the SAT Reasoning Test, and now, simply the SAT. The SAT is wholly owned, developed, and published by the College Board, a private, non-profit organization in the United States. It is administered on behalf of the College Board by the Educational Testing Service which until recently developed the SAT as well. The test is intended to assess students’ readiness for college. The SAT was originally designed not to be aligned with high school curricula, but several adjustments were made for the version of the SAT introduced in 2016, to align it with the common core standards followed in high school. The current SAT, introduced in 2016, takes three hours to finish, plus 50 minutes for the SAT with essay. Scores on the SAT range from 400 to 1600, combining test results from two 800-point sections: mathematics, and critical reading and writing.

The SAT continues to be a significant component of the application process used by US universities to assess student qualifications and make decisons regarding offers of admission. In an ideal world, the SAT would be a true indicator of a student’s academic potential and aptitude amd help determine whether the student and the university are a good fit for each other. Given this, it would be natural to expect that SAT scores reflect just the student’s capabilities without being impacted by other demographic or socio-economic factors.

In my study here, I attempt to assess whether that is indeed the case. Or do factors like household income influence the students’ SAT scores? If indeed, there was a causal relationship between household income and SAT scores, it would run contrary to the efforts to make college admissions a fair process. If indeed, SAT scores were discovered to be demonstratably higher for affluent students, it would provide evidence that the results are not based on merit alone, and certain sections of students have a distinct advantage over other sections.

While the US school system is dominated by public schools, these co-exist with other types of schools such as private schools, charter schools and vocational or technical schools. These are the main choices available to students that don’t have special needs. While public schools are run by individual schools districts that receive state funding and follow common standards, private schools are owned and operated by private entities who charge fees unlike public schools. Charter schools are run by private entities but require state approval and receive public funds from the school system albeit to a lesser extent as compared to public schools. These are typically formed to serve a specific “charter” or objective in specific districts in order to customize the education process for certain segments of students who might otherwise be constrained within a “one size fits all” type of model. Lastly, vocational or technical schools have been started in some towns within the public schools system to provide vocational or technical skills not commonly provided in regular schools. In this category also fall “magnet” schools that place additional emphasis on STEM (Science, Technology, Engineering and Mathematics) skills.

I also attempt to analyze whether the type of school has a bearing on the students’ SAT scores as well. If indeed there was a causal relationship between school type and SAT scores, it would give school administrators something to think about in terms of creating the “right” type of environment and infrastructure that allows students to fulfill their potential.

I restrict my analysis to my home state i.e. New Jersey (NJ).

Issues such as whether SAT scores are the correct metric to assess student potential or indicators of peformance in freshmen year are not studied here. This could be areas for future research.

To summarize, my research question is: “Can SAT scores be predicted by household income (at the school district level) and school type (public or vocational or charter or private)?” Stated differently, is there a statistically significant association between household income, school type and SAT scores?

Null Hypothesis: The slope of the independent variables (household income and school type) is 0 i.e. there is nothing going on in terms of the relationship between SAT score and household income and school type.

Alternative Hypothesis: The slope of the independent variable (household income) is not 0 i.e. there is an association between SAT score and household income and school type.

Data: Write about the data from your proposal in text form. Address the following points:

For my study, I obtained data from public sources on the Internet for the 2016-17 and 2018-19 academic years. I combine data from multiple sources to come up with a dataset for this project, as listed below:

For SAT scores:

I downloaded information from NJ’s Department of Education - Performance Report for the 2016-17 academic year. This dataset provides SAT scores for Math, Reading & Writing and Total_Average, by high school.
I used information from the schools metadata to label the applicable schools as Vocational or Charter schools. I labelled the remaining schools as Public schools.
I separately searched for and discovered a dataset on the Internet showing Average Total SAT score (not broken out into Math and Reading/Writing sections) for private schools in NJ, for the 2018-19 academic year. While I understand that combining data from 2 different academic years is not the ideal approach, the paucity of data about private schools (who are under no compulsion to provide public information) makes any type of combined analysis difficult. So I make the assumption that SAT scores for a given “type” of school should have a very strong correlation year-over-year. Typically scores in 2 consecutive years would not show a significant idiosyncratic deviation and would largely be driven by underlying conmmon infrastructure such as teachers, equipment, facilities etc.

For median household income, I downloaded data from the US Census Bureau 5y survey ending 2017, which provides median household income estimates, by school district name. For private schools, I looked up median household income by zip code on the Internet, since private schools are not affiliated with public school districts.

Data collection: Describe how the data were collected.

Data has been collected via a combination of surveys as described below:

Average Total SAT scores from the schools performance report 2016-17 published by the NJ State Department of Education. Website: https://rc.doe.state.nj.us/ReportsDatabase1617.aspx Website: https://www.privateschoolreview.com/sat-score-stats/new-jersey
Median household income as collected in the American Community Survey (ACS) census data for 2017 as available on the following website: https://factfinder.census.gov/faces/nav/jsf/pages/download_center.xhtml
Type of school: This is a categorical variable that indicates whether the school is a charter school, vocational or technical school, private school, or a public school.

I had to perform certain data cleansing (converting to upper case, removing extra spaces, removing redundant strings from school and district names, expanding abbreviated names, aligning spellings such as “Boro” and “Borough” to make them consistent, looking up zip codes etc) to be able to cross reference data across 1) and 2) above. Even after doing that, there were a few cases where the SAT scores were available, but the median income was not, and vice versa. I excluded all such cases from my data set. As a result, my usable data was whittled down to 441 records.

I combined all the above data into one common csv file, that I have posted on Github.

Cases: What are the cases? (Remember: case = units of observation or units of experiment)

Each case here represents an individual high school. The median household income was: i) based on the school district name, so if a district has more than 1 high school, then they would all have the same value for median household income, or ii) based on zip code for private schools.

Variables: What are the two variables you will be studying? State the type of each variable.

Average Total SAT scores for each school: This is a numeric variable. Since the composite score can only take integer values between a minimum of 400 and a maximum of 1600, typically in 10-point increments, it can be treated as a discrete numeric variable.
Median household income for school district or zip code. This is a numeric variable, that can take on a wide range of positive, integer values. This can be treated as a continuous variable.
Type of school: This is a categorical variable denoting Type of School: Charter, Vocational, Public and Private. This has been coded with values 1,2,3 and 4 respectively.

Type of study: What is the type of study, observational or an experiment? Explain how you’ve arrived at your conclusion using information on the sampling and/or experimental design.

This is an observational study. The data was collected via survey methods based on the information provided on the websites listed in the Data Collection section above. There was no experimental design - so no separation between control groups and treatment groups.

Scope of inference - generalizability: Identify the population of interest, and whether the findings from this analysis can be generalized to that population, or, if not, a subsection of that population. Explain why or why not. Also discuss any potential sources of bias that might prevent generalizability.

The population of interest is students, their households’ income in the state of New Jersey, and the type of schools attended. The findings from this analysis can be generalized to the general population in NJ because except for a few cases where data was not available, the sample comprises a majority of the population in NJ.

The analysis could potentially be generalized to the population of all high school seniors in the USA to some extent, if we assume that on average, the household economics and school curriculum are largely in sync across the country. Obviously all states have different per capita income and economic conditions, and the schools may not follow exactly the same curriculum. Generalizing across time (scores in different years) is also not completely unreasonable, given the slow moving nature of the variables involved - year-over-year correlation between SAT scores for same high school can be expected to have a very strong correlation and dependency. Obviously the SAT test can have different levels of difficulty in some years, but by and large, the intention is to have a consistent level of difficulty to make scores comparable.

Generalzing the research to outside the USA would not be feasible since SAT is used only by US universities. Other countries would have their own standardized tests and education structures.

Scope of inference - causality: Can these data be used to establish causal links between the variables of interest? Explain why or why not.

There was no experimental design - so no separation between control groups and treatment groups. Given the observational nature of the study, causal relationships cannot be inferred from it. We can only establish association.

Exploratory data analysis: Perform relevant descriptive statistics, including summary statistics and visualization of the data. Also address what the exploratory data analysis suggests about your research question.

I perform the following exploratory data analysis on the dataset.

# Load required libraries

library(readxl)
library(ggplot2)
library(RCurl)

## Loading required package: bitops

library(R.utils)

## Loading required package: R.oo

## Loading required package: R.methodsS3

## R.methodsS3 v1.7.1 (2016-02-15) successfully loaded. See ?R.methodsS3 for help.

## R.oo v1.22.0 (2018-04-21) successfully loaded. See ?R.oo for help.

## 
## Attaching package: 'R.oo'

## The following object is masked from 'package:RCurl':
## 
##     clone

## The following objects are masked from 'package:methods':
## 
##     getClasses, getMethods

## The following objects are masked from 'package:base':
## 
##     attach, detach, gc, load, save

## R.utils v2.8.0 successfully loaded. See ?R.utils for help.

## 
## Attaching package: 'R.utils'

## The following object is masked from 'package:RCurl':
## 
##     reset

## The following object is masked from 'package:utils':
## 
##     timestamp

## The following objects are masked from 'package:base':
## 
##     cat, commandArgs, getOption, inherits, isOpen, parse, warnings

library(xlsx)
library(httr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(magrittr)

## 
## Attaching package: 'magrittr'

## The following object is masked from 'package:R.utils':
## 
##     extract

## The following object is masked from 'package:R.oo':
## 
##     equals

library(tidyverse)

## -- Attaching packages ------------------------------------------------------------------------------------- tidyverse 1.2.1 --

## v tibble  2.1.1     v purrr   0.3.2
## v tidyr   0.8.3     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0

## -- Conflicts ---------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x tidyr::complete()  masks RCurl::complete()
## x magrittr::equals() masks R.oo::equals()
## x tidyr::extract()   masks magrittr::extract(), R.utils::extract()
## x dplyr::filter()    masks stats::filter()
## x dplyr::lag()       masks stats::lag()
## x purrr::set_names() masks magrittr::set_names()

library(kableExtra)

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

# Create the input parameter based on the URL for the input data file

infile<-getURL("https://raw.githubusercontent.com/Jagdish16/jagdish_r_repo/master/DATA606/Project/NJ_Schools_Data1.csv")

# Read the input data file into a tibble

all.schools<-as_tibble(read.csv(text=infile, header=TRUE))
num.all.schools<-count(all.schools)

paste("There are a total of ",num.all.schools," public and private schools in the sample")

## [1] "There are a total of  441  public and private schools in the sample"

# Address text formatting issue

all.schools%<>%rename(District.Code = X.U.FEFF.District.Code)

# Set correct data types for different columns
all.schools$Type.Label <- as.factor(all.schools$Type.Label)
all.schools$District.Code <- as.factor(all.schools$District.Code)
all.schools$School.Code <- as.factor(all.schools$School.Code)
all.schools$County.Code <- as.factor(all.schools$County.Code)
all.schools$School.Type <- as.factor(all.schools$School.Type)
all.schools$School.Name <- as.character(all.schools$School.Name)
all.schools$County.Name <- as.character(all.schools$County.Name)
all.schools$District.Name <- as.character(all.schools$District.Name)
all.schools$City <- as.character(all.schools$City)
all.schools$Zip.Code <- as.character(all.schools$Zip.Code)
all.schools$School.Type <- as.factor(all.schools$School.Type)

# Examine the structure of the data frame
#str(all.schools)

# View the first few records
head(all.schools)%>%kable()

District.Code	School.Code	County.Code	County.Name	District.Name	School.Name	City	Zip.Code	School.Type	Type.Label	Math	Reading.and.Writing	Average.Total.Score	Median.HH.Income
6010	910	80	MONMOUTH	ACADEMY CHARTER HIGH SCHOOL	ACADEMY CHARTER HIGH SCHOOL	LAKE COMO	7719	Charter	1	448	457	905	92414
6032	901	80	MIDDLESEX	ACADEMY FOR URBAN LEADERSHIP CHARTER SCHOOL	ACADEMY FOR URBAN LEADERSHIP CHARTER SCHOOL	Perth Amboy	8861	Charter	1	470	469	939	51389
110	10	1	ATLANTIC	ATLANTIC CITY	ATLANTIC CITY HIGH SCHOOL	Atlantic City	8401	Public	3	524	526	1050	26006
120	10	1	ATLANTIC	ATLANTIC COUNTY VOCATIONAL	ATLANTIC COUNTY INSTITUTE OF TECHNOLOGY	Mays Landing	8330	Vocational	2	528	530	1058	64266
150	10	7	CAMDEN	AUDUBON BOROUGH	AUDUBON JUNIOR/SENIOR HIGH SCHOOL	Audubon	8106	Public	3	536	551	1087	75136
185	30	29	OCEAN	BARNEGAT TOWNSHIP	BARNEGAT HIGH SCHOOL	Barnegat	8005	Public	3	556	548	1104	69877

# Create a separate data frame for all schools except Private schools (since this data pertains to a different year)

public.schools<-all.schools%>%filter(!School.Type=="Private")

num.public.schools<-count(public.schools)

paste("There are a total of ",num.public.schools," public schools in the sample.")

## [1] "There are a total of  402  public schools in the sample."

#View(public.schools)

# Create vectors for the 3 variables of interest
total.score<-all.schools$Average.Total.Score
hh.income<-all.schools$Median.HH.Income
school.type<-all.schools$Type.Label

pub.total.score<-public.schools$Average.Total.Score
pub.hh.income<-public.schools$Median.HH.Income


# Compute summary stats for the numeric variables of interest
summary<-NA
summary(total.score)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     795    1016    1106    1102    1178    1502

summary(hh.income)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   20529   62739   81011   83896  106039  202862

# Compute mean, minimum, maximum and standard deviation for the numeric variables of interest

mean.total.score<-mean(total.score)
sd.total.score<-sd(total.score)
mean.hh.income<-mean(hh.income)
sd.hh.income<-sd(hh.income)

cat("The mean SAT score across all NJ high schools for 2016-17 is ",mean(total.score),"\n")

## The mean SAT score across all NJ high schools for 2016-17 is  1102.259

cat("The standard deviation of SAT scores across all NJ high schools for 2016-17 is ",sd(total.score),"\n")

## The standard deviation of SAT scores across all NJ high schools for 2016-17 is  130.7444

cat("The minimum and maximum SAT scores across all NJ high schools for 2016-17 are",min(total.score),"and ", max(total.score),"\n")

## The minimum and maximum SAT scores across all NJ high schools for 2016-17 are 795 and  1502

cat("The average median household income across all NJ school districts is ",mean(hh.income),"\n")

## The average median household income across all NJ school districts is  83895.85

cat("The standard deviation of median household income across all NJ school districts is ",sd(hh.income),"\n")

## The standard deviation of median household income across all NJ school districts is  33649.71

cat("The minimum and maximum median household incomes across all NJ school districs and zip codes for 2016-17 are",min(hh.income),"and ", max(hh.income),"\n")

## The minimum and maximum median household incomes across all NJ school districs and zip codes for 2016-17 are 20529 and  202862

Exploratory Data Analysis

# Examine a histogram of the 2 numeric variables

hist(total.score,probability = TRUE, 10,main="Average Total Score", xlab="SAT Scores")
x <- 400:1600
y <- dnorm(x = x, mean = mean.total.score, sd = sd.total.score)
lines(x = x, y = y, col = "blue")

boxplot(total.score)

qqnorm(total.score)

hist(hh.income,probability = TRUE, 10,main="Median Household Income in US Dollars", xlab="Household Income")
x <- 0:230000
y <- dnorm(x = x, mean = mean.hh.income, sd = sd.hh.income)
lines(x = x, y = y, col = "red")

boxplot(hh.income)

qqnorm(hh.income)

From the histogram, it can be seen that the distribution for total score seems nearly normal. The boxplot does show some outliers outside the range. The qqnorm plot shows that the distribution for total score seems nearly normal.

From the histogram and the qqnorm plot above, it can be seen that the distribution for median household income does show a little right-skewness which can be expected because of some really high income values for a few districts/towns. The boxplot does show outliers outside the range.

# Check for a linear relationship visually

# Create a scatter plot of the response (dependent) and the predictor (independent) variable for all schools. Use color coding to indicate the categorical variable that is the second independent variable

ggplot(all.schools, aes(all.schools$Average.Total.Score, all.schools$Median.HH.Income)) +
 geom_point(aes(color = all.schools$School.Type)) + scale_color_brewer(palette="Dark2")

# Create a scatter plot of the response (dependent) and the predictor (independent) variable for public schools only. Use color coding to indicate the categorical variable that is the second independent variable

ggplot(public.schools, aes(public.schools$Average.Total.Score, public.schools$Median.HH.Income)) +
 geom_point(aes(color = public.schools$School.Type)) + scale_color_brewer(palette="Dark2")

The scatter plot seems to indicate a positive-sloping relationship between the household income and SAT scores for all schools as well as public schools.

# Examine the count for each type of the categorical variable

table(all.schools$School.Type)

## 
##    Charter    Private     Public Vocational 
##         18         39        341         43

It can be seen that a majority of the observations pertain to public schools. In fact, the number of observations for other “types” are much lower.

# Calculate the z-score for both total score and household income for all schools and public schools

mean.total.score<-mean(total.score)
sd.total.score<-sd(total.score)
z.total.score<-((total.score-mean.total.score)/sd.total.score)
qqnorm(z.total.score)

mean.hh.income<-mean(hh.income)
sd.hh.income<-sd(hh.income)
z.hh.income<-((hh.income-mean.hh.income)/sd.hh.income)
qqnorm(z.hh.income)

pub.mean.total.score<-mean(pub.total.score)
pub.sd.total.score<-sd(pub.total.score)
pub.z.total.score<-((pub.total.score-pub.mean.total.score)/pub.sd.total.score)
qqnorm(pub.z.total.score)

pub.mean.hh.income<-mean(pub.hh.income)
pub.sd.hh.income<-sd(pub.hh.income)
pub.z.hh.income<-((pub.hh.income-pub.mean.hh.income)/pub.sd.hh.income)
qqnorm(pub.z.hh.income)

From the qqnorm plot of the z-scores (scaled total score) above, it can be seen that the distribution for total score seems nearly normal.

From the qqnorm plot of the z-scores (scaled hh income) above, it can be seen that the distribution for household income is a little right-skewed.

Relevant summary statistics .

Before fitting the linear model, scale the median household income to dollars in hundreds i.e. change the unit of denomination to 00s of dollars instead of dollars

# Scale hh income to 00 dollars, and add the scaled z-scores for the 2 numeric variables to the data frame

all.schools$Median.HH.Income<-all.schools$Median.HH.Income/100
head(all.schools)

## # A tibble: 6 x 14
##   District.Code School.Code County.Code County.Name District.Name
##   <fct>         <fct>       <fct>       <chr>       <chr>        
## 1 6010          910         80          MONMOUTH    ACADEMY CHAR~
## 2 6032          901         80          MIDDLESEX   ACADEMY FOR ~
## 3 110           10          1           ATLANTIC    ATLANTIC CITY
## 4 120           10          1           ATLANTIC    ATLANTIC COU~
## 5 150           10          7           CAMDEN      AUDUBON BORO~
## 6 185           30          29          OCEAN       BARNEGAT TOW~
## # ... with 9 more variables: School.Name <chr>, City <chr>,
## #   Zip.Code <chr>, School.Type <fct>, Type.Label <fct>, Math <int>,
## #   Reading.and.Writing <int>, Average.Total.Score <int>,
## #   Median.HH.Income <dbl>

all.schools$z.total.score<-z.total.score
all.schools$z.hh.income<-z.hh.income

public.schools$Median.HH.Income<-public.schools$Median.HH.Income/100
head(public.schools)

## # A tibble: 6 x 14
##   District.Code School.Code County.Code County.Name District.Name
##   <fct>         <fct>       <fct>       <chr>       <chr>        
## 1 6010          910         80          MONMOUTH    ACADEMY CHAR~
## 2 6032          901         80          MIDDLESEX   ACADEMY FOR ~
## 3 110           10          1           ATLANTIC    ATLANTIC CITY
## 4 120           10          1           ATLANTIC    ATLANTIC COU~
## 5 150           10          7           CAMDEN      AUDUBON BORO~
## 6 185           30          29          OCEAN       BARNEGAT TOW~
## # ... with 9 more variables: School.Name <chr>, City <chr>,
## #   Zip.Code <chr>, School.Type <fct>, Type.Label <fct>, Math <int>,
## #   Reading.and.Writing <int>, Average.Total.Score <int>,
## #   Median.HH.Income <dbl>

public.schools$pub.z.total.score<-pub.z.total.score
public.schools$pub.z.hh.income<-pub.z.hh.income

Calculate the correlation between the total score and the household income, for all schools and for public schools only

cor(all.schools$Average.Total.Score, all.schools$Median.HH.Income)

## [1] 0.6657101

cor(public.schools$Average.Total.Score, public.schools$Median.HH.Income)

## [1] 0.7125073

It can be seen that the correlation between the total score and household income is 0.66 for all schools, and 0.71 for public schools only.

# Run the linear model fitting for all schools and public schools only and inspect the model statistics

all.score.income<-lm(all.schools$Average.Total.Score~all.schools$Median.HH.Income)
summary(all.score.income)

## 
## Call:
## lm(formula = all.schools$Average.Total.Score ~ all.schools$Median.HH.Income)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -301.10  -52.22  -10.12   35.70  392.20 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  885.25466   12.50649   70.78   <2e-16 ***
## all.schools$Median.HH.Income   0.25866    0.01384   18.69   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 97.67 on 439 degrees of freedom
## Multiple R-squared:  0.4432, Adjusted R-squared:  0.4419 
## F-statistic: 349.4 on 1 and 439 DF,  p-value: < 2.2e-16

public.score.income<-lm(public.schools$Average.Total.Score~public.schools$Median.HH.Income)
summary(public.score.income)

## 
## Call:
## lm(formula = public.schools$Average.Total.Score ~ public.schools$Median.HH.Income)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -290.44  -45.46   -2.20   38.45  404.18 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     870.54870   11.51421   75.61   <2e-16 ***
## public.schools$Median.HH.Income   0.26183    0.01289   20.31   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 86.83 on 400 degrees of freedom
## Multiple R-squared:  0.5077, Adjusted R-squared:  0.5064 
## F-statistic: 412.5 on 1 and 400 DF,  p-value: < 2.2e-16

For the public schools, the intercept is 870. This can ignored for the purposes of our analysis since it is only relevant for guiding the height of the linear model. Given that the lowest household income is about $20,000, a value of 0 household income does not provide any meangingful insight into the linear relationship.

The slope for the household income variable is 0.26. It indicates that on average, for every $100 change in median household income, there is a 0.26 point change in the average total SAT score, in the same direction i.e. as income increases, so does SAT score and vice-versa. The very low p-values indicate that the household income is statistically significant.

The adjusted R-squared is 0.50. This implies that household income explains about 50% of the variance in SAT scores, on average.

Check conditions for linear regression model to apply:

#plot(all.score.income)
plot(all.score.income$residuals)

abline=0
plot(all.score.income$fitted.values, all.score.income$residuals)

plot(public.score.income$residuals)

abline=0
plot(public.score.income$fitted.values, public.score.income$residuals)

hist(public.score.income$residuals)

qqnorm(public.score.income$residuals)

#qqline(public.score.income$residuals)
#plot(public.score.income)

Linear relationship between all independent variables and the response variable: The scatter plot of residuals versus each independent variable seems evenly distributed around 0.
Nearly normal residuals: The histogram and the qqnorm plot of the residuals shows skewness. This is because of some outliers in the data that can also be seen as tha variance of the residuals seems to be lower at higher score levels. While the distribution of residuals is centered at 0 in the histogram, the spread is not symmetric due to right-skewness.
Constant variability (Homoscedasticity): Variability of points around least squares line is roughly constant i.e. variability of residuals around the “0-line” is roughly constant.
Independence between observations: It can be assumed that the observations (SAT scores and household income) are independent across different observations.

Add school type to the linear model

all.score.income.type<-lm(all.schools$Average.Total.Score~all.schools$Median.HH.Income+all.schools$School.Type)
summary(all.score.income.type)

## 
## Call:
## lm(formula = all.schools$Average.Total.Score ~ all.schools$Median.HH.Income + 
##     all.schools$School.Type)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -270.93  -43.06    1.28   44.61  361.66 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)
## (Intercept)                       840.68480   21.87742  38.427  < 2e-16
## all.schools$Median.HH.Income        0.23792    0.01281  18.574  < 2e-16
## all.schools$School.TypePrivate    188.97737   25.71386   7.349 9.91e-13
## all.schools$School.TypePublic      46.80933   21.67787   2.159 0.031371
## all.schools$School.TypeVocational  92.95837   25.20654   3.688 0.000255
##                                      
## (Intercept)                       ***
## all.schools$Median.HH.Income      ***
## all.schools$School.TypePrivate    ***
## all.schools$School.TypePublic     *  
## all.schools$School.TypeVocational ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 88.2 on 436 degrees of freedom
## Multiple R-squared:  0.5491, Adjusted R-squared:  0.5449 
## F-statistic: 132.7 on 4 and 436 DF,  p-value: < 2.2e-16

# all.score.z.income.type<-lm(all.schools$z.total.score~all.schools$z.hh.income+all.schools$School.Type)
# summary(all.score.z.income.type)

The slope of the categorical variable is statistically significant given the very low p-values. The baseline is school type = Charter. Adding the categorical variable as an independent variable on top of the numeric independent variable increases the adjusted R-squared to 0.54 i.e. the new model explains 54% of the variability in SAT scores, on average.

Conclusion: Write a brief summary of your findings without repeating your statements from earlier. Also include a discussion of what you have learned about your research question and the data you collected. You may also want to include ideas for possible future research.

Based on the above research, it can be concluded that there is indeed an association between total SAT scores and median household income and school type. Since this is not an experiment, we cannot assert any causality between these variables. It is likely that there are other collinear variables at work - for example just a high household income would not automatically result in higher SAT scores. It is possible that both household income and SAT scores are in turn dependent on a common underlying variable such as parent’s level of education and career success. Highly educated and successful parents are likely to create an environment that allows the children to thrive and also guide them in how to organize their studies and work hard.

Ideas for possible future research would include finding additional variables that can help explain the remaining 46% of the variability in SAT scores. Also extending the study to other countries may be a worthwhile exercise though it would warrant finding a similar metric such as SAT.

References:

https://blog.prepscholar.com/sat-score-range

https://www.state.nj.us/education/data/fact.htm

https://patch.com/new-jersey/tomsriver/new-jersey-sat-scores-released-every-high-school-ranked

https://patch.com/new-jersey/westfield/every-nj-school-graded-new-state-report-where-do-you-rank

https://rc.doe.state.nj.us/reportsdatabase.aspx

https://en.wikipedia.org/wiki/List_of_New_Jersey_locations_by_per_capita_income

https://www.privateschoolreview.com/sat-score-stats/new-jersey

https://www.nj.com/education/2018/03/heres_how_every_nj_high_school_fared_on_the_sat.html