Research question

You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.

Can SAT scores be predicted by household income and availability of tutoring centers, at the school district level?

Cases

What are the cases, and how many are there?

I will need to combine data from multiple sources to come up with a dataset for this project.

Each case represents a school district in New Jersey. While the total number of school districts in NJ seem to be around 690, the median household income data is only available for about 340 school districts. So I plan to use only those 340 school distrcits for which I have data available for both average SAT scores and median household income for this project. I will need to use some data transformation and mapping in order to come up with this data set.

Data for tutoring centers is not easily available in one place. For this, I plan to collect center data from the websites for the most well-known 2-4 tutoring franchises. However, these websites provide data based on city and zip code, so I’ll need to find a way to map this to school districts. Also, it is hard to discern which centers are focussed just on high school versus those that cater to students across all grade levels. Lastly, there are online tutoring services available which are obviously not bound by geographical reach, and therefore it’s much harder to incorporate them in this analysis.

Data collection

Describe the method of data collection.

Data has been collected via a combination of surveys as described below:

  1. Total SAT scores from the schools performance report 2017-18 published by the NJ State Department of Education.

  2. Median household income as collected in the American Community Survey (ACS) census data for 2017 as available on the following website:

The above 2 data variables are both numeric: Average SAT score and median household income are both integer data type.

  1. Availability of tutoring center within close proximity (say within 10 miles): Plan to collect this by collating information from individual websites, mapping to school districts and finally coming up with a categorical variable between 0 and 5 to indicate presence of 0, 1 or multiple tutoring centers.

Type of study

What type of study is this (observational/experiment)?

This is an observational study.

Data Source

If you collected the data, state self-collected. If not, provide a citation/link.

  1. SAT scores from the schools performance report 2017-18 published by the NJ State Department of Education on the following website: https://rc.doe.state.nj.us/ReportsDatabase.aspx

  2. Median household income as collected in the American Community Survey (ACS) census data for 2017 as available on the following website: https://factfinder.census.gov

Response/ Dependent Variable

What is the response variable, and what type is it (numerical/categorical)?

The response variable is average Total SAT score by school district and is numeric.

Explanatory / Independent Variable

You should have two independent variables, one quantitative and one qualitative. What is the explanatory variable, and what type is it (numerical/categorical)?

The explanatory variables are: 1) Quantative: Median household income by school district. This is numeric. 2) Qualitative: Presence of tutoring centers. This is a categorical numeric variable.

Relevant summary statistics .

** Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). **

I still need to collect and cleanse the data. So I don’t have it ready for any analysis yet.