DATA 606 Data Project Proposal

Data Preparation

# load libraries
library(psych)
library(dplyr)

# load data
df <- read.csv("C:/Users/weberr1/Desktop/CUNY/DATA 606/Project/massachusetts-public-schools-data/MA_Public_Schools_2017.csv", stringsAsFactors = FALSE)

# Select off variables of interest
df <- df %>%
  select(highNeeds = X..High.Needs 
         ,econDisadv = X..Economically.Disadvantaged
         ,salary = Average.Salary
         ,expendPerPupil = Average.Expenditures.per.Pupil
         ,attendColl = X..Attending.College) %>%
  filter(!is.na(attendColl))

Research question

You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.

In Massachussetts public schools, is an increase in teacher salary associated with an increase in college attendance rates, controlling for average expenditure per student, percent of economically disadvantaged students, and percent of high needs students?

Cases

What are the cases, and how many are there?
Each case in this data set is a public school in the MA public school system. There were 344 schools in this data set with average college attendance information.

Data collection

Describe the method of data collection.
This data was downloaded from Kaggle.com. The data was sourced from Department of Education reports. In particular, the reports of interest were outlined in the associated data dictionary:

Teacher Salaries
Enrollment by Selected Population
Per Pupil Expenditure
Graduates Attending Higher Ed

The Massachusetts DOE site (http://profiles.doe.mass.edu/help/data.aspx) notes the following on the data utilized to create these reports: “Schools and Districts view, add, update and delete their own directory information to ensure that the information is as up-to-date and accurate as possible.”

Type of study

What type of study is this (observational/experiment)?
This study is observational as it utilizes retrospectively collected data.

Data Source

If you collected the data, state self-collected. If not, provide a citation/link.
https://www.kaggle.com/ndalziel/massachusetts-public-schools-data/version/1

Response

What is the response variable, and what type is it (numerical/categorical)?
The response variable is percent of students attending college. It is a numerical variable.

Explanatory

What is the explanatory variable, and what type is it (numerical/categorical)?
The explanatory variable is average teacher salary, and the controlling explanatory variables are average expenditures per pupil, percent of economically disadvantaged students, and percent of high needs students.

Relevant summary statistics

Provide summary statistics relevant to your research question. For example, if you’re comparing means across groups provide means, SDs, sample sizes of each group. This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

Percent of students attending college

describe(df$attendColl)

##    vars   n  mean    sd median trimmed   mad  min max range  skew kurtosis
## X1    1 344 74.51 16.18   77.8   76.86 13.57 10.5 100  89.5 -1.39      1.9
##      se
## X1 0.87

hist(df$attendColl, breaks = 100)

Minimum rate of college attendance is 10.5%, maximum rate is 100%.

Mean of 74.51% with a standard deviation of 16.18%.

This data is left-skewed (since it’s not possible to have a rate greater than 100).

Average teacher salary

describe(df$salary)

##    vars   n     mean      sd median  trimmed     mad   min    max range
## X1    1 312 74361.44 8424.16  73706 74081.97 7826.65 53763 100731 46968
##    skew kurtosis     se
## X1 0.34    -0.29 476.92

hist(df$salary, breaks = 100)

Minimum average teacher salary of $53,763, maximum of 100,731. (Will investigate source of this outlier. Is it a school with very few teachers?)

Mean of $74,361.44 with a standard deviation of $8,424.16.

This data is approximately normally distributed.

Plot of monetary variables vs. percent of students attending college

plot(df$attendColl, df$salary)

plot(df$attendColl, df$expendPerPupil)

Looking at these variables individually, there does not seem to be a relationship between teacher salary (or individual expenditure per student) and percent of students attending college.

Is it reasonable to include both individual expenditure per student and individual average teacher salary both as explanatory variables, since they are likely highly correlated? Wanted some way to control for wealther districts

plot(df$salary, df$expendPerPupil)

Preliminary analysis

regressionModel <- lm(df$attendColl ~ df$salary + df$expendPerPupil + df$econDisadv + df$highNeeds)

summary(regressionModel)

## 
## Call:
## lm(formula = df$attendColl ~ df$salary + df$expendPerPupil + 
##     df$econDisadv + df$highNeeds)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.610  -3.700   1.202   5.987  34.960 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        7.587e+01  5.869e+00  12.927  < 2e-16 ***
## df$salary          3.977e-04  8.846e-05   4.495 9.85e-06 ***
## df$expendPerPupil -5.280e-04  2.631e-04  -2.006   0.0457 *  
## df$econDisadv      2.513e-01  1.380e-01   1.821   0.0696 .  
## df$highNeeds      -7.601e-01  1.283e-01  -5.922 8.50e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.37 on 307 degrees of freedom
##   (32 observations deleted due to missingness)
## Multiple R-squared:  0.5208, Adjusted R-squared:  0.5146 
## F-statistic: 83.42 on 4 and 307 DF,  p-value: < 2.2e-16

Here, it does look like increase in salary is associated with higher college attendance rates when controlling for several other relevant variables.

Summary of other controlling variables

Percent of high needs students

describe(df$highNeeds)

##    vars   n  mean    sd median trimmed   mad min  max range skew kurtosis
## X1    1 344 40.93 20.99  35.45   38.74 20.83 5.2 99.7  94.5 0.77    -0.37
##      se
## X1 1.13

hist(df$highNeeds, breaks = 100)

Minimum rate of high needs students is 5.2%, maximum rate is 99.7%.

Mean of 40.93% with a standard deviation of 20.99%.

This data is right-skewed. There is wide variation in % of high needs students.

Percent of economically disadvantaged students

describe(df$econDisadv)

##    vars   n  mean    sd median trimmed   mad min max range skew kurtosis
## X1    1 344 27.87 18.97  22.85   25.98 19.13 3.1  88  84.9 0.75    -0.35
##      se
## X1 1.02

hist(df$econDisadv, breaks = 100)

Minimum rate of high needs students is 3.1%, maximum rate is 88%.

Mean of 27.87% with a standard deviation of 18.97%.

This data is right-skewed. There is wide variation in % of economically disadvantaged students.