# load libraries
library(psych)
library(dplyr)
# load data
df <- read.csv("C:/Users/weberr1/Desktop/CUNY/DATA 606/Project/massachusetts-public-schools-data/MA_Public_Schools_2017.csv", stringsAsFactors = FALSE)
# Select off variables of interest
df <- df %>%
select(highNeeds = X..High.Needs
,econDisadv = X..Economically.Disadvantaged
,salary = Average.Salary
,expendPerPupil = Average.Expenditures.per.Pupil
,attendColl = X..Attending.College) %>%
filter(!is.na(attendColl))
You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.
In Massachussetts public schools, is an increase in teacher salary associated with an increase in college attendance rates, controlling for average expenditure per student, percent of economically disadvantaged students, and percent of high needs students?
What are the cases, and how many are there?
Each case in this data set is a public school in the MA public school system. There were 344 schools in this data set with average college attendance information.
Describe the method of data collection.
This data was downloaded from Kaggle.com. The data was sourced from Department of Education reports. In particular, the reports of interest were outlined in the associated data dictionary:
The Massachusetts DOE site (http://profiles.doe.mass.edu/help/data.aspx) notes the following on the data utilized to create these reports: “Schools and Districts view, add, update and delete their own directory information to ensure that the information is as up-to-date and accurate as possible.”
What type of study is this (observational/experiment)?
This study is observational as it utilizes retrospectively collected data.
If you collected the data, state self-collected. If not, provide a citation/link.
https://www.kaggle.com/ndalziel/massachusetts-public-schools-data/version/1
What is the response variable, and what type is it (numerical/categorical)?
The response variable is percent of students attending college. It is a numerical variable.
What is the explanatory variable, and what type is it (numerical/categorical)?
The explanatory variable is average teacher salary, and the controlling explanatory variables are average expenditures per pupil, percent of economically disadvantaged students, and percent of high needs students.
Provide summary statistics relevant to your research question. For example, if you’re comparing means across groups provide means, SDs, sample sizes of each group. This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.
describe(df$attendColl)
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 344 74.51 16.18 77.8 76.86 13.57 10.5 100 89.5 -1.39 1.9
## se
## X1 0.87
hist(df$attendColl, breaks = 100)
Minimum rate of college attendance is 10.5%, maximum rate is 100%.
Mean of 74.51% with a standard deviation of 16.18%.
This data is left-skewed (since it’s not possible to have a rate greater than 100).
describe(df$salary)
## vars n mean sd median trimmed mad min max range
## X1 1 312 74361.44 8424.16 73706 74081.97 7826.65 53763 100731 46968
## skew kurtosis se
## X1 0.34 -0.29 476.92
hist(df$salary, breaks = 100)
Minimum average teacher salary of $53,763, maximum of 100,731. (Will investigate source of this outlier. Is it a school with very few teachers?)
Mean of $74,361.44 with a standard deviation of $8,424.16.
This data is approximately normally distributed.
plot(df$attendColl, df$salary)
plot(df$attendColl, df$expendPerPupil)
Looking at these variables individually, there does not seem to be a relationship between teacher salary (or individual expenditure per student) and percent of students attending college.
Is it reasonable to include both individual expenditure per student and individual average teacher salary both as explanatory variables, since they are likely highly correlated? Wanted some way to control for wealther districts
plot(df$salary, df$expendPerPupil)
regressionModel <- lm(df$attendColl ~ df$salary + df$expendPerPupil + df$econDisadv + df$highNeeds)
summary(regressionModel)
##
## Call:
## lm(formula = df$attendColl ~ df$salary + df$expendPerPupil +
## df$econDisadv + df$highNeeds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.610 -3.700 1.202 5.987 34.960
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.587e+01 5.869e+00 12.927 < 2e-16 ***
## df$salary 3.977e-04 8.846e-05 4.495 9.85e-06 ***
## df$expendPerPupil -5.280e-04 2.631e-04 -2.006 0.0457 *
## df$econDisadv 2.513e-01 1.380e-01 1.821 0.0696 .
## df$highNeeds -7.601e-01 1.283e-01 -5.922 8.50e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.37 on 307 degrees of freedom
## (32 observations deleted due to missingness)
## Multiple R-squared: 0.5208, Adjusted R-squared: 0.5146
## F-statistic: 83.42 on 4 and 307 DF, p-value: < 2.2e-16
Here, it does look like increase in salary is associated with higher college attendance rates when controlling for several other relevant variables.
describe(df$highNeeds)
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 344 40.93 20.99 35.45 38.74 20.83 5.2 99.7 94.5 0.77 -0.37
## se
## X1 1.13
hist(df$highNeeds, breaks = 100)
Minimum rate of high needs students is 5.2%, maximum rate is 99.7%.
Mean of 40.93% with a standard deviation of 20.99%.
This data is right-skewed. There is wide variation in % of high needs students.
describe(df$econDisadv)
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 344 27.87 18.97 22.85 25.98 19.13 3.1 88 84.9 0.75 -0.35
## se
## X1 1.02
hist(df$econDisadv, breaks = 100)
Minimum rate of high needs students is 3.1%, maximum rate is 88%.
Mean of 27.87% with a standard deviation of 18.97%.
This data is right-skewed. There is wide variation in % of economically disadvantaged students.