DATA 606 Final Data Project

Part 1 - Introduction

The following analysis will explore whether there is a significant linear relationship between the percent of students in a school that are economically disadvantaged and the school’s rate of proficiency on state assessments.

This is important as the presence of a strong linear relationship between a measure of poverty and educational outcomes would indicate that state and federal policies, as currenlty implemented, are largely being unsuccessful at mitigating the effects of poverty on educational outcomes.

The possibility of a linear relationship will be evaluated through linear regression modeling. Specifically looking to test whether there is strong enough evidence to determine that the true slope of the linear relationship is not zero.

This can be formally stated as H0: The true slope of the linear relationship is zero. H1: The true slope of the linear relationship is not zero.

Part 2a - Data Descriptions

Data Collection

This dataset to be used in this analysis was obtained by a request to the Michigan Department of Education and is now publicly available at https://raw.githubusercontent.com/ChadRyanBailey/606-Statistics-and-Probability/master/606-Final-Project/02%20Proficiency%20Data%20with%20Entity%20Demographics.csv

Cases

The full dataset has 56,372 cases. Each case in this dataset represents a school’s proficiency rate on the state assessment for a given content area and academic year. The full dataset has cases for four content areas (Mathematics, English Language Arts, Science, and Social Studies) and five academic years (2014-15 through 2018-19).

For this analysis, the data will be limited to only use 2018-19 English Language Arts and Mathematics records. This still leaves 6,386 cases.

Variables

This analysis will make use of two numerical variables within each case: PctED and PctMetProficient.

The first variable [PctED] gives the percent of all enrolled students in the school that were reported as Economically Disadvantaged (ED). The category ED is defined in federal law and is mostly based on family income. This variable is a school level variable. This means it will vary across schools and academic years but will not vary across content areas for a school within a year.

The second variable [PctMetProficient] is the percent of valid tested students that were proficient on the state assessment. Within this dataset this variable is only provided at for all grades combined but it should be noted that the grades assessed vary by content area. Math and English Language Arts are tested in grades 3-8 & 11. Social Studies is tested in grades 5, 8, & 11. Science was tested in grades 4, 7, & 11 but only has records up through 2016-17. Michigan’s state assessment in science was undergoing a redesign and was only a field test administration for 2017-18 and 2018-19.

Type of study

This analysis is an observational study. It is using data that were collected from schools as they are currently configured. The author of this study had no control over which students were assigned to which schools nor how the grades of those schools were configured.

Scope of Inference - Generalizability

The full dataset does not meet the conditions of generalizability (i.e., random sampling) but this analysis will also use simulation to randomly subsample the full dataset. The simulation results will then be generalizable to the population (all schools within the state of Michigan).

However, it should also be noted that the full dataset is not a random sample because education data is different in some respects. Primarily that by law a near full population census is required for both variables (PctED and PctMetProficient). States are required to assess all students in applicable grades for a content area and states are required to publicly report key demographics such as the percent economically disadvantaged for all schools.

Scope of Inference - Causality

No statements of causality will be possible from this study. The necessary condition of random assignment is not met. The school’s variables are aggregations of the attributes of the students at that school and students are not randomly assigned to schools.

Part 2b - Data Load & Cleansing

Load Full Dataset

## load data
    fileLocation1 <- '02 Proficiency Data with Entity Demographics.csv'
    proficiency <-read.csv(fileLocation1, sep = ',')

Subset the data

A new dataset is created to only contain the rows and columns of interest

  library(dplyr)

  proficiencySlim <- proficiency %>%
    ## limit to only the fields of current interest
    filter (AcademicYear == '2018-19'
            & ContentAreaName %in% c('Mathematics', 'English Language Arts')) %>%
    select (AcademicYear
            ,BuildingCode
            ,ContentAreaName
            ,nValidTested
            ,nMetProficient
            ,nNotMetProficient
            ,PctMetProficient
            ,nTotalEnrolled
            ,nED
            ,PctED) %>%
    ## rename to shorter field names
    rename(nTested = nValidTested
           ,nProf = nMetProficient
           ,nNonProf = nNotMetProficient
           ,PctProf = PctMetProficient
           ,nEnrolled = nTotalEnrolled) 
  
  #summary(proficiencySlim)
  #summary(proficiency)

Dealing with suppression

As can be seen in the columns {nTested, nProf, nNonProf, and PctProf}, the file has records that have been suppressed. This is typical for public education data. The suppression is done to protect the privacy of small groups of students.

Flag records that have suppression applied and count records with each type of suppression case

# add flags to review each suppression condition
    proficiencySlim <- proficiencySlim %>%
      mutate( HasTestedLT10 = ifelse(nTested == '< 10', 1, 0)
              ,HasProfLT3 = ifelse(nProf == '< 3', 1, 0)
              ,HasNonProfLT3 = ifelse(nNonProf == '< 3', 1, 0)
              ,HasEitherProfOrNonProfLT3 = ifelse(HasProfLT3 == 1 | HasNonProfLT3 == 1, 1, 0)
              ,HasRecord = 1)


# get the count of records by suppression conditions
    proficiencySlim %>%
      summarise(nTotal = sum(HasRecord)
                ,nTestedLT10 = sum(HasTestedLT10)
                ,nProfLT3 = sum(HasProfLT3)
                ,nNonProfLT3 = sum(HasNonProfLT3)
                ,nEitherProfOrNonProfLT3 = sum(HasEitherProfOrNonProfLT3))

Remove records where values cannot be imputed

Remove records with less than 10 valid tested students as all data for those records is supressed and a value for imputation cannot be applied

## remove records with < 10 valid tested; all data for these records are suppressed
  proficiencySlim <- proficiencySlim %>%
    filter(HasTestedLT10 == 0) 

  nrow(proficiencySlim)

## [1] 6163

Impute supressed values where possible

Since <3 is equal to the set {0, 1, 2}, the middle value “1” will be used as the imputed value. Also, percentages will be calculated for suppressed records using the imputed value.

## deal with cases where suppression is applied because nearly all nor nearly
## none of the students were proficient
proficiencySlim <- proficiencySlim %>%
    #convert factors to characters
    mutate(nTested = as.character(nTested)
           ,nProf = as.character(nProf)
           ,nNonProf = as.character(nNonProf)
           ,PctProf = as.character(PctProf)
           ) %>%
    
    #convert the characters to numerics
    mutate(nTested = as.numeric(nTested)
           ,nProf = as.numeric(nProf)
           ,nNonProf = as.numeric(nNonProf)
           ,PctProf = as.numeric(PctProf)
           ) %>%
    
    # for count variables (nProf and nNonProf) replace the suppression flag with imputed count
    mutate(nProf = ifelse(HasProfLT3 == 1, 1, nProf)
           ,nProf = ifelse(HasNonProfLT3 == 1, nTested - 1, nProf)

           ,nNonProf = ifelse(HasNonProfLT3 == 1, 1, nNonProf)
           ,nNonProf = ifelse(HasProfLT3 == 1, nTested - 1, nNonProf)
           ) %>%

    # for percentage variables (PctProf and PctNonProf) replace the suppression flag 
    # with a calucuated percentage using the imputed counts
    mutate(PctProf = ifelse(HasProfLT3 == 1, round(nProf*1.0/nTested*100.0, 2), PctProf)
           , PctProf = ifelse(HasNonProfLT3 == 1, round(nProf*1.0/nTested*100.0, 2), PctProf))


# drop columns
    proficiencySlim <- select(proficiencySlim
                              , -c(HasTestedLT10, HasProfLT3, HasNonProfLT3, HasRecord))
# rename column
    proficiencySlim <- proficiencySlim %>%
      rename(HasImputedValues = HasEitherProfOrNonProfLT3)
    
    
# get datasets partitioned by content area
    proficiency_all <- proficiencySlim
    proficiency_math <- proficiency_all %>% filter(ContentAreaName == 'Mathematics')
    proficiency_ela <- proficiency_all %>% filter(ContentAreaName == 'English Language Arts')

Part 3 - Exploratory data analysis

Initial Summary of the Data

#head(proficiency_all)
summary(proficiency_all)

##   AcademicYear   BuildingCode               ContentAreaName
##  2014-15:   0   Min.   :   1   English Language Arts:3081  
##  2015-16:   0   1st Qu.:1552   Mathematics          :3082  
##  2016-17:   0   Median :3172   Science              :   0  
##  2017-18:   0   Mean   :4067   Social Studies       :   0  
##  2018-19:6163   3rd Qu.:6364                               
##                 Max.   :9994                               
##     nTested           nProf          nNonProf         PctProf     
##  Min.   :  10.0   Min.   :  1.0   Min.   :   1.0   Min.   : 0.36  
##  1st Qu.: 106.0   1st Qu.: 29.0   1st Qu.:  54.0   1st Qu.:23.81  
##  Median : 192.0   Median : 74.0   Median : 103.0   Median :40.91  
##  Mean   : 244.6   Mean   :107.5   Mean   : 137.2   Mean   :40.16  
##  3rd Qu.: 317.0   3rd Qu.:140.0   3rd Qu.: 181.0   3rd Qu.:55.91  
##  Max.   :1525.0   Max.   :866.0   Max.   :1315.0   Max.   :99.56  
##    nEnrolled           nED             PctED        HasImputedValues 
##  Min.   :   8.0   Min.   :   5.0   Min.   :  1.98   Min.   :0.00000  
##  1st Qu.: 286.0   1st Qu.: 127.0   1st Qu.: 35.96   1st Qu.:0.00000  
##  Median : 422.0   Median : 212.0   Median : 55.33   Median :0.00000  
##  Mean   : 606.5   Mean   : 293.2   Mean   : 55.16   Mean   :0.05971  
##  3rd Qu.: 648.0   3rd Qu.: 344.0   3rd Qu.: 74.69   3rd Qu.:0.00000  
##  Max.   :5960.0   Max.   :4544.0   Max.   :100.00   Max.   :1.00000

Scatterplot of the primary variables

library(ggplot2)


ggplot(proficiency_all
       , aes(x = PctED, y = PctProf)) + 
  geom_point(aes(size = nEnrolled, color = ContentAreaName), alpha = 0.5) + 
  geom_smooth(method=lm)

ggplot(proficiency_all
       , aes(x = PctED, y = PctProf)) + 
  geom_point(aes(size = nEnrolled, color = ContentAreaName), alpha = 0.5) + 
  facet_wrap(~ContentAreaName)+ 
  geom_smooth(method=lm)

Initial Thoughts on the data’s relationships

The scatterplots clearly show a negative linear relationship. Visually it appears to be a strong enough correlation to justify rejecting H0. HOwever, that decision will be postponed until after the inference tests have been run.

Part 4 - Inference

Checking the conditions for linear regression

  m_allContentAreas <- lm(data = proficiency_all, formula = PctProf ~ PctED)
  m_ela <- lm(data = proficiency_ela, formula = PctProf ~ PctED)
  m_math <- lm(data = proficiency_math, formula = PctProf ~ PctED)


plot(m_allContentAreas)

## Linearity: Residuals vs Fitted; the data appear to have a mostly linear
##            relationship

## Nearly normal residuals: Normal Q-Q; the data appear to be mostly clustered 
##                          around the normal line, although there is meaningful 
##                          deviation at the right tail

## Constant variability:  Scale-Location; the data appear mostly evenly spread
##                        across the x-axis and the regression line is nearly
##                        horizontal (which is desired).

## Independent observations:  Residuals vs Leverage; none of the variables meet 
##                            the thresholds for "influential values".

Theoretical Inference

All Content Areas

  summary(m_allContentAreas)

## 
## Call:
## lm(formula = PctProf ~ PctED, data = proficiency_all)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.087  -8.832   0.189   8.673  66.564 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 75.424444   0.436710  172.71   <2e-16 ***
## PctED       -0.639261   0.007231  -88.41   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.97 on 6161 degrees of freedom
## Multiple R-squared:  0.5592, Adjusted R-squared:  0.5591 
## F-statistic:  7816 on 1 and 6161 DF,  p-value: < 2.2e-16

  confint(m_allContentAreas, c('PctED'), 0.95)

##            2.5 %     97.5 %
## PctED -0.6534355 -0.6250866

The p-value for PctED is less than 0.05. Additionally, the 95% confidence interval does not include 0. Both of these offer strong effidence to reject the null hypothesis (H0).

ELA

  summary(m_ela)

## 
## Call:
## lm(formula = PctProf ~ PctED, data = proficiency_ela)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -44.977  -7.313  -0.016   6.919  61.779 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 81.023708   0.544817  148.72   <2e-16 ***
## PctED       -0.651163   0.009022  -72.18   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.32 on 3079 degrees of freedom
## Multiple R-squared:  0.6285, Adjusted R-squared:  0.6284 
## F-statistic:  5209 on 1 and 3079 DF,  p-value: < 2.2e-16

  confint(m_ela, c('PctED'), 0.95)

##            2.5 %     97.5 %
## PctED -0.6688521 -0.6334735

The p-value for PctED is less than 0.05. Additionally, the 95% confidence interval does not include 0. Both of these offer strong effidence to reject the null hypothesis (H0).

Math

  summary(m_math)

## 
## Call:
## lm(formula = PctProf ~ PctED, data = proficiency_math)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -43.834  -8.727  -0.871   8.475  66.078 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 69.82077    0.60857  114.73   <2e-16 ***
## PctED       -0.62725    0.01007  -62.26   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.77 on 3080 degrees of freedom
## Multiple R-squared:  0.5572, Adjusted R-squared:  0.5571 
## F-statistic:  3876 on 1 and 3080 DF,  p-value: < 2.2e-16

  confint(m_math, c('PctED'), 0.95)

##            2.5 %     97.5 %
## PctED -0.6470071 -0.6074996

The p-value for PctED is less than 0.05. Additionally, the 95% confidence interval does not include 0. Both of these offer strong effidence to reject the null hypothesis (H0).

Simulated Inference

Build the Simulations

  # set sample size and iterations
    n <- 50
    iterations <- 500
  
  # simulate all content areas
      # initialize storage vector
      coeff_pctED_all <- rep(0, iterations)
    
      # loop to run 5000 regression models based on random samples of size 50 cases
      for(i in 1:iterations){
        set.seed(i); proficiency_samples <- sample_n(proficiency_all, n, replace = FALSE)
        m <- lm(data = proficiency_samples, formula = PctProf ~ PctED)
        coeffs_all <-coefficients(m)
        coeff_pctED_all[i] <- round(coeffs_all[2], 4)  #only store the PctED coefficient
      }
 
 # simulate ELA
      # initialize storage vector
      coeff_pctED_ela <- rep(0, iterations)
    
      # loop to run 5000 regression models based on random samples of size 50 cases
      for(i in 1:iterations){
        set.seed(i); proficiency_samples <- sample_n(proficiency_ela, n, replace = FALSE)
        m <- lm(data = proficiency_samples, formula = PctProf ~ PctED)
        coeffs_all <-coefficients(m)
        coeff_pctED_ela[i] <- round(coeffs_all[2], 4)  #only store the PctED coefficient
      }
    
 # simulate math
      # initialize storage vector
      coeff_pctED_math <- rep(0, iterations)
    
      for(i in 1:iterations){
        set.seed(i); proficiency_samples <- sample_n(proficiency_math, n, replace = FALSE)
        m <- lm(data = proficiency_samples, formula = PctProf ~ PctED)
        coeffs_all <-coefficients(m)
        coeff_pctED_math[i] <- round(coeffs_all[2], 4)  #only store the PctED coefficient
      }

Review Simulation Outputs

  sim_output <- summary(coeff_pctED_all)
  sim_output <-rbind(sim_output, summary(coeff_pctED_ela))
  sim_output <-rbind(sim_output, summary(coeff_pctED_math))
  
  sim_output <- cbind(c('All', 'ELA', 'Math'), sim_output)
  
  sim_output

##                   Min.      1st Qu.     Median     Mean        
## sim_output "All"  "-0.8551" "-0.68535"  "-0.63985" "-0.638068" 
##            "ELA"  "-0.8878" "-0.702175" "-0.6579"  "-0.6557526"
##            "Math" "-0.8398" "-0.676825" "-0.6336"  "-0.6270372"
##            3rd Qu.     Max.     
## sim_output "-0.590175" "-0.4019"
##            "-0.61135"  "-0.4269"
##            "-0.578975" "-0.2889"

The outputs of the simulation show that the full range of the data do not include zero. This is true for all content areas combined and for each content area seperately. This again supports rejecting the null hypothesis.

Part 5 - Conclusion

The inference tests confirm what was suspected in the reivew of the scatterplots, that the correlation is significant. There is strong evidence to reject the null hypothesis. We can say with 95% certainty that the percent of enrolled students reported as economically disadvantaged is a strong predictor of a school’s proficiency rates on state assessments. Furthermore, this analysis shows this is true both when content areas are combined or evaluated seperately.

These results indicate that at the state level, current policy is typically not successful in mitigating the effects of poverty on students educational outcomes. There were cases where there were high poverty schools that still had relatively high rates of proficiency (i.e., outperformed the model) but they were by far the exception and not the rule.

Further points of possible continuing analysis would be review of high poverty schools that are outperforming the model and comparing them with other high poverty schools with outcomes near and below the model. Specifically, if data about differences in instructional practices, curricular tools, or observational data from educational experts visiting schools from all three buckts (above, near, and below the model).

DATA 606 Final Data Project

Chad Bailey

Part 1 - Introduction

Part 2a - Data Descriptions

Data Collection

Cases

Variables

Type of study

Scope of Inference - Generalizability

Scope of Inference - Causality

Part 2b - Data Load & Cleansing

Load Full Dataset

Subset the data

Dealing with suppression

Remove records where values cannot be imputed

Impute supressed values where possible

Part 3 - Exploratory data analysis

Initial Summary of the Data

Scatterplot of the primary variables

Initial Thoughts on the data’s relationships

Part 4 - Inference

Checking the conditions for linear regression

Theoretical Inference

All Content Areas

ELA

Math

Simulated Inference

Build the Simulations

Review Simulation Outputs

Part 5 - Conclusion