Overview

This is the final project for “607 Data Acquisition and Management”.

I propose to examine the relationship and tensions of correlations of proficiency on state assessments to both the percent of students in poverty as well as the average magnitude of poverty in the surrounding geographic area. The specific measures and geographical unit (county, city, US census tract, etc.) will be dependent on the availability of data and how well that can be merged with educational data.

The motivation for this analysis is primarily to explore a policy question/problem within the field of education. There is a strong, and well documented, correlation between the rate of students in poverty and lower academic outcomes. However, while there has been significant discussion that schools exist within a context of the experiences of their students and the community in which they are located, little study on the impact of community on schools seems to have been done. This would be a start to rectifying that situation.

Proposed data sources:

  1. School, district, and/or Intermediate School District educational data to be pulled from www.mischooldata.org via csv download request
  2. County, city, and/or US census tract information pulled from www.census.gov website APIs

General Project Plan:

  1. Exploring and choosing data sources
  2. Acquire, transform, and clean each dataset
  3. Combine datasets
  4. Analyze the data
  5. Conclusions

Project Documentation and Notes

Also data and documentation for this project are available in the GitHub repository: https://github.com/ChadRyanBailey/607-Data-Acquisition-and-Management/tree/master/607-Weeks13-15-Final-Project

Notes and documentation for the project were maintained in a separate Microsoft Word (.docx) titled “01 Final Project Notes.docx” and is saved in the repo above.

1. Exploring and choosing data sources

The details of which data sources chosen as well as acquisition issues are detailed in the “01 Final Project Notes.docx” document stored in the project’s GitHub Repo: https://github.com/ChadRyanBailey/607-Data-Acquisition-and-Management/tree/master/607-Weeks13-15-Final-Project

In summary, sources for school proficiency on state assessments had to be altered for the sake of data quality. However, the granularity of Census data was better than expected, and included data by geographic districts.

2. Acquire, transform, and clean the separate datasets

2.1 Load Michigan proficiency rates on state assessments by school

# load necessary supporting packages
library(dplyr)
library(stringr)
library(httr)
library(ggplot2)

# load the school proficiency data
    fileLocation1 <- 'https://raw.githubusercontent.com/ChadRyanBailey/607-Data-Acquisition-and-Management/master/607-Weeks13-15-Final-Project/03%202015-16%20Proficiency%20All%20Tests%2C%20Grades%2C%20Students.csv'
  
# option needed to prevent long sequence id numbers from being converted to scientific notation
    options(scipen = 999)
    proficiency <-read.csv(fileLocation1, sep = ',')
    
# review column names and rename as needed
    names(proficiency)
##  [1] "ï..AcademicYearName"           "ISDCode"                      
##  [3] "ISDName"                       "DistrictCode"                 
##  [5] "DistrictName"                  "BuildingCode"                 
##  [7] "BuildingName"                  "Assessment"                   
##  [9] "Grade"                         "StudentGroup"                 
## [11] "Subject"                       "TestedIncluded"               
## [13] "TestedProficientMet"           "PercentTestedProficientMet"   
## [15] "TestedNotProficientMet"        "PercentTestedNotProficientMet"
    proficiency <- proficiency %>%
    rename(nTested = TestedIncluded
           ,nProf = TestedProficientMet
           ,nNonProf = TestedNotProficientMet
           ,PctProf = PercentTestedProficientMet
           ,PctNonProf = PercentTestedNotProficientMet) 
    
    names(proficiency)[1] <- c("AcademicYear")

# review summary of the variables/colummns
    summary(proficiency)
##     AcademicYear     ISDCode                 ISDName      DistrictCode  
##  2015-2016:6522   Min.   : 3000   Wayne RESA     :1042   Min.   : 1010  
##                   1st Qu.:33000   Oakland Schools: 666   1st Qu.:33170  
##                   Median :51000   Macomb ISD     : 454   Median :52040  
##                   Mean   :51642   Kent ISD       : 424   Mean   :51979  
##                   3rd Qu.:74000   Genesee ISD    : 262   3rd Qu.:74040  
##                   Max.   :84000   Ingham ISD     : 204   Max.   :84060  
##                                   (Other)        :3470                  
##                          DistrictName   BuildingCode 
##  Detroit City School District  : 180   Min.   :   1  
##  Grand Rapids Public Schools   :  92   1st Qu.:1538  
##  Utica Community Schools       :  74   Median :3155  
##  Dearborn City School District :  70   Mean   :4136  
##  Ann Arbor Public Schools      :  60   3rd Qu.:6592  
##  Lansing Public School District:  58   Max.   :9994  
##  (Other)                       :5988                 
##                        BuildingName            Assessment  
##  Central Elementary School   :  16   All Assessments:6522  
##  North Elementary School     :  12                         
##  Washington Elementary School:  12                         
##  Roosevelt Elementary School :  10                         
##  Central High School         :   8                         
##  Lakewood Elementary School  :   8                         
##  (Other)                     :6456                         
##         Grade            StudentGroup                   Subject    
##  All Grades:6522   All Students:6522   English Language Arts:3261  
##                                        Mathematics          :3261  
##                                                                    
##                                                                    
##                                                                    
##                                                                    
##                                                                    
##     nTested         nProf         PctProf        nNonProf   
##  <10    : 250   < 3    : 344   *      : 620   *      : 594  
##  20     :  37   *      : 276   50.00  :  39   64     :  42  
##  14     :  36   3      :  73   33.33  :  29   50     :  40  
##  11     :  35   4      :  65   25.00  :  21   52     :  40  
##  180    :  33   10     :  54   40.00  :  19   75     :  40  
##  216    :  33   7      :  54   41.67  :  14   57     :  39  
##  (Other):6098   (Other):5656   (Other):5780   (Other):5727  
##    PctNonProf  
##  *      : 620  
##  50.00  :  39  
##  66.67  :  29  
##  75.00  :  21  
##  60.00  :  19  
##  58.33  :  14  
##  (Other):5780

As can be seen in the columns {nTested, nProf, PctProf, nNonProf, and PctNonProf}, the file has records that have been suppressed. This is typical for public education data. The suppression is done to protect the privacy of small groups of students.

2.1a Resolve suppressions

The following section will review the behavior of rows having suppression. Where possible values will be imputed. If no value can be imputed, the record will be dropped.

# add flags to review each suppression condition
    proficiency <- proficiency %>%
      mutate( HasTestedLT10 = ifelse(nTested == '<10', 1, 0)
              ,HasProfLT3 = ifelse(nProf == '< 3', 1, 0)
              ,HasNonProfLT3 = ifelse(nNonProf == '< 3', 1, 0)
              ,HasEitherProfOrNonProfLT3 = ifelse(nProf == '< 3' | nNonProf == '< 3', 1, 0)
              ,HasRecord = 1)

# get the count of records by suppression conditions
    proficiency %>%
      summarise(nTotal = sum(HasRecord)
                ,nTestedLT10 = sum(HasTestedLT10)
                ,nProfLT3 = sum(HasProfLT3)
                ,nNonProfLT3 = sum(HasNonProfLT3)
                ,nEitherProfOrNonProfLT3 = sum(HasEitherProfOrNonProfLT3))
##   nTotal nTestedLT10 nProfLT3 nNonProfLT3 nEitherProfOrNonProfLT3
## 1   6522         250      344          26                     370

2.1b Starting with reocrds with <10 tested students

# review records for school's flagged as having <10 tested students
  proficiencySmallSchools <- proficiency %>%
    filter(nTested == '<10') %>%
    select(BuildingCode
           , Subject
           , nTested 
           , nProf
           , PctProf
           , nNonProf
           , PctNonProf)

  head(proficiencySmallSchools)
##   BuildingCode               Subject nTested nProf PctProf nNonProf
## 1         8425 English Language Arts     <10     *       *        *
## 2         5790 English Language Arts     <10     *       *        *
## 3         9085 English Language Arts     <10     *       *        *
## 4         9987 English Language Arts     <10     *       *        *
## 5         6950 English Language Arts     <10     *       *        *
## 6         7276 English Language Arts     <10     *       *        *
##   PctNonProf
## 1          *
## 2          *
## 3          *
## 4          *
## 5          *
## 6          *
# drop records for schools with <10 students tested as they are otherwise empty
  proficiencyCleaned <-
    anti_join(proficiency
              , proficiencySmallSchools
              , by = c('BuildingCode', 'Subject'))

As shown in the head of records where the number tested is less than 10, there was no way to impute a value with the given information. Thus, these records were dropped using an anti_join.

2.1c Moving to records with <3 students Proficient or Non-Proficient

Since <3 is equal to the set {0, 1, 2}, the middle value “1” will be used as the imputed value. Also, percentages will be calculated for suppressed records using the imputed value.

proficiencyCleaned <- proficiencyCleaned %>%
    #convet factors to characters
    mutate(nTested = as.character(nTested)
           ,nProf = as.character(nProf)
           ,nNonProf = as.character(nNonProf)
           ,PctProf = as.character(PctProf)
           ,PctNonProf = as.character(PctNonProf)
           ) %>%
    
    #convert the characters to numerics
    mutate(nTested = as.numeric(nTested)
           ,nProf = as.numeric(nProf)
           ,nNonProf = as.numeric(nNonProf)
           ,PctProf = as.numeric(PctProf)
           ,PctNonProf = as.numeric(PctNonProf)
           ) %>%
    
    # for count variables (nProf and nNonProf) replace the suppression flag with imputed count
    mutate(nProf = ifelse(HasProfLT3 == 1, 1, nProf)
           ,nProf = ifelse(HasNonProfLT3 == 1, nTested - 1, nProf)

           ,nNonProf = ifelse(HasNonProfLT3 == 1, 1, nNonProf)
           ,nNonProf = ifelse(HasProfLT3 == 1, nTested - 1, nNonProf)
           ) %>%

    # for percentage variables (PctProf and PctNonProf) replace the suppression flag 
    # with a calucuated percentage using the imputed counts
    mutate(PctProf = ifelse(HasProfLT3 == 1, round(nProf*1.0/nTested*100.0, 2), PctProf)
           , PctProf = ifelse(HasNonProfLT3 == 1, round(nProf*1.0/nTested*100.0, 2), PctProf)

           , PctNonProf = ifelse(HasNonProfLT3 == 1, round(nNonProf*1.0/nTested*100.0, 2), PctNonProf)
           , PctNonProf = ifelse(HasProfLT3 == 1, round(nNonProf*1.0/nTested*100.0, 2), PctNonProf))

2.1d Re-review the summary to make sure changes occurred as expected

summary(proficiencyCleaned)
##     AcademicYear     ISDCode                 ISDName      DistrictCode  
##  2015-2016:6272   Min.   : 3000   Wayne RESA     :1027   Min.   : 1010  
##                   1st Qu.:33000   Oakland Schools: 649   1st Qu.:33230  
##                   Median :52000   Macomb ISD     : 448   Median :52170  
##                   Mean   :51957   Kent ISD       : 406   Mean   :52243  
##                   3rd Qu.:74000   Genesee ISD    : 256   3rd Qu.:74050  
##                   Max.   :84000   Ingham ISD     : 190   Max.   :84060  
##                                   (Other)        :3296                  
##                          DistrictName   BuildingCode 
##  Detroit City School District  : 180   Min.   :   1  
##  Grand Rapids Public Schools   :  86   1st Qu.:1512  
##  Utica Community Schools       :  74   Median :3150  
##  Dearborn City School District :  70   Mean   :4094  
##  Ann Arbor Public Schools      :  60   3rd Qu.:6412  
##  Lansing Public School District:  56   Max.   :9994  
##  (Other)                       :5746                 
##                        BuildingName            Assessment  
##  Central Elementary School   :  16   All Assessments:6272  
##  North Elementary School     :  12                         
##  Washington Elementary School:  12                         
##  Roosevelt Elementary School :  10                         
##  Central High School         :   8                         
##  Lakewood Elementary School  :   8                         
##  (Other)                     :6206                         
##         Grade            StudentGroup                   Subject    
##  All Grades:6272   All Students:6272   English Language Arts:3137  
##                                        Mathematics          :3135  
##                                                                    
##                                                                    
##                                                                    
##                                                                    
##                                                                    
##     nTested         nProf          PctProf         nNonProf      
##  Min.   :  10   Min.   :  1.0   Min.   : 0.38   Min.   :   1.00  
##  1st Qu.: 108   1st Qu.: 30.0   1st Qu.:24.55   1st Qu.:  53.75  
##  Median : 195   Median : 76.0   Median :41.67   Median : 103.00  
##  Mean   : 247   Mean   :108.1   Mean   :40.88   Mean   : 138.97  
##  3rd Qu.: 323   3rd Qu.:145.0   3rd Qu.:56.67   3rd Qu.: 182.00  
##  Max.   :1865   Max.   :868.0   Max.   :99.71   Max.   :1575.00  
##                                                                  
##    PctNonProf    HasTestedLT10   HasProfLT3      HasNonProfLT3     
##  Min.   : 0.29   Min.   :0     Min.   :0.00000   Min.   :0.000000  
##  1st Qu.:43.33   1st Qu.:0     1st Qu.:0.00000   1st Qu.:0.000000  
##  Median :58.33   Median :0     Median :0.00000   Median :0.000000  
##  Mean   :59.12   Mean   :0     Mean   :0.05485   Mean   :0.004145  
##  3rd Qu.:75.45   3rd Qu.:0     3rd Qu.:0.00000   3rd Qu.:0.000000  
##  Max.   :99.62   Max.   :0     Max.   :1.00000   Max.   :1.000000  
##                                                                    
##  HasEitherProfOrNonProfLT3   HasRecord
##  Min.   :0.00000           Min.   :1  
##  1st Qu.:0.00000           1st Qu.:1  
##  Median :0.00000           Median :1  
##  Mean   :0.05899           Mean   :1  
##  3rd Qu.:0.00000           3rd Qu.:1  
##  Max.   :1.00000           Max.   :1  
## 

All variables are within the expected ranges are now of the expected types. No more transformations are needed on this data set for now.

2.2 Load Entity Demographics

# load the school proficiency data
    fileLocation2 <- 'https://raw.githubusercontent.com/ChadRyanBailey/607-Data-Acquisition-and-Management/master/607-Weeks13-15-Final-Project/04%202015-16%20Entity%20Demographics.csv'

    entityDemog <-read.csv(fileLocation2, sep = ',')

# review column names and rename as needed
    names(entityDemog)
##  [1] "SchoolYear"                          
##  [2] "ISDCode"                             
##  [3] "ISDName"                             
##  [4] "DistrictCode"                        
##  [5] "DistrictName"                        
##  [6] "BuildingCode"                        
##  [7] "BuildingName"                        
##  [8] "EntityType"                          
##  [9] "LOCALE_NAME"                         
## [10] "TOTAL_ENROLLMENT"                    
## [11] "MALE_ENROLLMENT"                     
## [12] "FEMALE_ENROLLMENT"                   
## [13] "AMERICAN_INDIAN_ENROLLMENT"          
## [14] "ASIAN_ENROLLMENT"                    
## [15] "AFRICAN_AMERICAN_ENROLLMENT"         
## [16] "HISPANIC_ENROLLMENT"                 
## [17] "HAWAIIAN_ENROLLMENT"                 
## [18] "WHITE_ENROLLMENT"                    
## [19] "TWO_OR_MORE_RACES_ENROLLMENT"        
## [20] "KINDERGARTEN_ENROLLMENT"             
## [21] "GRADE_1_ENROLLMENT"                  
## [22] "GRADE_2_ENROLLMENT"                  
## [23] "GRADE_3_ENROLLMENT"                  
## [24] "GRADE_4_ENROLLMENT"                  
## [25] "GRADE_5_ENROLLMENT"                  
## [26] "GRADE_6_ENROLLMENT"                  
## [27] "GRADE_7_ENROLLMENT"                  
## [28] "GRADE_8_ENROLLMENT"                  
## [29] "GRADE_9_ENROLLMENT"                  
## [30] "GRADE_10_ENROLLMENT"                 
## [31] "GRADE_11_ENROLLMENT"                 
## [32] "GRADE_12_ENROLLMENT"                 
## [33] "ECONOMIC_DISADVANTAGED_ENROLLMENT"   
## [34] "SPECIAL_EDUCATION_ENROLLMENT"        
## [35] "ENGLISH_LANGUAGE_LEARNERS_ENROLLMENT"
    entityDemog <- entityDemog %>%
      rename(AcademicYear = SchoolYear
           ,nEnrollmentTotal = TOTAL_ENROLLMENT
           ,nEnrollmentED = ECONOMIC_DISADVANTAGED_ENROLLMENT) %>%
      select(AcademicYear
             , ISDCode
             , DistrictCode
             , BuildingCode
             , nEnrollmentTotal
             , nEnrollmentED)
    
# review the summary values for the selected columns
    summary(entityDemog)
##               AcademicYear     ISDCode       DistrictCode  
##  15 - 16 School Year:5952   Min.   : 3.00   Min.   : 1010  
##                             1st Qu.:32.00   1st Qu.:33070  
##                             Median :52.00   Median :52180  
##                             Mean   :51.61   Mean   :52167  
##                             3rd Qu.:75.00   3rd Qu.:75038  
##                             Max.   :84.00   Max.   :84090  
##                             NA's   :1       NA's   :58     
##   BuildingCode   nEnrollmentTotal  nEnrollmentED     
##  Min.   :    0   Min.   :      0   Min.   :     0.0  
##  1st Qu.:  580   1st Qu.:      0   1st Qu.:     0.0  
##  Median : 2614   Median :    293   Median :   116.0  
##  Mean   : 3646   Mean   :   1013   Mean   :   467.4  
##  3rd Qu.: 6396   3rd Qu.:    534   3rd Qu.:   263.0  
##  Max.   :84070   Max.   :1540005   Max.   :713295.0  
##  NA's   :967
# transform values as needed
    entityDemog <- entityDemog %>%
      mutate(AcademicYear = str_replace_all(AcademicYear, " School Year", "")) %>%
      mutate(AcademicYear = str_replace_all(AcademicYear, " ", "")) %>%
      mutate(AcademicYear = paste("20", AcademicYear, sep = "")) %>%
      mutate(AcademicYear = str_replace_all(AcademicYear, "-", "-20")) %>%
      mutate(AcademicYear = as.factor(AcademicYear)) %>%
      
      filter(nEnrollmentTotal > 0
             ,!is.na(BuildingCode)) %>%
      
      mutate(PctEnrollmentED = round(nEnrollmentED / nEnrollmentTotal *100, 2))
    
    summary(entityDemog)
##     AcademicYear     ISDCode       DistrictCode    BuildingCode 
##  2015-2016:3381   Min.   : 3.00   Min.   : 1010   Min.   :   1  
##                   1st Qu.:33.00   1st Qu.:33170   1st Qu.:1535  
##                   Median :51.00   Median :52015   Median :3167  
##                   Mean   :51.61   Mean   :51934   Mean   :4152  
##                   3rd Qu.:74.00   3rd Qu.:74030   3rd Qu.:6663  
##                   Max.   :84.00   Max.   :84050   Max.   :9994  
##  nEnrollmentTotal nEnrollmentED    PctEnrollmentED 
##  Min.   :   1.0   Min.   :   0.0   Min.   :  0.00  
##  1st Qu.: 234.0   1st Qu.:  82.0   1st Qu.: 31.27  
##  Median : 378.0   Median : 162.0   Median : 50.70  
##  Mean   : 433.5   Mean   : 198.8   Mean   : 50.34  
##  3rd Qu.: 538.0   3rd Qu.: 262.0   3rd Qu.: 68.72  
##  Max.   :2917.0   Max.   :1844.0   Max.   :100.00

2.3 Load the Entity data

This data set is needed to bridge the previous two data sets. The US Census data uses district codes set by the National Center for Educational Statistics (NCES) whereas the the Michigan school proficiency data uses district codes set by the state of Michigan. Thankfully both codes are available in Michigan’s Educational Enity Master.

# load the school proficiency data
    fileLocation3 <- 'https://raw.githubusercontent.com/ChadRyanBailey/607-Data-Acquisition-and-Management/master/607-Weeks13-15-Final-Project/05%202015-16%20Entity%20Data.txt'

    entity <-read.csv(fileLocation3, sep = '\t')

# review column names and rename as needed
    names(entity)
##  [1] "AcademicYearName"          "ISD"                      
##  [3] "ISDName"                   "DistrictCode"             
##  [5] "DistrictName"              "BuildingCode"             
##  [7] "BuildingName"              "DistrictNCESId"           
##  [9] "BuildingNCESId"            "EntityTypeCategoryName"   
## [11] "GeographicDistrictCode"    "GeographicDistrictName"   
## [13] "GeographicDistrictNCESId"  "ActualGradeListSearchable"
## [15] "Addr1Street"               "Addr1City"                
## [17] "Addr1Zip"                  "Latitude"                 
## [19] "Longitude"
    entity <- entity %>%
      rename(ISDCode = ISD
           ,Street = Addr1Street
           ,City = Addr1City
           ,Zip = Addr1Zip
           ,GradeList = ActualGradeListSearchable)

    names(entity)[1] <- c("AcademicYear")

# review summary of the variables/colummns
    summary(entity)
##     AcademicYear     ISDCode                 ISDName      DistrictCode  
##  2015-2016:3867   Min.   : 3000   Wayne RESA     : 590   Min.   : 1010  
##                   1st Qu.:33000   Oakland Schools: 392   1st Qu.:33200  
##                   Median :52000   Macomb ISD     : 264   Median :52170  
##                   Mean   :51843   Kent ISD       : 253   Mean   :52139  
##                   3rd Qu.:74000   Genesee ISD    : 162   3rd Qu.:74050  
##                   Max.   :84000   Ingham ISD     : 120   Max.   :84090  
##                                   (Other)        :2086                  
##                              DistrictName   BuildingCode 
##  Detroit City School District      : 105   Min.   :   1  
##  Grand Rapids Public Schools       :  52   1st Qu.:1582  
##  Dearborn City School District     :  41   Median :3219  
##  Utica Community Schools           :  40   Mean   :4298  
##  Lansing Public School District    :  36   3rd Qu.:7278  
##  Michigan Department of Corrections:  34   Max.   :9996  
##  (Other)                           :3559                 
##                        BuildingName  DistrictNCESId   
##  Central Elementary School   :   9   Min.   :2600001  
##  Early Childhood Center      :   6   1st Qu.:2608453  
##  North Elementary School     :   6   Median :2618510  
##  Washington Elementary School:   6   Mean   :2621232  
##  Roosevelt Elementary School :   5   3rd Qu.:2629483  
##  Central High School         :   4   Max.   :2680990  
##  (Other)                     :3831   NA's   :1        
##  BuildingNCESId         EntityTypeCategoryName GeographicDistrictCode
##  Min.   :260000107831   ISD  : 237             Min.   : 1010         
##  1st Qu.:260849752605   LEA  :3210             1st Qu.:33130         
##  Median :261851006998   PSA  : 375             Median :51045         
##  Mean   :262122196472   State:  45             Mean   :51701         
##  3rd Qu.:262946758114                          3rd Qu.:74010         
##  Max.   :268099008435                          Max.   :84020         
##  NA's   :3                                                           
##                     GeographicDistrictName GeographicDistrictNCESId
##  Detroit City School District  : 217       Min.   :2600001         
##  Grand Rapids Public Schools   :  71       1st Qu.:2612000         
##  Dearborn City School District :  50       Median :2619620         
##  Lansing Public School District:  43       Mean   :2621336         
##  Ann Arbor Public Schools      :  41       3rd Qu.:2629160         
##  Utica Community Schools       :  41       Max.   :2680990         
##  (Other)                       :3404                               
##               GradeList                     Street               City     
##  9,10,11,12        : 674   420 WEST 9 MILE RD  :  10   DETROIT     : 182  
##  KG,1,2,3,4,5      : 513   204 EAST MUSKEGON ST:   7   GRAND RAPIDS: 119  
##  6,7,8             : 323   22586 ANN ARBOR TRL :   7   FLINT       :  53  
##                    : 307   2413 WEST MAPLE AVE :   7   KALAMAZOO   :  47  
##  KG,1,2,3,4,5,6,7,8: 199   550 5TH ST          :   7   LANSING     :  46  
##  6,7,8,9,10,11,12  : 138   P.O. Box 308        :   7   MUSKEGON    :  44  
##  (Other)           :1713   (Other)             :3822   (Other)     :3376  
##       Zip               Latitude       Longitude     
##  Min.   :    48001   Min.   : 0.00   Min.   :-90.17  
##  1st Qu.:481628879   1st Qu.:42.34   1st Qu.:-85.52  
##  Median :485323562   Median :42.64   Median :-83.97  
##  Mean   :442878357   Mean   :42.90   Mean   :-84.29  
##  3rd Qu.:492031759   3rd Qu.:43.13   3rd Qu.:-83.26  
##  Max.   :499690217   Max.   :47.47   Max.   :  0.00  
## 
entity %>%
  filter(is.na(BuildingNCESId))
##   AcademicYear ISDCode           ISDName DistrictCode
## 1    2015-2016   47000    Livingston ESA        47000
## 2    2015-2016   84000 State of Michigan        84020
## 3    2015-2016   84000 State of Michigan        84090
##                         DistrictName BuildingCode
## 1                     Livingston ESA         2952
## 2 Michigan Department of Corrections         9996
## 3        Michigan Virtual University         2929
##                         BuildingName DistrictNCESId BuildingNCESId
## 1 Livingston County Adult Education         2680700             NA
## 2            CTE Contracted Programs        2600001             NA
## 3            Michigan Virtual School             NA             NA
##   EntityTypeCategoryName GeographicDistrictCode
## 1                    ISD                  47070
## 2                  State                  33020
## 3                  State                  33020
##           GeographicDistrictName GeographicDistrictNCESId
## 1          Howell Public Schools                  2618720
## 2 Lansing Public School District                  2621150
## 3 Lansing Public School District                  2621150
##                       GradeList                 Street    City       Zip
## 1                               1425 W Grand River Ave  Howell     48843
## 2                    9,10,11,12        704 W Kalamazoo Lansing     48933
## 3 KG,1,2,3,4,5,6,7,8,9,10,11,12  3101 Technology Blvd. Lansing 489108546
##   Latitude Longitude
## 1 42.61142 -83.95280
## 2 42.73008 -84.56324
## 3 42.70023 -84.49982

There were three entities that were missing an (Operational)[BuildingNCESId] value however, looking into the specific entities, it is somewhat expected that these would not receive federal coding. In any case, each of these records still has a GeographicDistrictNCESId which will be the primary linking id used for this analysis.

2.4 Load the data from US Census using their API

# get the parts of the api request
  url_base <-"https://api.census.gov/data/"
  dataset <- "2016/acs/acs5/profile"   #2016 5-year american community survey profiles
  query_header <- "?get="
  query_variables <- "DP05_0028E,DP03_0062E,"  #est. num. population, est. median household income
  geographicGroupings <- "NAME&for=school%20district%20(unified):*&in=state:26" #geographic school districts in MI


# construct api request and stored the results
  response <- GET(paste(url_base
                        ,dataset
                        ,query_header
                        ,query_variables
                        ,geographicGroupings
                        ,sep = ""))

# parse the api results
  response_parsed <- content(response, "parse")

# unlist the data into a matrix and transpose that matrix
  response_unlisted <- sapply(response_parsed, unlist)
  response_transpose <- t(response_unlisted)

# load into a data frame
  # rows 2:end contain data
    response_df <- data.frame(response_transpose[2:nrow(response_transpose),], stringsAsFactors = FALSE)
  # row 1 contains column headers
    names(response_df) <- response_transpose[1,]

#review column headers and rename as needed
  names(response_df)
## [1] "DP05_0028E"                "DP03_0062E"               
## [3] "NAME"                      "state"                    
## [5] "school district (unified)"
  names(response_df) <- c('TotalPopulation', 'MedianHouseholdIncome', 'DistrictName', 'StateCode', 'DistrictCode_NCES')

#update data types as needed
  response_df <- response_df %>%
    mutate(DistrictCode_NCES = paste("00000", DistrictCode_NCES, sep = "")) %>%
    mutate(DistrictCode_NCES = str_trunc(DistrictCode_NCES, 5, side="left", ellipsis="")) %>%
    mutate(DistrictCode_NCES = paste(StateCode, DistrictCode_NCES, sep = "")) %>%
    mutate(TotalPopulation =  as.numeric(TotalPopulation)
           ,MedianHouseholdIncome = as.numeric(MedianHouseholdIncome)
           ,DistrictName = as.factor(DistrictName)
           ,DistrictCode_NCES = as.numeric(DistrictCode_NCES)) 

census <- response_df


summary(census)
##  TotalPopulation  MedianHouseholdIncome
##  Min.   :   394   Min.   : 17455       
##  1st Qu.:  5535   1st Qu.: 41371       
##  Median :  9987   Median : 48398       
##  Mean   : 19131   Mean   : 51199       
##  3rd Qu.: 20073   3rd Qu.: 57983       
##  Max.   :683524   Max.   :124137       
##                                        
##                                       DistrictName  StateCode        
##  Adams Township School District, Michigan   :  1   Length:518        
##  Addison Community Schools, Michigan        :  1   Class :character  
##  Adrian City School District, Michigan      :  1   Mode  :character  
##  Airport Community School District, Michigan:  1                     
##  Akron-Fairgrove Schools, Michigan          :  1                     
##  Alanson Public Schools, Michigan           :  1                     
##  (Other)                                    :512                     
##  DistrictCode_NCES
##  Min.   :2600005  
##  1st Qu.:2610675  
##  Median :2619905  
##  Mean   :2619464  
##  3rd Qu.:2627833  
##  Max.   :2699999  
## 
head(census)
##   TotalPopulation MedianHouseholdIncome
## 1           42127                 32269
## 2            3051                 35425
## 3            4266                 41375
## 4            3897                 39618
## 5           32225                 45378
## 6            2267                 35530
##                                      DistrictName StateCode
## 1           Battle Creek Public Schools, Michigan        26
## 2         Bessemer Area School District, Michigan        26
## 3        Harbor Beach Community Schools, Michigan        26
## 4               St. Ignace Area Schools, Michigan        26
## 5         Marquette Area Public Schools, Michigan        26
## 6 Ewen-Trout Creek Consolidated Schools, Michigan        26
##   DistrictCode_NCES
## 1           2600005
## 2           2600006
## 3           2600007
## 4           2600012
## 5           2600013
## 6           2600014

2.5 Load US poverty line

library(rJava)
library(XLConnectJars)
library(XLConnect)
## XLConnect 0.2-15 by Mirai Solutions GmbH [aut],
##   Martin Studer [cre],
##   The Apache Software Foundation [ctb, cph] (Apache POI),
##   Graph Builder [ctb, cph] (Curvesapi Java library)
## http://www.mirai-solutions.com
## https://github.com/miraisolutions/xlconnect
# method to load .xls file; abondoned as it required a local file 
# (i.e., will not load a file refereced by URL)
    # location1 <- 'https://www2.census.gov/programs-surveys/cps/tables/time-series/historical-poverty-thresholds/thresh16.xls'
    # workbook <- loadWorkbook(location1)
    # povertyThresholds <- readWorksheet(workbook, sheet = "Sheet1")
    # povertyThresholds[18,]
    # povertyThreshold <- str_replace_all(povertyThresholds[18,2], ",", "")
    # povertyThresholds <- as.numeric(povertyThresholds)

# alternative method, save the file as .csv
fileLocation4 <- 'https://raw.githubusercontent.com/ChadRyanBailey/607-Data-Acquisition-and-Management/master/607-Weeks13-15-Final-Project/07%202016%20US%20Poverty%20Thresholds.csv'
povertyThresholds <- read.csv(fileLocation4, sep = ",")
povertyThresholds[18,]
##    Table.with.row.headings.in.column.A.and.column.headings.in.rows.4.to.8.
## 18                           Four people........................……………………….
##         X    X.1    X.2    X.3    X.4 X.5 X.6 X.7 X.8 X.9
## 18 24,563 24,755 25,160 24,339 24,424
povertyThreshold <- str_replace_all(povertyThresholds[18,2], ",", "")
povertyThreshold <- as.numeric(povertyThreshold)
povertyThreshold
## [1] 24563

3. Join the datasets together

3.1 Join Proficiency and Entity Demographics

proficiencyCleaned <- proficiencyCleaned %>%
  select(AcademicYear
         , DistrictCode
         , BuildingCode
         , Subject
         , nTested
         , nProf
         , PctProf
         , nNonProf
         , PctNonProf)

entityDemog <- entityDemog %>%
  select(AcademicYear
         , DistrictCode
         , BuildingCode
         , nEnrollmentTotal
         , nEnrollmentED
         , PctEnrollmentED)

prof_demog <- left_join(proficiencyCleaned, entityDemog, by = c('AcademicYear', 'DistrictCode', 'BuildingCode'))
                        
RecordsWithoutDemog <- prof_demog %>% 
  summarise(sum(ifelse(is.na(nEnrollmentTotal), 1, 0)))

prof_demog <- prof_demog %>% filter(!is.na(nEnrollmentTotal))

#sample records inappropriately missing demographics, to be followed up with Michigan after this project
#head(prof_demog %>% filter(is.na(nEnrollmentTotal)))

Michigan’s entity demographic file appears to have filtered out schools that were open during the selected snapshot but are now closed when the datafile is downloaded. This results in 372 records not having demographics and thus needing to be dropped.

3.2 Add Join to Entity Data

This join gets the ID conversions between state codes and federal codes for schools.

entity <- entity %>%
  select(AcademicYear
         ,ISDCode
         ,ISDName
         ,DistrictCode
         ,DistrictNCESId
         ,DistrictName
         ,BuildingCode
         ,BuildingName
         ,GeographicDistrictCode
         ,GeographicDistrictNCESId
         ,GeographicDistrictName
         ,Latitude
         ,Longitude
         ,EntityType = EntityTypeCategoryName)

prof_demog_ent <- left_join(prof_demog, entity, by = c('AcademicYear', 'DistrictCode', 'BuildingCode'))

RecordsWithoutEntityCodes <- prof_demog_ent %>% 
  summarise(sum(ifelse(is.na(GeographicDistrictCode), 1, 0)))

RecordsWithoutEntityCodes
##   sum(ifelse(is.na(GeographicDistrictCode), 1, 0))
## 1                                                0

No records were missing enity code conversions.

3.3 Add Join to US Census API data

census <- census %>%
  select(GeographicDistrictNCESId = DistrictCode_NCES
         ,DistrictName_NCES = DistrictName
         ,DistrictTotalPopulation = TotalPopulation
         ,MedianHouseholdIncome)

prof_demog_ent_census <- left_join(prof_demog_ent, census, by = c('GeographicDistrictNCESId'))

RecordsWithoutCensus <- prof_demog_ent_census %>% 
  summarise(sum(ifelse(is.na(DistrictTotalPopulation), 1, 0)))

DistrictsWithoutCensus <- prof_demog_ent_census %>% 
  filter(is.na(DistrictTotalPopulation)) %>%
  distinct(GeographicDistrictNCESId) %>%
  summarise(n())

RecordsWithoutCensus
##   sum(ifelse(is.na(DistrictTotalPopulation), 1, 0))
## 1                                               221
DistrictsWithoutCensus
##   n()
## 1  73
#head(prof_demog_ent_census %>% filter(is.na(DistrictTotalPopulation)))

prof_demog_ent_census <- prof_demog_ent_census %>%filter(!is.na(DistrictTotalPopulation))

The Census data is missing data for some geographic districts for reason which could not be determined at this point. This means that 221 records from 73 geographic districts will be dropped from this analysis.

3.4 Add in Poverty Line

class(povertyThreshold)
## [1] "numeric"
final_data <- prof_demog_ent_census %>%
  mutate(PovertyThreshold = povertyThreshold) %>%
  mutate(MedianHouseholdIncomePctOfPoverty = round(MedianHouseholdIncome / PovertyThreshold * 100, 2))

3.5 Review the final dataset

3.5a Look at summary of all records

This section looks at the data set as a whole looking to ensure that the dataset as a whole has the expected:

  • number of records
  • variables (columns)
  • categorical values
  • ranges/spread for numerical values
# for simplicity also limit to just one subject
final_data <- final_data %>% filter(Subject == "English Language Arts")

summary(final_data)
##     AcademicYear   DistrictCode    BuildingCode 
##  2015-2016:2840   Min.   : 1010   Min.   :   1  
##                   1st Qu.:33215   1st Qu.:1556  
##                   Median :50240   Median :3148  
##                   Mean   :51457   Mean   :4072  
##                   3rd Qu.:73030   3rd Qu.:6361  
##                   Max.   :84050   Max.   :9994  
##                                                 
##                   Subject        nTested           nProf    
##  English Language Arts:2840   Min.   :  10.0   Min.   :  1  
##  Mathematics          :   0   1st Qu.: 121.0   1st Qu.: 48  
##                               Median : 202.0   Median : 98  
##                               Mean   : 256.7   Mean   :131  
##                               3rd Qu.: 329.2   3rd Qu.:170  
##                               Max.   :1864.0   Max.   :868  
##                                                             
##     PctProf         nNonProf         PctNonProf    nEnrollmentTotal
##  Min.   : 1.75   Min.   :   1.00   Min.   : 0.84   Min.   :   9.0  
##  1st Qu.:34.73   1st Qu.:  51.75   1st Qu.:38.07   1st Qu.: 281.0  
##  Median :49.29   Median :  93.00   Median :50.71   Median : 406.0  
##  Mean   :47.94   Mean   : 125.70   Mean   :52.06   Mean   : 475.6  
##  3rd Qu.:61.93   3rd Qu.: 165.00   3rd Qu.:65.27   3rd Qu.: 575.0  
##  Max.   :99.16   Max.   :1199.00   Max.   :98.25   Max.   :2917.0  
##                                                                    
##  nEnrollmentED    PctEnrollmentED     ISDCode                 ISDName    
##  Min.   :   2.0   Min.   :  0.88   Min.   : 3000   Wayne RESA     : 403  
##  1st Qu.: 100.0   1st Qu.: 30.05   1st Qu.:33000   Oakland Schools: 306  
##  Median : 174.0   Median : 49.08   Median :50000   Macomb ISD     : 215  
##  Mean   : 211.9   Mean   : 49.33   Mean   :51144   Kent ISD       : 199  
##  3rd Qu.: 269.0   3rd Qu.: 67.35   3rd Qu.:73000   Genesee ISD    : 124  
##  Max.   :1844.0   Max.   :100.00   Max.   :84000   Ottawa Area ISD:  91  
##                                                    (Other)        :1502  
##  DistrictNCESId                            DistrictName 
##  Min.   :2600005   Grand Rapids Public Schools   :  43  
##  1st Qu.:2607620   Utica Community Schools       :  37  
##  Median :2618465   Dearborn City School District :  35  
##  Mean   :2617867   Ann Arbor Public Schools      :  30  
##  3rd Qu.:2628178   Lansing Public School District:  28  
##  Max.   :2680990   Kalamazoo Public Schools      :  24  
##                    (Other)                       :2643  
##                        BuildingName  GeographicDistrictCode
##  Central Elementary School   :   8   Min.   : 1010         
##  North Elementary School     :   6   1st Qu.:33170         
##  Washington Elementary School:   6   Median :50230         
##  Roosevelt Elementary School :   5   Mean   :51332         
##  Lakewood Elementary School  :   4   3rd Qu.:73030         
##  Lincoln Elementary School   :   4   Max.   :83070         
##  (Other)                     :2807                         
##  GeographicDistrictNCESId                    GeographicDistrictName
##  Min.   :2600005          Detroit City School District  :  81      
##  1st Qu.:2612000          Grand Rapids Public Schools   :  55      
##  Median :2619950          Dearborn City School District :  44      
##  Mean   :2619866          Utica Community Schools       :  38      
##  3rd Qu.:2628830          Ann Arbor Public Schools      :  36      
##  Max.   :2636660          Lansing Public School District:  33      
##                           (Other)                       :2553      
##     Latitude       Longitude      EntityType  
##  Min.   :41.72   Min.   :-90.16   ISD  :   1  
##  1st Qu.:42.35   1st Qu.:-85.53   LEA  :2516  
##  Median :42.66   Median :-83.97   PSA  : 322  
##  Mean   :42.95   Mean   :-84.37   State:   1  
##  3rd Qu.:43.11   3rd Qu.:-83.27               
##  Max.   :47.24   Max.   :-82.43               
##                                               
##                                 DistrictName_NCES DistrictTotalPopulation
##  Detroit City School District, Michigan  :  81    Min.   :   394         
##  Grand Rapids Public Schools, Michigan   :  55    1st Qu.: 11002         
##  Dearborn City School District, Michigan :  44    Median : 23666         
##  Utica Community Schools, Michigan       :  38    Mean   : 59374         
##  Ann Arbor Public Schools, Michigan      :  36    3rd Qu.: 59071         
##  Lansing Public School District, Michigan:  33    Max.   :683524         
##  (Other)                                 :2553                           
##  MedianHouseholdIncome PovertyThreshold MedianHouseholdIncomePctOfPoverty
##  Min.   : 17455        Min.   :24563    Min.   : 71.06                   
##  1st Qu.: 41634        1st Qu.:24563    1st Qu.:169.50                   
##  Median : 51234        Median :24563    Median :208.58                   
##  Mean   : 53721        Mean   :24563    Mean   :218.71                   
##  3rd Qu.: 62272        3rd Qu.:24563    3rd Qu.:253.52                   
##  Max.   :124137        Max.   :24563    Max.   :505.38                   
## 

The dataset as a whole seems to have the expected number of records, the correct categorical values, and the correct ranges/spread for numerical values.

Zoom in on a single known district

This section zooms in on a single geographical district with known values. This will allow the opportunity to check for possible corruption in the data. This exporation will start visually and only go to table/statistical review if issues are found.

reviewcase <- final_data %>% filter(GeographicDistrictName == 'Grand Rapids Public Schools')

ggplot(reviewcase
       , aes(x = PctEnrollmentED, y = PctProf)) + 
  geom_point(aes(size = nEnrollmentTotal, color = EntityType), alpha = 0.5) + 
  geom_smooth(method=lm)

This scatter plot validates the expectation that Grand Rapids Public Schools data would have schools with varying rates of Economically Disadvantaged (ED) enrollment and that there should be moderately strong negative correlation between PctED and PctProf (percent proficient).

ggplot(reviewcase
       , aes(x = MedianHouseholdIncomePctOfPoverty, y = PctProf)) + 
  geom_point(aes(size = nEnrollmentTotal, color = EntityType), alpha = 0.5) + 
  geom_smooth(method=lm)

This scatter plot validates the expectation that for a geographic district there should be only one MedianHouseholdIncomePctPoverty and that Grand Rapids Public Schools’ value for that measure would be near 170.

4. Analyze the data

Now that the dataset has been constructed and validated, the analysis of the desired variables can begin.

4.1 Create regression models

4.1a Create regression model for {PctEnrollmentED and PctProf}

# model for enrollmentED
    enrollmentEDLM <- lm(PctProf ~ PctEnrollmentED, data = final_data)
    summary(enrollmentEDLM)
## 
## Call:
## lm(formula = PctProf ~ PctEnrollmentED, data = final_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -55.773  -6.964   0.550   7.455  58.071 
## 
## Coefficients:
##                  Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)     78.522903   0.522812  150.19 <0.0000000000000002 ***
## PctEnrollmentED -0.619878   0.009503  -65.23 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.33 on 2838 degrees of freedom
## Multiple R-squared:  0.5999, Adjusted R-squared:  0.5997 
## F-statistic:  4255 on 1 and 2838 DF,  p-value: < 0.00000000000000022

4.1b Create regression model for {MedianHouseholdIncomePctOfPoverty and Pct Prof}

# model for communityPoverty
    CommunityPovertyLM <- lm(PctProf ~ MedianHouseholdIncomePctOfPoverty, data = final_data)
    summary(CommunityPovertyLM)
## 
## Call:
## lm(formula = PctProf ~ MedianHouseholdIncomePctOfPoverty, data = final_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -65.373 -10.062   0.934  10.570  60.932 
## 
## Coefficients:
##                                    Estimate Std. Error t value
## (Intercept)                       13.516159   0.958713   14.10
## MedianHouseholdIncomePctOfPoverty  0.157418   0.004166   37.79
##                                              Pr(>|t|)    
## (Intercept)                       <0.0000000000000002 ***
## MedianHouseholdIncomePctOfPoverty <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.9 on 2838 degrees of freedom
## Multiple R-squared:  0.3347, Adjusted R-squared:  0.3345 
## F-statistic:  1428 on 1 and 2838 DF,  p-value: < 0.00000000000000022

The regression models for both EnrollmentED and CommunityPoverty had lower than expected Adjusted R-squared values. Both were less than 0.6 (with Enrollment being 0.5997 and Community Poverty 0.3345).

4.2 Visual analyze the data

4.2a Plot of PctEnrollmentEd vs PctProf

ggplot(final_data
       , aes(x = PctEnrollmentED, y = PctProf)) + 
  geom_point(aes(size = nEnrollmentTotal, color = EntityType), alpha = 0.5) + 
  geom_smooth(method=lm) 

ggplot(final_data
       , aes(x = PctEnrollmentED, y = PctProf)) + 
  geom_point(aes(size = nEnrollmentTotal, color = EntityType), alpha = 0.5) + 
  facet_wrap(~EntityType)+ 
  geom_smooth(method=lm)

4.2b Plot of PctEnrollmentEd vs PctProf

ggplot(final_data
       , aes(x = MedianHouseholdIncomePctOfPoverty, y = PctProf)) + 
  geom_point(aes(size = nEnrollmentTotal, color = EntityType), alpha = 0.5) + 
  geom_smooth(method=lm)

ggplot(final_data
       , aes(x = MedianHouseholdIncomePctOfPoverty, y = PctProf)) + 
  geom_point(aes(size = nEnrollmentTotal, color = EntityType), alpha = 0.5) + 
  facet_wrap(~EntityType) +
  geom_smooth(method=lm)

In reviewing the above tables and graphs it is clear that the direct linkage of students in the school is a much more powerfully correlated variable. However, there is still a meanginful correlation with community poverty level.

Conclusions

This project was both fun and challenging.

Unfortunately, Michigan’s data around school proficiency and and entity data were significantly more dirty than expected. This resulted in more time than expected being spent in the data acquisition, exploration, and tidying phase. Which meant there was less time than expectd for analysis.

However, as noted above, the project does show there is a correlation with both measures of poverty. Although unsurprisingly, the measure directly linked to the school (the school’s percent of enrolled students identified as “Economically Disadvantaged”) has a stronger correlation.

Future analysis could be done to further explore this data:

  1. also pull in census data on the rate of community poverty. This would allow for a more direct measure of the correlative differences between direct school measure and community measures.
  2. create a community poverty measure that has a negative correlation with percent proficent
  3. directly look for schools with with very conflicting residuals between the two models
  4. look for differences by enrollment size, entity type
  5. Create a composite geographic district based on attending student residencies (as opposed to the current project’s use of school location). This would require additional data.