This is the final project for “607 Data Acquisition and Management”.
I propose to examine the relationship and tensions of correlations of proficiency on state assessments to both the percent of students in poverty as well as the average magnitude of poverty in the surrounding geographic area. The specific measures and geographical unit (county, city, US census tract, etc.) will be dependent on the availability of data and how well that can be merged with educational data.
The motivation for this analysis is primarily to explore a policy question/problem within the field of education. There is a strong, and well documented, correlation between the rate of students in poverty and lower academic outcomes. However, while there has been significant discussion that schools exist within a context of the experiences of their students and the community in which they are located, little study on the impact of community on schools seems to have been done. This would be a start to rectifying that situation.
Also data and documentation for this project are available in the GitHub repository: https://github.com/ChadRyanBailey/607-Data-Acquisition-and-Management/tree/master/607-Weeks13-15-Final-Project
Notes and documentation for the project were maintained in a separate Microsoft Word (.docx) titled “01 Final Project Notes.docx” and is saved in the repo above.
The details of which data sources chosen as well as acquisition issues are detailed in the “01 Final Project Notes.docx” document stored in the project’s GitHub Repo: https://github.com/ChadRyanBailey/607-Data-Acquisition-and-Management/tree/master/607-Weeks13-15-Final-Project
In summary, sources for school proficiency on state assessments had to be altered for the sake of data quality. However, the granularity of Census data was better than expected, and included data by geographic districts.
# load necessary supporting packages
library(dplyr)
library(stringr)
library(httr)
library(ggplot2)
# load the school proficiency data
fileLocation1 <- 'https://raw.githubusercontent.com/ChadRyanBailey/607-Data-Acquisition-and-Management/master/607-Weeks13-15-Final-Project/03%202015-16%20Proficiency%20All%20Tests%2C%20Grades%2C%20Students.csv'
# option needed to prevent long sequence id numbers from being converted to scientific notation
options(scipen = 999)
proficiency <-read.csv(fileLocation1, sep = ',')
# review column names and rename as needed
names(proficiency)
## [1] "ï..AcademicYearName" "ISDCode"
## [3] "ISDName" "DistrictCode"
## [5] "DistrictName" "BuildingCode"
## [7] "BuildingName" "Assessment"
## [9] "Grade" "StudentGroup"
## [11] "Subject" "TestedIncluded"
## [13] "TestedProficientMet" "PercentTestedProficientMet"
## [15] "TestedNotProficientMet" "PercentTestedNotProficientMet"
proficiency <- proficiency %>%
rename(nTested = TestedIncluded
,nProf = TestedProficientMet
,nNonProf = TestedNotProficientMet
,PctProf = PercentTestedProficientMet
,PctNonProf = PercentTestedNotProficientMet)
names(proficiency)[1] <- c("AcademicYear")
# review summary of the variables/colummns
summary(proficiency)
## AcademicYear ISDCode ISDName DistrictCode
## 2015-2016:6522 Min. : 3000 Wayne RESA :1042 Min. : 1010
## 1st Qu.:33000 Oakland Schools: 666 1st Qu.:33170
## Median :51000 Macomb ISD : 454 Median :52040
## Mean :51642 Kent ISD : 424 Mean :51979
## 3rd Qu.:74000 Genesee ISD : 262 3rd Qu.:74040
## Max. :84000 Ingham ISD : 204 Max. :84060
## (Other) :3470
## DistrictName BuildingCode
## Detroit City School District : 180 Min. : 1
## Grand Rapids Public Schools : 92 1st Qu.:1538
## Utica Community Schools : 74 Median :3155
## Dearborn City School District : 70 Mean :4136
## Ann Arbor Public Schools : 60 3rd Qu.:6592
## Lansing Public School District: 58 Max. :9994
## (Other) :5988
## BuildingName Assessment
## Central Elementary School : 16 All Assessments:6522
## North Elementary School : 12
## Washington Elementary School: 12
## Roosevelt Elementary School : 10
## Central High School : 8
## Lakewood Elementary School : 8
## (Other) :6456
## Grade StudentGroup Subject
## All Grades:6522 All Students:6522 English Language Arts:3261
## Mathematics :3261
##
##
##
##
##
## nTested nProf PctProf nNonProf
## <10 : 250 < 3 : 344 * : 620 * : 594
## 20 : 37 * : 276 50.00 : 39 64 : 42
## 14 : 36 3 : 73 33.33 : 29 50 : 40
## 11 : 35 4 : 65 25.00 : 21 52 : 40
## 180 : 33 10 : 54 40.00 : 19 75 : 40
## 216 : 33 7 : 54 41.67 : 14 57 : 39
## (Other):6098 (Other):5656 (Other):5780 (Other):5727
## PctNonProf
## * : 620
## 50.00 : 39
## 66.67 : 29
## 75.00 : 21
## 60.00 : 19
## 58.33 : 14
## (Other):5780
As can be seen in the columns {nTested, nProf, PctProf, nNonProf, and PctNonProf}, the file has records that have been suppressed. This is typical for public education data. The suppression is done to protect the privacy of small groups of students.
The following section will review the behavior of rows having suppression. Where possible values will be imputed. If no value can be imputed, the record will be dropped.
# add flags to review each suppression condition
proficiency <- proficiency %>%
mutate( HasTestedLT10 = ifelse(nTested == '<10', 1, 0)
,HasProfLT3 = ifelse(nProf == '< 3', 1, 0)
,HasNonProfLT3 = ifelse(nNonProf == '< 3', 1, 0)
,HasEitherProfOrNonProfLT3 = ifelse(nProf == '< 3' | nNonProf == '< 3', 1, 0)
,HasRecord = 1)
# get the count of records by suppression conditions
proficiency %>%
summarise(nTotal = sum(HasRecord)
,nTestedLT10 = sum(HasTestedLT10)
,nProfLT3 = sum(HasProfLT3)
,nNonProfLT3 = sum(HasNonProfLT3)
,nEitherProfOrNonProfLT3 = sum(HasEitherProfOrNonProfLT3))
## nTotal nTestedLT10 nProfLT3 nNonProfLT3 nEitherProfOrNonProfLT3
## 1 6522 250 344 26 370
# review records for school's flagged as having <10 tested students
proficiencySmallSchools <- proficiency %>%
filter(nTested == '<10') %>%
select(BuildingCode
, Subject
, nTested
, nProf
, PctProf
, nNonProf
, PctNonProf)
head(proficiencySmallSchools)
## BuildingCode Subject nTested nProf PctProf nNonProf
## 1 8425 English Language Arts <10 * * *
## 2 5790 English Language Arts <10 * * *
## 3 9085 English Language Arts <10 * * *
## 4 9987 English Language Arts <10 * * *
## 5 6950 English Language Arts <10 * * *
## 6 7276 English Language Arts <10 * * *
## PctNonProf
## 1 *
## 2 *
## 3 *
## 4 *
## 5 *
## 6 *
# drop records for schools with <10 students tested as they are otherwise empty
proficiencyCleaned <-
anti_join(proficiency
, proficiencySmallSchools
, by = c('BuildingCode', 'Subject'))
As shown in the head of records where the number tested is less than 10, there was no way to impute a value with the given information. Thus, these records were dropped using an anti_join.
Since <3 is equal to the set {0, 1, 2}, the middle value “1” will be used as the imputed value. Also, percentages will be calculated for suppressed records using the imputed value.
proficiencyCleaned <- proficiencyCleaned %>%
#convet factors to characters
mutate(nTested = as.character(nTested)
,nProf = as.character(nProf)
,nNonProf = as.character(nNonProf)
,PctProf = as.character(PctProf)
,PctNonProf = as.character(PctNonProf)
) %>%
#convert the characters to numerics
mutate(nTested = as.numeric(nTested)
,nProf = as.numeric(nProf)
,nNonProf = as.numeric(nNonProf)
,PctProf = as.numeric(PctProf)
,PctNonProf = as.numeric(PctNonProf)
) %>%
# for count variables (nProf and nNonProf) replace the suppression flag with imputed count
mutate(nProf = ifelse(HasProfLT3 == 1, 1, nProf)
,nProf = ifelse(HasNonProfLT3 == 1, nTested - 1, nProf)
,nNonProf = ifelse(HasNonProfLT3 == 1, 1, nNonProf)
,nNonProf = ifelse(HasProfLT3 == 1, nTested - 1, nNonProf)
) %>%
# for percentage variables (PctProf and PctNonProf) replace the suppression flag
# with a calucuated percentage using the imputed counts
mutate(PctProf = ifelse(HasProfLT3 == 1, round(nProf*1.0/nTested*100.0, 2), PctProf)
, PctProf = ifelse(HasNonProfLT3 == 1, round(nProf*1.0/nTested*100.0, 2), PctProf)
, PctNonProf = ifelse(HasNonProfLT3 == 1, round(nNonProf*1.0/nTested*100.0, 2), PctNonProf)
, PctNonProf = ifelse(HasProfLT3 == 1, round(nNonProf*1.0/nTested*100.0, 2), PctNonProf))
summary(proficiencyCleaned)
## AcademicYear ISDCode ISDName DistrictCode
## 2015-2016:6272 Min. : 3000 Wayne RESA :1027 Min. : 1010
## 1st Qu.:33000 Oakland Schools: 649 1st Qu.:33230
## Median :52000 Macomb ISD : 448 Median :52170
## Mean :51957 Kent ISD : 406 Mean :52243
## 3rd Qu.:74000 Genesee ISD : 256 3rd Qu.:74050
## Max. :84000 Ingham ISD : 190 Max. :84060
## (Other) :3296
## DistrictName BuildingCode
## Detroit City School District : 180 Min. : 1
## Grand Rapids Public Schools : 86 1st Qu.:1512
## Utica Community Schools : 74 Median :3150
## Dearborn City School District : 70 Mean :4094
## Ann Arbor Public Schools : 60 3rd Qu.:6412
## Lansing Public School District: 56 Max. :9994
## (Other) :5746
## BuildingName Assessment
## Central Elementary School : 16 All Assessments:6272
## North Elementary School : 12
## Washington Elementary School: 12
## Roosevelt Elementary School : 10
## Central High School : 8
## Lakewood Elementary School : 8
## (Other) :6206
## Grade StudentGroup Subject
## All Grades:6272 All Students:6272 English Language Arts:3137
## Mathematics :3135
##
##
##
##
##
## nTested nProf PctProf nNonProf
## Min. : 10 Min. : 1.0 Min. : 0.38 Min. : 1.00
## 1st Qu.: 108 1st Qu.: 30.0 1st Qu.:24.55 1st Qu.: 53.75
## Median : 195 Median : 76.0 Median :41.67 Median : 103.00
## Mean : 247 Mean :108.1 Mean :40.88 Mean : 138.97
## 3rd Qu.: 323 3rd Qu.:145.0 3rd Qu.:56.67 3rd Qu.: 182.00
## Max. :1865 Max. :868.0 Max. :99.71 Max. :1575.00
##
## PctNonProf HasTestedLT10 HasProfLT3 HasNonProfLT3
## Min. : 0.29 Min. :0 Min. :0.00000 Min. :0.000000
## 1st Qu.:43.33 1st Qu.:0 1st Qu.:0.00000 1st Qu.:0.000000
## Median :58.33 Median :0 Median :0.00000 Median :0.000000
## Mean :59.12 Mean :0 Mean :0.05485 Mean :0.004145
## 3rd Qu.:75.45 3rd Qu.:0 3rd Qu.:0.00000 3rd Qu.:0.000000
## Max. :99.62 Max. :0 Max. :1.00000 Max. :1.000000
##
## HasEitherProfOrNonProfLT3 HasRecord
## Min. :0.00000 Min. :1
## 1st Qu.:0.00000 1st Qu.:1
## Median :0.00000 Median :1
## Mean :0.05899 Mean :1
## 3rd Qu.:0.00000 3rd Qu.:1
## Max. :1.00000 Max. :1
##
All variables are within the expected ranges are now of the expected types. No more transformations are needed on this data set for now.
# load the school proficiency data
fileLocation2 <- 'https://raw.githubusercontent.com/ChadRyanBailey/607-Data-Acquisition-and-Management/master/607-Weeks13-15-Final-Project/04%202015-16%20Entity%20Demographics.csv'
entityDemog <-read.csv(fileLocation2, sep = ',')
# review column names and rename as needed
names(entityDemog)
## [1] "SchoolYear"
## [2] "ISDCode"
## [3] "ISDName"
## [4] "DistrictCode"
## [5] "DistrictName"
## [6] "BuildingCode"
## [7] "BuildingName"
## [8] "EntityType"
## [9] "LOCALE_NAME"
## [10] "TOTAL_ENROLLMENT"
## [11] "MALE_ENROLLMENT"
## [12] "FEMALE_ENROLLMENT"
## [13] "AMERICAN_INDIAN_ENROLLMENT"
## [14] "ASIAN_ENROLLMENT"
## [15] "AFRICAN_AMERICAN_ENROLLMENT"
## [16] "HISPANIC_ENROLLMENT"
## [17] "HAWAIIAN_ENROLLMENT"
## [18] "WHITE_ENROLLMENT"
## [19] "TWO_OR_MORE_RACES_ENROLLMENT"
## [20] "KINDERGARTEN_ENROLLMENT"
## [21] "GRADE_1_ENROLLMENT"
## [22] "GRADE_2_ENROLLMENT"
## [23] "GRADE_3_ENROLLMENT"
## [24] "GRADE_4_ENROLLMENT"
## [25] "GRADE_5_ENROLLMENT"
## [26] "GRADE_6_ENROLLMENT"
## [27] "GRADE_7_ENROLLMENT"
## [28] "GRADE_8_ENROLLMENT"
## [29] "GRADE_9_ENROLLMENT"
## [30] "GRADE_10_ENROLLMENT"
## [31] "GRADE_11_ENROLLMENT"
## [32] "GRADE_12_ENROLLMENT"
## [33] "ECONOMIC_DISADVANTAGED_ENROLLMENT"
## [34] "SPECIAL_EDUCATION_ENROLLMENT"
## [35] "ENGLISH_LANGUAGE_LEARNERS_ENROLLMENT"
entityDemog <- entityDemog %>%
rename(AcademicYear = SchoolYear
,nEnrollmentTotal = TOTAL_ENROLLMENT
,nEnrollmentED = ECONOMIC_DISADVANTAGED_ENROLLMENT) %>%
select(AcademicYear
, ISDCode
, DistrictCode
, BuildingCode
, nEnrollmentTotal
, nEnrollmentED)
# review the summary values for the selected columns
summary(entityDemog)
## AcademicYear ISDCode DistrictCode
## 15 - 16 School Year:5952 Min. : 3.00 Min. : 1010
## 1st Qu.:32.00 1st Qu.:33070
## Median :52.00 Median :52180
## Mean :51.61 Mean :52167
## 3rd Qu.:75.00 3rd Qu.:75038
## Max. :84.00 Max. :84090
## NA's :1 NA's :58
## BuildingCode nEnrollmentTotal nEnrollmentED
## Min. : 0 Min. : 0 Min. : 0.0
## 1st Qu.: 580 1st Qu.: 0 1st Qu.: 0.0
## Median : 2614 Median : 293 Median : 116.0
## Mean : 3646 Mean : 1013 Mean : 467.4
## 3rd Qu.: 6396 3rd Qu.: 534 3rd Qu.: 263.0
## Max. :84070 Max. :1540005 Max. :713295.0
## NA's :967
# transform values as needed
entityDemog <- entityDemog %>%
mutate(AcademicYear = str_replace_all(AcademicYear, " School Year", "")) %>%
mutate(AcademicYear = str_replace_all(AcademicYear, " ", "")) %>%
mutate(AcademicYear = paste("20", AcademicYear, sep = "")) %>%
mutate(AcademicYear = str_replace_all(AcademicYear, "-", "-20")) %>%
mutate(AcademicYear = as.factor(AcademicYear)) %>%
filter(nEnrollmentTotal > 0
,!is.na(BuildingCode)) %>%
mutate(PctEnrollmentED = round(nEnrollmentED / nEnrollmentTotal *100, 2))
summary(entityDemog)
## AcademicYear ISDCode DistrictCode BuildingCode
## 2015-2016:3381 Min. : 3.00 Min. : 1010 Min. : 1
## 1st Qu.:33.00 1st Qu.:33170 1st Qu.:1535
## Median :51.00 Median :52015 Median :3167
## Mean :51.61 Mean :51934 Mean :4152
## 3rd Qu.:74.00 3rd Qu.:74030 3rd Qu.:6663
## Max. :84.00 Max. :84050 Max. :9994
## nEnrollmentTotal nEnrollmentED PctEnrollmentED
## Min. : 1.0 Min. : 0.0 Min. : 0.00
## 1st Qu.: 234.0 1st Qu.: 82.0 1st Qu.: 31.27
## Median : 378.0 Median : 162.0 Median : 50.70
## Mean : 433.5 Mean : 198.8 Mean : 50.34
## 3rd Qu.: 538.0 3rd Qu.: 262.0 3rd Qu.: 68.72
## Max. :2917.0 Max. :1844.0 Max. :100.00
This data set is needed to bridge the previous two data sets. The US Census data uses district codes set by the National Center for Educational Statistics (NCES) whereas the the Michigan school proficiency data uses district codes set by the state of Michigan. Thankfully both codes are available in Michigan’s Educational Enity Master.
# load the school proficiency data
fileLocation3 <- 'https://raw.githubusercontent.com/ChadRyanBailey/607-Data-Acquisition-and-Management/master/607-Weeks13-15-Final-Project/05%202015-16%20Entity%20Data.txt'
entity <-read.csv(fileLocation3, sep = '\t')
# review column names and rename as needed
names(entity)
## [1] "AcademicYearName" "ISD"
## [3] "ISDName" "DistrictCode"
## [5] "DistrictName" "BuildingCode"
## [7] "BuildingName" "DistrictNCESId"
## [9] "BuildingNCESId" "EntityTypeCategoryName"
## [11] "GeographicDistrictCode" "GeographicDistrictName"
## [13] "GeographicDistrictNCESId" "ActualGradeListSearchable"
## [15] "Addr1Street" "Addr1City"
## [17] "Addr1Zip" "Latitude"
## [19] "Longitude"
entity <- entity %>%
rename(ISDCode = ISD
,Street = Addr1Street
,City = Addr1City
,Zip = Addr1Zip
,GradeList = ActualGradeListSearchable)
names(entity)[1] <- c("AcademicYear")
# review summary of the variables/colummns
summary(entity)
## AcademicYear ISDCode ISDName DistrictCode
## 2015-2016:3867 Min. : 3000 Wayne RESA : 590 Min. : 1010
## 1st Qu.:33000 Oakland Schools: 392 1st Qu.:33200
## Median :52000 Macomb ISD : 264 Median :52170
## Mean :51843 Kent ISD : 253 Mean :52139
## 3rd Qu.:74000 Genesee ISD : 162 3rd Qu.:74050
## Max. :84000 Ingham ISD : 120 Max. :84090
## (Other) :2086
## DistrictName BuildingCode
## Detroit City School District : 105 Min. : 1
## Grand Rapids Public Schools : 52 1st Qu.:1582
## Dearborn City School District : 41 Median :3219
## Utica Community Schools : 40 Mean :4298
## Lansing Public School District : 36 3rd Qu.:7278
## Michigan Department of Corrections: 34 Max. :9996
## (Other) :3559
## BuildingName DistrictNCESId
## Central Elementary School : 9 Min. :2600001
## Early Childhood Center : 6 1st Qu.:2608453
## North Elementary School : 6 Median :2618510
## Washington Elementary School: 6 Mean :2621232
## Roosevelt Elementary School : 5 3rd Qu.:2629483
## Central High School : 4 Max. :2680990
## (Other) :3831 NA's :1
## BuildingNCESId EntityTypeCategoryName GeographicDistrictCode
## Min. :260000107831 ISD : 237 Min. : 1010
## 1st Qu.:260849752605 LEA :3210 1st Qu.:33130
## Median :261851006998 PSA : 375 Median :51045
## Mean :262122196472 State: 45 Mean :51701
## 3rd Qu.:262946758114 3rd Qu.:74010
## Max. :268099008435 Max. :84020
## NA's :3
## GeographicDistrictName GeographicDistrictNCESId
## Detroit City School District : 217 Min. :2600001
## Grand Rapids Public Schools : 71 1st Qu.:2612000
## Dearborn City School District : 50 Median :2619620
## Lansing Public School District: 43 Mean :2621336
## Ann Arbor Public Schools : 41 3rd Qu.:2629160
## Utica Community Schools : 41 Max. :2680990
## (Other) :3404
## GradeList Street City
## 9,10,11,12 : 674 420 WEST 9 MILE RD : 10 DETROIT : 182
## KG,1,2,3,4,5 : 513 204 EAST MUSKEGON ST: 7 GRAND RAPIDS: 119
## 6,7,8 : 323 22586 ANN ARBOR TRL : 7 FLINT : 53
## : 307 2413 WEST MAPLE AVE : 7 KALAMAZOO : 47
## KG,1,2,3,4,5,6,7,8: 199 550 5TH ST : 7 LANSING : 46
## 6,7,8,9,10,11,12 : 138 P.O. Box 308 : 7 MUSKEGON : 44
## (Other) :1713 (Other) :3822 (Other) :3376
## Zip Latitude Longitude
## Min. : 48001 Min. : 0.00 Min. :-90.17
## 1st Qu.:481628879 1st Qu.:42.34 1st Qu.:-85.52
## Median :485323562 Median :42.64 Median :-83.97
## Mean :442878357 Mean :42.90 Mean :-84.29
## 3rd Qu.:492031759 3rd Qu.:43.13 3rd Qu.:-83.26
## Max. :499690217 Max. :47.47 Max. : 0.00
##
entity %>%
filter(is.na(BuildingNCESId))
## AcademicYear ISDCode ISDName DistrictCode
## 1 2015-2016 47000 Livingston ESA 47000
## 2 2015-2016 84000 State of Michigan 84020
## 3 2015-2016 84000 State of Michigan 84090
## DistrictName BuildingCode
## 1 Livingston ESA 2952
## 2 Michigan Department of Corrections 9996
## 3 Michigan Virtual University 2929
## BuildingName DistrictNCESId BuildingNCESId
## 1 Livingston County Adult Education 2680700 NA
## 2 CTE Contracted Programs 2600001 NA
## 3 Michigan Virtual School NA NA
## EntityTypeCategoryName GeographicDistrictCode
## 1 ISD 47070
## 2 State 33020
## 3 State 33020
## GeographicDistrictName GeographicDistrictNCESId
## 1 Howell Public Schools 2618720
## 2 Lansing Public School District 2621150
## 3 Lansing Public School District 2621150
## GradeList Street City Zip
## 1 1425 W Grand River Ave Howell 48843
## 2 9,10,11,12 704 W Kalamazoo Lansing 48933
## 3 KG,1,2,3,4,5,6,7,8,9,10,11,12 3101 Technology Blvd. Lansing 489108546
## Latitude Longitude
## 1 42.61142 -83.95280
## 2 42.73008 -84.56324
## 3 42.70023 -84.49982
There were three entities that were missing an (Operational)[BuildingNCESId] value however, looking into the specific entities, it is somewhat expected that these would not receive federal coding. In any case, each of these records still has a GeographicDistrictNCESId which will be the primary linking id used for this analysis.
# get the parts of the api request
url_base <-"https://api.census.gov/data/"
dataset <- "2016/acs/acs5/profile" #2016 5-year american community survey profiles
query_header <- "?get="
query_variables <- "DP05_0028E,DP03_0062E," #est. num. population, est. median household income
geographicGroupings <- "NAME&for=school%20district%20(unified):*&in=state:26" #geographic school districts in MI
# construct api request and stored the results
response <- GET(paste(url_base
,dataset
,query_header
,query_variables
,geographicGroupings
,sep = ""))
# parse the api results
response_parsed <- content(response, "parse")
# unlist the data into a matrix and transpose that matrix
response_unlisted <- sapply(response_parsed, unlist)
response_transpose <- t(response_unlisted)
# load into a data frame
# rows 2:end contain data
response_df <- data.frame(response_transpose[2:nrow(response_transpose),], stringsAsFactors = FALSE)
# row 1 contains column headers
names(response_df) <- response_transpose[1,]
#review column headers and rename as needed
names(response_df)
## [1] "DP05_0028E" "DP03_0062E"
## [3] "NAME" "state"
## [5] "school district (unified)"
names(response_df) <- c('TotalPopulation', 'MedianHouseholdIncome', 'DistrictName', 'StateCode', 'DistrictCode_NCES')
#update data types as needed
response_df <- response_df %>%
mutate(DistrictCode_NCES = paste("00000", DistrictCode_NCES, sep = "")) %>%
mutate(DistrictCode_NCES = str_trunc(DistrictCode_NCES, 5, side="left", ellipsis="")) %>%
mutate(DistrictCode_NCES = paste(StateCode, DistrictCode_NCES, sep = "")) %>%
mutate(TotalPopulation = as.numeric(TotalPopulation)
,MedianHouseholdIncome = as.numeric(MedianHouseholdIncome)
,DistrictName = as.factor(DistrictName)
,DistrictCode_NCES = as.numeric(DistrictCode_NCES))
census <- response_df
summary(census)
## TotalPopulation MedianHouseholdIncome
## Min. : 394 Min. : 17455
## 1st Qu.: 5535 1st Qu.: 41371
## Median : 9987 Median : 48398
## Mean : 19131 Mean : 51199
## 3rd Qu.: 20073 3rd Qu.: 57983
## Max. :683524 Max. :124137
##
## DistrictName StateCode
## Adams Township School District, Michigan : 1 Length:518
## Addison Community Schools, Michigan : 1 Class :character
## Adrian City School District, Michigan : 1 Mode :character
## Airport Community School District, Michigan: 1
## Akron-Fairgrove Schools, Michigan : 1
## Alanson Public Schools, Michigan : 1
## (Other) :512
## DistrictCode_NCES
## Min. :2600005
## 1st Qu.:2610675
## Median :2619905
## Mean :2619464
## 3rd Qu.:2627833
## Max. :2699999
##
head(census)
## TotalPopulation MedianHouseholdIncome
## 1 42127 32269
## 2 3051 35425
## 3 4266 41375
## 4 3897 39618
## 5 32225 45378
## 6 2267 35530
## DistrictName StateCode
## 1 Battle Creek Public Schools, Michigan 26
## 2 Bessemer Area School District, Michigan 26
## 3 Harbor Beach Community Schools, Michigan 26
## 4 St. Ignace Area Schools, Michigan 26
## 5 Marquette Area Public Schools, Michigan 26
## 6 Ewen-Trout Creek Consolidated Schools, Michigan 26
## DistrictCode_NCES
## 1 2600005
## 2 2600006
## 3 2600007
## 4 2600012
## 5 2600013
## 6 2600014
library(rJava)
library(XLConnectJars)
library(XLConnect)
## XLConnect 0.2-15 by Mirai Solutions GmbH [aut],
## Martin Studer [cre],
## The Apache Software Foundation [ctb, cph] (Apache POI),
## Graph Builder [ctb, cph] (Curvesapi Java library)
## http://www.mirai-solutions.com
## https://github.com/miraisolutions/xlconnect
# method to load .xls file; abondoned as it required a local file
# (i.e., will not load a file refereced by URL)
# location1 <- 'https://www2.census.gov/programs-surveys/cps/tables/time-series/historical-poverty-thresholds/thresh16.xls'
# workbook <- loadWorkbook(location1)
# povertyThresholds <- readWorksheet(workbook, sheet = "Sheet1")
# povertyThresholds[18,]
# povertyThreshold <- str_replace_all(povertyThresholds[18,2], ",", "")
# povertyThresholds <- as.numeric(povertyThresholds)
# alternative method, save the file as .csv
fileLocation4 <- 'https://raw.githubusercontent.com/ChadRyanBailey/607-Data-Acquisition-and-Management/master/607-Weeks13-15-Final-Project/07%202016%20US%20Poverty%20Thresholds.csv'
povertyThresholds <- read.csv(fileLocation4, sep = ",")
povertyThresholds[18,]
## Table.with.row.headings.in.column.A.and.column.headings.in.rows.4.to.8.
## 18 Four people........................
.
## X X.1 X.2 X.3 X.4 X.5 X.6 X.7 X.8 X.9
## 18 24,563 24,755 25,160 24,339 24,424
povertyThreshold <- str_replace_all(povertyThresholds[18,2], ",", "")
povertyThreshold <- as.numeric(povertyThreshold)
povertyThreshold
## [1] 24563
proficiencyCleaned <- proficiencyCleaned %>%
select(AcademicYear
, DistrictCode
, BuildingCode
, Subject
, nTested
, nProf
, PctProf
, nNonProf
, PctNonProf)
entityDemog <- entityDemog %>%
select(AcademicYear
, DistrictCode
, BuildingCode
, nEnrollmentTotal
, nEnrollmentED
, PctEnrollmentED)
prof_demog <- left_join(proficiencyCleaned, entityDemog, by = c('AcademicYear', 'DistrictCode', 'BuildingCode'))
RecordsWithoutDemog <- prof_demog %>%
summarise(sum(ifelse(is.na(nEnrollmentTotal), 1, 0)))
prof_demog <- prof_demog %>% filter(!is.na(nEnrollmentTotal))
#sample records inappropriately missing demographics, to be followed up with Michigan after this project
#head(prof_demog %>% filter(is.na(nEnrollmentTotal)))
Michigan’s entity demographic file appears to have filtered out schools that were open during the selected snapshot but are now closed when the datafile is downloaded. This results in 372 records not having demographics and thus needing to be dropped.
This join gets the ID conversions between state codes and federal codes for schools.
entity <- entity %>%
select(AcademicYear
,ISDCode
,ISDName
,DistrictCode
,DistrictNCESId
,DistrictName
,BuildingCode
,BuildingName
,GeographicDistrictCode
,GeographicDistrictNCESId
,GeographicDistrictName
,Latitude
,Longitude
,EntityType = EntityTypeCategoryName)
prof_demog_ent <- left_join(prof_demog, entity, by = c('AcademicYear', 'DistrictCode', 'BuildingCode'))
RecordsWithoutEntityCodes <- prof_demog_ent %>%
summarise(sum(ifelse(is.na(GeographicDistrictCode), 1, 0)))
RecordsWithoutEntityCodes
## sum(ifelse(is.na(GeographicDistrictCode), 1, 0))
## 1 0
No records were missing enity code conversions.
census <- census %>%
select(GeographicDistrictNCESId = DistrictCode_NCES
,DistrictName_NCES = DistrictName
,DistrictTotalPopulation = TotalPopulation
,MedianHouseholdIncome)
prof_demog_ent_census <- left_join(prof_demog_ent, census, by = c('GeographicDistrictNCESId'))
RecordsWithoutCensus <- prof_demog_ent_census %>%
summarise(sum(ifelse(is.na(DistrictTotalPopulation), 1, 0)))
DistrictsWithoutCensus <- prof_demog_ent_census %>%
filter(is.na(DistrictTotalPopulation)) %>%
distinct(GeographicDistrictNCESId) %>%
summarise(n())
RecordsWithoutCensus
## sum(ifelse(is.na(DistrictTotalPopulation), 1, 0))
## 1 221
DistrictsWithoutCensus
## n()
## 1 73
#head(prof_demog_ent_census %>% filter(is.na(DistrictTotalPopulation)))
prof_demog_ent_census <- prof_demog_ent_census %>%filter(!is.na(DistrictTotalPopulation))
The Census data is missing data for some geographic districts for reason which could not be determined at this point. This means that 221 records from 73 geographic districts will be dropped from this analysis.
class(povertyThreshold)
## [1] "numeric"
final_data <- prof_demog_ent_census %>%
mutate(PovertyThreshold = povertyThreshold) %>%
mutate(MedianHouseholdIncomePctOfPoverty = round(MedianHouseholdIncome / PovertyThreshold * 100, 2))
This section looks at the data set as a whole looking to ensure that the dataset as a whole has the expected:
# for simplicity also limit to just one subject
final_data <- final_data %>% filter(Subject == "English Language Arts")
summary(final_data)
## AcademicYear DistrictCode BuildingCode
## 2015-2016:2840 Min. : 1010 Min. : 1
## 1st Qu.:33215 1st Qu.:1556
## Median :50240 Median :3148
## Mean :51457 Mean :4072
## 3rd Qu.:73030 3rd Qu.:6361
## Max. :84050 Max. :9994
##
## Subject nTested nProf
## English Language Arts:2840 Min. : 10.0 Min. : 1
## Mathematics : 0 1st Qu.: 121.0 1st Qu.: 48
## Median : 202.0 Median : 98
## Mean : 256.7 Mean :131
## 3rd Qu.: 329.2 3rd Qu.:170
## Max. :1864.0 Max. :868
##
## PctProf nNonProf PctNonProf nEnrollmentTotal
## Min. : 1.75 Min. : 1.00 Min. : 0.84 Min. : 9.0
## 1st Qu.:34.73 1st Qu.: 51.75 1st Qu.:38.07 1st Qu.: 281.0
## Median :49.29 Median : 93.00 Median :50.71 Median : 406.0
## Mean :47.94 Mean : 125.70 Mean :52.06 Mean : 475.6
## 3rd Qu.:61.93 3rd Qu.: 165.00 3rd Qu.:65.27 3rd Qu.: 575.0
## Max. :99.16 Max. :1199.00 Max. :98.25 Max. :2917.0
##
## nEnrollmentED PctEnrollmentED ISDCode ISDName
## Min. : 2.0 Min. : 0.88 Min. : 3000 Wayne RESA : 403
## 1st Qu.: 100.0 1st Qu.: 30.05 1st Qu.:33000 Oakland Schools: 306
## Median : 174.0 Median : 49.08 Median :50000 Macomb ISD : 215
## Mean : 211.9 Mean : 49.33 Mean :51144 Kent ISD : 199
## 3rd Qu.: 269.0 3rd Qu.: 67.35 3rd Qu.:73000 Genesee ISD : 124
## Max. :1844.0 Max. :100.00 Max. :84000 Ottawa Area ISD: 91
## (Other) :1502
## DistrictNCESId DistrictName
## Min. :2600005 Grand Rapids Public Schools : 43
## 1st Qu.:2607620 Utica Community Schools : 37
## Median :2618465 Dearborn City School District : 35
## Mean :2617867 Ann Arbor Public Schools : 30
## 3rd Qu.:2628178 Lansing Public School District: 28
## Max. :2680990 Kalamazoo Public Schools : 24
## (Other) :2643
## BuildingName GeographicDistrictCode
## Central Elementary School : 8 Min. : 1010
## North Elementary School : 6 1st Qu.:33170
## Washington Elementary School: 6 Median :50230
## Roosevelt Elementary School : 5 Mean :51332
## Lakewood Elementary School : 4 3rd Qu.:73030
## Lincoln Elementary School : 4 Max. :83070
## (Other) :2807
## GeographicDistrictNCESId GeographicDistrictName
## Min. :2600005 Detroit City School District : 81
## 1st Qu.:2612000 Grand Rapids Public Schools : 55
## Median :2619950 Dearborn City School District : 44
## Mean :2619866 Utica Community Schools : 38
## 3rd Qu.:2628830 Ann Arbor Public Schools : 36
## Max. :2636660 Lansing Public School District: 33
## (Other) :2553
## Latitude Longitude EntityType
## Min. :41.72 Min. :-90.16 ISD : 1
## 1st Qu.:42.35 1st Qu.:-85.53 LEA :2516
## Median :42.66 Median :-83.97 PSA : 322
## Mean :42.95 Mean :-84.37 State: 1
## 3rd Qu.:43.11 3rd Qu.:-83.27
## Max. :47.24 Max. :-82.43
##
## DistrictName_NCES DistrictTotalPopulation
## Detroit City School District, Michigan : 81 Min. : 394
## Grand Rapids Public Schools, Michigan : 55 1st Qu.: 11002
## Dearborn City School District, Michigan : 44 Median : 23666
## Utica Community Schools, Michigan : 38 Mean : 59374
## Ann Arbor Public Schools, Michigan : 36 3rd Qu.: 59071
## Lansing Public School District, Michigan: 33 Max. :683524
## (Other) :2553
## MedianHouseholdIncome PovertyThreshold MedianHouseholdIncomePctOfPoverty
## Min. : 17455 Min. :24563 Min. : 71.06
## 1st Qu.: 41634 1st Qu.:24563 1st Qu.:169.50
## Median : 51234 Median :24563 Median :208.58
## Mean : 53721 Mean :24563 Mean :218.71
## 3rd Qu.: 62272 3rd Qu.:24563 3rd Qu.:253.52
## Max. :124137 Max. :24563 Max. :505.38
##
The dataset as a whole seems to have the expected number of records, the correct categorical values, and the correct ranges/spread for numerical values.
This section zooms in on a single geographical district with known values. This will allow the opportunity to check for possible corruption in the data. This exporation will start visually and only go to table/statistical review if issues are found.
reviewcase <- final_data %>% filter(GeographicDistrictName == 'Grand Rapids Public Schools')
ggplot(reviewcase
, aes(x = PctEnrollmentED, y = PctProf)) +
geom_point(aes(size = nEnrollmentTotal, color = EntityType), alpha = 0.5) +
geom_smooth(method=lm)
This scatter plot validates the expectation that Grand Rapids Public Schools data would have schools with varying rates of Economically Disadvantaged (ED) enrollment and that there should be moderately strong negative correlation between PctED and PctProf (percent proficient).
ggplot(reviewcase
, aes(x = MedianHouseholdIncomePctOfPoverty, y = PctProf)) +
geom_point(aes(size = nEnrollmentTotal, color = EntityType), alpha = 0.5) +
geom_smooth(method=lm)
This scatter plot validates the expectation that for a geographic district there should be only one MedianHouseholdIncomePctPoverty and that Grand Rapids Public Schools’ value for that measure would be near 170.
Now that the dataset has been constructed and validated, the analysis of the desired variables can begin.
# model for enrollmentED
enrollmentEDLM <- lm(PctProf ~ PctEnrollmentED, data = final_data)
summary(enrollmentEDLM)
##
## Call:
## lm(formula = PctProf ~ PctEnrollmentED, data = final_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -55.773 -6.964 0.550 7.455 58.071
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 78.522903 0.522812 150.19 <0.0000000000000002 ***
## PctEnrollmentED -0.619878 0.009503 -65.23 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.33 on 2838 degrees of freedom
## Multiple R-squared: 0.5999, Adjusted R-squared: 0.5997
## F-statistic: 4255 on 1 and 2838 DF, p-value: < 0.00000000000000022
# model for communityPoverty
CommunityPovertyLM <- lm(PctProf ~ MedianHouseholdIncomePctOfPoverty, data = final_data)
summary(CommunityPovertyLM)
##
## Call:
## lm(formula = PctProf ~ MedianHouseholdIncomePctOfPoverty, data = final_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -65.373 -10.062 0.934 10.570 60.932
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 13.516159 0.958713 14.10
## MedianHouseholdIncomePctOfPoverty 0.157418 0.004166 37.79
## Pr(>|t|)
## (Intercept) <0.0000000000000002 ***
## MedianHouseholdIncomePctOfPoverty <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.9 on 2838 degrees of freedom
## Multiple R-squared: 0.3347, Adjusted R-squared: 0.3345
## F-statistic: 1428 on 1 and 2838 DF, p-value: < 0.00000000000000022
The regression models for both EnrollmentED and CommunityPoverty had lower than expected Adjusted R-squared values. Both were less than 0.6 (with Enrollment being 0.5997 and Community Poverty 0.3345).
ggplot(final_data
, aes(x = PctEnrollmentED, y = PctProf)) +
geom_point(aes(size = nEnrollmentTotal, color = EntityType), alpha = 0.5) +
geom_smooth(method=lm)
ggplot(final_data
, aes(x = PctEnrollmentED, y = PctProf)) +
geom_point(aes(size = nEnrollmentTotal, color = EntityType), alpha = 0.5) +
facet_wrap(~EntityType)+
geom_smooth(method=lm)
ggplot(final_data
, aes(x = MedianHouseholdIncomePctOfPoverty, y = PctProf)) +
geom_point(aes(size = nEnrollmentTotal, color = EntityType), alpha = 0.5) +
geom_smooth(method=lm)
ggplot(final_data
, aes(x = MedianHouseholdIncomePctOfPoverty, y = PctProf)) +
geom_point(aes(size = nEnrollmentTotal, color = EntityType), alpha = 0.5) +
facet_wrap(~EntityType) +
geom_smooth(method=lm)
In reviewing the above tables and graphs it is clear that the direct linkage of students in the school is a much more powerfully correlated variable. However, there is still a meanginful correlation with community poverty level.
This project was both fun and challenging.
Unfortunately, Michigan’s data around school proficiency and and entity data were significantly more dirty than expected. This resulted in more time than expected being spent in the data acquisition, exploration, and tidying phase. Which meant there was less time than expectd for analysis.
However, as noted above, the project does show there is a correlation with both measures of poverty. Although unsurprisingly, the measure directly linked to the school (the school’s percent of enrolled students identified as “Economically Disadvantaged”) has a stronger correlation.
Future analysis could be done to further explore this data: