School districts are usually well-regarded and ranked higher if they average high test scores and graduation rates. The problem I want to solve is whether there are certain characteristics that have a greater impact on these rankings than others.
To narrow down the scope of this project, I focused on school districts that were around the same size and population as the one I currently work at. The National Center for Education Statistics [NCES] had a peer matching tool that matches districts across the country based on type, locale, enrollment, student/teacher ratio and poverty. 99 similar districts in 31 states were selected. To help compare the districts, I needed an initial ranking to compare against.
The school district that the rest are matched against is located in a large suburb with around 24,500 students enrolled, student/teacher ratio of 16.18 and a poverty level of 21.3%.
The NCES data was downloaded as .csv files. National rankings and state rankings were pulled manually from Niche.com and SchoolDigger.com and stored in a MySQL database.
Some of the variables that will be analyzed are:
Characteristics - Total Schools, Total Students, Student/Teacher ratio, etc…
Expenditures - Instructional Expenditures, Student and Staff support, etc…
Fiscal - Federal and State Revenues, etc…
Although the disctricts are similar, the variance in quality can be considered large. The top district has a national rank of 196 and the bottom is unranked nationally and in the 0 percentile of the respective state.
Import NCES Data
This website separated the data for all schools into five separate files.
expenditures <- read.csv("https://raw.githubusercontent.com/smithchad17/Class607/master/607%20Final/expenditures.csv", header = T, stringsAsFactors = F)
finances <- read.csv("https://raw.githubusercontent.com/smithchad17/Class607/master/607%20Final/finances.csv", header = T, stringsAsFactors = F)
poverty <- read.csv("https://raw.githubusercontent.com/smithchad17/Class607/master/607%20Final/poverty.csv", header = T, stringsAsFactors = F)
revenue <- read.csv("https://raw.githubusercontent.com/smithchad17/Class607/master/607%20Final/revenue.csv", header = T, stringsAsFactors = F)
students <- read.csv("https://raw.githubusercontent.com/smithchad17/Class607/master/607%20Final/students.csv", header = T, stringsAsFactors = F)
head(expenditures)
## LEAID DistrictName State GradeSpanLow
## 1 4500720 AIKEN 01 South Carolina PK
## 2 4808090 ALVIN ISD Texas PK
## 3 400680 AMPHITHEATER UNIFIED DISTRICT Arizona PK
## 4 2200090 ASCENSION PARISH Louisiana PK
## 5 1704680 AURORA EAST USD 131 Illinois PK
## 6 1704710 AURORA WEST USD 129 Illinois PK
## GradeSpanHigh ZipCode TotalCurrentExpendituresPerStudent
## 1 12 29803 8120.246
## 2 12 77511 8208.390
## 3 12 85705 7554.084
## 4 12 70346 10356.297
## 5 12 60505 9743.412
## 6 12 60506 12781.889
## InstructionalExpendituresPerStudent StudentandStaffSupportPerStudent
## 1 4729.204 1118.5736
## 2 4943.006 678.1766
## 3 4136.213 1019.6603
## 4 6126.082 1217.4381
## 5 5064.692 961.1849
## 6 8407.229 1163.3113
## AdministrationPerStudent OperationsPerStudent TotalStudents
## 1 832.8613 1439.607 24716
## 2 998.8389 1588.369 19809
## 3 613.4874 1784.723 14191
## 4 1100.6693 1912.107 21367
## 5 1829.4178 1888.117 14685
## 6 1237.4660 1973.883 12865
Import SQL Data
Create the MySQL driver and connect to the database. Send a query to collect all the information and store it in a data frame.
#Some settings will need to be changed if you're running it from another MySQL server
con <- dbConnect(RMySQL::MySQL(),
dbname = "schoolrank",
host = "localhost",
port = 3306,
user = "chad",
password = "")
q <- "SELECT * FROM rankings;"
schoolrank <- dbGetQuery(con, q)
head(schoolrank)
## LEAID DISTRICT STATE DISTRICTS_IN_STATE
## 1 101290 ELMORE COUNTY AL 126
## 2 101920 JEFFERSON COUNTY AL 126
## 3 102070 LEE COUNTY AL 126
## 4 102220 MADISON COUNTY AL 126
## 5 103030 SHELBY COUNTY AL 126
## 6 400680 AMPHITHEATER UNIFIED DISTRICT AZ 452
## RANK_IN_STATE RANK_IN_US
## 1 27 NA
## 2 81 NA
## 3 51 NA
## 4 20 2891
## 5 22 2282
## 6 178 2761
Merge data frames by LEAID.
Merges the five data frames into one.
df1 <- merge(schoolrank, poverty, by = "LEAID")
df2 <- merge(df1, expenditures, by = "LEAID")
df3 <- merge(df2, finances, by = "LEAID")
df4 <- merge(df3, revenue, by = "LEAID")
final <- merge(df4, students, by = "LEAID")
head(final)
## LEAID DISTRICT STATE DISTRICTS_IN_STATE
## 1 101290 ELMORE COUNTY AL 126
## 2 101920 JEFFERSON COUNTY AL 126
## 3 102070 LEE COUNTY AL 126
## 4 102220 MADISON COUNTY AL 126
## 5 103030 SHELBY COUNTY AL 126
## 6 400680 AMPHITHEATER UNIFIED DISTRICT AZ 452
## RANK_IN_STATE RANK_IN_US NAME FIPST MSTATE
## 1 27 NA ELMORE COUNTY 1 AL
## 2 81 NA JEFFERSON COUNTY 1 AL
## 3 51 NA LEE COUNTY 1 AL
## 4 20 2891 MADISON COUNTY 1 AL
## 5 22 2282 SHELBY COUNTY 1 AL
## 6 178 2761 AMPHITHEATER UNIFIED DISTRICT 4 AZ
## FIPST_NAME GSLO GSHI PercentPoverty DistrictName.x
## 1 Alabama PK 12 15.66762 ELMORE COUNTY
## 2 Alabama PK 12 16.51225 JEFFERSON COUNTY
## 3 Alabama PK 12 14.74384 LEE COUNTY
## 4 Alabama PK 12 10.18080 MADISON COUNTY
## 5 Alabama PK 12 9.70268 SHELBY COUNTY
## 6 Arizona PK 12 19.84297 AMPHITHEATER UNIFIED DISTRICT
## State.x GradeSpanLow.x GradeSpanHigh.x ZipCode.x
## 1 Alabama PK 12 36092
## 2 Alabama PK 12 35209
## 3 Alabama PK 12 36803
## 4 Alabama PK 12 35804
## 5 Alabama PK 12 35051
## 6 Arizona PK 12 85705
## TotalCurrentExpendituresPerStudent InstructionalExpendituresPerStudent
## 1 7754.584 4732.686
## 2 8803.552 5069.967
## 3 8603.229 5209.912
## 4 8315.232 4847.829
## 5 9169.845 5269.324
## 6 7554.084 4136.213
## StudentandStaffSupportPerStudent AdministrationPerStudent
## 1 634.4134 837.3687
## 2 826.4232 988.8959
## 3 735.9602 785.2138
## 4 777.9748 812.5728
## 5 1055.7236 804.0247
## 6 1019.6603 613.4874
## OperationsPerStudent TotalStudents.x DistrictName.y
## 1 1550.116 11234 ELMORE COUNTY
## 2 1918.266 36203 JEFFERSON COUNTY
## 3 1872.144 9847 LEE COUNTY
## 4 1876.855 19741 MADISON COUNTY
## 5 2040.773 22809 SHELBY COUNTY
## 6 1784.723 14191 AMPHITHEATER UNIFIED DISTRICT
## State.y GradeSpanLow.y GradeSpanHigh.y ZipCode.y
## 1 Alabama PK 12 36092
## 2 Alabama PK 12 35209
## 3 Alabama PK 12 36803
## 4 Alabama PK 12 35804
## 5 Alabama PK 12 35051
## 6 Arizona PK 12 85705
## TotalExpendituresPerStudent TotalCapitalOutlayPerStudent
## 1 8396.564 273.9897
## 2 9943.126 301.7706
## 3 9347.111 376.3583
## 4 11106.276 2213.8696
## 5 10733.833 930.1153
## 6 9519.202 1613.4170
## ConstructionPerStudent NonElSecEducationPerStudent
## 1 161.2071 229.3929
## 2 221.6391 137.1433
## 3 277.0387 154.4633
## 4 2036.6749 206.9804
## 5 869.4813 225.4373
## 6 1367.8388 31.5693
## InterestOnDebtPerStudent DistrictName.x State.x
## 1 146.8756 ELMORE COUNTY Alabama
## 2 707.0409 JEFFERSON COUNTY Alabama
## 3 226.5665 LEE COUNTY Alabama
## 4 377.3365 MADISON COUNTY Alabama
## 5 418.6505 SHELBY COUNTY Alabama
## 6 313.8609 AMPHITHEATER UNIFIED DISTRICT Arizona
## GradeSpanLow.x GradeSpanHigh.x ZipCode.x TotalRevenuePerStudent
## 1 PK 12 36092 8494.748
## 2 PK 12 35209 9507.776
## 3 PK 12 36803 9258.150
## 4 PK 12 35804 9631.123
## 5 PK 12 35051 10294.007
## 6 PK 12 85705 9398.703
## TotalFederalRevenuePerStudent TotalLocalRevenuePerStudent
## 1 754.4063 2316.806
## 2 964.1742 3315.720
## 3 734.3353 3038.387
## 4 666.3796 3517.299
## 5 592.6608 4523.215
## 6 961.2430 6259.390
## TotalStateRevenuePerStudent DistrictName.y State.y
## 1 5423.536 ELMORE COUNTY Alabama
## 2 5227.882 JEFFERSON COUNTY Alabama
## 3 5485.427 LEE COUNTY Alabama
## 4 5447.444 MADISON COUNTY Alabama
## 5 5178.131 SHELBY COUNTY Alabama
## 6 2178.071 AMPHITHEATER UNIFIED DISTRICT Arizona
## GradeSpanLow.y GradeSpanHigh.y ZipCode.y TotalSchools
## 1 PK 12 36092 15
## 2 PK 12 35209 62
## 3 PK 12 36803 14
## 4 PK 12 35804 29
## 5 PK 12 35051 36
## 6 PK 12 85705 22
## TotalFiscalStudents TotalStudents.y ClassroomTeachersFTE
## 1 11234 11234 625.00
## 2 36203 36203 2209.50
## 3 9847 9847 629.00
## 4 19741 19741 1167.90
## 5 22809 22809 1426.10
## 6 14191 14191 837.85
## StudentTeacherRatio ELLStudents
## 1 17.97440 122
## 2 16.38516 1068
## 3 15.65501 58
## 4 16.90299 127
## 5 15.99397 899
## 6 16.93740 732
Remove duplicate columns
Some of the files had the exact columns and values before they were merged. The duplicates were removed after the merge.
final <- subset(final, select = c(LEAID, STATE, RANK_IN_STATE, GSLO, PercentPoverty, TotalCurrentExpendituresPerStudent, StudentandStaffSupportPerStudent, OperationsPerStudent, ZipCode.y, TotalCapitalOutlayPerStudent, NonElSecEducationPerStudent, TotalFederalRevenuePerStudent, TotalStateRevenuePerStudent, TotalSchools, StudentTeacherRatio, DISTRICT, DISTRICTS_IN_STATE, RANK_IN_US, FIPST_NAME, GSHI, InstructionalExpendituresPerStudent, AdministrationPerStudent, TotalExpendituresPerStudent, ConstructionPerStudent, InterestOnDebtPerStudent, TotalRevenuePerStudent, TotalLocalRevenuePerStudent, ClassroomTeachersFTE, ELLStudents))
colnames(final)
## [1] "LEAID"
## [2] "STATE"
## [3] "RANK_IN_STATE"
## [4] "GSLO"
## [5] "PercentPoverty"
## [6] "TotalCurrentExpendituresPerStudent"
## [7] "StudentandStaffSupportPerStudent"
## [8] "OperationsPerStudent"
## [9] "ZipCode.y"
## [10] "TotalCapitalOutlayPerStudent"
## [11] "NonElSecEducationPerStudent"
## [12] "TotalFederalRevenuePerStudent"
## [13] "TotalStateRevenuePerStudent"
## [14] "TotalSchools"
## [15] "StudentTeacherRatio"
## [16] "DISTRICT"
## [17] "DISTRICTS_IN_STATE"
## [18] "RANK_IN_US"
## [19] "FIPST_NAME"
## [20] "GSHI"
## [21] "InstructionalExpendituresPerStudent"
## [22] "AdministrationPerStudent"
## [23] "TotalExpendituresPerStudent"
## [24] "ConstructionPerStudent"
## [25] "InterestOnDebtPerStudent"
## [26] "TotalRevenuePerStudent"
## [27] "TotalLocalRevenuePerStudent"
## [28] "ClassroomTeachersFTE"
## [29] "ELLStudents"
Mutate columns to see what ranking percentile of the state the district is in.
Divide the rank of the district in the state by the total number of districts the state has.
final <- final %>% mutate(PercentileOfState = round((1 - RANK_IN_STATE/DISTRICTS_IN_STATE)*100,2))
Reorder the columns by column index
This was a personal preference so similar columns were grouped together.
final <- final[,c(1,2,19,9,16,14,17,3,18,30,4,5,6,7,8,10,11,12,13,15,20,21,22,23,24,25,26,27,28,29)]
Rename columns
Make some of the column names more readable.
colnames(final)[2] <- "StateAbbr"
colnames(final)[3] <- "State"
colnames(final)[4] <- "ZipCode"
colnames(final)
## [1] "LEAID"
## [2] "StateAbbr"
## [3] "State"
## [4] "ZipCode"
## [5] "DISTRICT"
## [6] "TotalSchools"
## [7] "DISTRICTS_IN_STATE"
## [8] "RANK_IN_STATE"
## [9] "RANK_IN_US"
## [10] "PercentileOfState"
## [11] "GSLO"
## [12] "PercentPoverty"
## [13] "TotalCurrentExpendituresPerStudent"
## [14] "StudentandStaffSupportPerStudent"
## [15] "OperationsPerStudent"
## [16] "TotalCapitalOutlayPerStudent"
## [17] "NonElSecEducationPerStudent"
## [18] "TotalFederalRevenuePerStudent"
## [19] "TotalStateRevenuePerStudent"
## [20] "StudentTeacherRatio"
## [21] "GSHI"
## [22] "InstructionalExpendituresPerStudent"
## [23] "AdministrationPerStudent"
## [24] "TotalExpendituresPerStudent"
## [25] "ConstructionPerStudent"
## [26] "InterestOnDebtPerStudent"
## [27] "TotalRevenuePerStudent"
## [28] "TotalLocalRevenuePerStudent"
## [29] "ClassroomTeachersFTE"
## [30] "ELLStudents"
Find top 10 nationally-ranked school districts
A column ‘type’ was created with parameter ‘top10’ so it will be easier to group later.
#Remove schools that are not ranked. Sort by ascending order and take the top 10.
#If a school is not ranked nationally, the RANK_IN_US value is 'NA'. na.last = NA parameter removes them.
Top10 <- head(final[order(final$RANK_IN_US, na.last = NA, decreasing = F),],10)
Top10$type <- "top10"
Find bottom 10 ranked school districts.
Since the under-performing school districts are not nationally ranked. I’ll sort the data frame by what percentile the district is in for its state.
A column ‘type’ was created with parameter ‘bottom10’ so it will be easier to group later.
bottom <- final %>% filter(is.na(RANK_IN_US))
Bottom10 <- head(bottom[order(bottom$PercentileOfState, decreasing = F),],10)
Bottom10$type <- "bottom10"
Top 10 summary
summary(Top10)
## LEAID StateAbbr State ZipCode
## Min. :1704170 Length:10 Length:10 Min. :55126
## 1st Qu.:3244770 Class :character Class :character 1st Qu.:71765
## Median :4820295 Mode :character Mode :character Median :76090
## Mean :4041132 Mean :72472
## 3rd Qu.:4828335 3rd Qu.:77447
## Max. :4842960 Max. :78660
## DISTRICT TotalSchools DISTRICTS_IN_STATE RANK_IN_STATE
## Length:10 Min. :11.00 Min. : 99.0 Min. : 17.0
## Class :character 1st Qu.:21.75 1st Qu.:817.5 1st Qu.: 85.5
## Mode :character Median :29.50 Median :933.0 Median :160.5
## Mean :28.10 Mean :785.0 Mean :170.0
## 3rd Qu.:33.75 3rd Qu.:933.0 3rd Qu.:252.2
## Max. :43.00 Max. :933.0 Max. :326.0
## RANK_IN_US PercentileOfState GSLO PercentPoverty
## Min. :196.0 Min. :65.06 Length:10 Min. :10.11
## 1st Qu.:374.2 1st Qu.:72.97 Class :character 1st Qu.:11.16
## Median :452.0 Median :79.94 Mode :character Median :12.80
## Mean :445.2 Mean :78.99 Mean :12.80
## 3rd Qu.:535.2 3rd Qu.:84.70 3rd Qu.:14.66
## Max. :682.0 Max. :95.18 Max. :15.51
## TotalCurrentExpendituresPerStudent StudentandStaffSupportPerStudent
## Min. : 7611 Min. : 555.9
## 1st Qu.: 8182 1st Qu.: 728.6
## Median : 8266 Median : 815.7
## Mean :10474 Mean :1045.7
## 3rd Qu.:11286 3rd Qu.: 921.3
## Max. :20936 Max. :3087.4
## OperationsPerStudent TotalCapitalOutlayPerStudent
## Min. :1257 Min. : 335.2
## 1st Qu.:1419 1st Qu.: 897.8
## Median :1573 Median :1929.0
## Mean :2266 Mean :1644.1
## 3rd Qu.:2075 3rd Qu.:2176.8
## Max. :6887 Max. :2663.3
## NonElSecEducationPerStudent TotalFederalRevenuePerStudent
## Min. : 15.30 Min. : 456.1
## 1st Qu.: 22.51 1st Qu.: 555.1
## Median : 58.73 Median : 644.5
## Mean :151.62 Mean : 699.1
## 3rd Qu.:195.09 3rd Qu.: 819.7
## Max. :706.53 Max. :1041.5
## TotalStateRevenuePerStudent StudentTeacherRatio GSHI
## Min. :1914 Min. :14.55 Min. :12
## 1st Qu.:3153 1st Qu.:15.52 1st Qu.:12
## Median :4223 Median :16.10 Median :12
## Mean :4217 Mean :16.07 Mean :12
## 3rd Qu.:4471 3rd Qu.:16.52 3rd Qu.:12
## Max. :9620 Max. :17.55 Max. :12
## InstructionalExpendituresPerStudent AdministrationPerStudent
## Min. : 4838 Min. : 777.6
## 1st Qu.: 5008 1st Qu.: 833.5
## Median : 5094 Median : 853.5
## Mean : 6164 Mean : 998.8
## 3rd Qu.: 5913 3rd Qu.: 943.5
## Max. :12601 Max. :2053.0
## TotalExpendituresPerStudent ConstructionPerStudent
## Min. : 9442 Min. : 140.6
## 1st Qu.:10512 1st Qu.: 700.0
## Median :11463 Median :1464.3
## Mean :12935 Mean :1373.4
## 3rd Qu.:12829 3rd Qu.:2033.1
## Max. :23478 Max. :2314.2
## InterestOnDebtPerStudent TotalRevenuePerStudent
## Min. : 212.4 Min. : 9689
## 1st Qu.: 431.9 1st Qu.: 9790
## Median : 784.6 Median :10416
## Mean : 700.3 Mean :12191
## 3rd Qu.: 901.6 3rd Qu.:12053
## Max. :1218.6 Max. :23843
## TotalLocalRevenuePerStudent ClassroomTeachersFTE ELLStudents
## Min. : 4096 Min. : 646.9 Min. : 406
## 1st Qu.: 4978 1st Qu.: 776.4 1st Qu.: 634
## Median : 5815 Median :1347.5 Median :1588
## Mean : 7275 Mean :1249.2 Mean :1863
## 3rd Qu.: 7455 3rd Qu.:1562.5 3rd Qu.:3034
## Max. :18759 Max. :2105.1 Max. :3809
## type
## Length:10
## Class :character
## Mode :character
##
##
##
Bottom 10 summary
summary(Bottom10)
## LEAID StateAbbr State ZipCode
## Min. : 801950 Length:10 Length:10 Min. : 6050
## 1st Qu.:1176342 Class :character Class :character 1st Qu.:29808
## Median :1738120 Mode :character Mode :character Median :60295
## Mean :2406114 Mean :51341
## 3rd Qu.:3431962 3rd Qu.:72095
## Max. :4841220 Max. :80022
##
## DISTRICT TotalSchools DISTRICTS_IN_STATE RANK_IN_STATE
## Length:10 Min. : 9.00 Min. : 17.0 Min. : 17.0
## Class :character 1st Qu.:13.50 1st Qu.:222.5 1st Qu.:215.8
## Mode :character Median :17.50 Median :771.5 Median :718.0
## Mean :19.30 Mean :570.9 Mean :539.2
## 3rd Qu.:22.75 3rd Qu.:779.0 3rd Qu.:739.0
## Max. :41.00 Max. :933.0 Max. :891.0
##
## RANK_IN_US PercentileOfState GSLO PercentPoverty
## Min. : NA Min. :0.000 Length:10 Min. :13.93
## 1st Qu.: NA 1st Qu.:1.815 Class :character 1st Qu.:17.90
## Median : NA Median :4.880 Mode :character Median :19.66
## Mean :NaN Mean :4.208 Mean :19.70
## 3rd Qu.: NA 3rd Qu.:6.218 3rd Qu.:22.91
## Max. : NA Max. :7.200 Max. :24.30
## NA's :10
## TotalCurrentExpendituresPerStudent StudentandStaffSupportPerStudent
## Min. : 7816 Min. : 640.1
## 1st Qu.: 9751 1st Qu.: 809.3
## Median :10608 Median : 973.6
## Mean :11655 Mean :1105.9
## 3rd Qu.:13235 3rd Qu.:1217.1
## Max. :18240 Max. :2106.8
##
## OperationsPerStudent TotalCapitalOutlayPerStudent
## Min. :1563 Min. : 127.9
## 1st Qu.:1762 1st Qu.: 188.7
## Median :1838 Median : 295.2
## Mean :2195 Mean : 426.5
## 3rd Qu.:2739 3rd Qu.: 570.4
## Max. :3301 Max. :1172.1
##
## NonElSecEducationPerStudent TotalFederalRevenuePerStudent
## Min. : 24.07 Min. : 840.6
## 1st Qu.: 71.51 1st Qu.: 994.9
## Median : 106.74 Median :1117.1
## Mean : 276.97 Mean :1208.1
## 3rd Qu.: 380.63 3rd Qu.:1326.2
## Max. :1255.89 Max. :1809.2
##
## TotalStateRevenuePerStudent StudentTeacherRatio GSHI
## Min. : 3585 Min. :14.56 Min. :12
## 1st Qu.: 5245 1st Qu.:16.06 1st Qu.:12
## Median : 7694 Median :16.69 Median :12
## Mean : 7545 Mean :16.58 Mean :12
## 3rd Qu.: 8151 3rd Qu.:17.36 3rd Qu.:12
## Max. :12864 Max. :17.80 Max. :12
##
## InstructionalExpendituresPerStudent AdministrationPerStudent
## Min. : 4392 Min. : 859.6
## 1st Qu.: 4958 1st Qu.:1156.9
## Median : 6060 Median :1424.6
## Mean : 6972 Mean :1381.1
## 3rd Qu.: 7850 3rd Qu.:1566.7
## Max. :12466 Max. :1829.4
##
## TotalExpendituresPerStudent ConstructionPerStudent
## Min. : 8777 Min. : 58.81
## 1st Qu.:10708 1st Qu.: 90.90
## Median :11774 Median : 115.07
## Mean :12638 Mean : 268.00
## 3rd Qu.:13913 3rd Qu.: 324.05
## Max. :19271 Max. :1031.58
##
## InterestOnDebtPerStudent TotalRevenuePerStudent
## Min. : 88.65 Min. : 9432
## 1st Qu.:194.77 1st Qu.:10616
## Median :286.20 Median :12937
## Mean :371.47 Mean :13227
## 3rd Qu.:526.72 3rd Qu.:14026
## Max. :855.69 Max. :19774
##
## TotalLocalRevenuePerStudent ClassroomTeachersFTE ELLStudents
## Min. :2950 Min. : 428.9 Min. : 156
## 1st Qu.:3737 1st Qu.: 580.7 1st Qu.:1009
## Median :4320 Median : 722.6 Median :2326
## Mean :4474 Mean : 867.1 Mean :2924
## 3rd Qu.:5159 3rd Qu.: 919.0 3rd Qu.:4737
## Max. :6025 Max. :2360.4 Max. :6240
##
## type
## Length:10
## Class :character
## Mode :character
##
##
##
##
Some highlights are that the bottom 10 districts receive more total revenue per student and pay for more administration. The top 10 districts spent more on construction and equipment per student as well as having a lower poverty level.
Transform The Data
The data frame with the top 10 and bottom 10 districts are joined and the columns that have dollar values are gathered under a single column called ‘category’. The columns that have non-dollar values are gathered into one column called ‘non_fund_category’. Their values will be under another column called ‘non_fund_value’.
A column ‘mean_dollars’ will hold the mean value from each ‘category’ and a column ‘non_fund_value’ will hold the mean value from each ‘non_fund_category’
#Combine the Top10 and Bottom10 data frames
topbottom <- rbind(Top10, Bottom10)
#Take all the columns that have dollar values and gather them into one column called category. Their values will be under another column called category_dollars. Group them by category and type and create another column with the mean dollar value from each category.
topbottom <- topbottom %>%
gather(category, category_dollars, select = c(TotalCurrentExpendituresPerStudent, StudentandStaffSupportPerStudent, OperationsPerStudent, TotalCapitalOutlayPerStudent, NonElSecEducationPerStudent, TotalFederalRevenuePerStudent, TotalStateRevenuePerStudent, InstructionalExpendituresPerStudent, AdministrationPerStudent, TotalExpendituresPerStudent, ConstructionPerStudent, InterestOnDebtPerStudent, TotalRevenuePerStudent, TotalLocalRevenuePerStudent)) %>%
group_by(category, type) %>%
mutate(mean_dollars = round(mean(category_dollars),2))
#Take all the columns that have non-dollar values and gather them into one column called non_fund_category. Their values will be under another column called non_fund_value. Group them by category and type and create another column with the mean dollar value from each category.
topbottom <- topbottom %>%
gather(non_fund_category, non_fund_value, select = c(PercentPoverty, StudentTeacherRatio, ClassroomTeachersFTE, ELLStudents)) %>%
group_by(non_fund_category, type) %>%
mutate(mean_non_fund = round(mean(non_fund_value), 2))
#Get only the total revenue and expenditures columns
just_total <- filter(topbottom, category == "TotalExpendituresPerStudent" | category == "TotalRevenuePerStudent")
Plot The Data
ggplot(just_total, aes(x = category, y=mean_dollars, fill = type)) +
geom_bar(stat = "identity", position = position_dodge()) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle("Bottom 10 vs Top 10 Districts") +
geom_text(aes(label=mean_dollars,hjust=0.4, vjust=-0.1),position=position_dodge(width = 1))
ggplot(topbottom, aes(x = category, y=mean_dollars, fill = type)) +
geom_bar(stat = "identity", position = position_dodge()) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle("Bottom 10 vs Top 10 Districts")
ggplot(topbottom, aes(x = non_fund_category, y=mean_non_fund, fill = type)) +
geom_bar(stat = "identity", position = position_dodge()) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle("Bottom 10 vs Top 10 Districts") +
geom_text(aes(label=mean_non_fund,hjust=0.4, vjust=-0.1),position=position_dodge(width = 1))
Rank percentile of the state is the response variable.
r <- lm(PercentileOfState ~ TotalCurrentExpendituresPerStudent + OperationsPerStudent + NonElSecEducationPerStudent + AdministrationPerStudent + ConstructionPerStudent + TotalRevenuePerStudent + ClassroomTeachersFTE + PercentPoverty + StudentandStaffSupportPerStudent + TotalCapitalOutlayPerStudent + StudentTeacherRatio + InstructionalExpendituresPerStudent + InterestOnDebtPerStudent + TotalExpendituresPerStudent + TotalFederalRevenuePerStudent, data = final)
Backwards Elimination
r2 <- lm(PercentileOfState ~ ClassroomTeachersFTE + PercentPoverty + ELLStudents , data = final)
PercOfState <- summary(r)
PercOfState
##
## Call:
## lm(formula = PercentileOfState ~ TotalCurrentExpendituresPerStudent +
## OperationsPerStudent + NonElSecEducationPerStudent + AdministrationPerStudent +
## ConstructionPerStudent + TotalRevenuePerStudent + ClassroomTeachersFTE +
## PercentPoverty + StudentandStaffSupportPerStudent + TotalCapitalOutlayPerStudent +
## StudentTeacherRatio + InstructionalExpendituresPerStudent +
## InterestOnDebtPerStudent + TotalExpendituresPerStudent +
## TotalFederalRevenuePerStudent, data = final)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.627 -12.606 1.973 11.952 46.038
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.441e+02 4.227e+01 3.409 0.00100
## TotalCurrentExpendituresPerStudent 6.837e-04 9.459e-03 0.072 0.94255
## OperationsPerStudent 1.192e-03 5.914e-03 0.201 0.84080
## NonElSecEducationPerStudent -3.927e-04 9.412e-03 -0.042 0.96682
## AdministrationPerStudent -2.101e-02 1.172e-02 -1.793 0.07652
## ConstructionPerStudent -1.730e-02 1.433e-02 -1.208 0.23061
## TotalRevenuePerStudent 8.791e-04 3.414e-03 0.258 0.79741
## ClassroomTeachersFTE 8.022e-03 5.314e-03 1.510 0.13487
## PercentPoverty -2.077e+00 6.814e-01 -3.048 0.00308
## StudentandStaffSupportPerStudent 9.286e-03 7.165e-03 1.296 0.19849
## TotalCapitalOutlayPerStudent 2.316e-02 1.716e-02 1.349 0.18083
## StudentTeacherRatio -2.140e+00 2.239e+00 -0.956 0.34175
## InstructionalExpendituresPerStudent NA NA NA NA
## InterestOnDebtPerStudent -3.550e-03 1.192e-02 -0.298 0.76658
## TotalExpendituresPerStudent -3.100e-03 1.083e-02 -0.286 0.77546
## TotalFederalRevenuePerStudent -1.581e-02 8.739e-03 -1.809 0.07408
##
## (Intercept) **
## TotalCurrentExpendituresPerStudent
## OperationsPerStudent
## NonElSecEducationPerStudent
## AdministrationPerStudent .
## ConstructionPerStudent
## TotalRevenuePerStudent
## ClassroomTeachersFTE
## PercentPoverty **
## StudentandStaffSupportPerStudent
## TotalCapitalOutlayPerStudent
## StudentTeacherRatio
## InstructionalExpendituresPerStudent
## InterestOnDebtPerStudent
## TotalExpendituresPerStudent
## TotalFederalRevenuePerStudent .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.2 on 84 degrees of freedom
## Multiple R-squared: 0.465, Adjusted R-squared: 0.3758
## F-statistic: 5.214 on 14 and 84 DF, p-value: 5.733e-07
PercOfState2 <- summary(r2)
PercOfState2
##
## Call:
## lm(formula = PercentileOfState ~ ClassroomTeachersFTE + PercentPoverty +
## ELLStudents, data = final)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46.976 -15.205 -1.337 16.455 41.029
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 65.796982 11.163898 5.894 5.73e-08 ***
## ClassroomTeachersFTE 0.024121 0.005351 4.508 1.87e-05 ***
## PercentPoverty -2.046776 0.520738 -3.931 0.000161 ***
## ELLStudents -0.006148 0.001538 -3.998 0.000126 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.66 on 95 degrees of freedom
## Multiple R-squared: 0.4239, Adjusted R-squared: 0.4058
## F-statistic: 23.31 on 3 and 95 DF, p-value: 2.167e-11
LM with all variables: R-squared and Adjusted R-squared tells us that the model is not a good fit for the data and it’s an overfit. Many of the coefficients have high p-values.
Backwards elminination removed the high p-value variables and gave the model a better fit, but the resulting model described the data even less with a lower R-squared value.
RANK_IN_US is the response variable.
Only the districts that are nationally ranked are analyzed.
only_ranked <- final %>% filter(!is.na(RANK_IN_US))
not_ranked <- final %>% filter(is.na(RANK_IN_US))
r3 <- lm(RANK_IN_US ~ TotalCurrentExpendituresPerStudent + OperationsPerStudent + AdministrationPerStudent + ConstructionPerStudent + TotalRevenuePerStudent + PercentPoverty + StudentandStaffSupportPerStudent + TotalCapitalOutlayPerStudent + StudentTeacherRatio + InterestOnDebtPerStudent + TotalExpendituresPerStudent + TotalFederalRevenuePerStudent, data = only_ranked)
Backwards Elimination
r4 <- lm(RANK_IN_US ~ PercentPoverty + TotalCapitalOutlayPerStudent + StudentTeacherRatio + InterestOnDebtPerStudent, data = only_ranked)
rank_usa <- summary(r3)
rank_usa
##
## Call:
## lm(formula = RANK_IN_US ~ TotalCurrentExpendituresPerStudent +
## OperationsPerStudent + AdministrationPerStudent + ConstructionPerStudent +
## TotalRevenuePerStudent + PercentPoverty + StudentandStaffSupportPerStudent +
## TotalCapitalOutlayPerStudent + StudentTeacherRatio + InterestOnDebtPerStudent +
## TotalExpendituresPerStudent + TotalFederalRevenuePerStudent,
## data = only_ranked)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2046.3 -663.3 -126.5 639.3 2779.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3017.1485 3189.3744 -0.946 0.34931
## TotalCurrentExpendituresPerStudent -1.0385 0.6462 -1.607 0.11517
## OperationsPerStudent -0.2596 0.3871 -0.671 0.50595
## AdministrationPerStudent 1.3247 1.0688 1.239 0.22175
## ConstructionPerStudent 0.6445 1.0236 0.630 0.53220
## TotalRevenuePerStudent -0.5960 0.2621 -2.274 0.02789
## PercentPoverty 84.6239 49.3982 1.713 0.09373
## StudentandStaffSupportPerStudent 0.1811 0.6973 0.260 0.79625
## TotalCapitalOutlayPerStudent -2.3721 1.2204 -1.944 0.05835
## StudentTeacherRatio 364.4852 168.8121 2.159 0.03633
## InterestOnDebtPerStudent -2.0561 0.7509 -2.738 0.00888
## TotalExpendituresPerStudent 1.4543 0.7237 2.010 0.05062
## TotalFederalRevenuePerStudent 0.4009 0.5062 0.792 0.43260
##
## (Intercept)
## TotalCurrentExpendituresPerStudent
## OperationsPerStudent
## AdministrationPerStudent
## ConstructionPerStudent
## TotalRevenuePerStudent *
## PercentPoverty .
## StudentandStaffSupportPerStudent
## TotalCapitalOutlayPerStudent .
## StudentTeacherRatio *
## InterestOnDebtPerStudent **
## TotalExpendituresPerStudent .
## TotalFederalRevenuePerStudent
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1126 on 44 degrees of freedom
## Multiple R-squared: 0.458, Adjusted R-squared: 0.3101
## F-statistic: 3.098 on 12 and 44 DF, p-value: 0.003012
rank_usa2 <- summary(r4)
rank_usa2
##
## Call:
## lm(formula = RANK_IN_US ~ PercentPoverty + TotalCapitalOutlayPerStudent +
## StudentTeacherRatio + InterestOnDebtPerStudent, data = only_ranked)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1803.73 -843.52 -13.91 689.46 2536.15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2781.6327 2625.9477 -1.059 0.2944
## PercentPoverty 76.1832 34.9488 2.180 0.0338 *
## TotalCapitalOutlayPerStudent -0.3928 0.2002 -1.962 0.0551 .
## StudentTeacherRatio 304.4481 157.4314 1.934 0.0586 .
## InterestOnDebtPerStudent -1.1702 0.5317 -2.201 0.0322 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1133 on 52 degrees of freedom
## Multiple R-squared: 0.3507, Adjusted R-squared: 0.3007
## F-statistic: 7.021 on 4 and 52 DF, p-value: 0.0001346
LM with all variables: R-squared and Adjusted R-squared tells us that the model is not a good fit for the data and it’s an overfit. Many of the coefficients have high p-values.
Backwards elminination removed the high p-value variables and gave the model a better fit, but the resulting model described the data even less with a lower R-squared value.
Residuals
hist(PercOfState$residuals)
hist(rank_usa$residuals)
qqnorm(PercOfState$residuals, main = "Q-Q Plot for State Percentile Rankings")
qqline(PercOfState$residuals)
qqnorm(rank_usa$residuals, main = "Q-Q Plot for National Rankings")
qqline(rank_usa$residuals)
Unfortunately, deeper analysis will have to be done to find the main factors that influence a school district’s performance. Using only these variables alone will not reliably whether a school district will have high test scores and graduation rates.
library(ggmap)
library(maps)
library(mapdata)
library(zipcode)
states <- map_data("state")
data(zipcode)
final$ZipCode <- clean.zipcodes(final$ZipCode)
zip <- subset(final, select = c(PercentileOfState, ZipCode, RANK_IN_US))
zip <- merge(zip, zipcode, by.x="ZipCode", by.y="zip")
ggplot(data = states) +
geom_polygon(aes(x = long, y = lat, group = group), fill = "grey50", color = "black") +
coord_fixed(1.3) +
xlab("") +
ylab("") +
guides(fill=FALSE)+
ggtitle("Locations of Nationally Ranked School Districts", subtitle = "Green = high-ranked, Red = low-ranked, Blue = unranked") +
geom_point(
aes(x=longitude, y=latitude, show_guide = TRUE, colour=RANK_IN_US),
data=zip, na.rm = F) +
scale_color_gradient(low="green", high="red", na.value = "blue") +
theme(axis.ticks = element_blank(),
axis.text = element_blank())
ggplot(data = states) +
geom_polygon(aes(x = long, y = lat, group = group ), fill = "grey50", color = "black") +
coord_fixed(1.3) +
guides(fill=FALSE)+
xlab("") +
ylab("") +
ggtitle("State Percentiles of School Districts", subtitle = "Red = bottom-percentile, Green = top-percentile") +
geom_point(
aes(x=longitude, y=latitude, show_guide = TRUE, colour=PercentileOfState),
data=zip, na.rm = F) +
scale_color_gradient(low="red", high="green") +
theme(axis.ticks = element_blank(),
axis.text = element_blank())
This information is from the Common Core of Data (CCD) for the 2014-2015 school year with directory information from the 2016-2017 preliminary data file. This data is collected annually directly from State Education Agencies (SEAs).
SchoolDigger’s sources are the National Center for Education Statistics, U.S. Department of Education, the U.S. Census Bureau and each state’s Department of Education
Niche’s School Districts ranking is based on rigorous analysis of key statistics and millions of reviews from students and parents using data from the U.S. Department of Education. Ranking factors include state test scores, college readiness, graduation rates, SAT/ACT scores, teacher quality, public school district ratings, and more.
save(final, file = "final.RData")
save(topbottom, file = "topbottom.RData")