Difference in Good vs Bad School Districts

The Problem

School districts are usually well-regarded and ranked higher if they average high test scores and graduation rates. The problem I want to solve is whether there are certain characteristics that have a greater impact on these rankings than others.

To narrow down the scope of this project, I focused on school districts that were around the same size and population as the one I currently work at. The National Center for Education Statistics [NCES] had a peer matching tool that matches districts across the country based on type, locale, enrollment, student/teacher ratio and poverty. 99 similar districts in 31 states were selected. To help compare the districts, I needed an initial ranking to compare against.

The school district that the rest are matched against is located in a large suburb with around 24,500 students enrolled, student/teacher ratio of 16.18 and a poverty level of 21.3%.

The Data

The NCES data was downloaded as .csv files. National rankings and state rankings were pulled manually from Niche.com and SchoolDigger.com and stored in a MySQL database.

Some of the variables that will be analyzed are:

Characteristics - Total Schools, Total Students, Student/Teacher ratio, etc…

Expenditures - Instructional Expenditures, Student and Staff support, etc…

Fiscal - Federal and State Revenues, etc…

Although the disctricts are similar, the variance in quality can be considered large. The top district has a national rank of 196 and the bottom is unranked nationally and in the 0 percentile of the respective state.

Import NCES Data

This website separated the data for all schools into five separate files.

expenditures <- read.csv("https://raw.githubusercontent.com/smithchad17/Class607/master/607%20Final/expenditures.csv", header = T, stringsAsFactors = F)

finances <- read.csv("https://raw.githubusercontent.com/smithchad17/Class607/master/607%20Final/finances.csv", header = T, stringsAsFactors = F)

poverty <- read.csv("https://raw.githubusercontent.com/smithchad17/Class607/master/607%20Final/poverty.csv", header = T, stringsAsFactors = F)

revenue <- read.csv("https://raw.githubusercontent.com/smithchad17/Class607/master/607%20Final/revenue.csv", header = T, stringsAsFactors = F)

students <- read.csv("https://raw.githubusercontent.com/smithchad17/Class607/master/607%20Final/students.csv", header = T, stringsAsFactors = F)

head(expenditures)

##     LEAID                  DistrictName          State GradeSpanLow
## 1 4500720                      AIKEN 01 South Carolina           PK
## 2 4808090                     ALVIN ISD          Texas           PK
## 3  400680 AMPHITHEATER UNIFIED DISTRICT        Arizona           PK
## 4 2200090              ASCENSION PARISH      Louisiana           PK
## 5 1704680           AURORA EAST USD 131       Illinois           PK
## 6 1704710           AURORA WEST USD 129       Illinois           PK
##   GradeSpanHigh ZipCode TotalCurrentExpendituresPerStudent
## 1            12   29803                           8120.246
## 2            12   77511                           8208.390
## 3            12   85705                           7554.084
## 4            12   70346                          10356.297
## 5            12   60505                           9743.412
## 6            12   60506                          12781.889
##   InstructionalExpendituresPerStudent StudentandStaffSupportPerStudent
## 1                            4729.204                        1118.5736
## 2                            4943.006                         678.1766
## 3                            4136.213                        1019.6603
## 4                            6126.082                        1217.4381
## 5                            5064.692                         961.1849
## 6                            8407.229                        1163.3113
##   AdministrationPerStudent OperationsPerStudent TotalStudents
## 1                 832.8613             1439.607         24716
## 2                 998.8389             1588.369         19809
## 3                 613.4874             1784.723         14191
## 4                1100.6693             1912.107         21367
## 5                1829.4178             1888.117         14685
## 6                1237.4660             1973.883         12865

Import SQL Data

Create the MySQL driver and connect to the database. Send a query to collect all the information and store it in a data frame.

#Some settings will need to be changed if you're running it from another MySQL server
con <- dbConnect(RMySQL::MySQL(),
                 dbname = "schoolrank",
                 host = "localhost",
                 port = 3306,
                 user = "chad",
                 password = "")

q <- "SELECT * FROM rankings;"
schoolrank <- dbGetQuery(con, q)
head(schoolrank)

##    LEAID                      DISTRICT STATE DISTRICTS_IN_STATE
## 1 101290                 ELMORE COUNTY    AL                126
## 2 101920              JEFFERSON COUNTY    AL                126
## 3 102070                    LEE COUNTY    AL                126
## 4 102220                MADISON COUNTY    AL                126
## 5 103030                 SHELBY COUNTY    AL                126
## 6 400680 AMPHITHEATER UNIFIED DISTRICT    AZ                452
##   RANK_IN_STATE RANK_IN_US
## 1            27         NA
## 2            81         NA
## 3            51         NA
## 4            20       2891
## 5            22       2282
## 6           178       2761

Tidy Data

Merge data frames by LEAID.

Merges the five data frames into one.

df1 <- merge(schoolrank, poverty, by = "LEAID")
df2 <- merge(df1, expenditures, by = "LEAID")
df3 <- merge(df2, finances, by = "LEAID")
df4 <- merge(df3, revenue, by = "LEAID")
final <- merge(df4, students, by = "LEAID")

head(final)

##    LEAID                      DISTRICT STATE DISTRICTS_IN_STATE
## 1 101290                 ELMORE COUNTY    AL                126
## 2 101920              JEFFERSON COUNTY    AL                126
## 3 102070                    LEE COUNTY    AL                126
## 4 102220                MADISON COUNTY    AL                126
## 5 103030                 SHELBY COUNTY    AL                126
## 6 400680 AMPHITHEATER UNIFIED DISTRICT    AZ                452
##   RANK_IN_STATE RANK_IN_US                          NAME FIPST MSTATE
## 1            27         NA                 ELMORE COUNTY     1     AL
## 2            81         NA              JEFFERSON COUNTY     1     AL
## 3            51         NA                    LEE COUNTY     1     AL
## 4            20       2891                MADISON COUNTY     1     AL
## 5            22       2282                 SHELBY COUNTY     1     AL
## 6           178       2761 AMPHITHEATER UNIFIED DISTRICT     4     AZ
##   FIPST_NAME GSLO GSHI PercentPoverty                DistrictName.x
## 1    Alabama   PK   12       15.66762                 ELMORE COUNTY
## 2    Alabama   PK   12       16.51225              JEFFERSON COUNTY
## 3    Alabama   PK   12       14.74384                    LEE COUNTY
## 4    Alabama   PK   12       10.18080                MADISON COUNTY
## 5    Alabama   PK   12        9.70268                 SHELBY COUNTY
## 6    Arizona   PK   12       19.84297 AMPHITHEATER UNIFIED DISTRICT
##   State.x GradeSpanLow.x GradeSpanHigh.x ZipCode.x
## 1 Alabama             PK              12     36092
## 2 Alabama             PK              12     35209
## 3 Alabama             PK              12     36803
## 4 Alabama             PK              12     35804
## 5 Alabama             PK              12     35051
## 6 Arizona             PK              12     85705
##   TotalCurrentExpendituresPerStudent InstructionalExpendituresPerStudent
## 1                           7754.584                            4732.686
## 2                           8803.552                            5069.967
## 3                           8603.229                            5209.912
## 4                           8315.232                            4847.829
## 5                           9169.845                            5269.324
## 6                           7554.084                            4136.213
##   StudentandStaffSupportPerStudent AdministrationPerStudent
## 1                         634.4134                 837.3687
## 2                         826.4232                 988.8959
## 3                         735.9602                 785.2138
## 4                         777.9748                 812.5728
## 5                        1055.7236                 804.0247
## 6                        1019.6603                 613.4874
##   OperationsPerStudent TotalStudents.x                DistrictName.y
## 1             1550.116           11234                 ELMORE COUNTY
## 2             1918.266           36203              JEFFERSON COUNTY
## 3             1872.144            9847                    LEE COUNTY
## 4             1876.855           19741                MADISON COUNTY
## 5             2040.773           22809                 SHELBY COUNTY
## 6             1784.723           14191 AMPHITHEATER UNIFIED DISTRICT
##   State.y GradeSpanLow.y GradeSpanHigh.y ZipCode.y
## 1 Alabama             PK              12     36092
## 2 Alabama             PK              12     35209
## 3 Alabama             PK              12     36803
## 4 Alabama             PK              12     35804
## 5 Alabama             PK              12     35051
## 6 Arizona             PK              12     85705
##   TotalExpendituresPerStudent TotalCapitalOutlayPerStudent
## 1                    8396.564                     273.9897
## 2                    9943.126                     301.7706
## 3                    9347.111                     376.3583
## 4                   11106.276                    2213.8696
## 5                   10733.833                     930.1153
## 6                    9519.202                    1613.4170
##   ConstructionPerStudent NonElSecEducationPerStudent
## 1               161.2071                    229.3929
## 2               221.6391                    137.1433
## 3               277.0387                    154.4633
## 4              2036.6749                    206.9804
## 5               869.4813                    225.4373
## 6              1367.8388                     31.5693
##   InterestOnDebtPerStudent                DistrictName.x State.x
## 1                 146.8756                 ELMORE COUNTY Alabama
## 2                 707.0409              JEFFERSON COUNTY Alabama
## 3                 226.5665                    LEE COUNTY Alabama
## 4                 377.3365                MADISON COUNTY Alabama
## 5                 418.6505                 SHELBY COUNTY Alabama
## 6                 313.8609 AMPHITHEATER UNIFIED DISTRICT Arizona
##   GradeSpanLow.x GradeSpanHigh.x ZipCode.x TotalRevenuePerStudent
## 1             PK              12     36092               8494.748
## 2             PK              12     35209               9507.776
## 3             PK              12     36803               9258.150
## 4             PK              12     35804               9631.123
## 5             PK              12     35051              10294.007
## 6             PK              12     85705               9398.703
##   TotalFederalRevenuePerStudent TotalLocalRevenuePerStudent
## 1                      754.4063                    2316.806
## 2                      964.1742                    3315.720
## 3                      734.3353                    3038.387
## 4                      666.3796                    3517.299
## 5                      592.6608                    4523.215
## 6                      961.2430                    6259.390
##   TotalStateRevenuePerStudent                DistrictName.y State.y
## 1                    5423.536                 ELMORE COUNTY Alabama
## 2                    5227.882              JEFFERSON COUNTY Alabama
## 3                    5485.427                    LEE COUNTY Alabama
## 4                    5447.444                MADISON COUNTY Alabama
## 5                    5178.131                 SHELBY COUNTY Alabama
## 6                    2178.071 AMPHITHEATER UNIFIED DISTRICT Arizona
##   GradeSpanLow.y GradeSpanHigh.y ZipCode.y TotalSchools
## 1             PK              12     36092           15
## 2             PK              12     35209           62
## 3             PK              12     36803           14
## 4             PK              12     35804           29
## 5             PK              12     35051           36
## 6             PK              12     85705           22
##   TotalFiscalStudents TotalStudents.y ClassroomTeachersFTE
## 1               11234           11234               625.00
## 2               36203           36203              2209.50
## 3                9847            9847               629.00
## 4               19741           19741              1167.90
## 5               22809           22809              1426.10
## 6               14191           14191               837.85
##   StudentTeacherRatio ELLStudents
## 1            17.97440         122
## 2            16.38516        1068
## 3            15.65501          58
## 4            16.90299         127
## 5            15.99397         899
## 6            16.93740         732

Remove duplicate columns

Some of the files had the exact columns and values before they were merged. The duplicates were removed after the merge.

final <- subset(final, select = c(LEAID, STATE, RANK_IN_STATE, GSLO, PercentPoverty, TotalCurrentExpendituresPerStudent, StudentandStaffSupportPerStudent, OperationsPerStudent, ZipCode.y, TotalCapitalOutlayPerStudent, NonElSecEducationPerStudent, TotalFederalRevenuePerStudent, TotalStateRevenuePerStudent, TotalSchools, StudentTeacherRatio, DISTRICT, DISTRICTS_IN_STATE, RANK_IN_US, FIPST_NAME,  GSHI, InstructionalExpendituresPerStudent, AdministrationPerStudent, TotalExpendituresPerStudent, ConstructionPerStudent, InterestOnDebtPerStudent, TotalRevenuePerStudent, TotalLocalRevenuePerStudent, ClassroomTeachersFTE, ELLStudents))

colnames(final)

##  [1] "LEAID"                              
##  [2] "STATE"                              
##  [3] "RANK_IN_STATE"                      
##  [4] "GSLO"                               
##  [5] "PercentPoverty"                     
##  [6] "TotalCurrentExpendituresPerStudent" 
##  [7] "StudentandStaffSupportPerStudent"   
##  [8] "OperationsPerStudent"               
##  [9] "ZipCode.y"                          
## [10] "TotalCapitalOutlayPerStudent"       
## [11] "NonElSecEducationPerStudent"        
## [12] "TotalFederalRevenuePerStudent"      
## [13] "TotalStateRevenuePerStudent"        
## [14] "TotalSchools"                       
## [15] "StudentTeacherRatio"                
## [16] "DISTRICT"                           
## [17] "DISTRICTS_IN_STATE"                 
## [18] "RANK_IN_US"                         
## [19] "FIPST_NAME"                         
## [20] "GSHI"                               
## [21] "InstructionalExpendituresPerStudent"
## [22] "AdministrationPerStudent"           
## [23] "TotalExpendituresPerStudent"        
## [24] "ConstructionPerStudent"             
## [25] "InterestOnDebtPerStudent"           
## [26] "TotalRevenuePerStudent"             
## [27] "TotalLocalRevenuePerStudent"        
## [28] "ClassroomTeachersFTE"               
## [29] "ELLStudents"

Mutate columns to see what ranking percentile of the state the district is in.

Divide the rank of the district in the state by the total number of districts the state has.

final <- final %>% mutate(PercentileOfState = round((1 - RANK_IN_STATE/DISTRICTS_IN_STATE)*100,2))

Reorder the columns by column index

This was a personal preference so similar columns were grouped together.

final <- final[,c(1,2,19,9,16,14,17,3,18,30,4,5,6,7,8,10,11,12,13,15,20,21,22,23,24,25,26,27,28,29)]

Rename columns

Make some of the column names more readable.

colnames(final)[2] <- "StateAbbr"
colnames(final)[3] <- "State"
colnames(final)[4] <- "ZipCode"

colnames(final)

##  [1] "LEAID"                              
##  [2] "StateAbbr"                          
##  [3] "State"                              
##  [4] "ZipCode"                            
##  [5] "DISTRICT"                           
##  [6] "TotalSchools"                       
##  [7] "DISTRICTS_IN_STATE"                 
##  [8] "RANK_IN_STATE"                      
##  [9] "RANK_IN_US"                         
## [10] "PercentileOfState"                  
## [11] "GSLO"                               
## [12] "PercentPoverty"                     
## [13] "TotalCurrentExpendituresPerStudent" 
## [14] "StudentandStaffSupportPerStudent"   
## [15] "OperationsPerStudent"               
## [16] "TotalCapitalOutlayPerStudent"       
## [17] "NonElSecEducationPerStudent"        
## [18] "TotalFederalRevenuePerStudent"      
## [19] "TotalStateRevenuePerStudent"        
## [20] "StudentTeacherRatio"                
## [21] "GSHI"                               
## [22] "InstructionalExpendituresPerStudent"
## [23] "AdministrationPerStudent"           
## [24] "TotalExpendituresPerStudent"        
## [25] "ConstructionPerStudent"             
## [26] "InterestOnDebtPerStudent"           
## [27] "TotalRevenuePerStudent"             
## [28] "TotalLocalRevenuePerStudent"        
## [29] "ClassroomTeachersFTE"               
## [30] "ELLStudents"

Eye Test

Find top 10 nationally-ranked school districts

A column ‘type’ was created with parameter ‘top10’ so it will be easier to group later.

#Remove schools that are not ranked. Sort by ascending order and take the top 10.
#If a school is not ranked nationally, the RANK_IN_US value is 'NA'. na.last = NA parameter removes them.
Top10 <- head(final[order(final$RANK_IN_US, na.last = NA, decreasing = F),],10)
Top10$type <- "top10"

Find bottom 10 ranked school districts.

Since the under-performing school districts are not nationally ranked. I’ll sort the data frame by what percentile the district is in for its state.

A column ‘type’ was created with parameter ‘bottom10’ so it will be easier to group later.

bottom <- final %>% filter(is.na(RANK_IN_US))
Bottom10 <- head(bottom[order(bottom$PercentileOfState, decreasing = F),],10)
Bottom10$type <- "bottom10"

Top 10 summary

summary(Top10)

##      LEAID          StateAbbr            State              ZipCode     
##  Min.   :1704170   Length:10          Length:10          Min.   :55126  
##  1st Qu.:3244770   Class :character   Class :character   1st Qu.:71765  
##  Median :4820295   Mode  :character   Mode  :character   Median :76090  
##  Mean   :4041132                                         Mean   :72472  
##  3rd Qu.:4828335                                         3rd Qu.:77447  
##  Max.   :4842960                                         Max.   :78660  
##    DISTRICT          TotalSchools   DISTRICTS_IN_STATE RANK_IN_STATE  
##  Length:10          Min.   :11.00   Min.   : 99.0      Min.   : 17.0  
##  Class :character   1st Qu.:21.75   1st Qu.:817.5      1st Qu.: 85.5  
##  Mode  :character   Median :29.50   Median :933.0      Median :160.5  
##                     Mean   :28.10   Mean   :785.0      Mean   :170.0  
##                     3rd Qu.:33.75   3rd Qu.:933.0      3rd Qu.:252.2  
##                     Max.   :43.00   Max.   :933.0      Max.   :326.0  
##    RANK_IN_US    PercentileOfState     GSLO           PercentPoverty 
##  Min.   :196.0   Min.   :65.06     Length:10          Min.   :10.11  
##  1st Qu.:374.2   1st Qu.:72.97     Class :character   1st Qu.:11.16  
##  Median :452.0   Median :79.94     Mode  :character   Median :12.80  
##  Mean   :445.2   Mean   :78.99                        Mean   :12.80  
##  3rd Qu.:535.2   3rd Qu.:84.70                        3rd Qu.:14.66  
##  Max.   :682.0   Max.   :95.18                        Max.   :15.51  
##  TotalCurrentExpendituresPerStudent StudentandStaffSupportPerStudent
##  Min.   : 7611                      Min.   : 555.9                  
##  1st Qu.: 8182                      1st Qu.: 728.6                  
##  Median : 8266                      Median : 815.7                  
##  Mean   :10474                      Mean   :1045.7                  
##  3rd Qu.:11286                      3rd Qu.: 921.3                  
##  Max.   :20936                      Max.   :3087.4                  
##  OperationsPerStudent TotalCapitalOutlayPerStudent
##  Min.   :1257         Min.   : 335.2              
##  1st Qu.:1419         1st Qu.: 897.8              
##  Median :1573         Median :1929.0              
##  Mean   :2266         Mean   :1644.1              
##  3rd Qu.:2075         3rd Qu.:2176.8              
##  Max.   :6887         Max.   :2663.3              
##  NonElSecEducationPerStudent TotalFederalRevenuePerStudent
##  Min.   : 15.30              Min.   : 456.1               
##  1st Qu.: 22.51              1st Qu.: 555.1               
##  Median : 58.73              Median : 644.5               
##  Mean   :151.62              Mean   : 699.1               
##  3rd Qu.:195.09              3rd Qu.: 819.7               
##  Max.   :706.53              Max.   :1041.5               
##  TotalStateRevenuePerStudent StudentTeacherRatio      GSHI   
##  Min.   :1914                Min.   :14.55       Min.   :12  
##  1st Qu.:3153                1st Qu.:15.52       1st Qu.:12  
##  Median :4223                Median :16.10       Median :12  
##  Mean   :4217                Mean   :16.07       Mean   :12  
##  3rd Qu.:4471                3rd Qu.:16.52       3rd Qu.:12  
##  Max.   :9620                Max.   :17.55       Max.   :12  
##  InstructionalExpendituresPerStudent AdministrationPerStudent
##  Min.   : 4838                       Min.   : 777.6          
##  1st Qu.: 5008                       1st Qu.: 833.5          
##  Median : 5094                       Median : 853.5          
##  Mean   : 6164                       Mean   : 998.8          
##  3rd Qu.: 5913                       3rd Qu.: 943.5          
##  Max.   :12601                       Max.   :2053.0          
##  TotalExpendituresPerStudent ConstructionPerStudent
##  Min.   : 9442               Min.   : 140.6        
##  1st Qu.:10512               1st Qu.: 700.0        
##  Median :11463               Median :1464.3        
##  Mean   :12935               Mean   :1373.4        
##  3rd Qu.:12829               3rd Qu.:2033.1        
##  Max.   :23478               Max.   :2314.2        
##  InterestOnDebtPerStudent TotalRevenuePerStudent
##  Min.   : 212.4           Min.   : 9689         
##  1st Qu.: 431.9           1st Qu.: 9790         
##  Median : 784.6           Median :10416         
##  Mean   : 700.3           Mean   :12191         
##  3rd Qu.: 901.6           3rd Qu.:12053         
##  Max.   :1218.6           Max.   :23843         
##  TotalLocalRevenuePerStudent ClassroomTeachersFTE  ELLStudents  
##  Min.   : 4096               Min.   : 646.9       Min.   : 406  
##  1st Qu.: 4978               1st Qu.: 776.4       1st Qu.: 634  
##  Median : 5815               Median :1347.5       Median :1588  
##  Mean   : 7275               Mean   :1249.2       Mean   :1863  
##  3rd Qu.: 7455               3rd Qu.:1562.5       3rd Qu.:3034  
##  Max.   :18759               Max.   :2105.1       Max.   :3809  
##      type          
##  Length:10         
##  Class :character  
##  Mode  :character  
##                    
##                    
##

Bottom 10 summary

summary(Bottom10)

##      LEAID          StateAbbr            State              ZipCode     
##  Min.   : 801950   Length:10          Length:10          Min.   : 6050  
##  1st Qu.:1176342   Class :character   Class :character   1st Qu.:29808  
##  Median :1738120   Mode  :character   Mode  :character   Median :60295  
##  Mean   :2406114                                         Mean   :51341  
##  3rd Qu.:3431962                                         3rd Qu.:72095  
##  Max.   :4841220                                         Max.   :80022  
##                                                                         
##    DISTRICT          TotalSchools   DISTRICTS_IN_STATE RANK_IN_STATE  
##  Length:10          Min.   : 9.00   Min.   : 17.0      Min.   : 17.0  
##  Class :character   1st Qu.:13.50   1st Qu.:222.5      1st Qu.:215.8  
##  Mode  :character   Median :17.50   Median :771.5      Median :718.0  
##                     Mean   :19.30   Mean   :570.9      Mean   :539.2  
##                     3rd Qu.:22.75   3rd Qu.:779.0      3rd Qu.:739.0  
##                     Max.   :41.00   Max.   :933.0      Max.   :891.0  
##                                                                       
##    RANK_IN_US  PercentileOfState     GSLO           PercentPoverty 
##  Min.   : NA   Min.   :0.000     Length:10          Min.   :13.93  
##  1st Qu.: NA   1st Qu.:1.815     Class :character   1st Qu.:17.90  
##  Median : NA   Median :4.880     Mode  :character   Median :19.66  
##  Mean   :NaN   Mean   :4.208                        Mean   :19.70  
##  3rd Qu.: NA   3rd Qu.:6.218                        3rd Qu.:22.91  
##  Max.   : NA   Max.   :7.200                        Max.   :24.30  
##  NA's   :10                                                        
##  TotalCurrentExpendituresPerStudent StudentandStaffSupportPerStudent
##  Min.   : 7816                      Min.   : 640.1                  
##  1st Qu.: 9751                      1st Qu.: 809.3                  
##  Median :10608                      Median : 973.6                  
##  Mean   :11655                      Mean   :1105.9                  
##  3rd Qu.:13235                      3rd Qu.:1217.1                  
##  Max.   :18240                      Max.   :2106.8                  
##                                                                     
##  OperationsPerStudent TotalCapitalOutlayPerStudent
##  Min.   :1563         Min.   : 127.9              
##  1st Qu.:1762         1st Qu.: 188.7              
##  Median :1838         Median : 295.2              
##  Mean   :2195         Mean   : 426.5              
##  3rd Qu.:2739         3rd Qu.: 570.4              
##  Max.   :3301         Max.   :1172.1              
##                                                   
##  NonElSecEducationPerStudent TotalFederalRevenuePerStudent
##  Min.   :  24.07             Min.   : 840.6               
##  1st Qu.:  71.51             1st Qu.: 994.9               
##  Median : 106.74             Median :1117.1               
##  Mean   : 276.97             Mean   :1208.1               
##  3rd Qu.: 380.63             3rd Qu.:1326.2               
##  Max.   :1255.89             Max.   :1809.2               
##                                                           
##  TotalStateRevenuePerStudent StudentTeacherRatio      GSHI   
##  Min.   : 3585               Min.   :14.56       Min.   :12  
##  1st Qu.: 5245               1st Qu.:16.06       1st Qu.:12  
##  Median : 7694               Median :16.69       Median :12  
##  Mean   : 7545               Mean   :16.58       Mean   :12  
##  3rd Qu.: 8151               3rd Qu.:17.36       3rd Qu.:12  
##  Max.   :12864               Max.   :17.80       Max.   :12  
##                                                              
##  InstructionalExpendituresPerStudent AdministrationPerStudent
##  Min.   : 4392                       Min.   : 859.6          
##  1st Qu.: 4958                       1st Qu.:1156.9          
##  Median : 6060                       Median :1424.6          
##  Mean   : 6972                       Mean   :1381.1          
##  3rd Qu.: 7850                       3rd Qu.:1566.7          
##  Max.   :12466                       Max.   :1829.4          
##                                                              
##  TotalExpendituresPerStudent ConstructionPerStudent
##  Min.   : 8777               Min.   :  58.81       
##  1st Qu.:10708               1st Qu.:  90.90       
##  Median :11774               Median : 115.07       
##  Mean   :12638               Mean   : 268.00       
##  3rd Qu.:13913               3rd Qu.: 324.05       
##  Max.   :19271               Max.   :1031.58       
##                                                    
##  InterestOnDebtPerStudent TotalRevenuePerStudent
##  Min.   : 88.65           Min.   : 9432         
##  1st Qu.:194.77           1st Qu.:10616         
##  Median :286.20           Median :12937         
##  Mean   :371.47           Mean   :13227         
##  3rd Qu.:526.72           3rd Qu.:14026         
##  Max.   :855.69           Max.   :19774         
##                                                 
##  TotalLocalRevenuePerStudent ClassroomTeachersFTE  ELLStudents  
##  Min.   :2950                Min.   : 428.9       Min.   : 156  
##  1st Qu.:3737                1st Qu.: 580.7       1st Qu.:1009  
##  Median :4320                Median : 722.6       Median :2326  
##  Mean   :4474                Mean   : 867.1       Mean   :2924  
##  3rd Qu.:5159                3rd Qu.: 919.0       3rd Qu.:4737  
##  Max.   :6025                Max.   :2360.4       Max.   :6240  
##                                                                 
##      type          
##  Length:10         
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
##

Some highlights are that the bottom 10 districts receive more total revenue per student and pay for more administration. The top 10 districts spent more on construction and equipment per student as well as having a lower poverty level.

Visualize the Difference

Transform The Data

The data frame with the top 10 and bottom 10 districts are joined and the columns that have dollar values are gathered under a single column called ‘category’. The columns that have non-dollar values are gathered into one column called ‘non_fund_category’. Their values will be under another column called ‘non_fund_value’.

A column ‘mean_dollars’ will hold the mean value from each ‘category’ and a column ‘non_fund_value’ will hold the mean value from each ‘non_fund_category’

#Combine the Top10 and Bottom10 data frames
topbottom <- rbind(Top10, Bottom10)

#Take all the columns that have dollar values and gather them into one column called category. Their values will be under another column called category_dollars. Group them by category and type and create another column with the mean dollar value from each category.
topbottom <- topbottom %>% 
  gather(category, category_dollars, select = c(TotalCurrentExpendituresPerStudent, StudentandStaffSupportPerStudent, OperationsPerStudent, TotalCapitalOutlayPerStudent, NonElSecEducationPerStudent, TotalFederalRevenuePerStudent, TotalStateRevenuePerStudent, InstructionalExpendituresPerStudent, AdministrationPerStudent, TotalExpendituresPerStudent, ConstructionPerStudent, InterestOnDebtPerStudent, TotalRevenuePerStudent, TotalLocalRevenuePerStudent)) %>%
  group_by(category, type) %>%
  mutate(mean_dollars = round(mean(category_dollars),2))

#Take all the columns that have non-dollar values and gather them into one column called non_fund_category. Their values will be under another column called non_fund_value. Group them by category and type and create another column with the mean dollar value from each category.
topbottom <- topbottom %>% 
  gather(non_fund_category, non_fund_value, select = c(PercentPoverty, StudentTeacherRatio, ClassroomTeachersFTE, ELLStudents)) %>%
  group_by(non_fund_category, type) %>%
  mutate(mean_non_fund = round(mean(non_fund_value), 2))

#Get only the total revenue and expenditures columns
just_total <- filter(topbottom, category == "TotalExpendituresPerStudent" | category == "TotalRevenuePerStudent")

Plot The Data

ggplot(just_total, aes(x = category, y=mean_dollars, fill = type)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  ggtitle("Bottom 10 vs Top 10 Districts") +
  geom_text(aes(label=mean_dollars,hjust=0.4, vjust=-0.1),position=position_dodge(width = 1))

ggplot(topbottom, aes(x = category, y=mean_dollars, fill = type)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  ggtitle("Bottom 10 vs Top 10 Districts")

ggplot(topbottom, aes(x = non_fund_category, y=mean_non_fund, fill = type)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  ggtitle("Bottom 10 vs Top 10 Districts") +
  geom_text(aes(label=mean_non_fund,hjust=0.4, vjust=-0.1),position=position_dodge(width = 1))

Lineary Regression

Rank percentile of the state is the response variable.

r <- lm(PercentileOfState ~ TotalCurrentExpendituresPerStudent + OperationsPerStudent + NonElSecEducationPerStudent + AdministrationPerStudent + ConstructionPerStudent + TotalRevenuePerStudent + ClassroomTeachersFTE + PercentPoverty + StudentandStaffSupportPerStudent + TotalCapitalOutlayPerStudent + StudentTeacherRatio + InstructionalExpendituresPerStudent + InterestOnDebtPerStudent + TotalExpendituresPerStudent + TotalFederalRevenuePerStudent, data = final)

Backwards Elimination

r2 <- lm(PercentileOfState ~   ClassroomTeachersFTE + PercentPoverty  + ELLStudents , data = final)

PercOfState <- summary(r)
PercOfState

## 
## Call:
## lm(formula = PercentileOfState ~ TotalCurrentExpendituresPerStudent + 
##     OperationsPerStudent + NonElSecEducationPerStudent + AdministrationPerStudent + 
##     ConstructionPerStudent + TotalRevenuePerStudent + ClassroomTeachersFTE + 
##     PercentPoverty + StudentandStaffSupportPerStudent + TotalCapitalOutlayPerStudent + 
##     StudentTeacherRatio + InstructionalExpendituresPerStudent + 
##     InterestOnDebtPerStudent + TotalExpendituresPerStudent + 
##     TotalFederalRevenuePerStudent, data = final)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.627 -12.606   1.973  11.952  46.038 
## 
## Coefficients: (1 not defined because of singularities)
##                                       Estimate Std. Error t value Pr(>|t|)
## (Intercept)                          1.441e+02  4.227e+01   3.409  0.00100
## TotalCurrentExpendituresPerStudent   6.837e-04  9.459e-03   0.072  0.94255
## OperationsPerStudent                 1.192e-03  5.914e-03   0.201  0.84080
## NonElSecEducationPerStudent         -3.927e-04  9.412e-03  -0.042  0.96682
## AdministrationPerStudent            -2.101e-02  1.172e-02  -1.793  0.07652
## ConstructionPerStudent              -1.730e-02  1.433e-02  -1.208  0.23061
## TotalRevenuePerStudent               8.791e-04  3.414e-03   0.258  0.79741
## ClassroomTeachersFTE                 8.022e-03  5.314e-03   1.510  0.13487
## PercentPoverty                      -2.077e+00  6.814e-01  -3.048  0.00308
## StudentandStaffSupportPerStudent     9.286e-03  7.165e-03   1.296  0.19849
## TotalCapitalOutlayPerStudent         2.316e-02  1.716e-02   1.349  0.18083
## StudentTeacherRatio                 -2.140e+00  2.239e+00  -0.956  0.34175
## InstructionalExpendituresPerStudent         NA         NA      NA       NA
## InterestOnDebtPerStudent            -3.550e-03  1.192e-02  -0.298  0.76658
## TotalExpendituresPerStudent         -3.100e-03  1.083e-02  -0.286  0.77546
## TotalFederalRevenuePerStudent       -1.581e-02  8.739e-03  -1.809  0.07408
##                                       
## (Intercept)                         **
## TotalCurrentExpendituresPerStudent    
## OperationsPerStudent                  
## NonElSecEducationPerStudent           
## AdministrationPerStudent            . 
## ConstructionPerStudent                
## TotalRevenuePerStudent                
## ClassroomTeachersFTE                  
## PercentPoverty                      **
## StudentandStaffSupportPerStudent      
## TotalCapitalOutlayPerStudent          
## StudentTeacherRatio                   
## InstructionalExpendituresPerStudent   
## InterestOnDebtPerStudent              
## TotalExpendituresPerStudent           
## TotalFederalRevenuePerStudent       . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.2 on 84 degrees of freedom
## Multiple R-squared:  0.465,  Adjusted R-squared:  0.3758 
## F-statistic: 5.214 on 14 and 84 DF,  p-value: 5.733e-07

PercOfState2 <- summary(r2)
PercOfState2

## 
## Call:
## lm(formula = PercentileOfState ~ ClassroomTeachersFTE + PercentPoverty + 
##     ELLStudents, data = final)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.976 -15.205  -1.337  16.455  41.029 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          65.796982  11.163898   5.894 5.73e-08 ***
## ClassroomTeachersFTE  0.024121   0.005351   4.508 1.87e-05 ***
## PercentPoverty       -2.046776   0.520738  -3.931 0.000161 ***
## ELLStudents          -0.006148   0.001538  -3.998 0.000126 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.66 on 95 degrees of freedom
## Multiple R-squared:  0.4239, Adjusted R-squared:  0.4058 
## F-statistic: 23.31 on 3 and 95 DF,  p-value: 2.167e-11

LM with all variables: R-squared and Adjusted R-squared tells us that the model is not a good fit for the data and it’s an overfit. Many of the coefficients have high p-values.

Backwards elminination removed the high p-value variables and gave the model a better fit, but the resulting model described the data even less with a lower R-squared value.

RANK_IN_US is the response variable.

Only the districts that are nationally ranked are analyzed.

only_ranked <- final %>% filter(!is.na(RANK_IN_US))
not_ranked <- final %>% filter(is.na(RANK_IN_US))

r3 <- lm(RANK_IN_US ~ TotalCurrentExpendituresPerStudent + OperationsPerStudent + AdministrationPerStudent + ConstructionPerStudent + TotalRevenuePerStudent + PercentPoverty + StudentandStaffSupportPerStudent + TotalCapitalOutlayPerStudent + StudentTeacherRatio + InterestOnDebtPerStudent + TotalExpendituresPerStudent + TotalFederalRevenuePerStudent, data = only_ranked)

Backwards Elimination

r4 <- lm(RANK_IN_US ~  PercentPoverty +  TotalCapitalOutlayPerStudent + StudentTeacherRatio + InterestOnDebtPerStudent, data = only_ranked)

rank_usa <- summary(r3)
rank_usa

## 
## Call:
## lm(formula = RANK_IN_US ~ TotalCurrentExpendituresPerStudent + 
##     OperationsPerStudent + AdministrationPerStudent + ConstructionPerStudent + 
##     TotalRevenuePerStudent + PercentPoverty + StudentandStaffSupportPerStudent + 
##     TotalCapitalOutlayPerStudent + StudentTeacherRatio + InterestOnDebtPerStudent + 
##     TotalExpendituresPerStudent + TotalFederalRevenuePerStudent, 
##     data = only_ranked)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2046.3  -663.3  -126.5   639.3  2779.2 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)
## (Intercept)                        -3017.1485  3189.3744  -0.946  0.34931
## TotalCurrentExpendituresPerStudent    -1.0385     0.6462  -1.607  0.11517
## OperationsPerStudent                  -0.2596     0.3871  -0.671  0.50595
## AdministrationPerStudent               1.3247     1.0688   1.239  0.22175
## ConstructionPerStudent                 0.6445     1.0236   0.630  0.53220
## TotalRevenuePerStudent                -0.5960     0.2621  -2.274  0.02789
## PercentPoverty                        84.6239    49.3982   1.713  0.09373
## StudentandStaffSupportPerStudent       0.1811     0.6973   0.260  0.79625
## TotalCapitalOutlayPerStudent          -2.3721     1.2204  -1.944  0.05835
## StudentTeacherRatio                  364.4852   168.8121   2.159  0.03633
## InterestOnDebtPerStudent              -2.0561     0.7509  -2.738  0.00888
## TotalExpendituresPerStudent            1.4543     0.7237   2.010  0.05062
## TotalFederalRevenuePerStudent          0.4009     0.5062   0.792  0.43260
##                                      
## (Intercept)                          
## TotalCurrentExpendituresPerStudent   
## OperationsPerStudent                 
## AdministrationPerStudent             
## ConstructionPerStudent               
## TotalRevenuePerStudent             * 
## PercentPoverty                     . 
## StudentandStaffSupportPerStudent     
## TotalCapitalOutlayPerStudent       . 
## StudentTeacherRatio                * 
## InterestOnDebtPerStudent           **
## TotalExpendituresPerStudent        . 
## TotalFederalRevenuePerStudent        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1126 on 44 degrees of freedom
## Multiple R-squared:  0.458,  Adjusted R-squared:  0.3101 
## F-statistic: 3.098 on 12 and 44 DF,  p-value: 0.003012

rank_usa2 <- summary(r4)
rank_usa2

## 
## Call:
## lm(formula = RANK_IN_US ~ PercentPoverty + TotalCapitalOutlayPerStudent + 
##     StudentTeacherRatio + InterestOnDebtPerStudent, data = only_ranked)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1803.73  -843.52   -13.91   689.46  2536.15 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)  
## (Intercept)                  -2781.6327  2625.9477  -1.059   0.2944  
## PercentPoverty                  76.1832    34.9488   2.180   0.0338 *
## TotalCapitalOutlayPerStudent    -0.3928     0.2002  -1.962   0.0551 .
## StudentTeacherRatio            304.4481   157.4314   1.934   0.0586 .
## InterestOnDebtPerStudent        -1.1702     0.5317  -2.201   0.0322 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1133 on 52 degrees of freedom
## Multiple R-squared:  0.3507, Adjusted R-squared:  0.3007 
## F-statistic: 7.021 on 4 and 52 DF,  p-value: 0.0001346

LM with all variables: R-squared and Adjusted R-squared tells us that the model is not a good fit for the data and it’s an overfit. Many of the coefficients have high p-values.

Backwards elminination removed the high p-value variables and gave the model a better fit, but the resulting model described the data even less with a lower R-squared value.

Residuals

hist(PercOfState$residuals)

hist(rank_usa$residuals)

qqnorm(PercOfState$residuals, main = "Q-Q Plot for State Percentile Rankings")
qqline(PercOfState$residuals)

qqnorm(rank_usa$residuals, main = "Q-Q Plot for National Rankings")
qqline(rank_usa$residuals)

Conclusion

Unfortunately, deeper analysis will have to be done to find the main factors that influence a school district’s performance. Using only these variables alone will not reliably whether a school district will have high test scores and graduation rates.

Map

library(ggmap)
library(maps)
library(mapdata)
library(zipcode)

states <- map_data("state")

data(zipcode)
final$ZipCode <- clean.zipcodes(final$ZipCode)

zip <- subset(final, select = c(PercentileOfState, ZipCode, RANK_IN_US))
zip <- merge(zip, zipcode, by.x="ZipCode", by.y="zip")


ggplot(data = states) + 
  geom_polygon(aes(x = long, y = lat, group = group), fill = "grey50", color = "black") + 
  coord_fixed(1.3) + 
  xlab("") +
  ylab("") +
  guides(fill=FALSE)+
  ggtitle("Locations of Nationally Ranked School Districts", subtitle = "Green = high-ranked, Red = low-ranked, Blue = unranked") +
  geom_point(
        aes(x=longitude, y=latitude, show_guide = TRUE, colour=RANK_IN_US), 
        data=zip, na.rm = F)  + 
        scale_color_gradient(low="green", high="red", na.value = "blue")  +
  theme(axis.ticks = element_blank(),
        axis.text = element_blank())

ggplot(data = states) + 
  geom_polygon(aes(x = long, y = lat, group = group ), fill = "grey50", color = "black") + 
  coord_fixed(1.3) + 
  guides(fill=FALSE)+
  xlab("") +
  ylab("") +
  ggtitle("State Percentiles of School Districts", subtitle = "Red = bottom-percentile, Green = top-percentile") +
  geom_point(
        aes(x=longitude, y=latitude, show_guide = TRUE, colour=PercentileOfState), 
        data=zip, na.rm = F)  + 
        scale_color_gradient(low="red", high="green")  +
  theme(axis.ticks = element_blank(),
        axis.text = element_blank())

Sources

NCES

This information is from the Common Core of Data (CCD) for the 2014-2015 school year with directory information from the 2016-2017 preliminary data file. This data is collected annually directly from State Education Agencies (SEAs).

SchoolDigger

SchoolDigger’s sources are the National Center for Education Statistics, U.S. Department of Education, the U.S. Census Bureau and each state’s Department of Education

Niche

Niche’s School Districts ranking is based on rigorous analysis of key statistics and millions of reviews from students and parents using data from the U.S. Department of Education. Ranking factors include state test scores, college readiness, graduation rates, SAT/ACT scores, teacher quality, public school district ratings, and more.

save(final, file = "final.RData")
save(topbottom, file = "topbottom.RData")