Introduction

In order to reach the U.S. national on-time high school graduation rate by 2020, government may need to implement some new methods. In this challenge, I analyzed high school graduation rate under education agency(district) level and tried to find insights that are actionable. When dealing with observational study, it is almost impossible to draw causal relation. It is not reasonable to simply plot one chosen variable against Whole state graduation rate and claim the pattern we see is the true relation. Like the idea from multiple linear regression, we have to adjust the effects of other variables and then check the relation between one variable against graduation rate. Under certain criteria, the situation of graduation rate can be totally different for each cluster and therefore we have to implement different methods for different clusters. The challenge part is to find these criterias. I used GUIDE( a multi-purpose machine learning algorithm for constructing classification and regression trees) to build models, find the most discriminated variable and do analysis under the criteria given by the most discrimintaed variable.

Summary of data cleaning

Data cleaning

## All kinds of high school graduation rate except "All student graduation rate"are character variable, which are divided into certain ranges. Feature LEP_ RATE seems to be missing from the whole dataset.
## 
## These different kind of graduation rate feautres are my target variable and it looks like the value of those features are not consistent, there are some value suchas "PS","GE50" , that I need to clean them first.
## 
## NA percentage in features MAM_RATE_1112,MAS_RATE_1112,MBL_RATE_1112, MHI_RATE_1112,MTR_RATE_1112 are too large, remove them.
## Consider MWH_RATE_1112,CWD_RATE_1112,ECD_RATE_1112 and ALL_RATE_1112 first
## 
## Also the income features should also be converted into numeric variable, since the unique value of those income features are very close to the whole row number. If I treat them as factor, then the levels are too many and do not make sense.

## Number of state in the merged dataset:  48
## name of state missing:
## Idaho
## Kentucky
## Oklahoma
## Randomly look at the mean of some features classified by each state

## As I expected, the mean of each features of each state different a lot, therefore when handling with missing data, mean/ mode imputation based on the whole state is not suitable. Impute each features based on the mean of Tract then City can give us a more accurate result

## As we can see from the boxplot, except for alaska, mean of graduation rate all above 75%, with northeast having the highest graduation rate. The width of the boxplot indicated number of education agency in each part, with midwest having largest number of education agency. The reason hawaii is only a point here is because Hawaii only have one education agency. Therefore it does not have variation.
##   Tukey multiple comparisons of means
##     99% family-wise confidence level
## 
## Fit: aov(formula = merged_index$ALL_RATE_1112 ~ merged_index$six_part)
## 
## $`merged_index$six_part`
##                         diff        lwr       upr     p adj
## hawaii-alaska     19.0930233 -20.212345 58.398392 0.5755699
## midwest-alaska    21.8944903  15.938422 27.850558 0.0000000
## northeast-alaska  24.4738329  18.484997 30.462669 0.0000000
## south-alaska      19.7934090  13.813544 25.773274 0.0000000
## west-alaska       16.4701249  10.443480 22.496769 0.0000000
## midwest-hawaii     2.8014671 -36.059355 41.662289 0.9998862
## northeast-hawaii   5.3808096 -33.485048 44.246667 0.9972740
## south-hawaii       0.7003858 -38.164091 39.564862 0.9999999
## west-hawaii       -2.6228983 -41.494600 36.248803 0.9999179
## northeast-midwest  2.5793425   1.522162  3.636523 0.0000000
## south-midwest     -2.1010813  -3.106197 -1.095966 0.0000000
## west-midwest      -5.4243654  -6.678126 -4.170605 0.0000000
## south-northeast   -4.6804238  -5.864346 -3.496501 0.0000000
## west-northeast    -8.0037079  -9.404897 -6.602519 0.0000000
## west-south        -3.3232841  -4.685619 -1.960949 0.0000000
## Above, I did a tukey pairwise test for the 6 part, under 99% confience interval. As we saw very obvisouly from the boxplot above, mean of alaska is different from midwest, northeast, south, west area. Under the tukey test, I know the mean difference of graduation rate between midwest, northeast, south, and west are all significantly different under 99% confience interval.
## Mean of about 95% of the features are different between these 6 parts.
## Instead selecting feature based on test result above, I will use other variable selection methods later.

## Under significant level=0.001, I did a 2 sample t-test for each feature seperated by whether graduation rate is higher than mean of national graduation rate or not
## Percentage of Significance features: 0.6845018
## About 68.4% of the mean of features between agencies whose graduatino rate is higher than average and those lower than average, are significant differnce.
## This is to get a feeling of the data and I know that simply choose features which are significantly would not work.
## Below I select features relative to percentage and house value from original Merged dataset, delete variable related to marginal of error(MOE).
## Using percentage can adjust the difference of population of each school agency and reduce the noise of model building coming next.
## After basic data cleaning, I merged 2 other new dataset I found with the original dataset. New datasets are: 
## elsec12t: Education finance Data.
## Common Core data: Local education agency universal survey.
## Cleaning step did for elsec12t: Remove redundant variables.
## Cleaning step did for Common Core data: Assign special value 0,-1,-2,-9 to their related meaning. Generate new features RATE_, which are the percentage of selected variables
## Row and Columns of original Merged dataset: 9870 154
## Row and Columns of Education finance dataset: 9870 21
## Row and Columns of Common Core dataset: 9870 148

Model building

As for finding insight, I used the model Tree-StepWise, which is not that simple as multiple linear regression or Tree-S Linear, and not that complicated or uninterpretable as Random forest.

Tree plot of Tree-StepWise:

  • I digged into the data and visualized patterns based on the information I get from this model.
  • As we can see from the plot, the most discriminated variable is MEMBER_SCH, which is a variable I made when I applied feature engineering. It represents average number of students per school. The raw variables are MEMBER ( Total students, all grades) and SCH ( Aggregate number of all schools associated with this agency), from NCES dataset. I divide MEMBER by SCH to get MEMBER_SCH.
  • There are 3 terminal nodes from this plot. Each node represents a cluster and I fit stepwise linear regression for each node. The variables selected by algorithm for each node are statistically significant and the most predictive variables. I analyzed all the selected variables from each nodes and choose the variables that are meaningful and potentially actionable. Below are variables I choosed.
Variable name Definition
ALL_RATE_1112 Overall high school graduation rate
MEMBER_SCH Average number of student per school
MEMBER_SCH_L175 Factor with 2 levels, whether MEMBER_SCH is larger than 175 or not
grate_4part Factor with 4 levels, whether mean graduation rate of state belongs to lower quantile, second quantile, third quantile, or upper quantile
pct_PUB_ASST_INC_ACS_08_12 The percentage of all ACS occupied housing units that receive public assistance income (general assistance and Temporary Assistance to Needy Families)
ELL_MEMBER Number of students who are English language learner, divided by total number of students
Aggregate_HH_INC_ACS_08_12 Sum of all incomes in the household
PCTLCHAR Percent-Charges of total elementary-secondary revenue
pct_FRST_FRMS_CEN_2010 The percentage of all addresses in a 2010 Census mailback area for which the first Mailout/Mailback form mailed was completed and returned
pct_Tot_Occp_Units_CEN_2010 The percentage of all 2010 Census housing units that are classified as the usual place of residence of the individual or group living in it
pct_URBANIZED_AREA_POP_CEN_2010 The percentage of the 2010 Census total population that lives in a densely settled area containing 50,000 or more people
PPCSTOT PER PUPIL - TOTAL CURRENT SPENDING (ELEMENTARY-SECONDARY)
PPSGENAD Per Pupil - General administration
PPITOTAL Per Pupil - Total Current Spending for Instruction
PPSALWG Per Pupil - Total salaries and wages
pct_Female_No_HB_CEN_2010 The percentage of all 2010 Census occupied housing units with a female householder and no husband of householder present
pct_Sngl_Prns_HHD_CEN_2010 The percentage of all 2010 Census occupied housing units where a householder lives alone

Based on the most discriminant variable MEMBER_SCH. I made a new label variable MEMBER_SCH_L175, which seperate all the school agencies into two parts and assign labels based on whether their MEMBER_SCH are larger than 175 or not. I plot each variable against graduation rate, seperated by label MEMBER_SCH_L175.

Total graduation rate vs MEMBER_SCH.

## Percentage of Total graduation rate less than 60% of each State, showing the top 7 states
##       Hawaii       Oregon       Alaska North Dakota      Montana 
##    100.00000    100.00000     40.00000     32.41379     31.21019 
##     Colorado South Dakota 
##     25.98870     24.00000

* Based on the density plot of ALL_RATE seperated by whether MEMBER_SCH larger than 175 or not, we see there is a small tail from graduation rate that are around 50% and most of them belong to MEMBER_SCH less than 175! And from the scatterplot between ALL_RATE and MEMBER_SCH, we can see that MEMBER_SCH( average student per school) has a positive correlation with graduation rate and for the blue group ( MEMBER_SCH_LESS175), the effect is even more strong.
* I also find the states that have very high percent of school agencies with graduation rate less than 60%. They are Hawaii, Oregon, Alaska, North Dakota, Montana, Colorado, and South Dakota.
* I also plot the mean of MEMBER_SCH per state and mean of ALL_RATE per state so that you can view this variable from different perspective.
* Finally, I made a boxplot with State ID in X axis and MEMBER_SCH in Y axis, colored by whether the mean graduation rate of this state belongs to lower quantile, second quantile, third quantile and upper quantile compared to whole state.

Total graduation rate vs. Public assistant

  • Between graduation rate and percentage of public assistant, there is not much difference between the two groups MEMBER_SCH_LARGER175, MEMBER_SCH_LESS175.
  • For both groups MEMBER_SCH_LARGER175 and MEMBER_SCH_LESS175, percentage of public assistant has a negative correlation with graduation rate.

Total graduation rate vs. English language learner.

  • Between graduation rate and English language learner per 100 students, the mean of MEMBER_SCH_LESS175 is significantly smaller than MEMBER_SCH_LARGER175.
  • For both groups MEMBER_SCH_LARGER175 and MEMBER_SCH_LESS175, English language learner per 100 students has a negative correlation with graduation rate.

Total graduation rate vs. aggregate household income

  • Between graduation rate and aggregate household income, the mean of MEMBER_SCH_LESS175 is significantly smaller than MEMBER_SCH_LARGER175.
  • For both groups MEMBER_SCH_LARGER175 and MEMBER_SCH_LESS175, aggregate household income has a positive correlation with graduation rate.

Total graduation rate vs. Charges of total elementary-secondary revenue

  • Between graduation rate and Charges of total elementary-secondary revenue, there is not much difference between the two groups MEMBER_SCH_LARGER175, MEMBER_SCH_LESS175.
  • For group MEMBER_SCH_LARGER175, Charges of total elementary-secondary revenue has a positive correlation with graduation rate. For MEMBER_SCH_LESS175, as Charges of total elementary-secondary revenue increase, graduation rate does not increase.

Conclusion

My best model captured 75% variability of graduation rate, with adjusted R-square 0.75. Root mean square error is around 6%.

Average number of student per school is the most important, discriminated and actionable variable. When the number of Average number of student per school went up, graduation rate went up. States, which contain more than 24% education agencies whose average number of student per school less than 175 are:
Hawaii, Oregon, Alaska, North Dakota, Montana, Colorado, and South Dakota.

Characteristic of school agencies that have average number of student per school less than 175.
1. They have relatively low percentage of first mail was completed and returned.
2. They have relatively low percentage of housing unit that are classified as the usual place of residence of the individual or group living in it.
3. Their percentage of total population that lives in a densely settled area containing 50,000 or more people are very low and is around 2%.
4. As Total current spending per pupil increase, the graduation rate of these school agencies do not increase. Instead, graduation rate even decreased a lot. Spending more money on General administration, Instruction, Total salaries and wages are not the suitable solutions for these school agencies. And it can also mean that in order for government to achieve same education quality, government needs to spend more on these school agencies and the result-graduation rate still did not become better.

Actionable suggestions:
For whole school agencies:

For school agencies with average number of students per school less than 175:

References

[School district finance data] (http://www.census.gov/did/www/schooldistricts/data/finance.html)
[Local Education Agency Universe Survey Data] (https://nces.ed.gov/ccd/pubagency.asp)
[Guide] (http://www.stat.wisc.edu/~loh/guide.html)