Introduction

In order to reach the U.S. national on-time high school graduation rate by 2020, government may need to implement some new methods. In this challenge, I analyzed high school graduation rate under education agency(district) level and tried to find insights that are actionable. When dealing with observational study, it is almost impossible to draw causal relation. It is not reasonable to simply plot one chosen variable against Whole state graduation rate and claim the pattern we see is the true relation. Like the idea from multiple linear regression, we have to adjust the effects of other variables and then check the relation between one variable against graduation rate. Under certain criteria, the situation of graduation rate can be totally different for each cluster and therefore we have to implement different methods for different clusters. The challenge part is to find these criterias. I used GUIDE( a multi-purpose machine learning algorithm for constructing classification and regression trees) to build models, find the most discriminated variable and do analysis under the criteria given by the most discrimintaed variable.

Summary of data cleaning

Merged three datasets together. They are:
- merged(Required dataset)
- Education Finance data( elsec12t)
- Local Education Agency Universe Survey Data( common core of data).
Found the maximum overlap of school agencies of these three datasets
Remove redundant variables (EX: MOE, marginal of error from Required dataset)
Convert variables into consistent type
Apply feature engineering
Final dataset has school agency = 9866 and variable = 352

Data cleaning

## All kinds of high school graduation rate except "All student graduation rate"are character variable, which are divided into certain ranges. Feature LEP_ RATE seems to be missing from the whole dataset.
## 
## These different kind of graduation rate feautres are my target variable and it looks like the value of those features are not consistent, there are some value suchas "PS","GE50" , that I need to clean them first.
## 
## NA percentage in features MAM_RATE_1112,MAS_RATE_1112,MBL_RATE_1112, MHI_RATE_1112,MTR_RATE_1112 are too large, remove them.
## Consider MWH_RATE_1112,CWD_RATE_1112,ECD_RATE_1112 and ALL_RATE_1112 first
## 
## Also the income features should also be converted into numeric variable, since the unique value of those income features are very close to the whole row number. If I treat them as factor, then the levels are too many and do not make sense.

## Number of state in the merged dataset:  48

## name of state missing:
## Idaho
## Kentucky
## Oklahoma

## Randomly look at the mean of some features classified by each state

## As I expected, the mean of each features of each state different a lot, therefore when handling with missing data, mean/ mode imputation based on the whole state is not suitable. Impute each features based on the mean of Tract then City can give us a more accurate result

## As we can see from the boxplot, except for alaska, mean of graduation rate all above 75%, with northeast having the highest graduation rate. The width of the boxplot indicated number of education agency in each part, with midwest having largest number of education agency. The reason hawaii is only a point here is because Hawaii only have one education agency. Therefore it does not have variation.

##   Tukey multiple comparisons of means
##     99% family-wise confidence level
## 
## Fit: aov(formula = merged_index$ALL_RATE_1112 ~ merged_index$six_part)
## 
## $`merged_index$six_part`
##                         diff        lwr       upr     p adj
## hawaii-alaska     19.0930233 -20.212345 58.398392 0.5755699
## midwest-alaska    21.8944903  15.938422 27.850558 0.0000000
## northeast-alaska  24.4738329  18.484997 30.462669 0.0000000
## south-alaska      19.7934090  13.813544 25.773274 0.0000000
## west-alaska       16.4701249  10.443480 22.496769 0.0000000
## midwest-hawaii     2.8014671 -36.059355 41.662289 0.9998862
## northeast-hawaii   5.3808096 -33.485048 44.246667 0.9972740
## south-hawaii       0.7003858 -38.164091 39.564862 0.9999999
## west-hawaii       -2.6228983 -41.494600 36.248803 0.9999179
## northeast-midwest  2.5793425   1.522162  3.636523 0.0000000
## south-midwest     -2.1010813  -3.106197 -1.095966 0.0000000
## west-midwest      -5.4243654  -6.678126 -4.170605 0.0000000
## south-northeast   -4.6804238  -5.864346 -3.496501 0.0000000
## west-northeast    -8.0037079  -9.404897 -6.602519 0.0000000
## west-south        -3.3232841  -4.685619 -1.960949 0.0000000

## Above, I did a tukey pairwise test for the 6 part, under 99% confience interval. As we saw very obvisouly from the boxplot above, mean of alaska is different from midwest, northeast, south, west area. Under the tukey test, I know the mean difference of graduation rate between midwest, northeast, south, and west are all significantly different under 99% confience interval.

## Mean of about 95% of the features are different between these 6 parts.
## Instead selecting feature based on test result above, I will use other variable selection methods later.

## Under significant level=0.001, I did a 2 sample t-test for each feature seperated by whether graduation rate is higher than mean of national graduation rate or not

## Percentage of Significance features: 0.6845018

## About 68.4% of the mean of features between agencies whose graduatino rate is higher than average and those lower than average, are significant differnce.
## This is to get a feeling of the data and I know that simply choose features which are significantly would not work.

## Below I select features relative to percentage and house value from original Merged dataset, delete variable related to marginal of error(MOE).
## Using percentage can adjust the difference of population of each school agency and reduce the noise of model building coming next.
## After basic data cleaning, I merged 2 other new dataset I found with the original dataset. New datasets are: 
## elsec12t: Education finance Data.
## Common Core data: Local education agency universal survey.
## Cleaning step did for elsec12t: Remove redundant variables.
## Cleaning step did for Common Core data: Assign special value 0,-1,-2，-9 to their related meaning. Generate new features RATE_, which are the percentage of selected variables

## Row and Columns of original Merged dataset: 9870 154

## Row and Columns of Education finance dataset: 9870 21

## Row and Columns of Common Core dataset: 9870 148

Model building

I use tree based Machine learning algorithm from GUIDE, build different models for:
- Total graduation rate against selected features from Merged dataset, Education Financial dataset and NCES data.
There are two purposes for the models I built:
1. For finding insight.
2. For achieving higher prediction accuracy and R square.
I first built models using data without geographical variables( State, City, County, Tract), because I want to find pattern based on school agency level. Then I add the geographical variables to build model that can lead to a higher prediction accuracy.
The reason I don’t simple use a multiple linear regression to predict graduation rate is because there are too much noisy and factors inside whole data. We can not simply expect that some selected features from linear model can accurately explain the variability of whole state. Instead, I used decision tree, which will find different clusters by making split based on different features that can give lowest prediction error. Decision tree can find those most discriminant features and it is good enough for me to have the information to start with.
There are different parameters I can set for a specific single tree, some options are:
1. Which method to use at each node:
- Best simple linear regression
- Stepwise linear regression
1. Pruning standard error
2. Maximum depth of the tree
3. Minimum sample size at each node
I can also build model based on ensemble trees
- Random Forest
- Bagging
Different parameters will give me different tree. Generally what kind of tree model to use is based on required purpose. When building model based on machine learning algorithm, model can easily become a blackbox when I choose complex parameters.( uninterpretable, but achieve much higher prediction accuracy). If I want the tree to be as much interpretable as it can, I will use a simple tree. ex: Use best simple linear at each node for a single tree. If I want to achieve the best prediction I can, I will use ensemble tree. ex: Use random forest, average prediction results from many trees, with each tree randomly choose some features to consider for split.
One thing we need to be aware of is that we should avoid overfitting, which means you can get a relative low prediction error and relative high R^2 from the data you used to fit( In this case, data before 2015), but when you make prediction for a completely new dataset ( For example, new graduation rate of 2016), you prediction error will be larger than before and R^2 will also decrease.
One way to avoid it is to use cross validation when you train your model. The methods I finally choose and the results I provide are all calculated after using cross validation. Therefore, even though the R^2 is not the highest I can achieve, the results are more accurate and stable.
Below I make a comparision between different methods and their training error. The meanings of model name:
- 1Tree-S Linear: Single Decision tree. For each node, fit the best simple linear regression variables against Graduation Rate.
- 2M Linear: Multiple linear regression. Use normal multiple linear regression and treat result as benchmark.
- 3Tree-StepWise: Single Decision tree. For each node, use stepwise selection( forward and backward) to choose best multiple linear regression variables.
- 4Tree-RF: Ensemble Tree: Random Forest. Decorrelate variables and make prediction based on average result of all trees.
- 5Tree-RF+Geo: Ensemble Tree: Random Forest. Same algorithm as last one but adding the geographical variables inside. ( Expected to have higher prediction accuracy)

As for finding insight, I used the model Tree-StepWise, which is not that simple as multiple linear regression or Tree-S Linear, and not that complicated or uninterpretable as Random forest.

Tree plot of Tree-StepWise:

I digged into the data and visualized patterns based on the information I get from this model.
As we can see from the plot, the most discriminated variable is MEMBER_SCH, which is a variable I made when I applied feature engineering. It represents average number of students per school. The raw variables are MEMBER ( Total students, all grades) and SCH ( Aggregate number of all schools associated with this agency), from NCES dataset. I divide MEMBER by SCH to get MEMBER_SCH.
There are 3 terminal nodes from this plot. Each node represents a cluster and I fit stepwise linear regression for each node. The variables selected by algorithm for each node are statistically significant and the most predictive variables. I analyzed all the selected variables from each nodes and choose the variables that are meaningful and potentially actionable. Below are variables I choosed.

Variable name	Definition
ALL_RATE_1112	Overall high school graduation rate
MEMBER_SCH	Average number of student per school
MEMBER_SCH_L175	Factor with 2 levels, whether MEMBER_SCH is larger than 175 or not
grate_4part	Factor with 4 levels, whether mean graduation rate of state belongs to lower quantile, second quantile, third quantile, or upper quantile
pct_PUB_ASST_INC_ACS_08_12	The percentage of all ACS occupied housing units that receive public assistance income (general assistance and Temporary Assistance to Needy Families)
ELL_MEMBER	Number of students who are English language learner, divided by total number of students
Aggregate_HH_INC_ACS_08_12	Sum of all incomes in the household
PCTLCHAR	Percent-Charges of total elementary-secondary revenue
pct_FRST_FRMS_CEN_2010	The percentage of all addresses in a 2010 Census mailback area for which the first Mailout/Mailback form mailed was completed and returned
pct_Tot_Occp_Units_CEN_2010	The percentage of all 2010 Census housing units that are classified as the usual place of residence of the individual or group living in it
pct_URBANIZED_AREA_POP_CEN_2010	The percentage of the 2010 Census total population that lives in a densely settled area containing 50,000 or more people
PPCSTOT	PER PUPIL - TOTAL CURRENT SPENDING (ELEMENTARY-SECONDARY)
PPSGENAD	Per Pupil - General administration
PPITOTAL	Per Pupil - Total Current Spending for Instruction
PPSALWG	Per Pupil - Total salaries and wages
pct_Female_No_HB_CEN_2010	The percentage of all 2010 Census occupied housing units with a female householder and no husband of householder present
pct_Sngl_Prns_HHD_CEN_2010	The percentage of all 2010 Census occupied housing units where a householder lives alone

Based on the most discriminant variable MEMBER_SCH. I made a new label variable MEMBER_SCH_L175, which seperate all the school agencies into two parts and assign labels based on whether their MEMBER_SCH are larger than 175 or not. I plot each variable against graduation rate, seperated by label MEMBER_SCH_L175.

Total graduation rate vs MEMBER_SCH.

## Percentage of Total graduation rate less than 60% of each State, showing the top 7 states

##       Hawaii       Oregon       Alaska North Dakota      Montana 
##    100.00000    100.00000     40.00000     32.41379     31.21019 
##     Colorado South Dakota 
##     25.98870     24.00000

* Based on the density plot of ALL_RATE seperated by whether MEMBER_SCH larger than 175 or not, we see there is a small tail from graduation rate that are around 50% and most of them belong to MEMBER_SCH less than 175! And from the scatterplot between ALL_RATE and MEMBER_SCH, we can see that MEMBER_SCH( average student per school) has a positive correlation with graduation rate and for the blue group ( MEMBER_SCH_LESS175), the effect is even more strong.
* I also find the states that have very high percent of school agencies with graduation rate less than 60%. They are Hawaii, Oregon, Alaska, North Dakota, Montana, Colorado, and South Dakota.
* I also plot the mean of MEMBER_SCH per state and mean of ALL_RATE per state so that you can view this variable from different perspective.
* Finally, I made a boxplot with State ID in X axis and MEMBER_SCH in Y axis, colored by whether the mean graduation rate of this state belongs to lower quantile, second quantile, third quantile and upper quantile compared to whole state.

Total graduation rate vs. Public assistant

Between graduation rate and percentage of public assistant, there is not much difference between the two groups MEMBER_SCH_LARGER175, MEMBER_SCH_LESS175.
For both groups MEMBER_SCH_LARGER175 and MEMBER_SCH_LESS175, percentage of public assistant has a negative correlation with graduation rate.

Total graduation rate vs. English language learner.

Between graduation rate and English language learner per 100 students, the mean of MEMBER_SCH_LESS175 is significantly smaller than MEMBER_SCH_LARGER175.
For both groups MEMBER_SCH_LARGER175 and MEMBER_SCH_LESS175, English language learner per 100 students has a negative correlation with graduation rate.

Total graduation rate vs. aggregate household income

Between graduation rate and aggregate household income, the mean of MEMBER_SCH_LESS175 is significantly smaller than MEMBER_SCH_LARGER175.
For both groups MEMBER_SCH_LARGER175 and MEMBER_SCH_LESS175, aggregate household income has a positive correlation with graduation rate.

Total graduation rate vs. Charges of total elementary-secondary revenue

Between graduation rate and Charges of total elementary-secondary revenue, there is not much difference between the two groups MEMBER_SCH_LARGER175, MEMBER_SCH_LESS175.
For group MEMBER_SCH_LARGER175, Charges of total elementary-secondary revenue has a positive correlation with graduation rate. For MEMBER_SCH_LESS175, as Charges of total elementary-secondary revenue increase, graduation rate does not increase.

Conclusion

My best model captured 75% variability of graduation rate, with adjusted R-square 0.75. Root mean square error is around 6%.

Average number of student per school is the most important, discriminated and actionable variable. When the number of Average number of student per school went up, graduation rate went up. States, which contain more than 24% education agencies whose average number of student per school less than 175 are:
Hawaii, Oregon, Alaska, North Dakota, Montana, Colorado, and South Dakota.

Characteristic of school agencies that have average number of student per school less than 175.
1. They have relatively low percentage of first mail was completed and returned.
2. They have relatively low percentage of housing unit that are classified as the usual place of residence of the individual or group living in it.
3. Their percentage of total population that lives in a densely settled area containing 50,000 or more people are very low and is around 2%.
4. As Total current spending per pupil increase, the graduation rate of these school agencies do not increase. Instead, graduation rate even decreased a lot. Spending more money on General administration, Instruction, Total salaries and wages are not the suitable solutions for these school agencies. And it can also mean that in order for government to achieve same education quality, government needs to spend more on these school agencies and the result-graduation rate still did not become better.

Actionable suggestions:
For whole school agencies:

Government should pay more attention to children involved in a housing unit that receive public assistant.
Government should pay more attention to children who are learning English.
Government should pay more attention to female living in a housing unit without husband. Figure out some methods to decrease divorce rate.
Government should pay more attention to occupied housing units where a householder lives alone.

For school agencies with average number of students per school less than 175:

Instead of using traditional educational method, goverment need to implement new education method specifically aim at these school agencies. For example, can we try to replace part of the in-class education by online teaching? Since most of the housing unit of school agencies that have average number of students per school less than 175 live in rural area and are classified as unusual place of residence, children of these hosing units might cost more to go to school. Even when they arrive school, the average number of student in school is around 100,which means the number of students per grade is around 10. They do not have many opportunities to make new friends and the studying atmosphere is not good. It is not worthy to spend more effort in order to go to school.
Instead, if government can implement online education, children can have more opportunities to meet new friends online and government can connect the students from several schools in this area together so that
there is more students studying at the same time on line. Government just need to figure out a way to make this online teaching as real as possible. And besides study, school agency can arrange some weekly sports activities that request more children to join together in person.
The mean of graduation rate of school agencies with average number of students per school less than 175 is around 70%, if government can aim at these agencies and improve their graduation rate into average graduation rate 81.4%, we have more chance to reach the 90% graduation goal by 2020.

References

[School district finance data] (http://www.census.gov/did/www/schooldistricts/data/finance.html)
[Local Education Agency Universe Survey Data] (https://nces.ed.gov/ccd/pubagency.asp)
[Guide] (http://www.stat.wisc.edu/~loh/guide.html)

Digging into US high school graduation rate

Haozhen Wu - Individual Maker

Introduction

Summary of data cleaning

Data cleaning

Model building

Tree plot of Tree-StepWise:

Total graduation rate vs MEMBER_SCH.

Total graduation rate vs. Public assistant

Total graduation rate vs. English language learner.

Total graduation rate vs. aggregate household income

Total graduation rate vs. Charges of total elementary-secondary revenue

Conclusion

References

Digging into US high school graduation rate

Haozhen Wu - Individual Maker

Introduction

Summary of data cleaning

Data cleaning

Model building

Tree plot of Tree-StepWise:

Total graduation rate vs MEMBER_SCH.

Total graduation rate vs. Public assistant

Total graduation rate vs. English language learner.

Total graduation rate vs. aggregate household income

Total graduation rate vs. Charges of total elementary-secondary revenue

Features potential related to geographic.

Features related to total current spending per pupil

Features related to housing unit

Conclusion

References