This table describes the four different versions of kmeans output that is being compared in this document. A brief description of the files is as follows:
# |
File Name | Scaling applied? | No of buckets | Growth features | Std1&2 combined? |
|---|---|---|---|---|---|
| 1 | results_1905_withscaler_mult1.0 | Yes | 4 | raw_growth | No |
| 2 | results_1905_withscaler_mult1.2 | Yes | 4 | raw_growth*1.2 | No |
| 3 | results_1605_noscaler_mult1.2 | No | 4 | raw_growth*1.2 | No |
| 4 | results_1905_noscaler_mult0.7 | No | 4 | raw_growth*0.7 | No |
[1] “buckA” “buckB” “buckC” “buckD”
| bucket_new | results_1905_withscaler_mult1.0 | results_1905_withscaler_mult1.2 | results_1605_noscaler_mult1.2 | results_1905_noscaler_mult0.7 |
|---|---|---|---|---|
| buckA | 0.33 | 0.31 | 0.23 | 0.30 |
| buckB | 0.25 | 0.22 | 0.35 | 0.23 |
| buckC | 0.25 | 0.29 | 0.26 | 0.29 |
| buckD | 0.17 | 0.17 | 0.16 | 0.18 |
| Total | 1.00 | 0.99 | 1.00 | 1.00 |
We read up about scaling and wanted to discuss 2 ideas with you:
The first motivation to investigate the merit of scaling comes from the scatter plots above. From the plots, it is evident that the buckets with scaling are not as neat or intuitive as the buckets without scaling(and with growth multiplier = 1.2).
The second motivation comes from a particular stackoverflow answer on this topic. It is understood from this article that scaling is non-negotiable when the features are of different units. For example, if a group of people are being classified based on their height in metres and weight in kilograms, it is necessary to scale these two variables, lest the algorithm is dominated by the weight(since it would have more variance, owing to its unit of measurement). However, the dataset we are working with does not have features of different units. All features are percentages representing learning levels. It is understood from the previously liknked article that the necessity to use scaling in such datasets would depend on the nature of the dataset.
The dataset primarily has two types of features: growth and baseline. The density plots below show that growth has higher variance than baseline. This is natural and expected. Especially since these are young children, the baseline is expected to be crowded around the 0% mark. This explains the sharp peak of baseline values around 3%. However, growth offers a different story. Even though most classes start near the 0% baseline at the beginning of the year, these classes grow by different magnitudes through the year. Therefore, it is natural that the growth curve is more spread out than the baseline curve. Further, it is this variance in growth that we wish to use to effectively classify better-performing schools from the others.
Minmax scaling, as can be seen in the plot below, has little effect on the baseline. This is because the baseline curve was roughly between 0 and 1 even before the scaling. The minmax has a more severe effect on growth instead, since there were negative growth values earlier.
The primary effect of the minmax scaler on the UP GP dataset is a reduction of the relative variance of growth as compared to baseline. In other words, the variance of growth was much higher than that of baseline before scaling. However, after scaling, the variance of growth is similar to the variance of baseline. This reduction in relative variance reduces the weightage naturally given to growth, thereby reducing the reward of schools that started with low baseline but achieved high growth.
In addition to reducing the intrinsic weightage of growth in the algorithm, the scaler prevents any artifical weights assigned to growth. This is evident from the identical nature of scatter plots: with scaling no multiplier and with scaling 1.2 multiplier.
These considerations raise the following question about the use of scaling for the UP GP dataset: while scaling is a general practice in clustring workflows, does it apply to this particular dataset that we are working with? Or should this particular dataset be exempt from scaling to preserve its natural variances?
While reading about PCA, we came across some literature mentioning that standard scaling is typically done before PCA. One of the blogs mentioned that standardization is done to ensure linearity of the variables being used for PCA. This is because linearity of data is one of the assumptions before performing PCA.
Thus, we thought it might be a good idea to see if the variables we are using already have a linear relationship with one another. We looked at this for all the variables related to Std. 1 as follows: We see that only 2 of our variable combinations(Maths-BL and Reading-BL, Reading-Growth and Math-Growth) have a somewhat linear relationship.