Definition of filenames

This table describes the four different versions of kmeans output that is being compared in this document. A brief description of the files is as follows:

  1. File 1: These are the results obtained when taking baseline and growth with scaling
  2. File 2: These are the results obtained when taking baseline and growth x 1.2 with scaling
  3. File 3: These are the results obtained when taking baseline and growth x 1.2 without scaling
  4. File 4: These are the results obtained when taking baseline and growth x 0.7 without scaling
# File Name Scaling applied? No of buckets Growth features Std1&2 combined?
1 results_1905_withscaler_mult1.0 Yes 4 raw_growth No
2 results_1905_withscaler_mult1.2 Yes 4 raw_growth*1.2 No
3 results_1605_noscaler_mult1.2 No 4 raw_growth*1.2 No
4 results_1905_noscaler_mult0.7 No 4 raw_growth*0.7 No

Distribution of buckets

[1] “buckA” “buckB” “buckC” “buckD”

bucket_new results_1905_withscaler_mult1.0 results_1905_withscaler_mult1.2 results_1605_noscaler_mult1.2 results_1905_noscaler_mult0.7
buckA 0.33 0.31 0.23 0.30
buckB 0.25 0.22 0.35 0.23
buckC 0.25 0.29 0.26 0.29
buckD 0.17 0.17 0.16 0.18
Total 1.00 0.99 1.00 1.00

Scatter plot

Investigation on Scaling

We read up about scaling and wanted to discuss 2 ideas with you:

  1. Looking at the Merit of Scaling
  2. Assumption of linearity for PCA

1. Looking at the Merit of Scaling

Reasons for investigating the merit of scaling

The first motivation to investigate the merit of scaling comes from the scatter plots above. From the plots, it is evident that the buckets with scaling are not as neat or intuitive as the buckets without scaling(and with growth multiplier = 1.2).

The second motivation comes from a particular stackoverflow answer on this topic. It is understood from this article that scaling is non-negotiable when the features are of different units. For example, if a group of people are being classified based on their height in metres and weight in kilograms, it is necessary to scale these two variables, lest the algorithm is dominated by the weight(since it would have more variance, owing to its unit of measurement). However, the dataset we are working with does not have features of different units. All features are percentages representing learning levels. It is understood from the previously liknked article that the necessity to use scaling in such datasets would depend on the nature of the dataset.

The nature of the UP-GP dataset

The dataset primarily has two types of features: growth and baseline. The density plots below show that growth has higher variance than baseline. This is natural and expected. Especially since these are young children, the baseline is expected to be crowded around the 0% mark. This explains the sharp peak of baseline values around 3%. However, growth offers a different story. Even though most classes start near the 0% baseline at the beginning of the year, these classes grow by different magnitudes through the year. Therefore, it is natural that the growth curve is more spread out than the baseline curve. Further, it is this variance in growth that we wish to use to effectively classify better-performing schools from the others.

How scaling affects the nature of the dataset

Minmax scaling, as can be seen in the plot below, has little effect on the baseline. This is because the baseline curve was roughly between 0 and 1 even before the scaling. The minmax has a more severe effect on growth instead, since there were negative growth values earlier.

The primary effect of the minmax scaler on the UP GP dataset is a reduction of the relative variance of growth as compared to baseline. In other words, the variance of growth was much higher than that of baseline before scaling. However, after scaling, the variance of growth is similar to the variance of baseline. This reduction in relative variance reduces the weightage naturally given to growth, thereby reducing the reward of schools that started with low baseline but achieved high growth.

Loss of control over weightage given to growth

In addition to reducing the intrinsic weightage of growth in the algorithm, the scaler prevents any artifical weights assigned to growth. This is evident from the identical nature of scatter plots: with scaling no multiplier and with scaling 1.2 multiplier.

Concluding questions

These considerations raise the following question about the use of scaling for the UP GP dataset: while scaling is a general practice in clustring workflows, does it apply to this particular dataset that we are working with? Or should this particular dataset be exempt from scaling to preserve its natural variances?

2. Assumption of linearity for PCA

While reading about PCA, we came across some literature mentioning that standard scaling is typically done before PCA. One of the blogs mentioned that standardization is done to ensure linearity of the variables being used for PCA. This is because linearity of data is one of the assumptions before performing PCA.

Thus, we thought it might be a good idea to see if the variables we are using already have a linear relationship with one another. We looked at this for all the variables related to Std. 1 as follows: We see that only 2 of our variable combinations(Maths-BL and Reading-BL, Reading-Growth and Math-Growth) have a somewhat linear relationship.

Concluding Questions

  1. Is this the right check for linearity?
  2. Do we need to consider doing standard scaling to ensure that the assumption of linearity is satisfied?