Ranking bygroup in Python

import numpy as np
import pandas as pd
import seaborn as sns

Có những trường hợp trong phân tích data chúng ta cần chọn ra những row có giá trị của biến số Y lớn nhất, nhỏ nhất hoặc thứ nth trong những nhóm nhỏ nào đó.

Ví dụ, trong dataset dưới đây về tiền tip. Chúng ta muốn biết trong 2 nhóm người hút thuốc và không hút thuốc, tiền tip nhiều nhất của mỗi nhóm là bao nhiêu.

tips = sns.load_dataset('tips')

tips.head()

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4

Tìm mean của total_bill và tip trong các nhóm, phân nhóm

tips.groupby(['time','sex']).mean()

		total_bill	tip	size
time	sex
Lunch	Male	18.048485	2.882121	2.363636
Lunch	Female	16.339143	2.582857	2.457143
Dinner	Male	21.461452	3.144839	2.701613
Dinner	Female	19.213077	3.002115	2.461538

Tìm mean, max của 2 biến số khác nhau theo phân nhóm

grp = tips[['total_bill','tip']].groupby([tips['time'],tips['sex']])
grp.agg({'total_bill' : 'mean', 'tip' : 'max'}).round(2)

		total_bill	tip
time	sex
Lunch	Male	18.05	6.70
Lunch	Female	16.34	5.17
Dinner	Male	21.46	10.00
Dinner	Female	19.21	6.50

Biến số phần trăm tip trong bill được tạo ra và muốn biết top 3 trường hợp có phần trăm tip cao nhất trong dataset

tips['tip_pct'] = tips['tip']/tips['total_bill']
tips.head()

	total_bill	tip	sex	smoker	day	time	size	tip_pct
0	16.99	1.01	Female	No	Sun	Dinner	2	0.059447
1	10.34	1.66	Male	No	Sun	Dinner	3	0.160542
2	21.01	3.50	Male	No	Sun	Dinner	3	0.166587
3	23.68	3.31	Male	No	Sun	Dinner	2	0.139780
4	24.59	3.61	Female	No	Sun	Dinner	4	0.146808

Top 3 row có tip_pct cao nhất trong dataset nói chung

def top(df, n = 3, column = 'tip_pct'): 
      return df.sort_values(by = column)[-n:]
    
top(tips, n = 3)

	total_bill	tip	sex	smoker	day	time	size	tip_pct
67	3.07	1.00	Female	Yes	Sat	Dinner	1	0.325733
178	9.60	4.00	Female	Yes	Sun	Dinner	2	0.416667
172	7.25	5.15	Male	Yes	Sun	Dinner	2	0.710345

Sử dụng function apply() cho các nhóm

tips.groupby('smoker').apply(top)

		total_bill	tip	sex	smoker	day	time	size	tip_pct
smoker
Yes	67	3.07	1.00	Female	Yes	Sat	Dinner	1	0.325733
	178	9.60	4.00	Female	Yes	Sun	Dinner	2	0.416667
	172	7.25	5.15	Male	Yes	Sun	Dinner	2	0.710345
No	51	10.29	2.60	Female	No	Sun	Dinner	2	0.252672
	149	7.51	2.00	Male	No	Thur	Lunch	2	0.266312
	232	11.61	3.39	Male	No	Sat	Dinner	2	0.291990

Để dễ theo dõi, hình dung trực quan, ta tạo ra biến số về ranking trong dataset, trong đó ranking giá trị của phần trăm tip (tip_pct) trong các nhóm. Trong đó row có giá trị tip_pct thấp nhất được rank là 1, row có giá trị cao nhất sẽ có rank cao nhất theo nhóm.

tips['rank'] = tips.groupby('smoker')['tip_pct'].rank()
tips.head()

	total_bill	tip	sex	smoker	day	time	size	tip_pct	rank
0	16.99	1.01	Female	No	Sun	Dinner	2	0.059447	2.0
1	10.34	1.66	Male	No	Sun	Dinner	3	0.160542	85.0
2	21.01	3.50	Male	No	Sun	Dinner	3	0.166587	95.0
3	23.68	3.31	Male	No	Sun	Dinner	2	0.139780	45.0
4	24.59	3.61	Female	No	Sun	Dinner	4	0.146808	60.0

Tìm ra các rows thuộc top 3 tip_pct trong các nhóm ‘smoker’

tips.groupby('smoker').apply(top)

		total_bill	tip	sex	smoker	day	time	size	tip_pct	rank
smoker
Yes	67	3.07	1.00	Female	Yes	Sat	Dinner	1	0.325733	91.0
	178	9.60	4.00	Female	Yes	Sun	Dinner	2	0.416667	92.0
	172	7.25	5.15	Male	Yes	Sun	Dinner	2	0.710345	93.0
No	51	10.29	2.60	Female	No	Sun	Dinner	2	0.252672	149.0
	149	7.51	2.00	Male	No	Thur	Lunch	2	0.266312	150.0
	232	11.61	3.39	Male	No	Sat	Dinner	2	0.291990	151.0

Để tìm ra những row rank cao nhất trong mỗi nhóm:

tips.groupby(['smoker'], sort=False)['rank'].max()

smoker
No     151.0
Yes     93.0
Name: rank, dtype: float64

tips.loc[(tips['rank'] == 151) | ((tips['rank'] == 93) & (tips['smoker'] == 'Yes'))]

	total_bill	tip	sex	smoker	day	time	size	tip_pct	rank
172	7.25	5.15	Male	Yes	Sun	Dinner	2	0.710345	93.0
232	11.61	3.39	Male	No	Sat	Dinner	2	0.291990	151.0

Một các khác ta ranking theo chiều ngược lại, row có giá trị cao nhất được rank là 1

tips['rank'] = tips.groupby('smoker')['tip_pct'].rank(ascending = False)

tips.groupby('smoker').apply(top)

		total_bill	tip	sex	smoker	day	time	size	tip_pct	rank
smoker
Yes	67	3.07	1.00	Female	Yes	Sat	Dinner	1	0.325733	3.0
	178	9.60	4.00	Female	Yes	Sun	Dinner	2	0.416667	2.0
	172	7.25	5.15	Male	Yes	Sun	Dinner	2	0.710345	1.0
No	51	10.29	2.60	Female	No	Sun	Dinner	2	0.252672	3.0
	149	7.51	2.00	Male	No	Thur	Lunch	2	0.266312	2.0
	232	11.61	3.39	Male	No	Sat	Dinner	2	0.291990	1.0

Tìm ra row có tip_pct cao nhất

tips.loc[(tips['rank'] == 1)]

	total_bill	tip	sex	smoker	day	time	size	tip_pct	rank
172	7.25	5.15	Male	Yes	Sun	Dinner	2	0.710345	1.0
232	11.61	3.39	Male	No	Sat	Dinner	2	0.291990	1.0

Đến đây ta có thể tìm ra row có ranking thứ n bất kì, ví dụ n = 10 trong mỗi nhóm

tips.loc[(tips['rank'] == 10)]

	total_bill	tip	sex	smoker	day	time	size	tip_pct	rank
42	13.94	3.06	Male	No	Sun	Dinner	2	0.219512	10.0
174	16.82	4.00	Male	Yes	Sun	Dinner	2	0.237812	10.0

những row có ranking thuộc top 5

tips.loc[(tips['rank'] <= 5)]

	total_bill	tip	sex	smoker	day	time	size	tip_pct	rank
51	10.29	2.60	Female	No	Sun	Dinner	2	0.252672	3.0
67	3.07	1.00	Female	Yes	Sat	Dinner	1	0.325733	3.0
88	24.71	5.85	Male	No	Thur	Lunch	2	0.236746	5.0
109	14.31	4.00	Female	Yes	Sat	Dinner	2	0.279525	5.0
149	7.51	2.00	Male	No	Thur	Lunch	2	0.266312	2.0
172	7.25	5.15	Male	Yes	Sun	Dinner	2	0.710345	1.0
178	9.60	4.00	Female	Yes	Sun	Dinner	2	0.416667	2.0
183	23.17	6.50	Male	Yes	Sun	Dinner	4	0.280535	4.0
185	20.69	5.00	Male	No	Sun	Dinner	5	0.241663	4.0
232	11.61	3.39	Male	No	Sat	Dinner	2	0.291990	1.0

Đến đây chúng ta có thể lấy data của những row thuộc top 5 về phần trăm tip cao nhất để phân tích.