Sometime back UPSC 2016 results were declared and amidst usual hype and hoopla, a blog post by twitter handle yugaparivatan claimed, that higher interview marks were awarded to Muslims . I was intrigued at that time and decided to analyse whole dataset later.
so I had some spare time this weekend and decided to analyse the whole thing along with other interesting factoids such as gender,religion,year-wise and surname wise variation.
2013,2014,2015,2016 UPSC list of candidates with categories were collected from Internet. 2015 list was pdf only,hence was extracted back to text then csv surnames were extracted with Regex The code for extraction and csv file of data can be found here and here Gender status and religion status was coded by hand after evaluation of data.
First we will look at average distribution of marks in Caste categories and percentage of Muslims and females in them
a=upm %>% group_by(caste) %>% summarise(n=n(),marks=round(mean(Total),0),Written=round(mean(Written),0),
Interview=round(mean(Interview),0),Rank=round(mean(Rank),0),
Muslims_percentage=round(100*mean(Muslim),2),
Females_percentage=round(100*mean(Gender),2))
library(knitr)
knitr::kable(a)
| caste | n | marks | Written | Interview | Rank | Muslims_percentage | Females_percentage |
|---|---|---|---|---|---|---|---|
| General | 2106 | 909 | 734 | 175 | 352 | 3.04 | 20.32 |
| OBC | 1342 | 876 | 709 | 167 | 669 | 4.25 | 10.28 |
| SC | 720 | 853 | 688 | 165 | 831 | 0.00 | 14.58 |
| ST | 367 | 842 | 675 | 167 | 934 | 3.54 | 13.62 |
Thus we see percentage of Muslims is between 3-4% in UPSC selected candidates, interestingly percentage of Muslim OBC is higher than general OBC. Expectedly Muslim percentage in Sc is zero as different religions can be given SC status , while ST classification depends upon regions and hence 3.54% Muslims there as well.
What is obvious here is inspite of protestations Muslims are under-represented in UPSC pool
Female representation is best in General group at 20%, while in OBC it is lowest 10.28% even lower than SC group which is surprising to say the least
Interview and written marks follow the general order General>OBC>SC>ST except in case of Interview where ST secure higher marks than SC(we shall examine this in a moment.)
General category candidates have better rank , but as we know that there is quota wise representation in each service and separate list for each categories
Lets see the whole data year-wise
b=upm %>% group_by(caste,year) %>% summarise(n=n(),marks=round(mean(Total),0),Written=round(mean(Written),0),
Interview=round(mean(Interview),0),Rank=round(mean(Rank),0),
Muslims_percentage=round(100*mean(Muslim),2),
Females_percentage=round(100*mean(Gender),2))
knitr::kable(b)
| caste | year | n | marks | Written | Interview | Rank | Muslims_percentage | Females_percentage |
|---|---|---|---|---|---|---|---|---|
| General | 2013 | 517 | 809 | 629 | 180 | 347 | 3.09 | 20.89 |
| General | 2014 | 590 | 917 | 739 | 177 | 390 | 2.03 | 22.20 |
| General | 2015 | 499 | 900 | 729 | 171 | 328 | 3.01 | 16.63 |
| General | 2016 | 500 | 1011 | 841 | 170 | 338 | 4.20 | 21.20 |
| OBC | 2013 | 327 | 775 | 606 | 169 | 643 | 3.98 | 10.09 |
| OBC | 2014 | 354 | 879 | 708 | 171 | 746 | 2.82 | 10.17 |
| OBC | 2015 | 314 | 864 | 698 | 165 | 634 | 5.10 | 6.37 |
| OBC | 2016 | 347 | 980 | 816 | 164 | 646 | 5.19 | 14.12 |
| SC | 2013 | 187 | 753 | 586 | 167 | 831 | 0.00 | 16.58 |
| SC | 2014 | 194 | 864 | 697 | 167 | 884 | 0.00 | 11.86 |
| SC | 2015 | 176 | 844 | 681 | 163 | 805 | 0.00 | 13.64 |
| SC | 2016 | 163 | 965 | 801 | 163 | 795 | 0.00 | 16.56 |
| ST | 2013 | 91 | 740 | 572 | 168 | 934 | 4.40 | 12.09 |
| ST | 2014 | 98 | 846 | 676 | 170 | 1012 | 2.04 | 14.29 |
| ST | 2015 | 89 | 834 | 672 | 162 | 867 | 3.37 | 14.61 |
| ST | 2016 | 89 | 949 | 783 | 166 | 918 | 4.49 | 13.48 |
Let’s Plot Muslim representation community wise
ggplot(b,aes(x=year,y=Muslims_percentage,color=caste,group=caste))+geom_line()+
geom_hline(yintercept =round(100*mean(upm$Muslim),2) ,color="red",linetype="dashed")+
annotate("text",x=2015.5,y=round(100*mean(upm$Muslim),2),label="Average Muslim \n Representation %")+
labs(color="Category")
Let’s Plot Female Representation Community wise
ggplot(b,aes(x=year,y=Females_percentage,color=caste,group=caste))+geom_line()+
geom_hline(yintercept =round(100*mean(upm$Gender),2) ,color="red",linetype="dashed")+
annotate("text",x=2013.5,y=round(100*mean(upm$Gender),2),label="Average Female \n Representation %")+
labs(color="Category")
So we saw that OBCs have poor female representation and slightly higher Muslim representation than general category
Let’s look at the histogram of all Written marks across the years.
ggplot(upm,aes(x=Written,y=..density..))+geom_histogram(fill="orange")+
geom_density(color="black")
Appears fairly symmetrical. Howver we know that caste wise differences exist so let’s plot caste wise histogram
ggplot(upm,aes(x=Written,y=..density..))+geom_histogram(fill="orange")+
geom_density(color="black")+facet_wrap(~caste)
So, the fairly symmetrical curve becomes almost trimodal with three peaks, what could be the potential cause.. Let’s investigate further
ggplot(upm,aes(x=Written,y=..density..))+geom_histogram(fill="orange")+
geom_density(color="black")+facet_wrap(caste~year)
So once we plot by year, the plots become symmetric again..How is that? a) in 2013 marks were considerably lower than 2016
Thus we saw how an innocuous looking histogram hid within itself many sub-populations, on eof the few reasons a single measure of central tendency should never be accepted till we see graphical representation of data.
It is also known as Simpson’s paradox
Lets plot Interview marks
ggplot(upm,aes(x=Interview,y=..density..))+geom_histogram(fill="orange")+
geom_density(color="black")
ggplot(upm,aes(x=Interview,y=..density..))+geom_histogram(fill="orange")+
geom_density(color="black")+facet_wrap(~caste)
ggplot(upm,aes(x=Interview,y=..density..))+geom_histogram(fill="orange")+
geom_density(color="black")+facet_wrap(caste~year)
The histograms and density plots give a lot of insight in the distribution of marks.
Let’s Plot Interview marks community wise
ggplot(b,aes(x=year,y=Interview,color=caste,group=caste))+geom_line()+
geom_hline(yintercept =round(mean(upm$Interview),2) ,color="red",linetype="dashed")+
annotate("text",x=2013.5,y=round(mean(upm$Interview),2),label="Average Interview Marks")+
labs(color="Category")
We see that Interview marks have gone down in general category while they have gone up in ST category,general change is down.
ggplot(b,aes(x=year,y=Written,color=caste,group=caste))+geom_line()+
geom_hline(yintercept =round(mean(upm$Written),2) ,color="red",linetype="dashed")+
annotate("text",x=2013.5,y=round(mean(upm$Written),2),label="Average Written Marks")+
labs(color="Category")
Average Written Marks have gone up in all categories across in usual order.
upm %>% mutate(Muslim=ifelse(Muslim==1,"Muslim","Non-Muslim")) %>% group_by(Muslim,year) %>% summarise(n=n(),marks=round(mean(Total),0),Written=round(mean(Written),0),
Interview=round(mean(Interview),0),Rank=round(mean(Rank),0)) %>%
ggplot(aes(x=year,y=Interview,color=factor(Muslim),group=factor(Muslim)))+geom_line()+
geom_hline(yintercept =round(mean(upm$Interview),2) ,color="red",linetype="dashed")+
annotate("text",x=2013.5,y=round(mean(upm$Interview),2),label="Average Interview Marks")+
labs(color="Category")
We see that Interview marks in both Muslims and Non-Muslims go down, however Muslims have higher marks(around 5 marks) on average
Now some people alleged that Interview marks inflation was done deliberately in top ranks(Ranks less than 500),So lets see the trend in top rankers
upm %>% filter(Rank<500) %>% mutate(Muslim=ifelse(Muslim==1,"Muslim","Non-Muslim")) %>% group_by(Muslim,year) %>% summarise(n=n(),marks=round(mean(Total),0),Written=round(mean(Written),0),
Interview=round(mean(Interview),0),Rank=round(mean(Rank),0)) %>%
ggplot(aes(x=year,y=Interview,color=factor(Muslim),group=factor(Muslim)))+geom_line()+
geom_hline(yintercept =round(mean(upm$Interview),2) ,color="red",linetype="dashed")+
annotate("text",x=2013.5,y=round(mean(upm$Interview),2),label="Average Interview Marks")+
labs(color="Category")
In this graph,we can clearly see that while marks have been going down in both groups, in 2016 there is an uptrend(almost 13 marks higher in top 500 candidates) and that led to higher selection of Muslims in top ranks. This might be one off event, but in this year 2016 while interview marks of Muslim candidates on average went down, in top groups it went up against the general trend..
Lets now examine Written Marks
upm %>% mutate(Muslim=ifelse(Muslim==1,"Muslim","Non-Muslim")) %>% group_by(Muslim,year) %>% summarise(n=n(),marks=round(mean(Total),0),Written=round(mean(Written),0),
Interview=round(mean(Interview),0),Rank=round(mean(Rank),0)) %>%
ggplot(aes(x=year,y=Written,color=factor(Muslim),group=factor(Muslim)))+geom_line()+
geom_hline(yintercept =round(mean(upm$Written),2) ,color="red",linetype="dashed")+
annotate("text",x=2013.5,y=round(mean(upm$Written),2),label="Average Written Marks")+
labs(color="Category")
In this graph we clearly see that there is almost no difference between Muslim and Non-Muslim candidates in UPSC written testLets examine top guys under 500
upm %>%filter(Rank<500) %>% mutate(Muslim=ifelse(Muslim==1,"Muslim","Non-Muslim")) %>% group_by(Muslim,year) %>% summarise(n=n(),marks=round(mean(Total),0),Written=round(mean(Written),0),
Interview=round(mean(Interview),0),Rank=round(mean(Rank),0)) %>%
ggplot(aes(x=year,y=Written,color=factor(Muslim),group=factor(Muslim)))+geom_line()+
geom_hline(yintercept =round(mean(upm$Written),2) ,color="red",linetype="dashed")+
annotate("text",x=2013.5,y=round(mean(upm$Written),2),label="Average Written Marks")+
labs(color="Category")
Again,here as well-there is no difference in Written scores(as the mark sheets are blinded) and hence the better performance in top group was driven entirely by better performance in interview
Lets see corresponding graphs for females in Written category
upm %>% mutate(Gender=ifelse(Gender==1,"Female","Male")) %>% group_by(Gender,year) %>% summarise(n=n(),marks=round(mean(Total),0),Written=round(mean(Written),0),
Interview=round(mean(Interview),0),Rank=round(mean(Rank),0)) %>%
ggplot(aes(x=year,y=Written,color=factor(Gender),group=factor(Gender)))+geom_line()+
geom_hline(yintercept =round(mean(upm$Written),2) ,color="red",linetype="dashed")+
annotate("text",x=2013.5,y=round(mean(upm$Written),2),label="Average Written Marks")+
labs(color="Category")
We see almost no difference here as opposed to Interview marks
Now in top candidates
upm %>% filter(Rank<500) %>% mutate(Gender=ifelse(Gender==1,"Female","Male")) %>% group_by(Gender,year) %>% summarise(n=n(),marks=round(mean(Total),0),Written=round(mean(Written),0),
Interview=round(mean(Interview),0),Rank=round(mean(Rank),0)) %>%
ggplot(aes(x=year,y=Written,color=factor(Gender),group=factor(Gender)))+geom_line()+
geom_hline(yintercept =round(mean(upm$Written),2) ,color="red",linetype="dashed")+
annotate("text",x=2013.5,y=round(mean(upm$Written),2),label="Average Written Marks")+
labs(color="Category")
Here as well , there is no difference
Let’s see Interview Scores Now
upm %>% mutate(Gender=ifelse(Gender==1,"Female","Male")) %>% group_by(Gender,year) %>% summarise(n=n(),marks=round(mean(Total),0),Written=round(mean(Written),0),
Interview=round(mean(Interview),0),Rank=round(mean(Rank),0)) %>%
ggplot(aes(x=year,y=Interview,color=factor(Gender),group=factor(Gender)))+geom_line()+
geom_hline(yintercept =round(mean(upm$Interview),2) ,color="red",linetype="dashed")+
annotate("text",x=2013.5,y=round(mean(upm$Interview),2),label="Average Interview Marks")+
labs(color="Category")
Now it appears being a female confers an advantage of almost 5 points in interview..
Let’s analyse it in top rankers
upm %>%filter(Rank<500) %>% mutate(Gender=ifelse(Gender==1,"Female","Male")) %>% group_by(Gender,year) %>% summarise(n=n(),marks=round(mean(Total),0),Written=round(mean(Written),0),
Interview=round(mean(Interview),0),Rank=round(mean(Rank),0)) %>%
ggplot(aes(x=year,y=Interview,color=factor(Gender),group=factor(Gender)))+geom_line()+
geom_hline(yintercept =round(mean(upm$Interview),2) ,color="red",linetype="dashed")+
annotate("text",x=2013.5,y=round(mean(upm$Interview),2),label="Average Interview Marks")+
labs(color="Category")
We see in top rankers difference is lesser and even that closed down in 2016 c.f. performance of Muslim community. Lets’ examine this association statistically.
First let’s look at relationship between Interview and Written marks
upm %>% ggplot(aes(x=Interview,y=Written))+
geom_jitter(color="red")+
stat_smooth(method="lm",se=FALSE)
We see that there is a predominatly negative relationship. Does this relationship hold across years. Let’s see..
upm %>% ggplot(aes(x=Interview,y=Written))+
geom_jitter(color="red")+
stat_smooth(method="lm",se=FALSE)+
facet_wrap(~year)
We can see the relationship holds but degrees vary. It is expected as well as those who do very well in Written , might be introvert and perform worse in Interview..
Let’s see if this relationship holds in females as well.
upm %>% mutate(Gender=ifelse(Gender==1,"Female","Male")) %>% ggplot(aes(x=Interview,y=Written,color=Gender))+
geom_jitter()+
stat_smooth(method="lm",se=FALSE)+
facet_wrap(~year)
Well it does across years.though Females tend to score higher
Lets look at it in Muslims
upm %>% mutate(Muslim=ifelse(Muslim==1,"Muslim","Non-Muslim")) %>% ggplot(aes(x=Interview,y=Written,color=Muslim))+
geom_jitter()+
stat_smooth(method="lm",se=FALSE)+
facet_wrap(~year)
On adjusting for Written marks we see that Muslims have only minor difference with Non-Muslims,However in 2016 we can see that among higher marks there is a clear divergence indicating higher marks for Muslim candidates.
Let’s now do a formal statistical test . Let’s consider Interview marks as dependent variable,predicted by indivisual’s caste,gender,religion (in which there are obvious trends)-obviously though the model is flawed as all models are as it doesnt have measure of innate ability of participant, but it helps us in controlling for gender,caste n religion difference in candidate as well as yearly variation.I have avoided taking year 2013 as there was significant ~100 point shift in written marks there..
f= lm(Interview~Written.c+Gender+caste+Muslim+factor(year),data=filter(upm,year>2013))
summary(f,digits=2)
##
## Call:
## lm(formula = Interview ~ Written.c + Gender + caste + Muslim +
## factor(year), data = filter(upm, year > 2013))
##
## Residuals:
## Min 1Q Median 3Q Max
## -88.251 -11.024 0.173 11.299 52.918
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 179.155691 0.618955 289.449 < 2e-16 ***
## Written.c -0.126074 0.008255 -15.272 < 2e-16 ***
## Gender 3.770246 0.794961 4.743 2.20e-06 ***
## casteOBC -9.704373 0.718291 -13.510 < 2e-16 ***
## casteSC -13.807427 0.914435 -15.099 < 2e-16 ***
## casteST -14.421253 1.198486 -12.033 < 2e-16 ***
## Muslim 5.984816 1.701129 3.518 0.00044 ***
## factor(year)2015 -7.046271 0.704688 -9.999 < 2e-16 ***
## factor(year)2016 6.922696 1.108052 6.248 4.68e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.76 on 3404 degrees of freedom
## Multiple R-squared: 0.1344, Adjusted R-squared: 0.1324
## F-statistic: 66.06 on 8 and 3404 DF, p-value: < 2.2e-16
The analysis shows that for an averal General candidate in year 2014 as base , had 179 marks in interview, every 10 increasein written marks would decrease his interview marks by 1.2, Being of female gender would increase the mark by 4, OBC,SC and ST have around 10,13,14 lower point than general candidates,while being a Muslim adds around 6 marks to interview score. 2013 was alow score year There are yearly 6 points in subsequent years
However it must be noted that since we dont have other measures of innate ability of students the predictive perrobability of model is poor despite controlling for confounders.
Lets do the same analysis in top rankers
g= lm(Interview~Written.c+Gender+caste+Muslim+factor(year),data=filter(upm,year>2013,Rank<500))
summary(g)
##
## Call:
## lm(formula = Interview ~ Written.c + Gender + caste + Muslim +
## factor(year), data = filter(upm, year > 2013, Rank < 500))
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.652 -9.895 -0.884 9.960 60.085
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 190.60196 0.93964 202.845 < 2e-16 ***
## Written.c -0.27833 0.01455 -19.126 < 2e-16 ***
## Gender 2.06934 0.98705 2.096 0.0362 *
## casteOBC -2.41066 1.09727 -2.197 0.0282 *
## casteSC -3.54604 1.96226 -1.807 0.0709 .
## casteST -2.57364 3.81958 -0.674 0.5005
## Muslim 9.76134 2.37969 4.102 4.32e-05 ***
## factor(year)2015 -11.45810 0.99103 -11.562 < 2e-16 ***
## factor(year)2016 17.36783 1.70377 10.194 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.16 on 1488 degrees of freedom
## Multiple R-squared: 0.2482, Adjusted R-squared: 0.2442
## F-statistic: 61.41 on 8 and 1488 DF, p-value: < 2.2e-16
We see that while in top rankers there is a much lesser difference in interview marks between general and OBC/SC/ST candidates and even in females, however intrestingly the advantage of being a Muslim is amplified in these top rankers and gives an advantage of upto 10 marks even on controlling for other variables and points to some evidence of deliberate grade inflation in interview marks based on religion Lets run this analysis without controversial year 2016
g= lm(Interview~Written.c+Gender+caste+Muslim+factor(year),data=filter(upm,year<2016,Rank<500))
summary(g)
##
## Call:
## lm(formula = Interview ~ Written.c + Gender + caste + Muslim +
## factor(year), data = filter(upm, year < 2016, Rank < 500))
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.852 -11.569 -0.323 10.890 65.527
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 157.18403 1.38120 113.802 < 2e-16 ***
## Written.c -0.30602 0.01471 -20.803 < 2e-16 ***
## Gender 2.90094 1.10309 2.630 0.00863 **
## casteOBC -3.68990 1.19525 -3.087 0.00206 **
## casteSC -4.28438 2.16018 -1.983 0.04751 *
## casteST -4.41615 4.23344 -1.043 0.29704
## Muslim 8.81479 2.83858 3.105 0.00194 **
## factor(year)2014 34.60859 1.99323 17.363 < 2e-16 ***
## factor(year)2015 22.73994 1.79319 12.681 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.8 on 1488 degrees of freedom
## Multiple R-squared: 0.257, Adjusted R-squared: 0.253
## F-statistic: 64.35 on 8 and 1488 DF, p-value: < 2.2e-16
We see that even when we exclude 2016 (as one-off random event), being a Muslim definitely provides upto 10 point advantage in interview process in top rankers and is infact the most significant of predictive factors excluding systematic yearly variations.
We must also note that even though there is category wise variation in Interview marks in caste as well, this difference is seen in written marks as well, However in case of Female and Muslim candidates those having similar marks in written perform better in Interview indicating potential bias. Further, while the interview marks advantage for females diminishes at top ranks(<500),it increases for Muslim candidates. However, it should be noted that Female candidates and Muslim candidates are vastly under-represented than their population levels. One of the points of this analysis is if these candidates are being favored and this analysis shows that they are,then whole process should be open.
Lets do some fun analysis now..
I extracted last names of candidates with regex and now lets see which surnames dominate UPSC list
upm %>% group_by(surname) %>%
summarise(n=n()) %>% mutate(proportion=round(100*(n/sum(n)),2)) %>%
arrange(desc(n)) %>%head(n=15)
## # A tibble: 15 x 3
## surname n proportion
## <chr> <int> <dbl>
## 1 KUMAR 269 5.93
## 2 SINGH 239 5.27
## 3 MEENA 113 2.49
## 4 SHARMA 97 2.14
## 5 S 95 2.09
## 6 YADAV 91 2.01
## 7 GUPTA 69 1.52
## 8 JAIN 58 1.28
## 9 K 58 1.28
## 10 R 55 1.21
## 11 MISHRA 51 1.12
## 12 VERMA 45 0.990
## 13 GARG 38 0.840
## 14 P 38 0.840
## 15 PANDEY 37 0.820
We see while generic surnames like KUMAR and SINGH dominate the scene, what is surprising is MEENA comminity which comes in ST has almost 113 seats higher than communities like sharma,gupta etc
Let’s plot it
upm %>% group_by(surname) %>%
summarise(n=n(),Rank=mean(Rank)) %>% mutate(proportion=round(100*(n/sum(n)),2)) %>%
arrange(desc(n)) %>%
filter(n>15) %>%
ggplot(aes(x=reorder(surname,n),y=n))+geom_bar(stat="identity",fill="blue")+
coord_flip()+
xlab("Surname")
Let’s look at which surnames have best ranks
upm %>% group_by(surname) %>%
summarise(n=n(),Rank=mean(Rank)) %>% mutate(proportion=round(100*(n/sum(n)),2)) %>%
arrange(desc(n)) %>%
filter(n>15) %>%
ggplot(aes(x=reorder(surname,desc(Rank)),y=Rank))+geom_bar(stat="identity",fill="blue")+
coord_flip()+
xlab("Surname")
We see that the best ranks are secured by brahmin ,bania communities event though absolute numbers may be variable
Which community/surname(except kumari) has best female representation..lets see
upm %>% group_by(surname) %>%
summarise(n=n(),Rank=mean(Rank),Female_percentage=round(100*mean(Gender),2)) %>% mutate(proportion=round(100*(n/sum(n)),2)) %>%
arrange(desc(n)) %>%
filter(n>15) %>%
ggplot(aes(x=reorder(surname,Female_percentage),y=Female_percentage))+geom_bar(stat="identity",fill="blue")+
geom_hline(yintercept =round(100*mean(upm$Gender),2) ,color="red",linetype="dashed")+
annotate("text",x=2,y=round(100*mean(upm$Gender),2),label="Average Female \n Representation (15.9) %")+
coord_flip()+
xlab("Surname")
We see that Baniyas(guptas)and some brahmins and some south indian communities(represnted by single surnames like S,C,A,B,J,K) have high above average female representation
Lets look inside communities. First look at General community
upm %>% filter(caste=="General") %>% group_by(surname) %>%
summarise(n=n(),Rank=mean(Rank)) %>% mutate(proportion=round(100*(n/sum(n)),2)) %>%
arrange(desc(n)) %>%
filter(n>15) %>%
ggplot(aes(x=reorder(surname,n),y=n))+geom_bar(stat="identity",fill="orange")+
coord_flip()+
xlab("Surname")
Lets lookat ST community
upm %>% filter(caste=="ST") %>% group_by(surname) %>%
summarise(n=n(),Rank=mean(Rank)) %>% mutate(proportion=round(100*(n/sum(n)),2)) %>%
arrange(desc(n)) %>%
filter(n>4) %>%
ggplot(aes(x=reorder(surname,n),y=n))+geom_bar(stat="identity",fill="orange")+
coord_flip()+
xlab("Surname")
What is clearly evident here is that MEENA group is an anomaly in this population and sort of super-elite which is cornering most of benefits reserved for ST.
Let’s look at OBC community
upm %>% filter(caste=="OBC") %>% group_by(surname) %>%
summarise(n=n(),Rank=mean(Rank)) %>% mutate(proportion=round(100*(n/sum(n)),2)) %>%
arrange(desc(n)) %>%
filter(n>5) %>%
ggplot(aes(x=reorder(surname,n),y=n))+geom_bar(stat="identity",fill="orange")+
coord_flip()+
xlab("Surname")
We see that apart from generic kumar and singh surnames , Yadav and Patels from North India are capturing lot of OBC seats, however the story here is that many south indian candidates(with single letter surnames) are dominant in OBC community
Lets examine SC community
upm %>% filter(caste=="SC") %>% group_by(surname) %>%
summarise(n=n(),Rank=mean(Rank)) %>% mutate(proportion=round(100*(n/sum(n)),2)) %>%
arrange(desc(n)) %>%
filter(n>5) %>%
ggplot(aes(x=reorder(surname,n),y=n))+geom_bar(stat="identity",fill="orange")+
coord_flip()+
xlab("Surname")
We see that while most SC prefer kumar,singh and even verma titles..many south indian surname and hence candidates take these benefits.
Thus, on the whole in General category -Northern Communitites predominate,while in OBC and Sc groups South Indian surnames are well represented. In ST group MEENA community captures most of the seats.
Here is a list of summary statistics by surnames
upm %>% group_by(surname) %>%
summarize_at(vars(Total,Rank,Interview,Written,Gender),funs(mean,n=n())) %>% arrange(desc(Total_n)) %>%
rename(count=Total_n) %>%
select(-Rank_n,-Interview_n,-Written_n,-Gender_n) %>%
filter(count>4) %>%
arrange(Rank_mean) %>% mutate(Female_percentage = round(100*Gender_mean,2)) %>%
mutate(proportion=round(100*(count/sum(count)),2)) %>%
mutate(Female_total = round(Female_percentage*count/100,0)) %>%
select(-Gender_mean) %>%
#arrange(desc(Female_total))
arrange(desc(proportion)) %>%
head(n=15)
## # A tibble: 15 x 9
## surname Total_mean Rank_mean Interview_mean Written_mean count
## <chr> <dbl> <dbl> <dbl> <dbl> <int>
## 1 KUMAR 881 636 166 715 269
## 2 SINGH 874 600 169 706 239
## 3 MEENA 852 909 160 692 113
## 4 SHARMA 884 428 178 705 97
## 5 S 883 573 174 709 95
## 6 YADAV 871 649 164 708 91
## 7 GUPTA 908 373 171 737 69
## 8 JAIN 911 345 172 739 58
## 9 K 877 615 170 707 58
## 10 R 881 619 170 711 55
## 11 MISHRA 911 385 178 733 51
## 12 VERMA 892 607 169 723 45
## 13 GARG 911 297 174 737 38
## 14 P 865 602 174 691 38
## 15 PANDEY 902 387 175 727 37
## # ... with 3 more variables: Female_percentage <dbl>, proportion <dbl>,
## # Female_total <dbl>
Repeater analysis
Let’s lookat names which repeat through various years(while it is true people with same names can get selected again,lesser probbaility that it happens on continuous years,in any case with limited resources-it is slight overestimation of repeater count
repeater= upm %>% group_by(Name) %>% summarize(n=n()) %>% filter(n>1) %>% pull(Name)
The number of repeater is 697.So the percentage of people who repeat out of total 4535 candidates is ~
round(100*(length(repeater)/length(upm$Name)),2)
## [1] 15.37
Thus proportion of repeaters is around 15%
So what percentage of repeaters eventually end up getting a top 100 rank.. Lets’s calculate
top_ranker =upm %>% filter(Name %in% repeater) %>% filter(Rank<100) %>% pull(Name)
The number of top ranker out of these repeaters is 111.
So the percent of repeaters who end up as top ranker is 111/697
round(100*(length(top_ranker)/length(repeater)),2)
## [1] 15.93
or around 16%.
Thus around 15% selected people end up repeating and only 15% of these determined ones end up getting top 100 ranks.who said UPSC was an easy exam! It takes a lot to be a Babu..:-)
1.Being a Female or a Muslim confers advantage in Interview process which is minor(5 marks)since whole Interview is of 275 marks however muslim candidates show an increasing trend of higher marks(10) at top ranks so there is some evidence of deliberate grade inflation
2.General candidates get bettter marks than OBC,SC,ST in that order in Interview process,however this advantage doesnt hold up in top ranks
3.Baniyas(Gupta,Agarwal) in particular get the best mean ranks and have highest female representation almost double the national average, arguably one of the most developed communities in India
4.Female representation in SC is higher than OBCs
5.MEENA community captures most of the seats in ST community and is a cause of concern
6.Higher marks are inversely correlated with Interview marks.A person with 100 more written marks than average is likely to get 10 less marks in an Interview
7Muslims iand Females are under-represented in UPSC list, However in my personal opinion covert grade inflation is not the solution as will backfire, we need open and transparent policy to increase their representation(eg. opening more schools in Muslim dominated areas rather than specific quota)
8South Indian Community are better represented in OBC and SC list(understandable since despite advanced demographics on almost every parameters few people let go of the advantage of quotas), however in General list North Indian communities dominate