This is an updated version of old upsc data analysis done by me on UPSC 2013-16 data data, which suggested that females and muslims were being awarded higher markers in interview even on controlling for written marks and caste and some sub castes were cornering all benefits in their respective categories.
The core of the analysis remains the same-
2013,2014,2015,2016 UPSC list of candidates with categories were collected from Internet. 2015 list was pdf only,hence was extracted back to text then csv surnames were extracted with Regex. 2017 data was added. The code for extraction and csv file of data can be found here and here Gender status and religion status was coded by hand after evaluation of data.
First we will look at average distribution of marks in Caste categories and percentage of Muslims and females in them
a=upm %>% group_by(caste) %>% summarise(n=n(),marks=round(mean(Total),0),Written=round(mean(Written),0),
Interview=round(mean(Interview),0),Rank=round(mean(Rank),0),
Muslims_percentage=round(100*mean(Muslim),2),
Females_percentage=round(100*mean(Gender),2))
library(knitr)
knitr::kable(a)
| caste | n | marks | Written | Interview | Rank | Muslims_percentage | Females_percentage |
|---|---|---|---|---|---|---|---|
| General | 2582 | 931 | 757 | 174 | 343 | 3.14 | 20.80 |
| OBC | 1617 | 897 | 730 | 167 | 656 | 4.70 | 10.70 |
| SC | 885 | 875 | 710 | 165 | 818 | 0.00 | 16.84 |
| ST | 441 | 863 | 697 | 166 | 909 | 3.40 | 13.15 |
Thus we see percentage of Muslims is between 3-5% in UPSC selected candidates, interestingly percentage of Muslim OBC is higher than general OBC. Expectedly Muslim percentage in SC is zero as different religions can’t be given SC status , while ST classification depends upon regions and hence 3.54% Muslims there as well.
What is obvious here is inspite of protestations Muslims are under-represented in UPSC pool
Female representation is best in General group at ~21%, while in OBC it is lowest 10.7% even lower than SC group () better at ~17% ) and even ST group surprising to say the least
Mean Interview and written marks follow the general order General>OBC>SC>ST except in case of Interview where ST secure higher marks than SC(we shall examine this in a moment.)
General category candidates have better rank , but as we know that there is quota wise representation in each service and separate list for each categories
Lets see the whole data year-wise
b=upm %>% group_by(caste,year) %>% summarise(n=n(),marks=round(mean(Total),0),Written=round(mean(Written),0),
Interview=round(mean(Interview),0),Rank=round(mean(Rank),0),
Muslims_percentage=round(100*mean(Muslim),2),
Females_percentage=round(100*mean(Gender),2))
knitr::kable(b)
| caste | year | n | marks | Written | Interview | Rank | Muslims_percentage | Females_percentage |
|---|---|---|---|---|---|---|---|---|
| General | 2013 | 517 | 809 | 629 | 180 | 347 | 3.09 | 20.89 |
| General | 2014 | 590 | 917 | 739 | 177 | 390 | 2.03 | 22.20 |
| General | 2015 | 499 | 900 | 729 | 171 | 328 | 3.01 | 16.63 |
| General | 2016 | 500 | 1011 | 841 | 170 | 338 | 4.20 | 21.20 |
| General | 2017 | 476 | 1028 | 857 | 171 | 304 | 3.57 | 22.90 |
| OBC | 2013 | 327 | 775 | 606 | 169 | 643 | 3.98 | 10.09 |
| OBC | 2014 | 354 | 879 | 708 | 171 | 746 | 2.82 | 10.17 |
| OBC | 2015 | 314 | 864 | 698 | 165 | 634 | 5.10 | 6.37 |
| OBC | 2016 | 347 | 980 | 816 | 164 | 646 | 5.19 | 14.12 |
| OBC | 2017 | 275 | 997 | 833 | 164 | 591 | 6.91 | 12.73 |
| SC | 2013 | 187 | 753 | 586 | 167 | 831 | 0.00 | 16.58 |
| SC | 2014 | 194 | 864 | 697 | 167 | 884 | 0.00 | 11.86 |
| SC | 2015 | 176 | 844 | 681 | 163 | 805 | 0.00 | 13.64 |
| SC | 2016 | 163 | 965 | 801 | 163 | 795 | 0.00 | 16.56 |
| SC | 2017 | 165 | 974 | 810 | 164 | 762 | 0.00 | 26.67 |
| ST | 2013 | 91 | 740 | 572 | 168 | 934 | 4.40 | 12.09 |
| ST | 2014 | 98 | 846 | 676 | 170 | 1012 | 2.04 | 14.29 |
| ST | 2015 | 89 | 834 | 672 | 162 | 867 | 3.37 | 14.61 |
| ST | 2016 | 89 | 949 | 783 | 166 | 918 | 4.49 | 13.48 |
| ST | 2017 | 74 | 971 | 805 | 166 | 781 | 2.70 | 10.81 |
Let’s Plot Muslim representation community wise
ggplot(b,aes(x=year,y=Muslims_percentage,color=caste,group=caste))+geom_line()+
geom_hline(yintercept =round(100*mean(upm$Muslim),2) ,color="red",linetype="dashed")+
annotate("text",x=2015.5,y=round(100*mean(upm$Muslim),2),label="Average Muslim \n Representation %")+
labs(color="Category")
We see OBC Muslims are doing comparatively better than General category Muslims.
Let’s Plot Female Representation Community wise
ggplot(b,aes(x=year,y=Females_percentage,color=caste,group=caste))+geom_line()+
geom_hline(yintercept =round(100*mean(upm$Gender),2) ,color="red",linetype="dashed")+
annotate("text",x=2013.5,y=round(100*mean(upm$Gender),2),label="Average Female \n Representation %")+
labs(color="Category")
So we saw that OBCs have poor female representation and slightly higher Muslim representation than general category.
In 2017 there was a massive jump in number of females in SC category
Let’s look at the histogram of all Written marks across the years.
ggplot(upm,aes(x=Written,y=..density..))+geom_histogram(fill="orange")+
geom_density(color="black")
Appears fairly symmetrical. Howver we know that caste wise differences exist so let’s plot caste wise histogram
ggplot(upm,aes(x=Written,y=..density..))+geom_histogram(fill="orange")+
geom_density(color="black")+facet_wrap(~caste)
So, the fairly symmetrical curve becomes almost trimodal with three peaks, what could be the potential cause.. Let’s investigate further
ggplot(upm,aes(x=Written,y=..density..))+geom_histogram(fill="orange")+
geom_density(color="black")+facet_wrap(caste~year)
So once we plot by year, the plots become symmetric again..How is that?
in 2013 marks were considerably lower than 2016
In caste groups we see that general marks distribution is shifted more to right of SC/ST/OBC
Thus we saw how an innocuous looking histogram hid within itself many sub-populations, on eof the few reasons a single measure of central tendency should never be accepted till we see graphical representation of data.
It is also known as Simpson’s paradox
Lets plot Interview marks
ggplot(upm,aes(x=Interview,y=..density..))+geom_histogram(fill="orange")+
geom_density(color="black")
ggplot(upm,aes(x=Interview,y=..density..))+geom_histogram(fill="orange")+
geom_density(color="black")+facet_wrap(~caste)
ggplot(upm,aes(x=Interview,y=..density..))+geom_histogram(fill="orange")+
geom_density(color="black")+facet_wrap(caste~year)
The histograms and density plots give a lot of insight in the distribution of marks.
Let’s Plot Interview marks community wise
ggplot(b,aes(x=year,y=Interview,color=caste,group=caste))+geom_line()+
geom_hline(yintercept =round(mean(upm$Interview),2) ,color="red",linetype="dashed")+
annotate("text",x=2013.5,y=round(mean(upm$Interview),2),label="Average Interview Marks")+
labs(color="Category")
We see that Interview marks general change is down but ST do better than OBC and SC
ggplot(b,aes(x=year,y=Written,color=caste,group=caste))+geom_line()+
geom_hline(yintercept =round(mean(upm$Written),2) ,color="red",linetype="dashed")+
annotate("text",x=2013.5,y=round(mean(upm$Written),2),label="Average Written Marks")+
labs(color="Category")
Average Written Marks have gone up in all categories across in usual order.
upm %>% mutate(Muslim=ifelse(Muslim==1,"Muslim","Non-Muslim")) %>% group_by(Muslim,year) %>% summarise(n=n(),marks=round(mean(Total),0),Written=round(mean(Written),0),
Interview=round(mean(Interview),0),Rank=round(mean(Rank),0)) %>%
ggplot(aes(x=year,y=Interview,color=factor(Muslim),group=factor(Muslim)))+geom_line()+
geom_hline(yintercept =round(mean(upm$Interview),2) ,color="red",linetype="dashed")+
annotate("text",x=2013.5,y=round(mean(upm$Interview),2),label="Average Interview Marks")+
labs(color="Category")
We see that Interview marks in both Muslims and Non-Muslims go down, however Muslims have higher marks(around 5 marks) on average
Now some people alleged that Interview marks inflation was done deliberately in top ranks(Ranks less than 500),So lets see the trend in top rankers
upm %>% filter(Rank<500) %>% mutate(Muslim=ifelse(Muslim==1,"Muslim","Non-Muslim")) %>% group_by(Muslim,year) %>% summarise(n=n(),marks=round(mean(Total),0),Written=round(mean(Written),0),
Interview=round(mean(Interview),0),Rank=round(mean(Rank),0)) %>%
ggplot(aes(x=year,y=Interview,color=factor(Muslim),group=factor(Muslim)))+geom_line()+
geom_hline(yintercept =round(mean(upm$Interview),2) ,color="red",linetype="dashed")+
annotate("text",x=2013.5,y=round(mean(upm$Interview),2),label="Average Interview Marks")+
labs(color="Category")
In this graph,we can clearly see that while marks have been going down in both groups, in 2016 there is an uptrend(almost 13 marks higher in top 500 candidates) and that led to higher selection of Muslims in top ranks. This might be one off event, but in this year 2016 while interview marks of Muslim candidates on average went down, in top groups it went up against the general trend..
In 2017 though difference has reduced, but is still there. We will also do a regression analysis controlling for these factors.
Lets now examine Written Marks
upm %>% mutate(Muslim=ifelse(Muslim==1,"Muslim","Non-Muslim")) %>% group_by(Muslim,year) %>% summarise(n=n(),marks=round(mean(Total),0),Written=round(mean(Written),0),
Interview=round(mean(Interview),0),Rank=round(mean(Rank),0)) %>%
ggplot(aes(x=year,y=Written,color=factor(Muslim),group=factor(Muslim)))+geom_line()+
geom_hline(yintercept =round(mean(upm$Written),2) ,color="red",linetype="dashed")+
annotate("text",x=2013.5,y=round(mean(upm$Written),2),label="Average Written Marks")+
labs(color="Category")
In this graph we clearly see that there is almost no difference between Muslim and Non-Muslim candidates in UPSC written testLets examine top guys under 500
upm %>%filter(Rank<500) %>% mutate(Muslim=ifelse(Muslim==1,"Muslim","Non-Muslim")) %>% group_by(Muslim,year) %>% summarise(n=n(),marks=round(mean(Total),0),Written=round(mean(Written),0),
Interview=round(mean(Interview),0),Rank=round(mean(Rank),0)) %>%
ggplot(aes(x=year,y=Written,color=factor(Muslim),group=factor(Muslim)))+geom_line()+
geom_hline(yintercept =round(mean(upm$Written),2) ,color="red",linetype="dashed")+
annotate("text",x=2013.5,y=round(mean(upm$Written),2),label="Average Written Marks")+
labs(color="Category")
Again,here as well-there is no difference in Written scores(as the mark sheets are blinded) and hence the better performance in top group was driven entirely by better performance in interview
Lets see corresponding graphs for females in Written category
upm %>% mutate(Gender=ifelse(Gender==1,"Female","Male")) %>% group_by(Gender,year) %>% summarise(n=n(),marks=round(mean(Total),0),Written=round(mean(Written),0),
Interview=round(mean(Interview),0),Rank=round(mean(Rank),0)) %>%
ggplot(aes(x=year,y=Written,color=factor(Gender),group=factor(Gender)))+geom_line()+
geom_hline(yintercept =round(mean(upm$Written),2) ,color="red",linetype="dashed")+
annotate("text",x=2013.5,y=round(mean(upm$Written),2),label="Average Written Marks")+
labs(color="Category")
We see almost no difference here as opposed to Interview marks
Now in top candidates
upm %>% filter(Rank<500) %>% mutate(Gender=ifelse(Gender==1,"Female","Male")) %>% group_by(Gender,year) %>% summarise(n=n(),marks=round(mean(Total),0),Written=round(mean(Written),0),
Interview=round(mean(Interview),0),Rank=round(mean(Rank),0)) %>%
ggplot(aes(x=year,y=Written,color=factor(Gender),group=factor(Gender)))+geom_line()+
geom_hline(yintercept =round(mean(upm$Written),2) ,color="red",linetype="dashed")+
annotate("text",x=2013.5,y=round(mean(upm$Written),2),label="Average Written Marks")+
labs(color="Category")
Here as well , there is no difference
Let’s see Interview Scores Now
upm %>% mutate(Gender=ifelse(Gender==1,"Female","Male")) %>% group_by(Gender,year) %>% summarise(n=n(),marks=round(mean(Total),0),Written=round(mean(Written),0),
Interview=round(mean(Interview),0),Rank=round(mean(Rank),0)) %>%
ggplot(aes(x=year,y=Interview,color=factor(Gender),group=factor(Gender)))+geom_line()+
geom_hline(yintercept =round(mean(upm$Interview),2) ,color="red",linetype="dashed")+
annotate("text",x=2013.5,y=round(mean(upm$Interview),2),label="Average Interview Marks")+
labs(color="Category")
Now it appears being a female confers an advantage of almost 5 points in interview..
Let’s analyse it in top rankers
upm %>%filter(Rank<500) %>% mutate(Gender=ifelse(Gender==1,"Female","Male")) %>% group_by(Gender,year) %>% summarise(n=n(),marks=round(mean(Total),0),Written=round(mean(Written),0),
Interview=round(mean(Interview),0),Rank=round(mean(Rank),0)) %>%
ggplot(aes(x=year,y=Interview,color=factor(Gender),group=factor(Gender)))+geom_line()+
geom_hline(yintercept =round(mean(upm$Interview),2) ,color="red",linetype="dashed")+
annotate("text",x=2013.5,y=round(mean(upm$Interview),2),label="Average Interview Marks")+
labs(color="Category")
We see in top rankers difference is lesser and even that closed down in 2016 c.f. performance of Muslim community. However in 2017 it has again shot up.
Lets’ examine this association statistically.
First let’s look at relationship between Interview and Written marks
upm %>% ggplot(aes(x=Interview,y=Written))+
geom_jitter(color="red")+
stat_smooth(method="lm",se=FALSE)
We see that there is a predominatly negative relationship. Does this relationship hold across years. Let’s see..
upm %>% ggplot(aes(x=Interview,y=Written))+
geom_jitter(color="red")+
stat_smooth(method="lm",se=FALSE)+
facet_wrap(~year)
We can see the relationship holds but degrees vary. It is expected as well as those who do very well in Written , might be introvert and perform worse in Interview..
Let’s see if this relationship holds in females as well.
upm %>% mutate(Gender=ifelse(Gender==1,"Female","Male")) %>% ggplot(aes(x=Interview,y=Written,color=Gender))+
geom_jitter()+
stat_smooth(method="lm",se=FALSE)+
facet_wrap(~year)
Well it does across years.though Females tend to score higher
Lets look at it in Muslims
upm %>% mutate(Muslim=ifelse(Muslim==1,"Muslim","Non-Muslim")) %>% ggplot(aes(x=Interview,y=Written,color=Muslim))+
geom_jitter()+
stat_smooth(method="lm",se=FALSE)+
facet_wrap(~year)
On adjusting for Written marks we see that Muslims have only minor difference with Non-Muslims,However in 2016 we can see that among higher marks there is a clear divergence indicating higher marks for Muslim candidates.
Let’s now do a formal statistical test . Let’s consider Interview marks as dependent variable,predicted by indivisual’s caste,gender,religion (in which there are obvious trends)-obviously though the model is flawed as all models are as it doesnt have measure of innate ability of participant, but it helps us in controlling for gender,caste n religion difference in candidate as well as yearly variation.
f= lm(Interview~Written.c+Gender+caste+Muslim+factor(year),data=upm)
summary(f,digits=2)
##
## Call:
## lm(formula = Interview ~ Written.c + Gender + caste + Muslim +
## factor(year), data = upm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -86.853 -11.350 0.112 11.551 57.613
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 180.161739 0.601409 299.566 < 2e-16 ***
## Written.c -0.153148 0.006813 -22.479 < 2e-16 ***
## Gender 3.820673 0.627526 6.088 1.22e-09 ***
## casteOBC -11.315397 0.578359 -19.565 < 2e-16 ***
## casteSC -15.810523 0.735807 -21.487 < 2e-16 ***
## casteST -16.431979 0.969966 -16.941 < 2e-16 ***
## Muslim 5.132072 1.338552 3.834 0.000127 ***
## factor(year)2014 -0.602161 0.708922 -0.849 0.395694
## factor(year)2015 -6.182381 0.733325 -8.431 < 2e-16 ***
## factor(year)2016 -6.721234 0.729755 -9.210 < 2e-16 ***
## factor(year)2017 -6.423793 0.749939 -8.566 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.19 on 5514 degrees of freedom
## Multiple R-squared: 0.1579, Adjusted R-squared: 0.1564
## F-statistic: 103.4 on 10 and 5514 DF, p-value: < 2.2e-16
*The analysis shows that for an averal General candidate in year 2013 as base , had 180 marks in interview, every 10 increasein written marks would decrease his interview marks by 1.5, Being of female gender would increase the mark by 4, OBC,SC and ST have around 11,15,16 lower point than general candidates,while being a Muslim adds around 5 marks to interview score. However it must be noted that since we dont have other measures of innate ability of students the predictive perrobability of model is poor despite controlling for confounders.
Lets do the same analysis in top rankers
g= lm(Interview~Written.c+Gender+caste+Muslim+factor(year),data=filter(upm,year>2013,Rank<500))
summary(g)
##
## Call:
## lm(formula = Interview ~ Written.c + Gender + caste + Muslim +
## factor(year), data = filter(upm, year > 2013, Rank < 500))
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.140 -10.004 -0.606 10.014 59.068
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 189.44749 0.84339 224.627 < 2e-16 ***
## Written.c -0.28364 0.01258 -22.547 < 2e-16 ***
## Gender 2.80179 0.83515 3.355 0.000809 ***
## casteOBC -2.59404 0.95551 -2.715 0.006689 **
## casteSC -2.46735 1.69117 -1.459 0.144733
## casteST -0.70120 3.07164 -0.228 0.819451
## Muslim 7.78875 1.92950 4.037 5.63e-05 ***
## factor(year)2015 -8.28404 0.94761 -8.742 < 2e-16 ***
## factor(year)2016 -11.54453 0.94921 -12.162 < 2e-16 ***
## factor(year)2017 -10.83843 0.95581 -11.339 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.91 on 1986 degrees of freedom
## Multiple R-squared: 0.2478, Adjusted R-squared: 0.2444
## F-statistic: 72.7 on 9 and 1986 DF, p-value: < 2.2e-16
We see that while in top rankers there is a much lesser difference in interview marks between general and OBC/SC/ST candidates and even in females, however intrestingly the advantage of being a Muslim is amplified in these top rankers and gives an advantage of upto 10 marks even on controlling for other variables and points to some evidence of deliberate grade inflation in interview marks based on religion Lets run this analysis without controversial year 2016
g= lm(Interview~Written.c+Gender+caste+Muslim+factor(year),data=filter(upm,year<2016,Rank<500))
summary(g)
##
## Call:
## lm(formula = Interview ~ Written.c + Gender + caste + Muslim +
## factor(year), data = filter(upm, year < 2016, Rank < 500))
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.852 -11.569 -0.323 10.890 65.527
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 188.96596 0.93958 201.117 < 2e-16 ***
## Written.c -0.30602 0.01471 -20.803 < 2e-16 ***
## Gender 2.90094 1.10309 2.630 0.00863 **
## casteOBC -3.68990 1.19525 -3.087 0.00206 **
## casteSC -4.28438 2.16018 -1.983 0.04751 *
## casteST -4.41615 4.23344 -1.043 0.29704
## Muslim 8.81479 2.83858 3.105 0.00194 **
## factor(year)2014 1.51692 1.06957 1.418 0.15633
## factor(year)2015 -6.88825 1.06673 -6.457 1.44e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.8 on 1488 degrees of freedom
## Multiple R-squared: 0.257, Adjusted R-squared: 0.253
## F-statistic: 64.35 on 8 and 1488 DF, p-value: < 2.2e-16
We see that even when we exclude 2016 (as one-off random event), being a Muslim definitely provides upto 9 point advantage in interview process in top rankers and is infact the most significant of predictive factors excluding systematic yearly variations.
We must also note that even though there is category wise variation in Interview marks in caste as well, this difference is seen in written marks as well, However in case of Female and Muslim candidates those having similar marks in written perform better in Interview indicating potential bias. Further, while the interview marks advantage for females diminishes at top ranks(<500),it increases for Muslim candidates. However, it should be noted that Female candidates and Muslim candidates are vastly under-represented than their population levels. One of the points of this analysis is if these candidates are being favored and this analysis shows that they are,then whole process should be open.
Lets do some fun analysis now..
I extracted last names of candidates with regex and now lets see which surnames dominate UPSC list
upm %>% group_by(surname) %>%
summarise(n=n()) %>% mutate(proportion=round(100*(n/sum(n)),2)) %>%
arrange(desc(n)) %>%head(n=15)
## # A tibble: 15 x 3
## surname n proportion
## <chr> <int> <dbl>
## 1 KUMAR 325 5.88
## 2 SINGH 290 5.25
## 3 MEENA 140 2.53
## 4 SHARMA 117 2.12
## 5 YADAV 114 2.06
## 6 S 113 2.05
## 7 GUPTA 83 1.50
## 8 JAIN 78 1.41
## 9 R 64 1.16
## 10 K 63 1.14
## 11 MISHRA 63 1.14
## 12 VERMA 53 0.960
## 13 GARG 46 0.830
## 14 PANDEY 43 0.780
## 15 P 41 0.740
We see while generic surnames like KUMAR and SINGH dominate the scene, what is surprising is MEENA community which comes in ST has seats higher than communities like sharma,gupta etc
Let’s plot it
upm %>% group_by(surname) %>%
summarise(n=n(),Rank=mean(Rank)) %>% mutate(proportion=round(100*(n/sum(n)),2)) %>%
arrange(desc(n)) %>%
filter(n>15) %>%
ggplot(aes(x=reorder(surname,n),y=n))+geom_bar(stat="identity",fill="blue")+
coord_flip()+
xlab("Surname")
Let’s look at which surnames have best ranks
upm %>% group_by(surname) %>%
summarise(n=n(),Rank=mean(Rank)) %>% mutate(proportion=round(100*(n/sum(n)),2)) %>%
arrange(desc(n)) %>%
filter(n>15) %>%
ggplot(aes(x=reorder(surname,desc(Rank)),y=Rank))+geom_bar(stat="identity",fill="blue")+
coord_flip()+
xlab("Surname")
We see that the best ranks are secured by brahmin ,bania communities event though absolute numbers may be variable
Which community/surname(except kumari) has best female representation..lets see
upm %>% group_by(surname) %>%
summarise(n=n(),Rank=mean(Rank),Female_percentage=round(100*mean(Gender),2)) %>% mutate(proportion=round(100*(n/sum(n)),2)) %>%
arrange(desc(n)) %>%
filter(n>15) %>%
ggplot(aes(x=reorder(surname,Female_percentage),y=Female_percentage))+geom_bar(stat="identity",fill="blue")+
geom_hline(yintercept =round(100*mean(upm$Gender),2) ,color="red",linetype="dashed")+
annotate("text",x=2,y=round(100*mean(upm$Gender),2),label="Average Female \n Representation (15.9) %")+
coord_flip()+
xlab("Surname")
We see that Baniyas(guptas)and some brahmins and some south indian communities(represnted by single surnames like S,C,A,B,J,K) have high above average female representation
Lets look inside communities. First look at General community
upm %>% filter(caste=="General") %>% group_by(surname) %>%
summarise(n=n(),Rank=mean(Rank)) %>% mutate(proportion=round(100*(n/sum(n)),2)) %>%
arrange(desc(n)) %>%
filter(n>15) %>%
ggplot(aes(x=reorder(surname,n),y=n))+geom_bar(stat="identity",fill="orange")+
coord_flip()+
xlab("Surname")
Lets lookat ST community
upm %>% filter(caste=="ST") %>% group_by(surname) %>%
summarise(n=n(),Rank=mean(Rank)) %>% mutate(proportion=round(100*(n/sum(n)),2)) %>%
arrange(desc(n)) %>%
filter(n>4) %>%
ggplot(aes(x=reorder(surname,n),y=n))+geom_bar(stat="identity",fill="orange")+
coord_flip()+
xlab("Surname")
What is clearly evident here is that MEENA group is an anomaly in this population and sort of super-elite which is cornering most of benefits reserved for ST.
Almost 37 of ST seats is bagged by MEENA community which forms only 5-6% of ST population
Let’s look at OBC community
upm %>% filter(caste=="OBC") %>% group_by(surname) %>%
summarise(n=n(),Rank=mean(Rank)) %>% mutate(proportion=round(100*(n/sum(n)),2)) %>%
arrange(desc(n)) %>%
filter(n>5) %>%
ggplot(aes(x=reorder(surname,n),y=n))+geom_bar(stat="identity",fill="orange")+
coord_flip()+
xlab("Surname")
We see that apart from generic kumar and singh surnames , Yadav and Patels from North India are capturing lot of OBC seats, however the story here is that many south indian candidates(with single letter surnames) are dominant in OBC community
Lets examine SC community
upm %>% filter(caste=="SC") %>% group_by(surname) %>%
summarise(n=n(),Rank=mean(Rank)) %>% mutate(proportion=round(100*(n/sum(n)),2)) %>%
arrange(desc(n)) %>%
filter(n>5) %>%
ggplot(aes(x=reorder(surname,n),y=n))+geom_bar(stat="identity",fill="orange")+
coord_flip()+
xlab("Surname")
We see that while most SC prefer kumar,singh and even verma titles..many south indian surname and hence candidates take these benefits.
Thus, on the whole in General category -Northern Communitites predominate,while in OBC and Sc groups South Indian surnames are well represented. In ST group MEENA community captures most of the seats.
Here is a list of summary statistics by surnames
upm %>% group_by(surname) %>%
summarize_at(vars(Total,Rank,Interview,Written,Gender),funs(mean,n=n())) %>% arrange(desc(Total_n)) %>%
rename(count=Total_n) %>%
select(-Rank_n,-Interview_n,-Written_n,-Gender_n) %>%
filter(count>4) %>%
arrange(Rank_mean) %>% mutate(Female_percentage = round(100*Gender_mean,2)) %>%
mutate(proportion=round(100*(count/sum(count)),2)) %>%
mutate(Female_total = round(Female_percentage*count/100,0)) %>%
select(-Gender_mean) %>%
#arrange(desc(Female_total))
arrange(desc(proportion)) %>%
head(n=15)
## # A tibble: 15 x 9
## surname Total_mean Rank_mean Interview_mean Written_mean count
## <chr> <dbl> <dbl> <dbl> <dbl> <int>
## 1 KUMAR 902. 615. 166. 737. 325
## 2 SINGH 897. 588. 169. 728. 290
## 3 MEENA 875. 883. 160. 715. 140
## 4 SHARMA 908. 409. 177. 731. 117
## 5 YADAV 896. 636. 165. 731. 114
## 6 S 902. 573. 174. 728. 113
## 7 GUPTA 928. 372. 171. 757. 83
## 8 JAIN 942. 329. 171. 772. 78
## 9 R 897. 617. 169. 728. 64
## 10 MISHRA 934. 364. 176. 758. 63
## 11 K 887. 608. 168. 719. 63
## 12 VERMA 907. 615. 168. 739. 53
## 13 GARG 930. 306. 172. 758. 46
## 14 PANDEY 921. 365. 176. 745. 43
## 15 P 874. 604. 174. 700. 41
## # ... with 3 more variables: Female_percentage <dbl>, proportion <dbl>,
## # Female_total <dbl>
Repeater analysis
Let’s lookat names which repeat through various years(while it is true people with same names can get selected again,lesser probbaility that it happens on continuous years,in any case with limited resources-it is slight overestimation of repeater count
repeater= upm %>% group_by(NAME) %>% summarize(n=n()) %>% filter(n>1) %>% pull(NAME)
round(100*(length(repeater)/length(upm$NAME)),2)
## [1] 16.36
Thus proportion of repeaters is around 16.36%
So what percentage of repeaters eventually end up getting a top 100 rank.. Lets’s calculate
top_ranker =upm %>% filter(NAME %in% repeater) %>% filter(Rank<100) %>% pull(NAME)
T So the percent of repeaters who end up as top ranker is
round(100*(length(top_ranker)/length(repeater)),2)
## [1] 16.48
or around 16.5%.
Thus around 16% selected people repeating end up selected and only 16% of these determined ones end up getting top 100 ranks.who said UPSC was an easy exam! It takes a lot to be a Babu..:-)
**Thus Despite adding 2017 data the analysis remains mostly unchanged, in 2017 females had a massive jump in interview marks while Muslims still had high but statistically not significant higher than non-muslims, though in analysis over the years the trend still holds up.
The overall point remains- Muslims and females are under-represented in UPSC. But rather than inflating marks through covert processes UPSC should be open about it, as its own data shows it and will be found out sooner than later.
The MEENA super-elite is cornering most of benefits in ST community, and a necessary correction is required by NCBC(sub categorization of ST seats without interfering with quota) rather than keeping quiet due to poilitical correctness.
1.Being a Female or a Muslim confers advantage in Interview process which is minor(5 marks)since whole Interview is of 275 marks however muslim candidates show an increasing trend of higher marks(10) at top ranks so there is some evidence of deliberate grade inflation
2.General candidates get bettter marks than OBC,SC,ST in that order in Interview process,however this advantage doesnt hold up in top ranks
3.Baniyas(Gupta,Agarwal) in particular get the best mean ranks and have highest female representation almost double the national average, arguably one of the most developed communities in India
4.Female representation in SC is higher than OBCs
5.MEENA community captures most of the seats in ST community and is a cause of concern
6.Higher marks are inversely correlated with Interview marks.A person with 100 more written marks than average is likely to get 10 less marks in an Interview
7Muslims iand Females are under-represented in UPSC list, However in my personal opinion covert grade inflation is not the solution as will backfire, we need open and transparent policy to increase their representation(eg. opening more schools in Muslim dominated areas rather than specific quota)
8South Indian Community are better represented in OBC and SC list(understandable since despite advanced demographics on almost every parameters few people let go of the advantage of quotas), however in General list North Indian communities dominate