Dataset: Loksabha 2019 Candidates General Information. (https://www.kaggle.com/datasets/themlphdstudent/lok-sabha-election-candidate-list-2004-to-2019)
# Importing required libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggrepel)
# Loading our dataset
data <-read.csv('C:\\Users\\bhush\\Downloads\\Coursework\\I 590 INTRO TO R\\datasets\\data_final\\LokSabha2019_xl.csv')
E.g., this could be a column name, or just some value inside a cell of your data
Why do you think they chose to encode the data the way they did? What could have happened if you didn’t read the documentation?
Our dataset is scraped from https://myneta.info/ and has a total of 10 columns. Out of which columns of Candidate, Party, Age, Education, Constituency, Gender are of general information. The columns Total Assets, Liabilities and Criminal Cases were unclear to me until I hadn’t read the documentation.
Criminal Cases - There are several candidates with a lot of criminal cases on them, when you see a candidate with even one criminal case you would definitely question how is it even they are contesting the elections!
Upon going through the documentation I found out that most number of criminal cases were of accusations and were ongoing in the criminal courts were the candidate is not yet guilty.
Also, as per the Indian Constitution a candidate can contest elections unless until he/she is not convicted by the court in any type of crime.
Total Assets - Many Candidates report the financial value of their assets by the means of a registered Chartered Accountant while some candidates provide their evidence of their assets directly to the Election Commission of India (ECI).
For contesting elections, A candidate must pay a deposit amount of INR 10,000 ($120) but some candidates have reported assets less than this deposit amount, so how were they able to contest the election?
The answer for this is that the candidate can borrow certain amount as a liability and contest the election.
Liabilities - There are several candidates who have liabilities more than their assets and can contest elections as the Indian Constituiton permits them. There is a court case which has started in 2016 and still running in the Supreme Court of India against the liabliity rule.
You may need to do some digging, but is there anything about the data that your documentation does not explain?
Constituency - Few constituencies are more number of candidates than rest other constituencies have.
For example: On an average there are 14 candidates per constituency but some constituencies of Nizamabad, Belgaum, Karur, etc. have 30+ candidates.
# Finding independent candidates per constituency
ind_cand <- data |>
filter(Party == 'IND') |>
group_by(Constituency)|>
summarise(count = n())|>
arrange(desc(count))
print(ind_cand)
## # A tibble: 512 × 2
## Constituency count
## <chr> <int>
## 1 Nizamabad 176
## 2 Belgaum 51
## 3 Karur 31
## 4 Beed 27
## 5 Chennai South 27
## 6 Thoothukkudi 26
## 7 Jamnagar 25
## 8 Surendranagar 25
## 9 Amritsar 21
## 10 Theni 21
## # ℹ 502 more rows
#Number of candidates per constituency
Count_per_const <- data|>
group_by(Constituency)|>
summarise(count = n())|>
arrange(desc(count))
print(Count_per_const)
## # A tibble: 542 × 2
## Constituency count
## <chr> <int>
## 1 Nizamabad 183
## 2 Belgaum 56
## 3 Karur 42
## 4 Chennai South 39
## 5 Thoothukkudi 37
## 6 Beed 36
## 7 Chandigarh 36
## 8 Nalanda 35
## 9 Aurangabad 32
## 10 Chennai Central 31
## # ℹ 532 more rows
# Average number of candidates per constituency
print(mean(Count_per_const$count))
## [1] 14.70111
# Nizamabad constituency details
Nizamabad_info <- data |>
filter(Constituency == "Nizamabad")
dim(Nizamabad_info)
## [1] 183 10
# PLotting parties against number of candidates
Nizamabad_info|>
ggplot()+
geom_bar(mapping = aes(y=Party,fill = Party),stat = "count")
Well, we can observe that most candidates are from “IND” party which stands for “Independent” candidate while other parties have only one candidate!
Why this?
Parties avoid giving tickets to more than one candidate per constituency and hence many candidates who don’t get ticket to contest an election decide to contest elections “Independenty (IND)”.
Belgaum_info <- data |>
filter(Constituency == "Belgaum")
dim(Belgaum_info)
## [1] 56 10
# PLotting parties against number of candidates
Belgaum_info|>
ggplot()+
geom_bar(mapping = aes(y=Party,fill = Party),stat = "count")